PowerShell 抓取网页表格_html/css_WEB-ITnose
今天无意中看到了传教士写的一篇博文 http://www.cnblogs.com/piapia/p/5367556.html (PowerShell中的两只爬虫),很受启发,自己试着抓了一下,成功地抓取了网页的表格。因为我是英文版的系统,中文系统的界面转换成字符串都成了乱码,因此测试都是在英文网页上操作的。
PowerShell 5里面有一个新的函数叫做ConvertFrom-String, 他的作用是把字符串转换成对象。其中一个参数是可以根据指定的模板,把对应的那一部分字符串匹配出来生成对象,我们可以利用这个功能抓取网页中的表格。
详细帮助文档链接
https://technet.microsoft.com/library/dn807178(v=wps.640).aspx
首先看个基本例子
$a=@'1 2 3 45 6 7 89 2 2 3'@$t=@'{Co1*:1} {Co2:2} {Co3:3} {Co4:4}{Co1*:5} 6 7 8'@$c=$a | ConvertFrom-String -Delimiter "\r\n"$d=$a | ConvertFrom-string -TemplateContent $t
同样的字符串,第一个我用分隔符回车换行来生成一个对象;第二个我用自定义的模板格式来进行匹配。注意属性定义的格式写法 {}隔开,然后第一个需要{属性名字*:},后面不需要加*,至少需要匹配2行数据才行。
可以看见第一个对象有3个属性,P1是1 2 3 4,P2 是 4 5 6 7 ,P3是9 2 2 3;
第二个对象则是根据每一列来自动匹配的(已经有一个模板匹配了前2行)
接下来我们来看2个实例。
第一个例子是这个网页,里面有一个澳洲代理服务器的列表,如下所示,我想抓出来
http://www.proxylisty.com/country/Australia-ip-list
基本思路:invoke-restmethod直接抓取整个网页,自动转换为string对象。
然后设计对应的模板。因为是html文件,转换为string以后对应的html代码都在里面。因此关键是怎么把这些带有html代码的表格模板弄出来。
很简单,网页都可以查看html的源代码,下面一大段html的代码可以直接从网页上复制粘贴对应的2行表格代码即可,稍加修改添加属性名字就行了。
然后根据模板匹配就会自动生成对应的表格对象了
$web = 'http://www.proxylisty.com/country/Australia-ip-list'$template = @'<tr><td>{IP*:203.56.188.145}</td><td><a href='http://www.proxylisty.com/port/8080-ip-list' title='Port 8080 Proxy List'>{Port:8080}</a></td><td>HTTP</td><td><a style='color:red;' href='http://www.proxylisty.com/anonymity/High anonymous / Elite proxy-ip-list' title='High anonymous / Elite proxy Proxy List'>High anonymous / Elite proxy</a></td><td>No</td><td><a href='http://www.proxylisty.com/country/Australia-ip-list' title='Australia IP Proxy List'><img style="max-width:90%" src='http://www.proxylisty.com/assets/flags/AU.png' title='Australia IP Proxy List'/ alt="PowerShell 抓取网页表格_html/css_WEB-ITnose" >Australia</a></td><td>13 Months</td><td>2.699 Sec</td><td><div id="progress-bar" class="all-rounded"><div title='50%' id="progress-bar-percentage" class="all-rounded" style="width: 50%">{Reliability:50%}</div></div></td></tr><tr><td>{IP*:103.25.182.1}</td><td><a href='http://www.proxylisty.com/port/8081-ip-list' title='Port 8081 Proxy List'>{Port:8081}</a></td><td>HTTP</td><td><a style='color:red;' href='http://www.proxylisty.com/anonymity/Anonymous proxy-ip-list' title='Anonymous proxy Proxy List'>Anonymous proxy</a></td><td>No</td><td><a href='http://www.proxylisty.com/country/Australia-ip-list' title='Australia IP Proxy List'><img style="max-width:90%" src='http://www.proxylisty.com/assets/flags/AU.png' title='Australia IP Proxy List'/ alt="PowerShell 抓取网页表格_html/css_WEB-ITnose" >Australia</a></td><td>15 Months</td><td>7.242 Sec</td><td><div id="progress-bar" class="all-rounded"><div title='55%' id="progress-bar-percentage" class="all-rounded" style="width: 55%">{Reliability:55%}</div></div></td></tr>'@$temp=Invoke-RestMethod -uri $web $result = ConvertFrom-String -TemplateContent $template -InputObject $temp $result | sort reliability
成功抓取
我还可以更进一步,我想测试一下这些抓取下来的地址是否真的可以用,写个function测试看看
function Test-Proxy{[cmdletbinding()]param( [Parameter(Mandatory=$true, ValueFromPipeline=$true, ValueFromPipelineByPropertyName=$true, position=0 ) ] [string]$server, [string]$url = "http://www.microsoft.com")write-host "Test Proxy Server: $server" -NoNewline$proxy = new-object System.Net.WebProxy($server)$WebClient = new-object System.Net.WebClient$WebClient.proxy = $proxyTry{ $content = $WebClient.DownloadString($url) Write-Host " Opened $url successfully" -ForegroundColor Cyan}catch{ Write-Host " Unable to access $url" -ForegroundColor Yellow }}foreach ($r in $result){$servername="http://"+$r.IP+":"+$r.PortTest-proxy -server $servername -url "www.google.com"}
测试标明都是坑货
类似的,豆子最近比较关注健康食物,我想看看低GI的食物有哪些
http://ultimatepaleoguide.com/glycemic-index-food-list
需要把下面这个表格抓出来
$t2=@'<tr><td valign="top">{Food*:Banana cake, made with sugar}</td><td valign="top">{GI:47}</td><td valign="top">{Size:60}</td></tr><tr><td valign="top">{Food*:Banana cake, made without sugar}</td><td valign="top">{GI:55}</td><td valign="top">{Size:60}</td></tr>'@$web2='http://ultimatepaleoguide.com/glycemic-index-food-list/'$temp=Invoke-RestMethod -uri $web2 $result1 = ConvertFrom-String -TemplateContent $t2 -InputObject $temp $result1 | Out-GridView
成功!
这种方式很有用,尤其是需要获取网页某些列表信息的时候,当然,如果网页本身就提供RESTFUL的接口,可以直接获取JSON格式的内容 那就更省事了。

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics











The roles of HTML, CSS and JavaScript in web development are: 1. HTML defines the web page structure, 2. CSS controls the web page style, and 3. JavaScript adds dynamic behavior. Together, they build the framework, aesthetics and interactivity of modern websites.

The future trends of HTML are semantics and web components, the future trends of CSS are CSS-in-JS and CSSHoudini, and the future trends of JavaScript are WebAssembly and Serverless. 1. HTML semantics improve accessibility and SEO effects, and Web components improve development efficiency, but attention should be paid to browser compatibility. 2. CSS-in-JS enhances style management flexibility but may increase file size. CSSHoudini allows direct operation of CSS rendering. 3.WebAssembly optimizes browser application performance but has a steep learning curve, and Serverless simplifies development but requires optimization of cold start problems.

The future of HTML is full of infinite possibilities. 1) New features and standards will include more semantic tags and the popularity of WebComponents. 2) The web design trend will continue to develop towards responsive and accessible design. 3) Performance optimization will improve the user experience through responsive image loading and lazy loading technologies.

The roles of HTML, CSS and JavaScript in web development are: HTML is responsible for content structure, CSS is responsible for style, and JavaScript is responsible for dynamic behavior. 1. HTML defines the web page structure and content through tags to ensure semantics. 2. CSS controls the web page style through selectors and attributes to make it beautiful and easy to read. 3. JavaScript controls web page behavior through scripts to achieve dynamic and interactive functions.

HTML, CSS and JavaScript are the core technologies for building modern web pages: 1. HTML defines the web page structure, 2. CSS is responsible for the appearance of the web page, 3. JavaScript provides web page dynamics and interactivity, and they work together to create a website with a good user experience.

HTMLisnotaprogramminglanguage;itisamarkuplanguage.1)HTMLstructuresandformatswebcontentusingtags.2)ItworkswithCSSforstylingandJavaScriptforinteractivity,enhancingwebdevelopment.

To build a website with powerful functions and good user experience, HTML alone is not enough. The following technology is also required: JavaScript gives web page dynamic and interactiveness, and real-time changes are achieved by operating DOM. CSS is responsible for the style and layout of the web page to improve aesthetics and user experience. Modern frameworks and libraries such as React, Vue.js and Angular improve development efficiency and code organization structure.

The article discusses the differences between HTML tags , , , and , focusing on their semantic vs. presentational uses and their impact on SEO and accessibility.
