PowerShell 抓取网页表格_html/css_WEB-ITnose-HTML Tutorial-php.cn

Home

Web Front-end

HTML Tutorial

PowerShell 抓取网页表格_html/css_WEB-ITnose

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Jun 24, 2016 am 11:16 AM

今天无意中看到了传教士写的一篇博文 http://www.cnblogs.com/piapia/p/5367556.html （PowerShell中的两只爬虫），很受启发，自己试着抓了一下，成功地抓取了网页的表格。因为我是英文版的系统，中文系统的界面转换成字符串都成了乱码，因此测试都是在英文网页上操作的。

PowerShell 5里面有一个新的函数叫做ConvertFrom-String, 他的作用是把字符串转换成对象。其中一个参数是可以根据指定的模板，把对应的那一部分字符串匹配出来生成对象，我们可以利用这个功能抓取网页中的表格。

详细帮助文档链接

https://technet.microsoft.com/library/dn807178(v=wps.640).aspx

首先看个基本例子

$a=@'1 2 3 45 6 7 89 2 2 3'@$t=@'{Co1*:1} {Co2:2} {Co3:3} {Co4:4}{Co1*:5} 6 7 8'@$c=$a | ConvertFrom-String -Delimiter "\r\n"$d=$a | ConvertFrom-string -TemplateContent $t

Copy after login

同样的字符串，第一个我用分隔符回车换行来生成一个对象；第二个我用自定义的模板格式来进行匹配。注意属性定义的格式写法 {}隔开，然后第一个需要{属性名字*：}，后面不需要加*,至少需要匹配2行数据才行。

可以看见第一个对象有3个属性，P1是1 2 3 4，P2 是 4 5 6 7 ，P3是9 2 2 3;

第二个对象则是根据每一列来自动匹配的（已经有一个模板匹配了前2行）

接下来我们来看2个实例。

第一个例子是这个网页，里面有一个澳洲代理服务器的列表，如下所示，我想抓出来

http://www.proxylisty.com/country/Australia-ip-list

Copy after login

基本思路：invoke-restmethod直接抓取整个网页，自动转换为string对象。

然后设计对应的模板。因为是html文件，转换为string以后对应的html代码都在里面。因此关键是怎么把这些带有html代码的表格模板弄出来。

很简单，网页都可以查看html的源代码，下面一大段html的代码可以直接从网页上复制粘贴对应的2行表格代码即可，稍加修改添加属性名字就行了。

然后根据模板匹配就会自动生成对应的表格对象了

$web = 'http://www.proxylisty.com/country/Australia-ip-list'$template = @'<tr><td>{IP*:203.56.188.145}</td><td><a href='http://www.proxylisty.com/port/8080-ip-list' title='Port 8080 Proxy List'>{Port:8080}</a></td><td>HTTP</td><td><a style='color:red;' href='http://www.proxylisty.com/anonymity/High anonymous / Elite proxy-ip-list' title='High anonymous / Elite proxy Proxy List'>High anonymous / Elite proxy</a></td><td>No</td><td><a href='http://www.proxylisty.com/country/Australia-ip-list' title='Australia IP Proxy List'><img    style="max-width:90%" src='http://www.proxylisty.com/assets/flags/AU.png' title='Australia IP Proxy List'/ alt="PowerShell 抓取网页表格_html/css_WEB-ITnose" >Australia</a></td><td>13 Months</td><td>2.699 Sec</td><td><div id="progress-bar" class="all-rounded"><div title='50%' id="progress-bar-percentage" class="all-rounded" style="width: 50%">{Reliability:50%}</div></div></td></tr><tr><td>{IP*:103.25.182.1}</td><td><a href='http://www.proxylisty.com/port/8081-ip-list' title='Port 8081 Proxy List'>{Port:8081}</a></td><td>HTTP</td><td><a style='color:red;' href='http://www.proxylisty.com/anonymity/Anonymous proxy-ip-list' title='Anonymous proxy Proxy List'>Anonymous proxy</a></td><td>No</td><td><a href='http://www.proxylisty.com/country/Australia-ip-list' title='Australia IP Proxy List'><img    style="max-width:90%" src='http://www.proxylisty.com/assets/flags/AU.png' title='Australia IP Proxy List'/ alt="PowerShell 抓取网页表格_html/css_WEB-ITnose" >Australia</a></td><td>15 Months</td><td>7.242 Sec</td><td><div id="progress-bar" class="all-rounded"><div title='55%' id="progress-bar-percentage" class="all-rounded" style="width: 55%">{Reliability:55%}</div></div></td></tr>'@$temp=Invoke-RestMethod  -uri $web $result = ConvertFrom-String -TemplateContent $template   -InputObject  $temp $result  | sort reliability

Copy after login

成功抓取

我还可以更进一步，我想测试一下这些抓取下来的地址是否真的可以用，写个function测试看看

function Test-Proxy{[cmdletbinding()]param( [Parameter(Mandatory=$true,                    ValueFromPipeline=$true,                   ValueFromPipelineByPropertyName=$true,                   position=0                    )                ] [string]$server, [string]$url = "http://www.microsoft.com")write-host "Test Proxy Server:　$server" -NoNewline$proxy = new-object System.Net.WebProxy($server)$WebClient = new-object System.Net.WebClient$WebClient.proxy = $proxyTry{  $content = $WebClient.DownloadString($url)  Write-Host " Opened $url successfully" -ForegroundColor Cyan}catch{  Write-Host " Unable to access $url" -ForegroundColor Yellow }}foreach ($r in $result){$servername="http://"+$r.IP+":"+$r.PortTest-proxy -server $servername -url "www.google.com"}

Copy after login

测试标明都是坑货

类似的，豆子最近比较关注健康食物，我想看看低GI的食物有哪些

http://ultimatepaleoguide.com/glycemic-index-food-list

Copy after login

需要把下面这个表格抓出来

$t2=@'<tr><td valign="top">{Food*:Banana cake, made with sugar}</td><td valign="top">{GI:47}</td><td valign="top">{Size:60}</td></tr><tr><td valign="top">{Food*:Banana cake, made without sugar}</td><td valign="top">{GI:55}</td><td valign="top">{Size:60}</td></tr>'@$web2='http://ultimatepaleoguide.com/glycemic-index-food-list/'$temp=Invoke-RestMethod  -uri $web2 $result1 = ConvertFrom-String -TemplateContent $t2   -InputObject  $temp     $result1  | Out-GridView

Copy after login

成功！

这种方式很有用，尤其是需要获取网页某些列表信息的时候，当然，如果网页本身就提供RESTFUL的接口，可以直接获取JSON格式的内容那就更省事了。

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

How to fix KB5055612 fails to install in Windows 10?

4 weeks ago By DDD

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Roblox: Grow A Garden - Complete Mutation Guide

3 weeks ago By DDD

Nordhold: Fusion System, Explained

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Mandragora: Whispers Of The Witch Tree - How To Unlock The Grappling Hook

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Java Tutorial

1672

CakePHP Tutorial

1428

Laravel Tutorial

1332

PHP Tutorial

1277

C# Tutorial

1257

Related knowledge

HTML: The Structure, CSS: The Style, JavaScript: The Behavior Apr 18, 2025 am 12:09 AM

The roles of HTML, CSS and JavaScript in web development are: 1. HTML defines the web page structure, 2. CSS controls the web page style, and 3. JavaScript adds dynamic behavior. Together, they build the framework, aesthetics and interactivity of modern websites.

The Future of HTML, CSS, and JavaScript: Web Development Trends Apr 19, 2025 am 12:02 AM

The future trends of HTML are semantics and web components, the future trends of CSS are CSS-in-JS and CSSHoudini, and the future trends of JavaScript are WebAssembly and Serverless. 1. HTML semantics improve accessibility and SEO effects, and Web components improve development efficiency, but attention should be paid to browser compatibility. 2. CSS-in-JS enhances style management flexibility but may increase file size. CSSHoudini allows direct operation of CSS rendering. 3.WebAssembly optimizes browser application performance but has a steep learning curve, and Serverless simplifies development but requires optimization of cold start problems.

The Future of HTML: Evolution and Trends in Web Design Apr 17, 2025 am 12:12 AM

The future of HTML is full of infinite possibilities. 1) New features and standards will include more semantic tags and the popularity of WebComponents. 2) The web design trend will continue to develop towards responsive and accessible design. 3) Performance optimization will improve the user experience through responsive image loading and lazy loading technologies.

HTML vs. CSS vs. JavaScript: A Comparative Overview Apr 16, 2025 am 12:04 AM

The roles of HTML, CSS and JavaScript in web development are: HTML is responsible for content structure, CSS is responsible for style, and JavaScript is responsible for dynamic behavior. 1. HTML defines the web page structure and content through tags to ensure semantics. 2. CSS controls the web page style through selectors and attributes to make it beautiful and easy to read. 3. JavaScript controls web page behavior through scripts to achieve dynamic and interactive functions.

HTML vs. CSS and JavaScript: Comparing Web Technologies Apr 23, 2025 am 12:05 AM

HTML, CSS and JavaScript are the core technologies for building modern web pages: 1. HTML defines the web page structure, 2. CSS is responsible for the appearance of the web page, 3. JavaScript provides web page dynamics and interactivity, and they work together to create a website with a good user experience.

HTML: Is It a Programming Language or Something Else? Apr 15, 2025 am 12:13 AM

HTMLisnotaprogramminglanguage;itisamarkuplanguage.1)HTMLstructuresandformatswebcontentusingtags.2)ItworkswithCSSforstylingandJavaScriptforinteractivity,enhancingwebdevelopment.

Beyond HTML: Essential Technologies for Web Development Apr 26, 2025 am 12:04 AM

To build a website with powerful functions and good user experience, HTML alone is not enough. The following technology is also required: JavaScript gives web page dynamic and interactiveness, and real-time changes are achieved by operating DOM. CSS is responsible for the style and layout of the web page to improve aesthetics and user experience. Modern frameworks and libraries such as React, Vue.js and Angular improve development efficiency and code organization structure.

What is the difference between <strong>, <b> tags and <em>, <i> tags? Apr 28, 2025 pm 05:42 PM

The article discusses the differences between HTML tags , , , and , focusing on their semantic vs. presentational uses and their impact on SEO and accessibility.

See all articles