复杂的HTML解析 003_html/css_WEB-ITnose
当我们想要从复杂的网络信息中抽取出我们需要的信息时,不假思索的写出层层抽取的代码是非常不理智的。这样的代码又臭又长,难以阅读,且容易失效。正确的方法是三思而后行,使用正确的技巧方法写出简洁,易懂,扩展性强的代码。重要的事情说三遍:三思而后行,三思而后行,三思而后行### 1.class和id抽取网页的class属性和id属性是非常丰富的,我们可以使用这些属性来抽取信息:例子:```from urllib.request import urlopenfrom bs4 import BeautifulSouphtml = urlopen("http://pythonscraping.com/pages/warandpeace.html")#网页html代码赋给变量bs0bjbs0bj = BeautifulSoup(html)#返回一个含有class=green属性的列表namelist = bs0bj.findAll("span",{"class":"green"})#遍历输出列表中不含标签的文字。for name in namelist: print(name.get_text())```.get_text():把正在处理的html文档中的所有标签都清除,返回一个只包含文字的字符串,一般情况下因尽可能保留标签,在最后情况下使用get_text()。### 2.find()和finaall()find()和finaall()这两个函数是非常常用的,我们可以用它们找到需要的标签或标签组。函数的格式定义如下:``` findAll(tag, attributes, recursive, text, limit, keywords)find(tag, attributes, recursive, text, keywords)```- tag:可以传入一个标签或多个标签的标签组,.findAll({"h1","h2","h3","h4","h5","h6"})- attributes:用一个字典封装的一个标签的若干属性和对应的属性值- recursive:是否递归查找子标签,默认为true,如果设置为False,则只查找一级标签。- text:用来匹配标签的文本内容,不是标签的属性,切记。- limit:限制数量,当为1时相当于find()函数,设置为n时,显示前n个,按在网页上的顺序排列显示。- keyword:选择具有指定属性的标签。### 3.beautifulsoup对象- BeautifulSoup 对象:前面代码中的bsObj- Tag 对象 :前面代码中的bsObj.div.h1- NavigableString 对象:标签里的文字- comment 对象:查找html的注释标签, <!--like this one-->### 4.导航树:整个html页面可以简化为一棵树,如下所示:1.处理子标签和后代标签:一般情况下,beautifulsoup总是处理当前标签下的后代标签,如bs0bj.body.h1是处理当前body标签下的第一个h1标签。bs0bj.body.finaAll("img")是找到第一div标签下的索引img列表。如果只想要找到子标签,可以使用.children标签。```from urllib.request import urlopenfrom bs4 import BeautifulSouphtml = urlopen("http://www.pythonscraping.com/pages/page3.html")bsObj = BeautifulSoup(html)for child in bsObj.find("table",{"id":"giftList"}).children: print(child)```上面的代码会打印giftlist表格中的索引产品的数据行。使用descendants()标签可以输出所有的后代标签。2.兄弟标签:使用next_siblings()函数可以处理兄弟标签。```from urllib.request import urlopenfrom bs4 import BeautifulSouphtml = urlopen("http://www.pythonscraping.com/pages/page3.html")bsObj = BeautifulSoup(html)for sibling in bsObj.find("table",{"id":"giftList"}).tr.next_siblings:print(sibling)```使用兄弟标签会跳过标签自身,因为自身不能当作自己的兄弟,同时兄弟标签只能查找后面的标签,不能查找自身以及前面的兄弟标签。与next_siblings()类似,previous_siblings()函数是查找前面的兄弟节点。同时还有next_siblings()和previous_siblings()函数,它们的作业是返回单个标签。3.父标签使用parent和parents标签可以查找给定标签的父标签或父标签列表。```from urllib.request import urlopenfrom bs4 import BeautifulSouphtml = urlopen("http://www.pythonscraping.com/pages/page3.html")bsObj = BeautifulSoup(html)print(bsObj.find("img",{"src":"../img/gifts/img1.jpg" }).parent.previous_sibling.get_text())```上面代码的意思是找到给定标签的父标签的前一个兄弟标签中的文字。### 5.正则表达式:正则表达式是非常丰富的,但是上手很简单。可以自己搜索一些教程学习。正则测试:http://www.regexpal.com/### 6.正则表达式与BeautifulSoup正则表达式是非常有用的,举个例子,加入你想要抓取索引的img图片,使用finaall()得到的结果可能会出乎意料,多了许多隐藏的图片或不相关的图片,此时可以使用正则表达式。我们发现想要获取的图片源代码格式如下:```<img src="/static/imghw/default1.png" data-src="../img/gifts/img3.jpg" class="lazy" alt="复杂的HTML解析 003_html/css_WEB-ITnose" >```这样我们可以直接通过文件路径来查找信息,```from urllib.requestimport urlopenfrom bs4import BeautifulSoupimport rehtml = urlopen("http://www.pythonscraping.com/pages/page3.html")bsObj = BeautifulSoup(html)images = bsObj.findAll("img", {"src":re.compile("\.\.\/img\/gifts/img.*\.jpg")})for image in images: print(image["src"])```上面的例子使用了正则表达式匹配了以../img/gifts/img开头,以.jpg结尾的图片。结果如下:```../img/gifts/img1.jpg../img/gifts/img2.jpg../img/gifts/img3.jpg../img/gifts/img4.jpg../img/gifts/img6.jpg```### 6.标签属性:对于一个标签对象,可以使用下面代码获得它的全部属性:```myTag.attrs```获得的属性存在一个字典中,可以取出某一个属性```myTag.attrs["href"]```### 7.lambda表达式

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

HTML is suitable for beginners because it is simple and easy to learn and can quickly see results. 1) The learning curve of HTML is smooth and easy to get started. 2) Just master the basic tags to start creating web pages. 3) High flexibility and can be used in combination with CSS and JavaScript. 4) Rich learning resources and modern tools support the learning process.

HTML defines the web structure, CSS is responsible for style and layout, and JavaScript gives dynamic interaction. The three perform their duties in web development and jointly build a colorful website.

WebdevelopmentreliesonHTML,CSS,andJavaScript:1)HTMLstructurescontent,2)CSSstylesit,and3)JavaScriptaddsinteractivity,formingthebasisofmodernwebexperiences.

AnexampleofastartingtaginHTMLis,whichbeginsaparagraph.StartingtagsareessentialinHTMLastheyinitiateelements,definetheirtypes,andarecrucialforstructuringwebpagesandconstructingtheDOM.

GiteePages static website deployment failed: 404 error troubleshooting and resolution when using Gitee...

The Y-axis position adaptive algorithm for web annotation function This article will explore how to implement annotation functions similar to Word documents, especially how to deal with the interval between annotations...

HTML, CSS and JavaScript are the three pillars of web development. 1. HTML defines the web page structure and uses tags such as, etc. 2. CSS controls the web page style, using selectors and attributes such as color, font-size, etc. 3. JavaScript realizes dynamic effects and interaction, through event monitoring and DOM operations.

To achieve the effect of scattering and enlarging the surrounding images after clicking on the image, many web designs need to achieve an interactive effect: click on a certain image to make the surrounding...
