How Python uses the Beautiful Soup (BS4) library to parse HTML and XML-Python Tutorial-php.cn

<!DOCTYPE html>
<html>
<head>
    <meta content="text/html;charset=utf-8" http-equiv="content-type" />
    <meta content="IE=Edge" http-equiv="X-UA-Compatible" />
    <meta content="always" name="referrer" />
    <link href="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="external nofollow"  rel="stylesheet" type="text/css" />
    <title>百度一下，你就知道 </title>
</head>
<body link="#0000cc">
  <div id="wrapper">
    <div id="head">
        <div class="head_wrapper">
          <div id="u1">
            <a class="mnav" href="http://news.baidu.com" rel="external nofollow"  name="tj_trnews">新闻 </a>
            <a class="mnav" href="https://www.hao123.com" rel="external nofollow"  name="tj_trhao123">hao123 </a>
            <a class="mnav" href="http://map.baidu.com" rel="external nofollow"  name="tj_trmap">地图 </a>
            <a class="mnav" href="http://v.baidu.com" rel="external nofollow"  name="tj_trvideo">视频 </a>
            <a class="mnav" href="http://tieba.baidu.com" rel="external nofollow"  rel="external nofollow"  name="tj_trtieba">贴吧 </a>
            <a class="bri" href="//www.baidu.com/more/" rel="external nofollow"  name="tj_briicon" >更多产品 </a>
          </div>
        </div>
    </div>
  </div>
</body>
</html>

Copy after login

Create a beautifulsoup4 object:

from bs4 import BeautifulSoup

file = open(&#39;./aa.html&#39;, &#39;rb&#39;)
html = file.read()
bs = BeautifulSoup(html, "html.parser")  # 缩进格式

print(bs.prettify())  # 格式化html结构
print(bs.title)  # 
print(bs.title.name)  # 获取title标签的名称 :title
print(bs.title.string)  # 获取title标签的文本内容 :   百度一下，你就知道
print(bs.head)  # 获取head标签的所有内容 :
print(bs.div)  # 获取第一个div标签中的所有内容   :
print(bs.div["id"])  # 获取第一个div标签的id的值      :    wrapper
print(bs.a)  # 获取第一个a标签中的所有内容    :       <a href="http://news.baidu.com/" rel="external nofollow"   target="_blank">新闻 </a>
print(bs.find_all("a"))  # 获取所有的a标签中的所有内容     :   [....]
print(bs.find(id="u1"))  # 获取id="u1"的所有内容 :
for item in bs.find_all("a"):  # 获取所有的a标签，并遍历打印a标签中的href的值    :
    print(item.get("href"))
for item in bs.find_all("a"):  # 获取所有的a标签，并遍历打印a标签的文本值:
    print(item.get_text())

Copy after login

3. The four major types of BeautifulSoup4 objects

BeautifulSoup4 converts complex HTML The document is converted into a complex tree structure. Each node is a Python object. All objects can be summarized into 4 types: Tag, NavigableString, BeautifulSoup, Comment,

1, Tag: Tag

## In layman's terms, #Tag is a tag in HTML, for example:

print(bs.title) # 获取title标签的所有内容
print(bs.head) # 获取head标签的所有内容
print(bs.a) # 获取第一个a标签的所有内容
print(type(bs.a))# 类型

Copy after login

We can use soup to add the tag name to easily obtain the content of these tags. The type of these objects is bs4.element.Tag. But note that it looks for the first matching tag in all content.

For Tag, it has

two important attributes, name and attrs:

print(bs.name) # [document] #bs 对象本身比较特殊，它的 name 即为 [document]
print(bs.head.name) # head #对于其他内部标签，输出的值便为标签本身的名称
print(bs.a.attrs) # 在这里，我们把 a 标签的所有属性打印输出了出来，得到的类型是一个字典。
print(bs.a[&#39;class&#39;]) ##还可以利用get方法，传入属性的名称，二者是等价的，等价 bs.a.get(&#39;class&#39;)

bs.a[&#39;class&#39;] = "newClass"# 可以对这些属性和内容等等进行修改
print(bs.a) 

del bs.a[&#39;class&#39;] # 还可以对这个属性进行删除
print(bs.a)

Copy after login

2. NavigableString: the text inside the tag

Since We have got the content of the label, so the question is, what should we do if we want to get the text inside the label? It's very simple, just use .string, for example:

print(bs.title.string)  # 百度一下，你就知道 
print(type(bs.title.string))  #

Copy after login

3. BeautifulSoup: the content of the document

The BeautifulSoup object represents the content of a document. Most of the time, it can be regarded as a Tag object, which is a special Tag. We can obtain its type, name, and attributes respectively, for example:

print(type(bs.name))  # 
print(bs.name)  # [document]
print(bs.attrs)  # {}

Copy after login

4. Comment: Comment

## The #Comment object is a special type of NavigableString object, and its output content does not include comment symbols.

print(bs.a)
# 此时不能出现空格和换行符，a标签如下：
# 
print(bs.a.string) # 新闻
print(type(bs.a.string)) #

Copy after login

4. Properties used to traverse the document tree

Get all child nodes of Tag and return a list

print(bs.head.contents)     # tag的.contents属性可以将tag的子节点以列表的方式输出:[...]
print(bs.head.contents[1])  # 用列表索引来获取它的某一个元素:

Copy after login

Get all child nodes of Tag and return a generator
```
for child in bs.body.children:
    print(child)
```
Copy after login

: Get all descendant nodes of Tag
##.parent
.parents
.previous_sibling
##.next_sibling
: Get the next node of the current Tag. The attribute is usually a string or blank. The result is the comma and newline character between the current tag and the next tag.
.previous_siblings
: Get all the sibling nodes above the current Tag and return a generator
.next_siblings
: Get all the sibling nodes below the current Tag , returns a generator
.previous_element
: Gets the last parsed object (string or tag) during the parsing process, which may be the same as previous_sibling, but usually It’s different
.next_element
: Gets the next parsed object (string or tag) during the parsing process, which may be the same as next_sibling, but usually Different
.previous_elements
: Returns a generator that can forward access the parsed content of the document
.next_elements
: Returns a generator that can access the parsed content of the document backwards
.strings
: If Tag contains multiple strings, That is, there is content in the descendant nodes, you can use this to obtain, and then traverse
.stripped_strings
: The usage is the same as strings, except that the redundant ones can be removed Blank content
.has_attr:
Judge whether Tag contains attributes
5. Search document tree
1. find_all(): filter

find_all(name, attrs, recursive, text, **kwargs):

find_all filter can be used in tag In the name, the attributes of the node, etc.

(1) name parameter:

String filtering

: Will find content that exactly matches the string

a_list = bs.find_all("a")
print(a_list)

Copy after login

Regular expression Filtering: If a regular expression is passed in, BeautifulSoup4 will match the content through search()

import re 

t_list = bs.find_all(re.compile("a")) 
for item in t_list: 
   print(item)

Copy after login

List: If a list is passed in, BeautifulSoup4 will match the list The node matched by any element in returns

t_list = bs.find_all(["meta","link"])
for item in t_list:
    print(item)

Copy after login

Method: Pass in a method and match based on the method.

def name_is_exists(tag): 
    return tag.has_attr("name") 
t_list = bs.find_all(name_is_exists) 
for item in t_list: 
    print(item)

Copy after login

（2）kwargs参数：

t_list = bs.find_all(id="head")  # 查询id=head的Tag
t_list = bs.find_all(href=re.compile(http://news.baidu.com))  # 查询href属性包含ss1.bdstatic.com的Tag
t_list = bs.find_all(class_=True) # 查询所有包含class的Tag(注意：class在Python中属于关键字，所以加_以示区别)
for item in t_list: 
    print(item)

Copy after login

（3）attrs参数：

并不是所有的属性都可以使用上面这种方式进行搜索，比如HTML的data-*属性：

t_list = bs.find_all(data-foo="value")

Copy after login

如果执行这段代码，将会报错。我们可以使用attrs参数，定义一个字典来搜索包含特殊属性的tag：

t_list = bs.find_all(attrs={"data-foo":"value"})
for item in t_list:
    print(item)

Copy after login

（4）text参数：

通过text参数可以搜索文档中的字符串内容，与name参数的可选值一样，text参数接受字符串，正则表达式，列表

t_list = bs.find_all(text="hao123") 
t_list = bs.find_all(text=["hao123", "地图", "贴吧"]) 
t_list = bs.find_all(text=re.compile("\d"))

Copy after login

当我们搜索text中的一些特殊属性时，同样也可以传入一个方法来达到我们的目的：

def length_is_two(text):
    return text and len(text) == 2
t_list = bs.find_all(text=length_is_two)

Copy after login

（5）limit参数：

可以传入一个limit参数来限制返回的数量，当搜索出的数据量为5，而设置了limit=2时，此时只会返回前2个数据

t_list = bs.find_all("a",limit=2)

Copy after login

find_all除了上面一些常规的写法，还可以对其进行一些简写：

# 下面两者是相等的
t_list = bs.find_all("a") 
t_list = bs("a") 

# 下面两者是相等的
t_list = bs.a.find_all(text="新闻") 
t_list = bs.a(text="新闻")

Copy after login

2、find()

find()将返回符合条件的第一个Tag，有时我们只需要或一个Tag时，我们就可以用到find()方法了。当然了，也可以使用find_all()方法，传入一个limit=1，然后再取出第一个值也是可以的，不过未免繁琐。

t_list = bs.find_all("title",limit=1) # 返回只有一个结果的列表
t = bs.find("title") # 返回唯一值
t = bs.find("abc") # 如果没有找到，则返回None

Copy after login

从结果可以看出find_all，尽管传入了limit=1，但是返回值仍然为一个列表，当我们只需要取一个值时，远不如find方法方便。但是如果未搜索到值时，将返回一个None。

在上面介绍BeautifulSoup4的时候，我们知道可以通过bs.div来获取第一个div标签，如果我们需要获取第一个div下的第一个div，我们可以这样：

t = bs.div.div
# 等价于
t = bs.find("div").find("div")

Copy after login

六、CSS选择器：select()方法

BeautifulSoup支持部分的CSS选择器，在Tag获取BeautifulSoup对象的.select()方法中传入字符串参数，即可使用CSS选择器的语法找到Tag:

print(bs.select(&#39;title&#39;))  # 1、通过标签名查找               
print(bs.select(&#39;a&#39;))                                
print(bs.select(&#39;.mnav&#39;))  # 2、通过类名查找                
print(bs.select(&#39;#u1&#39;))  # 3、通过id查找                  
print(bs.select(&#39;div .bri&#39;))  # 4、组合查找               
print(bs.select(&#39;a[class="bri"]&#39;))  # 5、属性查找         
print(bs.select(&#39;a[href="http://tieba.baidu.com" rel="external nofollow"  rel="external nofollow" ]&#39;)) 
print(bs.select("head > title"))  # 6、直接子标签查找        
print(bs.select(".mnav ~ .bri"))  # 7、兄弟节点标签查找       
print(bs.select(&#39;title&#39;)[0].get_text())  # 8、获取内容

Copy after login

七、综合实例：

from bs4 import BeautifulSoup
import requests,re
req_obj = requests.get(&#39;https://www.baidu.com&#39;)
soup = BeautifulSoup(req_obj.text,&#39;lxml&#39;)

&#39;&#39;&#39;标签查找&#39;&#39;&#39;
print(soup.title)              #只是查找出第一个
print(soup.find(&#39;title&#39;))      #效果和上面一样
print(soup.find_all(&#39;div&#39;))    #查出所有的div标签

&#39;&#39;&#39;获取标签里的属性&#39;&#39;&#39;
tag = soup.div
print(tag[&#39;class&#39;])   #多属性的话，会返回一个列表
print(tag[&#39;id&#39;])      #查找标签的id属性
print(tag.attrs)      #查找标签所有的属性，返回一个字典（属性名：属性值）

&#39;&#39;&#39;标签包的字符串&#39;&#39;&#39;
tag = soup.title
print(tag.string)                 #获取标签里的字符串
tag.string.replace_with("哈哈")    #字符串不能直接编辑，可以替换

&#39;&#39;&#39;子节点的操作&#39;&#39;&#39;
tag = soup.head
print(tag.title)     #获取head标签后再获取它包含的子标签

&#39;&#39;&#39;contents 和 .children&#39;&#39;&#39;
tag = soup.body
print(tag.contents)        #将标签的子节点以列表返回
print([child for child in tag.children])      #输出和上面一样


&#39;&#39;&#39;descendants&#39;&#39;&#39;
tag = soup.body
[print(child_tag) for child_tag in tag.descendants]    #获取所有子节点和子子节点

&#39;&#39;&#39;strings和.stripped_strings&#39;&#39;&#39;
tag = soup.body
[print(str) for str in tag.strings]             #输出所有所有文本内容
[print(str) for str in tag.stripped_strings]    #输出所有所有文本内容，去除空格或空行

&#39;&#39;&#39;.parent和.parents&#39;&#39;&#39;
tag = soup.title
print(tag.parent)   　　　　　　　　　　　　　 #输出便签的父标签
[print(parent) for parent in tag.parents]  #输出所有的父标签

&#39;&#39;&#39;.next_siblings 和 .previous_siblings
    查出所有的兄弟节点
&#39;&#39;&#39;

&#39;&#39;&#39;.next_element 和 .previous_element
    下一个兄弟节点
&#39;&#39;&#39;

&#39;&#39;&#39;find_all的keyword 参数&#39;&#39;&#39;
soup.find_all(id=&#39;link2&#39;)                   #查找所有包含 id 属性的标签
soup.find_all(href=re.compile("elsie"))     #href 参数,Beautiful Soup会搜索每个标签的href属性:
soup.find_all(id=True)                       #找出所有的有id属性的标签
soup.find_all(href=re.compile("elsie"), id=&#39;link1&#39;)         #也可以组合查找
soup.find_all(attrs={"属性名": "属性值"})  #也可以通过字典的方式查找

Copy after login

八、BeautifulSoup 和lxml（Xpath）对比

# test.py
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup, SoupStrainer
import traceback
import json
from lxml import etree
import re
import time

def getHtmlText(url):
    try:
        r = requests.get(url, headers=headers)
        r.raise_for_status()
        if r.encoding == &#39;ISO-8859-1&#39;:
            r.encoding = r.apparent_encoding
        return r.text
    except:
        traceback.print_exc()

# ----------使用BeautifulSoup解析------------------------
def parseWithBeautifulSoup(html_text):
    soup = BeautifulSoup(html_text, &#39;lxml&#39;) 
    content = []
    for mulu in soup.find_all(class_=&#39;mulu&#39;): # 先找到所有的 div class=mulu 标记
        # 找到div_h3 标记
        h3 = mulu.find(&#39;h3&#39;)
        if h3 != None:
            h3_title = h3.string # 获取标题
            lst = []
            for a in mulu.select(&#39;div.box a&#39;):
                href = a.get(&#39;href&#39;) # 找到 href 属性
                box_title = a.get(&#39;title&#39;)  # 找到 title 属性
                pattern = re.compile(r&#39;\s*\[(.*)\]\s+(.*)&#39;) # (re) 匹配括号内的表达式，也表示一个组
                match = pattern.search(box_title)
                if match != None:
                    date = match.group(1)
                    real_title = match.group(2)
                    lst.append({&#39;href&#39;:href,&#39;title&#39;:real_title,&#39;date&#39;:date})
            content.append({&#39;title&#39;:h3_title,&#39;content&#39;:lst})
    with open(&#39;dmbj_bs.json&#39;, &#39;w&#39;) as fp:
        json.dump(content, fp=fp, indent=4)

# ----------使用Xpath解析------------------------
def parseWithXpath(html_text):
    html = etree.HTML(html_text)
    content = []
    for div_mulu in html.xpath(&#39;.//*[@class="mulu"]&#39;)： # 先找到所有的 div class=mulu 标记
        # 找到所有的 div_h3 标记
        div_h3 = div_mulu.xpath(&#39;./div[@class="mulu-title"]/center/h3/text()&#39;)
        if len(div_h3) > 0:
            h3_title = div_h3[0] # 获取标题
            a_s = div_mulu.xpath(&#39;./div[@class="box"]/ul/li/a&#39;)
            lst = []
            for a in a_s:
                href = a.xpath(&#39;./@href&#39;)[0] # 找到 href 属性
                box_title = a.xpath(&#39;./@title&#39;)[0] # 找到 title 属性
                pattern = re.compile(r&#39;\s*\[(.*)\]\s+(.*)&#39;) # (re) 匹配括号内的表达式，也表示一个组
                match = pattern.search(box_title)
                if match != None:
                    date = match.group(1)
                    real_title = match.group(2)
                    lst.append({&#39;href&#39;:href,&#39;title&#39;:real_title,&#39;date&#39;:date})
            content.append({&#39;title&#39;:h3_title,&#39;content&#39;:lst})
    with open(&#39;dmbj_xp.json&#39;, &#39;w&#39;) as fp:
        json.dump(content, fp=fp, indent=4)

def main():
    html_text = getHtmlText(&#39;http://www.seputu.com&#39;)
    print(len(html_text))
    start = time.clock()
    parseWithBeautifulSoup(html_text)
    print(&#39;BSoup cost:&#39;, time.clock()-start)
    start = time.clock()
    parseWithXpath(html_text)
    print(&#39;Xpath cost:&#39;, time.clock()-start)

if __name__ == &#39;__main__&#39;:
    user_agent = &#39;Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.117 Safari/537.36&#39;
    headers={&#39;User-Agent&#39;: user_agent}
    main()

Copy after login

The above is the detailed content of How Python uses the Beautiful Soup (BS4) library to parse HTML and XML. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

What's New in Windows 11 KB5054979 & How to Fix Update Issues

4 weeks ago By DDD

How to fix KB5055523 fails to install in Windows 11?

3 weeks ago By DDD

InZoi: How To Apply To School And University

4 weeks ago By DDD

How to fix KB5055518 fails to install in Windows 10?

3 weeks ago By DDD

Where to find the Site Office Key in Atomfall

4 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7878

Java Tutorial

1649

CakePHP Tutorial

1409

Laravel Tutorial

1301

PHP Tutorial

1245

Related knowledge

PHP and Python: Different Paradigms Explained Apr 18, 2025 am 12:26 AM

PHP is mainly procedural programming, but also supports object-oriented programming (OOP); Python supports a variety of paradigms, including OOP, functional and procedural programming. PHP is suitable for web development, and Python is suitable for a variety of applications such as data analysis and machine learning.

HTML: The Structure, CSS: The Style, JavaScript: The Behavior Apr 18, 2025 am 12:09 AM

The roles of HTML, CSS and JavaScript in web development are: 1. HTML defines the web page structure, 2. CSS controls the web page style, and 3. JavaScript adds dynamic behavior. Together, they build the framework, aesthetics and interactivity of modern websites.

Choosing Between PHP and Python: A Guide Apr 18, 2025 am 12:24 AM

PHP is suitable for web development and rapid prototyping, and Python is suitable for data science and machine learning. 1.PHP is used for dynamic web development, with simple syntax and suitable for rapid development. 2. Python has concise syntax, is suitable for multiple fields, and has a strong library ecosystem.

PHP and Python: A Deep Dive into Their History Apr 18, 2025 am 12:25 AM

PHP originated in 1994 and was developed by RasmusLerdorf. It was originally used to track website visitors and gradually evolved into a server-side scripting language and was widely used in web development. Python was developed by Guidovan Rossum in the late 1980s and was first released in 1991. It emphasizes code readability and simplicity, and is suitable for scientific computing, data analysis and other fields.

The Future of HTML: Evolution and Trends in Web Design Apr 17, 2025 am 12:12 AM

The future of HTML is full of infinite possibilities. 1) New features and standards will include more semantic tags and the popularity of WebComponents. 2) The web design trend will continue to develop towards responsive and accessible design. 3) Performance optimization will improve the user experience through responsive image loading and lazy loading technologies.

Python vs. JavaScript: The Learning Curve and Ease of Use Apr 16, 2025 am 12:12 AM

Python is more suitable for beginners, with a smooth learning curve and concise syntax; JavaScript is suitable for front-end development, with a steep learning curve and flexible syntax. 1. Python syntax is intuitive and suitable for data science and back-end development. 2. JavaScript is flexible and widely used in front-end and server-side programming.

How to run sublime code python Apr 16, 2025 am 08:48 AM

To run Python code in Sublime Text, you need to install the Python plug-in first, then create a .py file and write the code, and finally press Ctrl B to run the code, and the output will be displayed in the console.

Where to write code in vscode Apr 15, 2025 pm 09:54 PM

Writing code in Visual Studio Code (VSCode) is simple and easy to use. Just install VSCode, create a project, select a language, create a file, write code, save and run it. The advantages of VSCode include cross-platform, free and open source, powerful features, rich extensions, and lightweight and fast.

See all articles

How Python uses the Beautiful Soup (BS4) library to parse HTML and XML

1. Overview of Beautiful Soup:

Installation:

## In layman's terms, #Tag is a tag in HTML, for example:

Since We have got the content of the label, so the question is, what should we do if we want to get the text inside the label? It's very simple, just use .string, for example:

The BeautifulSoup object represents the content of a document. Most of the time, it can be regarded as a Tag object, which is a special Tag. We can obtain its type, name, and attributes respectively, for example:

print(bs.a) # 此时不能出现空格和换行符，a标签如下： # print(bs.a.string) # 新闻 print(type(bs.a.string)) #
Copy after login

find_all(name, attrs, recursive, text, **kwargs):

a_list = bs.find_all("a") print(a_list)
Copy after login

（2）kwargs参数：

（3）attrs参数：

（4）text参数：

（5）limit参数：

2、find()

六、CSS选择器：select()方法

七、综合实例：

八、BeautifulSoup 和lxml（Xpath）对比

Hot AI Tools

Undresser.AI Undress

AI Clothes Remover

Undress AI Tool

Clothoff.io

Video Face Swap

Hot Article

Hot Tools

Notepad++7.3.1

SublimeText3 Chinese version

Zend Studio 13.0.1

Dreamweaver CS6

SublimeText3 Mac version

Hot Topics

How Python uses the Beautiful Soup (BS4) library to parse HTML and XML

1. Overview of Beautiful Soup:

Installation:

## In layman's terms, #Tag is a tag in HTML, for example:

Since We have got the content of the label, so the question is, what should we do if we want to get the text inside the label? It's very simple, just use .string, for example:

The BeautifulSoup object represents the content of a document. Most of the time, it can be regarded as a Tag object, which is a special Tag. We can obtain its type, name, and attributes respectively, for example:

print(bs.a) # 此时不能出现空格和换行符，a标签如下： # print(bs.a.string) # 新闻 print(type(bs.a.string)) #Copy after login

find_all(name, attrs, recursive, text, **kwargs):

a_list = bs.find_all("a") print(a_list)Copy after login

（2）kwargs参数：

（3）attrs参数：

（4）text参数：

（5）limit参数：

2、find()

六、CSS选择器：select()方法

七、综合实例：

八、BeautifulSoup 和lxml（Xpath）对比

Hot AI Tools

Undresser.AI Undress

AI Clothes Remover

Undress AI Tool

Clothoff.io

Video Face Swap

Hot Article

Hot Tools

Notepad++7.3.1

SublimeText3 Chinese version

Zend Studio 13.0.1

Dreamweaver CS6

SublimeText3 Mac version

Hot Topics

print(bs.a) # 此时不能出现空格和换行符，a标签如下： # print(bs.a.string) # 新闻 print(type(bs.a.string)) #
Copy after login

a_list = bs.find_all("a") print(a_list)
Copy after login