Home Backend Development Python Tutorial Introduction to python crawlers (4)--Detailed explanation of HTML text parsing library BeautifulSoup

Introduction to python crawlers (4)--Detailed explanation of HTML text parsing library BeautifulSoup

May 27, 2017 am 11:55 AM
python

Beautiful Soup is a library in python. Its main function is to grab data from web pages. The following article mainly introduces you to the relevant information of BeautifulSoup, the HTML text parsing library of python crawler. The introduction in the article is very detailed and has certain reference and learning value for everyone. Friends who need it can take a look below.

Preface

The third article in the python crawler series introduces the network request library artifact Requests. After the request returns the data, it must be extracted. Target data, the content returned by different websites usually has many different formats, one is the json format, this type of data is the most friendly to developers. Another XML format, and the most common format is HTML documents. Today I will talk about how to extract interesting data from HTML

Write your own HTML parsing Analyzer? Or use regular expression? None of these are the best solutions. Fortunately, the Python community has already had a very mature solution for this problem. BeautifulSoup is the nemesis of this type of problem. It focuses on HTML document operations and its name comes from a song of the same name by Lewis Carroll. poetry.

BeautifulSoup is a Python library for parsing HTML documents. Through BeautifulSoup, you only need to use very little code to extract any interesting content in HTML. In addition, it also has certain HTML fault tolerance. Ability to handle an incompletely formatted HTML document correctly.

Install BeautifulSoup

pip install beautifulsoup4
Copy after login

BeautifulSoup3 has been officially abandoned for maintenance, you need to download the latest version BeautifulSoup4.

HTML tag

Before learning BeautifulSoup4, it is necessary to have a basic understanding of HTML documents. As shown in the following code, HTML is a tree organizational structure .

<html> 
 <head>
  <title>hello, world</title>
 </head>
 <body>
  <h1>BeautifulSoup</h1>
  <p>如何使用BeautifulSoup</p>
 <body>
</html>
Copy after login
  • It consists of many tags (Tag), such as html, head, title, etc. are all tags

  • A tag pair constitutes a Node, for example... is a root node

  • There is a certain relationship between nodes, such as h1 and p are neighbors of each other, they are adjacent sibling nodes

  • h1 Is the direct child node of body or the descendants node of html

  • body is the parent node of p , html is the ancestor node of p

  • The string nested between the tags is a special child node under the node, such as "hello, world" is also a node , but it has no name.

Using BeautifulSoup

Constructing a BeautifulSoup object requires two parameters. The first parameter is the HTML to be parsed. A text string, the second parameter tells BeautifulSoup which parser to use to parse the HTML.

The parser is responsible for parsing HTML into related objects, while BeautifulSoup is responsible for manipulating data (adding, deleting, modifying, and checking). "html.parser" is Python's built-in parser, and "lxml" is a parser developed based on c language. It executes faster, but it requires additional installation.

It can be located through the BeautifulSoup object to any tag node in HTML.

from bs4 import BeautifulSoup 
text = """
<html> 
 <head>
  <title >hello, world</title>
 </head>
 <body>
  <h1>BeautifulSoup</h1>
  <p class="bold">如何使用BeautifulSoup</p>
  <p class="big" id="key1"> 第二个p标签</p>
  <a href="http://foofish.net" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >python</a>
 </body>
</html> 
"""
soup = BeautifulSoup(text, "html.parser")

# title 标签
>>> soup.title
<title>hello, world</title>

# p 标签
>>> soup.p
<p class="bold">\u5982\u4f55\u4f7f\u7528BeautifulSoup</p>

# p 标签的内容
>>> soup.p.string
u&#39;\u5982\u4f55\u4f7f\u7528BeautifulSoup&#39;
Copy after login

BeatifulSoup abstracts HTML into four main data types, namely Tag, NavigableString, BeautifulSoup, and Comment. Each tag node is a Tag object. NavigableString objects are generally strings wrapped in Tag objects. BeautifulSoup objects represent the entire HTML document. For example:

>>> type(soup)
<class &#39;bs4.BeautifulSoup&#39;>
>>> type(soup.h1)
<class &#39;bs4.element.Tag&#39;>
>>> type(soup.p.string)
<class &#39;bs4.element.NavigableString&#39;>
Copy after login

Tag

Each Tag has a name, which corresponds to the HTML tag name.


>>> soup.h1.name
u&#39;h1&#39;
>>> soup.p.name
u&#39;p&#39;
Copy after login

Tags can also have attributes. The access method of attributes is similar to that of dictionaries. It returns a list object

>>> soup.p[&#39;class&#39;]
[u&#39;bold&#39;]
Copy after login

NavigableString

To get the content in the tag, you can get it directly by using .stirng. It is a NavigableString object, and you can explicitly convert it to a unicode string.

>>> soup.p.string
u&#39;\u5982\u4f55\u4f7f\u7528BeautifulSoup&#39;
>>> type(soup.p.string)
<class &#39;bs4.element.NavigableString&#39;>
>>> unicode_str = unicode(soup.p.string)
>>> unicode_str
u&#39;\u5982\u4f55\u4f7f\u7528BeautifulSoup&#39;
Copy after login

After introducing the basic concepts, we can now officially enter the topic. How to find the data we care about from HTML? BeautifulSoup provides two methods, one is traversal and the other is search. Usually the two are combined to complete the search task.

Traversing the document tree

Traversing the document tree, as the name suggests, starts from the root node html tag and traverses until the target element is found. One drawback of is that if the content you are looking for is at the end of the document, then it has to traverse the entire document to find it, which is slow. Therefore, it is necessary to cooperate with the second method.

Obtaining tag nodes by traversing the document tree can be obtained directly through . tag name, for example:

Get the body tag:

>>> soup.body
<body>\n<h1>BeautifulSoup</h1>\n<p class="bold">\u5982\u4f55\u4f7f\u7528BeautifulSoup</p>\n</body>
Copy after login

获取 p 标签

>>> soup.body.p
<p class="bold">\u5982\u4f55\u4f7f\u7528BeautifulSoup</p>
Copy after login

获取 p 标签的内容

>>> soup.body.p.string
\u5982\u4f55\u4f7f\u7528BeautifulSoup
Copy after login

前面说了,内容也是一个节点,这里就可以用 .string 的方式得到。遍历文档树的另一个缺点是只能获取到与之匹配的第一个子节点,例如,如果有两个相邻的 p 标签时,第二个标签就没法通过 .p 的方式获取,这是需要借用 next_sibling 属性获取相邻且在后面的节点。此外,还有很多不怎么常用的属性,比如:.contents 获取所有子节点,.parent 获取父节点,更多的参考请查看官方文档。

搜索文档树

搜索文档树是通过指定标签名来搜索元素,另外还可以通过指定标签的属性值来精确定位某个节点元素,最常用的两个方法就是 find 和 find_all。这两个方法在 BeatifulSoup 和 Tag 对象上都可以被调用。

find_all()

find_all( name , attrs , recursive , text , **kwargs )
Copy after login

find_all 的返回值是一个 Tag 组成的列表,方法调用非常灵活,所有的参数都是可选的。

第一个参数 name 是标签节点的名字。

# 找到所有标签名为title的节点
>>> soup.find_all("title")
[<title>hello, world</title>]
>>> soup.find_all("p")
[<p class="bold">\xc8\xe7\xba\xce\xca\xb9\xd3\xc3BeautifulSoup</p>, 
<p class="big"> \xb5\xda\xb6\xfe\xb8\xf6p\xb1\xea\xc7\xa9</p>]
Copy after login

第二个参数是标签的class属性值

# 找到所有class属性为big的p标签
>>> soup.find_all("p", "big")
[<p class="big"> \xb5\xda\xb6\xfe\xb8\xf6p\xb1\xea\xc7\xa9</p>]
Copy after login

等效于

>>> soup.find_all("p", class_="big")
[<p class="big"> \xb5\xda\xb6\xfe\xb8\xf6p\xb1\xea\xc7\xa9</p>]
Copy after login

因为 class 是 Python 关键字,所以这里指定为 class_。

kwargs 是标签的属性名值对,例如:查找有href属性值为 "http://foofish.net" 的标签

>>> soup.find_all(href="foofish.net" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" )
[<a href="foofish.net" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >python</a>]
Copy after login

当然,它还支持正则表达式

>>> import re
>>> soup.find_all(href=re.compile("^http"))
[<a href="http://foofish.net" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >python</a>]
Copy after login

属性除了可以是具体的值、正则表达式之外,它还可以是一个布尔值(True/Flase),表示有属性或者没有该属性。

>>> soup.find_all(id="key1")
[<p class="big" id="key1"> \xb5\xda\xb6\xfe\xb8\xf6p\xb1\xea\xc7\xa9</p>]
>>> soup.find_all(id=True)
[<p class="big" id="key1"> \xb5\xda\xb6\xfe\xb8\xf6p\xb1\xea\xc7\xa9</p>]
Copy after login

遍历和搜索相结合查找,先定位到 body 标签,缩小搜索范围,再从 body 中找 a 标签。

>>> body_tag = soup.body
>>> body_tag.find_all("a")
[<a href="http://foofish.net" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >python</a>]
Copy after login

find()

find 方法跟 find_all 类似,唯一不同的地方是,它返回的单个 Tag 对象而非列表,如果没找到匹配的节点则返回 None。如果匹配多个 Tag,只返回第0个。

>>> body_tag.find("a")
<a href="foofish.net" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >python</a>
>>> body_tag.find("p")
<p class="bold">\xc8\xe7\xba\xce\xca\xb9\xd3\xc3BeautifulSoup</p>
Copy after login

get_text()

获取标签里面内容,除了可以使用 .string 之外,还可以使用 get_text 方法,不同的地方在于前者返回的一个 NavigableString 对象,后者返回的是 unicode 类型的字符串。

>>> p1 = body_tag.find(&#39;p&#39;).get_text()
>>> type(p1)
<type &#39;unicode&#39;>
>>> p1
u&#39;\xc8\xe7\xba\xce\xca\xb9\xd3\xc3BeautifulSoup&#39;

>>> p2 = body_tag.find("p").string
>>> type(p2)
<class &#39;bs4.element.NavigableString&#39;>
>>> p2
u&#39;\xc8\xe7\xba\xce\xca\xb9\xd3\xc3BeautifulSoup&#39;
>>>
Copy after login

实际场景中我们一般使用 get_text 方法获取标签中的内容。

总结

BeatifulSoup 是一个用于操作 HTML 文档的 Python 库,初始化 BeatifulSoup 时,需要指定 HTML 文档字符串和具体的解析器。BeatifulSoup 有3类常用的数据类型,分别是 Tag、NavigableString、和 BeautifulSoup。查找 HTML元素有两种方式,分别是遍历文档树和搜索文档树,通常快速获取数据需要二者结合。

【相关推荐】

1. python爬虫入门(5)--正则表达式实例教程

2. python爬虫入门(3)--利用requests构建知乎API

3. python爬虫入门(2)--HTTP库requests

4.  总结Python的逻辑运算符and

5. python爬虫入门(1)--快速理解HTTP协议

The above is the detailed content of Introduction to python crawlers (4)--Detailed explanation of HTML text parsing library BeautifulSoup. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

PHP and Python: Different Paradigms Explained PHP and Python: Different Paradigms Explained Apr 18, 2025 am 12:26 AM

PHP is mainly procedural programming, but also supports object-oriented programming (OOP); Python supports a variety of paradigms, including OOP, functional and procedural programming. PHP is suitable for web development, and Python is suitable for a variety of applications such as data analysis and machine learning.

Choosing Between PHP and Python: A Guide Choosing Between PHP and Python: A Guide Apr 18, 2025 am 12:24 AM

PHP is suitable for web development and rapid prototyping, and Python is suitable for data science and machine learning. 1.PHP is used for dynamic web development, with simple syntax and suitable for rapid development. 2. Python has concise syntax, is suitable for multiple fields, and has a strong library ecosystem.

Python vs. JavaScript: The Learning Curve and Ease of Use Python vs. JavaScript: The Learning Curve and Ease of Use Apr 16, 2025 am 12:12 AM

Python is more suitable for beginners, with a smooth learning curve and concise syntax; JavaScript is suitable for front-end development, with a steep learning curve and flexible syntax. 1. Python syntax is intuitive and suitable for data science and back-end development. 2. JavaScript is flexible and widely used in front-end and server-side programming.

PHP and Python: A Deep Dive into Their History PHP and Python: A Deep Dive into Their History Apr 18, 2025 am 12:25 AM

PHP originated in 1994 and was developed by RasmusLerdorf. It was originally used to track website visitors and gradually evolved into a server-side scripting language and was widely used in web development. Python was developed by Guidovan Rossum in the late 1980s and was first released in 1991. It emphasizes code readability and simplicity, and is suitable for scientific computing, data analysis and other fields.

Can vs code run in Windows 8 Can vs code run in Windows 8 Apr 15, 2025 pm 07:24 PM

VS Code can run on Windows 8, but the experience may not be great. First make sure the system has been updated to the latest patch, then download the VS Code installation package that matches the system architecture and install it as prompted. After installation, be aware that some extensions may be incompatible with Windows 8 and need to look for alternative extensions or use newer Windows systems in a virtual machine. Install the necessary extensions to check whether they work properly. Although VS Code is feasible on Windows 8, it is recommended to upgrade to a newer Windows system for a better development experience and security.

Can visual studio code be used in python Can visual studio code be used in python Apr 15, 2025 pm 08:18 PM

VS Code can be used to write Python and provides many features that make it an ideal tool for developing Python applications. It allows users to: install Python extensions to get functions such as code completion, syntax highlighting, and debugging. Use the debugger to track code step by step, find and fix errors. Integrate Git for version control. Use code formatting tools to maintain code consistency. Use the Linting tool to spot potential problems ahead of time.

How to run python with notepad How to run python with notepad Apr 16, 2025 pm 07:33 PM

Running Python code in Notepad requires the Python executable and NppExec plug-in to be installed. After installing Python and adding PATH to it, configure the command "python" and the parameter "{CURRENT_DIRECTORY}{FILE_NAME}" in the NppExec plug-in to run Python code in Notepad through the shortcut key "F6".

Is the vscode extension malicious? Is the vscode extension malicious? Apr 15, 2025 pm 07:57 PM

VS Code extensions pose malicious risks, such as hiding malicious code, exploiting vulnerabilities, and masturbating as legitimate extensions. Methods to identify malicious extensions include: checking publishers, reading comments, checking code, and installing with caution. Security measures also include: security awareness, good habits, regular updates and antivirus software.

See all articles