Table of Contents
Parsing HTML pages with Python
The structure of HTML page
XPath parsing
CSS 选择器解析
正则表达式解析
Home Backend Development Python Tutorial How to parse HTML pages with Python crawler

How to parse HTML pages with Python crawler

May 30, 2023 pm 09:41 PM
python html

Parsing HTML pages with Python

We usually need to parse web crawled pages to get the data we need. By analyzing the combined structure of HTML tags, we can extract useful information contained in web pages. In Python, there are three common ways to parse HTML: regular expression parsing, XPath parsing, and CSS selector parsing.

The structure of HTML page

Understanding the basic structure of HTML page is a prerequisite before explaining the HTML parsing method. When we open a website in a browser and select the "Show web page source code" menu item through the right-click menu of the mouse, we can see the HTML code corresponding to the web page. HTML code usually consists of tags, attributes, and text. The label carries the content displayed on the page, the attributes supplement the label information, and the text is the content displayed by the label. The following is a simple HTML page code structure example:

<!DOCTYPE html>
<html>
    <head>
        <!-- head 标签中的内容不会在浏览器窗口中显示 -->
        <title>这是页面标题</title>
    </head>
    <body>
        <!-- body 标签中的内容会在浏览器窗口中显示 -->
        <h2>这是一级标题</h2>
        <p>这是一段文本</p>
    </body>
</html>
Copy after login

In this HTML page code example, <!DOCTYPE html> is the document type declaration, <html> The tag is the root tag of the entire page, <head> and <body> are sub-tags of the <html> tag, placed in The content under the <body> tag will be displayed in the browser window. This part of the content is the main body of the web page; the content under the <head> tag will not be displayed in the browser window. It is displayed in the browser window, but it contains important meta-information of the page, usually called the header of the web page. The general code structure of an HTML page is as follows:

<!DOCTYPE html>
<html>
    <head>
        <!-- 页面的元信息,如字符编码、标题、关键字、媒体查询等 -->
    </head>
    <body>
        <!-- 页面的主体,显示在浏览器窗口中的内容 -->
    </body>
</html>
Copy after login

tags, cascading style sheets (CSS) and JavaScript are the three basic components that make up an HTML page. Tags are used to carry the content to be displayed on the page, CSS is responsible for rendering the page, and JavaScript is used to control the interactive behavior of the page. To parse HTML pages, you can use XPath syntax, which is originally a query syntax for XML. It can extract content or tag attributes in tags based on the hierarchical structure of HTML tags. In addition, you can also use CSS selectors to locate pages. Elements are the same as rendering page elements using CSS.

XPath parsing

XPath is a syntax for finding information in XML (eXtensible Markup Language) documents. XML is similar to HTML and is a tag language that uses tags to carry data. The difference The reason is that XML tags are extensible and customizable, and XML has stricter syntax requirements. XPath uses path expressions to select nodes or node sets in XML documents. The nodes mentioned here include elements, attributes, text, namespaces, processing instructions, comments, root nodes, etc.

XPath path expression is similar to file path syntax, you can use "/" and "//" to select nodes. When selecting the root node, you can use a single slash "/"; when selecting a node at any position, you can use a double slash "//". For example, "/bookstore/book" means selecting all book sub-nodes under the root node bookstore, and "//title" means selecting the title node at any position.

XPath can also use predicates to filter nodes. Nested expressions within square brackets can be numbers, comparison operators, or function calls that serve as predicates. For example, "/bookstore/book[1]" means selecting the first child node book of bookstore, and "//book[@lang]" means selecting all book nodes with the lang attribute.

XPath functions include string, mathematical, logical, node, sequence and other functions. These functions can be used to select nodes, calculate values, convert data types and other operations. For example, the "string-length(string)" function can return the length of the string, and the "count(node-set)" function can return the number of nodes in the node set.

Below we use an example to illustrate how to use XPath to parse the page. Suppose we have the following XML file:

<?xml version="1.0" encoding="UTF-8"?>
<bookstore>
    <book>
      <title lang="eng">Harry Potter</title>
      <price>29.99</price>
    </book>
    <book>
      <title lang="zh">Learning XML</title>
      <price>39.95</price>
    </book>
</bookstore>
Copy after login

For this XML file, we can use the XPath syntax as shown below to get the nodes in the document.

##/bookstoreSelect the root element bookstore. Note: If the path starts with a forward slash ( / ), this path always represents an absolute path to an element! //bookSelects all book child elements regardless of their position in the document. //@langSelect all attributes named lang. /bookstore/book[1]Select the first child node book of bookstore.

CSS 选择器解析

通过HTML标签的属性和关系来定位元素的方式被称为CSS选择器。根据 HTML 标签的层级结构、类名、id 等属性能够确定元素的位置。在 Python 中,我们可以使用 BeautifulSoup 库来进行 CSS 选择器解析。

我们接下来会举一个例子,讲解如何运用 CSS 选择器来分析页面。假设我们有如下的 HTML 代码:

<!DOCTYPE html>
<html>
<head>
	<meta charset="utf-8">
	<title>这是页面标题</title>
</head>
<body>
	<div class="content">
		<h2>这是一级标题</h2>
		<p>这是一段文本</p>
	</div>
	<div class="footer">
		<p>版权所有 © 2021</p>
	</div>
</body>
</html>
Copy after login
Copy after login

我们可以使用如下所示的 CSS 选择器语法来选取页面元素。

Path expressionResult
选择器结果
div.content选取 class 为 content 的 div 元素。
h2选取所有的 h2 元素。
div.footer p选取 class 为 footer 的 div 元素下的所有 p 元素。
[href]选取所有具有 href 属性的元素。

正则表达式解析

用正则表达式可以解析 HTML 页面,从而实现文本的匹配、查找和替换。使用 re 模块可以进行 Python 的正则表达式解析。

下面我们通过一个例子来说明如何使用正则表达式对页面进行解析。假设我们有如下的 HTML 代码:

<!DOCTYPE html>
<html>
<head>
	<meta charset="utf-8">
	<title>这是页面标题</title>
</head>
<body>
	<div class="content">
		<h2>这是一级标题</h2>
		<p>这是一段文本</p>
	</div>
	<div class="footer">
		<p>版权所有 © 2021</p>
	</div>
</body>
</html>
Copy after login
Copy after login

我们可以使用如下所示的正则表达式来选取页面元素。

import re
html = '''
<!DOCTYPE html>
<html>
<head>
	<meta charset="utf-8">
	<title>这是页面标题</title>
</head>
<body>
	<div class="content">
		<h2>这是一级标题</h2>
		<p>这是一段文本</p>
	</div>
	<div class="footer">
		<p>版权所有 © 2021</p>
	</div>
</body>
</html>
'''
pattern = re.compile(r'
.*?

(.*?)

.*?

(.*?)

.*?
', re.S) match = re.search(pattern, html) if match: title = match.group(1) text = match.group(2) print(title) print(text)
Copy after login

以上代码中,我们使用 re 模块的 compile 方法来编译正则表达式,然后使用 search 方法来匹配 HTML 代码。在正则表达式中,“.*?”表示非贪婪匹配,也就是匹配到第一个符合条件的标签就停止匹配,而“re.S”表示让“.”可以匹配包括换行符在内的任意字符。最后,我们使用 group 方法来获取匹配的结果。

The above is the detailed content of How to parse HTML pages with Python crawler. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

PHP and Python: Different Paradigms Explained PHP and Python: Different Paradigms Explained Apr 18, 2025 am 12:26 AM

PHP is mainly procedural programming, but also supports object-oriented programming (OOP); Python supports a variety of paradigms, including OOP, functional and procedural programming. PHP is suitable for web development, and Python is suitable for a variety of applications such as data analysis and machine learning.

HTML: The Structure, CSS: The Style, JavaScript: The Behavior HTML: The Structure, CSS: The Style, JavaScript: The Behavior Apr 18, 2025 am 12:09 AM

The roles of HTML, CSS and JavaScript in web development are: 1. HTML defines the web page structure, 2. CSS controls the web page style, and 3. JavaScript adds dynamic behavior. Together, they build the framework, aesthetics and interactivity of modern websites.

Choosing Between PHP and Python: A Guide Choosing Between PHP and Python: A Guide Apr 18, 2025 am 12:24 AM

PHP is suitable for web development and rapid prototyping, and Python is suitable for data science and machine learning. 1.PHP is used for dynamic web development, with simple syntax and suitable for rapid development. 2. Python has concise syntax, is suitable for multiple fields, and has a strong library ecosystem.

PHP and Python: A Deep Dive into Their History PHP and Python: A Deep Dive into Their History Apr 18, 2025 am 12:25 AM

PHP originated in 1994 and was developed by RasmusLerdorf. It was originally used to track website visitors and gradually evolved into a server-side scripting language and was widely used in web development. Python was developed by Guidovan Rossum in the late 1980s and was first released in 1991. It emphasizes code readability and simplicity, and is suitable for scientific computing, data analysis and other fields.

The Future of HTML: Evolution and Trends in Web Design The Future of HTML: Evolution and Trends in Web Design Apr 17, 2025 am 12:12 AM

The future of HTML is full of infinite possibilities. 1) New features and standards will include more semantic tags and the popularity of WebComponents. 2) The web design trend will continue to develop towards responsive and accessible design. 3) Performance optimization will improve the user experience through responsive image loading and lazy loading technologies.

Python vs. JavaScript: The Learning Curve and Ease of Use Python vs. JavaScript: The Learning Curve and Ease of Use Apr 16, 2025 am 12:12 AM

Python is more suitable for beginners, with a smooth learning curve and concise syntax; JavaScript is suitable for front-end development, with a steep learning curve and flexible syntax. 1. Python syntax is intuitive and suitable for data science and back-end development. 2. JavaScript is flexible and widely used in front-end and server-side programming.

How to run sublime code python How to run sublime code python Apr 16, 2025 am 08:48 AM

To run Python code in Sublime Text, you need to install the Python plug-in first, then create a .py file and write the code, and finally press Ctrl B to run the code, and the output will be displayed in the console.

Where to write code in vscode Where to write code in vscode Apr 15, 2025 pm 09:54 PM

Writing code in Visual Studio Code (VSCode) is simple and easy to use. Just install VSCode, create a project, select a language, create a file, write code, save and run it. The advantages of VSCode include cross-platform, free and open source, powerful features, rich extensions, and lightweight and fast.

See all articles