(.*?)
.*?(.*?)
.*?We usually need to parse web crawled pages to get the data we need. By analyzing the combined structure of HTML tags, we can extract useful information contained in web pages. In Python, there are three common ways to parse HTML: regular expression parsing, XPath parsing, and CSS selector parsing.
Understanding the basic structure of HTML page is a prerequisite before explaining the HTML parsing method. When we open a website in a browser and select the "Show web page source code" menu item through the right-click menu of the mouse, we can see the HTML code corresponding to the web page. HTML code usually consists of tags, attributes, and text. The label carries the content displayed on the page, the attributes supplement the label information, and the text is the content displayed by the label. The following is a simple HTML page code structure example:
<!DOCTYPE html> <html> <head> <!-- head 标签中的内容不会在浏览器窗口中显示 --> <title>这是页面标题</title> </head> <body> <!-- body 标签中的内容会在浏览器窗口中显示 --> <h2>这是一级标题</h2> <p>这是一段文本</p> </body> </html>
In this HTML page code example, <!DOCTYPE html>
is the document type declaration, <html> The
tag is the root tag of the entire page, <head>
and <body>
are sub-tags of the <html>
tag, placed in The content under the <body>
tag will be displayed in the browser window. This part of the content is the main body of the web page; the content under the <head>
tag will not be displayed in the browser window. It is displayed in the browser window, but it contains important meta-information of the page, usually called the header of the web page. The general code structure of an HTML page is as follows:
<!DOCTYPE html> <html> <head> <!-- 页面的元信息,如字符编码、标题、关键字、媒体查询等 --> </head> <body> <!-- 页面的主体,显示在浏览器窗口中的内容 --> </body> </html>
tags, cascading style sheets (CSS) and JavaScript are the three basic components that make up an HTML page. Tags are used to carry the content to be displayed on the page, CSS is responsible for rendering the page, and JavaScript is used to control the interactive behavior of the page. To parse HTML pages, you can use XPath syntax, which is originally a query syntax for XML. It can extract content or tag attributes in tags based on the hierarchical structure of HTML tags. In addition, you can also use CSS selectors to locate pages. Elements are the same as rendering page elements using CSS.
XPath is a syntax for finding information in XML (eXtensible Markup Language) documents. XML is similar to HTML and is a tag language that uses tags to carry data. The difference The reason is that XML tags are extensible and customizable, and XML has stricter syntax requirements. XPath uses path expressions to select nodes or node sets in XML documents. The nodes mentioned here include elements, attributes, text, namespaces, processing instructions, comments, root nodes, etc.
XPath path expression is similar to file path syntax, you can use "/" and "//" to select nodes. When selecting the root node, you can use a single slash "/"; when selecting a node at any position, you can use a double slash "//". For example, "/bookstore/book" means selecting all book sub-nodes under the root node bookstore, and "//title" means selecting the title node at any position.
XPath can also use predicates to filter nodes. Nested expressions within square brackets can be numbers, comparison operators, or function calls that serve as predicates. For example, "/bookstore/book[1]" means selecting the first child node book of bookstore, and "//book[@lang]" means selecting all book nodes with the lang attribute.
XPath functions include string, mathematical, logical, node, sequence and other functions. These functions can be used to select nodes, calculate values, convert data types and other operations. For example, the "string-length(string)" function can return the length of the string, and the "count(node-set)" function can return the number of nodes in the node set.
Below we use an example to illustrate how to use XPath to parse the page. Suppose we have the following XML file:
<?xml version="1.0" encoding="UTF-8"?> <bookstore> <book> <title lang="eng">Harry Potter</title> <price>29.99</price> </book> <book> <title lang="zh">Learning XML</title> <price>39.95</price> </book> </bookstore>
For this XML file, we can use the XPath syntax as shown below to get the nodes in the document.
Path expression | Result |
---|---|
Select the root element bookstore. Note: If the path starts with a forward slash ( / ), this path always represents an absolute path to an element! | |
Selects all book child elements regardless of their position in the document. | |
Select all attributes named lang. | |
Select the first child node book of bookstore. |
选择器 | 结果 |
---|---|
div.content | 选取 class 为 content 的 div 元素。 |
h2 | 选取所有的 h2 元素。 |
div.footer p | 选取 class 为 footer 的 div 元素下的所有 p 元素。 |
[href] | 选取所有具有 href 属性的元素。 |
用正则表达式可以解析 HTML 页面,从而实现文本的匹配、查找和替换。使用 re 模块可以进行 Python 的正则表达式解析。
下面我们通过一个例子来说明如何使用正则表达式对页面进行解析。假设我们有如下的 HTML 代码:
<!DOCTYPE html> <html> <head> <meta charset="utf-8"> <title>这是页面标题</title> </head> <body> <div class="content"> <h2>这是一级标题</h2> <p>这是一段文本</p> </div> <div class="footer"> <p>版权所有 © 2021</p> </div> </body> </html>
我们可以使用如下所示的正则表达式来选取页面元素。
import re html = ''' <!DOCTYPE html> <html> <head> <meta charset="utf-8"> <title>这是页面标题</title> </head> <body> <div class="content"> <h2>这是一级标题</h2> <p>这是一段文本</p> </div> <div class="footer"> <p>版权所有 © 2021</p> </div> </body> </html> ''' pattern = re.compile(r'.*?', re.S) match = re.search(pattern, html) if match: title = match.group(1) text = match.group(2) print(title) print(text)(.*?)
.*?(.*?)
.*?
以上代码中,我们使用 re 模块的 compile 方法来编译正则表达式,然后使用 search 方法来匹配 HTML 代码。在正则表达式中,“.*?”表示非贪婪匹配,也就是匹配到第一个符合条件的标签就停止匹配,而“re.S”表示让“.”可以匹配包括换行符在内的任意字符。最后,我们使用 group 方法来获取匹配的结果。
The above is the detailed content of How to parse HTML pages with Python crawler. For more information, please follow other related articles on the PHP Chinese website!
AI-powered app for creating realistic nude photos
Online AI tool for removing clothes from photos.
Undress images for free
AI clothes remover
Swap faces in any video effortlessly with our completely free AI face swap tool!
Easy-to-use and free code editor
Chinese version, very easy to use
Powerful PHP integrated development environment
Visual web development tools
God-level code editing software (SublimeText3)
PHP is mainly procedural programming, but also supports object-oriented programming (OOP); Python supports a variety of paradigms, including OOP, functional and procedural programming. PHP is suitable for web development, and Python is suitable for a variety of applications such as data analysis and machine learning.
The roles of HTML, CSS and JavaScript in web development are: 1. HTML defines the web page structure, 2. CSS controls the web page style, and 3. JavaScript adds dynamic behavior. Together, they build the framework, aesthetics and interactivity of modern websites.
PHP is suitable for web development and rapid prototyping, and Python is suitable for data science and machine learning. 1.PHP is used for dynamic web development, with simple syntax and suitable for rapid development. 2. Python has concise syntax, is suitable for multiple fields, and has a strong library ecosystem.
PHP originated in 1994 and was developed by RasmusLerdorf. It was originally used to track website visitors and gradually evolved into a server-side scripting language and was widely used in web development. Python was developed by Guidovan Rossum in the late 1980s and was first released in 1991. It emphasizes code readability and simplicity, and is suitable for scientific computing, data analysis and other fields.
The future of HTML is full of infinite possibilities. 1) New features and standards will include more semantic tags and the popularity of WebComponents. 2) The web design trend will continue to develop towards responsive and accessible design. 3) Performance optimization will improve the user experience through responsive image loading and lazy loading technologies.
Python is more suitable for beginners, with a smooth learning curve and concise syntax; JavaScript is suitable for front-end development, with a steep learning curve and flexible syntax. 1. Python syntax is intuitive and suitable for data science and back-end development. 2. JavaScript is flexible and widely used in front-end and server-side programming.
To run Python code in Sublime Text, you need to install the Python plug-in first, then create a .py file and write the code, and finally press Ctrl B to run the code, and the output will be displayed in the console.
Writing code in Visual Studio Code (VSCode) is simple and easy to use. Just install VSCode, create a project, select a language, create a file, write code, save and run it. The advantages of VSCode include cross-platform, free and open source, powerful features, rich extensions, and lightweight and fast.