Detailed explanation of how Python uses the Beautiful Soup module to search for content-Python Tutorial-php.cn

Home

Backend Development

Python Tutorial

Detailed explanation of how Python uses the Beautiful Soup module to search for content

高洛峰

Mar 31, 2017 am 09:47 AM

This article mainly introduces you to the search method function of the Beautiful Soup module in python. Methods Different types of filtering parameters can perform different filtering to obtain the desired results. The introduction in the article is very detailed and has certain reference value for everyone. Friends who need it can take a look below.

Preface

We will use the search function of the Beautiful Soup module to search based on tag name, tag attributes, document text and regular expressions .

Search method

Beautiful Soup’s built-in search method is as follows:

##find()
find_all()
find_parent()

Use the find() method to searchFirst you still need to create an HTML file for testing.

<html>
<body>
<p class="ecopyramid">
 <ul id="producers">
 <li class="producerlist">
  <p class="name">plants</p>
  <p class="number">100000</p>
 </li>
 <li class="producerlist">
  <p class="name">algae</p>
  <p class="number">100000</p>
 </li>
 </ul>
 <ul id="primaryconsumers">
 <li class="primaryconsumerlist">
  <p class="name">deer</p>
  <p class="number">1000</p>
 </li>
 <li class="primaryconsumerlist">
  <p class="name">rabbit</p>
  <p class="number">2000</p>
 </li>
 </ul>
 <ul id="secondaryconsumers">
 <li class="secondaryconsumerlist">
  <p class="name">fox</p>
  <p class="number">100</p>
 </li>
 <li class="secondaryconsumerlist">
  <p class="name">bear</p>
  <p class="number">100</p>
 </li>
 </ul>
 <ul id="tertiaryconsumers">
 <li class="tertiaryconsumerlist">
  <p class="name">lion</p>
  <p class="number">80</p>
 </li>
 <li class="tertiaryconsumerlist">
  <p class="name">tiger</p>
  <p class="number">50</p>
 </li>
 </ul>
</p>
</body>
</html>

Copy after login

We can get the

method. By default, the first one that appears will be obtained. Then get the

tag. By default, you will still get the first one that appears. Then get the
tag, and verify whether you got the first one by outputting the content.
```
from bs4 import BeautifulSoup
with open(&#39;search.html&#39;,&#39;r&#39;) as filename:
 soup = BeautifulSoup(filename,&#39;lxml&#39;)
first_ul_entries = soup.find(&#39;ul&#39;)
print first_ul_entries.li.p.string
```
Copy after login
find() method is as follows:
```
find(name,attrs,recursive,text,**kwargs)
```
Copy after login
As shown in the above code,
find( )
The method accepts five parameters: name, attrs, recursive, text and **kwargs. The name, attrs and text parameters can all act as filters in the find() method to improve the accuracy of the matching results.
Search for tags
In addition to searching for the
- tag, and the returned result is also the first one that appears. matches.
```
tag_li = soup.find(&#39;li&#39;)
# tag_li = soup.find(name = "li")
print type(tag_li)
print tag_li.p.string
```
 Copy after login
 Search text
 If we only want to search based on text content, we can only pass in the text parameters:
```
search_for_text = soup.find(text=&#39;plants&#39;)
print type(search_for_text)
<class &#39;bs4.element.NavigableString&#39;>
```
 Copy after login
 The returned result is also a NavigableString object.
 Search based on regular expression
 The following HTML text content
```
The below HTML has the information that has email ids.
 abc@example.com 
xyz@example.com 
 foo@example.com
```
 Copy after login
 You can see abc@ The example email address is not included in any tags, so the email address cannot be found based on the tags. At this time, we can use regular expressions to match.
```
email_id_example = """
 The below HTML has the information that has email ids.
 abc@example.com
 xyz@example.com
 foo@example.com
 """
email_soup = BeautifulSoup(email_id_example,&#39;lxml&#39;)
print email_soup
# pattern = "\w+@\w+\.\w+"
emailid_regexp = re.compile("\w+@\w+\.\w+")
first_email_id = email_soup.find(text=emailid_regexp)
print first_email_id
```
 Copy after login
 When using regular expressions for matching, if there are multiple matches, the first one will be returned first.
 Search by tag attribute value
 You can search by tag attribute value:
```
search_for_attribute = soup.find(id=&#39;primaryconsumers&#39;)
print search_for_attribute.li.p.string
```
 Copy after login
 According to tag Searching by attribute value is available for most attributes, such as id, style, and title.
 
 But there will be differences in the following two situations:
 find()
 function.
 Search based on custom attributes
 In HTML5, you can add custom attributes to tags, such as adding attributes to tags.
 As shown in the following code, if we continue to operate like searching for id, an error will be reported. Python variables cannot include the - symbol.
```
customattr = """
 custom attribute example
 """
customsoup = BeautifulSoup(customattr,&#39;lxml&#39;)
customsoup.find(data-custom="custom")
# SyntaxError: keyword can&#39;t be an expression
```
 Copy after login
 At this time, use the attrs attribute value to pass a dictionary type as a parameter for search:
```
using_attrs = customsoup.find(attrs={&#39;data-custom&#39;:&#39;custom&#39;})
print using_attrs
```
 Copy after login
 Search based on classes in CSS
 For CSS class attributes, since class is a keyword in Python, it cannot be passed as a label attribute parameter. In this case, it is the same as self Search as defined properties. Also use the attrs attribute to pass a dictionary for matching.
 In addition to using the attrs attribute, you can also use the class_ attribute to pass, which is different from class and will not cause errors.
```
css_class = soup.find(attrs={&#39;class&#39;:&#39;producerlist&#39;})
css_class2 = soup.find(class_ = "producerlist")
print css_class
print css_class2
```
 Copy after login
 Use custom function search
 You can pass a function to the
 find()
 method, This will search based on the conditions defined by the function.
 The function should return true or false value.
```
def is_producers(tag):
 return tag.has_attr(&#39;id&#39;) and tag.get(&#39;id&#39;) == &#39;producers&#39;
tag_producers = soup.find(is_producers)
print tag_producers.li.p.string
```
 Copy after login
 An is_producers function is defined in the code, which will check whether the tag has a specific id attribute and whether the attribute value is equal to producers. If the conditions are met, it will return true, otherwise it will return false.
 Combined use of various search methods
 Beautiful Soup provides various search methods. Similarly, we can also use these methods jointly for matching to improve the accuracy of the search. Spend.
```
combine_html = """
 
 Example of p tag with class identical
 
 
 Example of p tag with class identical
 
 """
combine_soup = BeautifulSoup(combine_html,&#39;lxml&#39;)
identical_p = combine_soup.find("p",class_="identical")
print identical_p
```
 Copy after login
 使用 find_all() 方法搜索
 使用 find() 方法会从搜索结果中返回第一个匹配的内容，而 find_all() 方法则会返回所有匹配的项。
 在 find() 方法中用到的过滤项，同样可以用在 find_all() 方法中。事实上，它们可以用到任何搜索方法中，例如：find_parents() 和 find_siblings() 中。
```
# 搜索所有 class 属性等于 tertiaryconsumerlist 的标签。
all_tertiaryconsumers = soup.find_all(class_=&#39;tertiaryconsumerlist&#39;)
print type(all_tertiaryconsumers)
for tertiaryconsumers in all_tertiaryconsumers:
 print tertiaryconsumers.p.string
```
 Copy after login
 find_all() 方法为：
```
find_all(name,attrs,recursive,text,limit,**kwargs)
```
 Copy after login
 它的参数和 find() 方法有些类似，多个了 limit 参数。limit 参数是用来限制结果数量的。而 find() 方法的 limit 就是 1 了。
 同时，我们也能传递一个字符串列表的参数来搜索标签、标签属性值、自定义属性值和 CSS 类。
```
# 搜索所有的 p 和 li 标签
p_li_tags = soup.find_all(["p","li"])
print p_li_tags
print
# 搜索所有类属性是 producerlist 和 primaryconsumerlist 的标签
all_css_class = soup.find_all(class_=["producerlist","primaryconsumerlist"])
print all_css_class
print
```
 Copy after login
 搜索相关标签
 一般情况下，我们可以使用 find() 和 find_all() 方法来搜索指定的标签，同时也能搜索其他与这些标签相关的感兴趣的标签。
 搜索父标签
 可以使用 find_parent() 或者 find_parents() 方法来搜索标签的父标签。
 find_parent() 方法将返回第一个匹配的内容，而 find_parents() 将返回所有匹配的内容，这一点与 find() 和 find_all() 方法类似。
```
# 搜索 父标签
primaryconsumers = soup.find_all(class_=&#39;primaryconsumerlist&#39;)
print len(primaryconsumers)
# 取父标签的第一个
primaryconsumer = primaryconsumers[0]
# 搜索所有 ul 的父标签
parent_ul = primaryconsumer.find_parents(&#39;ul&#39;)
print len(parent_ul)
# 结果将包含父标签的所有内容
print parent_ul
print
# 搜索,取第一个出现的父标签.有两种操作
immediateprimary_consumer_parent = primaryconsumer.find_parent()
# immediateprimary_consumer_parent = primaryconsumer.find_parent(&#39;ul&#39;)
print immediateprimary_consumer_parent
```
 Copy after login
 搜索同级标签
 Beautiful Soup 还提供了搜索同级标签的功能。
 使用函数 find_next_siblings() 函数能够搜索同一级的下一个所有标签，而 find_next_sibling() 函数能够搜索同一级的下一个标签。
```
producers = soup.find(id=&#39;producers&#39;)
next_siblings = producers.find_next_siblings()
print next_siblings
```
 Copy after login
 同样，也可以使用 find_previous_siblings() 和 find_previous_sibling() 方法来搜索上一个同级的标签。
 搜索下一个标签
 使用 find_next() 方法将搜索下一个标签中第一个出现的，而 find_next_all() 将会返回所有下级的标签项。
```
# 搜索下一级标签
first_p = soup.p
all_li_tags = first_p.find_all_next("li")
print all_li_tags
```
 Copy after login
 搜索上一个标签
 
 与搜索下一个标签类似，使用 find_previous() 和 find_all_previous() 方法来搜索上一个标签。
 The above is the detailed content of Detailed explanation of how Python uses the Beautiful Soup module to search for content. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

4 weeks ago By DDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

3 weeks ago By DDD

Where to find the Crane Control Keycard in Atomfall

4 weeks ago By DDD

Roblox: Dead Rails - How To Complete Every Challenge

1 months ago By DDD

Atomfall guide: item locations, quest guides, and tips

1 months ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7704

Java Tutorial

1640

CakePHP Tutorial

1393

Laravel Tutorial

1287

PHP Tutorial

1231

Related knowledge

How to solve the permissions problem encountered when viewing Python version in Linux terminal? Apr 01, 2025 pm 05:09 PM

Solution to permission issues when viewing Python version in Linux terminal When you try to view Python version in Linux terminal, enter python...

How to teach computer novice programming basics in project and problem-driven methods within 10 hours? Apr 02, 2025 am 07:18 AM

How to teach computer novice programming basics within 10 hours? If you only have 10 hours to teach computer novice some programming knowledge, what would you choose to teach...

How to avoid being detected by the browser when using Fiddler Everywhere for man-in-the-middle reading? Apr 02, 2025 am 07:15 AM

How to avoid being detected when using FiddlerEverywhere for man-in-the-middle readings When you use FiddlerEverywhere...

How to efficiently copy the entire column of one DataFrame into another DataFrame with different structures in Python? Apr 01, 2025 pm 11:15 PM

When using Python's pandas library, how to copy whole columns between two DataFrames with different structures is a common problem. Suppose we have two Dats...

How does Uvicorn continuously listen for HTTP requests without serving_forever()? Apr 01, 2025 pm 10:51 PM

How does Uvicorn continuously listen for HTTP requests? Uvicorn is a lightweight web server based on ASGI. One of its core functions is to listen for HTTP requests and proceed...

How to handle comma-separated list query parameters in FastAPI? Apr 02, 2025 am 06:51 AM

Fastapi ...

How to get news data bypassing Investing.com's anti-crawler mechanism? Apr 02, 2025 am 07:03 AM

Understanding the anti-crawling strategy of Investing.com Many people often try to crawl news data from Investing.com (https://cn.investing.com/news/latest-news)...

How to solve permission issues when using python --version command in Linux terminal? Apr 02, 2025 am 06:36 AM

Using python in Linux terminal...

See all articles