使用 Beautiful Soup 和 Scrapy 进行网页抓取：高效、负责任地提取数据-Python教程-PHP中文网

首页

后端开发

Python教程

使用 Beautiful Soup 和 Scrapy 进行网页抓取：高效、负责任地提取数据

Patricia Arquette

Jan 05, 2025 am 07:18 AM

Web Scraping with Beautiful Soup and Scrapy: Extracting Data Efficiently and Responsibly

在数字时代，数据是宝贵的资产，网络抓取已成为从网站提取信息的重要工具。本文探讨了两个流行的 Web 抓取 Python 库：Beautiful Soup 和 Scrapy。我们将深入研究它们的功能，提供实时工作代码示例，并讨论负责任的网络抓取的最佳实践。

网页抓取简介

网络抓取是从网站提取数据的自动化过程。它广泛应用于各个领域，包括数据分析、机器学习和竞争分析。然而，网络抓取必须负责任地进行，以尊重网站服务条款和法律界限。

Beautiful Soup：适合初学者的图书馆

Beautiful Soup 是一个 Python 库，专为快速轻松的网页抓取任务而设计。它对于解析 HTML 和 XML 文档并从中提取数据特别有用。 Beautiful Soup 提供了用于迭代、搜索和修改解析树的 Pythonic 惯用法。

主要特点

易于使用：Beautiful Soup 适合初学者且易于学习。
灵活的解析：它可以解析 HTML 和 XML 文档，甚至是那些带有格式错误的标记的文档。
集成：与其他 Python 库配合良好，例如获取网页的请求。

安装中

要开始使用 Beautiful Soup，您需要将其与请求库一起安装：

pip install beautifulsoup4 requests

登录后复制

基本示例

让我们从示例博客页面中提取文章标题：

import requests
from bs4 import BeautifulSoup

# Fetch the web page
url = 'https://example-blog.com'
response = requests.get(url)
# Check if the request was successful
if response.status_code == 200:
    # Parse the HTML content
    soup = BeautifulSoup(response.text, 'html.parser')
    # Extract article titles
    titles = soup.find_all('h1', class_='entry-title')
    # Check if titles were found
    if titles:
        for title in titles:
            # Extract and print the text of each title
            print(title.get_text(strip=True))
    else:
        print("No titles found. Please check the HTML structure and update the selector.")
else:
    print(f"Failed to retrieve the page. Status code: {response.status_code}")

登录后复制

优点

简单：非常适合中小型项目。
稳健性：优雅地处理格式不良的 HTML。

Scrapy：一个强大的网页抓取框架

Scrapy是一个全面的网络抓取框架，提供大规模数据提取的工具。它专为性能和灵活性而设计，使其适合复杂的项目。

主要特点

速度和效率：内置对异步请求的支持。
可扩展性：通过中间件和管道进行高度可定制。
内置数据导出：支持导出JSON、CSV、XML等多种格式的数据。

安装中

使用 pip 安装 Scrapy：

pip install scrapy

登录后复制

基本示例

为了演示 Scrapy，我们将创建一个蜘蛛来从网站上抓取报价：

创建一个 Scrapy 项目：

pip install beautifulsoup4 requests

登录后复制

定义蜘蛛：在spiders目录下创建一个文件quotes_spider.py：

import requests
from bs4 import BeautifulSoup

# Fetch the web page
url = 'https://example-blog.com'
response = requests.get(url)
# Check if the request was successful
if response.status_code == 200:
    # Parse the HTML content
    soup = BeautifulSoup(response.text, 'html.parser')
    # Extract article titles
    titles = soup.find_all('h1', class_='entry-title')
    # Check if titles were found
    if titles:
        for title in titles:
            # Extract and print the text of each title
            print(title.get_text(strip=True))
    else:
        print("No titles found. Please check the HTML structure and update the selector.")
else:
    print(f"Failed to retrieve the page. Status code: {response.status_code}")

登录后复制