网页抓取变得简单：使用 Puppeteer 解析任何 HTML 页面-js教程-PHP中文网

首页

web前端

js教程

网页抓取变得简单：使用 Puppeteer 解析任何 HTML 页面

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Sep 05, 2024 pm 10:34 PM

Web Scraping Made Easy: Parse Any HTML Page with Puppeteer

想象一下构建一个电子商务平台，我们可以轻松地从 eBay、Amazon 和 Flipkart 等主要商店实时获取产品数据。当然，有 Shopify 和类似的服务，但说实话 - 仅为一个项目购买订阅可能会感觉有点麻烦。所以，我想，为什么不抓取这些网站并将产品直接存储在我们的数据库中呢？这将是为我们的电子商务项目获取产品的一种高效且具有成本效益的方式。

什么是网页抓取？

网络抓取涉及通过解析网页的 HTML 来读取和收集内容，从而从网站中提取数据。它通常涉及自动化浏览器或向网站发送 HTTP 请求，然后分析 HTML 结构以检索特定的信息片段，如文本、链接或图像。Puppeteer 是一个用于抓取网站的库。

?什么是木偶师？

Puppeteer 是一个 Node.js 库。它提供了一个高级 API，用于控制无头 Chrome 或 Chromium 浏览器。无头 Chrome 是一个无需 UI 即可运行所有内容的 Chrome 版本（非常适合在后台运行）。

我们可以使用 puppeteer 自动执行各种任务，例如：

网页抓取：从网站提取内容涉及与页面的 HTML 和 JavaScript 进行交互。我们通常通过定位 CSS 选择器来检索内容。
PDF 生成：当您想直接从网页生成 PDF，而不是截取屏幕截图然后将屏幕截图转换为 PDF 时，以编程方式将网页转换为 PDF 是理想的选择。（P.S. 如果您已经有解决方法，我们深表歉意）。
自动化测试：通过模拟用户操作（如单击按钮、填写表单和截屏）在网页上运行测试。这消除了手动检查长表格以确保一切就位的繁琐过程。

?如何开始使用木偶？

首先我们必须安装库，继续执行此操作。
使用 npm：

npm i puppeteer # Downloads compatible Chrome during installation.
npm i puppeteer-core # Alternatively, install as a library, without downloading Chrome.

登录后复制

使用纱线：

yarn add puppeteer // Downloads compatible Chrome during installation.
yarn add puppeteer-core // Alternatively, install as a library, without downloading Chrome.

登录后复制

使用 pnpm：

pnpm add puppeteer # Downloads compatible Chrome during installation.
pnpm add puppeteer-core # Alternatively, install as a library, without downloading Chrome.

登录后复制

？演示 puppeteer 使用的示例

这是如何抓取网站的示例。（P.S. 我使用此代码从 Myntra 网站检索我的电子商务项目的产品。）

const puppeteer = require("puppeteer");
const CategorySchema = require("./models/Category");

// Define the scrape function as a named async function
const scrape = async () => {
    // Launch a new browser instance
    const browser = await puppeteer.launch({ headless: false });

    // Open a new page
    const page = await browser.newPage();

    // Navigate to the target URL and wait until the DOM is fully loaded
    await page.goto('https://www.myntra.com/mens-sport-wear?rawQuery=mens%20sport%20wear', { waitUntil: 'domcontentloaded' });

    // Wait for additional time to ensure all content is loaded
    await new Promise((resolve) => setTimeout(resolve, 25000));

    // Extract product details from the page
    const items = await page.evaluate(() => {
        // Select all product elements
        const elements = document.querySelectorAll('.product-base');
        const elementsArray = Array.from(elements);

        // Map each element to an object with the desired properties
        const results = elementsArray.map((element) => {
            const image = element.querySelector(".product-imageSliderContainer img")?.getAttribute("src");
            return {
                image: image ?? null,
                brand: element.querySelector(".product-brand")?.textContent,
                title: element.querySelector(".product-product")?.textContent,
                discountPrice: element.querySelector(".product-price .product-discountedPrice")?.textContent,
                actualPrice: element.querySelector(".product-price .product-strike")?.textContent,
                discountPercentage: element.querySelector(".product-price .product-discountPercentage")?.textContent?.split(' ')[0]?.slice(1, -1),
                total: 20, // Placeholder value, adjust as needed
                available: 10, // Placeholder value, adjust as needed
                ratings: Math.round((Math.random() * 5) * 10) / 10 // Random rating for demonstration
            };
        });

        return results; // Return the list of product details
    });

    // Close the browser
    await browser.close();

    // Prepare the data for saving
    const data = {
        category: "mens-sport-wear",
        subcategory: "Mens",
        list: items
    };

    // Create a new Category document and save it to the database
    // Since we want to store product information in our e-commerce store, we use a schema and save it to the database.
    // If you don't need to save the data, you can omit this step.
    const category = new CategorySchema(data);
    console.log(category);
    await category.save();

    // Return the scraped items
    return items;
};

// Export the scrape function as the default export
module.exports = scrape;

登录后复制

？说明：

在此代码中，我们使用 Puppeteer 从网站上抓取产品数据。提取详细信息后，我们创建一个架构 (CategorySchema) 来构造这些数据并将其保存到数据库中。如果我们想将抓取的产品集成到我们的电子商务商店中，此步骤特别有用。如果不需要将数据存储在数据库中，可以省略 schema 相关的代码。
在抓取之前，了解页面的 HTML 结构并确定哪些 CSS 选择器包含您要提取的内容非常重要。
就我而言，我使用了 Myntra 网站上标识的相关 CSS 选择器来提取我的目标内容。

以上是网页抓取变得简单：使用 Puppeteer 解析任何 HTML 页面的详细内容。更多信息请关注PHP中文网其他相关文章！

本站声明

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系admin@php.cn

热AI工具

Undresser.AI Undress

人工智能驱动的应用程序，用于创建逼真的裸体照片

AI Clothes Remover

用于从照片中去除衣服的在线人工智能工具。

Undress AI Tool

免费脱衣服图片

Clothoff.io

AI脱衣机

Video Face Swap

使用我们完全免费的人工智能换脸工具轻松在任何视频中换脸！

显示更多

热工具

记事本++7.3.1

好用且免费的代码编辑器

SublimeText3汉化版

中文版，非常好用

禅工作室 13.0.1

功能强大的PHP集成开发环境

Dreamweaver CS6

视觉化网页开发工具

SublimeText3 Mac版

神级代码编辑软件(SublimeText3)

显示更多

热门话题

gmail邮箱登陆入口在哪里

7940

Java教程

1652

CakePHP 教程

1412

Laravel 教程

1303

PHP教程

1250

显示更多

Related knowledge

前端热敏纸小票打印遇到乱码问题怎么办？ Apr 04, 2025 pm 02:42 PM

前端热敏纸小票打印的常见问题与解决方案在前端开发中，小票打印是一个常见的需求。然而，很多开发者在实...

神秘的JavaScript：它的作用以及为什么重要 Apr 09, 2025 am 12:07 AM

JavaScript是现代Web开发的基石，它的主要功能包括事件驱动编程、动态内容生成和异步编程。1)事件驱动编程允许网页根据用户操作动态变化。2)动态内容生成使得页面内容可以根据条件调整。3)异步编程确保用户界面不被阻塞。JavaScript广泛应用于网页交互、单页面应用和服务器端开发，极大地提升了用户体验和跨平台开发的灵活性。