Table of Contents
Extracting Unstructured Data from Web Pages
Leveraging Reader View Functionality
Scraping Data with Node.js and Readability.js
Integrating Readability with LangChain.js
Improved Web Scraping Accuracy with Readability.js
Home Web Front-end JS Tutorial Clean up HTML Content for Retrieval-Augmented Generation with Readability.js

Clean up HTML Content for Retrieval-Augmented Generation with Readability.js

Jan 22, 2025 am 10:33 AM

Web scraping is a common method for gathering content for your retrieval-augmented generation (RAG) application. However, parsing web page content can be challenging.

Mozilla's open-source Readability.js library offers a convenient solution for extracting only the essential parts of a web page. Let's explore its integration into a data ingestion pipeline for a RAG application.

Extracting Unstructured Data from Web Pages

Web pages are rich sources of unstructured data, ideal for RAG applications. However, web pages often contain irrelevant information such as headers, sidebars, and footers. While useful for browsing, this extra content detracts from the page's main subject.

For optimal RAG data, irrelevant content must be removed. While tools like Cheerio can parse HTML based on a site's known structure, this approach is inefficient for scraping diverse website layouts. A robust method is needed to extract only relevant content.

Leveraging Reader View Functionality

Most browsers include a reader view that removes all but the article title and content. The following image illustrates the difference between standard browsing and reader mode applied to a DataStax blog post:

Clean up HTML Content for Retrieval-Augmented Generation with Readability.js

Mozilla provides Readability.js, the library behind Firefox's reader mode, as a standalone open-source module. This allows us to integrate Readability.js into a data pipeline to remove irrelevant content and improve scraping results.

Scraping Data with Node.js and Readability.js

Let's illustrate scraping article content from a previous blog post about creating vector embeddings in Node.js. The following JavaScript code retrieves the page's HTML:

const html = await fetch(
  "https://www.datastax.com/blog/how-to-create-vector-embeddings-in-node-js"
).then((res) => res.text());
console.log(html);
Copy after login
Copy after login

This includes all HTML, including navigation, footers, and other elements common on websites.

Alternatively, you could use Cheerio to select specific elements:

npm install cheerio
Copy after login
Copy after login
import * as cheerio from "cheerio";

const html = await fetch(
  "https://www.datastax.com/blog/how-to-create-vector-embeddings-in-node-js"
).then((res) => res.text());

const $ = cheerio.load(html);

console.log($("h1").text(), "\n");
console.log($("section#blog-content > div:first-child").text());
Copy after login
Copy after login

This yields the title and article text. However, this approach relies on knowing the HTML structure, which is not always feasible.

A better approach involves installing Readability.js and jsdom:

npm install @mozilla/readability jsdom
Copy after login
Copy after login

Readability.js operates within a browser environment, requiring jsdom to simulate this in Node.js. We can convert the loaded HTML into a document and use Readability.js to parse the content:

import { Readability } from "@mozilla/readability";
import { JSDOM } from "jsdom";

const url = "https://www.datastax.com/blog/how-to-create-vector-embeddings-in-node-js";
const html = await fetch(url).then((res) => res.text());

const doc = new JSDOM(html, { url });
const reader = new Readability(doc.window.document);
const article = reader.parse();

console.log(article);
Copy after login
Copy after login

The article object contains various parsed elements:

Clean up HTML Content for Retrieval-Augmented Generation with Readability.js

This includes the title, author, excerpt, publication time, and both HTML (content) and plain text (textContent). textContent is ready for chunking, embedding, and storage, while content retains links and images for further processing.

The isProbablyReaderable function helps determine if the document is suitable for Readability.js:

const html = await fetch(
  "https://www.datastax.com/blog/how-to-create-vector-embeddings-in-node-js"
).then((res) => res.text());
console.log(html);
Copy after login
Copy after login

Unsuitable pages should be flagged for review.

Integrating Readability with LangChain.js

Readability.js integrates seamlessly with LangChain.js. The following example uses LangChain.js to load a page, extract content with MozillaReadabilityTransformer, split text with RecursiveCharacterTextSplitter, create embeddings with OpenAI, and store data in Astra DB.

Required dependencies:

npm install cheerio
Copy after login
Copy after login

You'll need Astra DB credentials ( ASTRA_DB_APPLICATION_TOKEN, ASTRA_DB_API_ENDPOINT) and an OpenAI API key (OPENAI_API_KEY) as environment variables.

Import necessary modules:

import * as cheerio from "cheerio";

const html = await fetch(
  "https://www.datastax.com/blog/how-to-create-vector-embeddings-in-node-js"
).then((res) => res.text());

const $ = cheerio.load(html);

console.log($("h1").text(), "\n");
console.log($("section#blog-content > div:first-child").text());
Copy after login
Copy after login

Initialize components:

npm install @mozilla/readability jsdom
Copy after login
Copy after login

Load, transform, split, embed, and store documents:

import { Readability } from "@mozilla/readability";
import { JSDOM } from "jsdom";

const url = "https://www.datastax.com/blog/how-to-create-vector-embeddings-in-node-js";
const html = await fetch(url).then((res) => res.text());

const doc = new JSDOM(html, { url });
const reader = new Readability(doc.window.document);
const article = reader.parse();

console.log(article);
Copy after login
Copy after login

Improved Web Scraping Accuracy with Readability.js

Readability.js, a robust library powering Firefox's reader mode, efficiently extracts relevant data from web pages, improving RAG data quality. It can be used directly or via LangChain.js's MozillaReadabilityTransformer.

This is just the initial stage of your ingestion pipeline. Chunking, embedding, and Astra DB storage are subsequent steps in building your RAG application.

Do you employ other methods for cleaning web content in your RAG applications? Share your techniques!

The above is the detailed content of Clean up HTML Content for Retrieval-Augmented Generation with Readability.js. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
Mandragora: Whispers Of The Witch Tree - How To Unlock The Grappling Hook
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
Nordhold: Fusion System, Explained
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Java Tutorial
1669
14
PHP Tutorial
1273
29
C# Tutorial
1256
24
JavaScript Engines: Comparing Implementations JavaScript Engines: Comparing Implementations Apr 13, 2025 am 12:05 AM

Different JavaScript engines have different effects when parsing and executing JavaScript code, because the implementation principles and optimization strategies of each engine differ. 1. Lexical analysis: convert source code into lexical unit. 2. Grammar analysis: Generate an abstract syntax tree. 3. Optimization and compilation: Generate machine code through the JIT compiler. 4. Execute: Run the machine code. V8 engine optimizes through instant compilation and hidden class, SpiderMonkey uses a type inference system, resulting in different performance performance on the same code.

Python vs. JavaScript: The Learning Curve and Ease of Use Python vs. JavaScript: The Learning Curve and Ease of Use Apr 16, 2025 am 12:12 AM

Python is more suitable for beginners, with a smooth learning curve and concise syntax; JavaScript is suitable for front-end development, with a steep learning curve and flexible syntax. 1. Python syntax is intuitive and suitable for data science and back-end development. 2. JavaScript is flexible and widely used in front-end and server-side programming.

From C/C   to JavaScript: How It All Works From C/C to JavaScript: How It All Works Apr 14, 2025 am 12:05 AM

The shift from C/C to JavaScript requires adapting to dynamic typing, garbage collection and asynchronous programming. 1) C/C is a statically typed language that requires manual memory management, while JavaScript is dynamically typed and garbage collection is automatically processed. 2) C/C needs to be compiled into machine code, while JavaScript is an interpreted language. 3) JavaScript introduces concepts such as closures, prototype chains and Promise, which enhances flexibility and asynchronous programming capabilities.

JavaScript and the Web: Core Functionality and Use Cases JavaScript and the Web: Core Functionality and Use Cases Apr 18, 2025 am 12:19 AM

The main uses of JavaScript in web development include client interaction, form verification and asynchronous communication. 1) Dynamic content update and user interaction through DOM operations; 2) Client verification is carried out before the user submits data to improve the user experience; 3) Refreshless communication with the server is achieved through AJAX technology.

JavaScript in Action: Real-World Examples and Projects JavaScript in Action: Real-World Examples and Projects Apr 19, 2025 am 12:13 AM

JavaScript's application in the real world includes front-end and back-end development. 1) Display front-end applications by building a TODO list application, involving DOM operations and event processing. 2) Build RESTfulAPI through Node.js and Express to demonstrate back-end applications.

Understanding the JavaScript Engine: Implementation Details Understanding the JavaScript Engine: Implementation Details Apr 17, 2025 am 12:05 AM

Understanding how JavaScript engine works internally is important to developers because it helps write more efficient code and understand performance bottlenecks and optimization strategies. 1) The engine's workflow includes three stages: parsing, compiling and execution; 2) During the execution process, the engine will perform dynamic optimization, such as inline cache and hidden classes; 3) Best practices include avoiding global variables, optimizing loops, using const and lets, and avoiding excessive use of closures.

Python vs. JavaScript: Community, Libraries, and Resources Python vs. JavaScript: Community, Libraries, and Resources Apr 15, 2025 am 12:16 AM

Python and JavaScript have their own advantages and disadvantages in terms of community, libraries and resources. 1) The Python community is friendly and suitable for beginners, but the front-end development resources are not as rich as JavaScript. 2) Python is powerful in data science and machine learning libraries, while JavaScript is better in front-end development libraries and frameworks. 3) Both have rich learning resources, but Python is suitable for starting with official documents, while JavaScript is better with MDNWebDocs. The choice should be based on project needs and personal interests.

Python vs. JavaScript: Development Environments and Tools Python vs. JavaScript: Development Environments and Tools Apr 26, 2025 am 12:09 AM

Both Python and JavaScript's choices in development environments are important. 1) Python's development environment includes PyCharm, JupyterNotebook and Anaconda, which are suitable for data science and rapid prototyping. 2) The development environment of JavaScript includes Node.js, VSCode and Webpack, which are suitable for front-end and back-end development. Choosing the right tools according to project needs can improve development efficiency and project success rate.

See all articles