Detailed explanation of steps to use nodeJs crawler
This time I will bring you a detailed explanation of the steps for using the nodeJs crawler. What are the precautions when using the nodeJs crawler? Here are practical cases, let’s take a look.
Background
Recently I plan to review the nodeJs-related content I have seen before, and write a few crawlers to kill the boredom, and I discovered some during the crawling process Questions, record them for future reference.
Dependencies
The cheerio library that is widely available on the Internet is used to process the crawled content, superagent is used to process requests, and log4js is used to record logs.
Log configuration
Without further ado, let’s go directly to the code:
const log4js = require('log4js'); log4js.configure({ appenders: { cheese: { type: 'dateFile', filename: 'cheese.log', pattern: '-yyyy-MM-dd.log', // 包含模型 alwaysIncludePattern: true, maxLogSize: 1024, backups: 3 } }, categories: { default: { appenders: ['cheese'], level: 'info' } } }); const logger = log4js.getLogger('cheese'); logger.level = 'INFO'; module.exports = logger;
The above directly exports a logger object and directly calls the logger in the business file. Just use .info() and other functions to add log information, and logs will be generated on a daily basis. There is a lot of relevant information on the Internet.
Crawling content and processing
superagent.get(cityItemUrl).end((err, res) => { if (err) { return console.error(err); } const $ = cheerio.load(res.text); // 解析当前页面,获取当前页面的城市链接地址 const cityInfoEle = $('.newslist1 li a'); cityInfoEle.each((idx, element) => { const $element = $(element); const sceneURL = $element.attr('href'); // 页面地址 const sceneName = $element.attr('title'); // 城市名称 if (!sceneName) { return; } logger.info(`当前解析到的目的地是: ${sceneName}, 对应的地址为: ${sceneURL}`); getDesInfos(sceneURL, sceneName); // 获取城市详细信息 ep.after('getDirInfoComplete', cityInfoEle.length, (dirInfos) => { const content = JSON.parse(fs.readFileSync(path.join(dirname, './imgs.json'))); dirInfos.forEach((element) => { logger.info(`本条数据为:${JSON.stringify(element)}`); Object.assign(content, element); }); fs.writeFileSync(path.join(dirname, './imgs.json'), JSON.stringify(content)); }); }); });
Use superagent to request the page. After the request is successful, use cheerio to load the page content, and then use matching rules similar to Jquery to find the target resource. .
Multiple resources are loaded, use eventproxy to proxy events, process one resource and punish one event, and process the data after all events are triggered.
The above is the most basic crawler. Next are some areas that may cause problems or require special attention. . .
Read and write local files
Create folder
function mkdirSync(dirname) { if (fs.existsSync(dirname)) { return true; } if (mkdirSync(path.dirname(dirname))) { fs.mkdirSync(dirname); return true; } return false; }
Read and write files
const content = JSON.parse(fs.readFileSync(path.join(dirname, './dir.json'))); dirInfos.forEach((element) => { logger.info(`本条数据为:${JSON.stringify(element)}`); Object.assign(content, element); }); fs.writeFileSync(path.join(dirname, './dir.json'), JSON.stringify(content));
Batch download resources
Downloaded resources may include pictures, audio, etc.
Use Bagpipe to handle asynchronous concurrency. Refer to
const Bagpipe = require('bagpipe'); const bagpipe = new Bagpipe(10); bagpipe.push(downloadImage, url, dstpath, (err, data) => { if (err) { console.log(err); return; } console.log(`[${dstpath}]: ${data}`); });
to download resources and use stream to complete file writing.
function downloadImage(src, dest, callback) { request.head(src, (err, res, body) => { if (src && src.indexOf('http') > -1 || src.indexOf('https') > -1) { request(src).pipe(fs.createWriteStream(dest)).on('close', () => { callback(null, dest); }); } }); }
Encoding
Sometimes the web page content processed directly using cheerio.load is found to be encoded text after writing to the file. You can use
const $ = cheerio.load(buf, { decodeEntities: false });
to disable encoding,
ps: The encoding library and iconv-lite failed to convert utf-8 encoded characters into Chinese. It may be that you are not familiar with the API. You can pay attention to it later.
Finally, attach a regular pattern that matches all dom tags
const reg = /<.*?>/g;
I believe you have mastered the method after reading the case in this article. For more exciting information, please pay attention to other related articles on the PHP Chinese website!
Recommended reading:
Detailed explanation of how to use jQuery class name selector (.class)
js encapsulates ajax function function Detailed explanation of implementation steps
The above is the detailed content of Detailed explanation of steps to use nodeJs crawler. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

Node.js can be used as a backend framework as it offers features such as high performance, scalability, cross-platform support, rich ecosystem, and ease of development.

To connect to a MySQL database, you need to follow these steps: Install the mysql2 driver. Use mysql2.createConnection() to create a connection object that contains the host address, port, username, password, and database name. Use connection.query() to perform queries. Finally use connection.end() to end the connection.

The following global variables exist in Node.js: Global object: global Core module: process, console, require Runtime environment variables: __dirname, __filename, __line, __column Constants: undefined, null, NaN, Infinity, -Infinity

There are two npm-related files in the Node.js installation directory: npm and npm.cmd. The differences are as follows: different extensions: npm is an executable file, and npm.cmd is a command window shortcut. Windows users: npm.cmd can be used from the command prompt, npm can only be run from the command line. Compatibility: npm.cmd is specific to Windows systems, npm is available cross-platform. Usage recommendations: Windows users use npm.cmd, other operating systems use npm.

The main differences between Node.js and Java are design and features: Event-driven vs. thread-driven: Node.js is event-driven and Java is thread-driven. Single-threaded vs. multi-threaded: Node.js uses a single-threaded event loop, and Java uses a multi-threaded architecture. Runtime environment: Node.js runs on the V8 JavaScript engine, while Java runs on the JVM. Syntax: Node.js uses JavaScript syntax, while Java uses Java syntax. Purpose: Node.js is suitable for I/O-intensive tasks, while Java is suitable for large enterprise applications.

Yes, Node.js is a backend development language. It is used for back-end development, including handling server-side business logic, managing database connections, and providing APIs.

Node.js and Java each have their pros and cons in web development, and the choice depends on project requirements. Node.js excels in real-time applications, rapid development, and microservices architecture, while Java excels in enterprise-grade support, performance, and security.

Yes, Node.js can be used for front-end development, and key advantages include high performance, rich ecosystem, and cross-platform compatibility. Considerations to consider are learning curve, tool support, and small community size.
