


How to control the number of concurrencies using async and enterproxy
I believe concurrency is familiar to everyone. This article mainly introduces to you the relevant information about using async and enterproxy to control the number of concurrencies. The article introduces it in detail through sample code, which is of certain significance to everyone's study or work. The reference learning value, friends who need it, please study together below.
Let’s talk about concurrency and parallelism
Concurrency, in the operating system, means that several programs are started in a period of time Between running and completion, these programs are all running on the same processor, but only one program is running on the processor at any point in time.
Concurrency is something we often mention. Whether it is a web server or app, concurrency is everywhere. In the operating system, it refers to several programs in a period of time between starting and running, and these programs They all run on the same processor, and only one program is running on the processor at any point in time. Many websites have limits on the number of concurrent connections, so when requests are sent too quickly, the return value will be empty or an error will be reported. What's more, some websites may block your IP because you send too many concurrent connections and think you are making malicious requests.
Compared with concurrency, parallelism may be a lot unfamiliar. Parallelism refers to a group of programs executing at an independent and asynchronous speed, which is not equal to overlap in time (occurring at the same moment). Multiple CPU cores can be implemented by increasing the number of CPU cores. Programs (tasks) are carried out simultaneously. That's right, parallel tasks can be performed simultaneously
Use enterproxy to control the number of concurrencies
enterproxy is a tool that Pu Lingda made a major contribution to , bringing about a change in the thinking of event-based programming, using the event mechanism to decouple complex business logic, solving the criticism of the coupling of callback functions, turning serial waiting into parallel waiting, and improving execution efficiency in multi-asynchronous collaboration scenarios
How do we use enterproxy to control the number of concurrencies? Usually if we don't use enterproxy and homemade counters, if we grab three sources:
This deeply nested, serial way
var render = function (template, data) { _.template(template, data); }; $.get("template", function (template) { // something $.get("data", function (data) { // something $.get("l10n", function (l10n) { // something render(template, data, l10n); }); }); });
Remove this deep nesting in the past Method, our conventional way of writing is to maintain a counter by ourselves
(function(){ var count = 0; var result = {}; $.get('template',function(data){ result.data1 = data; count++; handle(); }) $.get('data',function(data){ result.data2 = data; count++; handle(); }) $.get('l10n',function(data){ result.data3 = data; count++; handle(); }) function handle(){ if(count === 3){ var html = fuck(result.data1,result.data2,result.data3); render(html); } } })();
Here, enterproxy can play the role of this counter. It helps you manage whether these asynchronous operations are completed. After completion, it will automatically call you to provide processing function, and pass the captured data as parameters
var ep = new enterproxy(); ep.all('data_event1','data_event2','data_event3',function(data1,data2,data3){ var html = fuck(data1,data2,data3); render(html); }) $.get('http:example1',function(data){ ep.emit('data_event1',data); }) $.get('http:example2',function(data){ ep.emit('data_event2',data); }) $.get('http:example3',function(data){ ep.emit('data_event3',data); })
enterproxy also provides APIs required for many other scenarios. You can learn this API by yourself enterproxy
Use async to control the number of concurrency
If we have 40 requests to make, many websites may think you are making a malicious request because you send too many concurrent connections. Block your IP.
So we always need to control the number of concurrency, and then slowly crawl these 40 links.
Use mapLimit in async to control the number of concurrency at one time to 5, and only capture 5 links at a time.
async.mapLimit(arr, 5, function (url, callback) { // something }, function (error, result) { console.log("result: ") console.log(result); })
We should first know what concurrency is, why we need to limit the number of concurrencies, and what solutions are available. Then you can go to the documentation to see how to use the API. The async documentation is a great way to learn these syntaxes.
Simulate a set of data, the data returned here is false, and the return delay is random.
var concurreyCount = 0; var fetchUrl = function(url,callback){ // delay 的值在 2000 以内,是个随机的整数 模拟延时 var delay = parseInt((Math.random()* 10000000) % 2000,10); concurreyCount++; console.log('现在并发数是 ' , concurreyCount , ' 正在抓取的是' , url , ' 耗时' + delay + '毫秒'); setTimeout(function(){ concurreyCount--; callback(null,url + ' html content'); },delay); } var urls = []; for(var i = 0;i<30;i++){ urls.push('http://datasource_' + i) }
Then we use async.mapLimit to crawl concurrently and obtain the results.
async.mapLimit(urls,5,function(url,callback){ fetchUrl(url,callbcak); },function(err,result){ console.log('result: '); console.log(result); })
The simulation is excerpted from alsotang
The following results are obtained after running the output
We found that the number of concurrency starts from 1, but increases to At 5 o'clock, it is no longer increasing. When there are tasks, continue to crawl, and the number of concurrent connections is always controlled at 5.
Complete the node simple crawler system
Because the concurrency controlled by eventproxy is used in the tutorial example of "node teaching is not guaranteed" by senior alsotang Quantity, let’s complete a simple node crawler that uses async to control the number of concurrencies.
The crawling target is the homepage of this site (manual face protection)
The first step, first we need to use the following modules:
url: used for url parsing, here url.resolve() is used to generate a legal domain name
async: a practical module that provides powerful functions and asynchronous JavaScript work
cheerio: A fast, flexible, jQuery core implementation specially customized for the server
superagent: A very convenient client request in nodejs Agent module
Install the dependent module through npm
The second step is to introduce the dependent module through require and determine the crawling object URL:
var url = require("url"); var async = require("async"); var cheerio = require("cheerio"); var superagent = require("superagent"); var baseUrl = 'http://www.chenqaq.com';
Step 3: Use superagent to request the target URL, and use cheerio to process baseUrl to get the target content URL, and save it in the array arr
superagent.get(baseUrl) .end(function (err, res) { if (err) { return console.error(err); } var arr = []; var $ = cheerio.load(res.text); // 下面和jQuery操作是一样一样的.. $(".post-list .post-title-link").each(function (idx, element) { $element = $(element); var _url = url.resolve(baseUrl, $element.attr("href")); arr.push(_url); }); // 验证得到的所有文章链接集合 output(arr); // 第四步:接下来遍历arr,解析每一个页面需要的信息 })
We need a function to verify the captured url object , it’s very simple. We only need a function to traverse arr and print it out:
function output(arr){ for(var i = 0;i<arr.length;i++){ console.log(arr[i]); } }
第四步:我们需要遍历得到的URL对象,解析每一个页面需要的信息。
这里就需要用到async控制并发数量,如果你上一步获取了一个庞大的arr数组,有多个url需要请求,如果同时发出多个请求,一些网站就可能会把你的行为当做恶意请求而封掉你的ip
async.mapLimit(arr,3,function(url,callback){ superagent.get(url) .end(function(err,mes){ if(err){ console.error(err); console.log('message info ' + JSON.stringify(mes)); } console.log('「fetch」' + url + ' successful!'); var $ = cheerio.load(mes.text); var jsonData = { title:$('.post-card-title').text().trim(), href: url, }; callback(null,jsonData); },function(error,results){ console.log('results '); console.log(results); }) })
得到上一步保存url地址的数组arr,限制最大并发数量为3,然后用一个回调函数处理 「该回调函数比较特殊,在iteratee方法中一定要调用该回调函数,有三种方式」
callback(null)
调用成功callback(null,data)
调用成功,并且返回数据data追加到resultscallback(data)
调用失败,不会再继续循环,直接到最后的callback
好了,到这里我们的node简易的小爬虫就完成了,来看看效果吧
嗨呀,首页数据好少,但是成功了呢。
参考资料
Node.js 包教不包会 - alsotang
enterproxy
async
async Documentation
上面是我整理给大家的,希望今后会对大家有帮助。
相关文章:
The above is the detailed content of How to control the number of concurrencies using async and enterproxy. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

Face detection and recognition technology is already a relatively mature and widely used technology. Currently, the most widely used Internet application language is JS. Implementing face detection and recognition on the Web front-end has advantages and disadvantages compared to back-end face recognition. Advantages include reducing network interaction and real-time recognition, which greatly shortens user waiting time and improves user experience; disadvantages include: being limited by model size, the accuracy is also limited. How to use js to implement face detection on the web? In order to implement face recognition on the Web, you need to be familiar with related programming languages and technologies, such as JavaScript, HTML, CSS, WebRTC, etc. At the same time, you also need to master relevant computer vision and artificial intelligence technologies. It is worth noting that due to the design of the Web side

Concurrency and multithreading techniques using Java functions can improve application performance, including the following steps: Understand concurrency and multithreading concepts. Leverage Java's concurrency and multi-threading libraries such as ExecutorService and Callable. Practice cases such as multi-threaded matrix multiplication to greatly shorten execution time. Enjoy the advantages of increased application response speed and optimized processing efficiency brought by concurrency and multi-threading.

Concurrency and coroutines are used in GoAPI design for: High-performance processing: Processing multiple requests simultaneously to improve performance. Asynchronous processing: Use coroutines to process tasks (such as sending emails) asynchronously, releasing the main thread. Stream processing: Use coroutines to efficiently process data streams (such as database reads).

Transactions ensure database data integrity, including atomicity, consistency, isolation, and durability. JDBC uses the Connection interface to provide transaction control (setAutoCommit, commit, rollback). Concurrency control mechanisms coordinate concurrent operations, using locks or optimistic/pessimistic concurrency control to achieve transaction isolation to prevent data inconsistencies.

Functions and features of Go language Go language, also known as Golang, is an open source programming language developed by Google. It was originally designed to improve programming efficiency and maintainability. Since its birth, Go language has shown its unique charm in the field of programming and has received widespread attention and recognition. This article will delve into the functions and features of the Go language and demonstrate its power through specific code examples. Native concurrency support The Go language inherently supports concurrent programming, which is implemented through the goroutine and channel mechanisms.

Unit testing concurrent functions is critical as this helps ensure their correct behavior in a concurrent environment. Fundamental principles such as mutual exclusion, synchronization, and isolation must be considered when testing concurrent functions. Concurrent functions can be unit tested by simulating, testing race conditions, and verifying results.

Atomic classes are thread-safe classes in Java that provide uninterruptible operations and are crucial for ensuring data integrity in concurrent environments. Java provides the following atomic classes: AtomicIntegerAtomicLongAtomicReferenceAtomicBoolean These classes provide methods for getting, setting, and comparing values to ensure that the operation is atomic and will not be interrupted by threads. Atomic classes are useful when working with shared data and preventing data corruption, such as maintaining concurrent access to a shared counter.

Go process scheduling uses a cooperative algorithm. Optimization methods include: using lightweight coroutines as much as possible to reasonably allocate coroutines to avoid blocking operations and use locks and synchronization primitives.
