php simple crawler
Simple crawler design
Introduction
Saying that this is a crawler is a bit exaggerated, but the name is just right, so I added the word "simple" in front to show that
This is a castrated crawler, it can be used simply or played with.
The company recently has a new business to capture the data of competing products. After reading the capture system written by a previous classmate, there are certain problems.
The rules are too strong, whether it is scalability or versatility. The surface is a little weak. The previous system had to make a list,
and then crawl from this list. There was no concept of depth, which was a flaw for crawlers. Therefore, I decided to make a
slightly more general crawler, add the concept of depth, and improve the scalability and generality.
Design
We have agreed here that the content to be processed (may be URL, user name, etc.) we will call it entity.
Considering scalability, the concept of queue is adopted here. All entities to be processed are stored in the queue. Each time it is processed,
takes out an entity from the queue. After the processing is completed, Store and store the newly captured entities in the queue. Of course, here
also needs to be stored and deduplicated, and queued and deduplicated to prevent the processor from doing useless work.
+--------+ +-----------+ +----------+ | entity | | enqueue | | result | | list | | uniq list | | uniq list| | | | | | | | | | | | | | | | | | | | | | | | | +--------+ +-----------+ +----------+
When each entity enters the queueEnqueues and requeues
Set the enqueue entity flag to one and no longer enter the queue. When the
entity is processed, the result data is obtained , after processing the result data, mark the result verses such as result data reorder list
, of course
, you can also do update processing here, and the code can be compatible.
+-------+ | 开始 | +---+---+ | v +-------+ enqueue deep为1的实体 | init |--------------------------------> +---+---+ set 已经入过队列 flag | v +---------+ empty queue +------+ +------>| dequeue +------------->| 结束 | | +----+----+ +------+ | | | | | | | v | +---------------+ enqueue deep为deep+1的实体 | | handle entity |------------------------------> | +-------+-------+ set 已经入过队列 flag | | | | | v | +---------------+ set 已经处理过结果 flag | | handle result |--------------------------> | +-------+-------+ | | +------------+
Crawling strategy (anti-cheating response)
In order to crawl some websites, the most feared thing is to block the IP. Have you ever blocked the IP? The agent can only hehehe. Therefore, crawling
strategy is still very important.
Before crawling, you can search the Internet for relevant information about the website to be crawled, see if anyone has crawled it before, and absorb their experience. Then I have to carefully analyze the website requests myself to see if their website requests will bring special parameters? Not logged in
Status
Will there be related cookie? Finally, I tried to set a capture frequency as high as possible. If you must log in to the website to be crawled, you can register a batch of accounts, then simulate a successful login, and request in turn.
It will be even more troublesome if login requires a
, you can try to log in manually and then save the cookie (of course
, you can try OCR recognition if you have the ability). Of course, you still need to consider the issues mentioned in the previous paragraph after logging in. It does not mean that everything will be fine after logging in. Some websites will have their accounts blocked if their crawling frequency is too fast after logging in. Therefore, try to find a method that does not require logging in. It is troublesome to log in to blocked accounts, apply for accounts, and change accounts.
crawl data source and depth
Initial data source selection is also important. What I want to do is to crawl once a day, so I am looking for a place where the crawling website is updated daily, so that the initialization action can be fully automatic, and I basically don’t need to manage it. The crawling will start from Daily
updates occur automatically. The crawling depth is also very important. This should be determined based on the specific website, needs, and content that has been crawled, and capture as much of the website data as possible.
Optimization
The first is the queue, which has been changed to a stack-like structure. Because in the previous queue, entities with small depth were always executed first.
recursion
is processed first. Complete the depth of anentity, and then process the next entity. For example, for the initial 10 entities (deep=1), the maximum crawling depth
is 3, and there are 10 sub-entities under each entity, and then the maximum length of their queues are:队列(lpush,rpop) => 1000个 修改之后的队列(lpush,lpop) => 28个
The two above The method can achieve the same effect, but you can see that the length of the queue is much different, so we changed to the second method. The maximum depth limit is processed when entering the queue. If the maximum depth is exceeded, it will be discarded directly. In addition, the maximum length of the queue
is also limited to prevent unexpected problems.
Code
The following is a long and boring code. I originally wanted to post it on
hub, but I thought the project was a bit small, so I thought I’d post it directly. I hope my friends will speak out about the bad things, whether it is code or design.
abstract class SpiderBase { /** * @var 处理队列中数据的休息时间开始区间 */ public $startMS = 1000000; /** * @var 处理队列中数据的休息时间结束区间 */ public $endMS = 3000000; /** * @var 最大爬取深度 */ public $maxDeep = 1; /** * @var 队列最大长度,默认1w */ public $maxQueueLen = 10000; /** * @desc 给队列中插入一个待处理的实体 * 插入之前调用 @see isEnqueu 判断是否已经如果队列 * 直插入没如果队列的 * * @param $deep 插入实体在爬虫中的深度 * @param $entity 插入的实体内容 * @return bool 是否插入成功 */ abstract public function enqueue($deep, $entity); /** * @desc 从队列中取出一个待处理的实体 * 返回值示例,实体内容格式可自行定义 * [ * "deep" => 3, * "entity" => "balabala" * ] * * @return array */ abstract public function dequeue(); /** * @desc 获取待处理队列长度 * * @return int */ abstract public function queueLen(); /** * @desc 判断队列是否可以继续入队 * * @param $params mixed * @return bool */ abstract public function canEnqueue($params); /** * @desc 判断一个待处理实体是否已经进入队列 * * @param $entity 实体 * @return bool 是否已经进入队列 */ abstract public function isEnqueue($entity); /** * @desc 设置一个实体已经进入队列标志 * * @param $entity 实体 * @return bool 是否插入成功 */ abstract public function setEnqueue($entity); /** * @desc 判断一个唯一的抓取到的信息是否已经保存过 * * @param $entity mixed 用于判断的信息 * @return bool 是否已经保存过 */ abstract public function isSaved($entity); /** * @desc 设置一个对象已经保存 * * @param $entity mixed 是否保存的一句 * @return bool 是否设置成功 */ abstract public function setSaved($entity); /** * @desc 保存抓取到的内容 * 这里保存之前会判断是否保存过,如果保存过就不保存了 * 如果设置了更新,则会更新 * * @param $uniqInfo mixed 抓取到的要保存的信息 * @param $update bool 保存过的话是否更新 * @return bool */ abstract public function save($uniqInfo, $update); /** * @desc 处理实体的内容 * 这里会调用enqueue * * @param $item 实体数组,@see dequeue 的返回值 * @return */ abstract public function handle($item); /** * @desc 随机停顿时间 * * @param $startMs 随机区间开始微妙 * @param $endMs 随机区间结束微妙 * @return bool */ public function randomSleep($startMS, $endMS) { $rand = rand($startMS, $endMS); usleep($rand); return true; } /** * @desc 修改默认停顿时间开始区间值 * * @param $ms int 微妙 * @return obj $this */ public function setStartMS($ms) { $this->startMS = $ms; return $this; } /** * @desc 修改默认停顿时间结束区间值 * * @param $ms int 微妙 * @return obj $this */ public function setEndMS($ms) { $this->endMS = $ms; return $this; } /** * @desc 设置队列最长长度,溢出后丢弃 * * @param $len int 队列最大长度 */ public function setMaxQueueLen($len) { $this->maxQueueLen = $len; return $this; } /** * @desc 设置爬取最深层级 * 入队列的时候判断层级,如果超过层级不做入队操作 * * @param $maxDeep 爬取最深层级 * @return obj */ public function setMaxDeep($maxDeep) { $this->maxDeep = $maxDeep; return $this; } public function run() { while ($this->queueLen()) { $item = $this->dequeue(); if (empty($item)) continue; $item = json_decode($item, true); if (empty($item) || empty($item["deep"]) || empty($item["entity"])) continue; $this->handle($item); $this->randomSleep($this->startMS, $this->endMS); } } /** * @desc 通过curl获取链接内容 * * @param $url string 链接地址 * @param $curlOptions array curl配置信息 * @return mixed */ public function getContent($url, $curlOptions = []) { $ch = curl_init(); curl_setopt_array($ch, $curlOptions); curl_setopt($ch, CURLOPT_URL, $url); if (!isset($curlOptions[CURLOPT_HEADER])) curl_setopt($ch, CURLOPT_HEADER, 0); if (!isset($curlOptions[CURLOPT_RETURNTRANSFER])) curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); if (!isset($curlOptions[CURLOPT_USERAGENT])) curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Macintosh; Intel Mac"); $content = curl_exec($ch); if ($errorNo = curl_errno($ch)) { $errorInfo = curl_error($ch); echo "curl error : errorNo[{$errorNo}], errorInfo[{$errorInfo}]\n"; curl_close($ch); return false; } $httpCode = curl_getinfo($ch,CURLINFO_HTTP_CODE); curl_close($ch); if (200 != $httpCode) { echo "http code error : {$httpCode}, $url, [$content]\n"; return false; } return $content; } } abstract class RedisDbSpider extends SpiderBase { protected $queueName = ""; protected $isQueueName = ""; protected $isSaved = ""; public function construct($objRedis = null, $objDb = null, $configs = []) { $this->objRedis = $objRedis; $this->objDb = $objDb; foreach ($configs as $name => $value) { if (isset($this->$name)) { $this->$name = $value; } } } public function enqueue($deep, $entities) { if (!$this->canEnqueue(["deep"=>$deep])) return true; if (is_string($entities)) { if ($this->isEnqueue($entities)) return true; $item = [ "deep" => $deep, "entity" => $entities ]; $this->objRedis->lpush($this->queueName, json_encode($item)); $this->setEnqueue($entities); } else if(is_array($entities)) { foreach ($entities as $key => $entity) { if ($this->isEnqueue($entity)) continue; $item = [ "deep" => $deep, "entity" => $entity ]; $this->objRedis->lpush($this->queueName, json_encode($item)); $this->setEnqueue($entity); } } return true; } public function dequeue() { $item = $this->objRedis->lpop($this->queueName); return $item; } public function isEnqueue($entity) { $ret = $this->objRedis->hexists($this->isQueueName, $entity); return $ret ? true : false; } public function canEnqueue($params) { $deep = $params["deep"]; if ($deep > $this->maxDeep) { return false; } $len = $this->objRedis->llen($this->queueName); return $len < $this->maxQueueLen ? true : false; } public function setEnqueue($entity) { $ret = $this->objRedis->hset($this->isQueueName, $entity, 1); return $ret ? true : false; } public function queueLen() { $ret = $this->objRedis->llen($this->queueName); return intval($ret); } public function isSaved($entity) { $ret = $this->objRedis->hexists($this->isSaved, $entity); return $ret ? true : false; } public function setSaved($entity) { $ret = $this->objRedis->hset($this->isSaved, $entity, 1); return $ret ? true : false; } } class Test extends RedisDbSpider { /** * @desc 构造函数,设置redis、db实例,以及队列相关参数 */ public function construct($redis, $db) { $configs = [ "queueName" => "spider_queue:zhihu", "isQueueName" => "spider_is_queue:zhihu", "isSaved" => "spider_is_saved:zhihu", "maxQueueLen" => 10000 ]; parent::construct($redis, $db, $configs); } public function handle($item) { $deep = $item["deep"]; $entity = $item["entity"]; echo "开始抓取用户[{$entity}]\n"; echo "数据内容入库\n"; echo "下一层深度如队列\n"; echo "抓取用户[{$entity}]结束\n"; } public function save($addUsers, $update) { echo "保存成功\n"; } }
The above is the detailed content of php simple crawler. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics











The time it takes to learn Python crawlers varies from person to person and depends on factors such as personal learning ability, learning methods, learning time and experience. Learning Python crawlers is not just about learning the technology itself, but also requires good information gathering skills, problem solving skills and teamwork skills. Through continuous learning and practice, you will gradually grow into an excellent Python crawler developer.

In crawler development, handling cookies is often an essential part. As a state management mechanism in HTTP, cookies are usually used to record user login information and behavior. They are the key for crawlers to handle user authentication and maintain login status. In PHP crawler development, handling cookies requires mastering some skills and paying attention to some pitfalls. Below we explain in detail how to handle cookies in PHP. 1. How to get Cookie when writing in PHP

Java crawler practice: How to efficiently crawl web page data Introduction: With the rapid development of the Internet, a large amount of valuable data is stored in various web pages. To obtain this data, it is often necessary to manually access each web page and extract the information one by one, which is undoubtedly a tedious and time-consuming task. In order to solve this problem, people have developed various crawler tools, among which Java crawler is one of the most commonly used. This article will lead readers to understand how to use Java to write an efficient web crawler, and demonstrate the practice through specific code examples. 1. The base of the reptile

Analysis of common problems and solutions for PHP crawlers Introduction: With the rapid development of the Internet, the acquisition of network data has become an important link in various fields. As a widely used scripting language, PHP has powerful capabilities in data acquisition. One of the commonly used technologies is crawlers. However, in the process of developing and using PHP crawlers, we often encounter some problems. This article will analyze and give solutions to these problems and provide corresponding code examples. 1. Description of the problem that the data of the target web page cannot be correctly parsed.

The stock market has always been a topic of great concern. The daily rise, fall and changes in stocks directly affect investors' decisions. If you want to understand the latest developments in the stock market, you need to obtain and analyze stock information in a timely manner. The traditional method is to manually open major financial websites to view stock data one by one. This method is obviously too cumbersome and inefficient. At this time, crawlers have become a very efficient and automated solution. Next, we will demonstrate how to use PHP to write a simple stock crawler program to obtain stock data. allow

Practical skills sharing: Quickly learn how to crawl web page data with Java crawlers Introduction: In today's information age, we deal with a large amount of web page data every day, and a lot of this data may be exactly what we need. In order to quickly obtain this data, learning to use crawler technology has become a necessary skill. This article will share a method to quickly learn how to crawl web page data with a Java crawler, and attach specific code examples to help readers quickly master this practical skill. 1. Preparation work Before starting to write a crawler, we need to prepare the following

With the rapid development of Internet technology, Web applications are increasingly used in our daily work and life. In the process of web application development, crawling web page data is a very important task. Although there are many web scraping tools on the market, these tools are not very efficient. In order to improve the efficiency of web page data crawling, we can use the combination of PHP and Selenium. First, we need to understand what PHP and Selenium are. PHP is a powerful

In-depth exploration: Using Go language for efficient crawler development Introduction: With the rapid development of the Internet, obtaining information has become more and more convenient. As a tool for automatically obtaining website data, crawlers have attracted increasing attention and attention. Among many programming languages, Go language has become the preferred crawler development language for many developers due to its advantages such as high concurrency and powerful performance. This article will explore the use of Go language for efficient crawler development and provide specific code examples. 1. Advantages of Go language crawler development: High concurrency: Go language
