php爬虫抓取百度贴吧图片
最近有从百度贴吧上批量下载图片的需求,即从某一个贴吧下载所有图片。
本来打算用python写的,因为对python不熟悉,试了minidom,HtmlParser等,感觉上不了手,还是使用比较擅长的php语言吧。
以下是源代码:
1 <?php 2 //运行时间 3 @set_time_limit(60); 4 //贴吧名称 5 $tbname = "%CD%BC%C6%AC"; 6 //抓取类型 0-按照帖子顺序 1-按照贴图顺序 7 $type = 0; 8 //列表页url 9 $listurltpl = "http://tieba.baidu.com/f?kw=%s".($type?"&tp=1":"&pn=");10 //图册页url11 $galleryurltpl = "http://tieba.baidu.com/photo/bw/picture/guide?kw=%s&tid=%s&next=9999";12 //图片url13 $imageurltpl = "http://imgsrc.baidu.com/forum/pic/item/%s.jpg";14 //本地的目录15 $savepath = "h:/images/";16 //帖子子文件夹17 $filedirtpl = $savepath."%s/";18 //图片文件19 $filenametpl = $savepath."%s/%s.jpg";20 21 $listurl = sprintf($listurltpl,$tbname);22 //抓取起始点23 $pn = 0;24 while(1)25 {26 if (!$type) $listurl .= $pn;27 //得到列表页源代码28 $listhtml = file_get_contents($listurl);29 //匹配出帖子id30 if($type)31 preg_match_all('/<div class=\"aep_wrapper\" id=\"pic_item_(\d+)\" tid=\"\d+\">/',$listhtml,$m1);32 else33 preg_match_all('/<ul class=\"threadlist_media j_threadlist_media\" id=\"fm(\d+)\"/',$listhtml,$m1);34 //得到帖子id列表35 $tidlist = $m1[1];36 echo "Fetching ... <br /> \r\n";37 foreach($tidlist as $tid)38 {39 echo "--Gallery $tid <br /> \r\n";40 $galleryurl = sprintf($galleryurltpl,$tbname,$tid);41 //得到帖子图册的源代码42 $galleryhtml = file_get_contents($galleryurl);43 //匹配出图片id44 preg_match_all('/\{\"original\":\{\"id\":\"(\w+)\"/',$galleryhtml,$m2);45 //得到图片id列表46 $pidlist = $m2[1];47 foreach($pidlist as $pid)48 {49 echo "----Picture {$tid}/{$pid}.jpg ";50 $filedir = sprintf($filedirtpl,$tid);51 $filename = sprintf($filenametpl,$tid,$pid);52 //文件是否存在53 if(!is_file($filename))54 {55 $imageurl = sprintf($imageurltpl,$pid);56 //下载图片57 $imagebin = file_get_contents($imageurl);58 //目录是否存在59 if(!is_dir($filedir))60 mkdir($filedir);61 //保存图片62 file_put_contents($filename,$imagebin);63 $rnd = rand(2000,5000);64 echo "Downloaded! ";65 //延时休息66 sleep(1.0*$rnd/1000);67 echo "Sleep $rnd us <br />\r\n";68 }69 else70 echo "Existed! <br />\r\n";71 }72 }73 //翻到下一页74 if (!$type) $pn += 50;75 }
运行测试:
程序基本上可以满足要求,但是长时间抓取图片时,百度会弹出验证码,此时使用猫重新拨号即可更换IP继续抓取图片。
(仅供学习参考,请勿用来做非法的事情。)

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics











JWT is an open standard based on JSON, used to securely transmit information between parties, mainly for identity authentication and information exchange. 1. JWT consists of three parts: Header, Payload and Signature. 2. The working principle of JWT includes three steps: generating JWT, verifying JWT and parsing Payload. 3. When using JWT for authentication in PHP, JWT can be generated and verified, and user role and permission information can be included in advanced usage. 4. Common errors include signature verification failure, token expiration, and payload oversized. Debugging skills include using debugging tools and logging. 5. Performance optimization and best practices include using appropriate signature algorithms, setting validity periods reasonably,

Session hijacking can be achieved through the following steps: 1. Obtain the session ID, 2. Use the session ID, 3. Keep the session active. The methods to prevent session hijacking in PHP include: 1. Use the session_regenerate_id() function to regenerate the session ID, 2. Store session data through the database, 3. Ensure that all session data is transmitted through HTTPS.

RESTAPI design principles include resource definition, URI design, HTTP method usage, status code usage, version control, and HATEOAS. 1. Resources should be represented by nouns and maintained at a hierarchy. 2. HTTP methods should conform to their semantics, such as GET is used to obtain resources. 3. The status code should be used correctly, such as 404 means that the resource does not exist. 4. Version control can be implemented through URI or header. 5. HATEOAS boots client operations through links in response.

In PHP, exception handling is achieved through the try, catch, finally, and throw keywords. 1) The try block surrounds the code that may throw exceptions; 2) The catch block handles exceptions; 3) Finally block ensures that the code is always executed; 4) throw is used to manually throw exceptions. These mechanisms help improve the robustness and maintainability of your code.

The main function of anonymous classes in PHP is to create one-time objects. 1. Anonymous classes allow classes without names to be directly defined in the code, which is suitable for temporary requirements. 2. They can inherit classes or implement interfaces to increase flexibility. 3. Pay attention to performance and code readability when using it, and avoid repeatedly defining the same anonymous classes.

In PHP, the difference between include, require, include_once, require_once is: 1) include generates a warning and continues to execute, 2) require generates a fatal error and stops execution, 3) include_once and require_once prevent repeated inclusions. The choice of these functions depends on the importance of the file and whether it is necessary to prevent duplicate inclusion. Rational use can improve the readability and maintainability of the code.

There are four main error types in PHP: 1.Notice: the slightest, will not interrupt the program, such as accessing undefined variables; 2. Warning: serious than Notice, will not terminate the program, such as containing no files; 3. FatalError: the most serious, will terminate the program, such as calling no function; 4. ParseError: syntax error, will prevent the program from being executed, such as forgetting to add the end tag.

PHP and Python each have their own advantages, and choose according to project requirements. 1.PHP is suitable for web development, especially for rapid development and maintenance of websites. 2. Python is suitable for data science, machine learning and artificial intelligence, with concise syntax and suitable for beginners.
