PHP crawl website image script
Method one:
<code><span><?php</span> header(<span>"Content-type:image/jpeg"</span>); <span><span>class</span><span>download_image</span>{</span><span><span>function</span><span>read_url</span><span>(<span>$str</span>)</span> {</span><span>$file</span>=fopen(<span>$str</span>,<span>"r"</span>); <span>$result</span> = <span>''</span>; <span>while</span>(!feof(<span>$file</span>)) { <span>$result</span>.=fgets(<span>$file</span>,<span>9999</span>); } fclose(<span>$file</span>); <span>return</span><span>$result</span>; } <span><span>function</span><span>save_img</span><span>(<span>$str</span>)</span> {</span><span>$result</span>=<span>$this</span>->read_url(<span>$str</span>); <span>$result</span>=str_replace(<span>"\""</span>,<span>""</span>,<span>$result</span>); <span>$result</span>=str_replace(<span>"\'"</span>,<span>""</span>,<span>$result</span>); preg_match_all(<span>'/<img\ssrc=(http:\/\/.*?)(\s(.*?)>|>)/i'</span>,<span>$result</span>,<span>$matches</span>); <span>foreach</span>(<span>$matches</span>[<span>1</span>] <span>as</span><span>$value</span>) { <span>echo</span><span>$value</span>.<span>"\n"</span>; <span>$this</span>->GrabImage(<span>$value</span>,<span>$filename</span>=<span>""</span>); } } <span>// $url 是远程图片的完整URL地址,不能为空。 </span><span>// $filename 是可选变量: 如果为空,本地文件名将基于时间和日期 </span><span>// 自动生成. </span><span><span>function</span><span>GrabImage</span><span>(<span>$url</span>,<span>$filename</span>=<span>""</span>)</span> {</span><span>if</span>(<span>$url</span>==<span>""</span>):<span>return</span><span>false</span>;<span>endif</span>; <span>$path</span>=<span>"download/"</span>; <span>//指定存储文件夹 </span><span>//若文件不存在,则创建; </span><span>if</span>(!file_exists(<span>$path</span>)){ mkdir(<span>$path</span>); } <span>if</span>(<span>$filename</span>==<span>""</span>) { <span>$ext</span>=strrchr(<span>$url</span>,<span>"."</span>); <span>if</span>(<span>$ext</span>!=<span>".gif"</span> && <span>$ext</span>!=<span>".jpg"</span>):<span>return</span><span>false</span>;<span>endif</span>; @<span>$filename</span>=<span>$path</span>.date(<span>"YHis"</span>).<span>$ext</span>; } ob_start(); readfile(<span>$url</span>); <span>$img</span> = ob_get_contents(); ob_end_clean(); <span>$size</span> = strlen(<span>$img</span>); <span>$fp2</span>=@fopen(<span>$filename</span>, <span>"a"</span>); fwrite(<span>$fp2</span>,<span>$img</span>); fclose(<span>$fp2</span>); <span>return</span><span>$filename</span>; } } <span>$download_img</span>=<span>new</span> download_image(); <span>$download_img</span>->save_img(<span>"http://www.jb51.net"</span>); <span>?></span></span></code>
The idea of method one is relatively simple and clear, but there is a BUG, the picture is not fully captured, check again when you have time!
Method 2:
<code><span><span><?php</span><span><span>class</span><span>download_image</span>{</span><span>//抓取图片的保存地址</span><span>public</span><span>$save_path</span>; <span>//抓取图片的大小限制(单位:字节) 只抓比size比这个限制大的图片</span><span>public</span><span>$img_size</span>=<span>0</span>; <span>//定义一个静态数组,用于记录曾经抓取过的的超链接地址,避免重复抓取 </span><span>public</span><span>static</span><span>$a_url_arr</span>=<span>array</span>(); <span>/** *<span> @param</span> String $save_path 抓取图片的保存地址 *<span> @param</span> Int $img_size */</span><span>public</span><span><span>function</span><span>__construct</span><span>(<span>$save_path</span>,<span>$img_size</span>)</span>{</span><span>$this</span>->save_path=<span>$save_path</span>; <span>$this</span>->img_size=<span>$img_size</span>; <span>if</span>(!file_exists(<span>$save_path</span>)){ mkdir(<span>$save_path</span>,<span>0775</span>); } } <span>/** * 递归下载抓取首页及其子页面图片的方法 ( recursive 递归) *<span> @param</span> String $capture_url 用于抓取图片的网址 */</span><span>public</span><span><span>function</span><span>recursive_download_images</span><span>(<span>$capture_url</span>)</span>{</span><span>if</span> (!in_array(<span>$capture_url</span>,<span>self</span>::<span>$a_url_arr</span>)){ <span>//没抓取过</span><span>self</span>::<span>$a_url_arr</span>[]=<span>$capture_url</span>; <span>//计入静态数组</span> } <span>else</span> { <span>//抓取过,直接退出函数</span><span>return</span>; } <span>$this</span>->download_current_page_images(<span>$capture_url</span>); <span>//下载当前页面的所有图片</span><span>//用@屏蔽掉因为抓取地址无法读取导致的warning错误</span><span>$content</span>=@file_get_contents(<span>$capture_url</span>); <span>//匹配<strong>a标签</strong>href属性中?之前部分的正则</span><span>$a_pattern</span> = <span>"|<a[^>]+href=['\" ]?([^ '\"?]+)['\" >]|U"</span>; preg_match_all(<span>$a_pattern</span>, <span>$content</span>, <span>$a_out</span>, PREG_SET_ORDER); <span>$tmp_arr</span>=<span>array</span>(); <span>//定义一个数组,用于存放当前<strong>循环</strong>下抓取图片的超链接地址</span><span>foreach</span> (<span>$a_out</span><span>as</span><span>$k</span> => <span>$v</span>) { <span>/** * 去除超链接中的 空'','#','/'和重复值 * 1: 超链接地址的值 不能等于当前抓取页面的url, 否则会陷入死<strong>循环</strong> * 2: 超链接为''或'#','/'也是本页面,这样也会陷入死<strong>循环</strong>, * 3: 有时一个超连接地址在一个网页中会重复出现多次,如果不去除,会对一个子页面进行重复下载) */</span><span>if</span> ( <span>$v</span>[<span>1</span>] && !in_array(<span>$v</span>[<span>1</span>],<span>self</span>::<span>$a_url_arr</span>) &&!in_array(<span>$v</span>[<span>1</span>],<span>array</span>(<span>'#'</span>,<span>'/'</span>,<span>$capture_url</span>) ) ) { <span>$tmp_arr</span>[]=<span>$v</span>[<span>1</span>]; } } <span>foreach</span> (<span>$tmp_arr</span><span>as</span><span>$k</span> => <span>$v</span>){ <span>//超链接路径地址</span><span>if</span> ( strpos(<span>$v</span>, <span>'http://'</span>)!==<span>false</span> ){ <span>//如果url包含http://,可以直接访问</span><span>$a_url</span> = <span>$v</span>; }<span>else</span>{ <span>//否则证明是相对地址, 需要重新拼凑超链接的访问地址</span><span>$domain_url</span> = substr(<span>$capture_url</span>, <span>0</span>,strpos(<span>$capture_url</span>, <span>'/'</span>,<span>8</span>)+<span>1</span>); <span>$a_url</span>=<span>$domain_url</span>.<span>$v</span>; } <span>$this</span>->recursive_download_images(<span>$a_url</span>); } } <span>/** * 下载当前网页下的所有图片 *<span> @param</span> String $capture_url 用于抓取图片的网页地址 *<span> @return</span> Array 当前网页上所有图片img标签url地址的一个数组 */</span><span>public</span><span><span>function</span><span>download_current_page_images</span><span>(<span>$capture_url</span>)</span>{</span><span>$content</span>=@file_get_contents(<span>$capture_url</span>); <span>//屏蔽warning错误</span><span>//匹配img标签src属性中?之前部分的正则</span><span>$img_pattern</span> = <span>"|<img[^>]+src=['\" ]?([^ '\"?]+)['\" >]|U"</span>; preg_match_all(<span>$img_pattern</span>, <span>$content</span>, <span>$img_out</span>, PREG_SET_ORDER); <span>$photo_num</span> = count(<span>$img_out</span>); <span>//匹配到的图片数量</span><span>echo</span><span>$capture_url</span> . <span>"共找到 "</span> . <span>$photo_num</span> . <span>" 张图片\n"</span>; <span>foreach</span> (<span>$img_out</span><span>as</span><span>$k</span> => <span>$v</span>){ <span>$this</span>->save_one_img(<span>$capture_url</span>,<span>$v</span>[<span>1</span>]); } } <span>/** * 保存单个图片的方法 *<span> @param</span> String $capture_url 用于抓取图片的网页地址 *<span> @param</span> String $img_url 需要保存的图片的url */</span><span>public</span><span><span>function</span><span>save_one_img</span><span>(<span>$capture_url</span>,<span>$img_url</span>)</span>{</span><span>//图片路径地址</span><span>if</span> ( strpos(<span>$img_url</span>, <span>'http://'</span>)!==<span>false</span> ){ <span>// $img_url = $img_url;</span> }<span>else</span>{ <span>$domain_url</span> = substr(<span>$capture_url</span>, <span>0</span>,strpos(<span>$capture_url</span>, <span>'/'</span>,<span>8</span>)+<span>1</span>); <span>$img_url</span>=<span>$domain_url</span>.<span>$img_url</span>; } <span>$pathinfo</span> = pathinfo(<span>$img_url</span>); <span>//获取图片路径信息 </span><span>$pic_name</span>=<span>$pathinfo</span>[<span>'basename'</span>]; <span>//获取图片的名字</span><span>if</span> (file_exists(<span>$this</span>->save_path.<span>$pic_name</span>)){ <span>//如果图片存在,证明已经被抓取过,退出函数</span><span>echo</span><span>$img_url</span>.<span>'该图片已经抓取过!'</span>.<span>"\n"</span>; <span>return</span>; } <span>//将图片内容读入一个字符串</span><span>$img_data</span> = @file_get_contents(<span>$img_url</span>); <span>//屏蔽掉因为图片地址无法读取导致的warning错误</span><span>if</span> ( strlen(<span>$img_data</span>) > <span>$this</span>->img_size ){ <span>//下载size比限制大的图片</span><span>$img_size</span> = file_put_contents(<span>$this</span>->save_path . <span>$pic_name</span>, <span>$img_data</span>); <span>if</span> (<span>$img_size</span>){ <span>echo</span><span>$img_url</span>.<span>'图片保存成功!'</span>.<span>"\n"</span>; } <span>else</span> { <span>echo</span><span>$img_url</span>.<span>'图片保存失败!'</span>.<span>"\n"</span>; } } <span>else</span> { <span>echo</span><span>$img_url</span>.<span>'图片读取失败!'</span>.<span>"\n"</span>; } } } set_time_limit(<span>120</span>); <span>//设置脚本的最大执行时间 根据情况设置 </span><span>$download_img</span>=<span>new</span> download_image(<span>'imgages/'</span>,<span>0</span>); <span>//实例化下载图片<strong>对象</strong></span><span>//$download_img->recursive_download_images('http://www.oschina.net/'); //递归抓取图片方法</span><span>//$download_img->download_current_page_images($_POST['capture_url']); //只抓取当前页面图片方法</span><span>$download_img</span>->download_current_page_images(<span>'http://www.jb51.net'</span>); <span>//只抓取当前页面图片方法</span><span>?></span></span></span></code>
http://blog.csdn.net/china_skag/article/details/18452883
http://www.jb51.net/article/21738.htm
Copyright Statement: Please keep the article signature and link when reprinting
The above introduces the PHP script for grabbing website images, including the relevant content. I hope it will be helpful to friends who are interested in PHP tutorials.

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics











Methods to open img files include using virtual optical drive software, using compression software, and using special tools. Detailed introduction: 1. Use virtual optical drive software to open, download and install a virtual optical drive software, right-click the img file, select "Open with" or "Associated Program", select the installed virtual optical drive software in the pop-up dialog box, virtual The optical drive software will automatically load the img file and use it as a disc image in the virtual optical drive. Double-click the disc icon in the virtual optical drive to open the img file and access its contents, etc.

The reason for the error is NameResolutionError(self.host,self,e)frome, which is an exception type in the urllib3 library. The reason for this error is that DNS resolution failed, that is, the host name or IP address attempted to be resolved cannot be found. This may be caused by the entered URL address being incorrect or the DNS server being temporarily unavailable. How to solve this error There may be several ways to solve this error: Check whether the entered URL address is correct and make sure it is accessible Make sure the DNS server is available, you can try using the "ping" command on the command line to test whether the DNS server is available Try accessing the website using the IP address instead of the hostname if behind a proxy

PHP function introduction—get_headers(): Overview of obtaining the response header information of the URL: In PHP development, we often need to obtain the response header information of the web page or remote resource. The PHP function get_headers() can easily obtain the response header information of the target URL and return it in the form of an array. This article will introduce the usage of get_headers() function and provide some related code examples. Usage of get_headers() function: get_header

Differences: 1. Different definitions, url is a uniform resource locator, and html is a hypertext markup language; 2. There can be many urls in an html, but only one html page can exist in a url; 3. html refers to is a web page, and url refers to the website address.

How to set the PATH environment variable in Linux systems In Linux systems, the PATH environment variable is used to specify the path where the system searches for executable files on the command line. Correctly setting the PATH environment variable allows us to execute system commands and custom commands at any location. This article will introduce how to set the PATH environment variable in a Linux system and provide detailed code examples. View the current PATH environment variable. Execute the following command in the terminal to view the current PATH environment variable: echo$P

How to open the img file: 1. Confirm the img file path; 2. Use the img file opener; 3. Select the opening method; 4. View the picture; 5. Save the picture. The img file is a commonly used image file format, usually used to store picture data.

Nowadays, many Windows users who love games have entered the Steam client and can search, download and play any good games. However, many users' profiles may have the exact same name, making it difficult to find a profile or even link a Steam profile to other third-party accounts or join Steam forums to share content. The profile is assigned a unique 17-digit id, which remains the same and cannot be changed by the user at any time, whereas the username or custom URL can. Regardless, some users don't know their Steamid, and it's important to know this. If you don't know how to find your account's Steamid, don't panic. In this article

Use url to encode and decode the class java.net.URLDecoder.decode(url, decoding format) decoder.decoding method for encoding and decoding. Convert into an ordinary string, URLEncoder.decode(url, encoding format) turns the ordinary string into a string in the specified format packagecom.zixue.springbootmybatis.test;importjava.io.UnsupportedEncodingException;importjava.net.URLDecoder;importjava.net. URLEncoder
