PHP crawl website image script-PHP Tutorial-php.cn

Home

Backend Development

PHP Tutorial

PHP crawl website image script

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Jul 29, 2016 am 09:14 AM

CAPTURE download img path url

Method one:

<code><span><?php</span>
header(<span>"Content-type:image/jpeg"</span>); 
<span><span>class</span><span>download_image</span>{</span><span><span>function</span><span>read_url</span><span>(<span>$str</span>)</span>    {</span><span>$file</span>=fopen(<span>$str</span>,<span>"r"</span>);
        <span>$result</span> = <span>''</span>; 
        <span>while</span>(!feof(<span>$file</span>)) 
        { 
            <span>$result</span>.=fgets(<span>$file</span>,<span>9999</span>); 
        } 
        fclose(<span>$file</span>); 
        <span>return</span><span>$result</span>; 

    } 
    <span><span>function</span><span>save_img</span><span>(<span>$str</span>)</span>    {</span><span>$result</span>=<span>$this</span>->read_url(<span>$str</span>); 
        <span>$result</span>=str_replace(<span>"\""</span>,<span>""</span>,<span>$result</span>); 
        <span>$result</span>=str_replace(<span>"\'"</span>,<span>""</span>,<span>$result</span>); 

        preg_match_all(<span>'/<img\ssrc=(http:\/\/.*?)(\s(.*?)>|>)/i'</span>,<span>$result</span>,<span>$matches</span>); 

        <span>foreach</span>(<span>$matches</span>[<span>1</span>] <span>as</span><span>$value</span>) 
        { 
            <span>echo</span><span>$value</span>.<span>"\n"</span>; 
            <span>$this</span>->GrabImage(<span>$value</span>,<span>$filename</span>=<span>""</span>); 
        } 
    } 
    <span>// $url 是远程图片的完整URL地址，不能为空。 </span><span>// $filename 是可选变量: 如果为空，本地文件名将基于时间和日期 </span><span>// 自动生成. </span><span><span>function</span><span>GrabImage</span><span>(<span>$url</span>,<span>$filename</span>=<span>""</span>)</span> {</span><span>if</span>(<span>$url</span>==<span>""</span>):<span>return</span><span>false</span>;<span>endif</span>; 
        <span>$path</span>=<span>"download/"</span>; <span>//指定存储文件夹 </span><span>//若文件不存在,则创建; </span><span>if</span>(!file_exists(<span>$path</span>)){ 
            mkdir(<span>$path</span>); 
        } 

        <span>if</span>(<span>$filename</span>==<span>""</span>) { 
            <span>$ext</span>=strrchr(<span>$url</span>,<span>"."</span>); 
            <span>if</span>(<span>$ext</span>!=<span>".gif"</span> && <span>$ext</span>!=<span>".jpg"</span>):<span>return</span><span>false</span>;<span>endif</span>; 
            @<span>$filename</span>=<span>$path</span>.date(<span>"YHis"</span>).<span>$ext</span>; 
        } 

        ob_start(); 
        readfile(<span>$url</span>); 
        <span>$img</span> = ob_get_contents(); 
        ob_end_clean(); 
        <span>$size</span> = strlen(<span>$img</span>); 

        <span>$fp2</span>=@fopen(<span>$filename</span>, <span>"a"</span>); 
        fwrite(<span>$fp2</span>,<span>$img</span>); 
        fclose(<span>$fp2</span>); 

        <span>return</span><span>$filename</span>; 
    }
}
<span>$download_img</span>=<span>new</span> download_image();
<span>$download_img</span>->save_img(<span>"http://www.jb51.net"</span>); 
<span>?></span></span></code>

Copy after login

The idea of method one is relatively simple and clear, but there is a BUG, the picture is not fully captured, check again when you have time!

Method 2:

<code><span><span><?php</span><span><span>class</span><span>download_image</span>{</span><span>//抓取图片的保存地址</span><span>public</span><span>$save_path</span>;   
    <span>//抓取图片的大小限制(单位:字节) 只抓比size比这个限制大的图片</span><span>public</span><span>$img_size</span>=<span>0</span>; 
    <span>//定义一个静态数组,用于记录曾经抓取过的的超链接地址,避免重复抓取       </span><span>public</span><span>static</span><span>$a_url_arr</span>=<span>array</span>();

    <span>/**
     *<span> @param</span> String $save_path    抓取图片的保存地址
     *<span> @param</span> Int    $img_size     
     */</span><span>public</span><span><span>function</span><span>__construct</span><span>(<span>$save_path</span>,<span>$img_size</span>)</span>{</span><span>$this</span>->save_path=<span>$save_path</span>;
        <span>$this</span>->img_size=<span>$img_size</span>;
        <span>if</span>(!file_exists(<span>$save_path</span>)){
            mkdir(<span>$save_path</span>,<span>0775</span>);
        }
    }
    <span>/**
     * 递归下载抓取首页及其子页面图片的方法  ( recursive 递归)
     *<span> @param</span>   String  $capture_url  用于抓取图片的网址
     */</span><span>public</span><span><span>function</span><span>recursive_download_images</span><span>(<span>$capture_url</span>)</span>{</span><span>if</span> (!in_array(<span>$capture_url</span>,<span>self</span>::<span>$a_url_arr</span>)){   <span>//没抓取过</span><span>self</span>::<span>$a_url_arr</span>[]=<span>$capture_url</span>;   <span>//计入静态数组</span>
        } <span>else</span> {   <span>//抓取过,直接退出函数</span><span>return</span>;
        }
        <span>$this</span>->download_current_page_images(<span>$capture_url</span>);  <span>//下载当前页面的所有图片</span><span>//用@屏蔽掉因为抓取地址无法读取导致的warning错误</span><span>$content</span>=@file_get_contents(<span>$capture_url</span>); 
        <span>//匹配<strong>a标签</strong>href属性中?之前部分的正则</span><span>$a_pattern</span> = <span>"|<a[^>]+href=['\" ]?([^ '\"?]+)['\" >]|U"</span>;   
        preg_match_all(<span>$a_pattern</span>, <span>$content</span>, <span>$a_out</span>, PREG_SET_ORDER);
        <span>$tmp_arr</span>=<span>array</span>();  <span>//定义一个数组,用于存放当前<strong>循环</strong>下抓取图片的超链接地址</span><span>foreach</span> (<span>$a_out</span><span>as</span><span>$k</span> => <span>$v</span>) {
            <span>/**
             * 去除超链接中的 空'','#','/'和重复值  
             * 1: 超链接地址的值 不能等于当前抓取页面的url, 否则会陷入死<strong>循环</strong>
             * 2: 超链接为''或'#','/'也是本页面,这样也会陷入死<strong>循环</strong>,  
             * 3: 有时一个超连接地址在一个网页中会重复出现多次,如果不去除,会对一个子页面进行重复下载)
             */</span><span>if</span> ( <span>$v</span>[<span>1</span>] && !in_array(<span>$v</span>[<span>1</span>],<span>self</span>::<span>$a_url_arr</span>) &&!in_array(<span>$v</span>[<span>1</span>],<span>array</span>(<span>'#'</span>,<span>'/'</span>,<span>$capture_url</span>) ) ) { 
                <span>$tmp_arr</span>[]=<span>$v</span>[<span>1</span>];
            }
        }
        <span>foreach</span> (<span>$tmp_arr</span><span>as</span><span>$k</span> => <span>$v</span>){ 
            <span>//超链接路径地址</span><span>if</span> ( strpos(<span>$v</span>, <span>'http://'</span>)!==<span>false</span> ){ <span>//如果url包含http://,可以直接访问</span><span>$a_url</span> = <span>$v</span>;
            }<span>else</span>{   <span>//否则证明是相对地址, 需要重新拼凑超链接的访问地址</span><span>$domain_url</span> = substr(<span>$capture_url</span>, <span>0</span>,strpos(<span>$capture_url</span>, <span>'/'</span>,<span>8</span>)+<span>1</span>);
                <span>$a_url</span>=<span>$domain_url</span>.<span>$v</span>;
            }
            <span>$this</span>->recursive_download_images(<span>$a_url</span>);
        }
    }
    <span>/**
     * 下载当前网页下的所有图片 
     *<span> @param</span>   String  $capture_url  用于抓取图片的网页地址
     *<span> @return</span>  Array   当前网页上所有图片img标签url地址的一个数组
     */</span><span>public</span><span><span>function</span><span>download_current_page_images</span><span>(<span>$capture_url</span>)</span>{</span><span>$content</span>=@file_get_contents(<span>$capture_url</span>);   <span>//屏蔽warning错误</span><span>//匹配img标签src属性中?之前部分的正则</span><span>$img_pattern</span> = <span>"|<img[^>]+src=['\" ]?([^ '\"?]+)['\" >]|U"</span>;   
        preg_match_all(<span>$img_pattern</span>, <span>$content</span>, <span>$img_out</span>, PREG_SET_ORDER);
        <span>$photo_num</span> = count(<span>$img_out</span>);
        <span>//匹配到的图片数量</span><span>echo</span><span>$capture_url</span> . <span>"共找到 "</span> . <span>$photo_num</span> . <span>" 张图片\n"</span>;
        <span>foreach</span> (<span>$img_out</span><span>as</span><span>$k</span> => <span>$v</span>){
            <span>$this</span>->save_one_img(<span>$capture_url</span>,<span>$v</span>[<span>1</span>]);
        }
    }

    <span>/**
     * 保存单个图片的方法 
     *<span> @param</span> String $capture_url   用于抓取图片的网页地址
     *<span> @param</span> String $img_url       需要保存的图片的url
     */</span><span>public</span><span><span>function</span><span>save_one_img</span><span>(<span>$capture_url</span>,<span>$img_url</span>)</span>{</span><span>//图片路径地址</span><span>if</span> ( strpos(<span>$img_url</span>, <span>'http://'</span>)!==<span>false</span> ){ 
            <span>// $img_url = $img_url;</span>
        }<span>else</span>{   
            <span>$domain_url</span> = substr(<span>$capture_url</span>, <span>0</span>,strpos(<span>$capture_url</span>, <span>'/'</span>,<span>8</span>)+<span>1</span>);
            <span>$img_url</span>=<span>$domain_url</span>.<span>$img_url</span>;
        }           
        <span>$pathinfo</span> = pathinfo(<span>$img_url</span>);    <span>//获取图片路径信息        </span><span>$pic_name</span>=<span>$pathinfo</span>[<span>'basename'</span>];   <span>//获取图片的名字</span><span>if</span> (file_exists(<span>$this</span>->save_path.<span>$pic_name</span>)){  <span>//如果图片存在,证明已经被抓取过,退出函数</span><span>echo</span><span>$img_url</span>.<span>'该图片已经抓取过!'</span>.<span>"\n"</span>; 
            <span>return</span>;
        }                
        <span>//将图片内容读入一个字符串</span><span>$img_data</span> = @file_get_contents(<span>$img_url</span>);   <span>//屏蔽掉因为图片地址无法读取导致的warning错误</span><span>if</span> ( strlen(<span>$img_data</span>) > <span>$this</span>->img_size ){   <span>//下载size比限制大的图片</span><span>$img_size</span> = file_put_contents(<span>$this</span>->save_path . <span>$pic_name</span>, <span>$img_data</span>);
            <span>if</span> (<span>$img_size</span>){
                <span>echo</span><span>$img_url</span>.<span>'图片保存成功!'</span>.<span>"\n"</span>;
            } <span>else</span> {
                <span>echo</span><span>$img_url</span>.<span>'图片保存失败!'</span>.<span>"\n"</span>;
            }
        } <span>else</span> {
            <span>echo</span><span>$img_url</span>.<span>'图片读取失败!'</span>.<span>"\n"</span>;
        } 
    } 
}
set_time_limit(<span>120</span>);     <span>//设置脚本的最大执行时间  根据情况设置 </span><span>$download_img</span>=<span>new</span> download_image(<span>'imgages/'</span>,<span>0</span>);   <span>//实例化下载图片<strong>对象</strong></span><span>//$download_img->recursive_download_images('http://www.oschina.net/');      //递归抓取图片方法</span><span>//$download_img->download_current_page_images($_POST['capture_url']);     //只抓取当前页面图片方法</span><span>$download_img</span>->download_current_page_images(<span>'http://www.jb51.net'</span>);     <span>//只抓取当前页面图片方法</span><span>?></span></span></span></code>

Copy after login

http://blog.csdn.net/china_skag/article/details/18452883
http://www.jb51.net/article/21738.htm

The above introduces the PHP script for grabbing website images, including the relevant content. I hope it will be helpful to friends who are interested in PHP tutorials.

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Roblox: Grow A Garden - Complete Mutation Guide

3 weeks ago By DDD

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

How to fix KB5055612 fails to install in Windows 10?

3 weeks ago By DDD

Nordhold: Fusion System, Explained

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Mandragora: Whispers Of The Witch Tree - How To Unlock The Grappling Hook

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Java Tutorial

1666

CakePHP Tutorial

1425

Laravel Tutorial

1325

PHP Tutorial

1272

C# Tutorial

1252

Related knowledge

How to open img file Sep 18, 2023 am 09:40 AM

Methods to open img files include using virtual optical drive software, using compression software, and using special tools. Detailed introduction: 1. Use virtual optical drive software to open, download and install a virtual optical drive software, right-click the img file, select "Open with" or "Associated Program", select the installed virtual optical drive software in the pop-up dialog box, virtual The optical drive software will automatically load the img file and use it as a disc image in the virtual optical drive. Double-click the disc icon in the virtual optical drive to open the img file and access its contents, etc.

Why NameResolutionError(self.host, self, e) from e and how to solve it Mar 01, 2024 pm 01:20 PM

The reason for the error is NameResolutionError(self.host,self,e)frome, which is an exception type in the urllib3 library. The reason for this error is that DNS resolution failed, that is, the host name or IP address attempted to be resolved cannot be found. This may be caused by the entered URL address being incorrect or the DNS server being temporarily unavailable. How to solve this error There may be several ways to solve this error: Check whether the entered URL address is correct and make sure it is accessible Make sure the DNS server is available, you can try using the "ping" command on the command line to test whether the DNS server is available Try accessing the website using the IP address instead of the hostname if behind a proxy

PHP function introduction—get_headers(): Get the response header information of the URL Jul 25, 2023 am 09:05 AM

PHP function introduction—get_headers(): Overview of obtaining the response header information of the URL: In PHP development, we often need to obtain the response header information of the web page or remote resource. The PHP function get_headers() can easily obtain the response header information of the target URL and return it in the form of an array. This article will introduce the usage of get_headers() function and provide some related code examples. Usage of get_headers() function: get_header

What is the difference between html and url Mar 06, 2024 pm 03:06 PM

Differences: 1. Different definitions, url is a uniform resource locator, and html is a hypertext markup language; 2. There can be many urls in an html, but only one html page can exist in a url; 3. html refers to is a web page, and url refers to the website address.

Steps to set the PATH environment variable of the Linux system Feb 18, 2024 pm 05:40 PM

How to set the PATH environment variable in Linux systems In Linux systems, the PATH environment variable is used to specify the path where the system searches for executable files on the command line. Correctly setting the PATH environment variable allows us to execute system commands and custom commands at any location. This article will introduce how to set the PATH environment variable in a Linux system and provide detailed code examples. View the current PATH environment variable. Execute the following command in the terminal to view the current PATH environment variable: echo$P

How to open img file Jul 06, 2023 pm 04:17 PM

How to open the img file: 1. Confirm the img file path; 2. Use the img file opener; 3. Select the opening method; 4. View the picture; 5. Save the picture. The img file is a commonly used image file format, usually used to store picture data.

How to get your Steam ID in a few steps? May 08, 2023 pm 11:43 PM

Nowadays, many Windows users who love games have entered the Steam client and can search, download and play any good games. However, many users' profiles may have the exact same name, making it difficult to find a profile or even link a Steam profile to other third-party accounts or join Steam forums to share content. The profile is assigned a unique 17-digit id, which remains the same and cannot be changed by the user at any time, whereas the username or custom URL can. Regardless, some users don't know their Steamid, and it's important to know this. If you don't know how to find your account's Steamid, don't panic. In this article

How to use URL encoding and decoding in Java May 08, 2023 pm 05:46 PM

Use url to encode and decode the class java.net.URLDecoder.decode(url, decoding format) decoder.decoding method for encoding and decoding. Convert into an ordinary string, URLEncoder.decode(url, encoding format) turns the ordinary string into a string in the specified format packagecom.zixue.springbootmybatis.test;importjava.io.UnsupportedEncodingException;importjava.net.URLDecoder;importjava.net. URLEncoder

See all articles