Home Backend Development PHP Tutorial File_get_contents captures the solution to garbled web pages_PHP tutorial

File_get_contents captures the solution to garbled web pages_PHP tutorial

Jul 13, 2016 am 10:34 AM
file_get_contents gzip

Sometimes when using the file_get_contents() function to crawl web pages, garbled characters will occur. There are two reasons for garbled characters. One is encoding problem, and the other is Gzip enabled on the target page.

Encoding issues are easy to deal with. Just convert the captured content to encoding ($content=iconv("GBK", "UTF-8//IGNORE", $content);). What we are discussing here is how Fetch the page with Gzip turned on. How to judge? The obtained header contains Content-Encoding: gzip indicating that the content is GZIP compressed. Use FireBug to check whether gzip is enabled on the page. The following is the header information of my blog viewed using firebug. Gzip is turned on.

请求头信息原始头信息
Accept	text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Encoding	gzip, deflate
Accept-Language	zh-cn,zh;q=0.8,en-us;q=0.5,en;q=0.3
Connection	keep-alive
Cookie	__utma=225240837.787252530.1317310581.1335406161.1335411401.1537; __utmz=225240837.1326850415.887.3.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=%E4%BB%BB%E4%BD%95%E9%A1%B9%E7%9B%AE%E9%83%BD%E4%B8%8D%E4%BC%9A%E9%82%A3%E4%B9%88%E7%AE%80%E5%8D%95%20site%3Awww.bkjia.com; PHPSESSID=888mj4425p8s0m7s0frre3ovc7; __utmc=225240837; __utmb=225240837.1.10.1335411401
Host	www.bkjia.com
User-Agent	Mozilla/5.0 (Windows NT 5.1; rv:12.0) Gecko/20100101 Firefox/12.0
Copy after login

Here are some solutions:

1. Use the built-in zlib library

If the server has installed the zlib library, you can easily solve the garbled code problem by using the following code.

$data = file_get_contents("compress.zlib://".$url); 
Copy after login

2. Use CURL instead of file_get_contents

function curl_get($url, $gzip=false){
	$curl = curl_init($url);
	curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
	curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 10);
	if($gzip) curl_setopt($curl, CURLOPT_ENCODING, "gzip"); // 关键在这里
	$content = curl_exec($curl);
	curl_close($curl);
	return $content;
}
Copy after login

3. Use gzip decompression function

function gzdecode($data) { 
  $len = strlen($data); 
  if ($len < 18 || strcmp(substr($data,0,2),"\x1f\x8b")) { 
    return null;  // Not GZIP format (See RFC 1952) 
  } 
  $method = ord(substr($data,2,1));  // Compression method 
  $flags  = ord(substr($data,3,1));  // Flags 
  if ($flags & 31 != $flags) { 
    // Reserved bits are set -- NOT ALLOWED by RFC 1952 
    return null; 
  } 
  // NOTE: $mtime may be negative (PHP integer limitations) 
  $mtime = unpack("V", substr($data,4,4)); 
  $mtime = $mtime[1]; 
  $xfl   = substr($data,8,1); 
  $os    = substr($data,8,1); 
  $headerlen = 10; 
  $extralen  = 0; 
  $extra     = ""; 
  if ($flags & 4) { 
    // 2-byte length prefixed EXTRA data in header 
    if ($len - $headerlen - 2 < 8) { 
      return false;    // Invalid format 
    } 
    $extralen = unpack("v",substr($data,8,2)); 
    $extralen = $extralen[1]; 
    if ($len - $headerlen - 2 - $extralen < 8) { 
      return false;    // Invalid format 
    } 
    $extra = substr($data,10,$extralen); 
    $headerlen += 2 + $extralen; 
  } 

  $filenamelen = 0; 
  $filename = ""; 
  if ($flags & 8) { 
    // C-style string file NAME data in header 
    if ($len - $headerlen - 1 < 8) { 
      return false;    // Invalid format 
    } 
    $filenamelen = strpos(substr($data,8+$extralen),chr(0)); 
    if ($filenamelen === false || $len - $headerlen - $filenamelen - 1 < 8) { 
      return false;    // Invalid format 
    } 
    $filename = substr($data,$headerlen,$filenamelen); 
    $headerlen += $filenamelen + 1; 
  } 

  $commentlen = 0; 
  $comment = ""; 
  if ($flags & 16) { 
    // C-style string COMMENT data in header 
    if ($len - $headerlen - 1 < 8) { 
      return false;    // Invalid format 
    } 
    $commentlen = strpos(substr($data,8+$extralen+$filenamelen),chr(0)); 
    if ($commentlen === false || $len - $headerlen - $commentlen - 1 < 8) { 
      return false;    // Invalid header format 
    } 
    $comment = substr($data,$headerlen,$commentlen); 
    $headerlen += $commentlen + 1; 
  } 

  $headercrc = ""; 
  if ($flags & 1) { 
    // 2-bytes (lowest order) of CRC32 on header present 
    if ($len - $headerlen - 2 < 8) { 
      return false;    // Invalid format 
    } 
    $calccrc = crc32(substr($data,0,$headerlen)) & 0xffff; 
    $headercrc = unpack("v", substr($data,$headerlen,2)); 
    $headercrc = $headercrc[1]; 
    if ($headercrc != $calccrc) { 
      return false;    // Bad header CRC 
    } 
    $headerlen += 2; 
  } 

  // GZIP FOOTER - These be negative due to PHP's limitations 
  $datacrc = unpack("V",substr($data,-8,4)); 
  $datacrc = $datacrc[1]; 
  $isize = unpack("V",substr($data,-4)); 
  $isize = $isize[1]; 

  // Perform the decompression: 
  $bodylen = $len-$headerlen-8; 
  if ($bodylen < 1) { 
    // This should never happen - IMPLEMENTATION BUG! 
    return null; 
  } 
  $body = substr($data,$headerlen,$bodylen); 
  $data = ""; 
  if ($bodylen > 0) { 
    switch ($method) { 
      case 8: 
        // Currently the only supported compression method: 
        $data = gzinflate($body); 
        break; 
      default: 
        // Unknown compression method 
        return false; 
    } 
  } else { 
    // I'm not sure if zero-byte body content is allowed. 
    // Allow it for now...  Do nothing... 
  } 

  // Verifiy decompressed size and CRC32: 
  // NOTE: This may fail with large data sizes depending on how 
  //       PHP's integer limitations affect strlen() since $isize 
  //       may be negative for large sizes. 
  if ($isize != strlen($data) || crc32($data) != $datacrc) { 
    // Bad format!  Length or CRC doesn't match! 
    return false; 
  } 
  return $data; 
}
Copy after login

Use:

$html=file_get_contents('http://www.bkjia.com/librarys/veda/');
$html=gzdecode($html);
Copy after login

I will introduce these three methods, which should be able to solve most of the garbled crawling problems caused by gzip.

www.bkjia.comtruehttp: //www.bkjia.com/PHPjc/752354.htmlTechArticleSometimes when the file_get_contents() function is used to crawl web pages, garbled characters will occur. There are two reasons for garbled characters. One is encoding problem, and the other is Gzip enabled on the target page. Good coding question...
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

How to solve PHP Warning: file_get_contents(): Filename cannot be empty How to solve PHP Warning: file_get_contents(): Filename cannot be empty Aug 18, 2023 pm 07:30 PM

How to solve PHPWarning: file_get_contents(): Filenamecannotbeempty In the process of PHP development, we often encounter this error message: PHPWarning: file_get_contents(): Filenamecannotbeempty. This error usually occurs when using the file_get_contents function

如何解决PHP Warning: file_get_contents(): failed to open stream: HTTP request failed 如何解决PHP Warning: file_get_contents(): failed to open stream: HTTP request failed Aug 18, 2023 pm 11:34 PM

How to solve PHPWarning:file_get_contents():failedtoopenstream:HTTPrequestfailed During PHP development, we often encounter situations where HTTP requests are initiated to remote servers through the file_get_contents function. However, sometimes we encounter a common error message: PHPWarning: file_get_c

Detailed explanation of PHP file caching functions: file caching processing methods of file_get_contents, file_put_contents, unlink and other functions Detailed explanation of PHP file caching functions: file caching processing methods of file_get_contents, file_put_contents, unlink and other functions Nov 18, 2023 am 09:37 AM

Detailed explanation of PHP file caching functions: file caching processing methods of file_get_contents, file_put_contents, unlink and other functions, which require specific code examples. In web development, we often need to read data from files or write data to files. Moreover, in some cases, we need to cache the contents of files to avoid frequent file read and write operations, thus improving performance. In PHP, there are several commonly used functions that can help us implement file caching, including

How to configure nginx gzip dynamic compression and static compression How to configure nginx gzip dynamic compression and static compression May 12, 2023 am 08:25 AM

Dynamic compression Dynamic compression actually means that the nginx server compresses the compiled creation. You need to enable the following configuration in the http and https modules of nginx.conf: gzipon; #Enable gizo compression gzip_min_length1k; #gizp compression starting point, only if the file is larger than 1k Compression gzip_comp_level6;#The larger the compression level number, the smaller the compression, but the more performance consumption depends on the actual situation gzip_proxiedany;#Enabled when nginx is used as a reverse proxy. For details, see the official documentation: http://nginx.org/en/docs /http/ngx_http_gzip

How to set up Gzip compression for Nginx performance optimization How to set up Gzip compression for Nginx performance optimization May 29, 2023 pm 05:40 PM

Nginx turns on the Gzip compression function, which can compress the css, js, xml, and html files of the website during transmission, improve the access speed, and then optimize the performance of Nginx! Images, videos and other multimedia files and large files on the Web website are compressed due to compression The effect is not good, so there is no need to support compression for images. If you want to optimize, you can set the life cycle of the image to be longer and let the client cache it. After turning on the Gzip function, the Nginx server will compress the sent content, such as css, js, xml, html and other static resources according to the configured policy, so that the size of the content is reduced, and the user will process it before receiving the returned content. The compressed data is displayed to the customer. so

PHP's file_get_contents() function: How to read contents from a file PHP's file_get_contents() function: How to read contents from a file Nov 04, 2023 pm 01:43 PM

PHP's file_get_contents() function: How to read content from a file, specific code example In PHP, file_get_contents() is a very useful function that allows us to read content from a file. Whether reading a text file or reading content from a remote URL, this function can easily complete the task. Syntax The basic syntax of this function is as follows: stringfile_get_contents(string$f

PHP function introduction—file_get_contents(): Read the contents of the URL into a string PHP function introduction—file_get_contents(): Read the contents of the URL into a string Jul 24, 2023 pm 02:32 PM

PHP function introduction—file_get_contents(): Read the contents of the URL into a string. In web development, it is often necessary to obtain data from a remote server or read a remote file. PHP provides a very powerful function file_get_contents(), which can conveniently read the contents of a URL and save it to a string. This article will introduce the usage of file_get_contents() function and give some code examples to help readers better

Nginx basic introduction to gzip configuration method Nginx basic introduction to gzip configuration method Jun 03, 2023 am 09:52 AM

Preface gzip (gnu-zip) is a compression technology. After gzip compression, the page size can be reduced to 30% or even smaller than the original size. In this way, users will browse the page much faster. The gzip compressed page needs to be supported by both the browser and the server. It is actually server-side compression. After being transmitted to the browser, the browser decompresses and parses it. We don’t need to worry about the browser, because most current browsers support parsing gzip pages. Whether it is front-end or back-end, nginx is often used when deploying projects, and small projects often use a reverse proxy or something. Today I will be simple and direct and talk about one of the points - gzip. If there are any errors, please correct me. Generally used on the server side is u

See all articles