


File_get_contents captures the solution to garbled web pages_PHP tutorial
Sometimes when using the file_get_contents() function to crawl web pages, garbled characters will occur. There are two reasons for garbled characters. One is encoding problem, and the other is Gzip enabled on the target page.
Encoding issues are easy to deal with. Just convert the captured content to encoding ($content=iconv("GBK", "UTF-8//IGNORE", $content);). What we are discussing here is how Fetch the page with Gzip turned on. How to judge? The obtained header contains Content-Encoding: gzip indicating that the content is GZIP compressed. Use FireBug to check whether gzip is enabled on the page. The following is the header information of my blog viewed using firebug. Gzip is turned on.
请求头信息原始头信息 Accept text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 Accept-Encoding gzip, deflate Accept-Language zh-cn,zh;q=0.8,en-us;q=0.5,en;q=0.3 Connection keep-alive Cookie __utma=225240837.787252530.1317310581.1335406161.1335411401.1537; __utmz=225240837.1326850415.887.3.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=%E4%BB%BB%E4%BD%95%E9%A1%B9%E7%9B%AE%E9%83%BD%E4%B8%8D%E4%BC%9A%E9%82%A3%E4%B9%88%E7%AE%80%E5%8D%95%20site%3Awww.bkjia.com; PHPSESSID=888mj4425p8s0m7s0frre3ovc7; __utmc=225240837; __utmb=225240837.1.10.1335411401 Host www.bkjia.com User-Agent Mozilla/5.0 (Windows NT 5.1; rv:12.0) Gecko/20100101 Firefox/12.0
Here are some solutions:
1. Use the built-in zlib library
If the server has installed the zlib library, you can easily solve the garbled code problem by using the following code.
$data = file_get_contents("compress.zlib://".$url);
2. Use CURL instead of file_get_contents
function curl_get($url, $gzip=false){ $curl = curl_init($url); curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 10); if($gzip) curl_setopt($curl, CURLOPT_ENCODING, "gzip"); // 关键在这里 $content = curl_exec($curl); curl_close($curl); return $content; }
3. Use gzip decompression function
function gzdecode($data) { $len = strlen($data); if ($len < 18 || strcmp(substr($data,0,2),"\x1f\x8b")) { return null; // Not GZIP format (See RFC 1952) } $method = ord(substr($data,2,1)); // Compression method $flags = ord(substr($data,3,1)); // Flags if ($flags & 31 != $flags) { // Reserved bits are set -- NOT ALLOWED by RFC 1952 return null; } // NOTE: $mtime may be negative (PHP integer limitations) $mtime = unpack("V", substr($data,4,4)); $mtime = $mtime[1]; $xfl = substr($data,8,1); $os = substr($data,8,1); $headerlen = 10; $extralen = 0; $extra = ""; if ($flags & 4) { // 2-byte length prefixed EXTRA data in header if ($len - $headerlen - 2 < 8) { return false; // Invalid format } $extralen = unpack("v",substr($data,8,2)); $extralen = $extralen[1]; if ($len - $headerlen - 2 - $extralen < 8) { return false; // Invalid format } $extra = substr($data,10,$extralen); $headerlen += 2 + $extralen; } $filenamelen = 0; $filename = ""; if ($flags & 8) { // C-style string file NAME data in header if ($len - $headerlen - 1 < 8) { return false; // Invalid format } $filenamelen = strpos(substr($data,8+$extralen),chr(0)); if ($filenamelen === false || $len - $headerlen - $filenamelen - 1 < 8) { return false; // Invalid format } $filename = substr($data,$headerlen,$filenamelen); $headerlen += $filenamelen + 1; } $commentlen = 0; $comment = ""; if ($flags & 16) { // C-style string COMMENT data in header if ($len - $headerlen - 1 < 8) { return false; // Invalid format } $commentlen = strpos(substr($data,8+$extralen+$filenamelen),chr(0)); if ($commentlen === false || $len - $headerlen - $commentlen - 1 < 8) { return false; // Invalid header format } $comment = substr($data,$headerlen,$commentlen); $headerlen += $commentlen + 1; } $headercrc = ""; if ($flags & 1) { // 2-bytes (lowest order) of CRC32 on header present if ($len - $headerlen - 2 < 8) { return false; // Invalid format } $calccrc = crc32(substr($data,0,$headerlen)) & 0xffff; $headercrc = unpack("v", substr($data,$headerlen,2)); $headercrc = $headercrc[1]; if ($headercrc != $calccrc) { return false; // Bad header CRC } $headerlen += 2; } // GZIP FOOTER - These be negative due to PHP's limitations $datacrc = unpack("V",substr($data,-8,4)); $datacrc = $datacrc[1]; $isize = unpack("V",substr($data,-4)); $isize = $isize[1]; // Perform the decompression: $bodylen = $len-$headerlen-8; if ($bodylen < 1) { // This should never happen - IMPLEMENTATION BUG! return null; } $body = substr($data,$headerlen,$bodylen); $data = ""; if ($bodylen > 0) { switch ($method) { case 8: // Currently the only supported compression method: $data = gzinflate($body); break; default: // Unknown compression method return false; } } else { // I'm not sure if zero-byte body content is allowed. // Allow it for now... Do nothing... } // Verifiy decompressed size and CRC32: // NOTE: This may fail with large data sizes depending on how // PHP's integer limitations affect strlen() since $isize // may be negative for large sizes. if ($isize != strlen($data) || crc32($data) != $datacrc) { // Bad format! Length or CRC doesn't match! return false; } return $data; }
Use:
$html=file_get_contents('http://www.bkjia.com/librarys/veda/'); $html=gzdecode($html);
I will introduce these three methods, which should be able to solve most of the garbled crawling problems caused by gzip.

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

How to solve PHPWarning: file_get_contents(): Filenamecannotbeempty In the process of PHP development, we often encounter this error message: PHPWarning: file_get_contents(): Filenamecannotbeempty. This error usually occurs when using the file_get_contents function

How to solve PHPWarning:file_get_contents():failedtoopenstream:HTTPrequestfailed During PHP development, we often encounter situations where HTTP requests are initiated to remote servers through the file_get_contents function. However, sometimes we encounter a common error message: PHPWarning: file_get_c

Detailed explanation of PHP file caching functions: file caching processing methods of file_get_contents, file_put_contents, unlink and other functions, which require specific code examples. In web development, we often need to read data from files or write data to files. Moreover, in some cases, we need to cache the contents of files to avoid frequent file read and write operations, thus improving performance. In PHP, there are several commonly used functions that can help us implement file caching, including

Dynamic compression Dynamic compression actually means that the nginx server compresses the compiled creation. You need to enable the following configuration in the http and https modules of nginx.conf: gzipon; #Enable gizo compression gzip_min_length1k; #gizp compression starting point, only if the file is larger than 1k Compression gzip_comp_level6;#The larger the compression level number, the smaller the compression, but the more performance consumption depends on the actual situation gzip_proxiedany;#Enabled when nginx is used as a reverse proxy. For details, see the official documentation: http://nginx.org/en/docs /http/ngx_http_gzip

Nginx turns on the Gzip compression function, which can compress the css, js, xml, and html files of the website during transmission, improve the access speed, and then optimize the performance of Nginx! Images, videos and other multimedia files and large files on the Web website are compressed due to compression The effect is not good, so there is no need to support compression for images. If you want to optimize, you can set the life cycle of the image to be longer and let the client cache it. After turning on the Gzip function, the Nginx server will compress the sent content, such as css, js, xml, html and other static resources according to the configured policy, so that the size of the content is reduced, and the user will process it before receiving the returned content. The compressed data is displayed to the customer. so

PHP's file_get_contents() function: How to read content from a file, specific code example In PHP, file_get_contents() is a very useful function that allows us to read content from a file. Whether reading a text file or reading content from a remote URL, this function can easily complete the task. Syntax The basic syntax of this function is as follows: stringfile_get_contents(string$f

PHP function introduction—file_get_contents(): Read the contents of the URL into a string. In web development, it is often necessary to obtain data from a remote server or read a remote file. PHP provides a very powerful function file_get_contents(), which can conveniently read the contents of a URL and save it to a string. This article will introduce the usage of file_get_contents() function and give some code examples to help readers better

Preface gzip (gnu-zip) is a compression technology. After gzip compression, the page size can be reduced to 30% or even smaller than the original size. In this way, users will browse the page much faster. The gzip compressed page needs to be supported by both the browser and the server. It is actually server-side compression. After being transmitted to the browser, the browser decompresses and parses it. We don’t need to worry about the browser, because most current browsers support parsing gzip pages. Whether it is front-end or back-end, nginx is often used when deploying projects, and small projects often use a reverse proxy or something. Today I will be simple and direct and talk about one of the points - gzip. If there are any errors, please correct me. Generally used on the server side is u
