Home Backend Development PHP Tutorial Some notes on post-collection data processing based on preg_match_all (encoding conversion and regular matching)_PHP tutorial

Some notes on post-collection data processing based on preg_match_all (encoding conversion and regular matching)_PHP tutorial

Jul 13, 2016 am 10:39 AM
match regular transcoding

1. Use curl to achieve off-site collection

Please refer to my last note for details: http://www.jb51.net/article/46432.htm

2. Encoding conversion
First find the encoding used by the collected website by viewing the source code, and transcode it through the mb_convert_encoding function;

Specific usage:

Copy code The code is as follows:

//The source character is $str

//The following is known The original encoding is GBK, converted to utf-8
mb_convert_encoding($str, "UTF-8", "GBK");

//The following unknown original encoding, after automatic detection by auto, convert the encoding For utf-8
mb_convert_encoding($str, "UTF-8", "auto");

3. In order to better avoid the obstacles of uncertain factors such as line breaks and spaces, it is necessary to first remove line breaks, spaces and tab characters in the collected source code

Copy code The code is as follows:

//Method 1, use str_replace to replace
$contents = str_replace(" rn", '', $contents); //Clear newline characters
$contents = str_replace("n", '', $contents); //Clear newline characters
$contents = str_replace("t" , '', $contents); //Clear tab characters
$contents = str_replace(" ", '', $contents); //Clear space characters

//Method 2, use regular expressions Expression replacement
$contents = preg_replace("/([rn|n|t| ]+)/",'',$contents);

4. Find the code segment you need to obtain through regular expression matching, and use preg_match_all to achieve the matching

Copy code The code is as follows:

Function explanation:
int preg_match_all ( string pattern, string subject, array matches [ , int flags] )
pattern is the regular expression
subject is the original text to be searched
matches is the array used to store the output results
flags is the stored pattern, including:
PREG_PATTERN_ORDER ; //The entire array is a two-dimensional array, $arr1[0] is an array of matching strings including the boundaries, $arr1[1] is an array of matching strings minus the boundaries
PREG_SET_ORDER; //The entire array is a two-dimensional array, $arr2[0][0] is the first matching string consisting of boundaries, $arr2[0][1] is the first matching string consisting of removing boundaries, and then The array can be deduced by analogy
PREG_OFFSET_CAPTURE; //The entire array is a three-dimensional array, $arr3[0][0][0] is the first matching string including the boundary, $arr3[0][0 ][1] is the offset to the boundary of the first matching string (the boundary is not included), and so on, $arr2[1][0][0] is the first including the boundary The matched string, $arr3[1][0][1] is the offset to the boundary of the first matched string (boundary is included);

//Application
preg_match_all('/(.*?)

/',$contents, $out, PREG_SET_ORDER);
$out will get all matching elements
$out[0][0] will be the entire character including


$out[0][1] will be only the (.* ?) The matched character segment in the brackets

// By analogy, the nth matched field can be obtained using the following method
$out[n-1][1]

//If there are multiple parentheses in the regular expression, the method to obtain the mth matching point in the sentence is
$out[n-1][m]

5. After obtaining the characters to be found, if you want to remove the html tags, you can easily achieve this by using the function strip_tags that comes with PHP

Copy code The code is as follows:

//Example
$result=strip_tags($out[0][1 ]);

www.bkjia.comtruehttp: //www.bkjia.com/PHPjc/728086.htmlTechArticle1. For details on using curl to achieve off-site collection, please refer to my last note: http://www.jb51 .net/article/46432.htm 2. Encoding conversion: First find the encoding used by the collected website by viewing the source code...
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
Nordhold: Fusion System, Explained
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
Mandragora: Whispers Of The Witch Tree - How To Unlock The Grappling Hook
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Java Tutorial
1666
14
PHP Tutorial
1273
29
C# Tutorial
1253
24
How to type underline on computer How to type underline on computer Feb 19, 2024 pm 08:36 PM

How to underline on the computer When entering text on the computer, we often need to use underlines to highlight certain content or mark it. However, for some people who are not very familiar with computer input methods, typing underline can be a bit confusing. This article will introduce you to how to underline on your computer. In different computer operating systems and software, the way to enter the underscore may be slightly different. The following will introduce the common methods on Windows operating system and Mac operating system respectively. First, let’s take a look at the operation in Windows

How to match multiple words or strings using Golang regular expression? How to match multiple words or strings using Golang regular expression? May 31, 2024 am 10:32 AM

Golang regular expressions use the pipe character | to match multiple words or strings, separating each option as a logical OR expression. For example: matches "fox" or "dog": fox|dog matches "quick", "brown" or "lazy": (quick|brown|lazy) matches "Go", "Python" or "Java": Go|Python |Java matches words or 4-digit zip codes: ([a-zA

How to replace a string starting with something with php regular expression How to replace a string starting with something with php regular expression Mar 24, 2023 pm 02:57 PM

PHP regular expressions are a powerful tool for text processing and conversion. It can effectively manage text information by parsing text content and replacing or intercepting it according to specific patterns. Among them, a common application of regular expressions is to replace strings starting with specific characters. We will explain this as follows

How to use regular expressions to remove Chinese characters in php How to use regular expressions to remove Chinese characters in php Mar 03, 2023 am 10:12 AM

How to remove Chinese in PHP using regular expressions: 1. Create a PHP sample file; 2. Define a string containing Chinese and English; 3. Use "preg_replace('/([\x80-\xff]*)/i', '',$a);" The regular method can remove Chinese characters from the query results.

How to use regular matching to remove html tags in php How to use regular matching to remove html tags in php Mar 21, 2023 pm 05:17 PM

In this article, we will learn how to remove HTML tags and extract plain text content from HTML strings using PHP regular expressions. To demonstrate how to remove HTML tags, let's first define a string containing HTML tags.

PHP regular replacement examples: quickly master replacement skills PHP regular replacement examples: quickly master replacement skills Feb 29, 2024 pm 06:33 PM

PHP regular replacement examples: quickly master replacement skills. With the development of the Internet, website development has become more and more common. In website development, it is often necessary to replace strings, and regular expressions are a very powerful tool that can quickly search and replace strings. This article will introduce how to use regular expressions in the PHP language to perform replacement operations, and provide specific code examples to help readers quickly master replacement techniques. 1.preg_replace function in PHP, you can use preg

Sharing tips on using PHP regular expressions to implement Chinese replacement function Sharing tips on using PHP regular expressions to implement Chinese replacement function Mar 24, 2024 pm 05:57 PM

Sharing tips on using PHP regular expressions to implement the Chinese replacement function. In web development, we often encounter situations where Chinese content needs to be replaced. As a popular server-side scripting language, PHP provides powerful regular expression functions, which can easily realize Chinese replacement. This article will share some techniques for using regular expressions to implement Chinese substitution in PHP, and provide specific code examples. 1. Use the preg_replace function to implement Chinese replacement. The preg_replace function in PHP can be used

How to verify if a URL is HTTPS protocol using PHP regex How to verify if a URL is HTTPS protocol using PHP regex Jun 24, 2023 am 08:16 AM

Website security has attracted more and more attention, and using the HTTPS protocol to ensure the security of data transmission has become an important part of current website development. In PHP development, how to use regular expressions to verify whether the URL is HTTPS protocol? here we come to find out. Regular expression Regular expression is an expression used to describe rules. It is a powerful tool for processing text and is widely used in text matching, search and replacement. In PHP development, we can use regular expressions to match http in the URL

See all articles