What does crawling data mean?-Common Problem-php.cn

Home

Common Problem

What does crawling data mean?

青灯夜游

Jul 24, 2020 pm 04:12 PM

Web Crawler

Crawling data means: using a web crawler program to obtain the required content information on the website, such as text, video, pictures and other data. A web crawler (web spider) is a program or script that automatically crawls information from the World Wide Web according to certain rules.

What does crawling data mean?

What is the use of learning some knowledge about crawling data?

For example: search engines that are often used by everyone (Google, Sogou);

When users search for corresponding keywords on the Google search engine, Google will Keywords are analyzed, and the possible entries that are most suitable for the user are found from the "included" web pages and presented to the user; then, how to obtain these web pages is what the crawler needs to do, and of course how to push the most valuable web pages to the user is also It needs to be combined with the corresponding algorithm, which involves the knowledge of data mining;

For smaller applications, for example, we count the workload of testing work, which requires counting the number of modification orders per week/month , the number of defects recorded by jira and the specific content;

There is also the recent hot World Cup, if you want to count the data of each player/country, and store these data for other purposes;

Alternatively, you can do some analysis based on your own interests and hobbies through some data (statistics on the popularity of a book/movie). This requires crawling the data of existing web pages, and then doing some analysis with the obtained data. Specific analysis/statistical work, etc.

What basic knowledge is needed to learn a simple crawler?

I divide the basic knowledge into two parts:

1. Front-end basic knowledge

HTML/JSON, CSS; Ajax

Reference materials ：

http://www.w3school.com.cn/h.asp

http://www.w3school.com.cn/ajax/

http: //www.w3school.com.cn/json/

https://www.php.cn/course/list/1.html

https://www.php.cn /course/list/2.html

https://www.html.cn/

2. Python programming related knowledge

(1) Python basics Knowledge

Basic grammar knowledge, dictionaries, lists, functions, regular expressions, JSON, etc.

Reference materials:

http://www.runoob.com /python3/python3-tutorial.html

https://www.py.cn/

https://www.php.cn/course/list/30.html

(2) Python commonly used libraries:

Usage of Python's urllib library (I use more urlretrieve functions in this module, mainly using it to save some acquired resources (documents/pictures/mp3 /Video, etc.))

Python’s pyMysql library (database connection and addition, deletion, modification and query)

python module bs4 (requires css selector, html tree structure domTree knowledge, etc., according to css Selector/html tag/attribute to locate the content we need)

Python's requests (as the name suggests, this module is used to send request requests/POST/Get, etc., to obtain a Response object)

python's os module (this module provides a very rich method for processing files and directories. The os.path.join/exists function is more commonly used)

Reference materials: For this part, you can refer to the relevant modules Interface API document

Extended information:

The web crawler is a program that automatically extracts web pages. It downloads web pages from the World Wide Web for search engines and is an important component of search engines. .

Traditional crawlers start from the URL of one or several initial web pages and obtain the URL on the initial web page. During the process of crawling the web page, they continuously extract new URLs from the current page and put them into the queue until the system requirements are met. Certain stopping conditions.

The workflow of the focused crawler is more complicated. It needs to filter links unrelated to the topic according to a certain web page analysis algorithm, retain useful links and put them into the URL queue waiting to be crawled. Then, it will select the web page URL to be crawled next from the queue according to a certain search strategy, and repeat the above process until it stops when a certain condition of the system is reached.

In addition, all web pages crawled by crawlers will be stored by the system, subjected to certain analysis, filtering, and indexing for subsequent query and retrieval; for focused crawlers, this process requires The obtained analysis results may also provide feedback and guidance for future crawling processes.

Compared with general web crawlers, focused crawlers also need to solve three main problems:

(1) Description or definition of the crawling target;

(2) Analysis and filtering of web pages or data;

(3) Search strategy for URLs.

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Roblox: Grow A Garden - Complete Mutation Guide

3 weeks ago By DDD

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

How to fix KB5055612 fails to install in Windows 10?

3 weeks ago By DDD

Nordhold: Fusion System, Explained

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Blue Prince: How To Get To The Basement

1 months ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Java Tutorial

1664

CakePHP Tutorial

1423

Laravel Tutorial

1318

PHP Tutorial

1269

C# Tutorial

1248

Related knowledge

How to build a powerful web crawler application using React and Python Sep 26, 2023 pm 01:04 PM

How to build a powerful web crawler application using React and Python Introduction: A web crawler is an automated program used to crawl web page data through the Internet. With the continuous development of the Internet and the explosive growth of data, web crawlers are becoming more and more popular. This article will introduce how to use React and Python, two popular technologies, to build a powerful web crawler application. We will explore the advantages of React as a front-end framework and Python as a crawler engine, and provide specific code examples. 1. For

How to write a simple web crawler using PHP Jun 14, 2023 am 08:21 AM

A web crawler is an automated program that automatically visits websites and crawls information from them. This technology is becoming more and more common in today's Internet world and is widely used in data mining, search engines, social media analysis and other fields. If you want to learn how to write a simple web crawler using PHP, this article will provide you with basic guidance and advice. First, you need to understand some basic concepts and techniques. Crawling target Before writing a crawler, you need to select a crawling target. This can be a specific website, a specific web page, or the entire Internet

What is a web crawler Jun 20, 2023 pm 04:36 PM

A web crawler (also known as a web spider) is a robot that searches and indexes content on the Internet. Essentially, web crawlers are responsible for understanding the content on a web page in order to retrieve it when a query is made.

Develop efficient web crawlers and data scraping tools using Vue.js and Perl languages Jul 31, 2023 pm 06:43 PM

Use Vue.js and Perl languages to develop efficient web crawlers and data scraping tools. In recent years, with the rapid development of the Internet and the increasing importance of data, the demand for web crawlers and data scraping tools has also increased. In this context, it is a good choice to combine Vue.js and Perl language to develop efficient web crawlers and data scraping tools. This article will introduce how to develop such a tool using Vue.js and Perl language, and attach corresponding code examples. 1. Introduction to Vue.js and Perl language

Detailed explanation of HTTP request method of PHP web crawler Jun 17, 2023 am 11:53 AM

With the development of the Internet, all kinds of data are becoming more and more accessible. As a tool for obtaining data, web crawlers have attracted more and more attention and attention. In web crawlers, HTTP requests are an important link. This article will introduce in detail the common HTTP request methods in PHP web crawlers. 1. HTTP request method The HTTP request method refers to the request method used by the client when sending a request to the server. Common HTTP request methods include GET, POST, and PU

PHP simple web crawler development example Jun 13, 2023 pm 06:54 PM

With the rapid development of the Internet, data has become one of the most important resources in today's information age. As a technology that automatically obtains and processes network data, web crawlers are attracting more and more attention and application. This article will introduce how to use PHP to develop a simple web crawler and realize the function of automatically obtaining network data. 1. Overview of Web Crawler Web crawler is a technology that automatically obtains and processes network resources. Its main working process is to simulate browser behavior, automatically access specified URL addresses and extract all information.

How to perform web crawling and data scraping in PHP? May 20, 2023 pm 09:51 PM

With the advent of the Internet era, crawling and grabbing network data has become a daily job for many people. Among the programming languages that support web development, PHP has become a popular choice for web crawlers and data scraping due to its scalability and ease of use. This article will introduce how to perform web crawling and data scraping in PHP from the following aspects. 1. HTTP protocol and request implementation Before carrying out web crawling and data crawling, you need to have a certain understanding of the HTTP protocol and request implementation. The HTTP protocol is based on the request response model.

How to use PHP and swoole for large-scale web crawler development? Jul 21, 2023 am 09:09 AM

How to use PHP and swoole for large-scale web crawler development? Introduction: With the rapid development of the Internet, big data has become one of the important resources in today's society. In order to obtain this valuable data, web crawlers came into being. Web crawlers can automatically visit various websites on the Internet and extract required information from them. In this article, we will explore how to use PHP and the swoole extension to develop efficient, large-scale web crawlers. 1. Understand the basic principles of web crawlers. The basic principles of web crawlers are very simple.