What is a crawler? What is the basic process of crawler?
A web crawler is a program, mainly used for search engines. It reads all the content and links of a website, builds relevant full-text indexes into the database, and then jumps to another website. It looks like A big spider.
When people search for keywords on the Internet (such as Google), they are actually comparing the content in the database to find those that match the user. The quality of the web crawler program determines the ability of the search engine , for example, Google’s search engine is obviously better than Baidu, because its web crawler program is efficient and its programming structure is good.
1. What is a crawler
First of all, let’s briefly understand the crawler. That is a process of requesting a website and extracting the data you need. As for how to climb and how to climb, it will be the content of learning later, so there is no need to go into it for now. Through our program, we can send requests to the server on our behalf, and then download large amounts of data in batches.
2. The basic process of the crawler
Initiate a request: initiate a request request to the server through the url, Requests can contain additional header information.
Get the response content: If the server responds normally, we will receive a response. The response is the content of the web page we requested, which may include HTML. Json string or binary data (video, picture), etc.
Parse content: If it is HTML code, it can be parsed using a web page parser. If it is Json data, it can be converted into a Json object for parsing. If Is binary data, it can be saved to a file for further processing.
Save data: You can save it to a local file or to a database (MySQL, Redis, Mongodb, etc.)
3. What does the request contain?
When we send a request to the server through the browser When requesting, what information does this request contain? We can explain it through Chrome's developer tools (if you don't know how to use it, read the notes in this article).
Request method: The most commonly used request methods include get request and post request. The most common post request in development is to submit it through a form. From the user's perspective, the most common one is login verification. When you need to enter some information to log in, this request is a post request.
url Uniform Resource Locator: A URL, a picture, a video, etc. can all be defined using URL. When we request a web page, we can view the network tag. The first one is usually a document, which means that this document is an HTML code that is not rendered with external images, css, js, etc. Below this document we will see To a series of jpg, js, etc., this is a request initiated by the browser again and again based on the html code, and the requested address is the url address of the image, js, etc. in the html document
request headers: Request headers, including the request type of this request, cookie information, browser type, etc. This request header is still useful when we crawl web pages. The server will review the information by parsing the request header to determine whether the request is a legitimate request. So when we make a request through a program that disguises the browser, we can set the request header information.
Request body: The post request will package the user information in form-data for submission, so compared to the get request, the content of the Headers tag of the post request There will be an additional information package called Form Data. The get request can be simply understood as an ordinary search carriage return, and the information will be added at the end of the url at ? intervals.
4. What does the response contain
##Response status: The status code can be seen through General in Headers. 200 indicates success, 301 jump, 404 web page not found, 502 server error, etc.
#Response header: includes content type, cookie information, etc.
Response body: The purpose of the request is to get the response body, including html code, Json and binary data.
5. Simple request demonstration
Perform web page requests through Python's request library :
The output result is the web page code that has not yet been rendered, that is, the content of the request body. You can view the response header information:
View status code:
You can also add the request header to the request information:
Grab the picture (Baidu logo):
6. How to solve JavaScript rendering problems
Use Selenium webdriver
Enter print(driver.page_source) and you can see that this time the code is the code after rendering.
【Remarks】Using chrome browser
##F12 to open the developer tools
##
Network tag
##
The above is the detailed content of What is a crawler? What is the basic process of crawler?. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

How to delete Xiaohongshu notes? Notes can be edited in the Xiaohongshu APP. Most users don’t know how to delete Xiaohongshu notes. Next, the editor brings users pictures and texts on how to delete Xiaohongshu notes. Tutorial, interested users come and take a look! Xiaohongshu usage tutorial How to delete Xiaohongshu notes 1. First open the Xiaohongshu APP and enter the main page, select [Me] in the lower right corner to enter the special area; 2. Then in the My area, click on the note page shown in the picture below , select the note you want to delete; 3. Enter the note page, click [three dots] in the upper right corner; 4. Finally, the function bar will expand at the bottom, click [Delete] to complete.

Notes deleted from Xiaohongshu cannot be recovered. As a knowledge sharing and shopping platform, Xiaohongshu provides users with the function of recording notes and collecting useful information. According to Xiaohongshu’s official statement, deleted notes cannot be recovered. The Xiaohongshu platform does not provide a dedicated note recovery function. This means that once a note is deleted in Xiaohongshu, whether it is accidentally deleted or for other reasons, it is generally impossible to retrieve the deleted content from the platform. If you encounter special circumstances, you can try to contact Xiaohongshu’s customer service team to see if they can help solve the problem.

How to set up keyboard startup on Gigabyte's motherboard. First, if it needs to support keyboard startup, it must be a PS2 keyboard! ! The setting steps are as follows: Step 1: Press Del or F2 to enter the BIOS after booting, and go to the Advanced (Advanced) mode of the BIOS. Ordinary motherboards enter the EZ (Easy) mode of the motherboard by default. You need to press F7 to switch to the Advanced mode. ROG series motherboards enter the BIOS by default. Advanced mode (we use Simplified Chinese to demonstrate) Step 2: Select to - [Advanced] - [Advanced Power Management (APM)] Step 3: Find the option [Wake up by PS2 keyboard] Step 4: This option The default is Disabled. After pulling down, you can see three different setting options, namely press [space bar] to turn on the computer, press group

What graphics card is good for Core i73770? RTX3070 is a very powerful graphics card with excellent performance and advanced technology. Whether you're playing games, rendering graphics, or performing machine learning, the RTX3070 can handle it with ease. It uses NVIDIA's Ampere architecture, has 5888 CUDA cores and 8GB of GDDR6 memory, which can provide a smooth gaming experience and high-quality graphics effects. RTX3070 also supports ray tracing technology, which can present realistic light and shadow effects. All in all, the RTX3070 is a powerful and advanced graphics card suitable for those who pursue high performance and high quality. RTX3070 is an NVIDIA series graphics card. Powered by 2nd generation NVID

As a Xiaohongshu user, we have all encountered the situation where published notes suddenly disappeared, which is undoubtedly confusing and worrying. In this case, what should we do? This article will focus on the topic of "What to do if the notes published by Xiaohongshu are missing" and give you a detailed answer. 1. What should I do if the notes published by Xiaohongshu are missing? First, don't panic. If you find that your notes are missing, staying calm is key and don't panic. This may be caused by platform system failure or operational errors. Checking release records is easy. Just open the Xiaohongshu App and click "Me" → "Publish" → "All Publications" to view your own publishing records. Here you can easily find previously published notes. 3.Repost. If found

Link AppleNotes on iPhone using the Add Link feature. Notes: You can only create links between Apple Notes on iPhone if you have iOS17 installed. Open the Notes app on your iPhone. Now, open the note where you want to add the link. You can also choose to create a new note. Click anywhere on the screen. This will show you a menu. Click the arrow on the right to see the "Add link" option. click it. Now you can type the name of the note or the web page URL. Then, click Done in the upper right corner and the added link will appear in the note. If you want to add a link to a word, just double-click the word to select it, select "Add Link" and press

How to add product links in notes in Xiaohongshu? In the Xiaohongshu app, users can not only browse various contents but also shop, so there is a lot of content about shopping recommendations and good product sharing in this app. If If you are an expert on this app, you can also share some shopping experiences, find merchants for cooperation, add links in notes, etc. Many people are willing to use this app for shopping, because it is not only convenient, but also has many Experts will make some recommendations. You can browse interesting content and see if there are any clothing products that suit you. Let’s take a look at how to add product links to notes! How to add product links to Xiaohongshu Notes Open the app on the desktop of your mobile phone. Click on the app homepage

Which tablet is suitable for musicians? The 12.9-inch speaker in Huawei’s iPad is a very good product. It comes with four speakers and the sound is excellent. Moreover, it belongs to the pro series, which is slightly better than other styles. Overall, ipad pro is a very good product. The speaker of this mini4 mobile phone is small and the effect is average. It cannot be used to play music externally, you still need to rely on headphones to enjoy music. Headphones with good sound quality will have a slightly better effect, but cheap headphones worth thirty or forty yuan cannot meet the requirements. What tablet should I use for electronic piano music? If you want to buy an iPad larger than 10 inches, I recommend using two applications, namely Henle and Piascore. Provided by Henle
