Getting started with python crawlers (1)--Quickly understand the HTTP protocol-Python Tutorial-php.cn

Home

Backend Development

Python Tutorial

Getting started with python crawlers (1)--Quickly understand the HTTP protocol

零下一度

May 27, 2017 am 11:36 AM

python

The http protocol is one of the most important and basic protocols in the Internet. Our crawlers need to often deal with the http protocol. The following article mainly introduces to you the relevant information about pythonGetting started with crawlers and quickly understanding the HTTP protocol. The introduction in the article is very detailed. Friends who need it can refer to it. Let’s come together. Let's see.

Preface

The basic principle of the crawler is to simulate the browser to make HTTP requests. Understanding the HTTP protocol is the necessary foundation for writing a crawler , the crawler position on the recruitment website also clearly states that you are proficient in the HTTP protocol specifications. When writing a crawler, you have to start with the HTTP protocol.

What is the HTTP protocol?

Every web page you browse is presented based on the HTTP protocol. The HTTP protocol is a protocol for data communication between the client (browser) and the server in Internet applications. . The protocol stipulates the format in which the client should send requests to the server, and also stipulates the format of the response returned by the server.

As long as everyone initiates requests and returns response results in accordance with the protocol, anyone can implement their own Web client (browser, crawler) and Web server (Nginx, Apache, etc.) based on the HTTP protocol.

The HTTP protocol itself is very simple. It stipulates that the client can only actively initiate a request, and the server returns the response result after receiving the request and processing it. At the same time, HTTP is a status protocol, and the protocol itself does not record the client's historical request records.

#How does the HTTP protocol specify the request format and response format? In other words, in what format can the client correctly initiate an HTTP request? In what format does the server return the response result so that the client can parse it correctly?

HTTP request

HTTP request is grouped into three parts, namely request line, request header, and request body , the header and request body are optional and not required for every request.

Request line

The request line is an essential part of every request, it consists of 3 It consists of parts, namely Request method (method), request URL (URI), and HTTP protocol version, separated by spaces.

The most commonly used request methods in HTTP protocol are: GET, POST, PUT, DELETE. The GET method is used to obtain resources from the server, and 90% of crawlers crawl data based on GET requests.

The request URL refers to the path address of the server where the resource is located. For example, the example above indicates that the client wants to obtain the resource index.html, and its path is under the root directory (/) of the server foofish.net.

Request header

Because the amount of information carried by the request line is very limited, the client still has a lot to say to the server Things have to be placed in the request header (Header). The request header is used to provide some additional information to the server. For example, User-Agent is used to indicate the identity of the client and let the server know whether the request comes from a browser or a crawler, or from Chrome. The browser is still FireFox. HTTP/1.1 specifies 47 header field types. The format of the HTTP header field is very similar to the dictionary type in Python, consisting of key-value pairs separated by colons. For example:

User-Agent: Mozilla/5.0

Copy after login

Because when the client sends a request, the data (message) sent is composed of a string. In order to distinguish the end of the request header and the beginning of the request body, a blank line is used to represent it. When a blank line is reached, it means that this is the end of the header and the beginning of the request body.

Request body

The request body is the real content submitted by the client to the server, such as the username and password required when the user logs in. For example, file upload data, such as form information submitted when registering user information.

Now we use the original API socket module provided by Python to simulate an HTTP request to the server

with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
 # 1. 与服务器建立连接
 s.connect(("www.seriot.ch", 80))
 # 2. 构建请求行，请求资源是 index.php
 request_line = b"GET /index.php HTTP/1.1"
 # 3. 构建请求首部，指定主机名
 headers = b"Host: seriot.ch"
 # 4. 用空行标记请求首部的结束位置
 blank_line = b"\r\n"

 # 请求行、首部、空行这3部分内容用换行符分隔，组成一个请求报文字符串
 # 发送给服务器
 message = b"\r\n".join([request_line, headers, blank_line])
 s.send(message)

 # 服务器返回的响应内容稍后进行分析
 response = s.recv(1024)
 print(response)

Copy after login

HTTP response

After the server receives the request and processes it, it returns the response content to the client. Similarly, the response content must follow a fixed format in order for the browser to parse it correctly. The HTTP response also consists of three parts: response line, response header, and response body, which correspond to the HTTP request format.

响应行

响应行同样也是3部分组成，由服务端支持的 HTTP 协议版本号、状态码、以及对状态码的简短原因描述组成。

状态码是响应行中很重要的一个字段。通过状态码，客户端可以知道服务器是否正常处理的请求。如果状态码是200，说明客户端的请求处理成功，如果是500，说明服务器处理请求的时候出现了异常。404 表示请求的资源在服务器找不到。除此之外，HTTP 协议还很定义了很多其他的状态码，不过它不是本文的讨论范围。

响应首部

响应首部和请求首部类似，用于对响应内容的补充，在首部里面可以告知客户端响应体的数据类型是什么？响应内容返回的时间是什么时候，响应体是否压缩了，响应体最后一次修改的时间。

响应体

响应体（body）是服务器返回的真正内容，它可以是一个HTML页面，或者是一张图片、一段视频等等。

我们继续沿用前面那个例子来看看服务器返回的响应结果是什么？因为我只接收了前1024个字节，所以有一部分响应内容是看不到的。

b&#39;HTTP/1.1 200 OK\r\n
Date: Tue, 04 Apr 2017 16:22:35 GMT\r\n
Server: Apache\r\n
Expires: Thu, 19 Nov 1981 08:52:00 GMT\r\n
Set-Cookie: PHPSESSID=66bea0a1f7cb572584745f9ce6984b7e; path=/\r\n
Transfer-Encoding: chunked\r\n
Content-Type: text/html; charset=UTF-8\r\n\r\n118d\r\n

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">\n\n
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">\n
<head>\n\t
 <meta http-equiv="Content-Type" content="text/html;charset=iso-8859-1" /> \n\t
 <meta http-equiv="content-language" content="en" />\n\t
...
</html>

Copy after login

从结果来看，它与协议中规范的格式是一样的，第一行是响应行，状态码是200，表明请求成功。第二部分是响应首部信息，由多个首部组成，有服务器返回响应的时间，Cookie信息等等。第三部分就是真正的响应体 HTML 文本。

至此，你应该对 HTTP 协议有一个总体的认识了，爬虫的行为本质上就是模拟浏览器发送HTTP请求，所以要想在爬虫领域深耕细作，理解 HTTP 协议是必须的。

2. python爬虫入门（3）--利用requests构建知乎API

3. python爬虫入门（2）--HTTP库requests

4. 总结Python的逻辑运算符and

The above is the detailed content of Getting started with python crawlers (1)--Quickly understand the HTTP protocol. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

What's New in Windows 11 KB5054979 & How to Fix Update Issues

3 weeks ago By DDD

How to fix KB5055523 fails to install in Windows 11?

2 weeks ago By DDD

InZoi: How To Apply To School And University

3 weeks ago By DDD

How to fix KB5055518 fails to install in Windows 10?

2 weeks ago By DDD

Roblox: Dead Rails – How To Summon And Defeat Nikola Tesla

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7816

Java Tutorial

1646

CakePHP Tutorial

1402

Laravel Tutorial

1300

PHP Tutorial

1238

Related knowledge

PHP and Python: Different Paradigms Explained Apr 18, 2025 am 12:26 AM

PHP is mainly procedural programming, but also supports object-oriented programming (OOP); Python supports a variety of paradigms, including OOP, functional and procedural programming. PHP is suitable for web development, and Python is suitable for a variety of applications such as data analysis and machine learning.

Choosing Between PHP and Python: A Guide Apr 18, 2025 am 12:24 AM

PHP is suitable for web development and rapid prototyping, and Python is suitable for data science and machine learning. 1.PHP is used for dynamic web development, with simple syntax and suitable for rapid development. 2. Python has concise syntax, is suitable for multiple fields, and has a strong library ecosystem.

Python vs. JavaScript: The Learning Curve and Ease of Use Apr 16, 2025 am 12:12 AM

Python is more suitable for beginners, with a smooth learning curve and concise syntax; JavaScript is suitable for front-end development, with a steep learning curve and flexible syntax. 1. Python syntax is intuitive and suitable for data science and back-end development. 2. JavaScript is flexible and widely used in front-end and server-side programming.

PHP and Python: A Deep Dive into Their History Apr 18, 2025 am 12:25 AM

PHP originated in 1994 and was developed by RasmusLerdorf. It was originally used to track website visitors and gradually evolved into a server-side scripting language and was widely used in web development. Python was developed by Guidovan Rossum in the late 1980s and was first released in 1991. It emphasizes code readability and simplicity, and is suitable for scientific computing, data analysis and other fields.

Can vs code run in Windows 8 Apr 15, 2025 pm 07:24 PM

VS Code can run on Windows 8, but the experience may not be great. First make sure the system has been updated to the latest patch, then download the VS Code installation package that matches the system architecture and install it as prompted. After installation, be aware that some extensions may be incompatible with Windows 8 and need to look for alternative extensions or use newer Windows systems in a virtual machine. Install the necessary extensions to check whether they work properly. Although VS Code is feasible on Windows 8, it is recommended to upgrade to a newer Windows system for a better development experience and security.

Can visual studio code be used in python Apr 15, 2025 pm 08:18 PM

VS Code can be used to write Python and provides many features that make it an ideal tool for developing Python applications. It allows users to: install Python extensions to get functions such as code completion, syntax highlighting, and debugging. Use the debugger to track code step by step, find and fix errors. Integrate Git for version control. Use code formatting tools to maintain code consistency. Use the Linting tool to spot potential problems ahead of time.

How to run python with notepad Apr 16, 2025 pm 07:33 PM

Running Python code in Notepad requires the Python executable and NppExec plug-in to be installed. After installing Python and adding PATH to it, configure the command "python" and the parameter "{CURRENT_DIRECTORY}{FILE_NAME}" in the NppExec plug-in to run Python code in Notepad through the shortcut key "F6".

Is the vscode extension malicious? Apr 15, 2025 pm 07:57 PM

VS Code extensions pose malicious risks, such as hiding malicious code, exploiting vulnerabilities, and masturbating as legitimate extensions. Methods to identify malicious extensions include: checking publishers, reading comments, checking code, and installing with caution. Security measures also include: security awareness, good habits, regular updates and antivirus software.

See all articles