


Method for simulating HTTP requests to realize automatic operation of web pages and data collection
The following editor will bring you an article on how to simulate HTTP requests to realize automatic operation and data collection of web pages. The editor thinks it’s pretty good, so I’ll share it with you now and give it as a reference. Let’s follow the editor to take a look.
Preface
Web pages can be divided into information provision and business operation categories. Information provision such as news, stocks Quotes and other websites. Business operations such as online business hall, OA and so on. Of course, there are many websites that have both properties at the same time. Websites such as Weibo, Douban, and Taobao not only provide information but also implement certain businesses.
Ordinary Internet access methods are generally manual operations (this does not require explanation: D). But sometimes manual operations may not be enough, such as crawling a large amount of data on the Internet, monitoring changes in a page in real time, batch operations (such as batch posting on Weibo, batch Taobao shopping), brushing orders, etc. Due to the large amount of operations and the repetitive operations, manual operations are inefficient and error-prone. At this time, you can use software to automatically operate.
I have developed a number of such software, including web crawlers and automatic batch operation businesses. One of the core functions used is to simulate HTTP requests. Of course, the HTTPS protocol is sometimes used, and the website generally needs to be logged in before further operations can be performed. The most important point is to understand the business process of the website, that is, to know when and how to submit to which page in order to achieve a certain operation. What data, finally, to extract the data or know the results of the operation, you also need to parse the HTML. This article will explain them one by one.
This article uses C# language to display the code. Of course, it can also be implemented in other languages. The principle is the same. Take logging into JD.com as an example.
Simulating HTTP requests
C# To simulate HTTP requests, you need to use the following classes:
•WebRequest
##•HttpWebRequest
•HttpWebResponse
•Stream
First create a request object (HttpWebRequest), set the relevant Headers information and then send the request (if it is POST, also write the form data to the network stream), if the target address is accessible, a response object (HttpWebResponse) will be obtained, and the return result can be read from the network stream of the corresponding object.The sample code is as follows:
String contentType = "application/x-www-form-urlencoded"; String accept = "image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/x-shockwave-flash, application/x-silverlight, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, application/x-ms-application, application/x-ms-xbap, application/vnd.ms-xpsdocument, application/xaml+xml, application/x-silverlight-2-b1, */*"; String userAgent = "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.116 Safari/537.36"; public String Get(String url, String encode = DEFAULT_ENCODE) { HttpWebRequest request = WebRequest.Create(url) as HttpWebRequest; InitHttpWebRequestHeaders(request); request.Method = "GET"; var html = ReadHtml(request, encode); return html; } public String Post(String url, String param, String encode = DEFAULT_ENCODE) { Encoding encoding = System.Text.Encoding.UTF8; byte[] data = encoding.GetBytes(param); HttpWebRequest request = WebRequest.Create(url) as HttpWebRequest; InitHttpWebRequestHeaders(request); request.Method = "POST"; request.ContentLength = data.Length; var outstream = request.GetRequestStream(); outstream.Write(data, 0, data.Length); var html = ReadHtml(request, encode); return html; } private void InitHttpWebRequestHeaders(HttpWebRequest request) { request.ContentType = contentType; request.Accept = accept; request.UserAgent = userAgent; } private String ReadHtml(HttpWebRequest request, String encode) { HttpWebResponse response = request.GetResponse() as HttpWebResponse; Stream stream = response.GetResponseStream(); StreamReader reader = new StreamReader(stream, Encoding.GetEncoding(encode)); String content = reader.ReadToEnd(); reader.Close(); stream.Close(); return content; }
HTTPS request
When the website uses https protocol, the following error may occur in the above code:The underlying connection was closed: Could not establish trust relationship for
private HttpWebRequest CreateHttpWebRequest(String url) { HttpWebRequest request; if (IsHttpsProtocol(url)) { ServicePointManager.ServerCertificateValidationCallback = new RemoteCertificateValidationCallback(CheckValidationResult); request = WebRequest.Create(url) as HttpWebRequest; request.ProtocolVersion = HttpVersion.Version10; } else { request = WebRequest.Create(url) as HttpWebRequest; } return request; } private HttpWebRequest CreateHttpWebRequest(String url) { HttpWebRequest request; if (IsHttpsProtocol(url)) { ServicePointManager.ServerCertificateValidationCallback = new RemoteCertificateValidationCallback(CheckValidationResult); request = WebRequest.Create(url) as HttpWebRequest; request.ProtocolVersion = HttpVersion.Version10; } else { request = WebRequest.Create(url) as HttpWebRequest; } return request; }
public String Get(String url, String encode = DEFAULT_ENCODE) { HttpWebRequest request = WebRequest.Create(url) as HttpWebRequest; InitHttpWebRequestHeaders(request); request.Method = "GET"; request.CookieContainer = cookieContainer; HttpWebResponse response = request.GetResponse() as HttpWebResponse; foreach (Cookie c in response.Cookies) { cookieContainer.Add(c); } }
Analysis and debugging website
The above realizes the simulated HTTP request, of course, the most important thing Or an analysis station. The general situation is that there is no documentation, no website developer can be found, and exploration starts from a black box. There are many analysis tools. It is recommended to use the Chrome+ plug-in Advanced Rest Client. Chrome's developer tools allow us to know what operations and requests are made in the background when opening a web page. Advanced Rest Client can simulate sending requests.For example, when logging in to JD.com, the following data will be submitted:
We can also see that Jingdong’s password is actually transmitted in clear text, which is very worrying about security!
You can also see the returned data:
The returned data is JSON data, but\u8d26What are these? In fact, this is Unicode encoding. You can use the Unicode encoding conversion tool to convert it into readable text. For example, the result returned this time is: the account name and password do not match, please re-enter.
Parsing HTML
The data obtained by HTTP request is generally in HTML format, and sometimes it may be Json or XML. Parsing is required to extract useful data. The components that parse HTML are:
•HTML Parser. Available on multiple platforms such as Java/C#/Python. Haven't used it for a long time.
•HtmlAgilityPack. By parsing HMTL via XPath. Used all the time. For XPath tutorials, you can see W3School's XPath tutorials.
Conclusion
This article introduces the skills required to develop simulated automatic web page operations, from simulating HTTP/HTTPS requests, to cookies, and analyzing websites , parse HTML. The code is intended to illustrate usage and is not complete code and may not be run directly.
The above is the detailed content of Method for simulating HTTP requests to realize automatic operation of web pages and data collection. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

PHP is a widely used programming language, and one of its common applications is sending emails. In this article, we will discuss how to send emails using HTTP requests. We will introduce this topic from the following aspects: What is an HTTP request? Basic principles of sending emails using PHP. Sending HTTP requests. Sample code for sending emails. What is an HTTP request? An HTTP request refers to a request sent to a web server to obtain a web resource. . HTTP is a protocol used in web browsers and we

From start to finish: How to use php extension cURL for HTTP requests Introduction: In web development, it is often necessary to communicate with third-party APIs or other remote servers. Using cURL to make HTTP requests is a common and powerful way. This article will introduce how to use PHP to extend cURL to perform HTTP requests, and provide some practical code examples. 1. Preparation First, make sure that php has the cURL extension installed. You can execute php-m|grepcurl on the command line to check

How to solve the problem of HTTP request connection being refused in Java development. In Java development, we often encounter the problem of HTTP request connection being refused. This problem may occur because the server side restricts access rights, or the network firewall blocks access to HTTP requests. Fixing this problem requires some adjustments to your code and environment. This article will introduce several common solutions. Check the network connection and server status. First, confirm that your network connection is normal. You can try to access other websites or services to see

Brief introduction to the reason for the http request error: 504GatewayTimeout: During network communication, the client interacts with the server by sending HTTP requests. However, sometimes we may encounter some error messages during the process of sending the request. One of them is the 504GatewayTimeout error. This article will explore the causes and solutions to this error. What is the 504GatewayTimeout error? GatewayTimeo

http request error: Solution to SocketError When making network requests, we often encounter various errors. One of the common problems is SocketError. This error is thrown when our application cannot establish a connection with the server. In this article, we will discuss some common causes and solutions of SocketError. First, we need to understand what Socket is. Socket is a communication protocol that allows applications to

To set query parameters for HTTP requests in Go, you can use the http.Request.URL.Query().Set() method, which accepts query parameter names and values as parameters. Specific steps include: Create a new HTTP request. Use the Query().Set() method to set query parameters. Encode the request. Execute the request. Get the value of a query parameter (optional). Remove query parameters (optional).

How Nginx implements HTTP request retry configuration requires specific code examples. Nginx is a very popular open source reverse proxy server. It has powerful functions and flexible configuration options and can be used to implement HTTP request retry configuration. In network communication, sometimes the HTTP request we initiate may fail due to various reasons, such as network delay, server load, etc. In order to improve the reliability and stability of the application, we may need to retry when the request fails. The following will introduce how to use Ng

How to use Nginx to compress and decompress HTTP requests Nginx is a high-performance web server and reverse proxy server that is powerful and flexible. When processing HTTP requests, you can use the gzip and gunzip modules provided by Nginx to compress and decompress the requests to reduce the amount of data transmission and improve the request response speed. This article will introduce the specific steps of how to use Nginx to compress and decompress HTTP requests, and provide corresponding code examples. Configure gzip module
