Home Backend Development C#.Net Tutorial Method for simulating HTTP requests to realize automatic operation of web pages and data collection

Method for simulating HTTP requests to realize automatic operation of web pages and data collection

Apr 20, 2017 pm 05:26 PM

The following editor will bring you an article on how to simulate HTTP requests to realize automatic operation and data collection of web pages. The editor thinks it’s pretty good, so I’ll share it with you now and give it as a reference. Let’s follow the editor to take a look.

Preface

Web pages can be divided into information provision and business operation categories. Information provision such as news, stocks Quotes and other websites. Business operations such as online business hall, OA and so on. Of course, there are many websites that have both properties at the same time. Websites such as Weibo, Douban, and Taobao not only provide information but also implement certain businesses.

Ordinary Internet access methods are generally manual operations (this does not require explanation: D). But sometimes manual operations may not be enough, such as crawling a large amount of data on the Internet, monitoring changes in a page in real time, batch operations (such as batch posting on Weibo, batch Taobao shopping), brushing orders, etc. Due to the large amount of operations and the repetitive operations, manual operations are inefficient and error-prone. At this time, you can use software to automatically operate.

I have developed a number of such software, including web crawlers and automatic batch operation businesses. One of the core functions used is to simulate HTTP requests. Of course, the HTTPS protocol is sometimes used, and the website generally needs to be logged in before further operations can be performed. The most important point is to understand the business process of the website, that is, to know when and how to submit to which page in order to achieve a certain operation. What data, finally, to extract the data or know the results of the operation, you also need to parse the HTML. This article will explain them one by one.

This article uses C# language to display the code. Of course, it can also be implemented in other languages. The principle is the same. Take logging into JD.com as an example.

Simulating HTTP requests

C# To simulate HTTP requests, you need to use the following classes:

•WebRequest

##•HttpWebRequest

•HttpWebResponse

•Stream

First create a request object (HttpWebRequest), set the relevant Headers information and then send the request (if it is POST, also write the form data to the network stream), if the target address is accessible, a response object (HttpWebResponse) will be obtained, and the return result can be read from the network stream of the corresponding object.

The sample code is as follows:

String contentType = "application/x-www-form-urlencoded";
String accept = "image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/x-shockwave-flash, application/x-silverlight,
 application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, application/x-ms-application, application/x-ms-xbap,
  application/vnd.ms-xpsdocument, application/xaml+xml, application/x-silverlight-2-b1, */*";
String userAgent = "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.116 Safari/537.36";

public String Get(String url, String encode = DEFAULT_ENCODE)
{
   HttpWebRequest request = WebRequest.Create(url) as HttpWebRequest;
   InitHttpWebRequestHeaders(request);
   request.Method = "GET";
   var html = ReadHtml(request, encode);
   return html;
}

public String Post(String url, String param, String encode = DEFAULT_ENCODE)
{
   Encoding encoding = System.Text.Encoding.UTF8;
   byte[] data = encoding.GetBytes(param);
   HttpWebRequest request = WebRequest.Create(url) as HttpWebRequest;
   InitHttpWebRequestHeaders(request);
   request.Method = "POST";
   request.ContentLength = data.Length;
   var outstream = request.GetRequestStream();
   outstream.Write(data, 0, data.Length);
   var html = ReadHtml(request, encode);
   return html;
}

private void InitHttpWebRequestHeaders(HttpWebRequest request)
{
  request.ContentType = contentType;
  request.Accept = accept;
  request.UserAgent = userAgent;
}

private String ReadHtml(HttpWebRequest request, String encode)
{
  HttpWebResponse response = request.GetResponse() as HttpWebResponse;
  Stream stream = response.GetResponseStream();
  StreamReader reader = new StreamReader(stream, Encoding.GetEncoding(encode));
  String content = reader.ReadToEnd();
  reader.Close();
  stream.Close();
  return content;
}
Copy after login

It can be seen that most of the codes of the Get and Post methods are similar, so the codes are encapsulated. Extracted the same code as a new function.

HTTPS request

When the website uses https protocol, the following error may occur in the above code:


The underlying connection was closed: Could not establish trust relationship for
Copy after login

The reason is a certificate error. When opening it with a browser, the following page will appear:

Method for simulating HTTP requests to realize automatic operation of web pages and data collection

When you click to continue to xxx.xx (unsafe), You can continue to open the web page. In the program, you only need to simulate this step to continue. In C#, you only need to set the ServicePointManager.ServerCertificateValidationCallback proxy and return true directly in the proxy method.


private HttpWebRequest CreateHttpWebRequest(String url)
{
  HttpWebRequest request;
  if (IsHttpsProtocol(url))
  {
    ServicePointManager.ServerCertificateValidationCallback = new RemoteCertificateValidationCallback(CheckValidationResult);
    request = WebRequest.Create(url) as HttpWebRequest;
    request.ProtocolVersion = HttpVersion.Version10;
  }
  else
  {
    request = WebRequest.Create(url) as HttpWebRequest;
  }

  return request;
}

private HttpWebRequest CreateHttpWebRequest(String url)
{
  HttpWebRequest request;
  if (IsHttpsProtocol(url))
  {
    ServicePointManager.ServerCertificateValidationCallback = new RemoteCertificateValidationCallback(CheckValidationResult);
    request = WebRequest.Create(url) as HttpWebRequest;
    request.ProtocolVersion = HttpVersion.Version10;
  }
  else
  {
    request = WebRequest.Create(url) as HttpWebRequest;
  }

  return request;
}
Copy after login

In this way, you can access the https website normally.

Record Cookies to achieve identity authentication

Some websites require logging in to perform the next step. For example, shopping on JD.com requires logging in first. The website server uses sessions to record client users. Each session corresponds to a user, and the previous code will re-establish a session every time a request is created. Even if the login is successful, the login will be invalid because a new connection is created during the next step. At this time, you have to find a way to make the server think that this series of requests come from the same session.

The client only has Cookies. In order to let the server know which session the client corresponds to during the next request, there will be a record of the session ID in the Cookies. Therefore, as long as the cookies are the same, it is the same user to the server.

You need to use CookieContainer at this time. As the name suggests, this is a Cookies container. HttpWebRequest has a CookieContainer property. As long as the cookies for each request are recorded in CookieContainer, the CookieContainer attribute of HttpWebRequest is set on the next request. Since the cookies are the same, it is the same user to the server.


public String Get(String url, String encode = DEFAULT_ENCODE)
{
   HttpWebRequest request = WebRequest.Create(url) as HttpWebRequest;
   InitHttpWebRequestHeaders(request);
   request.Method = "GET";

   request.CookieContainer = cookieContainer;
   HttpWebResponse response = request.GetResponse() as HttpWebResponse;
   foreach (Cookie c in response.Cookies)
   {
      cookieContainer.Add(c);
   }
}
Copy after login

Analysis and debugging website

The above realizes the simulated HTTP request, of course, the most important thing Or an analysis station. The general situation is that there is no documentation, no website developer can be found, and exploration starts from a black box. There are many analysis tools. It is recommended to use the Chrome+ plug-in Advanced Rest Client. Chrome's developer tools allow us to know what operations and requests are made in the background when opening a web page. Advanced Rest Client can simulate sending requests.

For example, when logging in to JD.com, the following data will be submitted:

Method for simulating HTTP requests to realize automatic operation of web pages and data collection

We can also see that Jingdong’s password is actually transmitted in clear text, which is very worrying about security!

You can also see the returned data:

Method for simulating HTTP requests to realize automatic operation of web pages and data collection

The returned data is JSON data, but\u8d26What are these? In fact, this is Unicode encoding. You can use the Unicode encoding conversion tool to convert it into readable text. For example, the result returned this time is: the account name and password do not match, please re-enter.

Parsing HTML

The data obtained by HTTP request is generally in HTML format, and sometimes it may be Json or XML. Parsing is required to extract useful data. The components that parse HTML are:

•HTML Parser. Available on multiple platforms such as Java/C#/Python. Haven't used it for a long time.

•HtmlAgilityPack. By parsing HMTL via XPath. Used all the time. For XPath tutorials, you can see W3School's XPath tutorials.

Conclusion

This article introduces the skills required to develop simulated automatic web page operations, from simulating HTTP/HTTPS requests, to cookies, and analyzing websites , parse HTML. The code is intended to illustrate usage and is not complete code and may not be run directly.

The above is the detailed content of Method for simulating HTTP requests to realize automatic operation of web pages and data collection. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

How to send email using PHP HTTP request How to send email using PHP HTTP request May 21, 2023 pm 07:10 PM

PHP is a widely used programming language, and one of its common applications is sending emails. In this article, we will discuss how to send emails using HTTP requests. We will introduce this topic from the following aspects: What is an HTTP request? Basic principles of sending emails using PHP. Sending HTTP requests. Sample code for sending emails. What is an HTTP request? An HTTP request refers to a request sent to a web server to obtain a web resource. . HTTP is a protocol used in web browsers and we

From start to finish: How to use php extension cURL to make HTTP requests From start to finish: How to use php extension cURL to make HTTP requests Jul 29, 2023 pm 05:07 PM

From start to finish: How to use php extension cURL for HTTP requests Introduction: In web development, it is often necessary to communicate with third-party APIs or other remote servers. Using cURL to make HTTP requests is a common and powerful way. This article will introduce how to use PHP to extend cURL to perform HTTP requests, and provide some practical code examples. 1. Preparation First, make sure that php has the cURL extension installed. You can execute php-m|grepcurl on the command line to check

How to solve the problem of HTTP request connection refused in Java development How to solve the problem of HTTP request connection refused in Java development Jun 29, 2023 pm 02:29 PM

How to solve the problem of HTTP request connection being refused in Java development. In Java development, we often encounter the problem of HTTP request connection being refused. This problem may occur because the server side restricts access rights, or the network firewall blocks access to HTTP requests. Fixing this problem requires some adjustments to your code and environment. This article will introduce several common solutions. Check the network connection and server status. First, confirm that your network connection is normal. You can try to access other websites or services to see

Cause analysis: HTTP request error 504 gateway timeout Cause analysis: HTTP request error 504 gateway timeout Feb 19, 2024 pm 05:12 PM

Brief introduction to the reason for the http request error: 504GatewayTimeout: During network communication, the client interacts with the server by sending HTTP requests. However, sometimes we may encounter some error messages during the process of sending the request. One of them is the 504GatewayTimeout error. This article will explore the causes and solutions to this error. What is the 504GatewayTimeout error? GatewayTimeo

Solution: Socket Error when handling HTTP requests Solution: Socket Error when handling HTTP requests Feb 25, 2024 pm 09:24 PM

http request error: Solution to SocketError When making network requests, we often encounter various errors. One of the common problems is SocketError. This error is thrown when our application cannot establish a connection with the server. In this article, we will discuss some common causes and solutions of SocketError. First, we need to understand what Socket is. Socket is a communication protocol that allows applications to

Set query parameters for HTTP requests using Golang Set query parameters for HTTP requests using Golang Jun 02, 2024 pm 03:27 PM

To set query parameters for HTTP requests in Go, you can use the http.Request.URL.Query().Set() method, which accepts query parameter names and values ​​as parameters. Specific steps include: Create a new HTTP request. Use the Query().Set() method to set query parameters. Encode the request. Execute the request. Get the value of a query parameter (optional). Remove query parameters (optional).

How Nginx implements retry configuration of HTTP requests How Nginx implements retry configuration of HTTP requests Nov 08, 2023 pm 04:47 PM

How Nginx implements HTTP request retry configuration requires specific code examples. Nginx is a very popular open source reverse proxy server. It has powerful functions and flexible configuration options and can be used to implement HTTP request retry configuration. In network communication, sometimes the HTTP request we initiate may fail due to various reasons, such as network delay, server load, etc. In order to improve the reliability and stability of the application, we may need to retry when the request fails. The following will introduce how to use Ng

How to use Nginx for compression and decompression of HTTP requests How to use Nginx for compression and decompression of HTTP requests Aug 02, 2023 am 10:09 AM

How to use Nginx to compress and decompress HTTP requests Nginx is a high-performance web server and reverse proxy server that is powerful and flexible. When processing HTTP requests, you can use the gzip and gunzip modules provided by Nginx to compress and decompress the requests to reduce the amount of data transmission and improve the request response speed. This article will introduce the specific steps of how to use Nginx to compress and decompress HTTP requests, and provide corresponding code examples. Configure gzip module

See all articles