


Why do you encounter 403 problems when using a crawler to crawl data on Zhihu content?
I want to capture the follow information of users on Zhihu, such as checking who A follows, and get the list of followees through the page www.zhihu.com/people/XXX/followees, but I encountered a 403 problem during the capture.
1. The crawler is only for collecting user attention information, for academic research, and is not for commercial or other purposes.
2. Use PHP, use curl to construct requests, and use simple_html_dom to parse documents. 3. In the user's followers (Followees) The list should use Ajax to dynamically load more followees, so I want to directly crawl the interface data. I can see through firebug that loading more followers seems to be done through
http://www.zhihu.com/ node/ProfileFolloweesListV2, and the post data includes _xsrf, method, parmas, so I submitted a request to this link while simulating that I stayed logged in, with the parameters required for the post. , but 403 is returned. 4. But when I also simulate login, I can parse data that does not require Ajax, such as the number of likes and thanks.
5. I use curl_setopt($ch, CURLOPT_HTTPHEADER, $header); to set the request header so that The request headers are consistent with the request headers I submitted in the browser, but this still resulted in a 403 error
6. I tried to print out the curl request headers and compare them with the request headers sent by the browser, but did not find the correct way (Baidu The curl_getinfo() seems to print out the corresponding message)
7. Many people have encountered 403 because User-Agent or X-Requested-With was not set, but I set it when setting the request header as described in 5
8 .If the description is unclear and you need to post the code, I can post the code
9. This crawler is part of my graduation project and I need to obtain data for the next work. As 1 said, crawling data is purely for academic research
Reply content:
If the server has a firewall function, continuous crawling may be killed unless you have many proxy servers. Or the easiest way is to use adsl to constantly redial and change the IP address. You first find a browser and study the HTTP Header of the request and then capture it. In the past two days, I just made a crawler to capture users' followings and followers, using Python. Here is a piece of python code for you. You can look at the code to see your code problem.
403 should be that some data was sent incorrectly during the request. The following code involves an open text. The content in the text is the user's ID. I took a screenshot of the content style in the text and put it at the end.<span class="c">#encoding=utf8</span> <span class="kn">import</span> <span class="nn">urllib2</span> <span class="kn">import</span> <span class="nn">json</span> <span class="kn">import</span> <span class="nn">requests</span> <span class="kn">from</span> <span class="nn">bs4</span> <span class="k">import</span> <span class="n">BeautifulSoup</span> <span class="n">Default_Header</span> <span class="o">=</span> <span class="p">{</span><span class="s">'X-Requested-With'</span><span class="p">:</span> <span class="s">'XMLHttpRequest'</span><span class="p">,</span> <span class="s">'Referer'</span><span class="p">:</span> <span class="s">'http://www.zhihu.com'</span><span class="p">,</span> <span class="s">'User-Agent'</span><span class="p">:</span> <span class="s">'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; '</span> <span class="s">'rv:39.0) Gecko/20100101 Firefox/39.0'</span><span class="p">,</span> <span class="s">'Host'</span><span class="p">:</span> <span class="s">'www.zhihu.com'</span><span class="p">}</span> <span class="n">_session</span> <span class="o">=</span> <span class="n">requests</span><span class="o">.</span><span class="n">session</span><span class="p">()</span> <span class="n">_session</span><span class="o">.</span><span class="n">headers</span><span class="o">.</span><span class="n">update</span><span class="p">(</span><span class="n">Default_Header</span><span class="p">)</span> <span class="n">resourceFile</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s">'/root/Desktop/UserId.text'</span><span class="p">,</span><span class="s">'r'</span><span class="p">)</span> <span class="n">resourceLines</span> <span class="o">=</span> <span class="n">resourceFile</span><span class="o">.</span><span class="n">readlines</span><span class="p">()</span> <span class="n">resultFollowerFile</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s">'/root/Desktop/userIdFollowees.text'</span><span class="p">,</span><span class="s">'a+'</span><span class="p">)</span> <span class="n">resultFolloweeFile</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s">'/root/Desktop/userIdFollowers.text'</span><span class="p">,</span><span class="s">'a+'</span><span class="p">)</span> <span class="n">BASE_URL</span> <span class="o">=</span> <span class="s">'https://www.zhihu.com/'</span> <span class="n">CAPTURE_URL</span> <span class="o">=</span> <span class="n">BASE_URL</span><span class="o">+</span><span class="s">'captcha.gif?r=1466595391805&type=login'</span> <span class="n">PHONE_LOGIN</span> <span class="o">=</span> <span class="n">BASE_URL</span> <span class="o">+</span> <span class="s">'login/phone_num'</span> <span class="k">def</span> <span class="nf">login</span><span class="p">():</span> <span class="sd">'''登录知乎'''</span> <span class="n">username</span> <span class="o">=</span> <span class="s">''</span><span class="c">#用户名</span> <span class="n">password</span> <span class="o">=</span> <span class="s">''</span><span class="c">#密码,注意我这里用的是手机号登录,用邮箱登录需要改一下下面登录地址</span> <span class="n">cap_content</span> <span class="o">=</span> <span class="n">urllib2</span><span class="o">.</span><span class="n">urlopen</span><span class="p">(</span><span class="n">CAPTURE_URL</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()</span> <span class="n">cap_file</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s">'/root/Desktop/cap.gif'</span><span class="p">,</span><span class="s">'wb'</span><span class="p">)</span> <span class="n">cap_file</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">cap_content</span><span class="p">)</span> <span class="n">cap_file</span><span class="o">.</span><span class="n">close</span><span class="p">()</span> <span class="n">captcha</span> <span class="o">=</span> <span class="n">raw_input</span><span class="p">(</span><span class="s">'capture:'</span><span class="p">)</span> <span class="n">data</span> <span class="o">=</span> <span class="p">{</span><span class="s">"phone_num"</span><span class="p">:</span><span class="n">username</span><span class="p">,</span><span class="s">"password"</span><span class="p">:</span><span class="n">password</span><span class="p">,</span><span class="s">"captcha"</span><span class="p">:</span><span class="n">captcha</span><span class="p">}</span> <span class="n">r</span> <span class="o">=</span> <span class="n">_session</span><span class="o">.</span><span class="n">post</span><span class="p">(</span><span class="n">PHONE_LOGIN</span><span class="p">,</span> <span class="n">data</span><span class="p">)</span> <span class="nb">print</span> <span class="p">(</span><span class="n">r</span><span class="o">.</span><span class="n">json</span><span class="p">())[</span><span class="s">'msg'</span><span class="p">]</span> <span class="k">def</span> <span class="nf">readFollowerNumbers</span><span class="p">(</span><span class="n">followerId</span><span class="p">,</span><span class="n">followType</span><span class="p">):</span> <span class="sd">'''读取每一位用户的关注者和追随者,根据type进行判断'''</span> <span class="nb">print</span> <span class="n">followerId</span> <span class="n">personUrl</span> <span class="o">=</span> <span class="s">'https://www.zhihu.com/people/'</span> <span class="o">+</span> <span class="n">followerId</span><span class="o">.</span><span class="n">strip</span><span class="p">(</span><span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">)</span> <span class="n">xsrf</span> <span class="o">=</span><span class="n">getXsrf</span><span class="p">()</span> <span class="n">hash_id</span> <span class="o">=</span> <span class="n">getHashId</span><span class="p">(</span><span class="n">personUrl</span><span class="p">)</span> <span class="n">headers</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">(</span><span class="n">Default_Header</span><span class="p">)</span> <span class="n">headers</span><span class="p">[</span><span class="s">'Referer'</span><span class="p">]</span><span class="o">=</span> <span class="n">personUrl</span> <span class="o">+</span> <span class="s">'/follow'</span><span class="o">+</span><span class="n">followType</span> <span class="n">followerUrl</span> <span class="o">=</span> <span class="s">'https://www.zhihu.com/node/ProfileFollow'</span><span class="o">+</span><span class="n">followType</span><span class="o">+</span><span class="s">'ListV2'</span> <span class="n">params</span> <span class="o">=</span> <span class="p">{</span><span class="s">"offset"</span><span class="p">:</span><span class="mi">0</span><span class="p">,</span><span class="s">"order_by"</span><span class="p">:</span><span class="s">"created"</span><span class="p">,</span><span class="s">"hash_id"</span><span class="p">:</span><span class="n">hash_id</span><span class="p">}</span> <span class="n">params_encode</span> <span class="o">=</span> <span class="n">json</span><span class="o">.</span><span class="n">dumps</span><span class="p">(</span><span class="n">params</span><span class="p">)</span> <span class="n">data</span> <span class="o">=</span> <span class="p">{</span><span class="s">"method"</span><span class="p">:</span><span class="s">"next"</span><span class="p">,</span><span class="s">"params"</span><span class="p">:</span><span class="n">params_encode</span><span class="p">,</span><span class="s">'_xsrf'</span><span class="p">:</span><span class="n">xsrf</span><span class="p">}</span> <span class="n">signIndex</span> <span class="o">=</span> <span class="mi">20</span> <span class="n">offset</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">while</span> <span class="n">signIndex</span> <span class="o">==</span> <span class="mi">20</span><span class="p">:</span> <span class="n">params</span><span class="p">[</span><span class="s">'offset'</span><span class="p">]</span> <span class="o">=</span> <span class="n">offset</span> <span class="n">data</span><span class="p">[</span><span class="s">'params'</span><span class="p">]</span> <span class="o">=</span> <span class="n">json</span><span class="o">.</span><span class="n">dumps</span><span class="p">(</span><span class="n">params</span><span class="p">)</span> <span class="n">followerUrlJSON</span> <span class="o">=</span> <span class="n">_session</span><span class="o">.</span><span class="n">post</span><span class="p">(</span><span class="n">followerUrl</span><span class="p">,</span><span class="n">data</span><span class="o">=</span><span class="n">data</span><span class="p">,</span><span class="n">headers</span> <span class="o">=</span> <span class="n">headers</span><span class="p">)</span> <span class="n">signIndex</span> <span class="o">=</span> <span class="nb">len</span><span class="p">((</span><span class="n">followerUrlJSON</span><span class="o">.</span><span class="n">json</span><span class="p">())[</span><span class="s">'msg'</span><span class="p">])</span> <span class="n">offset</span> <span class="o">=</span> <span class="n">offset</span> <span class="o">+</span> <span class="n">signIndex</span> <span class="n">followerHtml</span> <span class="o">=</span> <span class="p">(</span><span class="n">followerUrlJSON</span><span class="o">.</span><span class="n">json</span><span class="p">())[</span><span class="s">'msg'</span><span class="p">]</span> <span class="k">for</span> <span class="n">everHtml</span> <span class="ow">in</span> <span class="n">followerHtml</span><span class="p">:</span> <span class="n">everHtmlSoup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">everHtml</span><span class="p">)</span> <span class="n">personId</span> <span class="o">=</span> <span class="n">everHtmlSoup</span><span class="o">.</span><span class="n">a</span><span class="p">[</span><span class="s">'href'</span><span class="p">]</span> <span class="n">resultFollowerFile</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">personId</span><span class="o">+</span><span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">)</span> <span class="nb">print</span> <span class="n">personId</span> <span class="k">def</span> <span class="nf">getXsrf</span><span class="p">():</span> <span class="sd">'''获取用户的xsrf这个是当前用户的'''</span> <span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">_session</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">BASE_URL</span><span class="p">)</span><span class="o">.</span><span class="n">content</span><span class="p">)</span> <span class="n">_xsrf</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">'input'</span><span class="p">,</span><span class="n">attrs</span><span class="o">=</span><span class="p">{</span><span class="s">'name'</span><span class="p">:</span><span class="s">'_xsrf'</span><span class="p">})[</span><span class="s">'value'</span><span class="p">]</span> <span class="k">return</span> <span class="n">_xsrf</span> <span class="k">def</span> <span class="nf">getHashId</span><span class="p">(</span><span class="n">personUrl</span><span class="p">):</span> <span class="sd">'''这个是需要抓取的用户的hashid,不是当前登录用户的hashid'''</span> <span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">_session</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">personUrl</span><span class="p">)</span><span class="o">.</span><span class="n">content</span><span class="p">)</span> <span class="n">hashIdText</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">'script'</span><span class="p">,</span> <span class="n">attrs</span><span class="o">=</span><span class="p">{</span><span class="s">'data-name'</span><span class="p">:</span> <span class="s">'current_people'</span><span class="p">})</span> <span class="k">return</span> <span class="n">json</span><span class="o">.</span><span class="n">loads</span><span class="p">(</span><span class="n">hashIdText</span><span class="o">.</span><span class="n">text</span><span class="p">)[</span><span class="mi">3</span><span class="p">]</span> <span class="k">def</span> <span class="nf">main</span><span class="p">():</span> <span class="n">login</span><span class="p">()</span> <span class="n">followType</span> <span class="o">=</span> <span class="nb">input</span><span class="p">(</span><span class="s">'请配置抓取类别:0-抓取关注了谁 其它-被哪些人关注'</span><span class="p">)</span> <span class="n">followType</span> <span class="o">=</span> <span class="s">'ees'</span> <span class="k">if</span> <span class="n">followType</span> <span class="o">==</span> <span class="mi">0</span> <span class="k">else</span> <span class="s">'ers'</span> <span class="k">for</span> <span class="n">followerId</span> <span class="ow">in</span> <span class="n">resourceLines</span><span class="p">:</span> <span class="k">try</span><span class="p">:</span> <span class="n">readFollowerNumbers</span><span class="p">(</span><span class="n">followerId</span><span class="p">,</span><span class="n">followType</span><span class="p">)</span> <span class="n">resultFollowerFile</span><span class="o">.</span><span class="n">flush</span><span class="p">()</span> <span class="k">except</span><span class="p">:</span> <span class="k">pass</span> <span class="k">if</span> <span class="n">__name__</span><span class="o">==</span><span class="s">'__main__'</span><span class="p">:</span> <span class="n">main</span><span class="p">()</span>
- No cookies
- _xsrf or hash_id error
///
/ // Zhihu question
///
/// Question title
/// Details Content
/// Cookie obtained after logging in
public void ZhiHuFaTie(string question_title,string question_detail,CookieContainer cookie)
{
question_title="Question Content";
question_detail="Question Detailed Description";
//Traverse cookies and get the value of _xsrf
var list = GetAllCookies(cookie);
foreach (var item in list)
{
if (item.Name = = "_xsrf")
{
xsrf = item.Value;
break;
}
}
//Post
var FaTiePostUrl = "http://www.zhihu.com/question/add" ;
var dd = topicStr.ToCharArray();
var FaTiePostStr = "question_title=" + HttpUtility.UrlEncode(question_title) + "&question_detail=" + HttpUtility.UrlEncode(question_detail) + "&anon=0&topic_ids=" + topicId + "&new_topics =&_xsrf="+xsrf;
var FaTieResult = nhp.PostResultHtml(FaTiePostUrl, cookie, "http://www.zhihu.com/", FaTiePostStr);
}
///
// / Traverse CookieContainer
///
///
///
public static List
List
Hashtable table = (Hashtable)cc.GetType().InvokeMember("m_domainTable",
System.Reflection.BindingFlags. NonPublic | System.Reflection.BindingFlags.GetField |
System.Reflection.BindingFlags.Instance, null, cc, new object[] { });
foreach (object pathList in table.Values)
{
SortedList lstCookieCol = (SortedList )pathList.GetType().InvokeMember("m_list",
System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.GetField
| System.Reflection.BindingFlags.Instance, null, pathList, new object[] { }) ;
foreach (CookieCollection colCookies in lstCookieCol.Values)
foreach (Cookie c in colCookies) lstCookies.Add(c);
}
return lstCookies;
} Modify the X-Forwarded-For field of the header to disguise the IP address What a coincidence, I just encountered this problem last night. There may be many reasons, I will only tell you what I encountered, for reference only and to provide an idea. I crawled Sina Weibo and used a proxy. The 403 appears because the website is rejected when accessing. I do the same thing on the browser. If I just look at a few web pages, 403 will appear, but it will be fine if I refresh it a few times. The implementation in the code is to request multiple times. After reading the answer above, I was instantly stunned. There are so many great people, but I suggest you ask Kai-Fu Lee ~ Haha Let’s talk about how the interface is caught... Why can’t I catch the interface with firebug? Chrome’s network can’t catch the interface either
Speaking of which, you can get it directly by directly requesting followees. The rest is regular.

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

Both curl and Pythonrequests are powerful tools for sending HTTP requests. While curl is a command-line tool that allows you to send requests directly from the terminal, Python's requests library provides a more programmatic way to send requests from Python code. The basic syntax for converting curl to Pythonrequestscurl command is as follows: curl[OPTIONS]URL When converting curl command to Python request, we need to convert the options and URL into Python code. Here is an example curlPOST command: curl-XPOST https://example.com/api

From start to finish: How to use php extension cURL for HTTP requests Introduction: In web development, it is often necessary to communicate with third-party APIs or other remote servers. Using cURL to make HTTP requests is a common and powerful way. This article will introduce how to use PHP to extend cURL to perform HTTP requests, and provide some practical code examples. 1. Preparation First, make sure that php has the cURL extension installed. You can execute php-m|grepcurl on the command line to check

PHP8.1 released: Introducing curl for concurrent processing of multiple requests. Recently, PHP officially released the latest version of PHP8.1, which introduced an important feature: curl for concurrent processing of multiple requests. This new feature provides developers with a more efficient and flexible way to handle multiple HTTP requests, greatly improving performance and user experience. In previous versions, handling multiple requests often required creating multiple curl resources and using loops to send and receive data respectively. Although this method can achieve the purpose

To update the curl version under Linux, you can follow the steps below: Check the current curl version: First, you need to determine the curl version installed in the current system. Open a terminal and execute the following command: curl --version This command will display the current curl version information. Confirm available curl version: Before updating curl, you need to confirm the latest version available. You can visit curl's official website (curl.haxx.se) or related software sources to find the latest version of curl. Download the curl source code: Using curl or a browser, download the source code file for the curl version of your choice (usually .tar.gz or .tar.bz2

How to handle 301 redirection of web pages in PHPCurl? When using PHPCurl to send network requests, you will often encounter a 301 status code returned by the web page, indicating that the page has been permanently redirected. In order to handle this situation correctly, we need to add some specific options and processing logic to the Curl request. The following will introduce in detail how to handle 301 redirection of web pages in PHPCurl, and provide specific code examples. 301 redirect processing principle 301 redirect means that the server returns a 30

In Linux, curl is a very practical tool for transferring data to and from the server. It is a file transfer tool that uses URL rules to work under the command line; it supports file upload and download, and is a comprehensive transfer tool. . Curl provides a lot of very useful functions, including proxy access, user authentication, ftp upload and download, HTTP POST, SSL connection, cookie support, breakpoint resume and so on.

Python simulates the browser sending post requests importrequests format request.postrequest.post(url,data,json,kwargs)#post request format request.get(url,params,kwargs)#Compared with get request, sending post request parameters are divided into forms ( x-www-form-urlencoded) json (application/json) data parameter supports dictionary format and string format. The dictionary format uses the json.dumps() method to convert the data into a legal json format string. This method requires

For PHP developers, using POST to jump to pages with parameters is a basic skill. POST is a method of sending data in HTTP. It can submit data to the server through HTTP requests. The jump page processes and jumps the page on the server side. In actual development, we often need to use POST with parameters to jump to pages to achieve certain functional purposes.
