Example sharing of python obtaining proxy IP
This article mainly introduces the sharing of examples about python obtaining proxy IP. It has certain reference value. Now I share it with everyone. Friends in need can refer to it.
Usually when we need to crawl some of our When data is needed, there are always some websites that prohibit repeated visits from the same IP. At this time, we should use a proxy IP to disguise ourselves before each visit so that the "enemy" cannot detect it.
ooooooooooooooOK, let's start happily!
This is the file to get the proxy IP. I modularized them and divided them into three functions
Note: There will be some English comments in the article , for the convenience of writing code, after all, one or two words in English are ok
#!/usr/bin/python #-*- coding:utf-8 -*- """ author:dasuda """ import urllib2 import re import socket import threading findIP = [] #获取的原始IP数据 IP_data = [] #拼接端口后的IP数据 IP_data_checked = [] #检查可用性后的IP数据 findPORT = [] #IP对应的端口 available_table = [] #可用IP的索引 def getIP(url_target): patternIP = re.compile(r'(?<=<td>)[\d]{1,3}\.[\d]{1,3}\.[\d]{1,3}\.[\d]{1,3}') patternPORT = re.compile(r'(?<=<td>)[\d]{2,5}(?=</td>)') print "now,start to refresh proxy IP..." for page in range(1,4): url = 'http://www.xicidaili.com/nn/'+str(page) headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64)"} request = urllib2.Request(url=url, headers=headers) response = urllib2.urlopen(request) content = response.read() findIP = re.findall(patternIP,str(content)) findPORT = re.findall(patternPORT,str(content)) #assemble the ip and port for i in range(len(findIP)): findIP[i] = findIP[i] + ":" + findPORT[i] IP_data.extend(findIP) print('get page', page) print "refresh done!!!" #use multithreading mul_thread_check(url_target) return IP_data_checked def check_one(url_check,i): #get lock lock = threading.Lock() #setting timeout socket.setdefaulttimeout(8) try: ppp = {"http":IP_data[i]} proxy_support = urllib2.ProxyHandler(ppp) openercheck = urllib2.build_opener(proxy_support) urllib2.install_opener(openercheck) request = urllib2.Request(url_check) request.add_header('User-Agent',"Mozilla/5.0 (Windows NT 10.0; WOW64)") html = urllib2.urlopen(request).read() lock.acquire() print(IP_data[i],'is OK') #get available ip index available_table.append(i) lock.release() except Exception as e: lock.acquire() print('error') lock.release() def mul_thread_check(url_mul_check): threads = [] for i in range(len(IP_data)): #creat thread... thread = threading.Thread(target=check_one, args=[url_mul_check,i,]) threads.append(thread) thread.start() print "new thread start",i for thread in threads: thread.join() #get the IP_data_checked[] for error_cnt in range(len(available_table)): aseemble_ip = {'http': IP_data[available_table[error_cnt]]} IP_data_checked.append(aseemble_ip) print "available proxy ip:",len(available_table)
1. getIP(url_target): The main function incoming parameters are: the URL to verify the availability of the proxy IP, It is recommended that ipchina
obtain the proxy IP from the http://www.xicidaili.com/nn/ website. It is a website that provides free proxy IP, but not all IPs in it are It can be used, and based on your actual geographical location, network conditions, target server accessed, etc., probably less than 20% can be used, at least in my case.
Use the normal method to access the http://www.xicidaili.com/nn/ website. The returned web page content obtains the required IP and corresponding port through regular query. The code is as follows:
patternIP = re.compile(r'(?<=<td>)[\d]{1,3}\.[\d]{1,3}\.[\d]{1,3}\.[\d]{1,3}') patternPORT = re.compile(r'(?<=<td>)[\d]{2,5}(?=</td>)') ... findIP = re.findall(patternIP,str(content)) findPORT = re.findall(patternPORT,str(content))
About How to construct a regular expression, you can refer to other articles:
The obtained IP is stored in findIP, and the corresponding port is in findPORT. The two correspond to each other by index. The normal number of IPs obtained on a page is 100.
Next, IP and port splicing
Finally, availability check
2. check_one(url_check,i): thread function
This visit to url_check is still done in the normal way. When the web page is returned, it means that the proxy IP is available, and the current index value is recorded, which will be used to extract all available IPs later.
3. mul_thread_check(url_mul_check): Multi-thread generation
This function enables multi-threading to check the proxy IP availability, and each IP opens a thread Check.
This project directly calls getIP() and passes in the URL used to check availability, and then a list is returned, which is a list of IPs that have been checked for availability, in the format of
['ip1:port1','ip2:port2',....]
Related recommendations :
Instance of Python crawler grabbing proxy IP and checking availability
Python method to collect proxy IP and determine whether it is available and update it regularly
The above is the detailed content of Example sharing of python obtaining proxy IP. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

PHP is mainly procedural programming, but also supports object-oriented programming (OOP); Python supports a variety of paradigms, including OOP, functional and procedural programming. PHP is suitable for web development, and Python is suitable for a variety of applications such as data analysis and machine learning.

PHP is suitable for web development and rapid prototyping, and Python is suitable for data science and machine learning. 1.PHP is used for dynamic web development, with simple syntax and suitable for rapid development. 2. Python has concise syntax, is suitable for multiple fields, and has a strong library ecosystem.

Python is more suitable for beginners, with a smooth learning curve and concise syntax; JavaScript is suitable for front-end development, with a steep learning curve and flexible syntax. 1. Python syntax is intuitive and suitable for data science and back-end development. 2. JavaScript is flexible and widely used in front-end and server-side programming.

PHP originated in 1994 and was developed by RasmusLerdorf. It was originally used to track website visitors and gradually evolved into a server-side scripting language and was widely used in web development. Python was developed by Guidovan Rossum in the late 1980s and was first released in 1991. It emphasizes code readability and simplicity, and is suitable for scientific computing, data analysis and other fields.

VS Code can run on Windows 8, but the experience may not be great. First make sure the system has been updated to the latest patch, then download the VS Code installation package that matches the system architecture and install it as prompted. After installation, be aware that some extensions may be incompatible with Windows 8 and need to look for alternative extensions or use newer Windows systems in a virtual machine. Install the necessary extensions to check whether they work properly. Although VS Code is feasible on Windows 8, it is recommended to upgrade to a newer Windows system for a better development experience and security.

VS Code can be used to write Python and provides many features that make it an ideal tool for developing Python applications. It allows users to: install Python extensions to get functions such as code completion, syntax highlighting, and debugging. Use the debugger to track code step by step, find and fix errors. Integrate Git for version control. Use code formatting tools to maintain code consistency. Use the Linting tool to spot potential problems ahead of time.

Running Python code in Notepad requires the Python executable and NppExec plug-in to be installed. After installing Python and adding PATH to it, configure the command "python" and the parameter "{CURRENT_DIRECTORY}{FILE_NAME}" in the NppExec plug-in to run Python code in Notepad through the shortcut key "F6".

VS Code extensions pose malicious risks, such as hiding malicious code, exploiting vulnerabilities, and masturbating as legitimate extensions. Methods to identify malicious extensions include: checking publishers, reading comments, checking code, and installing with caution. Security measures also include: security awareness, good habits, regular updates and antivirus software.
