


Detailed explanation of python3 Baidu index crawling example
This article mainly introduces python3 Baidu index crawling. The editor thinks it is quite good. Now I will share it with you and give it as a reference. Let’s follow the editor to have a look.
Catch the Baidu index, and then use image recognition to get the index
Foreword:
Tufu once said that the Baidu index is difficult to grasp. On Taobao, it costs 20 yuan per unit. Keywords:
How could someone be frightened by someone with such a big mouth, so it took me about 2 and a half days to complete it. I despise Tufu
There are many installed libraries:
谷歌图像识别tesseract-ocr pip3 install pillow pip3 install pyocr selenium2.45 Chrome47.0.2526.106 m or Firebox32.0.1 chromedriver.exe
You need to log in to enter the Baidu Index. The login account and password are written in the text account:
The universal login code is as follows:
# 打开浏览器 def openbrowser(): global browser # http://www.php.cn/ url = "https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F" # 打开谷歌浏览器 # Firefox() # Chrome() browser = webdriver.Chrome() # 输入网址 browser.get(url) # 打开浏览器时间 # print("等待10秒打开浏览器...") # time.sleep(10) # 找到id="TANGRAM__PSP_3__userName"的对话框 # 清空输入框 browser.find_element_by_id("TANGRAM__PSP_3__userName").clear() browser.find_element_by_id("TANGRAM__PSP_3__password").clear() # 输入账号密码 # 输入账号密码 account = [] try: fileaccount = open("../baidu/account.txt") accounts = fileaccount.readlines() for acc in accounts: account.append(acc.strip()) fileaccount.close() except Exception as err: print(err) input("请正确在account.txt里面写入账号密码") exit() browser.find_element_by_id("TANGRAM__PSP_3__userName").send_keys(account[0]) browser.find_element_by_id("TANGRAM__PSP_3__password").send_keys(account[1]) # 点击登陆登陆 # id="TANGRAM__PSP_3__submit" browser.find_element_by_id("TANGRAM__PSP_3__submit").click() # 等待登陆10秒 # print('等待登陆10秒...') # time.sleep(10) print("等待网址加载完毕...") select = input("请观察浏览器网站是否已经登陆(y/n):") while 1: if select == "y" or select == "Y": print("登陆成功!") print("准备打开新的窗口...") # time.sleep(1) # browser.quit() break elif select == "n" or select == "N": selectno = input("账号密码错误请按0,验证码出现请按1...") # 账号密码错误则重新输入 if selectno == "0": # 找到id="TANGRAM__PSP_3__userName"的对话框 # 清空输入框 browser.find_element_by_id("TANGRAM__PSP_3__userName").clear() browser.find_element_by_id("TANGRAM__PSP_3__password").clear() # 输入账号密码 account = [] try: fileaccount = open("../baidu/account.txt") accounts = fileaccount.readlines() for acc in accounts: account.append(acc.strip()) fileaccount.close() except Exception as err: print(err) input("请正确在account.txt里面写入账号密码") exit() browser.find_element_by_id("TANGRAM__PSP_3__userName").send_keys(account[0]) browser.find_element_by_id("TANGRAM__PSP_3__password").send_keys(account[1]) # 点击登陆sign in # id="TANGRAM__PSP_3__submit" browser.find_element_by_id("TANGRAM__PSP_3__submit").click() elif selectno == "1": # 验证码的id为id="ap_captcha_guess"的对话框 input("请在浏览器中输入验证码并登陆...") select = input("请观察浏览器网站是否已经登陆(y/n):") else: print("请输入“y”或者“n”!") select = input("请观察浏览器网站是否已经登陆(y/n):")
Login page:
After logging in, you need to open a new window, that is, open the Baidu Index and switch windows. Use selenium:
# 新开一个窗口,通过执行js来新开一个窗口 js = 'window.open("http://index.baidu.com");' browser.execute_script(js) # 新窗口句柄切换,进入百度指数 # 获得当前打开所有窗口的句柄handles # handles为一个数组 handles = browser.window_handles # print(handles) # 切换到当前最新打开的窗口 browser.switch_to_window(handles[-1])
Clear the input box and construct the number of click days:
# 清空输入框 browser.find_element_by_id("schword").clear() # 写入需要搜索的百度指数 browser.find_element_by_id("schword").send_keys(keyword) # 点击搜索 # <input type="submit" value="" id="searchWords" onclick="searchDemoWords()"> browser.find_element_by_id("searchWords").click() time.sleep(2) # 最大化窗口 browser.maximize_window() # 构造天数 sel = int(input("查询7天请按0,30天请按1,90天请按2,半年请按3:")) day = 0 if sel == 0: day = 7 elif sel == 1: day = 30 elif sel == 2: day = 90 elif sel == 3: day = 180 sel = '//a[@rel="' + str(day) + '"]' browser.find_element_by_xpath(sel).click() # 太快了 time.sleep(2)
The number of days is here:
Find the graphics box:
xoyelement = browser.find_elements_by_css_selector("#trend rect")[2]
The graphics box is:
Construct offsets based on different coordinate points:
Select the coordinates of 7 days to observe:
The abscissa of one point is 1031.66666
The abscissa of the second point is 1234
from selenium.webdriver.common.action_chains import ActionChains ActionChains(browser).move_to_element_with_offset(xoyelement,x_0,y_0).perform()
But this is sure The point is pointed out at this position:
, which is the upper left corner of the rectangle. The js will not be loaded here to display the pop-up box, so the abscissa + 1:
x_0 = 1 y_0 = 0
Write a cycle based on the number of days and let the abscissa accumulate:
# 按照选择的天数循环 for i in range(day): # 构造规则 if day == 7: x_0 = x_0 + 202.33 elif day == 30: x_0 = x_0 + 41.68 elif day == 90: x_0 = x_0 + 13.64 elif day == 180: x_0 = x_0 + 6.78
A box will pop up when the mouse moves horizontally. Find this box in the URL:
Selenium automatically recognizes...:
# <p class="imgtxt" style="margin-left:-117px;"></p> imgelement = browser.find_element_by_xpath('//p[@id="viewbox"]')
And determine the size and position of this box:
# 找到图片坐标 locations = imgelement.location print(locations) # 找到图片大小 sizes = imgelement.size print(sizes) # 构造指数的位置 rangle = (int(locations['x']), int(locations['y']), int(locations['x'] + sizes['width']), int(locations['y'] + sizes['height']))
The intercepted graphic is:
The following idea is:
1. Take a screenshot of the entire screen
2. Open the screenshot and use the coordinate range obtained above to crop it
But the final crop is the black frame above. The effect I want is:
So I need to calculate the range, but I am lazy and ignore the length of the search term, so I directly write violently:
# 构造指数的位置 rangle = (int(locations['x'] + sizes['width']/3), int(locations['y'] + sizes['height']/2), int(locations['x'] + sizes['width']*2/3), int(locations['y'] + sizes['height']))
这个写法最终不太好,最起码要对keyword的长度进行判断,长度过长会导致截图坐标出现偏差,反正我知道怎么做,就是不写出来给你们看!
后面的完整代码是:
# <p class="imgtxt" style="margin-left:-117px;"></p> imgelement = browser.find_element_by_xpath('//p[@id="viewbox"]') # 找到图片坐标 locations = imgelement.location print(locations) # 找到图片大小 sizes = imgelement.size print(sizes) # 构造指数的位置 rangle = (int(locations['x'] + sizes['width']/3), int(locations['y'] + sizes['height']/2), int(locations['x'] + sizes['width']*2/3), int(locations['y'] + sizes['height'])) # 截取当前浏览器 path = "../baidu/" + str(num) browser.save_screenshot(str(path) + ".png") # 打开截图切割 img = Image.open(str(path) + ".png") jpg = img.crop(rangle) jpg.save(str(path) + ".jpg")
但是后面发现裁剪的图片太小,识别精度太低,所以需要对图片进行扩大:
# 将图片放大一倍 # 原图大小73.29 jpgzoom = Image.open(str(path) + ".jpg") (x, y) = jpgzoom.size x_s = 146 y_s = 58 out = jpgzoom.resize((x_s, y_s), Image.ANTIALIAS) out.save(path + 'zoom.jpg', 'png', quality=95)
原图大小请 右键->属性->详细信息 查看,我的是长73像素,宽29像素
最后就是图像识别
# 图像识别 index = [] image = Image.open(str(path) + "zoom.jpg") code = pytesseract.image_to_string(image) if code: index.append(code)
最后效果图:
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持PHP中文网。
更多Detailed explanation of python3 Baidu index crawling example相关文章请关注PHP中文网!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

Solution to permission issues when viewing Python version in Linux terminal When you try to view Python version in Linux terminal, enter python...

How to avoid being detected when using FiddlerEverywhere for man-in-the-middle readings When you use FiddlerEverywhere...

When using Python's pandas library, how to copy whole columns between two DataFrames with different structures is a common problem. Suppose we have two Dats...

How does Uvicorn continuously listen for HTTP requests? Uvicorn is a lightweight web server based on ASGI. One of its core functions is to listen for HTTP requests and proceed...

Fastapi ...

How to teach computer novice programming basics within 10 hours? If you only have 10 hours to teach computer novice some programming knowledge, what would you choose to teach...

Using python in Linux terminal...

Understanding the anti-crawling strategy of Investing.com Many people often try to crawl news data from Investing.com (https://cn.investing.com/news/latest-news)...
