Home Backend Development Python Tutorial Detailed explanation of python3 Baidu index crawling example

Detailed explanation of python3 Baidu index crawling example

Feb 14, 2017 pm 01:42 PM

This article mainly introduces python3 Baidu index crawling. The editor thinks it is quite good. Now I will share it with you and give it as a reference. Let’s follow the editor to have a look.

Catch the Baidu index, and then use image recognition to get the index

Foreword:

Tufu once said that the Baidu index is difficult to grasp. On Taobao, it costs 20 yuan per unit. Keywords:

Detailed explanation of python3 Baidu index crawling example

How could someone be frightened by someone with such a big mouth, so it took me about 2 and a half days to complete it. I despise Tufu

There are many installed libraries:

谷歌图像识别tesseract-ocr

pip3 install pillow

pip3 install pyocr

selenium2.45

Chrome47.0.2526.106 m or Firebox32.0.1

chromedriver.exe
Copy after login

You need to log in to enter the Baidu Index. The login account and password are written in the text account:

Detailed explanation of python3 Baidu index crawling example

The universal login code is as follows:

# 打开浏览器
def openbrowser():
  global browser

  # http://www.php.cn/
  url = "https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F"
  # 打开谷歌浏览器
  # Firefox()
  # Chrome()
  browser = webdriver.Chrome()
  # 输入网址
  browser.get(url)
  # 打开浏览器时间
  # print("等待10秒打开浏览器...")
  # time.sleep(10)

  # 找到id="TANGRAM__PSP_3__userName"的对话框
  # 清空输入框
  browser.find_element_by_id("TANGRAM__PSP_3__userName").clear()
  browser.find_element_by_id("TANGRAM__PSP_3__password").clear()

  # 输入账号密码
  # 输入账号密码
  account = []
  try:
    fileaccount = open("../baidu/account.txt")
    accounts = fileaccount.readlines()
    for acc in accounts:
      account.append(acc.strip())
    fileaccount.close()
  except Exception as err:
    print(err)
    input("请正确在account.txt里面写入账号密码")
    exit()
  browser.find_element_by_id("TANGRAM__PSP_3__userName").send_keys(account[0])
  browser.find_element_by_id("TANGRAM__PSP_3__password").send_keys(account[1])

  # 点击登陆登陆
  # id="TANGRAM__PSP_3__submit"
  browser.find_element_by_id("TANGRAM__PSP_3__submit").click()

  # 等待登陆10秒
  # print('等待登陆10秒...')
  # time.sleep(10)
  print("等待网址加载完毕...")

  select = input("请观察浏览器网站是否已经登陆(y/n):")
  while 1:
    if select == "y" or select == "Y":
      print("登陆成功!")
      print("准备打开新的窗口...")
      # time.sleep(1)
      # browser.quit()
      break

    elif select == "n" or select == "N":
      selectno = input("账号密码错误请按0,验证码出现请按1...")
      # 账号密码错误则重新输入
      if selectno == "0":

        # 找到id="TANGRAM__PSP_3__userName"的对话框
        # 清空输入框
        browser.find_element_by_id("TANGRAM__PSP_3__userName").clear()
        browser.find_element_by_id("TANGRAM__PSP_3__password").clear()

        # 输入账号密码
        account = []
        try:
          fileaccount = open("../baidu/account.txt")
          accounts = fileaccount.readlines()
          for acc in accounts:
            account.append(acc.strip())
          fileaccount.close()
        except Exception as err:
          print(err)
          input("请正确在account.txt里面写入账号密码")
          exit()

        browser.find_element_by_id("TANGRAM__PSP_3__userName").send_keys(account[0])
        browser.find_element_by_id("TANGRAM__PSP_3__password").send_keys(account[1])
        # 点击登陆sign in
        # id="TANGRAM__PSP_3__submit"
        browser.find_element_by_id("TANGRAM__PSP_3__submit").click()

      elif selectno == "1":
        # 验证码的id为id="ap_captcha_guess"的对话框
        input("请在浏览器中输入验证码并登陆...")
        select = input("请观察浏览器网站是否已经登陆(y/n):")

    else:
      print("请输入“y”或者“n”!")
      select = input("请观察浏览器网站是否已经登陆(y/n):")
Copy after login

Login page:

Detailed explanation of python3 Baidu index crawling example

After logging in, you need to open a new window, that is, open the Baidu Index and switch windows. Use selenium:

# 新开一个窗口,通过执行js来新开一个窗口
js = 'window.open("http://index.baidu.com");'
browser.execute_script(js)
# 新窗口句柄切换,进入百度指数
# 获得当前打开所有窗口的句柄handles
# handles为一个数组
handles = browser.window_handles
# print(handles)
# 切换到当前最新打开的窗口
browser.switch_to_window(handles[-1])
Copy after login

Clear the input box and construct the number of click days:

# 清空输入框
browser.find_element_by_id("schword").clear()
# 写入需要搜索的百度指数
browser.find_element_by_id("schword").send_keys(keyword)
# 点击搜索
# <input type="submit" value="" id="searchWords" onclick="searchDemoWords()">
browser.find_element_by_id("searchWords").click()
time.sleep(2)
# 最大化窗口
browser.maximize_window()
# 构造天数
sel = int(input("查询7天请按0,30天请按1,90天请按2,半年请按3:"))
day = 0
if sel == 0:
  day = 7
elif sel == 1:
  day = 30
elif sel == 2:
  day = 90
elif sel == 3:
  day = 180
sel = &#39;//a[@rel="&#39; + str(day) + &#39;"]&#39;
browser.find_element_by_xpath(sel).click()
# 太快了
time.sleep(2)
Copy after login

The number of days is here:

Detailed explanation of python3 Baidu index crawling example

Find the graphics box:

xoyelement = browser.find_elements_by_css_selector("#trend rect")[2]
Copy after login

The graphics box is:

Detailed explanation of python3 Baidu index crawling example

Construct offsets based on different coordinate points:

Detailed explanation of python3 Baidu index crawling example

Select the coordinates of 7 days to observe:

The abscissa of one point is 1031.66666

The abscissa of the second point is 1234

Detailed explanation of python3 Baidu index crawling example

##So the distance between the two coordinates is 7 days The difference is: 202.33, other days are similar

Use selenium library to simulate mouse sliding suspension:


##
from selenium.webdriver.common.action_chains import ActionChains
ActionChains(browser).move_to_element_with_offset(xoyelement,x_0,y_0).perform()
Copy after login

But this is sure The point is pointed out at this position:


Detailed explanation of python3 Baidu index crawling example, which is the upper left corner of the rectangle. The js will not be loaded here to display the pop-up box, so the abscissa + 1:


x_0 = 1
y_0 = 0
Copy after login

Write a cycle based on the number of days and let the abscissa accumulate:


# 按照选择的天数循环
for i in range(day):
  # 构造规则
  if day == 7:
    x_0 = x_0 + 202.33
  elif day == 30:
    x_0 = x_0 + 41.68
  elif day == 90:
    x_0 = x_0 + 13.64
  elif day == 180:
    x_0 = x_0 + 6.78
Copy after login

A box will pop up when the mouse moves horizontally. Find this box in the URL:


Detailed explanation of python3 Baidu index crawling exampleSelenium automatically recognizes...:


# <p class="imgtxt" style="margin-left:-117px;"></p>
imgelement = browser.find_element_by_xpath(&#39;//p[@id="viewbox"]&#39;)
Copy after login

And determine the size and position of this box:


# 找到图片坐标
locations = imgelement.location
print(locations)
# 找到图片大小
sizes = imgelement.size
print(sizes)
# 构造指数的位置
rangle = (int(locations[&#39;x&#39;]), int(locations[&#39;y&#39;]), int(locations[&#39;x&#39;] + sizes[&#39;width&#39;]),
     int(locations[&#39;y&#39;] + sizes[&#39;height&#39;]))
Copy after login

The intercepted graphic is:

Detailed explanation of python3 Baidu index crawling example
The following idea is:

1. Take a screenshot of the entire screen


2. Open the screenshot and use the coordinate range obtained above to crop it

But the final crop is the black frame above. The effect I want is:

Detailed explanation of python3 Baidu index crawling exampleSo I need to calculate the range, but I am lazy and ignore the length of the search term, so I directly write violently:


# 构造指数的位置
rangle = (int(locations[&#39;x&#39;] + sizes[&#39;width&#39;]/3), int(locations[&#39;y&#39;] + sizes[&#39;height&#39;]/2), int(locations[&#39;x&#39;] + sizes[&#39;width&#39;]*2/3),
     int(locations[&#39;y&#39;] + sizes[&#39;height&#39;]))
Copy after login

这个写法最终不太好,最起码要对keyword的长度进行判断,长度过长会导致截图坐标出现偏差,反正我知道怎么做,就是不写出来给你们看!

后面的完整代码是:

# <p class="imgtxt" style="margin-left:-117px;"></p>
imgelement = browser.find_element_by_xpath(&#39;//p[@id="viewbox"]&#39;)
# 找到图片坐标
locations = imgelement.location
print(locations)
# 找到图片大小
sizes = imgelement.size
print(sizes)
# 构造指数的位置
rangle = (int(locations[&#39;x&#39;] + sizes[&#39;width&#39;]/3), int(locations[&#39;y&#39;] + sizes[&#39;height&#39;]/2), int(locations[&#39;x&#39;] + sizes[&#39;width&#39;]*2/3),
     int(locations[&#39;y&#39;] + sizes[&#39;height&#39;]))
# 截取当前浏览器
path = "../baidu/" + str(num)
browser.save_screenshot(str(path) + ".png")
# 打开截图切割
img = Image.open(str(path) + ".png")
jpg = img.crop(rangle)
jpg.save(str(path) + ".jpg")
Copy after login

但是后面发现裁剪的图片太小,识别精度太低,所以需要对图片进行扩大:

# 将图片放大一倍
# 原图大小73.29
jpgzoom = Image.open(str(path) + ".jpg")
(x, y) = jpgzoom.size
x_s = 146
y_s = 58
out = jpgzoom.resize((x_s, y_s), Image.ANTIALIAS)
out.save(path + &#39;zoom.jpg&#39;, &#39;png&#39;, quality=95)
Copy after login

原图大小请 右键->属性->详细信息 查看,我的是长73像素,宽29像素

最后就是图像识别

# 图像识别
index = []
image = Image.open(str(path) + "zoom.jpg")
code = pytesseract.image_to_string(image)
if code:
  index.append(code)
Copy after login

最后效果图:

Detailed explanation of python3 Baidu index crawling example

Detailed explanation of python3 Baidu index crawling example


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持PHP中文网。

更多Detailed explanation of python3 Baidu index crawling example相关文章请关注PHP中文网!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

How to solve the permissions problem encountered when viewing Python version in Linux terminal? How to solve the permissions problem encountered when viewing Python version in Linux terminal? Apr 01, 2025 pm 05:09 PM

Solution to permission issues when viewing Python version in Linux terminal When you try to view Python version in Linux terminal, enter python...

How to avoid being detected by the browser when using Fiddler Everywhere for man-in-the-middle reading? How to avoid being detected by the browser when using Fiddler Everywhere for man-in-the-middle reading? Apr 02, 2025 am 07:15 AM

How to avoid being detected when using FiddlerEverywhere for man-in-the-middle readings When you use FiddlerEverywhere...

How to efficiently copy the entire column of one DataFrame into another DataFrame with different structures in Python? How to efficiently copy the entire column of one DataFrame into another DataFrame with different structures in Python? Apr 01, 2025 pm 11:15 PM

When using Python's pandas library, how to copy whole columns between two DataFrames with different structures is a common problem. Suppose we have two Dats...

How does Uvicorn continuously listen for HTTP requests without serving_forever()? How does Uvicorn continuously listen for HTTP requests without serving_forever()? Apr 01, 2025 pm 10:51 PM

How does Uvicorn continuously listen for HTTP requests? Uvicorn is a lightweight web server based on ASGI. One of its core functions is to listen for HTTP requests and proceed...

How to solve permission issues when using python --version command in Linux terminal? How to solve permission issues when using python --version command in Linux terminal? Apr 02, 2025 am 06:36 AM

Using python in Linux terminal...

How to teach computer novice programming basics in project and problem-driven methods within 10 hours? How to teach computer novice programming basics in project and problem-driven methods within 10 hours? Apr 02, 2025 am 07:18 AM

How to teach computer novice programming basics within 10 hours? If you only have 10 hours to teach computer novice some programming knowledge, what would you choose to teach...

How to get news data bypassing Investing.com's anti-crawler mechanism? How to get news data bypassing Investing.com's anti-crawler mechanism? Apr 02, 2025 am 07:03 AM

Understanding the anti-crawling strategy of Investing.com Many people often try to crawl news data from Investing.com (https://cn.investing.com/news/latest-news)...

See all articles