Python crawler---Autohome font anti-crawling
in the source code, so I looked through myfont to see if I could find anything? As a result, I really found something
# 那么便开始通过字体库进行解析 world = TTFont('./world.ttf') # 读取响应的映射关系 uni_list = world['cmap'].tables[0].ttFont.getGlyphOrder() # 'cmap' 表示汉字对应的映射 为unicode编码 print(uni_list) # 按顺序拿到各个字符的unicode编码 # 打印结果: ['.notdef', 'uniEDE8', 'uniED35', 'uniED87', 'uniECD3', 'uniED25', 'uniEC72', 'uniEDB2', 'uniEE04', 'uniED51', 'uniEC9D', 'uniECEF', 'uniEC3C', 'uniEC8D', 'uniEDCE', 'uniED1B', 'uniED6C', 'uniECB9', 'uniEDFA', 'uniEC57', 'uniED98', 'uniEDEA', 'uniED36', 'uniEC83', 'uniECD5', 'uniEC21', 'uniED62', 'uniEDB4', 'uniED00', 'uniED52', 'uniEC9F', 'uniEDDF', 'uniEC3D', 'uniED7E', 'uniECCA', 'uniED1C', 'uniEC69', 'uniECBB', 'uniEDFB'] # .notdef 并不是汉字的映射, 而是表示字体家族名称。 将映射列表转换成unicode的类型,因为自己文中获取的是字符串unicode类型的,当然你也可以转化为utf-8,不过你获取的文章内容也要转化为utf-8 unicode_list= [eval(r"u'\u" + uni[3:] + "'") for uni in uni_list[1:]]
https://www.zhihu.com/question/23374078
Okay, everything has been prepared above. Let’s write some code.# coding:utf-8 import re import requests from scrapy import Selector from fontTools.ttLib import TTFont class QiCheZhiJiaSpider: def article_content(self): url = 'https://club.autohome.com.cn/bbs/thread/2d8a42404ba24266/77486027-1.html#pvareaid=2199101' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.119 Safari/537.36' } try: response = requests.get(url=url, headers=headers).text response_info = Selector(text=response) except BaseException as e: print(e) else: content = response_info.xpath('//div[@class="tz-paragraph"]//text()').extract() # 获取列表的形式内容。 # print(content) content_str = ''.join(content) # 紧接着获取字体的链接 world_href = re.findall(r",url\('(//.*\.ttf)'\).*", response, re.M or re.S)[0] world_href = 'https:' + world_href world_content = requests.get(url=world_href, headers=headers).content # 对获取到的字体进行下载.......... with open('./world.ttf', 'wb') as f: f.write(world_content) # 那么便开始通过字体库进行解析 world = TTFont('./world.ttf') # 读取响应的映射关系 uni_list = world['cmap'].tables[0].ttFont.getGlyphOrder() unicode_list = [eval(r"u'\u" + uni[3:] + "'") for uni in uni_list[1:]] world_list = ["右", "远", "高", "呢", "了", "短", "得", "矮", "多", "二", "大", "一", "不", "近", "是", "着", "五", "三", "九", "六", "少", "好", "上", "七", "和", "很", "十", "四", "左", "下", "八", "小", "坏", "低", "长", "更", "的", "地"] # # 录入字体文件中的字符。必须要以国际标准的unicode编码 for i in range(len(unicode_list )): content_str = content_str.replace(unicode_list [i], world_list[i]) print(content_str) if __name__ == '__main__': qi_che_zhi_jia = QiCheZhiJiaSpider() qi_che_zhi_jia.article_content()
Python video tutorial Please pay attention to the PHP Chinese website.
##
The above is the detailed content of Python crawler---Autohome font anti-crawling. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

Solution to permission issues when viewing Python version in Linux terminal When you try to view Python version in Linux terminal, enter python...

How to avoid being detected when using FiddlerEverywhere for man-in-the-middle readings When you use FiddlerEverywhere...

When using Python's pandas library, how to copy whole columns between two DataFrames with different structures is a common problem. Suppose we have two Dats...

How to teach computer novice programming basics within 10 hours? If you only have 10 hours to teach computer novice some programming knowledge, what would you choose to teach...

How does Uvicorn continuously listen for HTTP requests? Uvicorn is a lightweight web server based on ASGI. One of its core functions is to listen for HTTP requests and proceed...

Fastapi ...

Using python in Linux terminal...

Understanding the anti-crawling strategy of Investing.com Many people often try to crawl news data from Investing.com (https://cn.investing.com/news/latest-news)...
