Python implements simple crawler-Python Tutorial-php.cn

Home

Backend Development

Python Tutorial

Python implements simple crawler

高洛峰

Dec 13, 2016 pm 01:53 PM

python crawler

Introduction

A crawler is a program that automatically crawls Internet information. The value of it is that the internet data is all mine. You can use the crawled data to do many things, such as: you can perform data statistics and comparisons; you can use the crawled data to make an app in a certain aspect; you can also use the crawled data to make a news reader, etc.

Crawler architecture

1) URL manager
2) Web page downloader
3) Web page analyzer
4) Crawler caller
5) Value data usage

Crawler implementation

1) Scheduler implementation

# coding:utf-8
import url_manager
import html_downloader
import html_parser
import html_outputer
import url_manager
class SpiderMain(object):
    def __init__(self):
        self.urls = url_manager.UrlManager()
        self.downloader = html_downloader.HtmlDownloader()
        self.parser = html_parser.HtmlParser()
        self.outputer = html_outputer.HtmlOutputer()

    def craw(self, root_url):
        count = 1
        self.urls.add_new_url(root_url)
        while self.urls.has_new_url():
            try:
                new_url = self.urls.get_new_url()
                print "craw %d : %s" % (count, new_url)
                html_cont = self.downloader.download(new_url)
                new_urls, new_data = self.parser.parse(new_url, html_cont)
                self.urls.add_new_urls(new_urls)
                self.outputer.collect_data(new_data)
                if count == 1000:
                    break
                count = count + 1
            except:
                print "craw failed"
        self.outputer.output_html()

if __name__ == "__main__":
    root_url = "http://baike.baidu.com/view/21087.htm"
    obj_spider = SpiderMain()
    obj_spider.craw(root_url)

Copy after login

2) URL manager implementation

class UrlManager(object):
    def __init__(self):
        self.new_urls = set()
        self.old_urls = set()

    def add_new_url(self, url):
        if url is None:
            return
        if url not in self.new_urls and url not in self.old_urls:
            self.new_urls.add(url)

    def add_new_urls(self, urls):
        if urls is None or len(urls) == 0:
            return
        for url in urls:
            self.add_new_url(url)

    def has_new_url(self):
        return len(self.new_urls) != 0

    def get_new_url(self):
        new_url = self.new_urls.pop()
        self.old_urls.add(new_url)
        return new_url

Copy after login

3) URL downloader implementation

import urllib2
class HtmlDownloader(object):
    def download(self, url):
        if url is None:
            return None
        response = urllib2.urlopen(url)
        if response.getcode() != 200:
            return None
        return response.read()

Copy after login

4) URL parser implementation

from bs4 import BeautifulSoup
import re
import urlparse
class HtmlParser(object):
    def _get_new_urls(self, page_url, soup):
        new_urls = set()
        links = soup.find_all(&#39;a&#39;, href=re.compile(r"/view/\d+\.htm"))
        for link in links:
            new_url = link[&#39;href&#39;]
            new_full_url = urlparse.urljoin(page_url, new_url)
            new_urls.add(new_full_url)
        return new_urls

    def _get_new_data(self, page_url, soup):
        res_data = {}
        res_data[&#39;url&#39;] = page_url
        title_node = soup.find(&#39;dd&#39;, class_="lemmaWgt-lemmaTitle-title").find("h1")
        res_data[&#39;title&#39;] = title_node.get_text()
        summary_node = soup.find(&#39;div&#39;, class_="lemma-summary")
        res_data[&#39;summary&#39;] = summary_node.get_text()
        return res_data

    def parse(self, page_url, html_cont):
        if page_url is None or html_cont is None:
            return
        soup = BeautifulSoup(html_cont, &#39;html.parser&#39;, from_encoding=&#39;utf-8&#39;)
        new_urls = self._get_new_urls(page_url, soup)
        new_data = self._get_new_data(page_url, soup)
        return new_urls, new_data

Copy after login

5) Value data output display

# coding:utf-8
class HtmlOutputer(object):
    def __init__(self):
        self.datas = []

    def collect_data(self, data):
        if data is None:
            return
        self.datas.append(data)

    def output_html(self):
        fout = open(&#39;output.html&#39;, &#39;w&#39;)
        fout.write("<html>")
        fout.write("<meta charset=\"UTF-8\">")
        fout.write("<body>")
        fout.write("<table>")
        for data in self.datas:
            fout.write("<tr>")
            fout.write("<td>%s</td>" % data[&#39;url&#39;])
            fout.write("<td>%s</td>" % data[&#39;title&#39;].encode(&#39;utf-8&#39;))
            fout.write("<td>%s</td>" % data[&#39;summary&#39;].encode(&#39;utf-8&#39;))
            fout.write("</tr>")
        fout.write("</table>")
        fout.write("</body>")
        fout.write("</html>")
        fout.close()

Copy after login

Execution

This crawler crawls Baidu Encyclopedia and Python For 1,000 static web pages related to keywords, we mainly extract keywords and summary information from the data in the web pages, and store the crawled information in the form of HTML files, which can then be accessed by opening them with a browser.

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

4 weeks ago By DDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

3 weeks ago By DDD

Where to find the Crane Control Keycard in Atomfall

4 weeks ago By DDD

Roblox: Dead Rails - How To Complete Every Challenge

1 months ago By DDD

Atomfall guide: item locations, quest guides, and tips

1 months ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7702

Java Tutorial

1640

CakePHP Tutorial

1393

Laravel Tutorial

1287

PHP Tutorial

1230

Related knowledge

How to solve the permissions problem encountered when viewing Python version in Linux terminal? Apr 01, 2025 pm 05:09 PM

Solution to permission issues when viewing Python version in Linux terminal When you try to view Python version in Linux terminal, enter python...

How to teach computer novice programming basics in project and problem-driven methods within 10 hours? Apr 02, 2025 am 07:18 AM

How to teach computer novice programming basics within 10 hours? If you only have 10 hours to teach computer novice some programming knowledge, what would you choose to teach...

How to avoid being detected by the browser when using Fiddler Everywhere for man-in-the-middle reading? Apr 02, 2025 am 07:15 AM

How to avoid being detected when using FiddlerEverywhere for man-in-the-middle readings When you use FiddlerEverywhere...

How to efficiently copy the entire column of one DataFrame into another DataFrame with different structures in Python? Apr 01, 2025 pm 11:15 PM

When using Python's pandas library, how to copy whole columns between two DataFrames with different structures is a common problem. Suppose we have two Dats...

How does Uvicorn continuously listen for HTTP requests without serving_forever()? Apr 01, 2025 pm 10:51 PM

How does Uvicorn continuously listen for HTTP requests? Uvicorn is a lightweight web server based on ASGI. One of its core functions is to listen for HTTP requests and proceed...

How to handle comma-separated list query parameters in FastAPI? Apr 02, 2025 am 06:51 AM

Fastapi ...

How to get news data bypassing Investing.com's anti-crawler mechanism? Apr 02, 2025 am 07:03 AM

Understanding the anti-crawling strategy of Investing.com Many people often try to crawl news data from Investing.com (https://cn.investing.com/news/latest-news)...

How to solve permission issues when using python --version command in Linux terminal? Apr 02, 2025 am 06:36 AM

Using python in Linux terminal...

See all articles