Home Web Front-end HTML Tutorial First introduction to scrapy, actual combat of image crawling on Mekong.com_html/css_WEB-ITnose

First introduction to scrapy, actual combat of image crawling on Mekong.com_html/css_WEB-ITnose

Jun 24, 2016 am 11:53 AM

I have been studying the scrapy crawler framework in the past two days, and I am planning to write a crawler to practice. What I usually do more often is browse pictures, yes, that’s right, those are artistic photos. I proudly believe that looking at more beautiful photos will definitely improve your aesthetics and become an elegant programmer. O(∩_∩)O~ Just kidding, so without further ado, let’s get to the point and write an image crawler.

Design idea: The crawl target is the model photos of Meikong.com, use CrawlSpider to extract the URL address of each photo, and write the extracted image URL into a static html text for storage, and you can open it to view the image. My environment is win8.1, python2.7 Scrapy 0.24.4. I won’t tell you how to configure the environment. You can search it on Baidu yourself.

Referring to the official documentation, I summarized the four steps to build a crawler program:

  • Create a scrapy project
  • Define the element items that need to be extracted from the web page
  • Implement a spider class to complete the function of crawling URLs and extracting items through the interface
  • Implement an item pipeline class to complete the storage function of Items.
  • The next step is very simple. Just follow the steps step by step. First, create a project in the terminal. Let’s name the project moko. Enter the command scrapy startproject moko. Scrapy will create a moko file directory in the current directory. There are some initial files in it. If you are interested in the use of the files, check out the documentation. I will mainly introduce the files we used this time.

    Define Item Define the data we want to capture in items.py:

    # -*- coding: utf-8 -*-# Define here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass MokoItem(scrapy.Item):    # define the fields for your item here like:    # name = scrapy.Field()    url = scrapy.Field()
    Copy after login

  • The url here is used The dict number that stores the final result will be explained later. The name is randomly named. For example, if I also need to crawl the name of the author of the picture, then we can add a name = scrapy.Field(), and so on.
  • Next we enter the spiders folder and create a python file in it. Let’s take the name mokospider.py and add the core code to implement Spider:
  • Spider is a script inherited from scrapy The Python class of .contrib.spiders.CrawlSpider has three required defined members

    name: The name, the identifier of this spider, must be unique. Different crawlers define different names

    start_urls: a list of URLs, the spider starts crawling from these web pages

    parse(): parsing method, when calling, pass in the Response object returned from each URL as the only parameter, responsible for parsing and matching the crawl Get the data (parsed into items) and track more URLs.

  • # -*- coding: utf-8 -*-#File name :spyders/mokospider.py#Author:Jhonny Zhang#mail:veinyy@163.com#create Time : 2014-11-29#############################################################################from scrapy.contrib.spiders import CrawlSpider,Rulefrom scrapy.contrib.linkextractors.sgml import SgmlLinkExtractorfrom moko.items import MokoItemimport refrom scrapy.http import Requestfrom scrapy.selector import Selectorclass MokoSpider(CrawlSpider):    name = "moko"    allowed_domains = ["moko.cc"]    start_urls=["http://www.moko.cc/post/aaronsky/list.html"]    rules = (Rule(SgmlLinkExtractor(allow=('/post/\d*\.html')),  callback = 'parse_img', follow=True),)    def parse_img(self, response):        urlItem = MokoItem()        sel = Selector(response)        for divs in sel.xpath('//div[@class="pic dBd"]'):            img_url=divs.xpath('.//img/@src2').extract()[0]            urlItem['url'] = img_url            yield urlItem
    Copy after login

    Our project is named moko. The allowed_domains area allowed by the crawler is limited to moko.cc, which is the restricted area of ​​the crawler. It stipulates that the crawler only crawls web pages under this domain name. The starting address of the crawler starts from http://www.moko.cc/post/aaronsky/list.html. Then set the crawling rule Rule. This is what makes CrawlSpider different from basic crawlers. For example, we start crawling from web page A. There are many hyperlink URLs on web page A. Our crawler will proceed based on the set rules. Crawl the hyperlink URLs that comply with the rules, and repeat this process. The callback function is used when a web page calls this callback function. The reason why I did not use the default name of parse is because the official documentation says that parse may be called in the crawler framework, causing conflicts.

    There are many links to pictures on the target http://www.moko.cc/post/aaronsky/list.html webpage. The links to each picture have rules to follow. For example, just click on one to open it. http://www.moko.cc/post/1052776.html, http://www.moko.cc/post/ here are all the same, and the different parts of each link are the numbers behind them. So here we use regular expressions to fill in the rules rules = (Rule(SgmlLinkExtractor(allow=('/post/d*.html')), callback = 'parse_img', follow=True),) refers to the current web page, all matches All web pages with the suffix /post/d*.html are crawled and processed by calling parse_img.

    Next, define the parsing function parse_img. This is more critical. The parameter it passes in is the response object returned by the crawler after opening the URL. The content in the response object is simply a large string. We are using the crawler Filter out what we need. How to filter it? ? ? Haha, there is an awesome Selector method that uses its xpath() path expression formula to parse the content. Before parsing, you need to analyze the web page in detail. The tool we use here is firebug. The intercepted web core code is

      我们需要的是src2部分!他在

    标签下的里面, 首先实例一个在Items.py里面定义的MokoItem()的对象urlItem,用牛逼的Selector传入response,我这里用了一个循环,每次处理一个url,利用xpath路径表达式解析取出url,至于xpath如何用,自行百度下。结果存储到urlItem里面,这里用到了我们Items.py里面定义的url了!

    然后定义一下pipelines,这部分管我们的内容存储。

    from moko.items import MokoItemclass MokoPipeline(object):    def __init__(self):        self.mfile = open('test.html', 'w')    def process_item(self, item, spider):        text = '<img src="' + item['url'] + '" alt = "" />'        self.mfile.writelines(text)    def close_spider(self, spider):        self.mfile.close()
    Copy after login

    建立一个test.html文件用来存储结果。注意我的process_item里用到了一些html规则,作用是直接在html里面显示图片。结尾在定义一个关闭文件的方法,在爬虫结束时候调用。

    最后定义设置一下settings.py

    BOT_NAME = 'moko'SPIDER_MODULES = ['moko.spiders']NEWSPIDER_MODULE = 'moko.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent#USER_AGENT = 'moko (+http://www.yourdomain.com)'ITEM_PIPELINES={'moko.pipelines.MokoPipeline': 1,}     
    Copy after login

     


     

          最后展示一下效果图吧,祝各位玩的快乐 ^_^

                   

    Statement of this Website
    The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

    Hot AI Tools

    Undresser.AI Undress

    Undresser.AI Undress

    AI-powered app for creating realistic nude photos

    AI Clothes Remover

    AI Clothes Remover

    Online AI tool for removing clothes from photos.

    Undress AI Tool

    Undress AI Tool

    Undress images for free

    Clothoff.io

    Clothoff.io

    AI clothes remover

    Video Face Swap

    Video Face Swap

    Swap faces in any video effortlessly with our completely free AI face swap tool!

    Hot Tools

    Notepad++7.3.1

    Notepad++7.3.1

    Easy-to-use and free code editor

    SublimeText3 Chinese version

    SublimeText3 Chinese version

    Chinese version, very easy to use

    Zend Studio 13.0.1

    Zend Studio 13.0.1

    Powerful PHP integrated development environment

    Dreamweaver CS6

    Dreamweaver CS6

    Visual web development tools

    SublimeText3 Mac version

    SublimeText3 Mac version

    God-level code editing software (SublimeText3)

    Is HTML easy to learn for beginners? Is HTML easy to learn for beginners? Apr 07, 2025 am 12:11 AM

    HTML is suitable for beginners because it is simple and easy to learn and can quickly see results. 1) The learning curve of HTML is smooth and easy to get started. 2) Just master the basic tags to start creating web pages. 3) High flexibility and can be used in combination with CSS and JavaScript. 4) Rich learning resources and modern tools support the learning process.

    The Roles of HTML, CSS, and JavaScript: Core Responsibilities The Roles of HTML, CSS, and JavaScript: Core Responsibilities Apr 08, 2025 pm 07:05 PM

    HTML defines the web structure, CSS is responsible for style and layout, and JavaScript gives dynamic interaction. The three perform their duties in web development and jointly build a colorful website.

    Understanding HTML, CSS, and JavaScript: A Beginner's Guide Understanding HTML, CSS, and JavaScript: A Beginner's Guide Apr 12, 2025 am 12:02 AM

    WebdevelopmentreliesonHTML,CSS,andJavaScript:1)HTMLstructurescontent,2)CSSstylesit,and3)JavaScriptaddsinteractivity,formingthebasisofmodernwebexperiences.

    What is an example of a starting tag in HTML? What is an example of a starting tag in HTML? Apr 06, 2025 am 12:04 AM

    AnexampleofastartingtaginHTMLis,whichbeginsaparagraph.StartingtagsareessentialinHTMLastheyinitiateelements,definetheirtypes,andarecrucialforstructuringwebpagesandconstructingtheDOM.

    Gitee Pages static website deployment failed: How to troubleshoot and resolve single file 404 errors? Gitee Pages static website deployment failed: How to troubleshoot and resolve single file 404 errors? Apr 04, 2025 pm 11:54 PM

    GiteePages static website deployment failed: 404 error troubleshooting and resolution when using Gitee...

    How to use CSS3 and JavaScript to achieve the effect of scattering and enlarging the surrounding pictures after clicking? How to use CSS3 and JavaScript to achieve the effect of scattering and enlarging the surrounding pictures after clicking? Apr 05, 2025 am 06:15 AM

    To achieve the effect of scattering and enlarging the surrounding images after clicking on the image, many web designs need to achieve an interactive effect: click on a certain image to make the surrounding...

    HTML, CSS, and JavaScript: Essential Tools for Web Developers HTML, CSS, and JavaScript: Essential Tools for Web Developers Apr 09, 2025 am 12:12 AM

    HTML, CSS and JavaScript are the three pillars of web development. 1. HTML defines the web page structure and uses tags such as, etc. 2. CSS controls the web page style, using selectors and attributes such as color, font-size, etc. 3. JavaScript realizes dynamic effects and interaction, through event monitoring and DOM operations.

    How to implement adaptive layout of Y-axis position in web annotation? How to implement adaptive layout of Y-axis position in web annotation? Apr 04, 2025 pm 11:30 PM

    The Y-axis position adaptive algorithm for web annotation function This article will explore how to implement annotation functions similar to Word documents, especially how to deal with the interval between annotations...

    See all articles