Table of Contents
The core code is placed in crawler.py, the data is first stored in the
Home Backend Development Python Tutorial Mobike crawler source code analysis

Mobike crawler source code analysis

Apr 04, 2017 am 10:40 AM
Mobike

The first two articles analyzed why I grabbed Mobike’s interface and As a result of data analysis, this article directly provides executable source code for learning

Statement:
This crawler is only for learning and research purposes, please do not use it for illegal purposes. . Any legal disputes caused by this will be your own responsibility.

If you don’t have the patience to read the article, please post directly:

git clone https://github.com/derekhe/mobike-crawler
python3 crawler.py
Copy after login

Please don’t forget to give it a star and ##!

#Directory structure

  • \analysis - jupyter for data analysis

  • \influx-importer - import to influxdb, but I didn’t do it well before

  • \

    modules - Agent module

  • \web - Real-time graphical The display module was just to learn

    react. The effect can be found here

  • crawler.py - crawler core code

  • importToDb.py - Import into postgres database for analysis

  • sql.sql - Create table sql

  • ##start.sh - Continue Running script
  • Idea

The core code is placed in crawler.py, the data is first stored in the

sqlite

3 database, and then after deduplication Export to a csv file to save space. Mobike’s

API

returns a bicycle in a square area. I can capture the entire area by moving it piece by piece. Large area data. left,

top

,right,bottom defines the crawling range, which is currently the Chengdu City Ring Expressway. Within and the square area south to Nanhu. offset defines the crawling interval. It is currently based on 0.002 and can be used within 15 minutes on the DigitalOcean 5$ server. Fetch it once.

Then 250 threads are started. As for you asking me why I didn’t use coroutines, humming~~I didn’t learn it at the time~~~It’s actually possible, maybe it’s more efficient. High.

Since the data needs to be deduplicated after crawling, in order to eliminate duplicate parts between small square areas, the last group_data is the core API for doing this. The code is here. For the API interface of the mini program, just create a few

variables

, it is very simple.

        executor = ThreadPoolExecutor(max_workers=250)
        print("Start")
        self.total = 0
        lat_range = np.arange(left, right, -offset)
        for lat in lat_range:
            lon_range = np.arange(top, bottom, offset)
            for lon in lon_range:
                self.total += 1
                executor.submit(self.get_nearby_bikes, (lat, lon))

        executor.shutdown()
        self.group_data()
Copy after login

Finally, you may want to ask if frequent IP grabbing is not blocked? In fact, Mobike has IP access speed restrictions, but the way to crack it is very simple, which is to use a large number of proxies. I have an agent pool, and there are basically more than 8,000 agents every day. Get this proxy pool directly in ProxyProvider and provide a pick

function

to randomly select the top 50 proxies. Please note that my proxy pool is

updated

every hour, but the jsonblob proxy list provided in the code is just a sample, and most of it should be invalid after a while. . A proxy scoring mechanism is used here. Instead of selecting agents directly at random, I sorted the agents according to their scores. Each successful request will add points, while an erroneous request will lose points. In this way, the agent with the best speed and quality can be selected in a short time. You can save it and use it next time if necessary.

In actual use, select the proxy through proxyProvider.pick() and then use it. If there are any problems with the proxy, directly use proxy.fatal_error() to lower the score, so that this proxy will not be selected in the future.

class ProxyProvider:
    def init(self, min_proxies=200):
        self._bad_proxies = {}
        self._minProxies = min_proxies
        self.lock = threading.RLock()

        self.get_list()

    def get_list(self):
        logger.debug("Getting proxy list")
        r = requests.get("https://jsonblob.com/31bf2dc8-00e6-11e7-a0ba-e39b7fdbe78b", timeout=10)
        proxies = ujson.decode(r.text)
        logger.debug("Got %s proxies", len(proxies))
        self._proxies = list(map(lambda p: Proxy(p), proxies))

    def pick(self):
        with self.lock:
            self._proxies.sort(key = lambda p: p.score, reverse=True)
            proxy_len = len(self._proxies)
            max_range = 50 if proxy_len > 50 else proxy_len
            proxy = self._proxies[random.randrange(1, max_range)]
            proxy.used()

            return proxy
Copy after login

Okay, that’s basically it~~~Study the other codes yourself~~~

The above is the detailed content of Mobike crawler source code analysis. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

How to solve the permissions problem encountered when viewing Python version in Linux terminal? How to solve the permissions problem encountered when viewing Python version in Linux terminal? Apr 01, 2025 pm 05:09 PM

Solution to permission issues when viewing Python version in Linux terminal When you try to view Python version in Linux terminal, enter python...

How to avoid being detected by the browser when using Fiddler Everywhere for man-in-the-middle reading? How to avoid being detected by the browser when using Fiddler Everywhere for man-in-the-middle reading? Apr 02, 2025 am 07:15 AM

How to avoid being detected when using FiddlerEverywhere for man-in-the-middle readings When you use FiddlerEverywhere...

How to efficiently copy the entire column of one DataFrame into another DataFrame with different structures in Python? How to efficiently copy the entire column of one DataFrame into another DataFrame with different structures in Python? Apr 01, 2025 pm 11:15 PM

When using Python's pandas library, how to copy whole columns between two DataFrames with different structures is a common problem. Suppose we have two Dats...

How does Uvicorn continuously listen for HTTP requests without serving_forever()? How does Uvicorn continuously listen for HTTP requests without serving_forever()? Apr 01, 2025 pm 10:51 PM

How does Uvicorn continuously listen for HTTP requests? Uvicorn is a lightweight web server based on ASGI. One of its core functions is to listen for HTTP requests and proceed...

How to teach computer novice programming basics in project and problem-driven methods within 10 hours? How to teach computer novice programming basics in project and problem-driven methods within 10 hours? Apr 02, 2025 am 07:18 AM

How to teach computer novice programming basics within 10 hours? If you only have 10 hours to teach computer novice some programming knowledge, what would you choose to teach...

How to solve permission issues when using python --version command in Linux terminal? How to solve permission issues when using python --version command in Linux terminal? Apr 02, 2025 am 06:36 AM

Using python in Linux terminal...

How to get news data bypassing Investing.com's anti-crawler mechanism? How to get news data bypassing Investing.com's anti-crawler mechanism? Apr 02, 2025 am 07:03 AM

Understanding the anti-crawling strategy of Investing.com Many people often try to crawl news data from Investing.com (https://cn.investing.com/news/latest-news)...

See all articles