Home Backend Development Python Tutorial Scrapy case analysis: How to crawl company information on LinkedIn

Scrapy case analysis: How to crawl company information on LinkedIn

Jun 23, 2023 am 10:04 AM
linkedin crawl scrapy

Scrapy is a Python-based crawler framework that can quickly and easily obtain relevant information on the Internet. In this article, we will use a Scrapy case to analyze in detail how to crawl company information on LinkedIn.

  1. Determine the target URL

First of all, we need to make it clear that our target is the company information on LinkedIn. Therefore, we need to find the URL of the LinkedIn company information page. Open the LinkedIn website, enter the company name in the search box, and select the "Company" option in the drop-down box to enter the company introduction page. On this page, we can see the company's basic information, number of employees, affiliated companies and other information. At this point, we need to obtain the URL of the page from the browser's developer tools for subsequent use. The structure of this URL is:

https://www.linkedin.com/search/results/companies/?keywords=xxx

Among them, keywords=xxx represents the keywords we searched for, xxx can be replaced with any company name.

  1. Create a Scrapy project

Next, we need to create a Scrapy project. Enter the following command on the command line:

scrapy startproject linkedin

This command will create a Scrapy project named linkedin in the current directory.

  1. Create a crawler

After creating the project, enter the following command in the project root directory to create a new crawler:

scrapy genspider company_spider www. linkedin.com

This will create a spider named company_spider and locate it on the Linkedin company page.

  1. Configuring Scrapy

In Spider, we need to configure some basic information, such as the URL to be crawled and how to parse the data in the page. Add the following code to the company_spider.py file you just created:

import scrapy

class CompanySpider(scrapy.Spider):
    name = "company"
    allowed_domains = ["linkedin.com"]
    start_urls = [
        "https://www.linkedin.com/search/results/companies/?keywords=apple"
    ]

    def parse(self, response):
        pass
Copy after login

In the above code, we define the site URL to be crawled and the parsing function. In the above code, we have only defined the site URL to be crawled and the parsing function, and have not added the specific implementation of the crawler. Now we need to write the parse function to capture and process LinkedIn company information.

  1. Write parsing function

In the parse function, we need to write code to capture and process LinkedIn company information. We can use XPath or CSS selectors to parse HTML code. Basic information in the LinkedIn company information page can be extracted using the following XPath:

//*[@class="org-top-card-module__name ember-view"]/text()
Copy after login

This XPath will select the element with class "org-top-card-module__name ember-view" and return its text value.

The following is the complete company_spider.py file:

import scrapy

class CompanySpider(scrapy.Spider):
    name = "company"
    allowed_domains = ["linkedin.com"]
    start_urls = [
        "https://www.linkedin.com/search/results/companies/?keywords=apple"
    ]

    def parse(self, response):
        # 获取公司名称
        company_name = response.xpath('//*[@class="org-top-card-module__name ember-view"]/text()')
        
        # 获取公司简介
        company_summary = response.css('.org-top-card-summary__description::text').extract_first().strip()
        
        # 获取公司分类标签
        company_tags = response.css('.org-top-card-category-list__top-card-category::text').extract()
        company_tags = ','.join(company_tags)

        # 获取公司员工信息
        employees_section = response.xpath('//*[@class="org-company-employees-snackbar__details-info"]')
        employees_current = employees_section.xpath('.//li[1]/span/text()').extract_first()
        employees_past = employees_section.xpath('.//li[2]/span/text()').extract_first()

        # 数据处理
        company_name = company_name.extract_first()
        company_summary = company_summary if company_summary else "N/A"
        company_tags = company_tags if company_tags else "N/A"
        employees_current = employees_current if employees_current else "N/A"
        employees_past = employees_past if employees_past else "N/A"

        # 输出抓取结果
        print('Company Name: ', company_name)
        print('Company Summary: ', company_summary)
        print('Company Tags: ', company_tags)
        print('
Employee Information
Current: ', employees_current)
        print('Past: ', employees_past)
Copy after login

In the above code, we use XPath and CSS selectors to extract the basic information, company profile, tags and employee information in the page, And performed some basic data processing and output on them.

  1. Run Scrapy

Now, we have completed crawling and processing the LinkedIn company information page. Next, we need to run Scrapy to execute the crawler. Enter the following command on the command line:

scrapy crawl company

After executing this command, Scrapy will begin to crawl and process the data in the LinkedIn company information page, and output the crawl results.

Summary

The above is how to use Scrapy to crawl LinkedIn company information. With the help of the Scrapy framework, we can easily carry out large-scale data scraping, and at the same time be able to process and transform data, saving our time and energy and improving data collection efficiency.

The above is the detailed content of Scrapy case analysis: How to crawl company information on LinkedIn. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Scrapy implements crawling and analysis of WeChat public account articles Scrapy implements crawling and analysis of WeChat public account articles Jun 22, 2023 am 09:41 AM

Scrapy implements article crawling and analysis of WeChat public accounts. WeChat is a popular social media application in recent years, and the public accounts operated in it also play a very important role. As we all know, WeChat public accounts are an ocean of information and knowledge, because each public account can publish articles, graphic messages and other information. This information can be widely used in many fields, such as media reports, academic research, etc. So, this article will introduce how to use the Scrapy framework to crawl and analyze WeChat public account articles. Scr

what software is linkedin what software is linkedin Nov 22, 2022 pm 03:33 PM

Linkedin is a social platform for the workplace, its Chinese name is "Linkedin"; Linkedin has a diversified business model, and its main revenue comes from the talent recruitment solutions, marketing solutions and paid accounts it provides.

Scrapy asynchronous loading implementation method based on Ajax Scrapy asynchronous loading implementation method based on Ajax Jun 22, 2023 pm 11:09 PM

Scrapy is an open source Python crawler framework that can quickly and efficiently obtain data from websites. However, many websites use Ajax asynchronous loading technology, making it impossible for Scrapy to obtain data directly. This article will introduce the Scrapy implementation method based on Ajax asynchronous loading. 1. Ajax asynchronous loading principle Ajax asynchronous loading: In the traditional page loading method, after the browser sends a request to the server, it must wait for the server to return a response and load the entire page before proceeding to the next step.

Scrapy case analysis: How to crawl company information on LinkedIn Scrapy case analysis: How to crawl company information on LinkedIn Jun 23, 2023 am 10:04 AM

Scrapy is a Python-based crawler framework that can quickly and easily obtain relevant information on the Internet. In this article, we will use a Scrapy case to analyze in detail how to crawl company information on LinkedIn. Determine the target URL First, we need to make it clear that our target is the company information on LinkedIn. Therefore, we need to find the URL of the LinkedIn company information page. Open the LinkedIn website, enter the company name in the search box, and

Scrapy optimization tips: How to reduce crawling of duplicate URLs and improve efficiency Scrapy optimization tips: How to reduce crawling of duplicate URLs and improve efficiency Jun 22, 2023 pm 01:57 PM

Scrapy is a powerful Python crawler framework that can be used to obtain large amounts of data from the Internet. However, when developing Scrapy, we often encounter the problem of crawling duplicate URLs, which wastes a lot of time and resources and affects efficiency. This article will introduce some Scrapy optimization techniques to reduce the crawling of duplicate URLs and improve the efficiency of Scrapy crawlers. 1. Use the start_urls and allowed_domains attributes in the Scrapy crawler to

Example of scraping Instagram information using PHP Example of scraping Instagram information using PHP Jun 13, 2023 pm 06:26 PM

Instagram is one of the most popular social media today, with hundreds of millions of active users. Users upload billions of pictures and videos, and this data is very valuable to many businesses and individuals. Therefore, in many cases, it is necessary to use a program to automatically scrape Instagram data. This article will introduce how to use PHP to capture Instagram data and provide implementation examples. Install the cURL extension for PHP cURL is a tool used in various

Using Selenium and PhantomJS in Scrapy crawler Using Selenium and PhantomJS in Scrapy crawler Jun 22, 2023 pm 06:03 PM

Using Selenium and PhantomJS in Scrapy crawlers Scrapy is an excellent web crawler framework under Python and has been widely used in data collection and processing in various fields. In the implementation of the crawler, sometimes it is necessary to simulate browser operations to obtain the content presented by certain websites. In this case, Selenium and PhantomJS are needed. Selenium simulates human operations on the browser, allowing us to automate web application testing

In-depth use of Scrapy: How to crawl HTML, XML, and JSON data? In-depth use of Scrapy: How to crawl HTML, XML, and JSON data? Jun 22, 2023 pm 05:58 PM

Scrapy is a powerful Python crawler framework that can help us obtain data on the Internet quickly and flexibly. In the actual crawling process, we often encounter various data formats such as HTML, XML, and JSON. In this article, we will introduce how to use Scrapy to crawl these three data formats respectively. 1. Crawl HTML data and create a Scrapy project. First, we need to create a Scrapy project. Open the command line and enter the following command: scrapys

See all articles