Home Web Front-end JS Tutorial Example tutorial on crawling MOOC course information

Example tutorial on crawling MOOC course information

Jun 26, 2017 am 10:36 AM
javascript node.js information reptile course

This is my first time learning Node.js crawler, so this is a simple crawler. The advantage of Node.js is that it can be executed concurrently.

This crawler is mainly to obtain Course information on MOOC.com and store the obtained information in a file, which uses the cheerio library, which allows us to conveniently operate HTML, just like using jQ

Before starting, remember

npm install cheerio
Copy after login

In order to crawl concurrently, the Promise object is used

//接受一个url爬取整个网页,返回一个Promise对象function getPageAsync(url){return new Promise((resolve,reject)=>{
        console.log(`正在爬取${url}的内容`);
        http.get(url,function(res){
            let html = '';

            res.on('data',function(data){
                html += data;
            });

            res.on('end',function(){
                resolve(html);
            });

            res.on('error',function(err){
                reject(err);
                console.log('错误信息:' + err);
            })
        });
    })
}
Copy after login

In MOOC, each course has an ID. We must write the ID of the course we want to get into an array in advance, and each The address of each course is the same address plus ID, so we only need to concatenate the address and ID to get the address of the course

const baseUrl = 'http://www.imooc.com/learn/';
const baseNuUrl = 'http://www.imooc.com/course/AjaxCourseMembers?ids=';//获取课程的IDconst videosId = [773,371];
Copy after login

In order to obtain concurrent execution when obtaining each course content, use the all method in Promise

Promise//当所有网页的内容爬取完毕    .all(courseArray)
    .then((pages)=>{//所有页面需要的内容let courseData = [];//遍历每个网页提取出所需要的内容pages.forEach((html)=>{
            let courses = filterChapter(html);
            courseData.push(courses);
        });//给每个courseMenners.number赋值for(let i=0;i<videosId.length;i++){for(let j=0;j<videosId.length;j++){if(courseMembers[i].id +&#39;&#39; == videosId[j]){
                    courseData[j].number = courseMembers[i].numbers;
                }
            }
        }//对所需要的内容进行排序courseData.sort((a,b)=>{return a.number > b.number;
        });//在重新将爬取内容写入文件中前,清空文件fs.writeFileSync(outputFile,'###爬取慕课网课程信息###',(err)=>{if(err){
                console.log(err)
            }
        });
        printfData(courseData);
    });
Copy after login

In the then method, pages is the HTML page of each course. We have to extract the information we need from it. We need to use the following function

//接受一个爬取下来的网页内容,查找网页中需要的信息function filterChapter(html){
    const $ = cheerio.load(html);//所有章const chapters = $('.chapter');//课程的标题和学习人数let title = $('.hd>h2').text();
    let number = 0;//最后返回的数据//每个网页需要的内容的结构let courseData = {'title':title,'number':number,'videos':[]
    };

    chapters.each(function(item){
        let chapter = $(this);//文章标题let chapterTitle = Trim(chapter.find('strong').text(),'g');//每个章节的结构let chapterdata = {'chapterTitle':chapterTitle,'video':[]
        };//一个网页中的所有视频let videos = chapter.find('.video').children('li');
        videos.each(function(item){//视频标题let videoTitle = Trim($(this).find('a.J-media-item').text(),'g');//视频IDlet id = $(this).find('a').attr('href').split('video/')[1];
            chapterdata.video.push({'title':videoTitle,'id':id
            })
        });

        courseData.videos.push(chapterdata);

    });return courseData;
}
Copy after login

Note: In the above The number of students studying the course is set to 0 because the number of students studying the course is dynamically obtained using Ajax, so I wrote a method later to specifically obtain the number of students studying the course. The Trim() method used is to remove spaces in the text

The number of people who want to get the course:

//获取上课人数function getNumber(url){

    let datas = '';

    http.get(url,(res)=>{
        res.on('data',(chunk)=>{
            datas += chunk;
        });

        res.on('end',()=>{
            datas = JSON.parse(datas);
            courseMembers.push({'id':datas.data[0].id,'numbers':parseInt(datas.data[0].numbers,10)});
        });
    });
}
Copy after login

In this way, the number of people who want to get the course are added to the courseMembers array, at the end Assign the number of people studying the course to the corresponding course

        //给每个courseMenners.number赋值for(let i=0;i<videosId.length;i++){for(let j=0;j<videosId.length;j++){if(courseMembers[i].id +&#39;&#39; == videosId[j]){
                    courseData[j].number = courseMembers[i].numbers;
                }
            }
        }
Copy after login

Once we have obtained the data, we must put it in a certain format Save to a file

//写入文件function writeFile(file,string) {
    fs.appendFileSync(file,string,(err)=>{if(err){
                console.log(err);
            }
        })
}//打印信息function printfData(coursesData){

    coursesData.forEach((courseData)=>{       // console.log(`${courseData.number}人学习过${courseData.title}\n`);       writeFile(outputFile,`\n\n${courseData.number}人学习过${courseData.title}\n\n`);

        courseData.videos.forEach(function(item){
            let chapterTitle = item.chapterTitle;// console.log(chapterTitle + '\n');            writeFile(outputFile,`\n  ${chapterTitle}\n`);

            item.video.forEach(function(item){// console.log('     【' + item.id + '】' + item.title + '\n');                writeFile(outputFile,`     【${item.id}】  ${item.title}\n`);
            })
        });

    });


}
Copy after login

The last data obtained:

Source code:

/**
 * Created by hp-pc on 2017/6/7 0007. */const http = require('http');
const fs = require('fs');
const cheerio = require('cheerio');
const baseUrl = 'http://www.imooc.com/learn/';
const baseNuUrl = 'http://www.imooc.com/course/AjaxCourseMembers?ids=';//获取课程的IDconst videosId = [773,371];//输出的文件const outputFile = 'test.txt';//记录学习课程的人数let courseMembers = [];//去除字符串中的空格function Trim(str,is_global)
{
    let  result;
    result = str.replace(/(^\s+)|(\s+$)/g,"");if(is_global.toLowerCase()=="g")
    {
        result = result.replace(/\s/g,"");
    }return result;
}//接受一个url爬取整个网页,返回一个Promise对象function getPageAsync(url){return new Promise((resolve,reject)=>{
        console.log(`正在爬取${url}的内容`);
        http.get(url,function(res){
            let html = '';

            res.on('data',function(data){
                html += data;
            });

            res.on('end',function(){
                resolve(html);
            });

            res.on('error',function(err){
                reject(err);
                console.log('错误信息:' + err);
            })
        });
    })
}//接受一个爬取下来的网页内容,查找网页中需要的信息function filterChapter(html){
    const $ = cheerio.load(html);//所有章const chapters = $('.chapter');//课程的标题和学习人数let title = $('.hd>h2').text();
    let number = 0;//最后返回的数据//每个网页需要的内容的结构let courseData = {'title':title,'number':number,'videos':[]
    };

    chapters.each(function(item){
        let chapter = $(this);//文章标题let chapterTitle = Trim(chapter.find('strong').text(),'g');//每个章节的结构let chapterdata = {'chapterTitle':chapterTitle,'video':[]
        };//一个网页中的所有视频let videos = chapter.find('.video').children('li');
        videos.each(function(item){//视频标题let videoTitle = Trim($(this).find('a.J-media-item').text(),'g');//视频IDlet id = $(this).find('a').attr('href').split('video/')[1];
            chapterdata.video.push({'title':videoTitle,'id':id
            })
        });

        courseData.videos.push(chapterdata);

    });return courseData;
}//获取上课人数function getNumber(url){

    let datas = '';

    http.get(url,(res)=>{
        res.on('data',(chunk)=>{
            datas += chunk;
        });

        res.on('end',()=>{
            datas = JSON.parse(datas);
            courseMembers.push({'id':datas.data[0].id,'numbers':parseInt(datas.data[0].numbers,10)});
        });
    });
}//写入文件function writeFile(file,string) {
    fs.appendFileSync(file,string,(err)=>{if(err){
                console.log(err);
            }
        })
}//打印信息function printfData(coursesData){

    coursesData.forEach((courseData)=>{       // console.log(`${courseData.number}人学习过${courseData.title}\n`);       writeFile(outputFile,`\n\n${courseData.number}人学习过${courseData.title}\n\n`);

        courseData.videos.forEach(function(item){
            let chapterTitle = item.chapterTitle;// console.log(chapterTitle + '\n');            writeFile(outputFile,`\n  ${chapterTitle}\n`);

            item.video.forEach(function(item){// console.log('     【' + item.id + '】' + item.title + '\n');                writeFile(outputFile,`     【${item.id}】  ${item.title}\n`);
            })
        });

    });


}//所有页面爬取完后返回的Promise数组let courseArray = [];//循环所有的videosId,和baseUrl进行字符串拼接,爬取网页内容videosId.forEach((id)=>{//将爬取网页完毕后返回的Promise对象加入数组courseArray.push(getPageAsync(baseUrl + id));//获取学习的人数getNumber(baseNuUrl + id);
});

Promise//当所有网页的内容爬取完毕    .all(courseArray)
    .then((pages)=>{//所有页面需要的内容let courseData = [];//遍历每个网页提取出所需要的内容pages.forEach((html)=>{
            let courses = filterChapter(html);
            courseData.push(courses);
        });//给每个courseMenners.number赋值for(let i=0;i<videosId.length;i++){for(let j=0;j<videosId.length;j++){if(courseMembers[i].id +&#39;&#39; == videosId[j]){
                    courseData[j].number = courseMembers[i].numbers;
                }
            }
        }//对所需要的内容进行排序courseData.sort((a,b)=>{return a.number > b.number;
        });//在重新将爬取内容写入文件中前,清空文件fs.writeFileSync(outputFile,'###爬取慕课网课程信息###',(err)=>{if(err){
                console.log(err)
            }
        });
        printfData(courseData);
    });
Copy after login

The above is the detailed content of Example tutorial on crawling MOOC course information. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

How to implement an online speech recognition system using WebSocket and JavaScript How to implement an online speech recognition system using WebSocket and JavaScript Dec 17, 2023 pm 02:54 PM

How to use WebSocket and JavaScript to implement an online speech recognition system Introduction: With the continuous development of technology, speech recognition technology has become an important part of the field of artificial intelligence. The online speech recognition system based on WebSocket and JavaScript has the characteristics of low latency, real-time and cross-platform, and has become a widely used solution. This article will introduce how to use WebSocket and JavaScript to implement an online speech recognition system.

WebSocket and JavaScript: key technologies for implementing real-time monitoring systems WebSocket and JavaScript: key technologies for implementing real-time monitoring systems Dec 17, 2023 pm 05:30 PM

WebSocket and JavaScript: Key technologies for realizing real-time monitoring systems Introduction: With the rapid development of Internet technology, real-time monitoring systems have been widely used in various fields. One of the key technologies to achieve real-time monitoring is the combination of WebSocket and JavaScript. This article will introduce the application of WebSocket and JavaScript in real-time monitoring systems, give code examples, and explain their implementation principles in detail. 1. WebSocket technology

How to use JavaScript and WebSocket to implement a real-time online ordering system How to use JavaScript and WebSocket to implement a real-time online ordering system Dec 17, 2023 pm 12:09 PM

Introduction to how to use JavaScript and WebSocket to implement a real-time online ordering system: With the popularity of the Internet and the advancement of technology, more and more restaurants have begun to provide online ordering services. In order to implement a real-time online ordering system, we can use JavaScript and WebSocket technology. WebSocket is a full-duplex communication protocol based on the TCP protocol, which can realize real-time two-way communication between the client and the server. In the real-time online ordering system, when the user selects dishes and places an order

JavaScript and WebSocket: Building an efficient real-time weather forecasting system JavaScript and WebSocket: Building an efficient real-time weather forecasting system Dec 17, 2023 pm 05:13 PM

JavaScript and WebSocket: Building an efficient real-time weather forecast system Introduction: Today, the accuracy of weather forecasts is of great significance to daily life and decision-making. As technology develops, we can provide more accurate and reliable weather forecasts by obtaining weather data in real time. In this article, we will learn how to use JavaScript and WebSocket technology to build an efficient real-time weather forecast system. This article will demonstrate the implementation process through specific code examples. We

Simple JavaScript Tutorial: How to Get HTTP Status Code Simple JavaScript Tutorial: How to Get HTTP Status Code Jan 05, 2024 pm 06:08 PM

JavaScript tutorial: How to get HTTP status code, specific code examples are required. Preface: In web development, data interaction with the server is often involved. When communicating with the server, we often need to obtain the returned HTTP status code to determine whether the operation is successful, and perform corresponding processing based on different status codes. This article will teach you how to use JavaScript to obtain HTTP status codes and provide some practical code examples. Using XMLHttpRequest

Efficient Java crawler practice: sharing of web data crawling techniques Efficient Java crawler practice: sharing of web data crawling techniques Jan 09, 2024 pm 12:29 PM

Java crawler practice: How to efficiently crawl web page data Introduction: With the rapid development of the Internet, a large amount of valuable data is stored in various web pages. To obtain this data, it is often necessary to manually access each web page and extract the information one by one, which is undoubtedly a tedious and time-consuming task. In order to solve this problem, people have developed various crawler tools, among which Java crawler is one of the most commonly used. This article will lead readers to understand how to use Java to write an efficient web crawler, and demonstrate the practice through specific code examples. 1. The base of the reptile

How to get HTTP status code in JavaScript the easy way How to get HTTP status code in JavaScript the easy way Jan 05, 2024 pm 01:37 PM

Introduction to the method of obtaining HTTP status code in JavaScript: In front-end development, we often need to deal with the interaction with the back-end interface, and HTTP status code is a very important part of it. Understanding and obtaining HTTP status codes helps us better handle the data returned by the interface. This article will introduce how to use JavaScript to obtain HTTP status codes and provide specific code examples. 1. What is HTTP status code? HTTP status code means that when the browser initiates a request to the server, the service

How to implement an online collaborative editor using WebSocket and JavaScript How to implement an online collaborative editor using WebSocket and JavaScript Dec 17, 2023 pm 01:37 PM

Real-time collaborative editors have become a standard feature of modern web development, especially in various team collaboration, online document editing and task management scenarios. Real-time communication technology based on WebSocket can improve communication efficiency and collaboration effects among team members. This article will introduce how to use WebSocket and JavaScript to build a simple online collaborative editor to help readers better understand the principles and usage of WebSocket. Understand the basic principles of WebSocketWebSo

See all articles