


Detailed explanation of node's simulated login and crawling steps based on puppeteer
This time I will bring you a detailed explanation of the steps for node to simulate login and capture based on puppeteer. What are the precautions for node to simulate login and capture based on puppeteer? The following is a practical case, let's take a look.
About heat map
In the website analysis industry, website heat map can well reflect the user's operating behavior on the website. Specific analysis User preferences, targeted optimization of the website, an example of a heat map (from ptengine)Mainstream implementation methods of heat maps
Generally, the following stages are required to implement heat map display: 1. Obtain the website Page2. Obtain processed user data
3. Draw heat map
This article mainly focuses on stage 1 to introduce in detail the mainstream implementation method of obtaining website pages in heat map
4. Use iframe to directly embed the user website
5. Grab the user page and save it locally, and embed local resources through iframe (the so-called local resources are considered to be the analysis tool side)
if(window.top !== window.self){ window.top.location = window.location;} ), in this case you need The client's website has to do some work before it can be loaded by the iframe of the analysis tool, and it may not be so convenient to use because not all website users who need to detect and analyze the website can manage the website.
user login authorization, cannot crawl the page where the user has set a clear setting, etc.
Both methods have https and http resources. Another problem caused by the same-origin policy is that the https station cannot load http resources, so for the best compatibility, the heat map analysis tool needs to be applied with the http protocol. , of course, specific sub-station optimization can be carried out according to the customer websites visited.How to optimize the crawling of website pages
Here we do some optimization based on puppeteer to improve the crawling of the problems encountered in crawling website pages The probability of success mainly optimizes the following two pages:1.spa page
The spa page is considered mainstream on the current page, but it is always well-known for its Unfriendly to search engines; the usual page crawler is actually a simple crawler, and the process usually involves initiating an http get request to the user's website (should be the user's website server). This crawling method itself has problems. First of all, the direct request is to the user server. The user server should have many restrictions on non-browser agents and need to be bypassed. Secondly, the request returns the original content, which needs to be bypassed. The part rendered through js in the browser cannot be obtained (of course, after the iframe is embedded, js execution will still make up for this problem to a certain extent). Finally, if the page is a spa page, then only the template is obtained at this time. In the heat map The display effect is very unfriendly.In response to this situation, if it is done based on puppeteer, the process becomes
puppeteer starts the browser to open the user website-->page rendering-->returns the rendered result. Simply use pseudo code to implement it as follows:const puppeteer = require('puppeteer'); async getHtml = (url) =>{ const browser = await puppeteer.launch(); const page = await browser.newPage(); await page.goto(url); return await page.content(); }
Pages that require login
There are actually many situations for pages that require login:
需要登录才可以查看页面,如果没有登录,则跳转到login页面(各种管理系统)
对于这种类型的页面我们需要做的就是模拟登录,所谓模拟登录就是让浏览器去登录,这里需要用户提供对应网站的用户名和密码,然后我们走如下的流程:
访问用户网站-->用户网站检测到未登录跳转到login-->puppeteer控制浏览器自动登录后跳转到真正需要抓取的页面,可用如下伪代码来说明:
const puppeteer = require("puppeteer"); async autoLogin =(url)=>{ const browser = await puppeteer.launch(); const page =await browser.newPage(); await page.goto(url); await page.waitForNavigation(); //登录 await page.type('#username',"用户提供的用户名"); await page.type('#password','用户提供的密码'); await page.click('#btn_login'); //页面登录成功后,需要保证redirect 跳转到请求的页面 await page.waitForNavigation(); return await page.content(); }
登录与否都可以查看页面,只是登录后看到内容会所有不同 (各种电商或者portal页面)
这种情况处理会比较简单一些,可以简单的认为是如下步骤:
通过puppeteer启动浏览器打开请求页面-->点击登录按钮-->输入用户名和密码登录 -->重新加载页面
基本代码如下图:
const puppeteer = require("puppeteer"); async autoLoginV2 =(url)=>{ const browser = await puppeteer.launch(); const page =await browser.newPage(); await page.goto(url); await page.click('#btn_show_login'); //登录 await page.type('#username',"用户提供的用户名"); await page.type('#password','用户提供的密码'); await page.click('#btn_login'); //页面登录成功后,是否需要reload 根据实际情况来确定 await page.reload(); return await page.content(); }
总结
明天总结吧,今天下班了。
补充(还昨天的债):基于puppeteer虽然可以很友好的抓取页面内容,但是也存在这很多的局限
1.抓取的内容为渲染后的原始html,即资源路径(css、image、javascript)等都是相对路径,保存到本地后无法正常显示,需要特殊处理(js不需要特殊处理,甚至可以移除,因为渲染的结构已经完成)
2.通过puppeteer抓取页面性能会比直接http get 性能会差一些,因为多了渲染的过程
3.同样无法保证页面的完整性,只是很大的提高了完整的概率,虽然通过page对象提供的各种wait 方法能够解决这个问题,但是网站不同,处理方式就会不同,无法复用。
相信看了本文案例你已经掌握了方法,更多精彩请关注php中文网其它相关文章!
推荐阅读:
The above is the detailed content of Detailed explanation of node's simulated login and crawling steps based on puppeteer. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

With the rapid development of social media, Xiaohongshu has become a popular platform for many young people to share their lives and explore new products. During use, sometimes users may encounter difficulties logging into previous accounts. This article will discuss in detail how to solve the problem of logging into the old account on Xiaohongshu, and how to deal with the possibility of losing the original account after changing the binding. 1. How to log in to Xiaohongshu’s previous account? 1. Retrieve password and log in. If you do not log in to Xiaohongshu for a long time, your account may be recycled by the system. In order to restore access rights, you can try to log in to your account again by retrieving your password. The operation steps are as follows: (1) Open the Xiaohongshu App or official website and click the "Login" button. (2) Select "Retrieve Password". (3) Enter the mobile phone number you used when registering your account

When you log in to someone else's steam account on your computer, and that other person's account happens to have wallpaper software, steam will automatically download the wallpapers subscribed to the other person's account after switching back to your own account. Users can solve this problem by turning off steam cloud synchronization. What to do if wallpaperengine downloads other people's wallpapers after logging into another account 1. Log in to your own steam account, find cloud synchronization in settings, and turn off steam cloud synchronization. 2. Log in to someone else's Steam account you logged in before, open the Wallpaper Creative Workshop, find the subscription content, and then cancel all subscriptions. (In case you cannot find the wallpaper in the future, you can collect it first and then cancel the subscription) 3. Switch back to your own steam

Thousands of ghosts screamed in the mountains and fields, and the sound of the exchange of weapons disappeared. The ghost generals who rushed over the mountains, with fighting spirit raging in their hearts, used the fire as their trumpet to lead hundreds of ghosts to charge into the battle. [Blazing Flame Bairen·Ibaraki Doji Collection Skin is now online] The ghost horns are blazing with flames, the gilt eyes are bursting with unruly fighting spirit, and the white jade armor pieces decorate the shirt, showing the unruly and wild momentum of the great demon. On the snow-white fluttering sleeves, red flames clung to and intertwined, and gold patterns were imprinted on them, igniting a crimson and magical color. The will-o'-the-wisps formed by the condensed demon power roared, and the fierce flames shook the mountains. Demons and ghosts who have returned from purgatory, let's punish the intruders together. [Exclusive dynamic avatar frame·Blazing Flame Bailian] [Exclusive illustration·Firework General Soul] [Biography Appreciation] [How to obtain] Ibaraki Doji’s collection skin·Blazing Flame Bailian will be available in the skin store after maintenance on December 28.

The solution to the Discuz background login problem is revealed. Specific code examples are needed. With the rapid development of the Internet, website construction has become more and more common, and Discuz, as a commonly used forum website building system, has been favored by many webmasters. However, precisely because of its powerful functions, sometimes we encounter some problems when using Discuz, such as background login problems. Today, we will reveal the solution to the Discuz background login problem and provide specific code examples. We hope to help those in need.

Recently, some friends have asked me how to log in to the Kuaishou computer version. Here is the login method for the Kuaishou computer version. Friends who need it can come and learn more. Step 1: First, search Kuaishou official website on Baidu on your computer’s browser. Step 2: Select the first item in the search results list. Step 3: After entering the main page of Kuaishou official website, click on the video option. Step 4: Click on the user avatar in the upper right corner. Step 5: Click the QR code to log in in the pop-up login menu. Step 6: Then open Kuaishou on your phone and click on the icon in the upper left corner. Step 7: Click on the QR code logo. Step 8: After clicking the scan icon in the upper right corner of the My QR code interface, scan the QR code on your computer. Step 9: Finally log in to the computer version of Kuaishou

GitHubCopilot is the next level for coders, with an AI-based model that successfully predicts and autocompletes your code. However, you might be wondering how to get this AI genius on your device so that your coding becomes even easier! However, using GitHub isn't exactly easy, and the initial setup process is a tricky one. Therefore, we created this step-by-step tutorial on how to install and implement GitHub Copilot in VSCode on Windows 11, 10. How to install GitHubCopilot on Windows There are several steps to this process. So, follow the steps below now. Step 1 – You must have the latest version of Visual Studio installed on your computer

How to log in to two devices with Quark? Quark Browser supports logging into two devices at the same time, but most friends don’t know how to log in to two devices with Quark Browser. Next, the editor brings users Quark to log in to two devices. Method graphic tutorials, interested users come and take a look! Quark Browser usage tutorial Quark how to log in to two devices 1. First open the Quark Browser APP and click [Quark Network Disk] on the main page; 2. Then enter the Quark Network Disk interface and select the [My Backup] service function; 3. Finally, select [Switch Device] to log in to two new devices.

Baidu Netdisk can not only store various software resources, but also share them with others. It supports multi-terminal synchronization. If your computer does not have a client downloaded, you can choose to enter the web version. So how to log in to Baidu Netdisk web version? Let’s take a look at the detailed introduction. Baidu Netdisk web version login entrance: https://pan.baidu.com (copy the link to open in the browser) Software introduction 1. Sharing Provides file sharing function, users can organize files and share them with friends in need. 2. Cloud: It does not take up too much memory. Most files are saved in the cloud, effectively saving computer space. 3. Photo album: Supports the cloud photo album function, import photos to the cloud disk, and then organize them for everyone to view.
