Table of Contents
select机制
关于webmagic的后续打算
Home Web Front-end HTML Tutorial Jsoup代码解读之七-实现一个CSS Selector_html/css_WEB-ITnose

Jsoup代码解读之七-实现一个CSS Selector_html/css_WEB-ITnose

Jun 21, 2016 am 08:56 AM

当当当!终于来到了Jsoup的特色:CSS Selector部分。selector也是我写的爬虫框架webmagic开发的一个重点。附上一张street fighter的图,希望以后webmagic也能挑战Jsoup!

select机制

Jsoup的select包里,类结构如下:

在最开始介绍Jsoup的时候,就已经说过NodeVisitor和Selector了。Selector是select部分的对外facade,而NodeVisitor则是遍历树的底层API,CSS Selector也是根据NodeVisitor实现的遍历。

Jsoup的select核心是Evaluator。Selector所传递的表达式,会经过QueryParser,最终编译成一个Evaluator。Evaluator是一个抽象类,它只有一个方法:

public abstract boolean matches(Element root, Element element);
Copy after login

注意这里传入了root,是为了某些情况下对树进行遍历时用的。

Evaluator的设计简洁明了,所有的Selector表达式单词都会编译到对应的Evaluator。例如#xx对应Id,.xx对应Class,[]对应Attribute。这里补充一下w3c的CSS Selector规范:http://www.w3.org/TR/CSS2/selector.html

当然,只靠这几个还不够,Jsoup还定义了CombiningEvaluator(对Evaluator进行And/Or组合),StructuralEvaluator(结合DOM树结构进行筛选)。

这里我们可能最关心的是,“div ul li”这样的父子结构是如何实现的。这个的实现方式在StructuralEvaluator.Parent中,贴一下代码了:

static class Parent extends StructuralEvaluator { public Parent(Evaluator evaluator) { this.evaluator = evaluator; }public boolean matches(Element root, Element element) { if (root == element) return false;Element parent = element.parent(); while (parent != root) { if (evaluator.matches(root, parent)) return true; parent = parent.parent(); } return false; }}
Copy after login

这里Parent包含了一个evaluator属性,会根据这个evaluator去验证所有父节点。注意Parent是可以嵌套的,所以这个表达式”div ul li”最终会编译成And(Parent(And(Parent(Tag(“div”)),Tag(“ul”)),Tag(“li”)))这样的Evaluator组合。

select部分比想象的要简单,代码可读性也很高。经过了parser部分的研究,这部分应该算是驾轻就熟了。

关于webmagic的后续打算

webmagic是一个爬虫框架,它的Selector是用于抓取HTML中指定的文本,其机制和Jsoup的Evaluator非常像,只不过webmagic暂时是将Selector封装成较简单的API,而Evaluator直接上了表达式。之前也考虑过自己定制DSL来写一个HTML,现在看了Jsoup的源码,实现能力算是有了,但是引入DSL,实现只是一小部分,如何让DSL易写易懂才是难点。

其实看了Jsoup的源码,精细程度上比webmagic要好得多了,基本每个类都对应一个真实的概念抽象,可能以后会在这方面下点工夫。

下篇文章将讲最后一部分:白名单及HTML过滤机制。

最后依然附上这系列文章和代码的github地址:https://github.com/code4craft/jsoup-learning

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
Nordhold: Fusion System, Explained
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
Mandragora: Whispers Of The Witch Tree - How To Unlock The Grappling Hook
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Java Tutorial
1666
14
PHP Tutorial
1273
29
C# Tutorial
1252
24
Understanding HTML, CSS, and JavaScript: A Beginner's Guide Understanding HTML, CSS, and JavaScript: A Beginner's Guide Apr 12, 2025 am 12:02 AM

WebdevelopmentreliesonHTML,CSS,andJavaScript:1)HTMLstructurescontent,2)CSSstylesit,and3)JavaScriptaddsinteractivity,formingthebasisofmodernwebexperiences.

HTML: The Structure, CSS: The Style, JavaScript: The Behavior HTML: The Structure, CSS: The Style, JavaScript: The Behavior Apr 18, 2025 am 12:09 AM

The roles of HTML, CSS and JavaScript in web development are: 1. HTML defines the web page structure, 2. CSS controls the web page style, and 3. JavaScript adds dynamic behavior. Together, they build the framework, aesthetics and interactivity of modern websites.

The Future of HTML, CSS, and JavaScript: Web Development Trends The Future of HTML, CSS, and JavaScript: Web Development Trends Apr 19, 2025 am 12:02 AM

The future trends of HTML are semantics and web components, the future trends of CSS are CSS-in-JS and CSSHoudini, and the future trends of JavaScript are WebAssembly and Serverless. 1. HTML semantics improve accessibility and SEO effects, and Web components improve development efficiency, but attention should be paid to browser compatibility. 2. CSS-in-JS enhances style management flexibility but may increase file size. CSSHoudini allows direct operation of CSS rendering. 3.WebAssembly optimizes browser application performance but has a steep learning curve, and Serverless simplifies development but requires optimization of cold start problems.

The Future of HTML: Evolution and Trends in Web Design The Future of HTML: Evolution and Trends in Web Design Apr 17, 2025 am 12:12 AM

The future of HTML is full of infinite possibilities. 1) New features and standards will include more semantic tags and the popularity of WebComponents. 2) The web design trend will continue to develop towards responsive and accessible design. 3) Performance optimization will improve the user experience through responsive image loading and lazy loading technologies.

HTML vs. CSS vs. JavaScript: A Comparative Overview HTML vs. CSS vs. JavaScript: A Comparative Overview Apr 16, 2025 am 12:04 AM

The roles of HTML, CSS and JavaScript in web development are: HTML is responsible for content structure, CSS is responsible for style, and JavaScript is responsible for dynamic behavior. 1. HTML defines the web page structure and content through tags to ensure semantics. 2. CSS controls the web page style through selectors and attributes to make it beautiful and easy to read. 3. JavaScript controls web page behavior through scripts to achieve dynamic and interactive functions.

HTML: Building the Structure of Web Pages HTML: Building the Structure of Web Pages Apr 14, 2025 am 12:14 AM

HTML is the cornerstone of building web page structure. 1. HTML defines the content structure and semantics, and uses, etc. tags. 2. Provide semantic markers, such as, etc., to improve SEO effect. 3. To realize user interaction through tags, pay attention to form verification. 4. Use advanced elements such as, combined with JavaScript to achieve dynamic effects. 5. Common errors include unclosed labels and unquoted attribute values, and verification tools are required. 6. Optimization strategies include reducing HTTP requests, compressing HTML, using semantic tags, etc.

HTML vs. CSS and JavaScript: Comparing Web Technologies HTML vs. CSS and JavaScript: Comparing Web Technologies Apr 23, 2025 am 12:05 AM

HTML, CSS and JavaScript are the core technologies for building modern web pages: 1. HTML defines the web page structure, 2. CSS is responsible for the appearance of the web page, 3. JavaScript provides web page dynamics and interactivity, and they work together to create a website with a good user experience.

HTML: Is It a Programming Language or Something Else? HTML: Is It a Programming Language or Something Else? Apr 15, 2025 am 12:13 AM

HTMLisnotaprogramminglanguage;itisamarkuplanguage.1)HTMLstructuresandformatswebcontentusingtags.2)ItworkswithCSSforstylingandJavaScriptforinteractivity,enhancingwebdevelopment.

See all articles