Table of Contents
查找html元素
元素属性操作
如何避免解析器消耗过多内存
Home php教程 php手册 抓取微博热词,使用simple_html_dom来操作html数据

抓取微博热词,使用simple_html_dom来操作html数据

Jun 06, 2016 pm 08:09 PM
dom html simple use Weibo crawl operate

一直以来使用php解析html文档树都是一个难题。 Simple HTML DOM parser ?很好地解决了这个问题。可以通过这个php类来解析html文档,对其中的html元素进行操作 (PHP5+以上版本)。解析器不仅仅只是帮助我们验证html文档;更能解析不符合W3C标准的html文档。它

一直以来使用php解析html文档树都是一个难题。Simple HTML DOM parser?很好地解决了这个问题。可以通过这个php类来解析html文档,对其中的html元素进行操作 (PHP5+以上版本)。 解析器不仅仅只是帮助我们验证html文档;更能解析不符合W3C标准的html文档。它使用了类似jQuery的元素选择器,通过元素的id,class,tag等等来查找定位;同时还提供添加、删除、修改文档树的功能。和jq一样的操作还是很方便的。 有三种方式调用这个类: 从url中加载html文档 从字符串中加载html文档 从文件中加载html文档
<?php // 新建一个Dom实例
$html = new simple_html_dom();
// 从url中加载
$html->load_file('http://www.xxx.com');
// 从字符串中加载
$html->load('从字符串中加载html文档演示');
//从文件中加载
$html->load_file('path/file/test.html');
?>
Copy after login
 

查找html元素

可以使用find函数来查找html文档中的元素。返回的结果是一个包含了对象的数组。我们使用HTML DOM解析类中的函数来访问这些对象,下面给出几个示例
<?php //查找html文档中的超链接元素
$a = $html->find('a');
//查找文档中第(N)个超链接,如果没有找到则返回空数组.
$a = $html->find('a', 0);
// 查找id为main的div元素
$main = $html->find('div[id=main]',0);
// 查找所有包含有id属性的div元素
$divs = $html->find('div[id]');
// 查找所有包含有id属性的元素
$divs = $html->find('[id]');
?>
Copy after login
 
<?php // 查找id='#container'的元素
$ret = $html->find('#container');
// 找到所有class=foo的元素
$ret = $html->find('.foo');
// 查找多个html标签
$ret = $html->find('a, img');
// 还可以这样用
$ret = $html->find('a[title], img[title]');
?>
Copy after login
 
<?php // 返回父元素
$e->parent;
// 返回子元素数组
$e->children;
// 通过索引号返回指定子元素
$e->children(0);
// 返回第一个资源速
$e->first_child ();
// 返回最后一个子元素
$e->last _child ();
// 返回上一个相邻元素
$e->prev_sibling ();
//返回下一个相邻元素
$e->next_sibling ();
?>
Copy after login
 

元素属性操作

使用简单的正则表达式来操作属性选择器。 [attribute] – 选择包含某属性的html元素 [attribute=value] – 选择所有指定值属性的html元素 [attribute!=value]- 选择所有非指定值属性的html元素 [attribute^=value] -选择所有指定值开头属性的html元素 [attribute$=value] 选择所有指定值结尾属性的html元素 [attribute*=value] -选择所有包含指定值属性的html元素  

如何避免解析器消耗过多内存

有时候可能Simple HTML DOM解析器消耗内存过多。如果php脚本占用内存太多,会导致网站停止响应等一系列严重的问题。解决的方法也很简单,在解析器加载html文档并使用完成后,记得清理掉这个对象就可以了。
<?php $html->clear();
?>
Copy after login
  下面看看微博热词抓取的源码示例
<?php header('Content-Type:text/html;charset=gbk');
include "simple_html_dom.php";
class Tmemcache {
    protected $memcache;
    function __construct($cluster) {
        $this->memcache = new Memcache;
        foreach ($cluster['memcached'] as $server) {
            $this->memcache->addServer($server['host'], $server['port']);
        }
    }
    function fetch($cache_key) {
        return $this->memcache->get($cache_key);
    }
    function store($cache_key, $val, $expire = 7200) {
        $this->memcache->set($cache_key, $val, MEMCACHE_COMPRESSED, $expire);
    }
    function flush() {
        $this->memcache->flush();
    }
    function delete($cache_key, $timeout = 0) {
        $this->memcache->delete($cache_key, $timeout);
    }
}
function unicode_hex_2_gbk($name) {
    $a = json_decode('{"a":"' . $name . '"}');
    if (isset($a) && is_object($a)) {
        return iconv('UTF-8', 'GBK//IGNORE', $a->a);
        return $a->a;
    }
    return null;
}
function curl_fetch($url, $time = 3) {
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_TIMEOUT, $time);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    $data = curl_exec($ch);
    $errno = curl_errno($ch);
    if ($errno > 0) {
        $err = "[CURL] url:{$url} ; errno:{$errno} ; info:" . curl_error($ch) . ";";
        echo $err;
        $data = false;
    }
    curl_close($ch);
    return $data;
}
$cluster["memcached"] = array(
    array("host" => "10.11.1.1", "port" => 11211),
);
//$memcache = new Tmemcache($cluster);
$url = "http://s.weibo.com/top/summary?cate=total&key=event";
$cache_key = md5("weibo" . $url);
//$str = $memcache->fetch($cache_key);
//if (!isset($_GET["nocache"]) && !empty($str)) {
//    echo $str;
//    exit;
//}
$content = curl_fetch($url);
if ($content === false)
    exit;
$html = str_get_html($content);
$a = $html->find('script', 8);
//测试
$a = str_replace(array('\\"', '\\/', "\\n", "\\t"), array('"', '/', "", ""), $a);
$pos = strpos($a, '');
$a = substr($a, $pos);
////////
//echo "";
//echo ($a);
//echo "";
$html = str_get_html($a);
$arr = array();
foreach ($html->find('table[id=event]', 0)->find('.rank_content') as $element) {
    $arr[] = unicode_hex_2_gbk($element->find("a", 0)->plaintext);
}
$html->clear();
$str = implode(",", $arr);
//if (!isset($_GET["nocache"]))
//    $memcache->store($cache_key, $str, 3600);
echo $str;
Copy after login
 
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
Nordhold: Fusion System, Explained
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
Mandragora: Whispers Of The Witch Tree - How To Unlock The Grappling Hook
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Java Tutorial
1666
14
PHP Tutorial
1273
29
C# Tutorial
1253
24
Table Border in HTML Table Border in HTML Sep 04, 2024 pm 04:49 PM

Guide to Table Border in HTML. Here we discuss multiple ways for defining table-border with examples of the Table Border in HTML.

Nested Table in HTML Nested Table in HTML Sep 04, 2024 pm 04:49 PM

This is a guide to Nested Table in HTML. Here we discuss how to create a table within the table along with the respective examples.

HTML margin-left HTML margin-left Sep 04, 2024 pm 04:48 PM

Guide to HTML margin-left. Here we discuss a brief overview on HTML margin-left and its Examples along with its Code Implementation.

HTML Table Layout HTML Table Layout Sep 04, 2024 pm 04:54 PM

Guide to HTML Table Layout. Here we discuss the Values of HTML Table Layout along with the examples and outputs n detail.

HTML Input Placeholder HTML Input Placeholder Sep 04, 2024 pm 04:54 PM

Guide to HTML Input Placeholder. Here we discuss the Examples of HTML Input Placeholder along with the codes and outputs.

How do you parse and process HTML/XML in PHP? How do you parse and process HTML/XML in PHP? Feb 07, 2025 am 11:57 AM

This tutorial demonstrates how to efficiently process XML documents using PHP. XML (eXtensible Markup Language) is a versatile text-based markup language designed for both human readability and machine parsing. It's commonly used for data storage an

HTML Ordered List HTML Ordered List Sep 04, 2024 pm 04:43 PM

Guide to the HTML Ordered List. Here we also discuss introduction of HTML Ordered list and types along with their example respectively

HTML onclick Button HTML onclick Button Sep 04, 2024 pm 04:49 PM

Guide to HTML onclick Button. Here we discuss their introduction, working, examples and onclick Event in various events respectively.

See all articles