Scraping Links With PHP-tutorial php-php.cn

Rumah

pembangunan bahagian belakang

tutorial php

Scraping Links With PHP

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Jun 23, 2016 pm 02:36 PM

Scraping Links With PHP

by justin on August 11, 2007

FROM:http://www.merchantos.com/makebeta/php/scraping-links-with-php/#curl_content

In this tutorial you will learn how to build a PHP script that scrapes links from any web page.

What You’ll Learn How to use cURL to get the content from a website (URL). Call PHP DOM functions to parse the HTML so you can extract links. Use XPath to grab links from specific parts of a page. Store the scraped links in a MySQL database. Put it all together into a link scraper. What else you could use a scraper for. Legal issues associated with scraping content. What You Will Need Basic knowledge of PHP and MySQL. A web server running PHP 5. The cURL extension for PHP. MySQL ? if you want to store the links. Get The Page Content

cURL is a great tool for making requests to remote servers in PHP. It can imitate a browser in pretty much every way. Here’s the code to grab our target site content:

$ch = curl_init();curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);curl_setopt($ch, CURLOPT_URL,$target_url);curl_setopt($ch, CURLOPT_FAILONERROR, true);curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);curl_setopt($ch, CURLOPT_AUTOREFERER, true);curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);curl_setopt($ch, CURLOPT_TIMEOUT, 10);$html = curl_exec($ch);if (!$html) {echo "<br />cURL error number:" .curl_errno($ch);echo "<br />cURL error:" . curl_error($ch);exit;}

Salin selepas log masuk

If the request is successful $html will be filled with the content of $target_url. If the call fails then we’ll see an error message about the failure.

curl_setopt($ch, CURLOPT_URL,$target_url);

Salin selepas log masuk

This line determines what URL will be requested. For example if you wanted to scrape this site you’d have $target_url = “/makebeta/”. I won’t go into the rest of the options that are set (except for CURLOPT_USERAGENT ? see below). You can read an in depth tutorial on PHP and cURL here.

Tip: Fake Your User Agent

Many websites won’t play nice with you if you come knocking with the wrong User Agent string. What’s a User Agent string? It’s part of every request to a web server that tells it what type of agent (browser, spider, etc) is requesting the content. Some websites will give you different content depending on the user agent, so you might want to experiment. You do this in cURL with a call to curl_setopt() with CURLOPT_USERAGENT as the option:

$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);

Salin selepas log masuk

This would set cURL’s user agent to mimic Google’s. You can find a comprehensive list of user agents here: User Agents.

Common User Agents

I’ve done a bit of the leg work for you and gathered the most common user agents:

Search Engine User Agents Google ? Googlebot/2.1 ( http://www.googlebot.com/bot.html) Google Image ? Googlebot-Image/1.0 ( http://www.googlebot.com/bot.html) MSN Live ? msnbot-Products/1.0 (+http://search.msn.com/msnbot.htm) Yahoo ? Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp) ask Browser User Agents Firefox (WindowsXP) ? Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6 IE 7 ? Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30) IE 6 ? Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322) Safari ? Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en) AppleWebKit/522.11 (KHTML, like Gecko) Safari/3.0.2 Opera ? Opera/9.00 (Windows NT 5.1; U; en) Using PHP’s DOM Functions To Parse The HTML

PHP provides with a really cool tool for working with HTML content: DOM Functions. The DOM Functions allow you to parse HTML (or XML) into an object structure (or DOM ? Document Object Model). Let’s see how we do it:

$dom = new DOMDocument();@$dom->loadHTML($html);

Salin selepas log masuk

Wow is it really that easy? Yes! Now we have a nice DOMDocument object that we can use to access everything within the HTML in a nice clean way. I discovered this over at Russll Beattie’s post on: Using PHP TO Scrape Sites As Feeds, thanks Russell!

Tip: You may have noticed I put @ in front of loadHTML(), this suppresses some annoying warnings that the HTML parser throws on many pages that have non-standard compliant code.

XPath Makes Getting The Links You Want Easy

Now for the real magic of the DOM: XPath! XPath allows you to gather collections of DOM nodes (otherwise known as tags in HTML). Say you want to only get links that are within unordered lists. All you have to do is write a query like “/html/body//ul//li//a” and pass it to XPath->evaluate(). I’m not going to go into all the ways you can use XPath because I’m just learning myself and someone else has already made a great list of examples: XPath Examples. Here’s a code snippet that will just get every link on the page using XPath:

$xpath = new DOMXPath($dom);$hrefs = $xpath->evaluate("/html/body//a");

Salin selepas log masuk

Iterate And Store Your Links

Next we’ll iterate through all the links we’ve gathered using XPath and store them in a database. First the code to iterate through the links:

for ($i = 0; $i < $hrefs->length; $i++) {$href = $hrefs->item($i);$url = $href->getAttribute('href');storeLink($url,$target_url);}

Salin selepas log masuk

FULL PROGRAM:

Salin selepas log masuk


$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);curl_setopt($ch, CURLOPT_URL,$target_url);curl_setopt($ch, CURLOPT_FAILONERROR, true);curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);curl_setopt($ch, CURLOPT_AUTOREFERER, true);curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);curl_setopt($ch, CURLOPT_TIMEOUT, 10);$html = curl_exec($ch);if (!$html) { echo "
cURL error number:" .curl_errno($ch); echo "
cURL error:" . curl_error($ch); exit;}$dom = new DOMDocument();@$dom->loadHTML($html);$xpath = new DOMXPath($dom);$hrefs = $xpath->evaluate("/html/body//a");
for ($i = 0; $i < $hrefs->length; $i++) {  $href = $hrefs->item($i); $url = $href->getAttribute('href');  echo $url; echo "
"; }
?>

Salin selepas log masuk

then you can store url to your database. more details from here:http://www.merchantos.com/makebeta/php/scraping-links-with-php/#curl_content

Salin selepas log masuk

REF:tutorial on PHP and cURL

Salin selepas log masuk

You can find a comprehensive list of user agents here: User Agents.

Salin selepas log masuk

Using PHP TO Scrape Sites As Feeds

Salin selepas log masuk

Kenyataan Laman Web ini

Kandungan artikel ini disumbangkan secara sukarela oleh netizen, dan hak cipta adalah milik pengarang asal. Laman web ini tidak memikul tanggungjawab undang-undang yang sepadan. Jika anda menemui sebarang kandungan yang disyaki plagiarisme atau pelanggaran, sila hubungi admin@php.cn

Alat AI Hot

Undresser.AI Undress

Apl berkuasa AI untuk mencipta foto bogel yang realistik

AI Clothes Remover

Alat AI dalam talian untuk mengeluarkan pakaian daripada foto.

Undress AI Tool

Gambar buka pakaian secara percuma

Clothoff.io

Penyingkiran pakaian AI

Video Face Swap

Tukar muka dalam mana-mana video dengan mudah menggunakan alat tukar muka AI percuma kami!

Tunjukkan Lagi

Artikel Panas

<🎜>: Tumbuh Taman - Panduan Mutasi Lengkap

4 minggu yang lalu By DDD

<🎜>: Bubble Gum Simulator Infinity - Cara Mendapatkan dan Menggunakan Kekunci Diraja

4 minggu yang lalu By 尊渡假赌尊渡假赌尊渡假赌

Nordhold: Sistem Fusion, dijelaskan

1 bulan yang lalu By 尊渡假赌尊渡假赌尊渡假赌

Mandragora: Whispers of the Witch Tree - Cara Membuka Kunci Cangkuk Bergelut

4 minggu yang lalu By 尊渡假赌尊渡假赌尊渡假赌

Clair Obscur: Ekspedisi 33 UE-Sandfall Game Crash? 3 cara!

2 minggu yang lalu By DDD

Tunjukkan Lagi

Alat panas

Notepad++7.3.1

Editor kod yang mudah digunakan dan percuma

SublimeText3 versi Cina

Versi Cina, sangat mudah digunakan

Hantar Studio 13.0.1

Persekitaran pembangunan bersepadu PHP yang berkuasa

Dreamweaver CS6

Alat pembangunan web visual

SublimeText3 versi Mac

Perisian penyuntingan kod peringkat Tuhan (SublimeText3)

Tunjukkan Lagi

Topik panas

Tutorial Java

1677

Tutorial CakePHP

1431

Tutorial Laravel

1333

Tutorial PHP

1278

Tutorial C#

1257

Tunjukkan Lagi

Related knowledge

Terangkan hashing kata laluan yang selamat di PHP (mis., Password_hash, password_verify). Mengapa tidak menggunakan MD5 atau SHA1? Apr 17, 2025 am 12:06 AM

Dalam php, kata laluan_hash dan kata laluan 1) password_hash menjana hash yang mengandungi nilai garam untuk meningkatkan keselamatan. 2) Kata Laluan_verify Sahkan kata laluan dan pastikan keselamatan dengan membandingkan nilai hash. 3) MD5 dan SHA1 terdedah dan kekurangan nilai garam, dan tidak sesuai untuk keselamatan kata laluan moden.

Bagaimanakah jenis membayangkan jenis PHP, termasuk jenis skalar, jenis pulangan, jenis kesatuan, dan jenis yang boleh dibatalkan? Apr 17, 2025 am 12:25 AM

Jenis PHP meminta untuk meningkatkan kualiti kod dan kebolehbacaan. 1) Petua Jenis Skalar: Oleh kerana Php7.0, jenis data asas dibenarkan untuk ditentukan dalam parameter fungsi, seperti INT, Float, dan lain -lain. 2) Return Type Prompt: Pastikan konsistensi jenis nilai pulangan fungsi. 3) Jenis Kesatuan Prompt: Oleh kerana Php8.0, pelbagai jenis dibenarkan untuk ditentukan dalam parameter fungsi atau nilai pulangan. 4) Prompt jenis yang boleh dibatalkan: membolehkan untuk memasukkan nilai null dan mengendalikan fungsi yang boleh mengembalikan nilai null.

PHP dan Python: Paradigma yang berbeza dijelaskan Apr 18, 2025 am 12:26 AM

PHP terutamanya pengaturcaraan prosedur, tetapi juga menyokong pengaturcaraan berorientasikan objek (OOP); Python menyokong pelbagai paradigma, termasuk pengaturcaraan OOP, fungsional dan prosedur. PHP sesuai untuk pembangunan web, dan Python sesuai untuk pelbagai aplikasi seperti analisis data dan pembelajaran mesin.

Memilih antara php dan python: panduan Apr 18, 2025 am 12:24 AM

PHP sesuai untuk pembangunan web dan prototaip pesat, dan Python sesuai untuk sains data dan pembelajaran mesin. 1.Php digunakan untuk pembangunan web dinamik, dengan sintaks mudah dan sesuai untuk pembangunan pesat. 2. Python mempunyai sintaks ringkas, sesuai untuk pelbagai bidang, dan mempunyai ekosistem perpustakaan yang kuat.

PHP dan Python: menyelam mendalam ke dalam sejarah mereka Apr 18, 2025 am 12:25 AM

PHP berasal pada tahun 1994 dan dibangunkan oleh Rasmuslerdorf. Ia pada asalnya digunakan untuk mengesan pelawat laman web dan secara beransur-ansur berkembang menjadi bahasa skrip sisi pelayan dan digunakan secara meluas dalam pembangunan web. Python telah dibangunkan oleh Guidovan Rossum pada akhir 1980 -an dan pertama kali dikeluarkan pada tahun 1991. Ia menekankan kebolehbacaan dan kesederhanaan kod, dan sesuai untuk pengkomputeran saintifik, analisis data dan bidang lain.

PHP dan Rangka Kerja: Memodenkan bahasa Apr 18, 2025 am 12:14 AM

PHP tetap penting dalam proses pemodenan kerana ia menyokong sejumlah besar laman web dan aplikasi dan menyesuaikan diri dengan keperluan pembangunan melalui rangka kerja. 1.Php7 meningkatkan prestasi dan memperkenalkan ciri -ciri baru. 2. Rangka kerja moden seperti Laravel, Symfony dan CodeIgniter memudahkan pembangunan dan meningkatkan kualiti kod. 3. Pengoptimuman prestasi dan amalan terbaik terus meningkatkan kecekapan aplikasi.

Mengapa menggunakan PHP? Kelebihan dan faedah dijelaskan Apr 16, 2025 am 12:16 AM

Manfaat utama PHP termasuk kemudahan pembelajaran, sokongan pembangunan web yang kukuh, perpustakaan dan kerangka yang kaya, prestasi tinggi dan skalabilitas, keserasian silang platform, dan keberkesanan kos. 1) mudah dipelajari dan digunakan, sesuai untuk pemula; 2) integrasi yang baik dengan pelayan web dan menyokong pelbagai pangkalan data; 3) mempunyai rangka kerja yang kuat seperti Laravel; 4) Prestasi tinggi dapat dicapai melalui pengoptimuman; 5) menyokong pelbagai sistem operasi; 6) Sumber terbuka untuk mengurangkan kos pembangunan.

Impak PHP: Pembangunan Web dan seterusnya Apr 18, 2025 am 12:10 AM

Phphassignificantelympactedwebdevelopmentandextendsbeyondit.1) itpowersmajorplatformslikeworderpressandexcelsindatabaseIntions.2) php'SadaptabilityAldoStoScaleforlargeapplicationFrameworksLikelara.3)

See all articles