Image Scraping with Symfony's DomCrawler-PHP Tutorial-php.cn

Table of Contents

Key Takeaways

How the Class works

Coding the Class

Class Dependency

Using the Class

Summary

Frequently Asked Questions (FAQs) about Image Scraping with Symfony’s DomCrawler

What is Symfony’s DomCrawler Component?

How do I install Symfony’s DomCrawler Component?

How do I use Symfony’s DomCrawler Component to scrape images?

Can I use Symfony’s DomCrawler Component with Laravel?

How do I select elements using Symfony’s DomCrawler Component?

Can I modify the content of elements using Symfony’s DomCrawler Component?

How do I handle errors and exceptions when using Symfony’s DomCrawler Component?

Can I use Symfony’s DomCrawler Component to scrape websites that require authentication?

How do I extract attribute values using Symfony’s DomCrawler Component?

Can I use Symfony’s DomCrawler Component to scrape AJAX-loaded content?

Home

Backend Development

PHP Tutorial

Image Scraping with Symfony's DomCrawler

Jennifer Aniston

Feb 21, 2025 am 08:47 AM

Image Scraping with Symfony's DomCrawler

A photographer friend of mine implored me to find and download images of picture frames from the internet. I eventually landed on a web page that had a number of them available for free but there was a problem: a link to download all the images together wasn’t present.

I didn’t want to go through the stress of downloading the images individually, so I wrote this PHP class to find, download and zip all images found on the website.

Key Takeaways

The PHP class utilizes Symfony’s DomCrawler component to scrape images from a webpage, download and save them into a folder, create a ZIP archive of the folder, and then delete the folder. This class is designed to automate the process of downloading multiple images from a website.
The class includes five private properties and eight public methods. The properties store information such as the folder name, webpage URL, HTML document code, ZIP file name, and operation status. The methods include functions to set the folder and file name, instantiate the DomCrawler, download and save images, create a ZIP file, delete the folder, and get the operation status.
To use the class, all required files must be included, either via autoload or explicitly. The setFolder and setFileName methods should be called with their respective arguments, and the process method is then called to put the class to work. The DomCrawler component and create_zip function must be included for the class to function.

How the Class works

It searches a URL for images, downloads and saves the images into a folder, creates a ZIP archive of the folder and finally deletes the folder.

The class uses Symfony’s DomCrawler component to search for all image links found on the webpage and a custom zip function that creates the zip file. Credit to David Walsh for the zip function.

Coding the Class

The class consists of five private properties and eight public methods including the __construct magic method.

Image Scraping with Symfony's DomCrawler

Below is the list of the class properties and their roles.
1. $folder: stores the name of the folder that contains the scraped images.
2. $url: stores the webpage URL.
3. $html: stores the HTML document code of the webpage to be scraped.
4. $fileName: stores the name of the ZIP file.
5. $status: saves the status of the operation. I.e if it was a success or failure.

Let’s get started building the class.

Create the class ZipImages containing the above five properties.

<span><span><?php
</span></span><span><span>class ZipImages {
</span></span><span>    <span>private $folder;
</span></span><span>    <span>private $url;
</span></span><span>    <span>private $html;
</span></span><span>    <span>private $fileName;
</span></span><span>    <span>private $status;</span></span>

Copy after login

Create a __construct magic method that accepts a URL as an argument.
The method is quite self-explanatory.

<span>public function __construct($url) {
</span>    <span>$this->url = $url; 
</span>    <span>$this->html = file_get_contents($this->url);
</span>    <span>$this->setFolder();
</span><span>}</span>

Copy after login

The created ZIP archive has a folder that contains the scraped images. The setFolder method below configures this.

By default, the folder name is set to images but the method provides an option to change the name of the folder by simply passing the folder name as its argument.

<span><span><?php
</span></span><span><span>class ZipImages {
</span></span><span>    <span>private $folder;
</span></span><span>    <span>private $url;
</span></span><span>    <span>private $html;
</span></span><span>    <span>private $fileName;
</span></span><span>    <span>private $status;</span></span>

Copy after login

setFileName provides an option to change the name of the ZIP file with a default name set to zipImages:

<span>public function __construct($url) {
</span>    <span>$this->url = $url; 
</span>    <span>$this->html = file_get_contents($this->url);
</span>    <span>$this->setFolder();
</span><span>}</span>

Copy after login

At this point, we instantiate the Symfony crawler component to search for images, then download and save all the images into the folder.

<span>public function setFolder($folder="image") {
</span>    <span>// if folder doesn't exist, attempt to create one and store the folder name in property $folder
</span>    <span>if(!file_exists($folder)) {
</span>        <span>mkdir($folder);
</span>    <span>}
</span>    <span>$this->folder = $folder;
</span><span>}</span>

Copy after login

After the download is complete, we compress the image folder to a ZIP Archive using our custom create_zip function.

<span>public function setFileName($name = "zipImages") {
</span>    <span>$this->fileName = $name;
</span><span>}</span>

Copy after login

Lastly, we delete the created folder after the ZIP file has been created.

<span>public function domCrawler() {
</span>    <span>//instantiate the symfony DomCrawler Component
</span>    <span>$crawler = new Crawler($this->html);
</span>    <span>// create an array of all scrapped image links
</span>    <span>$result = $crawler
</span>        <span>->filterXpath('//img')
</span>        <span>->extract(array('src'));
</span>
<span>// download and save the image to the folder 
</span>    <span>foreach ($result as $image) {
</span>        <span>$path = $this->folder."/".basename($image);
</span>        <span>$file = file_get_contents($image);
</span>        <span>$insert = file_put_contents($path, $file);
</span>        <span>if (!$insert) {
</span>            <span>throw new <span>\Exception</span>('Failed to write image');
</span>        <span>}
</span>    <span>}
</span><span>}</span>

Copy after login

Get the status of the operation. I.e if it was successful or an error occurred.

<span>public function createZip() {
</span>    <span>$folderFiles = scandir($this->folder);
</span>    <span>if (!$folderFiles) {
</span>        <span>throw new <span>\Exception</span>('Failed to scan folder');
</span>    <span>}
</span>    <span>$fileArray = array();
</span>    <span>foreach($folderFiles as $file){
</span>        <span>if (($file != ".") && ($file != "..")) {
</span>            <span>$fileArray[] = $this->folder."/".$file;
</span>        <span>}
</span>    <span>}
</span>
    <span>if (create_zip($fileArray, $this->fileName.'.zip')) {
</span>        <span>$this->status = <span><span><<<HTML</span>
</span></span><span>File successfully archived. <a href="<span><span>$this->fileName</span>.zip">Download it now</a>
</span></span><span><span>HTML<span>;</span></span>
</span>    <span>} else {
</span>        <span>$this->status = "An error occurred";
</span>    <span>}
</span><span>}</span>

Copy after login

Process all the methods above.

<span>public function deleteCreatedFolder() {
</span>    <span>$dp = opendir($this->folder) or die ('ERROR: Cannot open directory');
</span>    <span>while ($file = readdir($dp)) {
</span>        <span>if ($file != '.' && $file != '..') {
</span>            <span>if (is_file("<span><span>$this->folder</span>/<span>$file</span>"</span>)) {
</span>                <span>unlink("<span><span>$this->folder</span>/<span>$file</span>"</span>);
</span>            <span>}
</span>        <span>}
</span>    <span>}
</span>    <span>rmdir($this->folder) or die ('could not delete folder');
</span><span>}</span>

Copy after login

You can download the full class from Github.

Class Dependency

For the class to work, the Domcrawler component and create_zip function need to be included. You can download the code for this function here.

Download and install the DomCrawler component via Composer simply by adding the following require statement to your composer.json file:

<span>public function getStatus() {
</span>    <span>echo $this->status;
</span><span>}</span>

Copy after login

Run $ php composer.phar install to download the library and generate the vendor/autoload.php autoloader file.

Using the Class

Make sure all required files are included, via autoload or explicitly.
Call the setFolder , and setFileName method and pass in their respective arguments. Only call the setFolder method when you need to change the folder name.
Call the process method to put the class to work.

<span>public function process() {
</span>    <span>$this->domCrawler();
</span>    <span>$this->createZip();
</span>    <span>$this->deleteCreatedFolder();
</span>    <span>$this->getStatus();
</span><span>}</span>

Copy after login

Image Scraping with Symfony's DomCrawler

Summary

In this article, we learned how to create a simple PHP image scraper that automatically compresses downloaded images into a Zip archive. If you have alternative solutions or suggestions for improvement, please leave them in the comments below, all feedback is welcome!

Frequently Asked Questions (FAQs) about Image Scraping with Symfony’s DomCrawler

What is Symfony’s DomCrawler Component?

Symfony’s DomCrawler Component is a powerful tool that allows developers to traverse and manipulate HTML and XML documents. It provides an API that is easy to use and understand, making it a popular choice for web scraping tasks. The DomCrawler Component can be used to select specific elements on a page, extract data from them, and even modify their content.

How do I install Symfony’s DomCrawler Component?

Installing Symfony’s DomCrawler Component is straightforward. You can use Composer, a dependency management tool for PHP. Run the following command in your project directory: composer require symfony/dom-crawler. This will download and install the DomCrawler Component along with its dependencies.

How do I use Symfony’s DomCrawler Component to scrape images?

To scrape images using Symfony’s DomCrawler Component, you first need to create a new instance of the Crawler class and load the HTML content into it. Then, you can use the filter method to select the image elements and extract their src attributes. Here’s a basic example:

$crawler = new Crawler($html);
$crawler->filter('img')->each(function (Crawler $node) {
echo $node->attr('src');
});

Can I use Symfony’s DomCrawler Component with Laravel?

Yes, you can use Symfony’s DomCrawler Component with Laravel. Laravel’s HTTP testing functionality actually uses the DomCrawler Component under the hood. This means you can use the same methods and techniques to traverse and manipulate HTML content in your Laravel tests.

How do I select elements using Symfony’s DomCrawler Component?

Symfony’s DomCrawler Component provides several methods to select elements, including filter, filterXPath, and selectLink. These methods allow you to select elements based on their tag name, XPath expression, or link text, respectively.

Can I modify the content of elements using Symfony’s DomCrawler Component?

Yes, you can modify the content of elements using Symfony’s DomCrawler Component. The each method allows you to iterate over each selected element and perform operations on it. For example, you can change the src attribute of an image element like this:

$crawler->filter('img')->each(function (Crawler $node) {
$node->attr('src', 'new-image.jpg');
});

How do I handle errors and exceptions when using Symfony’s DomCrawler Component?

When using Symfony’s DomCrawler Component, errors and exceptions can be handled using try-catch blocks. For example, if the filter method doesn’t find any matching elements, it will throw an InvalidArgumentException. You can catch this exception and handle it appropriately.

Can I use Symfony’s DomCrawler Component to scrape websites that require authentication?

Yes, you can use Symfony’s DomCrawler Component to scrape websites that require authentication. However, this requires additional steps, such as sending a POST request with the login credentials and storing the session cookie.

How do I extract attribute values using Symfony’s DomCrawler Component?

You can extract attribute values using the attr method provided by Symfony’s DomCrawler Component. For example, to extract the src attribute of an image element, you can do the following:

$crawler->filter('img')->each(function (Crawler $node) {
echo $node->attr('src');
});

Can I use Symfony’s DomCrawler Component to scrape AJAX-loaded content?

Unfortunately, Symfony’s DomCrawler Component cannot directly scrape AJAX-loaded content because it doesn’t execute JavaScript. However, you can use tools like Guzzle and Goutte in combination with the DomCrawler Component to send HTTP requests and handle AJAX-loaded content.

The above is the detailed content of Image Scraping with Symfony's DomCrawler. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

How to fix KB5055523 fails to install in Windows 11?

3 weeks ago By DDD

How to fix KB5055518 fails to install in Windows 10?

3 weeks ago By DDD

Roblox: Dead Rails - How To Tame Wolves

3 weeks ago By DDD

Strength Levels for Every Enemy & Monster in R.E.P.O.

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Roblox: Grow A Garden - Complete Mutation Guide

2 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Java Tutorial

1655

CakePHP Tutorial

1414

Laravel Tutorial

1307

PHP Tutorial

1253

C# Tutorial

1227

Related knowledge

Explain JSON Web Tokens (JWT) and their use case in PHP APIs. Apr 05, 2025 am 12:04 AM

JWT is an open standard based on JSON, used to securely transmit information between parties, mainly for identity authentication and information exchange. 1. JWT consists of three parts: Header, Payload and Signature. 2. The working principle of JWT includes three steps: generating JWT, verifying JWT and parsing Payload. 3. When using JWT for authentication in PHP, JWT can be generated and verified, and user role and permission information can be included in advanced usage. 4. Common errors include signature verification failure, token expiration, and payload oversized. Debugging skills include using debugging tools and logging. 5. Performance optimization and best practices include using appropriate signature algorithms, setting validity periods reasonably,

How does session hijacking work and how can you mitigate it in PHP? Apr 06, 2025 am 12:02 AM

Session hijacking can be achieved through the following steps: 1. Obtain the session ID, 2. Use the session ID, 3. Keep the session active. The methods to prevent session hijacking in PHP include: 1. Use the session_regenerate_id() function to regenerate the session ID, 2. Store session data through the database, 3. Ensure that all session data is transmitted through HTTPS.

How do you handle exceptions effectively in PHP (try, catch, finally, throw)? Apr 05, 2025 am 12:03 AM

In PHP, exception handling is achieved through the try, catch, finally, and throw keywords. 1) The try block surrounds the code that may throw exceptions; 2) The catch block handles exceptions; 3) Finally block ensures that the code is always executed; 4) throw is used to manually throw exceptions. These mechanisms help improve the robustness and maintainability of your code.

Explain different error types in PHP (Notice, Warning, Fatal Error, Parse Error). Apr 08, 2025 am 12:03 AM

There are four main error types in PHP: 1.Notice: the slightest, will not interrupt the program, such as accessing undefined variables; 2. Warning: serious than Notice, will not terminate the program, such as containing no files; 3. FatalError: the most serious, will terminate the program, such as calling no function; 4. ParseError: syntax error, will prevent the program from being executed, such as forgetting to add the end tag.

What is the difference between include, require, include_once, require_once? Apr 05, 2025 am 12:07 AM

In PHP, the difference between include, require, include_once, require_once is: 1) include generates a warning and continues to execute, 2) require generates a fatal error and stops execution, 3) include_once and require_once prevent repeated inclusions. The choice of these functions depends on the importance of the file and whether it is necessary to prevent duplicate inclusion. Rational use can improve the readability and maintainability of the code.

PHP and Python: Comparing Two Popular Programming Languages Apr 14, 2025 am 12:13 AM

PHP and Python each have their own advantages, and choose according to project requirements. 1.PHP is suitable for web development, especially for rapid development and maintenance of websites. 2. Python is suitable for data science, machine learning and artificial intelligence, with concise syntax and suitable for beginners.

What are HTTP request methods (GET, POST, PUT, DELETE, etc.) and when should each be used? Apr 09, 2025 am 12:09 AM

HTTP request methods include GET, POST, PUT and DELETE, which are used to obtain, submit, update and delete resources respectively. 1. The GET method is used to obtain resources and is suitable for read operations. 2. The POST method is used to submit data and is often used to create new resources. 3. The PUT method is used to update resources and is suitable for complete updates. 4. The DELETE method is used to delete resources and is suitable for deletion operations.

PHP: A Key Language for Web Development Apr 13, 2025 am 12:08 AM

PHP is a scripting language widely used on the server side, especially suitable for web development. 1.PHP can embed HTML, process HTTP requests and responses, and supports a variety of databases. 2.PHP is used to generate dynamic web content, process form data, access databases, etc., with strong community support and open source resources. 3. PHP is an interpreted language, and the execution process includes lexical analysis, grammatical analysis, compilation and execution. 4.PHP can be combined with MySQL for advanced applications such as user registration systems. 5. When debugging PHP, you can use functions such as error_reporting() and var_dump(). 6. Optimize PHP code to use caching mechanisms, optimize database queries and use built-in functions. 7

See all articles