Home Web Front-end HTML Tutorial robot.txt_html/css_WEB-ITnose

robot.txt_html/css_WEB-ITnose

Jun 24, 2016 am 11:53 AM

In China, website managers do not seem to pay much attention to robots.txt, but some functions cannot be achieved without it, so today Shijiazhuang SEO would like to briefly talk about robots through this article. txt writing. ? part, or specify that the search engine only includes the specified content.

When a search robot (some called a search spider) visits a site,

Basic introduction to robots.txt

robots.txt is a plain text file. In this file, website administrators can declare the parts of the website that they do not want to be accessed by robots, or specify that search engines only include specified content.

When a search robot (some called a search spider) visits a site, it will first check whether robots.txt exists in the root directory of the site. If it exists, the search robot will The scope of access is determined based on the content of the file; if the file does not exist, the search robot crawls along the link.

In addition, robots.txt must be placed in the root directory of a site, and the file name must be all lowercase.

Robots.txt writing syntax

First, let’s take a look at a robots.txt example: http://www.shijiazhuangseo.com. cn/robots.txt

Visit the above specific address, we can see the specific content of robots.txt as follows:

# Robots.txt file from http://www.shijiazhuangseo.com.cn

# All robots will spider the domain

User-agent: *

Disallow :

The above text means that all search robots are allowed to access all files under the www.shijiazhuangseo.com..cn site.

Specific syntax analysis: The text after # is explanatory information; User-agent: is followed by the name of the search robot. If it is followed by *, it generally refers to all search robots; Disallow: The following are the file directories that are not allowed to be accessed.

Below, I will list some specific uses of robots.txt:

Allow all robots to access

User-agent: *

Disallow:

Or you can create an empty file "/robots.txt" file

Disable all search engines from accessing any part of the site

User-agent: *

Disallow: /

Disable all search engines from accessing several parts of the website (01, 02, 03 directories in the example below)

User-agent: *

Disallow: / 01/

Disallow: /02/

Disallow: /03/

Disallow access to a search engine (BadBot in the example below)

User-agent: BadBot

Disallow: /

Only allow access from a certain search engine (in the example below Crawler)

User-agent: Crawler

Disallow:

User-agent: *

Disallow: /

In addition, I think it is necessary to expand the explanation and give some introduction to robots meta:

The Robots META tag is mainly It is for each specific page. Like other META tags (such as the language used, page description, keywords, etc.), the Robots META tag is also placed in the

of the page, specifically used to tell the search engine ROBOTS how to crawl the page. content.

How to write the Robots META tag:

There is no case distinction in the Robots META tag. name="Robots" means all search engines. You can write name="BaiduSpider" for a specific search engine. There are four command options in the content part: index, noindex, follow, nofollow. The commands are separated by ",".

INDEX instruction tells the search robot to crawl the page;

FOLLOW instruction means the search robot can continue crawling along the links on the page Go down;

The default values ​​of Robots Meta tags are INDEX and FOLLOW, except for inktomi, for which the default values ​​are INDEX, NOFOLLOW.

In this way, there are four combinations:

<META NAME="ROBOTS" CONTENT="INDEX,FOLLOW">

<META NAME="ROBOTS" CONTENT="NOINDEX,FOLLOW">

<META NAME="ROBOTS" CONTENT="INDEX,NOFOLLOW">

<META NAME=" ROBOTS" CONTENT="NOINDEX,NOFOLLOW">

Where

<META NAME="ROBOTS" CONTENT="INDEX,FOLLOW"> It can be written as <META NAME="ROBOTS" CONTENT="ALL">; ROBOTS" CONTENT="NONE">

At present, it seems that the vast majority of search engine robots comply with the rules of robots.txt, but for the Robots META tag, it is not currently supported. There are many, but they are gradually increasing. For example, the famous search engine GOOGLE fully supports it, and GOOGLE has also added a command "archive" that can limit whether GOOGLE retains web page snapshots. For example:

<META NAME="googlebot" CONTENT="index,follow,noarchive">

means crawling the pages in this site And crawl along the links in the page, but do not keep a web snapshot of the page on GOOLGE.

The above is Shijiazhuang SEO’s syntax for writing robots.txt

First, let’s look at an example of robots.txt: http://www.shijiazhuangseo.com.cn /robots.txt

Visit the above specific address, we can see the specific content of robots.txt as follows:

# Robots.txt file from http://www.shijiazhuangseo.com.cn# All robots will spider the domain

User-agent: *

Disallow:

The above text means that all search robots are allowed to access all files under the www.shijiazhuangseo.com.cn site.

Specific syntax analysis: The text after # is explanatory information; User-agent: is followed by the name of the search robot. If it is followed by *, it generally refers to all search robots; Disallow: The following are the file directories that are not allowed to be accessed.

Below, I will list some specific uses of robots.txt:

Allow all robots to access

User-agent: *

Disallow:

Or you can create an empty file "/robots.txt" file

Disable all search engines from accessing any part of the site

User-agent: *

Disallow: /

Disable all search engines from accessing several parts of the website (01, 02, 03 directories in the example below)

User-agent: *

Disallow: / 01/

Disallow: /02/

Disallow: /03/

Disallow access to a search engine (BadBot in the example below)

User-agent: BadBot

Disallow: /

Only allow access from a certain search engine (Crawler in the example below)

User-agent: Crawler

Disallow:

User-agent: *

Disallow: /

Also, I think It is necessary to expand the explanation and give some introduction to robots meta:

The Robots META tag is mainly for specific pages. Like other META tags (such as the language used, page description, keywords, etc.), the Robots META tag is also placed in the

of the page, specifically used to tell the search engine ROBOTS how to crawl the page. content.

How to write the Robots META tag:

There is no case distinction in the Robots META tag. name="Robots" means all search engines , which can be written as name="BaiduSpider" for a specific search engine. There are four command options in the content part: index, noindex, follow, nofollow. The commands are separated by ",".

INDEX instruction tells the search robot to crawl the page;

FOLLOW instruction means the search robot can continue crawling along the links on the page Go down;

The default values ​​of Robots Meta tags are INDEX and FOLLOW, except for inktomi, for which the default values ​​are INDEX, NOFOLLOW.

In this way, there are four combinations:

<META NAME="ROBOTS" CONTENT="INDEX,FOLLOW">

<META NAME="ROBOTS" CONTENT="NOINDEX,FOLLOW">

<META NAME="ROBOTS" CONTENT="INDEX,NOFOLLOW">

<META NAME=" ROBOTS" CONTENT="NOINDEX,NOFOLLOW">

Where

<META NAME="ROBOTS" CONTENT="INDEX,FOLLOW"> It can be written as <META NAME="ROBOTS" CONTENT="ALL">; ROBOTS" CONTENT="NONE">

At present, it seems that the vast majority of search engine robots comply with the rules of robots.txt, but for the Robots META tag, it is not currently supported. There are many, but they are gradually increasing. For example, the famous search engine GOOGLE fully supports it, and GOOGLE has also added a command "archive" that can limit whether GOOGLE retains web page snapshots. For example:

<META NAME="googlebot" CONTENT="index,follow,noarchive">

means crawling the pages in this site And crawl along the links in the page, but do not keep a web snapshot of the page on GOOLGE.

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Is HTML easy to learn for beginners? Is HTML easy to learn for beginners? Apr 07, 2025 am 12:11 AM

HTML is suitable for beginners because it is simple and easy to learn and can quickly see results. 1) The learning curve of HTML is smooth and easy to get started. 2) Just master the basic tags to start creating web pages. 3) High flexibility and can be used in combination with CSS and JavaScript. 4) Rich learning resources and modern tools support the learning process.

The Roles of HTML, CSS, and JavaScript: Core Responsibilities The Roles of HTML, CSS, and JavaScript: Core Responsibilities Apr 08, 2025 pm 07:05 PM

HTML defines the web structure, CSS is responsible for style and layout, and JavaScript gives dynamic interaction. The three perform their duties in web development and jointly build a colorful website.

Understanding HTML, CSS, and JavaScript: A Beginner's Guide Understanding HTML, CSS, and JavaScript: A Beginner's Guide Apr 12, 2025 am 12:02 AM

WebdevelopmentreliesonHTML,CSS,andJavaScript:1)HTMLstructurescontent,2)CSSstylesit,and3)JavaScriptaddsinteractivity,formingthebasisofmodernwebexperiences.

Gitee Pages static website deployment failed: How to troubleshoot and resolve single file 404 errors? Gitee Pages static website deployment failed: How to troubleshoot and resolve single file 404 errors? Apr 04, 2025 pm 11:54 PM

GiteePages static website deployment failed: 404 error troubleshooting and resolution when using Gitee...

What is an example of a starting tag in HTML? What is an example of a starting tag in HTML? Apr 06, 2025 am 12:04 AM

AnexampleofastartingtaginHTMLis,whichbeginsaparagraph.StartingtagsareessentialinHTMLastheyinitiateelements,definetheirtypes,andarecrucialforstructuringwebpagesandconstructingtheDOM.

How to use CSS3 and JavaScript to achieve the effect of scattering and enlarging the surrounding pictures after clicking? How to use CSS3 and JavaScript to achieve the effect of scattering and enlarging the surrounding pictures after clicking? Apr 05, 2025 am 06:15 AM

To achieve the effect of scattering and enlarging the surrounding images after clicking on the image, many web designs need to achieve an interactive effect: click on a certain image to make the surrounding...

HTML, CSS, and JavaScript: Essential Tools for Web Developers HTML, CSS, and JavaScript: Essential Tools for Web Developers Apr 09, 2025 am 12:12 AM

HTML, CSS and JavaScript are the three pillars of web development. 1. HTML defines the web page structure and uses tags such as, etc. 2. CSS controls the web page style, using selectors and attributes such as color, font-size, etc. 3. JavaScript realizes dynamic effects and interaction, through event monitoring and DOM operations.

How to implement adaptive layout of Y-axis position in web annotation? How to implement adaptive layout of Y-axis position in web annotation? Apr 04, 2025 pm 11:30 PM

The Y-axis position adaptive algorithm for web annotation function This article will explore how to implement annotation functions similar to Word documents, especially how to deal with the interval between annotations...

See all articles