home › Firmware › What kind of work do spider robots do? Search engines are their robots and spiders. Who are search robots

What kind of work do spider robots do? Search engines are their robots and spiders. Who are search robots

Search robot (bot, spider, spider, crawler)- This special program search engine designed to scan sites on the Internet.

Many people don't know that scanning bots simply collect and store information. They don't process it. Other programs do this.

If you want to look at the site through the eyes of a search robot, you can do this through the webmaster panel.

You can see how Google works through the webmaster panel. There you need to add your site and then you can look at the page:

https://www.google.com/webmasters/tools/googlebot-fetch?hl=ru

You can view Yandex through a saved copy of the page. To do this, find the desired page in Yandex search, click “saved copy” and then “view text version”.

Below is a list of search robots that visit our sites. Some of them index sites, others monitor contextual advertising. There are specialized robots that perform certain narrow tasks. For example, they index pictures or news.

Knowing the robot by sight, you can prohibit or allow it to crawl around the site, thereby reducing the load on the server. Well, or protect your information from getting into the network.

Yandex search robots

The Yandex search engine has a dozen and a half search robots known to us. The list of bots that I managed to dig up, including from the official help, is below.

YandexBot is the main indexing robot;
YandexMedia is a robot that indexes multimedia data;
YandexImages - Yandex.Images indexer;
YandexCatalog - a “tapping” tool for Yandex.Catalog, used to temporarily remove unavailable sites from publication in the Catalog;
YaDirectFetcher - Yandex.Direct robot;
YandexBlogs is a blog search robot that indexes posts and comments;
YandexNews - Yandex.News robot;
YandexWebmaster – comes when adding a site through the AddURL forum;
YandexPagechecker - micro markup validator;
YandexFavicons - favicon indexer
YandexMetrika - Yandex.Metrica robot;
YandexMarket - Yandex.Market robot;
YandexCalendar is a Yandex.Calendar robot.

Google search robots (bots)

Googlebot is the main indexing robot;
Googlebot Nes - news indexer;
Googlebot Images - image indexer;
Googlebot Video - robot for video data;
Google Mobile - mobile content indexer;
Google Mobile AdSense - mobile AdSense robot
Google AdSense- AdSense robot
Google AdsBot – landing page quality checking bot
Mediapartners-Google - AdSense robot

Robots of other search engines

Also, in the logs of your site, you may stumble upon some robots of other search engines.

Rambler - StackRambler
Mail.ru - Mail.Ru
Yahoo! — Slurp (or Yahoo! Slurp)
AOL - Slurp
MSN - MSNBot
Live - MSNBot
Ask - Teoma
Alexa - ia_archiver
Lycos - Lycos
Aport - Aport
Webalta - WebAlta (WebAlta Crawler/2.0)

In addition to search engine bots, there is a huge army of all kinds of left-wing spiders running around the sites. These are various parsers that collect information from sites, usually for the selfish purposes of their creators.

Some steal content, others steal pictures, others hack websites and secretly place links. If you notice that such a parser has attached itself to your site, block everyone’s access to it possible ways, including through the robots.txt file.

Hello friends! Today you will learn how Yandex and Google search robots work and what function they perform in website promotion. So let's go!

Search engines do this action in order to find ten WEB projects out of a million sites that have a high-quality and relevant answer to the user’s request. Why only ten? Because it consists of only ten positions.

Search robots are friends to both webmasters and users

Why it is important for search robots to visit a site has already become clear, but why does the user need this? That’s right, in order for the user to see only those sites that will respond to his request in full.

Search robot- a very flexible tool, it is able to find a site, even one that has just been created, and the owner of this site has not yet worked on it. That’s why this bot was called a spider; it can stretch its legs and get anywhere on the virtual web.

Is it possible to control a search robot to your advantage?

There are cases when some pages are not included in the search. This is mainly due to the fact that this page has not yet been indexed by a search robot. Of course, sooner or later a search robot will notice this page. But it takes time, and sometimes quite a lot of time. But here you can help the search robot visit this page faster.

To do this, you can place your website in special directories or lists, social networks. In general, on all sites where the search robot simply lives. For example, social networks update every second. Try to advertise your site, and the search robot will come to your site much faster.

One main rule follows from this. If you want search engine bots to visit your site, you need to feed them new content on a regular basis. If they notice that the content is being updated and the site is developing, they will begin to visit your Internet project much more often.

Every search robot can remember how often your content changes. He evaluates not only quality, but time intervals. And if the material on the site is updated once a month, then he will come to the site once a month.

Thus, if the site is updated once a week, then the search robot will come once a week. If you update the site every day, then the search robot will visit the site every day or every other day. There are sites that are indexed within a few minutes after updating. This social media, news aggregators, and sites that post several articles a day.

How to give a task to a robot and prohibit it from doing anything?

Early on, we learned that search engines have multiple robots that perform different tasks. Some are looking for pictures, some for links, and so on.

You can control any robot using a special file robots.txt . It is from this file that the robot begins to get acquainted with the site. In this file you can specify whether the robot can index the site, and if so, which sections. All these instructions can be created for one or all robots.

Website promotion training

More details about wisdom SEO promotion sites in search engines Google systems and Yandex, I talk on my own on Skype. I brought all my WEB projects to more traffic and get excellent results from this. I can teach this to you too, if you are interested!

Search robot is a special program of a search engine that is designed to enter into a database (index) sites and their pages found on the Internet. Names also used: crawler, spider, bot, automaticindexer, ant, webcrawler, bot, webscutter, webrobots, webspider.

Principle of operation

A search robot is a browser-type program. It constantly scans the network: visits indexed (already known to it) sites, follows links from them and finds new resources. When a new resource is discovered, the procedure robot adds it to the search engine index. The search robot also indexes updates on sites, the frequency of which is fixed. For example, a site that is updated once a week will be visited by a spider with this frequency, and content on news sites can be indexed within minutes of publication. If no links from other resources lead to the site, then in order to attract search robots, the resource must be added through a special form (Google Webmaster Center, Yandex Webmaster Panel, etc.).

Types of search robots

Yandex spiders:

Yandex/1.01.001 I - the main bot involved in indexing,
Yandex/1.01.001 (P) - indexes pictures,
Yandex/1.01.001 (H) - finds mirror sites,
Yandex/1.03.003 (D) - determines whether the page added from the webmaster panel meets the indexing parameters,
YaDirectBot/1.0 (I) - indexes resources from advertising network Yandex,
Yandex/1.02.000 (F) - indexes site favicons.

Google Spiders:

Googlebot is the main robot
Googlebot News - scans and indexes news,
Google Mobile - indexes sites for mobile devices,
Googlebot Images - searches and indexes images,
Googlebot Video - indexes videos,
Google AdsBot - checks the quality of the landing page,
Google Mobile AdSense and Google AdSense - indexes sites of the Google advertising network.

Other search engines also use several types of robots that are functionally similar to those listed.

Principle of operation

Types of search robots

Yandex spiders:

Yandex/1.01.001 I - the main bot involved in indexing,
Yandex/1.01.001 (P) - indexes pictures,
Yandex/1.01.001 (H) - finds mirror sites,
Yandex/1.03.003 (D) - determines whether the page added from the webmaster panel meets the indexing parameters,
YaDirectBot/1.0 (I) - indexes resources from the Yandex advertising network,
Yandex/1.02.000 (F) - indexes site favicons.

Google Spiders:

Googlebot is the main robot
Googlebot News - scans and indexes news,
Google Mobile - indexes sites for mobile devices,
Googlebot Images - searches and indexes images,
Googlebot Video - indexes videos,
Google AdsBot - checks the quality of the landing page,
Google Mobile AdSense and Google AdSense - indexes sites of the Google advertising network.

Other search engines also use several types of robots that are functionally similar to those listed.

Contrary to popular belief, the robot is not directly involved in any processing of scanned documents. It only reads and saves them; then they are processed by other programs. Visual confirmation can be obtained by analyzing the logs of a site that is being indexed for the first time. On the first visit, the bot first requests the robots.txt file, then the main page of the site. That is, he follows the only link known to him. This is where the bot’s first visit always ends. After some time (usually the next day), the bot requests the following pages - using links that are found on the page that has already been read. Then the process continues in the same order: requesting pages for which links have already been found - a pause for processing the read documents - the next session with a request for found links.

Parsing pages on the fly would mean significantly more O greater resource consumption of the robot and loss of time. Each scan server runs multiple bot processes in parallel. They must act as quickly as possible in order to have time to read new pages and re-read existing ones. Therefore, bots only read and save documents. Whatever they save is queued for processing (code parsing). Links found during page processing are placed in a task queue for bots. This is how the entire network is continuously scanned. The only thing that a bot can and should analyze on the fly is the robots.txt file, so as not to request addresses that are prohibited in it. During each site crawling session, the robot first requests this file, and after it, all pages queued for crawling.

Types of search robots

Each search engine has its own set of robots for different purposes.
Basically, they differ in their functional purpose, although the boundaries are very arbitrary, and each search engine understands them in its own way. For systems only for full-text search, one robot is enough for all occasions. For those search engines that are engaged not only in text, bots are divided into at least two categories: for texts and drawings. There are also separate bots dedicated to specific types of content - mobile, blog, news, video, etc.

Google Robots

All Google robots are collectively called Googlebot. The main robot indexer “introduces itself” like this:

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

This bot is busy scanning HTML pages and other documents for the main Google search. It also occasionally reads CSS and JS files - this can mainly be noticed at the early stage of site indexing, while the bot is crawling the site for the first time. Accepted content types are all (Accept: */*).

The second of the main bots is busy scanning images from the site. It “introduces itself” simply:

Googlebot-Image/1.0

At least three bots were also seen in the logs, busy collecting content for mobile version search. The User-agent field of all three ends with the line:

(compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)

Before this line is the model mobile phone, with which this bot is compatible. The spotted bots have models Nokia phones, Samsung and iPhone. Accepted content types are all, but with priorities indicated:

Accept: application/vnd.wap.xhtml+xml,application/xhtml+xml;q=0.9,text/vnd.wap.wml;q=0.8,text/html;q=0.7,*/*;q=0.6

Yandex robots

Of the search engines active on the RuNet, Yandex has the largest collection of bots. In the webmaster help section you can find an official list of all spider personnel. There is no point in presenting it here in full, since changes occur periodically in this list.
However, the most important Yandex robots for us need to be mentioned separately.
Basic indexing robot currently called

Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)

Previously represented as

Yandex/1.01.001 (compatible; Win16; I)

Reads HTML pages website and other documents for indexing. The list of accepted media types was previously limited:

Accept: text/html, application/pdf;q=0.1, application/rtf;q=0.1, text/rtf;q=0.1, application/msword;q=0.1, application/x-shockwave-flash;q=0.1, application/vnd.ms-excel;q=0.1, application/vnd.ms-powerpoint;q=0.1

Since July 31, 2009, a significant expansion has been noticed in this list (the number of types has almost doubled), and since November 10, 2009, the list has been shortened to */* (all types).
This robot is keenly interested in a very specific set of languages: Russian, a little less Ukrainian and Belarusian, a little less English, and very little - all other languages.

Accept-Language: ru, uk;q=0.8, be;q=0.8, en;q=0.7, *;q=0.01

Robot image scanner carries the following line in the User-agent field:

Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)

Engaged in scanning graphics of various formats to search in pictures.

Unlike Google, Yandex has separate bots to serve some special functions general search.
Robot "mirror"

Mozilla/5.0 (compatible; YandexBot/3.0; MirrorDetector; +http://yandex.com/bots)

It doesn’t do anything particularly complicated - it periodically appears and checks whether the main page of the site matches when accessing the domain with www. and without. Also checks parallel “mirror” domains for matches. Apparently, mirrors and the canonical form of domains in Yandex are handled separately software package, not directly related to indexing. Otherwise, there is absolutely nothing to explain the existence of a separate bot for this purpose.

Icon collector favicon.ico

Mozilla/5.0 (compatible; YandexFavicons/1.0; +http://yandex.com/bots)

It periodically appears and requests the favicon.ico icon, which then appears in the search results next to the link to the site. For what reasons the picture collector does not share this responsibility is unknown. Apparently there is also a separate software package at play.

Verification bot for new sites, works when added to the AddURL form

Mozilla/5.0 (compatible; YandexWebmaster/2.0; +http://yandex.com/bots)

This bot checks the site's response by sending a HEAD request to the root URL. This way we check the existence home page in the domain and the HTTP headers of this page are analyzed. The bot also requests the robots.txt file in the root of the site. Thus, after submitting the link to AddURL, it is determined that the site exists and neither robots.txt nor HTTP headers prohibit access to the main page.

Rambler robot

Currently no longer working, since Rambler now uses Yandex search
The Rambler indexer robot can be easily identified in the logs by the User-agent field

StackRambler/2.0 (MSIE incompatible)

Compared to its “colleagues” from other search engines, this bot seems quite simple: it does not indicate a list of media types (accordingly, it receives the requested document of any type), the Accept-Language field is missing in the request, and the If-Modified-since field is not found in the bot’s requests .

Robot Mail.Ru

Little is known about this robot yet. The Mail.Ru portal has been developing its own search for a long time, but it still hasn’t gotten around to launching this search. Therefore, only the name of the bot in the User-agent is known for certain - Mail.Ru/2.0 (previously - Mail.Ru/1.0). The name of the bot for the directives of the robors.txt file has not been published anywhere; there is an assumption that the bot should be called Mail.Ru.

Other robots

Internet search is, of course, not limited to two search engines. Therefore, there are other robots - for example, the Bing robot - the search engine from Microsoft and other robots. So, in particular, in China there is a national search engine Baidu - but its robot is unlikely to reach the middle of the river and reach the Russian site.

In addition, many services have recently proliferated - in particular solomono - which, although they are not search engines, also scan sites. Often the value of transmitting site information to such systems is questionable, and therefore their robots can be banned in

Popular in the category: