home › Problems › What work do search engine spiders do? What is a search robot? Functions of the search robot "Yandex" and Google. What does a search robot do?

What work do search engine spiders do? What is a search robot? Functions of the search robot "Yandex" and Google. What does a search robot do?

Contrary to popular belief, the robot is not directly involved in any processing of scanned documents. It only reads and saves them; then they are processed by other programs. Visual confirmation can be obtained by analyzing the logs of a site that is being indexed for the first time. On the first visit, the bot first requests the robots.txt file, then the main page of the site. That is, he follows the only link known to him. This is where the bot’s first visit always ends. After some time (usually the next day), the bot requests the following pages - using links that are found on the page that has already been read. Then the process continues in the same order: requesting pages for which links have already been found - a pause for processing the read documents - the next session with a request for found links.

Parsing pages on the fly would mean significantly more O greater resource consumption of the robot and loss of time. Each scan server runs multiple bot processes in parallel. They must act as quickly as possible in order to have time to read new pages and re-read existing ones. Therefore, bots only read and save documents. Whatever they save is queued for processing (code parsing). Links found during page processing are placed in a task queue for bots. This is how the entire network is continuously scanned. The only thing that a bot can and should analyze on the fly is the robots.txt file, so as not to request addresses that are prohibited in it. During each site crawling session, the robot first requests this file, and after it, all pages queued for crawling.

Types of search robots

Each search engine has its own set of robots for different purposes.
Basically, they differ in their functional purpose, although the boundaries are very arbitrary, and each search engine understands them in its own way. For systems only for full-text search, one robot is enough for all occasions. For those search engines that are engaged not only in text, bots are divided into at least two categories: for texts and drawings. There are also separate bots dedicated to specific types of content - mobile, blog, news, video, etc.

Google Robots

All Google robots are collectively called Googlebot. The main robot indexer “introduces itself” like this:

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

This bot is busy scanning HTML pages and other documents for the main Google search. It also occasionally reads CSS and JS files - this can mainly be noticed at the early stage of site indexing, while the bot is crawling the site for the first time. Accepted content types are all (Accept: */*).

The second of the main bots is busy scanning images from the site. It “introduces itself” simply:

Googlebot-Image/1.0

At least three bots were also seen in the logs, busy collecting content for mobile version search. The User-agent field of all three ends with the line:

(compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)

Before this line is the model mobile phone, with which this bot is compatible. The spotted bots have models Nokia phones, Samsung and iPhone. Accepted content types are all, but with priorities indicated:

Accept: application/vnd.wap.xhtml+xml,application/xhtml+xml;q=0.9,text/vnd.wap.wml;q=0.8,text/html;q=0.7,*/*;q=0.6

Yandex robots

Of the search engines active on the RuNet, Yandex has the largest collection of bots. In the webmaster help section you can find an official list of all spider personnel. There is no point in presenting it here in full, since changes occur periodically in this list.
However, the most important Yandex robots for us need to be mentioned separately.
Basic indexing robot currently called

Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)

Previously represented as

Yandex/1.01.001 (compatible; Win16; I)

Reads website HTML pages and other documents for indexing. The list of accepted media types was previously limited:

Accept: text/html, application/pdf;q=0.1, application/rtf;q=0.1, text/rtf;q=0.1, application/msword;q=0.1, application/x-shockwave-flash;q=0.1, application/vnd.ms-excel;q=0.1, application/vnd.ms-powerpoint;q=0.1

Since July 31, 2009, a significant expansion has been noticed in this list (the number of types has almost doubled), and since November 10, 2009, the list has been shortened to */* (all types).
This robot is keenly interested in a very specific set of languages: Russian, a little less Ukrainian and Belarusian, a little less English, and very little - all other languages.

Accept-Language: ru, uk;q=0.8, be;q=0.8, en;q=0.7, *;q=0.01

Robot image scanner carries the following line in the User-agent field:

Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)

Engaged in scanning graphics of various formats to search in pictures.

Unlike Google, Yandex has separate bots to serve some special functions general search.
Robot "mirror"

Mozilla/5.0 (compatible; YandexBot/3.0; MirrorDetector; +http://yandex.com/bots)

It doesn’t do anything particularly complicated - it periodically appears and checks whether the main page of the site matches when accessing the domain with www. and without. Also checks parallel “mirror” domains for matches. Apparently, mirrors and the canonical form of domains in Yandex are handled separately software package, not directly related to indexing. Otherwise, there is absolutely nothing to explain the existence of a separate bot for this purpose.

Icon collector favicon.ico

Mozilla/5.0 (compatible; YandexFavicons/1.0; +http://yandex.com/bots)

It periodically appears and requests the favicon.ico icon, which then appears in the search results next to the link to the site. For what reasons the picture collector does not share this responsibility is unknown. Apparently there is also a separate software package at play.

Verification bot for new sites, works when added to the AddURL form

Mozilla/5.0 (compatible; YandexWebmaster/2.0; +http://yandex.com/bots)

This bot checks the site's response by sending a HEAD request to the root URL. This way we check the existence home page in the domain and the HTTP headers of this page are analyzed. The bot also requests the robots.txt file in the root of the site. Thus, after submitting the link to AddURL, it is determined that the site exists and neither robots.txt nor HTTP headers prohibit access to the main page.

Rambler robot

Currently no longer working, since Rambler now uses Yandex search
The Rambler indexer robot can be easily identified in the logs by the User-agent field

StackRambler/2.0 (MSIE incompatible)

Compared to its “colleagues” from other search engines, this bot seems quite simple: it does not specify a list of media types (accordingly, it receives the requested document of any type), the Accept-Language field is missing in the request, and the If-Modified-since field is not found in the bot’s requests .

Robot Mail.Ru

Little is known about this robot yet. The Mail.Ru portal has been developing its own search for a long time, but it still hasn’t gotten around to launching this search. Therefore, only the name of the bot in the User-agent is known for certain - Mail.Ru/2.0 (previously Mail.Ru/1.0). The name of the bot for the directives of the robors.txt file has not been published anywhere; there is an assumption that the bot should be called Mail.Ru.

Other robots

Internet search is, of course, not limited to two search engines. Therefore, there are other robots - for example, the Bing robot - the search engine from Microsoft and other robots. So, in particular, in China there is a national search engine Baidu - but its robot is unlikely to reach the middle of the river and reach the Russian site.

In addition, many services have recently proliferated - in particular solomono - which, although they are not search engines, also scan sites. Often the value of transmitting site information to such systems is questionable, and therefore their robots can be banned in

How search engine robots work

A search robot (spider, bot) is a small program that can visit millions of websites and scan gigabytes of text without operator intervention. Reading pages and storing text copies of them is the first stage of indexing new documents. It should be noted that search engine robots do not perform any processing of the received data. Their task is only to preserve text information.

More videos on our channel - learn internet marketing with SEMANTICA

List of search robots

Of all the search engines that scan the Runet, Yandex has the largest collection of bots. The following bots are responsible for indexing:

the main indexing robot that collects data from website pages;
a bot that can recognize mirrors;
Yandex search robot, which indexes images;
a spider that scans the pages of sites accepted by YAN;
robot scanning favicon icons;
several spiders that determine the accessibility of site pages.

Google's main search robot collects textual information. Basically, it views HTML files and analyzes JS and CSS at certain intervals. Capable of accepting any types of content allowed for indexing. PS Google has a spider that controls the indexing of images. There is also a search robot - a program that supports the functioning of the mobile version of the search.

See the site through the eyes of a search robot

To correct code errors and other shortcomings, the webmaster can find out how the search robot sees the site. This opportunity is provided by Google PS. You will need to go to webmaster tools, and then click on the “crawling” tab. In the window that opens, you need to select the line “view as Googlebot”. Next, you need to enter the address of the page you are researching into the search form (without specifying the domain and http:// protocol).

By selecting the “get and display” command, the webmaster will be able to visually assess the state of the site page. To do this, you need to click on the “request to display” checkbox. A window will open with two versions of the web document. The webmaster learns how a regular visitor sees the page, and in what form it is available to the search spider.

Tip! If the web document you are analyzing is not yet indexed, you can use the “add to index” >> “scan only this URL” command. The spider will analyze the document in a few minutes, and in the near future the web page will appear in the search results. The monthly limit for indexing requests is 500 documents.

How to influence indexing speed

Having figured out how search robots work, a webmaster will be able to promote his site much more effectively. One of the main problems of many young web projects is poor indexing. Search engine robots are reluctant to visit unauthorized Internet resources.
It has been established that the speed of indexing directly depends on the intensity with which the site is updated. Regularly adding unique text materials will attract the attention of search engines.

To speed up indexing, you can use social bookmarking and the twitter service. It is recommended to create a Sitemap and upload it to the root directory of the web project.

Search robot called special program any search engine that is designed to enter into a database (index) sites and their pages found on the Internet. Names also used: crawler, spider, bot, automaticindexer, ant, webcrawler, bot, webscutter, webrobots, webspider.

Principle of operation

A search robot is a browser-type program. It constantly scans the network: visits indexed (already known to it) sites, follows links from them and finds new resources. When a new resource is discovered, the procedure robot adds it to the search engine index. The search robot also indexes updates on sites, the frequency of which is fixed. For example, a site that is updated once a week will be visited by a spider with this frequency, and content on news sites can be indexed within minutes of publication. If no links from other resources lead to the site, then in order to attract search robots, the resource must be added through a special form (Google Webmaster Center, Yandex Webmaster Panel, etc.).

Types of search robots

Yandex spiders:

Yandex/1.01.001 I - the main bot involved in indexing,
Yandex/1.01.001 (P) - indexes pictures,
Yandex/1.01.001 (H) - finds mirror sites,
Yandex/1.03.003 (D) - determines whether the page added from the webmaster panel meets the indexing parameters,
YaDirectBot/1.0 (I) - indexes resources from advertising network Yandex,
Yandex/1.02.000 (F) - indexes site favicons.

Google Spiders:

Googlebot is the main robot
Googlebot News - scans and indexes news,
Google Mobile - indexes sites for mobile devices,
Googlebot Images - searches and indexes images,
Googlebot Video - indexes videos,
Google AdsBot - checks the quality of the landing page,
Google Mobile AdSense and Google AdSense— indexes sites of the Google advertising network.

Other search engines also use several types of robots that are functionally similar to those listed.

How do search engines work? One of the wonderful things about the Internet is that there are hundreds of millions of web resources waiting and ready to be presented to us. But the bad thing is that there are the same millions of pages that, even if we need them, will not appear before us, because... simply unknown to us. How to find out what and where you can find on the Internet? To do this, we usually turn to search engines.

Internet search engines are special sites in global network, which are designed to help people find world wide web the information they need. There are differences in the way search engines perform their functions, but in general there are 3 main and identical functions:

All of them “search” the Internet (or some sector of the Internet) - based on given keywords;
- all search engines index the words they search for and the places where they find them;
- all search engines allow users to search for words or combinations of keywords based on web pages already indexed and included in their databases.

The very first search engines indexed up to several hundred thousand pages and received 1,000 - 2,000 requests per day. Today, top search engines have indexed and are continuously indexing hundreds of millions of pages and processing tens of millions of requests per day. Below we will talk about how search engines work and how they “put together” all the pieces of information found so as to be able to answer any question that interests us.

Let's look at the Web

When people talk about Internet search engines machines, they actually mean search engines World Wide Web. Before the Web became the most visible part of the Internet, search engines already existed to help people find information on the Internet. Programs called "gopher" and "Archie" were able to index files located on different servers connected to Internet Internet and significantly reduced the time spent on searching necessary programs or documents. In the late 80s of the last century, a synonym for “the ability to work on the Internet” was the ability to use gopher, Archie, Veronica, etc. search programs. Today, most Internet users limit their search to only worldwide network, or WWW.

A small beginning

Before we can tell you where to find the required document or file, the file or document must have already been found. To find information about hundreds of millions of existing WEB pages, the search engine uses a special robot program. This program is also called spider ("spider") and is used to build a list of words found on the page. The process of constructing such a list is called web crawling(Web crawling). To further construct and capture a “useful” (meaningful) list of words, search spider must “look through” a ton of other pages.

How does anyone start? spider(spider) your journey on the web? Usually the starting point is the world's largest servers and very popular web pages. The spider begins its journey from such a site, indexes all the words found and continues its movement further, following links to other sites. Thus, the spider robot begins to cover increasingly large “pieces” of web space. Google.com began as an academic search engine. In an article describing how this search engine was created, Sergey Brin and Lawrence Page (founders and owners of Google) gave an example of how quickly Google spiders work. There are several of them and usually the search begins with the use of 3 spiders. Each spider supports up to 300 simultaneously open connections to web pages. At peak load, using 4 spiders, the Google system is capable of processing 100 pages per second, generating traffic of about 600 kilobytes/sec.

To provide the spiders with the data they needed to process, Google used to have a server that did nothing more than feed the spiders more and more URLs. In order not to depend on Internet service providers in terms of domain name servers (DNS) that translate URLs into IP addresses, Google acquired its own DNS server, reducing all time spent on indexing pages to a minimum.

When Google Robot Visits HTML page, it takes into account 2 things:

Words (text) per page;
- their location (in which part of the body of the page).

Words located with service sections such as title, subtitles, meta tags and others were flagged as particularly important for user search queries. Google Spider was built to index every similar word on a page, with the exception of interjections like "a," "an," and "the." Other search engines have a slightly different approach to indexing.

All search engine approaches and algorithms are ultimately aimed at making spider robots work faster and more efficiently. For example, some search robots track words in the title, links, and up to 100 most frequently used words on a page during indexing, and even each of the words in the first 20 lines of text content on the page. This is the indexing algorithm, in particular, of Lycos.

Other search engines, such as AltaVista, go in the other direction, indexing every single word in a page, including "a," "an," "the" and other unimportant words.

Meta Tags

Meta tags allow the owner of a web page to specify keywords and concepts that define the essence of its content. This is a very useful tool, especially when these keywords can be repeated up to 2-3 times in the text of the page. In this case, meta tags can “direct” the search robot to the desired selection of keywords for indexing the page. There is a possibility of “cheating” meta tags with popular search queries and concepts that are in no way related to the content of the page itself. Search robots are able to combat this by, for example, analyzing the correlation of meta tags and the content of a web page, “throwing out” from consideration those meta tags (respectively keywords) that do not correspond to the content of the pages.

All this applies to those cases when the owner of a web resource really wants to be included in search results for the desired search words. But it often happens that the owner does not want to be indexed by the robot at all. But such cases are not the topic of our article.

Index construction

Once the spiders have finished their work of finding new web pages, search engines must place all the information found so that it is convenient to use it in the future. There are 2 key components that matter here:

Information stored with data;
- the method by which this information is indexed.

In the simplest case, a search engine could simply place the word and the URL where it is found. But this would make the search engine a completely primitive tool, since there is no information about what part of the document this word is in (meta tags, or in plain text), whether this word is used once or repeatedly, and whether it is contained in a link to another important and related resource. In other words, this method will not rank sites, will not provide relevant results to users, etc.

To provide us with useful data, search engines store not only information from the word and its URL. A search engine can save data on the number (frequency) of mentions of a word on a page, assign a “weight” to the word, which will then help produce search listings (results) based on the weighted ranking for this word, taking into account its location (in links, meta tags, page title and so on.). Each commercial search engine has its own formula for calculating the “weight” of keywords during indexing. This is one of the reasons why for the same search query search engines produce completely different results.

Next important point when processing found information - its encoding in order to reduce the amount of disk space for storing it. For example, the original Google article describes that 2 bytes (8 bits each) are used to store the weight data of words - this takes into account the type of word (capital or capital letters), the size of the letters themselves (Font-Size) and other information. which helps to rank the site. Each such “piece” of information requires 2-3 bits of data in a complete 2-byte set. As a result, a huge amount of information can be stored in a very compact form. Once the information is “compressed,” it’s time to start indexing.

Indexation has one goal: to ensure maximum quick search the necessary information. There are several ways to build indexes, but the most effective is to build hash tables(hash table). Hashing uses a specific formula to assign a numerical value to each word.

In any language, there are letters with which many more words begin than with the rest of the letters of the alphabet. For example, there are significantly more words starting with the letter "M" in the English dictionary section than those starting with the letter "X". This means that searching for a word starting with the most popular letter will take longer than any other word. Hashing(Hashing) equalizes this difference and reduces the average search time, and also separates the index itself from the real data. A hash table contains hash values along with a pointer to the data corresponding to that value. Effective indexing + effective placement together provide high search speed, even if the user asks a very complex search query.

The future of search engines

A search based on Boolean operators ("and", "or", "not") is a literal search - the search engine receives the search words exactly as they were entered. This can cause a problem when, for example, the entered word has multiple meanings. "Key," for example, can mean "a means to open a door," or it can mean a "password" for logging into a server. If you are only interested in one meaning of a word, then you obviously won't need data on its second meaning. You can, of course, build a literal query that will exclude the output of data based on the unnecessary meaning of a word, but it would be nice if the search engine itself could help you.

One area of research into future search engine algorithms is conceptual information retrieval. These are algorithms that use statistical analysis of pages containing a given search keyword or phrase to find relevant data. It is clear that such a "conceptual search engine" would require much more storage space for each page and more time to process each request. Currently, many researchers are working on this problem.

No less intensive work is being carried out in the field of developing search algorithms based on queries. natural language(Natural-Language query).

The idea behind natural queries is that you can write your query as if you were asking a colleague sitting across from you. No need to worry about Boolean operators or strain to compose complex query. Today's most popular natural language search site is AskJeeves.com. It converts the query into keywords, which it then uses when indexing sites. This approach only works for simple queries. However, progress does not stand still; it is possible that very soon we will “talk” to search engines in our own “human language”.

Friends, I welcome you again! Now we will look at what search robots are and talk in detail about the Google search robot and how to be friends with them.

First you need to understand what search robots actually are; they are also called spiders. What work do search engine spiders do?

These are programs that check sites. They look through all the posts and pages on your blog, collect information, which they then transmit to the database of the search engine for which they work.

You don’t need to know the entire list of search robots, the most important thing is to know that Google now has two main spiders, called “panda” and “penguin”. They fight against low-quality content and junk links, and you need to know how to repel their attacks.

The Google Panda search robot was created to promote only high-quality material in searches. All sites with low-quality content are lowered in search results.

This spider first appeared in 2011. Before its appearance, it was possible to promote any website by publishing a large amount of text in articles and using a huge amount of keywords. Together, these two techniques brought non-quality content to the top of search results, and good sites were lowered in search results.

“Panda” immediately put things in order by checking all the sites and putting everyone in their rightful places. Although it struggles with low-quality content, it is now possible to promote even small sites with high-quality articles. Although previously it was useless to promote such sites, they could not compete with giants that have a large amount of content.

Now we will figure out how you can avoid the “panda” sanctions. You must first understand what she doesn’t like. I already wrote above that she struggles with bad content, but what kind of text is bad for her, let’s figure it out so we don’t publish it on our website.

The Google search robot strives to ensure that this search engine provides only high-quality materials for job seekers. If you have articles that contain little information and are not attractive in appearance, then urgently rewrite these texts so that the “panda” does not get to you.

High-quality content can be both large and small, but if the spider sees a long article with a lot of information, then it will be more useful to the reader.

Then you need to note duplication, in other words, plagiarism. If you think that you will rewrite other people’s articles on your blog, then you can immediately put an end to your site. Copying is strictly punished by applying a filter, and Plagiarism is checked very easy, I wrote an article on the topic how to check texts for uniqueness.

The next thing to notice is the oversaturation of the text with keywords. Anyone who thinks that they can write an article using only keywords and take first place in the search results is very much mistaken. I have an article on how to check pages for relevance, be sure to read it.

And another thing that can attract a “panda” to you is old articles that are morally outdated and do not bring traffic to the site. They definitely need to be updated.

There is also a Google search robot “penguin”. This spider fights spam and junk links on your site. It also calculates purchased links from other resources. Therefore, in order not to be afraid of this search robot, you should not buy links, but publish high-quality content so that people link to you themselves.

Now let’s formulate what needs to be done to make the site look perfect through the eyes of a search robot:

To make quality content, first research the topic well before writing the article. Then you need to understand that people are really interested in this topic.

Use specific examples and pictures, this will make the article lively and interesting. Break the text into small paragraphs to make it easy to read. For example, if you open a page of jokes in a newspaper, which ones will you read first? Naturally, each person first reads short texts, then longer ones, and lastly, long foot wraps.

The “panda’s” favorite quibble is the lack of relevance of an article that contains outdated information. Follow the updates and change the texts.

Keep track of the keyword density; I wrote above how to determine this density; in the service I described, you will receive the exact required number of keywords.

Don’t plagiarize, everyone knows that you can’t steal other people’s things or text – it’s the same thing. You will be punished for theft by getting caught in the filter.

Write texts of at least two thousand words, then such an article will look informative through the eyes of search engine robots.

Stay on topic with your blog. If you are running a blog about making money on the Internet, then you do not need to publish articles about air guns. This may lower the rating of your resource.

Design your articles beautifully, divide them into paragraphs and add pictures so that you enjoy reading and don’t want to leave the site quickly.

When purchasing links, make them to the most interesting and useful articles that people will actually read.

Well, now you know what work search engine robots do and you can be friends with them. And most importantly, the Google search robot and “panda” and “penguin” have been studied in detail by you.

Popular in the category: