Google's Web Crawling Technology And Basic Concepts…

A web crawler which is known as a web spider or web robot or – especially in the FOAF community – web scutter or is a program or automated script which browses the World Wide Web in a methodical, automated manner. Other less frequently used names for web crawlers are ants, automatic indexers, bots, and worms.

This process is called web crawling or spidering. Many sites, in particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index (Search engine indexing collects, parses, and stores data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, physics and computer science. An alternate name for the process in the context of search engines designed to find web pages on the Internet is Web indexing.) the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a website, such as checking links or validating HTML code. Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for spam).

A web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies.

We got a great post and this site discuss about Google Crawling Technology and that should be the most convenient ways to understand and anybody get the main theme of Google Crawling Technology…

"Google does two types of crawl:- the main crawl and the fresh crawl. The main crawl is done once a month; the fresh crawl is done more-or-less daily, but only some pages are crawled. Google is still experimenting with which sites and pages to crawl and how deep to crawl. Neither type of crawl puts any new pages into Google's main index. That only happens at the next update – at the conclusion of the next Google Dance. Fresh crawls can be distinguished from main crawls by the IP addresses used by Googlebot. Fresh crawl: 64.68.82…; Main crawl: 216.239.46…

The fresh crawl re-crawls pages that are already in the index, picking up new pages along the way. Fresh-crawled new pages are evaluated in some way and inserted into the search results straight away, which means that new pages can be found by surfers almost immediately, even though they are not yet in Google's main index. A new page can be added to a site today and traffic could start arriving on it within hours.

Also, updated pages that are already in Google's main index, are re-evaluated in some way and inserted into the search results in places that reflect the changes. E.g. the day after the link to this site's SEO Copywriting page was placed on the index page, the index page showed up at #3 for the search term "seo copywriting". The index page was well established in Google's main index, but the SEO copywriting part of it was new, and was given the "fresh" treatment. Very soon after that, the SEO copy writing page itself was 'fresh' ranked at #1.

This is good news for surfers and webmasters, although some websites can suffer for a while due to fresh-crawled new pages pushing them down the rankings.

In practice, many new fresh-crawled pages enjoy a flury of traffic while they are not in the main index. When they have been included in the main index, they take their place in the rankings according to their evaluated merit, and the traffic tends to be reduced unless the page actually merits its 'fresh' ranking, of course.

At the time of writing, the fresh crawl is still new, but my theory of the experience of a new page is this:-

Sometime during a month, the new page is found by Google and fresh-crawled. It is evaluated in some way and placed in a 'fresh' index. From there it is inserted into the rankings, according to its 'fresh' evaluation.

The page is involved in the next end-of-month dance but, because it hasn't yet been main-crawled, it isn't included in the actual update and isn't placed in the main index. It continues to be a 'fresh' page.

Then the main crawl gets underway. If the page still exists, it is crawled and will be included in the following update, when it will enter the main index. During this period, it may keep the 'fresh' ranking that it achieved provided that other new pages don't come along to push it down. It is only after the page enters the main index that it's true ranking is seen.

Because of the page's revised evaluation when entering the main index, traffic from it is likely to drop. That's assuming that the page didn't really merit its 'fresh' ranking.

It should be noted that Google is continually updating the rankings and 'fresh' rankings are very volatile in that they come, go and change during a page's 'fresh' period.

As I said, the fresh crawl is still quite new and not yet fully understood. The experience of a new page from fresh crawl to main index is what I believe I have observed, but my conclusions could easily be wrong. The reason I believe that new pages don't enter the main index until the dance and update after they have been main-crawled, even though they have usually been involved in one dance, is because Google still shows no links to them until after the update following their first main crawl. This is my theory of a new page's experience but, like any theory, it may need to be revised in the light of new observations.

Addendum

As of the New Year 2003 update, Google is applying Tool bar PR0 (zero PageRank) values to some new pages. PR0 normally indicates that a page has been penalized, but these PR0s are not penalties. From my observations, it appears that the values apply to pages that have been fresh-crawled and have gone through an update following the fresh-crawl. Such pages don't get into the main index until after they have been main-crawled and gone through the update after that. It appears that, between the two updates, Google applies PR0 to the pages.

The reason for it may be to do with how Google inserts 'fresh' pages into the rankings or it may be for some other reason entirely. Also, it may be that different PR values are applied to different pages, but it is brand new and, as yet, I have seen only PR0 values applied.

Submited by Paavan Solanki – leading search engine optimization expert and owner of premier seo company in India – Targetseo.com"

Source: Latest Google Crawling Technology – Thanks To Targetseo
Source: Wikipedia

Please consider to
subscribe to the feed
or


Subscribe to Google Crawling Technology by Email
and get future
articles delivered to your feed reader.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s