Before giving us the information that we require, the search engine must be able to find this information for itself.
It does this by web crawling.
To find information on the hundreds of millions of web pages that exist today, a search engine used a web crawler (usually called a web spider). A web crawler is any program that browses the World Wide Web in an automated and methodical way. The process through which a web crawler scans the entire Web little-by-little is called web crawling or spidering.
As a spider crawls the World Wide Web in search of pages after pages, it builds a massive list of all the information out there on the Web (along with where exactly it found this information, the web address). Obviously, to build and maintain a useful record of information and words, a search engine's spiders have to look at a lot of pages.
Where does a spider begin its journey across the Web? The usual starting point can be very popular web pages which are frequently visited and have many links leading to other pagers. It will follow every link it finds within one site, and move on in this manner. In this way, a spider begins to move across the Web.
When Google was built initially as an academic search engine, its creators, Sergey Brin and Lawrence Page gave an example of how quickly their spiders can work. They built their first system using three spider at a time. Each spider could keep about 300 connections to Web pages open at a time. At its peak performance, using four spiders, their system could crawl over 100 pages per second, generating around 600 kilobytes of data each second.
When a spider looks at an HTML page, it takes note of two things. One, the words on the page, and two, the exact address of the page. Words occurring in the title, subtitles, and other positions of importance were noted for special consideration. The Google spider was built so that the spiders took note of every significant word on a page, leaving out the articles "a," "an" and "the." Other spiders take different approaches.
These different approaches usually attempt to make the spider operate faster and retrieve information that users really want. For example, the search engine Lycos is said to have its spiders keep special track of not only the title, sub-headings and links on a page, but also of things like the 100 most frequently used words on the page and each word in the first 20 lines of text.
Other search engines might have a greater emphasis on completeness while spidering the web. AltaVista spiders index every single word on a page, including "a," "an," "the" and other "insignificant" words.