So once, the creepy crawlies are done with web crawling, or scourging for all possible relevant information on the World Wide Web, the search engine needs to store this information in such a way that it is most easily used. Note here, that the process of crawling the Web never really “completes”. The web is every changing, every second of the day, which means that the crawlers need to keep at their crawling all the time.
Nonetheless, we move on to the next task of a search engine sorting and storing all that information it just found.
Now, in its simplest form, a search engine could index the word it found and where it found this word (the URL). Now, if we search that particular word, the search engine would be imaginably inefficient. That page would appear in our search results irrespective of whether the word was used in an important or a trivial way on the page, whether the word was used once or many times or whether the page contained links to other pages containing the word. In other words, there would be no way of building the ranking list that tries to present the most useful pages at the top of the list of search results.
So to make the search more relevant, search engines evidently need to store more than only the word the URL where it was found. The search engine would need to assign some sort of importance or some sort of a weight to each page displaying the word. Perhaps, it could attach a greater weight to the pages which have the specific words appearing more number of times or perhaps, a greater weight to a page with that word in the heading or the title of the page and a lower weight to the page where the word only appears in regular text.
Different web search engines have adopted different ways of ranking pages (they use different formulae to assign the weights). That is why the same search query run on different search engines sometimes shows the results in varied orders.
So that was the additional information stored with the data that a spider found (what else is stored?).
There is one more major issue that we need to deal with then we talk about indexing. That is the method by which the information is stored (how is it stored?). Regardless of the type of additional information (weights, rankings etc.) stored with the data, the data is always encoded when saved. This is done to save vital storage space. As a result of this encoding, a great amount of information can be stored in a very compact form. After this information is compacted it is ready to be indexed!