The Web is a Trillion Pages Long: Google

The web is a trillion pages to Google, and growing at a rate of several billion pages per day, the company said in a blog post. Literally though, the interweb consists of more than the trillion pages that Google indexes. Google claims not to index every one of those trillion pages; not all of them, "We don't index every one of those trillion pages -- many of them are similar to each other, or represent auto-generated content..." Most of the pages consist of duplicate URLs -- with multiple pages containing the same content.

he first Google index in 1998 had 26 million pages, and by 2000 the Google index reached the one billion mark. The blog further charts the nature of this task and the evolution of Google's own methods: "Back then, we did everything in batches: one workstation could compute the PageRank graph on 26 million pages in a couple of hours, and that set of pages would be used as Google's index for a fixed period of time. Today, Google downloads the web continuously, collecting updated page information and re-processing the entire web-link graph several times per day."

The blog post led to Michael Arrington of TechCrunch to hint at something interesting come next week. Quoting that Google is proud to have the most 'comprehensive index of any search engine', Michael adds that "That may be true today, but it probably won t be true next week". A hint to a potential challenger to the search engine crown, if there ever was one.

No comments: