Google 'Crawling' and 'Caching'

dav2008 · May 22, 2003

One of the google help pages says "Google takes a snapshot of each page examined as it crawls the web and caches these as a back-up in case the original page is unavailable"

Im wondering how does it 'crawl' through the web, and where in the world does google get all that storage space for all those websites??

LogicalAtheist · May 22, 2003

Google is constantly going to every single webpage and making a copy of it.

Right now it's going through every damn page there is, and updating the cached version it has

Where does it get the space? It's a secret.

Greg Bernhardt · May 22, 2003

Google has thousands of linux servers networked together that scour the net. Right now they're working on some new algorithms for page rank, not much is known. As for space, it's not a big issue since it's so cheap. I'd imagine it's hundreds of terabytes.

LogicalAtheist · May 22, 2003

Two years ago I lived with a programmer. I learned some basics, then.

We wrote a search engine. Because of it's nature, it was the most powerful search engine ever created. This may sound hard to prove, but in fact it uses a simple idea that would prove it.

The code for the entire engine would fit on the front and back of one piece of paper in regular font.

Unfortunately, I don't talk to him anymore. He's never put the search engine on the web. I think I'm going to contact him and get the code if he still has it.

Does anyone here use DELPHI? And would be able to build the code and make changes?

I felt that based on the design of the engine, which I concieved, it'd be the next popular engine when the time comes.

Yes, it's better than google. In fact, it is literally "better" than all search engines combined.

Any programmers willing to guess how the code works, so that it could be better than all of them combined?

damgo · May 22, 2003

The way they 'crawl' through the net is they have a database of known pages. Every so often they check these and follow any links off the page; update those pages, follow their links, etc.

LogicalAtheist · May 23, 2003

Originally posted by damgo
The way they 'crawl' through the net is they have a database of known pages. Every so often they check these and follow any links off the page; update those pages, follow their links, etc.

Yeah. Google is good. Not quite as good as mine, but good.

You know what would make good fiction. The interconnection between computers is completely ruined. And the servers which house original material is almost completely ruined.

But google remains, and all their cached pages are like a few weeks old. Ok maybe not. But neat to know a lot of the internet is "backed up" heh.

As many of you noticed, google lists the amount of webpages there are (unofficially). Here is the current number:

3,083,324,652

Discuss.

FZ+ · May 23, 2003

Originally posted by LogicalAtheist
Any programmers willing to guess how the code works, so that it could be better than all of them combined?

I'm not a very good programmer, but I'm guessing from your phrasing that it is one of those multi-engine searches that cross references results from several pages?

LogicalAtheist · May 23, 2003

YAY.

Indeed the program works as such...

Once a search parameter is entered, the program searches that parameter on 20+ engines.

It then ranks all the results, and compares.

Each one gets a number as to what number it was on a given engine.

Then the results are compared, and identical ones get their numbers added together.

Then the resulting numbers are ranked, and thus the new list is displayed.

How long does this take? it takes only as long as the slowest of the engines. Namely around one third of a second

Thus the engine is more powerful because it uses the power of all engines!

damgo · May 23, 2003

www.metacrawler.com

FZ+ · May 23, 2003

But the problem with such systems is that the "power" of an engine is not that easy to quantify. For most people, the power of an engine relates to how relevant it's results are to what they want. In many cases, specialising in one particular engine that is good in a certain field is better than watering down results with irrelevant answers from other engines. Ie. you may be wrong in considering "power" to be cumulative.

plus · May 25, 2003

I use google because it is quick and easy.

30 seconds is far too long to wait.

dav2008 · May 25, 2003

Originally posted by plus
I use google because it is quick and easy.

30 seconds is far too long to wait.

I think he said .3 seconds

LogicalAtheist · May 25, 2003

Originally posted by plus
I use google because it is quick and easy.

30 seconds is far too long to wait.

Plus - I said one third of a second.

In other words, it searches all the search engines in the same time it takes to just search one.

By the time the page has loaded the search has been finished for almost an entire second.

plus · May 25, 2003

Originally posted by LogicalAtheist
Plus - I said one third of a second.

In other words, it searches all the search engines in the same time it takes to just search one.

By the time the page has loaded the search has been finished for almost an entire second.

My mistake.

Yes, there is clearly no problem if it takes 1/3 of a second.

Google 'Crawling' and 'Caching'

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Who May Find This Useful

Similar threads

Method of storing energy on the Moon

Thermal mass flow sensor vs. thermal flux sensor

Achievable accuracy of thermostatic radiator valves

Li-Ion Battery Quality Report

Torque to turn a large Beam on a Rotisserie

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect