Google 'Crawling' and 'Caching'

  • Thread starter dav2008
  • Start date

dav2008

Gold Member
608
1
One of the google help pages says "Google takes a snapshot of each page examined as it crawls the web and caches these as a back-up in case the original page is unavailable"


Im wondering how does it 'crawl' through the web, and where in the world does google get all that storage space for all those websites??
 

LogicalAtheist

Google is constantly going to every single webpage and making a copy of it.

Right now it's going through every damn page there is, and updating the cached version it has

Where does it get the space? It's a secret.
 
17,540
7,145
Google has thousands of linux servers networked together that scour the net. Right now they're working on some new algorithms for page rank, not much is known. As for space, it's not a big issue since it's so cheap. I'd imagine it's hundreds of terabytes.
 

LogicalAtheist

Two years ago I lived with a programmer. I learned some basics, then.

We wrote a search engine. Because of it's nature, it was the most powerful search engine ever created. This may sound hard to prove, but in fact it uses a simple idea that would prove it.

The code for the entire engine would fit on the front and back of one piece of paper in regular font.

Unfortunately, I don't talk to him anymore. He's never put the search engine on the web. I think I'm gonna contact him and get the code if he still has it.

Does anyone here use DELPHI? And would be able to build the code and make changes?

I felt that based on the design of the engine, which I concieved, it'd be the next popular engine when the time comes.

Yes, it's better than google. In fact, it is literally "better" than all search engines combined.

Any programmers willing to guess how the code works, so that it could be better than all of them combined?
 

damgo

The way they 'crawl' through the net is they have a database of known pages. Every so often they check these and follow any links off the page; update those pages, follow their links, etc.
 

LogicalAtheist

Originally posted by damgo
The way they 'crawl' through the net is they have a database of known pages. Every so often they check these and follow any links off the page; update those pages, follow their links, etc.
Yeah. Google is good. Not quite as good as mine, but good.

You know what would make good fiction. The interconnection between computers is completely ruined. And the servers which house original material is almost completely ruined.

But google remains, and all their cached pages are like a few weeks old. Ok maybe not. But neat to know alot of the internet is "backed up" heh.

As many of you noticed, google lists the amount of webpages there are (unofficially). Here is the current number:

3,083,324,652

Discuss.
 

FZ+

1,550
2
Originally posted by LogicalAtheist
Any programmers willing to guess how the code works, so that it could be better than all of them combined?
I'm not a very good programmer, but I'm guessing from your phrasing that it is one of those multi-engine searches that cross references results from several pages?
 

LogicalAtheist

YAY.

Indeed the program works as such...

Once a search parameter is entered, the program searches that parameter on 20+ engines.

It then ranks all the results, and compares.

Each one gets a number as to what number it was on a given engine.

Then the results are compared, and identical ones get their numbers added together.

Then the resulting numbers are ranked, and thus the new list is displayed.

How long does this take? it takes only as long as the slowest of the engines. Namely around one third of a second

Thus the engine is more powerful because it uses the power of all engines!!!
 
Last edited by a moderator:

FZ+

1,550
2
But the problem with such systems is that the "power" of an engine is not that easy to quantify. For most people, the power of an engine relates to how relevant it's results are to what they want. In many cases, specialising in one particular engine that is good in a certain field is better than watering down results with irrelevant answers from other engines. Ie. you may be wrong in considering "power" to be cumulative.
 
167
1
I use google because it is quick and easy.

30 seconds is far too long to wait.
 

dav2008

Gold Member
608
1
Originally posted by plus
I use google because it is quick and easy.

30 seconds is far too long to wait.
I think he said .3 seconds
 

LogicalAtheist

Originally posted by plus
I use google because it is quick and easy.

30 seconds is far too long to wait.
Plus - I said one third of a second.

In other words, it searches all the search engines in the same time it takes to just search one.

By the time the page has loaded the search has been finished for almost an entire second.
 
167
1
Originally posted by LogicalAtheist
Plus - I said one third of a second.

In other words, it searches all the search engines in the same time it takes to just search one.

By the time the page has loaded the search has been finished for almost an entire second.
My mistake.

Yes, there is clearly no problem if it takes 1/3 of a second.
 

Related Threads for: Google 'Crawling' and 'Caching'

  • Posted
Replies
4
Views
3K

Physics Forums Values

We Value Quality
• Topics based on mainstream science
• Proper English grammar and spelling
We Value Civility
• Positive and compassionate attitudes
• Patience while debating
We Value Productivity
• Disciplined to remain on-topic
• Recognition of own weaknesses
• Solo and co-op problem solving

Hot Threads

Top