Google 'Crawling' and 'Caching'

  • Thread starter dav2008
  • Start date
  • #1
dav2008
Gold Member
612
1
One of the google help pages says "Google takes a snapshot of each page examined as it crawls the web and caches these as a back-up in case the original page is unavailable"


Im wondering how does it 'crawl' through the web, and where in the world does google get all that storage space for all those websites??
 

Answers and Replies

  • #2
Google is constantly going to every single webpage and making a copy of it.

Right now it's going through every damn page there is, and updating the cached version it has

Where does it get the space? It's a secret.
 
  • #3
18,834
9,017
Google has thousands of linux servers networked together that scour the net. Right now they're working on some new algorithms for page rank, not much is known. As for space, it's not a big issue since it's so cheap. I'd imagine it's hundreds of terabytes.
 
  • #4
Two years ago I lived with a programmer. I learned some basics, then.

We wrote a search engine. Because of it's nature, it was the most powerful search engine ever created. This may sound hard to prove, but in fact it uses a simple idea that would prove it.

The code for the entire engine would fit on the front and back of one piece of paper in regular font.

Unfortunately, I don't talk to him anymore. He's never put the search engine on the web. I think I'm going to contact him and get the code if he still has it.

Does anyone here use DELPHI? And would be able to build the code and make changes?

I felt that based on the design of the engine, which I concieved, it'd be the next popular engine when the time comes.

Yes, it's better than google. In fact, it is literally "better" than all search engines combined.

Any programmers willing to guess how the code works, so that it could be better than all of them combined?
 
  • #5
The way they 'crawl' through the net is they have a database of known pages. Every so often they check these and follow any links off the page; update those pages, follow their links, etc.
 
  • #6
Originally posted by damgo
The way they 'crawl' through the net is they have a database of known pages. Every so often they check these and follow any links off the page; update those pages, follow their links, etc.

Yeah. Google is good. Not quite as good as mine, but good.

You know what would make good fiction. The interconnection between computers is completely ruined. And the servers which house original material is almost completely ruined.

But google remains, and all their cached pages are like a few weeks old. Ok maybe not. But neat to know a lot of the internet is "backed up" heh.

As many of you noticed, google lists the amount of webpages there are (unofficially). Here is the current number:

3,083,324,652

Discuss.
 
  • #7
FZ+
1,599
3
Originally posted by LogicalAtheist
Any programmers willing to guess how the code works, so that it could be better than all of them combined?
I'm not a very good programmer, but I'm guessing from your phrasing that it is one of those multi-engine searches that cross references results from several pages?
 
  • #8
YAY.

Indeed the program works as such...

Once a search parameter is entered, the program searches that parameter on 20+ engines.

It then ranks all the results, and compares.

Each one gets a number as to what number it was on a given engine.

Then the results are compared, and identical ones get their numbers added together.

Then the resulting numbers are ranked, and thus the new list is displayed.

How long does this take? it takes only as long as the slowest of the engines. Namely around one third of a second

Thus the engine is more powerful because it uses the power of all engines!!!
 
Last edited by a moderator:
  • #10
FZ+
1,599
3
But the problem with such systems is that the "power" of an engine is not that easy to quantify. For most people, the power of an engine relates to how relevant it's results are to what they want. In many cases, specialising in one particular engine that is good in a certain field is better than watering down results with irrelevant answers from other engines. Ie. you may be wrong in considering "power" to be cumulative.
 
  • #11
plus
178
1
I use google because it is quick and easy.

30 seconds is far too long to wait.
 
  • #12
dav2008
Gold Member
612
1
Originally posted by plus
I use google because it is quick and easy.

30 seconds is far too long to wait.
I think he said .3 seconds
 
  • #13
Originally posted by plus
I use google because it is quick and easy.

30 seconds is far too long to wait.

Plus - I said one third of a second.

In other words, it searches all the search engines in the same time it takes to just search one.

By the time the page has loaded the search has been finished for almost an entire second.
 
  • #14
plus
178
1
Originally posted by LogicalAtheist
Plus - I said one third of a second.

In other words, it searches all the search engines in the same time it takes to just search one.

By the time the page has loaded the search has been finished for almost an entire second.

My mistake.

Yes, there is clearly no problem if it takes 1/3 of a second.
 

Suggested for: Google 'Crawling' and 'Caching'

  • Last Post
Replies
6
Views
689
Replies
5
Views
350
  • Last Post
Replies
4
Views
3K
  • Last Post
Replies
8
Views
730
Replies
12
Views
741
Replies
5
Views
455
Replies
32
Views
383
Replies
17
Views
669
Replies
2
Views
424
Replies
11
Views
879
Top