Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

Google 'Crawling' and 'Caching'

  1. May 22, 2003 #1


    User Avatar
    Gold Member

    One of the google help pages says "Google takes a snapshot of each page examined as it crawls the web and caches these as a back-up in case the original page is unavailable"

    Im wondering how does it 'crawl' through the web, and where in the world does google get all that storage space for all those websites??
  2. jcsd
  3. May 22, 2003 #2
    Google is constantly going to every single webpage and making a copy of it.

    Right now it's going through every damn page there is, and updating the cached version it has

    Where does it get the space? It's a secret.
  4. May 22, 2003 #3
    Google has thousands of linux servers networked together that scour the net. Right now they're working on some new algorithms for page rank, not much is known. As for space, it's not a big issue since it's so cheap. I'd imagine it's hundreds of terabytes.
  5. May 22, 2003 #4
    Two years ago I lived with a programmer. I learned some basics, then.

    We wrote a search engine. Because of it's nature, it was the most powerful search engine ever created. This may sound hard to prove, but in fact it uses a simple idea that would prove it.

    The code for the entire engine would fit on the front and back of one piece of paper in regular font.

    Unfortunately, I don't talk to him anymore. He's never put the search engine on the web. I think I'm gonna contact him and get the code if he still has it.

    Does anyone here use DELPHI? And would be able to build the code and make changes?

    I felt that based on the design of the engine, which I concieved, it'd be the next popular engine when the time comes.

    Yes, it's better than google. In fact, it is literally "better" than all search engines combined.

    Any programmers willing to guess how the code works, so that it could be better than all of them combined?
  6. May 22, 2003 #5
    The way they 'crawl' through the net is they have a database of known pages. Every so often they check these and follow any links off the page; update those pages, follow their links, etc.
  7. May 23, 2003 #6
    Yeah. Google is good. Not quite as good as mine, but good.

    You know what would make good fiction. The interconnection between computers is completely ruined. And the servers which house original material is almost completely ruined.

    But google remains, and all their cached pages are like a few weeks old. Ok maybe not. But neat to know alot of the internet is "backed up" heh.

    As many of you noticed, google lists the amount of webpages there are (unofficially). Here is the current number:


  8. May 23, 2003 #7


    User Avatar

    I'm not a very good programmer, but I'm guessing from your phrasing that it is one of those multi-engine searches that cross references results from several pages?
  9. May 23, 2003 #8

    Indeed the program works as such...

    Once a search parameter is entered, the program searches that parameter on 20+ engines.

    It then ranks all the results, and compares.

    Each one gets a number as to what number it was on a given engine.

    Then the results are compared, and identical ones get their numbers added together.

    Then the resulting numbers are ranked, and thus the new list is displayed.

    How long does this take? it takes only as long as the slowest of the engines. Namely around one third of a second

    Thus the engine is more powerful because it uses the power of all engines!!!
    Last edited by a moderator: May 25, 2003
  10. May 23, 2003 #9
  11. May 23, 2003 #10


    User Avatar

    But the problem with such systems is that the "power" of an engine is not that easy to quantify. For most people, the power of an engine relates to how relevant it's results are to what they want. In many cases, specialising in one particular engine that is good in a certain field is better than watering down results with irrelevant answers from other engines. Ie. you may be wrong in considering "power" to be cumulative.
  12. May 25, 2003 #11
    I use google because it is quick and easy.

    30 seconds is far too long to wait.
  13. May 25, 2003 #12


    User Avatar
    Gold Member

    I think he said .3 seconds
  14. May 25, 2003 #13
    Plus - I said one third of a second.

    In other words, it searches all the search engines in the same time it takes to just search one.

    By the time the page has loaded the search has been finished for almost an entire second.
  15. May 25, 2003 #14
    My mistake.

    Yes, there is clearly no problem if it takes 1/3 of a second.
Share this great discussion with others via Reddit, Google+, Twitter, or Facebook