Google 'Crawling' and 'Caching'

  • Thread starter dav2008
  • Start date
  • Tags
    Google
In summary, Google has thousands of servers that constantly crawl the web and update cached versions of webpages. They also use a database of known pages and follow links to update and add new pages to their database. There is a discussion about a programmer creating a search engine that is more powerful than Google by using multiple search engines simultaneously. However, there is debate about the effectiveness of such a system and the definition of a "powerful" search engine.
  • #1
dav2008
Gold Member
589
1
One of the google help pages says "Google takes a snapshot of each page examined as it crawls the web and caches these as a back-up in case the original page is unavailable"


Im wondering how does it 'crawl' through the web, and where in the world does google get all that storage space for all those websites??
 
Engineering news on Phys.org
  • #2
Google is constantly going to every single webpage and making a copy of it.

Right now it's going through every damn page there is, and updating the cached version it has

Where does it get the space? It's a secret.
 
  • #3
Google has thousands of linux servers networked together that scour the net. Right now they're working on some new algorithms for page rank, not much is known. As for space, it's not a big issue since it's so cheap. I'd imagine it's hundreds of terabytes.
 
  • #4
Two years ago I lived with a programmer. I learned some basics, then.

We wrote a search engine. Because of it's nature, it was the most powerful search engine ever created. This may sound hard to prove, but in fact it uses a simple idea that would prove it.

The code for the entire engine would fit on the front and back of one piece of paper in regular font.

Unfortunately, I don't talk to him anymore. He's never put the search engine on the web. I think I'm going to contact him and get the code if he still has it.

Does anyone here use DELPHI? And would be able to build the code and make changes?

I felt that based on the design of the engine, which I concieved, it'd be the next popular engine when the time comes.

Yes, it's better than google. In fact, it is literally "better" than all search engines combined.

Any programmers willing to guess how the code works, so that it could be better than all of them combined?
 
  • #5
The way they 'crawl' through the net is they have a database of known pages. Every so often they check these and follow any links off the page; update those pages, follow their links, etc.
 
  • #6
Originally posted by damgo
The way they 'crawl' through the net is they have a database of known pages. Every so often they check these and follow any links off the page; update those pages, follow their links, etc.

Yeah. Google is good. Not quite as good as mine, but good.

You know what would make good fiction. The interconnection between computers is completely ruined. And the servers which house original material is almost completely ruined.

But google remains, and all their cached pages are like a few weeks old. Ok maybe not. But neat to know a lot of the internet is "backed up" heh.

As many of you noticed, google lists the amount of webpages there are (unofficially). Here is the current number:

3,083,324,652

Discuss.
 
  • #7
Originally posted by LogicalAtheist
Any programmers willing to guess how the code works, so that it could be better than all of them combined?
I'm not a very good programmer, but I'm guessing from your phrasing that it is one of those multi-engine searches that cross references results from several pages?
 
  • #8
YAY.

Indeed the program works as such...

Once a search parameter is entered, the program searches that parameter on 20+ engines.

It then ranks all the results, and compares.

Each one gets a number as to what number it was on a given engine.

Then the results are compared, and identical ones get their numbers added together.

Then the resulting numbers are ranked, and thus the new list is displayed.

How long does this take? it takes only as long as the slowest of the engines. Namely around one third of a second

Thus the engine is more powerful because it uses the power of all engines!
 
Last edited by a moderator:
  • #10
But the problem with such systems is that the "power" of an engine is not that easy to quantify. For most people, the power of an engine relates to how relevant it's results are to what they want. In many cases, specialising in one particular engine that is good in a certain field is better than watering down results with irrelevant answers from other engines. Ie. you may be wrong in considering "power" to be cumulative.
 
  • #11
I use google because it is quick and easy.

30 seconds is far too long to wait.
 
  • #12
Originally posted by plus
I use google because it is quick and easy.

30 seconds is far too long to wait.
I think he said .3 seconds
 
  • #13
Originally posted by plus
I use google because it is quick and easy.

30 seconds is far too long to wait.

Plus - I said one third of a second.

In other words, it searches all the search engines in the same time it takes to just search one.

By the time the page has loaded the search has been finished for almost an entire second.
 
  • #14
Originally posted by LogicalAtheist
Plus - I said one third of a second.

In other words, it searches all the search engines in the same time it takes to just search one.

By the time the page has loaded the search has been finished for almost an entire second.

My mistake.

Yes, there is clearly no problem if it takes 1/3 of a second.
 

What is Google 'Crawling'?

Google 'Crawling' is the process by which Google's search engine program, called Googlebot, scans and indexes webpages on the internet. This allows Google to gather information and determine the relevance and quality of a webpage for search engine results.

What is Google 'Caching'?

Google 'Caching' is the process of storing a copy of a webpage on Google's servers. This allows for faster loading times for users and also ensures that the webpage is still accessible even if the original webpage is down.

How does Google 'Crawling' and 'Caching' work together?

Google 'Crawling' and 'Caching' work together to ensure that Google's search results are accurate and up-to-date. Googlebot crawls and indexes webpages, while also checking for any changes or updates to the webpage. If there are any changes, Google will recrawl the webpage and update its cached version.

Why is Google 'Crawling' and 'Caching' important for SEO?

Google 'Crawling' and 'Caching' are important for SEO because they help Google determine the relevance and quality of a webpage. This is a crucial factor in determining where a webpage will rank in search engine results. Additionally, having a cached version of a webpage can help improve loading times and user experience, which can also positively impact SEO.

Can you control how often Google crawls and caches your webpage?

While you cannot directly control how often Google crawls and caches your webpage, you can indirectly influence it through various SEO strategies. This includes regularly updating and optimizing your webpage, building backlinks, and having a well-designed and user-friendly website. These factors can signal to Google that your webpage is relevant and high-quality, which may lead to more frequent crawls and caching.

Similar threads

Replies
1
Views
541
Replies
4
Views
810
  • Programming and Computer Science
Replies
15
Views
1K
  • STEM Educators and Teaching
Replies
19
Views
1K
  • Programming and Computer Science
Replies
3
Views
1K
Replies
6
Views
3K
  • Science and Math Textbooks
Replies
2
Views
1K
  • Sci-Fi Writing and World Building
Replies
3
Views
669
  • Engineering and Comp Sci Homework Help
Replies
4
Views
700
  • Programming and Computer Science
Replies
1
Views
600
Back
Top