Comp Sci Why Are Hotspots Rare in GFS with Sequential Reads of Large Multi-Chunk Files?

  • Thread starter Thread starter shivajikobardan
  • Start date Start date
  • Tags Tags
    File Google System
AI Thread Summary
Hotspots in the Google File System (GFS) are not a significant issue primarily because applications typically read large multi-chunk files sequentially, reducing the likelihood of multiple clients accessing the same chunk simultaneously. When small files, consisting of fewer chunks, are accessed by many clients, hotspots can occur as contention arises for those limited resources. The discussion highlights that while hotspots can happen during simultaneous read/write operations, they are less frequent with larger files due to the sequential access pattern. The analogy of a barrel filled with tennis balls illustrates how contention decreases as the number of chunks increases. Overall, the design of GFS and its handling of large files effectively mitigates hotspot issues in practice.
shivajikobardan
Messages
637
Reaction score
54
Homework Statement
In Google File System,hotspots haven't been a major issue because our applications mostly read large multi chunk files sequentially. what it mean?
Relevant Equations
none
In Google File System,hotspots haven't been a major issue because our applications mostly read large multi chunk files sequentially. what it mean?

hotspot-: region of computer program where a high proportion of executed instructions occur

Lazy space allocation-:https://stackoverflow.com/questions/18109582/what-is-lazy-space-allocation-in-google-file-system

With lazy space allocation, the physical allocation of space is delayed as long as possible, until data at the size of the chunk size (in GFS's case, 64 MB according the 2003 paper) is accumulated.
Large chunk size in GFS-:
=>A large chunk size, even with lazy space allocation has its disadvantages.
=> A small file consists of a small number of chunks, perhaps just one.
=> The chunkservers storing those chunks may become hot spots if many clients are accessing the same file.
=> In practice hotspots haven't been a major issue because our applications mostly read large multi chunk files sequentially.
I don't understand how hotspots are no issue when we read large multi chunk files sequentially. They say hotspots are issue if clients are accessing same small file(file of just 1 chunk).

I will represent scenario where small file=small no. of chunks is being accesed by multiple clients.



it makes sense why chunkservers will be hotspot in this case as they will be active if they are being accessed by multiple clients.
but it absolutely doesn't make sense when the research paper say " In practice hotspots haven't been a major issue because our applications mostly read large multi chunk files sequentially." What's the difference. If I imagine a scenario like above, here file is made up of multiple chunks and rest is same, what difference is made here?
 
Physics news on Phys.org
Collision issues can occur when multiple clients try to read / write or append to a common file. When writing only one client is given permission to write and all others must wait until the operation is complete before they can access the file.
 
jedishrfu said:
Collision issues can occur when multiple clients try to read / write or append to a common file.
Alright I get this.
jedishrfu said:
When writing only one client is given permission to write and all others must wait until the operation is complete before they can access the file.
So what? I don't get this.
In Google File System,hotspots haven't been a major issue because our applications mostly read large multi chunk files sequentially. what it mean?
is my question
 
shivajikobardan said:
In Google File System,hotspots haven't been a major issue because our applications mostly read large multi chunk files sequentially. what it mean?
Because
shivajikobardan said:
our applications mostly read large multi chunk files sequentially
then the situation where multiple clients try to read or write the same chunk at the same time does not occur often so it has not been a major issue.
 
pbuk said:
Because

then the situation where multiple clients try to read or write the same chunk at the same time does not occur often so it has not been a major issue.
can you tell me why this. I have one example but I prefer listening to your idea.
 
Hmm so we are giving you helpful suggestions here and you have an example but don't want to share until you hear someone else’s example first.

Thats not being very open. I would have provided my example which would get me even more comments but now I guess Ill just wait and see what happens.

If your example is proprietary to your work then I understand but must also say you should not be discussing work related stuff on the internet.
 
  • Haha
Likes shivajikobardan
jedishrfu said:
Hmm so we are giving you helpful suggestions here and you have an example but don't want to share until you hear someone else’s example first.

Thats not being very open. I would have provided my example which would get me even more comments but now I guess Ill just wait and see what happens.

If your example is proprietary to your work then I understand but must also say you should not be discussing work related stuff on the internet.
LOL what are you saying, why wouldn't I share it? It is here
Imagine you have a large barrel (file). In it, there is one tennis ball (chunk). Then, reach in blindfolded and grab the tennis ball (read file), Ok. Now put the ball back and get nine friends to join you. Then, have everyone grab the ball. There WILL be contention (hotspot). Now put 100 tennis balls into the barrel and you and your friends try to grab a ball. Most of the time, everyone will get a ball. Occasionally, there will be contention (hotspot) but it will be far less frequent.
 
  • Haha
Likes jedishrfu
It’s an interesting analogy though it’s unlikely that google chunks data in tennis balls. In filesystems or databases contention occurs when trying to update a specific resource. Locks are used to insure only one client may write to that resource.

It may be that Google logs some information as each client tries to read a given chunk which causes other clients to wait on that chunk. It may be that the web service that handles the reads has serialized the client requests which appears to the client as a wait. I’ve seen that in some web services but wouldn’t expect it in a Google service.

I’ve found this writeup on how it works so maybe you can find your answer there:

https://computer.howstuffworks.com/internet/basics/google-file-system.htm

and here’s a stackoverflow discussion on GFS hotspots

https://stackoverflow.com/questions...es-create-hot-spots-in-the-google-file-system
 
jedishrfu said:
It’s an interesting analogy though it’s unlikely that google chunks data in tennis balls. In filesystems or databases contention occurs when trying to update a specific resource. Locks are used to insure only one client may write to that resource.

It may be that Google logs some information as each client tries to read a given chunk which causes other clients to wait on that chunk. It may be that the web service that handles the reads has serialized the client requests which appears to the client as a wait. I’ve seen that in some web services but wouldn’t expect it in a Google service.

I’ve found this writeup on how it works so maybe you can find your answer there:

https://computer.howstuffworks.com/internet/basics/google-file-system.htm

and here’s a stackoverflow discussion on GFS hotspots

https://stackoverflow.com/questions...es-create-hot-spots-in-the-google-file-system
Hmm I didn't make it, someone from another forum did it. It clicked with my brain immediately.
 
  • Like
Likes jedishrfu
Back
Top