What is the optimal ID key size for minimizing data usage and avoiding overflow?

volican · Jul 31, 2017

Hi, say I have billions of things that I would like to individually name (give a unique identifying number), what is the best way to go about that? I would like to minimise the amount of data that is needed to uniquely identify each object, as it then takes less memory to store and transmit. So far what I have been thinking along the lines of concatenating the date and time of creation. This is fine but if things are made concurrently this would not work. Any ideas?

mfb · Jul 31, 2017

Just use numbers from 1 to N?

More context would help.

Filip Larsen · Jul 31, 2017

If you have a central store that knows the ID already issued (or how many) then, as mfb says, it is hard to do it more compactly than simply increase a counter. With a 4 byte (32 bit) count you can identify up to 4 billion unique IDs. Using 8 bytes (64 bit) this increases to 18 * 10¹⁸.

However, if by concurrent mean you will have distributed registration of ID's (on a computer), then other schemes like GUIDs that uses 16 bytes may be more appropriate. While these are a practical solution for use by computer systems you should be able to extract the "mathematical" methodology behind this if that is what you want.

Since you are talking about storing and transmitting IDs in a forum for mathematical set theory it is really not clear what you are after. More context, as mfb said, would definitely increase the chances for others to give you relevant help.

jedishrfu · Aug 1, 2017

Some of this discussion of IDs came up in the early years of computing where we started with Baudot code (5-bits) then ASCII (7-bit) then EBCDIC (8-bit) and extended ASCII (8-bit) then WideChar and Unicode (16-bit) ... The complaint often was too much storage is needed when we want to save documents written in language X. At the time, the byte was popular as a unit of storage and so 8-bit extended ASCII was a natural encoding (actually there were many coding standards that dictated the sorting of characters in the locale) to cover all the European alphabets.

The notion of indexing comes into play when you assign a number to a thing like a letter in an alphabet. Chinese, Japanese and Korean challenged the dominance of ASCII since they needed 2 bytes to represent a character. This means that a text document must use 2-byte characters to represent Chinese and English (0-byte + 8-bit ASCII) in the same document in memory and use a variable length coding like UTF-8 (1 byte for ASCII and 2, 3 ... bytes for Chinese, nulls and certain control characters not allowed) when saving to a file. Unicode further challenged the 2-byte notion when all the alphabets of the world are added into the mix.

The point here is that the byte size of your ID will determine how long it will be useful before it "overflows" ie runs out of number space. We have seen this happen with seconds, milliseconds and microsecond timestamps of 32-bit, 64-bit and 128-bit size causing the Y2K and related overflow events. You'll need to consider the maximal number of concurrent events that can occur to then determine the best ID key size that won't overflow too soon.

seconds + upto 2, 4, 8 ... 256 events
or
millisecs + upto 2, 4, 8 ... 256 events
or
microseconds + upto 2, 4, 8 ... 256 events

What is the optimal ID key size for minimizing data usage and avoiding overflow?

1. What is a unique identifier problem?

2. Why is a unique identifier important?

3. What are some common examples of unique identifier problems?

4. How can a unique identifier problem be solved?

5. What are the potential consequences of not addressing a unique identifier problem?

Similar threads

Hot Threads

Recent Insights