Parts of Hashes and Expected Collisions#
Problem#
I am building a documentation system that works using Markdown for the documents and Pandoc to transform the documents to HTML, PDF, etc… It works well and is very easy to use. However there is a problem I have encountered. By default when Pandoc transforms a Markdown file to HTML, it automatically inserts section anchors. In Markdown, an ATX section header could look something like this:
# Parts of Hashes and Expected Collisions
Pandoc will convert that to the equivalent HTML, which in most cases is perfectly reasonable:
<h1 id="parts-of-hashes-and-expected-collisions">Parts of Hashes and Expected Collisions</h1>
My problem, I am building a system that has to deal with language localizations. This means that if a document is translated to Spanish, the Pandoc auto-generated anchors would break external and internal links. I need to consider a different solution.
Another problem is that collisions can occur when generating single file formats like PDF or Word documents. Pandoc requires a single file. This means that all of the separate Markdown files have to be merged into one document. Now, all links that point to files are simply replaced with pointers to section anchors - this increases the chances of a collision quite a bit.
The problem is, how do we create a section anchor that is relatively easy for a person to use and is relatively unique so that it avoids collisions. Thinking about the problem I realized that I don’t care what the name of the anchor is, it could be random for all I care. I just need to make sure that it is short enough so that it is easy for a person to use.
Pandoc has an attribute syntax which can handle naming the anchor. But how do we name the anchor so that it is language agnostic, stable and easy to use?
Possible Solutions#
I could use a UUID [1] and be done with it. That is after all, what they were designed for. A universal unique identifier (UUID)
is a 128-bit number that is guaranteed to be unique. Here is a sample UUID:
1f69d70a-8efb-11eb-8d17-71cde3084076
The UUID is too large (16-bytes - 32 hex digits) for me to want to use it as a section anchor. I want something in the 8 to 10 character range, not much more. I was thinking maybe it could be shortened by using the first 10 characters. This seems to be a very bad idea [2]. It appears that truncating the number produces unstable results and may cause collisions very early depending on the algorithm used.
Note
You could also change the base of the number you use to express the UUID. Expressing the UUID as a base 64 number would shortening the representation somewhat. It would take approximately 24 characters to represent the UUID. Better, but still more than twice what I want. Base 64 conversions can be done in an URL safe way so it is something to consider.
Maybe I could use a combination of the Markdown file name and some random characters. Unfortunately, I can’t guarantee that the file name won’t be localized. Most likely the names of the files will be localized so that any URL string will appear correctly localized. It would seem odd to me to have an English URL that points to Spanish resources. Using the file name directly will not work.
What if I hashed [3] the file name and used part of that hash as the anchor string. Git [4] allows the user to type in the partial hash number (first n
characters) to match a commit. It seems plausible. But what does it mean if we use the first n
characters of the hash?
Hashes and Collisions#
What does it mean if we use less bits in a hash? Why can we do this for a hash and not a UUID? A UUID is designed to be globally unique based on the algorithm it uses. Parts of the number are generated different ways depending on the algorithm and can mean different things [1] [5] [6]. So truncating out bits will most likely render it completely useless.
Essentially a hash is a mathematical one-way function. You pass it a big number (text strings or bytes) and it maps it to a shorter number that represents the original value (a data signature or fingerprint). What is a collision? A collision occurs if you take two or more different pieces of data and run them through the hashing function and they resolve to the same hash signature. In cryptography, this is bad. For what I want to do, it depends.
Why a Hash?#
The nice thing about a hash is that it will generate a fixed length number that reflects the input data. For example, SHA-256 will generate a number that is 256-bits long (32-bytes or 64 hex digits). This is twice as along as the UUID! Why would this choice even be on the table when I am looking for 8 to 10 characters? It is possible to use first n
bits of the hash.
Why can we truncate the hash and use less bits when we couldn’t do it with the UUID? The answer is not simple, and I won’t get into too many details. The short answer is that the algorithms are different. The UUID algorithm is designed to produce guaranteed unique numbers (within the 128-bit range and within the context you are using).
The hashing algorithm is different. From my understanding, because it is reproducible you can use less bits of the signature. What I mean is that, if you feed in the same data, you get the same results. It is consistent. If you only use the first 60-bits of the 256-bit number of the SHA-256 hash. If you feed the same data in, you expect the first 60-bits to match. To me it is that simple.
What does Collision have to do with Short Hashes?#
If your hashing algorithm has a high collision rate, it isn’t very good. The Wikipedia page covers the collisions. For a SHA-256, the search space is \(2^{256}\) which is an enormous search space. The probability of a collision is very small. If we shorten the hash, we reduce the search space.
To understand what is happening we will look at the Birthday Paradox. Simply put, the birthday paradox explains that the probability of two or more people (in a small sample size, say about 23 or more) sharing the same birthday is about 50 percent. This is unintuitive to most people (I would argue the vast majority of people). Given a year is about 365 days, it is difficult to imagine that out of 23 people, the probability that at least two share the same birthday (not year, just the day in the year - July 31 for example) is about 50%.
There is an approximation to the Birthday paradox [7] that well use to understand what is happening.
Where:
\(n\) - This is the number of values we would see before we would expect a collision
\(p\) - This is the probability that after n values a collision will occur
\(H\) - This is the size of the hash search space. Generally it is \( 2^b \) where b is the number of bits.
We’ll work with a SHA-256 hash and use hex digits to represent the number. An SHA-256 hash consists of a number that is 256-bits or 32-bytes. One hex digit can represent 16 numbers from one of 0 through 9 and A through F. A hex value can be represented by 4-bits. An SHA-256 can be represented by 64 hex digits.
If we rearrange (1) to solve for \(n\), we can determine how many hashes we can expect before a collision.
How many hashes would we expect to generate before we have a 50% chance of a collision (i.e. the same hash is generated)?
For a 50% chance:
What happens if we truncate the SHA-256 to n
characters, that is n
hexadecimal numbers?
For 4 characters:
The process is the same for different numbers of characters. The following table summarizes the first 15 characters of the SHA-256 hash:
Characters |
bits |
n |
---|---|---|
1 |
4 |
4 |
2 |
8 |
16 |
3 |
12 |
64 |
4 |
16 |
256 |
5 |
20 |
1,024 |
6 |
24 |
4,096 |
7 |
28 |
16,384 |
8 |
32 |
65,536 |
9 |
36 |
262,144 |
10 |
40 |
1,048,576 |
11 |
44 |
4,194,304 |
12 |
48 |
16,777,216 |
13 |
52 |
67,108,864 |
14 |
56 |
268,435,456 |
15 |
60 |
10,73,741,824 |
Looking at the table, if you choose to use the first 10 characters of the SHA-256 hash, you can expect a \(\frac{50}{50}\) chance of a collision after the first million hashes.
Additional References and Resources#
Here are some references I used while writing this document: