Introduction
I have long waffled on whether a GUID or integer make for the best primary key. I wrote this with the goal of refining or challenging my current opinion.
The identity, or primary key, uniquely identifies an entity or record. To be clear, ‘identity’ is a Domain Driven Design concept that describes an immutable identifying attribute of an entity (not to be confused with a SQL Server identity column, which is an automatically incrementing integer). A ‘primary key’ is a database concept that serves the same purpose for a record and enforces referential integrity. In this post, I will use the term ‘primary key’, since we are focusing on the database. The concept of an identity attribute is more than just a database concern, it is a domain concern, but it is important to consider where these concerns intersect.
While my bias as a software developer guides me to focus on the domain model, the realities of database persistence cannot be overlooked. For this post, we are going to assume that we are using SQL Server for persistence. Therefore, its performance must be factored into the identifier decision. Consider that other databases are going to have similar issues.
GUID
From a domain perspective, I have an affinity for GUID (or UUID) as my identifier. It is almost assuredly unique in the world and therefore will not collide with other GUIDs should my system be distributed, replicated, merged, or otherwise interact with other systems or contexts. It also has the benefit that it can be created in the application, rather than being a side-effect of adding the entity to the database.
Column Size
The most basic and easy to understand issue with GUID is the size of the column, which is 16 bytes. Consider how keys will proliferate in your database. Not only will it consume space in the table where it is a primary key, but in every table where it is a foreign key. If every key in your database is a GUID, you can imagine the memory implications versus a 4-byte int or a 8-byte bigint. Consider further that memory usage will not only occur on disk, but in the RAM of the server.
Clustered Index Fragmentation
A common argument against the use of GUID for the primary key is fragmentation.[2] Clustered indexes sort and store the data rows based on their key values. There can be only one clustered index per table, because the data rows themselves can be stored in only one order. When setting a primary key for a table, SQL Server creates a clustered index on that column by default. That may be what you want initially. When you have an auto-incrementing integer as a clustering key, that key is in the same order as the clustered index. When you add a key that is not ordered, such as a GUID, fragmentation occurs. There are some options to mitigate fragmentation when using a GUID. COMB (for COMBined, abbreviated)[3] or sequential GUIDs allow for ordered identifiers as do integers, thus mitigating clustered index fragmentation. A COMB GUID has the benefit of being generated in code like a regular GUID using libraries such as RT.Comb. NEWSEQUENTIALID() is a Transact-SQL function built into SQL Server that creates a GUID that is greater than any GUID previously generated by this function on a specified computer since it was started. Both of these options allow for the creation of ordered GUIDs that mitigate fragmentation. There is also the option of declaring a separate primary key and clustering key. This way you could have a GUID as your primary key, and an integer as your clustering key. Other causes of fragmentation exist that have nothing to do with the key, but that is a topic for another time.
Index Structure Size
Another strong argument against the use of GUID as the clustering key is the table size resulting from the aforementioned clustered index. The pointer from an index row in a nonclustered index to a data row is called a row locator. For a clustered table, the row locator is the clustered index key. Since a GUID is 16 bytes versus 4 bytes for an int, the index structure is going to be larger than using an integer. Since every nonclustered index will contain the clustering key, a larger clustering key will widen the nonclustered index. The result is that a clustering key of a GUID will make the index structure multiple times larger than will an integer.
Hard to Remember
Finally, GUIDs are just plain ugly. Integers are much easier to remember and type. If your table has 10,000 rows, it is far easier to remember a key of ‘8,711’ than it is to remember ‘55e7ad83-c81b-4148-a658-07766c221558’. You may be able to hold the former in your head, but the latter will require the use of the clipboard.
API
Occasionally, you will see an API that is hackable, either by design or accident. This probably is not what you want. If someone has order ‘100’, you don’t want them to have the ability to enter ‘99’ or ‘101’ to see the orders before and after them. GUIDs make this less likely. Simply encoding the GUID to a BASE64 string gives us something that is highly unlikely to be hacked in this manner.
public string GuidToBase64(Guid guid)
{
string base64 = Convert.ToBase64String(guid.ToByteArray());
base64 = base64.Replace("/", "_").Replace("+", "-");
return base64.Substring(0, 22);
}
public Guid Base64ToGuid(string base64)
{
Guid guid = default(Guid);
base64 = base64.Replace("-", "/").Replace("_", "+") + "==";
try {
guid = new Guid(Convert.FromBase64String(base64));
}
catch (Exception ex) {
throw new Exception("Failed to covert BASE64 string to GUID", ex);
}
return guid;
}
Using the code above, a GUID of ‘55e7ad83-c81b-4148-a658-07766c221558’ would return a BASE64 string of ‘g63nVRvISEGmWAd2bCIVWA’. This string could then be used on your API. (Keep in mind that this ID is case sensitive.)
If one uses integers as their primary key, there are options to obfuscate that key on their API. One such library is Hashids, a small open-source library that generates short, unique, non-sequential IDs from numbers, which is available for a variety of programming languages.
While you may see tutorials or production websites that use hackable IDs, you will want to use a mechanism like those suggested above to decouple the keys of your database from your public API.
Conclusion
I began writing this with the premise of showing how a COMB or similar sequential GUID is a good default choice for an identity attribute or primary key. There remain many strong arguments in favor of the GUID. However, the table and index structure size of GUIDs make integers a wise choice. My recommendations are:
- If you choose GUID as a primary and clustering key, use some form of sequential GUID,
- Choose GUID if you need to generate the key in code,
- Choose GUID if you require its distributed benefits,
- If using GUID, consider an int or bigint clustering key,
- If performance is your top concern, use int or bigint as your primary and clustering key,
- Only expose hashed keys to the outside world.
This is an interesting topic to dive into deeper. I recommend reading the posts below, and if you use SQL Server, consume just about everything that Kimberly Tripp writes or says.
References
- Robert C. Martin. No DB, The Clean Code Blog, May 15, 2012.
- Kimberly Tripp. GUIDs as PRIMARY KEYs and/or the clustering key. SQLskills, March 5, 2009.
- Jimmy Nilsson. The Cost of GUIDs as Primary Keys, InformIT, March 8, 2002.
- Jeff Atwood. Primary Keys: IDs versus GUIDs, Coding Horror, March 19, 2007.
- Tom Harrison. UUID or GUID as Primary Keys? Be Careful!, Tom Harrison’s Blog, February 12, 2017.
