Using GUIDs for Identifiers

When a developer is faced with creating a data model, the type of primary key is selected “out of habit” as it was 10 and 20 years ago, well, or generally out of the box, in the same Yii, this is an auto-incrementing integer field. In reality, this is not always a competent decision. I will describe what are the advantages of a GUID.

No replication collisions

When replicating between multiple instances of a database where new entries are added to more than one replica, the GUID ensures that no primary key matches.

For example, in MySQL, for replication to work, you have to do N + 1, but N + 2, N + 3, etc., for a standard auto-increment PK, where 2 and 3 are the number of databases. And since the number of bases can grow, you have to do N + 10, N + 20, which very quickly selects the entire available range of INT4, and you have to switch to INT8.

In MSSQL, in general, when using standard replication methods, the use of a column of the GUID type is actually a mandatory requirement.

The INT type is universal, and since everyone out of habit takes MySQL as the main base, it does not have a GUID type. And here it is logical that INT is better than CHAR (36) ... for MySQL, in PostgreSQL, MSSQL, Oracle there is normal and correct GUID support.

Uniqueness in any system

The GUID ensures the uniqueness of identifiers not only within the same table, but in general in any, even in different tables of different systems. This uniqueness becomes relevant when data can be moved between systems (databases) or when there was one large table, and then it is divided into two / three.

Another plus is that you do not need to store the entity type for tables such as comments, reviews, images. Usually the same as: PK (object_id, type_id), where type_id is 1 (recall), 2 (comment), 3 (image). In the case of the GUID, there can be no comment and recall with the same PK. Therefore PK (object_id) is enough.

Another example is the merging of databases among themselves. When some small project is absorbed by a larger one, dances begin with how to merge the same numerical identifiers. A very big problem is in joining the table-links, and you have to either remember the identifiers “before-after” in a separate table, or do “id = id + 100000”. In the case of the GUID, you don’t have to do anything at all - just insert the entries.

Client Generation (in PHP)

Using a GUID allows you to generate an identifier before saving the record to the database. After all, a GUID is a globally unique identifier, and the generation of the second one in any system is zero.

This item becomes quite relevant when you need to perform any action with the object BEFORE saving it to the database, for example, send a request to an external service for verification, and then save it to the database.

So, let’s say, a significant plus is the fact that you can immediately generate 10-20-100 identifiers with a multi-insert, and then insert records through batchInsert. You don’t have to climb into the database or worry about the fact that the auto-increment counter has already moved at the time of insertion.

GUID makes guessing and identifier impossible

Because there is no sequence of identifiers. For example, there is a tariff “10”, that means there are tariffs “9”, “8”, “7” - and you can try to change the form parameters so that instead of the tariff “10” change / delete the tariff “9”.

This is not only for protection against hacking, but also against banal parsing, when the parser simply puts id + 1 in the address bar.

Auto increment can select the whole range

An auto-increment field stores the last (maximum) identifier. For example, now there is "100". And if I insert an entry with the identifier “100000000”, it turns out that I have 101 entries in the table, but the following identifiers will start not with 102, but with 100000001. This creates an INT4 overflow problem (which is used by default for PK). Simply put, the site will stop working due to out of range, because it will be necessary to change the type of field. Changing the field type to INT8 does not solve the problem, because then it will be necessary to do INT16, etc.

Once, due to a programmer’s error, the countdown began not with 10000, but with 1000010001. It is logical that they noticed it when day X arrived - the range of INT4 ended. The change to INT8 was unsuccessful, because JOIN INT4 on INT8 was very slow due to type conversion. I had to rewrite all the identifiers and relationships, returning the missing gap.

Another resulting issue with INT auto-increment is the fast overflow of the allocated set due to transactions. The fact is that the transaction increments the counter, even if it did not complete. That is, if the counter was “1”, I did and rolled back 10 save entries to the database, the counter will already be “11”. A rollback may be, for example, due to an error in the associated data, when, for example, I saved the number, and then I want to save the tariffs, and there is an error in some tariff.

Because the GUID is not stored as CHAR (36) - this is essentially the same INT, just with higher bit depth (INT16) - INT4 stores 4 bytes, and INT16 stores 16 bytes.

The disadvantages of using GUIDs include only the random nature of their generation, which leads to fragmentation of the indexes, because it turns out that each subsequent generated GUID is exactly different from the previous one and is more likely to fall into another sheet in the index tree. In other words, it will require rebuilding the index more often than the auto-increment field.

What conclusions can be made:

We use PostgreSQL, and it has excellent native GUID support.

As identifiers we use not the INT type, but the GUID type.