Introduction
If you've ever dealt with ANY database system you probably encountered the ACID acronym. Let's briefly go over each of the principles it describes:
A for Atomicity
The concept of treating something (a block of data operations in our context) as a single, indivisible unit of work.
C for Consistency
In the context of databases, consistency relates to abiding to data constraints - in the context of relational DBs, these are usually foreign key/relations constraints, cascades and validations. When this property is effectively guaranteed - it makes the result of your data operations more predictable and stable.
I for Isolation
As transactions and data operations can run concurrently, the isolation property ensures that each transaction runs in an “isolated” environment, generally making sure that the effects on the data from one transaction would not be “visible” to another transaction, thus preventing certain race conditions and data conflicts. In reality - there are actually DIFFERENT levels of isolation (discussed in detail in a dedicated section in this article) - with each level allowing some level of visibility between transactions. Knowing about these levels and tailoring them to your needs can greatly improve the effectiveness and performance of your db operations, and will allow you to tailor them to your needs.
D for Durability
Durability is about the “permanent persistency” of data, since many of the database work is done in non-permanent mediums such as the memory (RAM), effective durability guarantees that data processes can recover in-case of system crashes - and eventually become permanently persistent.
All of these are closely related to the concept of database Transactions.
Transactions: more than just "atomic operations"
You've probably heard about them and used them before. Most people know that transactions encapsulates a series of data operations into an atomic unit. Some people may also think that simply using transactions can guarantee and protect them from race-conditions and data conflicts - This is not necessarily always the case, and transactions have more to them, like different kinds of isolation levels.
Let's explore some deeper facts and knowledge about database transactions.
What is a Transaction?
In the general abstract sense - a transaction is a way to say: “I want to try these few data steps on my data, and see if they work OKAY, BEFORE making the changes permanent. If something goes wrong - rollback”. One important thing to remember is, that:
Databases work in memory
Databases do most of their work in memory (RAM) - for obvious performance reasons. While the permanent state of a dataset is saved on a disk. This means that a considerable part of a DB work relates to transferring from disk to RAM and vice-versa.
Some of the orchestration between disk storage and RAM memory is handled by an OS API known as “Virtual Memory”, which can effectively make use of disk storages as available "memory". The internal workings of Virtual Memory is outside the scope of this article.
But, much of the committing of data changes from memory to disk is also the responsibility of a DB engine, and this is an important aspect to know about:
A higher rate of disk flushes yields better durability, at the expanse of higher latency (in the context of the time it takes for a transaction to "finish") and worst performances (while lower disk flush rates will increase performance but will yield a higher risk of losing data in case of unexpected crashes === lower durability); This is a constant balance DBs are trying to achieve, and knowing how to control or tweak configurations that affect this, may greatly improve your ability to adapt specific parts of your data to specific use cases.
When you use transactions in databases - you are also making sure that all the transactions' operations are always done in memory first, and become permanent (or persistent or durable) after a successful call to COMMIT
. But, did you know, that actually...
Everything is a Transaction
You probably know that in an SQL environment - to start a transaction you need to call START TRANSACTION
(or BEGIN
) and then after running the data operations, either commit it using COMMIT
or rollback using ROLLBACK
.
However, even if you just pass a single INSERT
or UPDATE
- relational database engines will still treat it as a "transaction" (internally giving it a "transaction id" like other transactions). BUT, DB engines treat these single operation transactions differently, by automatically adding the COMMIT
operation after the data operation. These single operation transactions are also known as implicit transactions, while the popularly known multi-operation transactions are also known as explicit transactions.
This behavior of auto committing implicit transactions is known as:
Auto Commit
With some rare exceptions (Oracle, I'm looking at you) - Auto Commit is turned on by default on almost all database engines. Most databases also offer a configuration setting to turn this behavior OFF. Each DB has its own way of doing so, let's go over a few:
MySQL
MySQL has Auto Commit by default turned ON
To check the current Auto Commit setting:
SELECT @@AUTOCOMMIT;
To turn off auto-commit:
SET AUTOCOMMIT = 0
(Or use 1
to turn it on).
PostgreSQL
Postgres has Auto Commit by default turned ON.
There is no way to change this setting at the engine level, but most clients and drivers allow you to to turn if off at the driver level, with: \set AUTOCOMMIT off
- and will then treat the following data operations as an "explicit" transaction, which will be committed when you call COMMIT
.
MS-SQL
MS-SQL treats the term implicit transaction a little differently. In this context, an implicit transaction is a multi operation transaction when "auto-commit" is turned off.SET IMPLICIT_TRANSACTIONS ON
will turn on a mode where you have to always call COMMIT
manually for previous data operations to commit.
So, how does changing these settings affect the issue of performance vs. durability?
Disk Flushes
At some point the in-memory representation of the data has to be flushed to the disk. These flushes are done in preset intervals and are known as "checkpoints". The checkpoint interval can be configured (incidentally, increasing or decreasing the DB's durability).
PostgreSQL
In PostgreSQL the checkpoint_timeout
configuration can be set to decide on the checkpoint interval. And checkpoint_completion_target
can be configured to spread the amount of time the checkpoint takes to complete.
MySQL (InnoDB)
InnoDB (the default storage engine in MySQL) has this general setting that can affect the checkpoint rate innodb_io_capacity
along with a setting to control the maximum "dirty memory pages" before issuing a checkpoint: innodb_max_dirty_pages_pct
. It also supports an "adaptive flushing" mode, which automatically tries to adjust the flush rates according to workload, and can be turned on or off using: innodb_adaptive_flushing
.
Oracle
Oracle has the LOG_CHECKPOINT_INTERVAL
setting to control the checkpoint interval.
There are many more available settings related to disk flush rates in each DB system, consult their documentation for more information.
Write Ahead Log (WAL)
To further ensure the durability of their data, databases implement a special LOG, that records data operations between checkpoints, called a Write Ahead Log or WAL. If a system crashes - it can read the WAL records relevant since the last checkpoint, and redo the data operations. The WAL is also used for transaction rollbacks.
WAL entries are also initially written into memory, and also need to be flushed to disk periodically. The flushing of the WAL is usually done in a much higher rate; If a system crashes completely before a WAL is flushed to the disk - all data operations logged in it since the last disk flush will actually be lost. This also means that no database system in existence can promise 100% durability, but they can offer high percentages close to it.
MySQL (InnoDB)
Durability is such an important aspect, that the default setting for flushing the WAL to disk in InnoDB is: after each commit. This also means that a disk write is issued after every single implicit transaction (single data operations). So the default for InnoDB is to put more emphasis on durability than speed and performance - something that is not always necessarily right for all uses cases - e.g. some types of data you can "risk" to lose a small amount of to gain higher performance such as analytics data and certain non-real-time data.
The setting that controls this behavior is: innodb_flush_log_at_trx_commit
When set to 1
InnoDB flushes the WAL (which is called "the redo log" in its context) on every commit.
If you set it to
0
it will be flushed every 1 second.If you set it to
2
it will log on each commit, but will flush to disk every 1 second.If you set it to
3
it will flush to disk both at the "prepare" stage of a transaction and the commit (highest level of durability/slowest operation. This is considered a "legacy" behavior).
Oracle
The default behavior in Oracle is to also flush the WAL on each commit. You can change this behavior with this setting: COMMIT_WRITE
PostgreSQL
The default behavior is different depending on the system and OS, but it can be controlled using the wal_sync_method
.
Additional configuration settings on each db can control different aspect of the WAL and the flush timings. Consult the respective documentations.
MongoDB
In MongoDB the "WAL" is known as the "Journal", and Write Ahead Logging is called "journaling". To control the rate the Journal is flushed to disk, use the storage.journal.commitIntervalMs
setting. The default is 100ms
So, what is actually considered a "successful transaction commit"?
Knowing these different aspects and stages of a commit, might arise the question of "when does the issuing client knows that the transaction was successful?" - well, the answer to this question is, of-course, configurable as well.
PostgreSQL
The major configuration parameter to control this is called synchronous_commit
.
Here are the optional values and their meanings:
on
: Wait for WAL disk flush before reporting success.remote_apply
: Wait for WAL records to be applied on the standby server (a "backup server" used as part of Postgres' replication system).remote_write
: Wait for WAL records to be flushed to disk of standby server.local
: Wait for WAL records to be written to the OS's buffer cache.off
: Report success without waiting for WAL records to be flushed to disk.
The default is on
Oracle
Use the COMMIT_WRITE
setting:
IMMEDIATE, NOWAIT
: Return success without waiting for WAL disk flush.IMMEDIATE, WAIT
: Wait for the WAL to flush to disk before returning success.BATCH, NOWAIT
: WAL is written in batches, return success without waiting for disk flush.BATCH, WAIT
: WAL is written in batches, wait for disk flush before returning success.
MySQL
The same innodb_flush_log_at_trx_commit
mentioned before also affects this aspect of when a transaction commit is considered "successful".
MongoDB
Write operations in Mongo can have a writeConcern
property to control this, if it has j: true
it will wait for a write in the journal and a disk flush of it, before acknowledging success. Note that j: true
will also force a disk flush right after the write operation is finished.
Summary
In summary, to ensure high durability databases implement the following:
They use Virtual Memory and RAM for most of their work.
They use a Write Ahead Log that logs all uncommitted data operations, flushing its content in preset configurable intervals (that can be after each transaction commit).
They use "checkpoints", periods when the in-memory data is flushed to disk periodically.
Now let's explore how the Atomicity aspect is implemented in databases:
Atomicity === Locking
Databases ensure atomicity by using locks on the data a transaction is working on in the in-memory data (data in memory which is being worked on is known as "the working set"). There are generally two types of locks:
Pessimistic locking (aka: Exclusive Lock)
This generally means, that whatever data a transaction works on is completely locked for other transactions for the duration of the work. It means that any concurrent transactions, or later-run transactions that needs to deal with some of the same data, must wait until one transaction finishes first. A more advanced method of locking is known as
Optimistic locking (MVCC)
This method is also known as Multi Version Concurrency Control (MVCC).
With this approach, each record in the in-memory dataset has an associated version number. When a transaction starts, it records the version number so it knows what is the state of the entity, a "snapshot" of the relevant data is created and READ operations of the transactions have access to that snapshot.
If the transaction modifies the data with WRITE operations and tries to commit it - the engine verifies versions. If records were modified in the meantime by some other transaction, the update is rejected and the transaction has to rollback or start over again.
The non-relational MongoDB uses MVCC for multi-documents transactions. For single document operations it implements pessimistic locking on entire documents.
Relational databases that implement this approach (such as Posgresql, MySQL (with InnoDB engine), Oracle, and MS-SQL (as an optional setting)), use it on the row level, which allows several transactions to modify different rows concurrently.
This approach offers higher concurrency levels, but it can introduce a higher amount of transaction rollbacks if concurrent transactions often works on the same data. It also introduces more complexity into the database engine - as multiple versions has to be managed, stored and garbage collected.
Shared Locks
For READ operations most DB engines implement "shared locking" which allows multiple concurrent transactions to read the same (versioned) data.
Many DB systems also allow for explicitly acquiring shared locks, using SQL syntax (this is demonstrated ahead).
More details about shared read locks are outside the scope of this article.
Summary
To ensure atomicity databases use a combination of locks:
Exclusive locks.
Muti-Version-Concurrency-Control (multi-versioned data "snapshots").
Shared locks
Isolation
Now that we know all of this, we can understand how DB engines can effectively run transactions in an "isolated" environment. This isolation can ensure data affected by one transaction is not affected by another or visible to it, but some occasions DO require a certain level of visibility between transactions. Before looking at the different levels of isolation and their use cases, let's go over a few related terms:
Dirty Read
A "dirty read" is a term used to describe any data changes made by a transaction, that were still not committed.
Non-Repeatable Read
This describes a scenario where a transaction has, for example, two READ operations, and another source alters the data between the reads, resulting in the second read returning different data results.
Phantom Read
Phantom reads are scenarios where a transaction have, for example, two READ operations, and in between those - another source adds or delete data records, resulting in additional (or less) data records in the second read.
Now, let's go over the different kinds of isolation levels:
Isolation Levels
Read Uncommitted
This isolation level, which is considered the lowest isolation level, allows transactions to read uncommitted changes made by other transactions, i.e. it allows "dirty reads". This means, that while transaction A is running, if any other transaction modifies data and that data is read by transaction A - it will read the "dirty" uncommitted data. This, depending on the use case, can sometimes lead to undesirable results:
Dirty reads rollback
Transaction A updates data in the database but does not commit.
Transaction B reads this uncommitted data.
Transaction A rolls back its changes.
Transaction B has read a value that is now invalid.
Non-Repeatable Reads
Transaction A reads data from the database.
Transaction B updates that data.
Transaction A reads the same data again and sees the updated version
Phantom Reads
Transaction A queries a set of rows with a specific condition
Transaction B inserts, updates, or deletes rows that affect the result set of Transaction A's query.
Transaction A re-executes the query and gets a different set of rows.
Lost Updates
Occurs when two transactions read the same data and then update it based on the read value. The first update is overwritten by the second update, leading to lost data.
• Transaction A reads a value.
• Transaction B reads the same value.
• Transaction A updates the value.
• Transaction B updates the value, overwriting the change made by Transaction A.
Uncommitted Dependency
Occurs when a transaction’s outcome depends on changes made by another transaction that are not yet committed. If the other transaction rolls back, the dependent transaction’s outcome may be incorrect.
• Transaction A updates a value but does not commit.
• Transaction B reads the updated value and uses it to perform further operations.
• Transaction A rolls back, invalidating the read value and subsequent operations performed by Transaction B.
Read Committed
This isolation level does not allow a transaction access to uncommitted changes from other transactions, but it does allow visibility to committed changes made by other transactions. This means that Non-Repeatable Reads and Phantom Reads can both still occur with this isolation level, in case changes between data operations of a transaction are done by other transactions and are committed.
In both of these isolation levels, since Non-Repeatable Reads can happen, imagine the following scenario:
Transaction A reads a
balance
value, to determine if it is available (non-zero), to follow it with certain updates, It reads the value as1
.Transaction B modifies the
balance
to0
and commits the changes.Transaction A still does an update, based on the first read of the balance
1
- even though at this stage the balance is already updated to0
!
A possible "resolution" for this is to have Transaction A read the balance
once again, the second time receiving the updated balance of 0
(since reads in both isolation levels are Non-Repeatable) - however, this requires that Transaction A be "aware" of Transaction B (or any other relevant transactions) which complicates the business logic.
Lost Updates can also still occur in this isolation level:
Transaction A reads a value and then updates it.
Transaction B reads the same value and then updates it to another value - overwriting the update of Transaction A.
This is the "second level" of isolation, and, interestingly enough, it is actually the default isolation level in most of the DB engines: PostgreSQL, MSSQL, and Oracle (but not in MySQL).
The next level of isolation is:
Repeatable Read
This isolation level ensures that if a value is read more than once in the same transaction, the same data will be returned.
Non-Repeatable reads are not allowed with this isolation level, but phantom reads can still occur.
This is the default isolation level in MySQL.
This isolation level, does not yet completely "solves" the "balance" problem we've shown above:
Transaction A reads the
balance
as1
Transaction B updates the balance to
0
Transaction A will still always read balance as
1
throughout its duration, and can do a wrong UPDATE.
With all the above isolation levels, a possible "solution" to the "balance" problem can also be, acquiring an exclusive lock for Transaction A, by using the SELECT FOR UPDATE
syntax on balance
- this will ensure Transaction B will not be able to update the balance to 0
until Transaction A finishes.
But, another way, would be to use the next isolation level:
Serializable
This is the highest isolation level, where a transaction is executed as if it were "the only transaction in the system". All transactions are executed sequentially, one-by-one, where another transaction starts only when the previous one had finished.
In this isolation level, there could be no dirty reads, non-repeatable reads or phantom reads, but, the level of concurrency is obviously greatly reduced.
Side Note
For many occasions, for the "balance problem" mentioned above, it is recommended to implement a solution "outside" of the main database layer (especially if concurrent 'requests' are also involved) - in the applicative layer or using another kind of data-structure: e.g. a sorted queue.
Isolation Levels Summary
Here is a nifty table summarizing the different isolation levels:
Isolation Level | Dirty Reads | Non-Repeatable Reads | Phantom Reads | Lost Updates |
Read Uncommitted | Allowed | Possible | Possible | Possible |
Read Committed | Not possible | Possible | Possible | Possible |
Repeatable Read | Not possible | Not possible | Possible | Possible |
Serializable | Not possible | Not possible | Not possible | Not Possible |
Changing or Setting Isolation Levels
DB engines allows you to change isolation levels on both the global level or a transaction level. Here's how (e.g. showing how to change to "Read Committed" in different DB systems):
MySQL
Global
SET GLOBAL TRANSACTION ISOLATION LEVEL READ COMMITTED;
Transaction level
SET SESSION TRANSACTION ISOLATION LEVEL READ COMMITTED;
PostgreSQL
Global
SET SESSION CHARACTERISTICS AS TRANSACTION ISOLATION LEVEL READ COMMITTED;
Transaction level
BEGIN;
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
-- Perform operations
COMMIT;
MS-SQL
Global
SET TRANSACTION ISOLATION LEVEL READ COMMITTED;
Transaction level
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
BEGIN TRANSACTION;
-- Perform operations
COMMIT TRANSACTION;
Oracle
Global
ALTER SESSION SET ISOLATION_LEVEL = SERIALIZABLE;
Transaction level
SET TRANSACTION ISOLATION LEVEL READ COMMITTED;
-- Perform operations within this transaction
COMMIT;
Summary
Databases strive for durability, transaction atomicity and isolation, consistency and integrity, but it's not always possible to achieve 100% of these things, as an increase in one can lead to a decrease in another. Knowing the different approaches Database systems take, and the different ways to change relevant configurations - allows for developers to use the right tools, systems and configurations, custom-tailor them to their use cases and achieve greater performance, durability and expected results.