Isolation & Concurrency

A few meetups ago, we had Jim Starkey come and speak. During that evening, he asked “Why is MVCC not serializable?” It was a “go and do your homework” kind of question.

So, in order to answer that question, let’s explore the concepts of isolation and concurrency, the “i” and “c” in ACID compliance. For this post, I will refer to “transactions”, by which I mean actual transactions, as well as statements where autocommit=0. In other words, every statement is its own transaction, unless it is within an explicit transaction, and for the purpose of this article, those statements are “transactions”. Of course, for many of the explanations below, if there’s only one statement in the transaction, there’s no problem. (As usual, if I make a mistake, please correct me. I will make sure all comments from all sources get to the article at http://www.sheeri.net/123 .)

Isolation levels define how separate the transaction data is from the actual data in the table. So if I begin a transaction, read a row from the table, change the row, read the row from the table, and commit, what do I see at each step? What does someone else who is also reading the same row see?

Isolation Level: Read Uncommitted
#	Step	Thread 1: changing data	Thread 2: reading data
0	Thread 1: Read row Thread 2: Read row	sees data from original row	sees data from original row
1	Thread 1: Change Row Thread 2: N/A	changes data	N/A
2	Thread 1: Read row Thread 2: Read row	sees new data	sees new data
3	Thread 1: Commit Thread 2: N/A	Commits data	N/A
4	Thread 1: Read row Thread 2: Read row	sees new data	sees new data

So a read uncommitted is the lowest level of isolation in that there is none. This is how statements that are implicit transactions act — you see the data right away, even if what you are doing is logically a transaction. This is also called a “dirty read” because information is seen right away. If the transaction rolled back, and ‘undid’ the change, then the original data would be back for all to see.

More isolated than read uncommitted read is “read committed”, which, as you can guess, means that thread 2 can only read committed changes.

The description in bold is the difference between read uncommitted and read commited:

Isolation Level: Read Committed
#	Step	Thread 1: changing data	Thread 2: reading data
0	Thread 1: Read row Thread 2: Read row	sees data from original row	sees data from original row
1	Thread 1: Change Row Thread 2: N/A	changes data	N/A
2	Thread 1: Read row Thread 2: Read row	sees new data	sees data from the original row
3	Thread 1: Commit Thread 2: N/A	Commits data	N/A
4	Thread 1: Read row Thread 2: Read row	sees new data	sees new data

However, this means that for the duration of thread 2‘s transaction, the data may change. This is not necessarily desirable, and is seen as a ‘dirty’ read because for the duration of thread 2’s transaction, the state of the data changes.

So if thread 1 commits a change, thread 2 does not see it until it reads the changed row in a transaction that started after the commit.

The part in bold is the difference between read uncommitted and repeatable read:

Isolation Level: Repeatable Read
#	Step	Thread 1: changing data	Thread 2: reading data
0	Thread 1: Read row Thread 2: Read row	sees data from original row	sees data from original row
1	Thread 1: Change Row Thread 2: N/A	changes data	N/A
2	Thread 1: Read row Thread 2: Read row	sees new data	sees data from the original row
3	Thread 1: Commit Thread 2: N/A	Commits data	N/A
4	Thread 1: Read row Thread 2: Read row	sees new data	sees data from original table
5	Thread 1: N/A Thread 2: New transaction	Thread 2 starts new transaction	N/A
6	Thread 1: Read row Thread 2: Read row	sees new data	sees new data

This sounds like a good level of isolation. However, there are many cases where a transaction will want the most recent available data. For instance, if two people who have a joint bank account take out money at the same time, having two systems start with the same balance, update the new balance and then write the new balance to the database will cause the bank to lose a lot of money, as only one of the two withdrawals will be reflected.

The serializable isolation level helps prevent this — it does not allow a read of data that is being written to. In other words, there is a serial order to transactions, even reads.

The part in bold is the difference between repeatable read and serializable:

Isolation Level: Serializable
#	Step	Thread 1: changing data	Thread 2: reading data
0	Thread 1: Read row Thread 2: Read row	sees data from original row	sees data from original row
1	Thread 1: Change Row Thread 2: N/A	changes data	N/A
2	Thread 1: Read row Thread 2: Read row	sees new data	is not allowed to start the transaction of reading
3	Thread 1: Commit Thread 2: N/A	Commits data	N/A
4	Thread 1: Read row Thread 2: Read row	sees new data	is allowed to start the read transaction, sees new data
5	Thread 1: N/A Thread 2: New transaction	Thread 2 starts new transaction	N/A
6	Thread 1: Read row Thread 2: Read row	sees new data	sees new data

I have not yet addressed how these restrictions are enforced, that’s what concurrency is.

Concurrency is how the different working environments occur at the same time. Folks who know about version control understand concurrency issues. One way to control concurrency is to use READ and WRITE locks. To achieve a read uncommitted isolation level, there are no locks. For read committed, a read lock on the data allows multiple threads to read the data, but no thread may write to the data; a write lock does not allow reading until the lock is released. For repeatable read, a read lock means that the thread always sees the same data even if it is changed on disk. For serializable, only one lock can be on the data at a time.

Those who deal with version control know that there is a better system than those locks. Imagine not being able to look at a code file while someone was changing it! In many version control systems, each user has a working copy where changes can be made, which is hidden from everyone else until commit. That is multi-version concurrency control (MVCC) — there are multiple versions, all out there at once concurrently. MySQL employes MVCC.

The serializable isolation level makes more sense using the MVCC model — before a software developer commits, they ususally update their files to see if other changes have been made. This does not quite carry over to databases.

Within a multi-statement transaction (for storage engines with support), the database gives a snapshot of the data to the transaction to work with, much like checking out a version. The physical database may be changing with other commits, but much like version control in software, you only see the data you ‘checked out’ for the duration of the transaction.

Therefore, all isolation levels up to repeatable read are easy to enforce, but serializable is more difficult. There is no way to “update” the snapshot of data within a transaction, to use what’s currently in the database; you can only look at the data as it was at the beginning of the transaction (repeatable read isolation level).

So that’s why MVCC in a database is not serializable.

To be clear: You can set the isolation level to be SERIALIZABLE in MySQL.

See:
http://dev.mysql.com/doc/refman/5.0/en/innodb-transaction-isolation.html

and

http://dev.mysql.com/doc/refman/5.0/en/innodb-transaction-isolation.html