MVCC Part 4: Distributed MVCC

Greetings oh-so-persistent nuonians! Your faithful consistency nerd is here to continue the discussion of MVCC. For background please examine part 1 (wherein our heroes are introduced to MVCC), part 2 (wherein our heroes witness examples) or part 3 (where our heroes are exposed to the subtleties of vanilla mvcc). This chapter of the story will introduce you to distributed MVCC, in particular the NuoDB flavor.

The Problem Distributed MVCC is Solving

NuoDB is a distributed ACID database. We recognize that applications are easier to write with transactions, and that not all transactions are created equal. Like any good database, we allow the user to tell us what level of consistency they want, and we need to act accordingly. NuoDB uses MVCC to handle concurrent reads and writes to shared data, and the previous posts have gone over the broad strokes of what that entails. However, in a distributed system we can have situations where multiple separate nodes each want to update the same record at the same time.

‘Classic’ databases were designed as a single-server solution, where concurrency came in the form of local threads. Much work was done in decades past to build systems that could resolve update conflicts between threads. In general, the approaches involved ‘managers’ which is a term I’m going to use for any centralized synchronized object that was the main clearing-house for requests that had to be done consistently. For readers familiar with run-of-the-mill thread programming, a manager was usually some fine-grained lock-based object that would hand out ‘grants’ for record versions to requesting threads. What this means is if multiple threads are trying to add a new record version ‘simultaneously’, the manager will determine who gets what versions (and if anyone needs to be failed). This is perfectly fine for a single-node system, and optimizing local-memory concurrent data structures is a task that can keep armies of engineers occupied for untold eons.

Distributed systems are, at a high level of abstraction, just large-scale concurrent systems. Instead of function calls and atomic locks, a distributed system uses messages and message processing. The similarity starts to break down when you want to make things fast. The network is orders of magnitude slower than local memory, and a design that worked jim-dandy in local memory may (and often does) fail to scale to multiple machines even when retooled for distributed operation. One of the first lessons I learned when transitioning from a concurrent programmer to a distributed programmer was that centralized manager-type systems are a bad idea in distributed systems. The reason is simple enough, a centralized point of control may require blocking, which can basically stop all nodes for a time. There are also reliability concerns, namely: what do you do if the node running as the ‘manager’ dies? So, centralized points of control can cause issues that defeat the reasons for using a distributed system in the first place (e.g. performance and reliability). ‘Well, shoot’ some of you may be saying, ‘if we can’t just distributify the old manager-based approaches, how are we going to swing a distributed mvcc database?’

There are some tools in the distributed engineer’s toolkit available to us. The first trick is that we build everything from atoms (fine-grained partitioning). A table itself is actually a galaxy of atoms. Each atom is a small, independent distributed object. As a table grows, it grows by adding more atoms rather than inflating atoms indefinitely. Because the atoms are independent, we can leverage another trick (asynchronous messaging) to overlap and pipeline processing with messaging to hide as much message processing latency as possible. These are useful tricks and help keep atom processing lean and fast. However the real speedup is in applying asynchrony more generally.

A time-honored trope in systems engineering is ‘to make the common case fast’. In general, any marginally well-designed application won’t have every thread on every node banging away on the same record. Therefore, most databases assume that the common case for record update is no-conflict or low-conflict. NuoDB makes a similar assumption. Therefore we’d prefer not to have to wait for messages from a ‘manager node’ in order to process an update. Therefore every node assumes that it has the right to install a new version with what it assumes is the latest version and then march ahead. Of course, conflicts do happen and we need to be able to detect them and deal with them.

Of Chairmen and Record Metadata 

Consider a database with 3 nodes (1,2 and 3). All of them agree that the latest version for row 42 is 5. Nodes 2 and 3 have clients connected to them that are attempting to update row 42. The simplest approach would be to have node 2 confirm that it was allowed to update row 42 before proceeding. But if node 2 had to chat with all its peers before allowing that update, then updates would be incredibly slow as each row would require all kinds of chit-chat. This slowdown is doubly bad in the assumed common case where most updates don’t conflict at all. So, obviously doing distributed coordination for each row before update is slow. Nuodb is not slow, therefore nuodb doesn’t do that. What nuodb does is that node 2 and node 3 both assume that it’s ok for them to install a new version locally and that someone, somewhere will call them on it if it isn’t actually ok.

In the example above, node 2 and 3 would both independently approve their respective client updates, provisionally give them version 6 and then march ahead. Of course, now we have an undetected and unresolved conflict. Node 2 thinks that row 42, version 6 is A and node 3 thinks that it is B. How can this work? And how will the conflict be detected, and then dealt with? The system works due to the magic of MVCC. Because the updates are part of an uncommitted transaction, they won’t be visible to anyone else. Therefore, MVCC restricts the scope of conflict detection and resolution for a single row to exactly the set of active transactions that are updating that row.

The problem is not that node 2′s update is ‘bad’, or that node 3′s update is ‘bad’. The problem is that both changes can’t be in flight at the same time. Therefore, we need some way of detecting that such a conflict exists. In NuoDB, there’s no extra-special node that is responsible for database leadership. There’s no leader/master/mongo nodes. However, when we have a situation like this, we need some kind of arbitration. Fortunately, nuodb has a distributed, light-weight psuedo-leadership that we can piggy-back on. In NuoDB each Atom has a ‘chairman’. A node that can handle these arbitration situations. Every peer knows who the chairman is, even during waves of node failure. A lot of clever logic has gone into making chairmanship and chairman changes message-free, so there’s no election process or similar on-demand consensus gathering to worry about. In the example, that means that one of nodes 1,2 or 3 are chairman for the atom containing the record metadata for row 42. Because the chairman may not be executing a transaction in the set of conflicting transactions, we need to make sure that the chairman still finds out about record updates. Of course, we had to do this anyway to keep everybody consistent. Therefore, we just need to make sure that the update changes are broadcast before commits, so that the chairman has everything it needs to break any ties. This broadcast-before-commit process has an added benefit, which is in the common case of no conflicts, all the changes will already be replicating around the database before the commit is even invoked (which leads to faster replication and commits).

The chairman breaks ties in a simple way. When two nodes are attempting to update the same row at the same time, the chairman will grant the update to the node who’s update notification arrived at the chairman first. In the example, assume that Node 1 is the chairman. Node 2 and node 3 both broadcast their update changes. Assume that node 3′s message arrives first. Therefore node 1 will install node 3′s record version and now the canonical record metadata atom will note that row 42, version 6 is ‘B’ and is owned by the transaction on node 3. When node 2′s message shows up, the conflict is detected and a failure response is sent to node 2. This is illustrated below. In the example below, we’re assuming that both node 2 and node 3′s transactions are updating multiple rows, so that local row updates are proceeding even though some of the updates haven’t been officially endorsed by the chairman.

MultiNode Conflict

This system now enables us to do distributed conflict detection lazily and asynchronously. The only time node 2 or node 3 would have to block awaiting a response is if a commit-point was reached and the system had to determine the final state of the update operation (e.g. the transaction is being committed). Because SQL’s updates are inherently bulk updates, this means that the latency for multiple updates can all be overlapped and effectively eliminated (or greatly reduced). Another thing to bear in mind is that NuoDB nodes are constantly garbage collecting disused atoms, therefore even though the database itself may consist of 100 nodes, only the exact subset of nodes executing against the rows in question will have to participate in any conflict detection decision.

We’re now 2 out of 3 steps towards having distributed MVCC. We know that using normal MVCC semantics means that we don’t have to do any fancy coordinating of pending changes. We also know how to leverage the chairmanship concept to do some tie-breaking between conflicting updates from different nodes. We’re now left with a final problem: ‘how do we resolve a conflict, once detected?’ This is more complicated, not least because the answer to this question depends upon the semantics that the client is looking for. For example, SQL allows clients to set ‘isolation levels’. This is basically a knob for trading off consistency for performance. Each isolation level may have completely different semantics to support. Therefore, we actually need a meta-solution to this problem. We need a system that will allow us to handle update conflicts differently depending upon client-requested semantics. From the universe of possibilities there an obvious set of 3:

  1. Fail right away (e.g. on conflict, abort the transaction)
  2. Block until the conflicting transaction’s final state is known, then react against that state
  3.  Something special

The choice between 1 and 2 is similar to the choice that needed to be made for non-distributed MVCC and currently is chosen based on isolation level in NuoDB. Choice number 3 is a bit of a cop-out, but it basically refers to strategies that will allow the system to detect a conflict that can be ‘worked around’ and then work around it. NuoDB does this for several situations where a conflict can be fixed post-facto. For example, if node 3 wins the race to update row 42 to version 6, there are situations where it is ok to change node 2′s update to be updating row 42 to version 7 instead. In these cases, NuoDB has determined that transactional semantics will allow us to effectively enforce that node 3′s transaction is happening before node 2′s transaction. Doing this correctly and efficiently requires some fancy bookkeeping. For example, now that node 2′s transaction is happening after node 3′s, what happens if the transaction on node 3 aborts and is rolled back? Of course, this depends on the isolation level of node 2′s transaction and whether or not node 2′s transaction used/read any state that’s now being rolled back. By being even more optimistic about updates, NuoDB avoids aborting transactions that could otherwise commit cleanly.

Martin Suer
Anonymous's picture
thanks Trek, has been a

thanks Trek, has been a pleasure reading this

Pedro Eugênio Rocha
Anonymous's picture
Great posts!

Great posts!

Pedro Eugênio Rocha
Anonymous's picture
I think the main problem when

I think the main problem when implementing distributed MVCC is how to establish a global ordering between distributed transactions. In your example, for instance, you rollback transaction on Node 2 even though it might have started after transaction in Node 3, in which case it could be commited. There is no way you could infer distributed transaction ordering without using a synchronized clock, which is very hard to implement even in a local network. See google’s spanner paper for more on that matter. How do you handle that? NTP is not enough to provide such guarantees.

In nuodb, transaction control

In nuodb, transaction control is local to a particular transaction engine. Transaction start and commit/failure is driven off of the same machine, and rollback is controlled similarly. Computation can be distributed, but if state transitions are ensured to be processed in the same order on all peers, then you don’t need an atomic clock!

Wilson A. Higashino
Anonymous's picture
Hi Trek, great post.

Hi Trek, great post.

I am just wondering if NuoDB is strongly consistent in the CAP sense.

Imagine the following scenario:

- Two clients, A and B;
- Three servers C, D and E;

Client A has a session with server D, and is updating a record, say row number 100. Client B is connected to Server E, which has the same record in memory, as client B were using it before. The server C is the chairman of the Record atom that contains this row.

Before the commit, server D sends the update to the chairman (server C) and the other ‘replicas’ of the record (server E). No conflict is detected, so server D is allowed to proceed with the commit.

The commit message is sent (asynchronously?), and the server D executes the commit.

The commit message to server E is delayed. Client B executes a new read of row 100 in server E. Due to the MVCC concurrency control, the reader is not blocked and reads an older version of the data.

Client A executes a new read of the row 100, and gets the new data.

Is that possible?

What am I missing here?


Thank you for your question.

Thank you for your question. Of course, MVCC is complicated, because consistency is complicated. Here are a couple of things that might help you when trying to reason your way through the MVCC jungle:

1) Transaction control in nuodb is local to a particular node, so the set of visible records is determined by the set of known committed transactions at the time the transaction STARTED. What this means is that it doesn’t matter how many transactions commit and in what order if they do so after the reading transaction started (from the point of view of that node). In your example, this means that client B will read an older version of the data, but client A’s read depends on whether or not it is in a transaction started after the update transaction committed from the point of view of server D.

This brings up another important point:

2) Consistency is NOT serializability. Serializability implies consistency, but there are often interleavings of operations that are not serializable, but result in consistent state. In fact, serializability is expensive to maintain and rarely needed. NuoDB bypasses the whole issue of global synchronicity (which is what a global serial order is), by just worrying about local transaction order and consistency of operations at the record level.

Wilson A. Higashino
Anonymous's picture
Hi Trek. Yes, this definitely

Hi Trek. Yes, this definitely is not a simple subject.

I understand that NuoDB is consistent in the ACID sense, and this consistency does not depend on a global serial order.

My question was about consistency as defined in the CAP theorem. I am citing S. Gilbert and N. Lynch here, which proved the validity of the theorem: “Under this consistency guarantee, there must exist a total order on all operations such that each operation looks as if it were completed at a single instant….. [in a consistent service] any read operation that begins after a write operation completes must return that value, or the result of a later write operation.”

Based on your reply, can I assume NuoDB as a “eventually consistent” database (under the above definition of consistency)? Is that correct to say that in this sense NuoDB is more similar to NoSQL solutions than to traditional RDBMS?

Regards and thanks for the discussion

NuoDB is not eventually

NuoDB is not eventually consistent. Every operation (read: transaction) does occur at a single instant, insofar as its reads can be ordered with respect to any other transaction’s writes. Each transaction ‘knows’ the set of visible records at transaction start because it ‘knows’ about all committed transactions visible at that time. NuoDB leverages MVCC to answer visibility questions as records are being read. When an active transaction reads a record, it will read the version that is visible to it, which may not be the most recent version. This is possible because nuodb can determine the transaction that installed a version from the record version itself. Because a transaction ‘knows’ what transactions were committed when it started, it has a set of visible transaction results that it can choose from. Of course, in practical terms the system can’t lug around enormous lists of transactions, so the implementation does a lot of ‘accounting tricks’ to keep everything tight and fast.

Anonymous's picture
Hi Trek,

Hi Trek,

MVCC deals with concurrency control. But the question is the consistency. For the database to be strictly consistent, when commit is confirmed to the client, changes must have already propagated to all replicas (to memory or disk). You say ‘This broadcast-before-commit process … all the changes will already be replicating around the database before the commit is even invoked’. How do you ensure that changes are propagated to all replicas before the commit is confirmed (if not using synchronous replication)?

Katarina, you raise an

Katarina, you raise an excellent question. First, I think I need to explain some things a little clearer. NuoDB is a peer-to-peer system. Unlike a more traditional DB, nuodb’s nodes aren’t in a primary-replica(s) configuration. All nuodb nodes that have a copy of a particular atom can read AND write to it. Another important point is that the StorageManager nodes are the only nodes responsible for durability guarantees. When a transaction on a node changes an atom, that change is broadcast to all the peers (just as they broadcast their changes). Some of those peers will be storage managers. When a transaction commits, it has a choice of the number of durable copies it wants to be sure exist. This number can range from 0 to the number of storage managers. In general, if the transaction engine hasn’t heard from enough storage managers to confirm the durability level desired, commit will block until enough SM’s report back. Note that ALL changes will always be propagated to SMs in the order that the transaction made those changes. For more details on the durability layer of the NuoDB stack, have a look at:


Add new comment

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.