A Little Riak Book for LFE

Clusters

Up to this point you've conceptually read about "clusters" and the "Ring" in nebulous summations. What exactly do we mean, and what are the practical implications of these details for Riak developers and operators?

A cluster in Riak is a managed collection of nodes that share a common Ring.

The Ring

The Ring in Riak is actually a two-fold concept.

Firstly, the Ring represents the consistent hash partitions (the partitions managed by vnodes). This partition range is treated as circular, from 0 to 2^160-1 back to 0 again. (If you're wondering, yes this means that we are limited to 2^160 nodes, which is a limit of a 1.46 quindecillion, or 1.46 x 10^48, node cluster. For comparison, there are only 1.92 x 10^49 silicon atoms on Earth.)

When we consider replication, the N value defines how many nodes an object is replicated to. Riak makes a best attempt at spreading that value to as many nodes as it can, so it copies to the next N adjacent nodes, starting with the primary partition and counting around the Ring, if it reaches the last partition, it loops around back to the first one.

Secondly, the Ring is also used as a shorthand for describing the state of the circular hash ring I just mentioned. This Ring (aka Ring State) is a data structure that gets passed around between nodes, so each knows the state of the entire cluster. Which node manages which vnodes? If a node gets a request for an object managed by other nodes, it consults the Ring and forwards the request to the proper nodes. It's a local copy of a contract that all of the nodes agree to follow.

Obviously, this contract needs to stay in sync between all of the nodes. If a node is permanently taken offline or a new one added, the other nodes need to readjust, balancing the partitions around the cluster, then updating the Ring with this new structure. This Ring state gets passed between the nodes by means of a gossip protocol.

Gossip and CMD

Riak has two methods of keeping nodes current on the state of the Ring. The first, and oldest, is the gossip protocol. If a node's state in the cluster is altered, information is propagated to other nodes. Periodically, nodes will also send their status to a random peer for added consistency.

A newer method of information exchange in Riak is cluster metadata (CMD), which uses a more sophisticated method (plum-tree, DVV consistent state) to pass large amounts of metadata between nodes. The superiority of CMD is one of the benefits of using bucket types in Riak 2.0, discussed below.

In both cases, propagating changes in Ring is an asynchronous operation, and can take a couple minutes depending on Ring size.

How Replication Uses the Ring

Even if you are not a programmer, it's worth taking a look at this Ring example. It's also worth remembering that partitions are managed by vnodes, and in conversation are sometimes interchanged, though I'll try to be more precise here.

Let's start with Riak configured to have 8 partitions, which are set via ring_creation_size in the etc/riak.conf file (we'll dig deeper into this file later).

## Number of partitions in the cluster (only valid when first
## creating the cluster). Must be a power of 2, minimum 8 and maximum
## 1024.
## 
## Default: 64
## 
## Acceptable values:
##   - an integer
ring_size = 8

In this example, I have a total of 4 Riak nodes running on [email protected], [email protected], [email protected], and [email protected], each with two partitions (and thus vnodes)

Riak has the amazing, and dangerous, attach command that attaches an Erlang console to a live Riak node, with access to all of the Riak modules.

The riak_core_ring:chash(Ring) function extracts the total count of partitions (8), with an array of numbers representing the start of the partition, some fraction of the 2^160 number, and the node name that represents a particular Riak server in the cluster.

$ bin/riak attach
([email protected])1> {ok,Ring} = riak_core_ring_manager:get_my_ring().
([email protected])2> riak_core_ring:chash(Ring).
{8,
 [{0,'[email protected]'},
  {182687704666362864775460604089535377456991567872, '[email protected]'},
  {365375409332725729550921208179070754913983135744, '[email protected]'},
  {548063113999088594326381812268606132370974703616, '[email protected]'},
  {730750818665451459101842416358141509827966271488, '[email protected]'},
  {913438523331814323877303020447676887284957839360, '[email protected]'},
  {1096126227998177188652763624537212264741949407232, '[email protected]'},
  {1278813932664540053428224228626747642198940975104, '[email protected]'}]}

To discover which partition the bucket/key food/favorite object would be stored in, for example, we execute riak_core_util:chash_key( {<<"food">>, <<"favorite">>} ) and get a wacky 160 bit Erlang number we named DocIdx (document index).

Just to illustrate that Erlang binary value is a real number, the next line makes it a more readable format, similar to the ring partition numbers.

([email protected])3> DocIdx = 
([email protected])3> riak_core_util:chash_key({<<"food">>,<<"favorite">>}).
<<80,250,1,193,88,87,95,235,103,144,152,2,21,102,201,9,156,102,128,3>>

([email protected])4> <<I:160/integer>> = DocIdx. I.
462294600869748304160752958594990128818752487427

With this DocIdx number, we can order the partitions, starting with first number greater than DocIdx. The remaining partitions are in numerical order, until we reach zero, then we loop around and continue to exhaust the list.

([email protected])5> Preflist = riak_core_ring:preflist(DocIdx, Ring).
[{548063113999088594326381812268606132370974703616, '[email protected]'},
 {730750818665451459101842416358141509827966271488, '[email protected]'},
 {913438523331814323877303020447676887284957839360, '[email protected]'},
 {1096126227998177188652763624537212264741949407232, '[email protected]'},
 {1278813932664540053428224228626747642198940975104, '[email protected]'},
 {0,'[email protected]'},
 {182687704666362864775460604089535377456991567872, '[email protected]'},
 {365375409332725729550921208179070754913983135744, '[email protected]'}]

So what does all this have to do with replication? With the above list, we simply replicate a write down the list N times. If we set N=3, then the food/favorite object will be written to the [email protected] node's partition 5480631... (I truncated the number here), [email protected] partition 7307508..., and [email protected] partition 9134385....

If something has happened to one of those nodes, like a network split (confusingly also called a partition---the "P" in "CAP"), the remaining active nodes in the list become candidates to hold the data.

So if the node coordinating the write could not reach node [email protected] to write to partition 7307508..., it would then attempt to write that partition 7307508... to [email protected] as a fallback (it's the next node in the list preflist after the 3 primaries).

The way that the Ring is structured allows Riak to ensure data is always written to the appropriate number of physical nodes, even in cases where one or more physical nodes are unavailable. It does this by simply trying the next available node in the preflist.

Hinted Handoff

When a node goes down, data is replicated to a backup node. This is not permanent; Riak will periodically examine whether each vnode resides on the correct physical node and hands them off to the proper node when possible.

As long as the temporary node cannot connect to the primary, it will continue to accept write and read requests on behalf of its incapacitated brethren.

Hinted handoff not only helps Riak achieve high availability, it also facilitates data migration when physical nodes are added or removed from the Ring.