Entropy is a byproduct of eventual consistency. In other words: although eventual consistency says a write will replicate to other nodes in time, there can be a bit of delay during which all nodes do not contain the same value.
That difference is entropy, and so Riak has created several anti-entropy strategies (abbreviated as AE). We've already talked about how an R/W quorum can deal with differing values when write/read requests overlap at least one node. Riak can repair entropy, or allow you the option to do so yourself.
Riak has two basic strategies to address conflicting writes.
The most basic, and least reliable, strategy for curing entropy is called last write wins. It's the simple idea that the last write based on a node's system clock will overwrite an older one. This is currently the default behavior in Riak (by virtue of the allow_mult
property defaulting to false
). You can also set the last_write_wins
property to true
, which improves performance by never retaining vector clock history.
Realistically, this exists for speed and simplicity, when you really don't care about true order of operations, or the possibility of losing data. Since it's impossible to keep server clocks truly in sync (without the proverbial geosynchronized atomic clocks), this is a best guess as to what "last" means, to the nearest millisecond.
As we saw under Concepts, vector clocks are Riak's way of tracking a true sequence of events of an object. Let's take a look at using vector clocks to allow for a more sophisticated conflict resolution approach than simply retaining the last-written value.
Siblings occur when you have conflicting values, with no clear way for Riak to know which value is correct. As of Riak 2.0, as long as you use a custom (not default
) bucket type that isn't a datatype, conflicting writes should create siblings. This is a good thing, since it ensures no data is ever lost.
In the case where you forgo a custom bucket type, Riak will try to resolve these conflicts itself if the allow_mult
parameter is configured to false
. You should generally always have your buckets set to retain siblings, to be resolved by the client by ensuring allow_mult
is true
.
curl -i -XPUT "$RIAK/types/default/buckets/cart/props" \
-H "Content-Type:application/json" \
-d '{"props":{"allow_mult":true}}'
Siblings arise in a couple cases.
We used the second scenario to manufacture a conflict in the previous chapter when we introduced the concept of vector clocks, and we'll do so again here.
Imagine we create a shopping cart for a single refrigerator, but several people in a household are able to order food for it. Because losing orders would result in an unhappy household, Riak is using a custom bucket type shopping
which keeps the default allow_mult=true
.
First Casey (a vegan) places 10 orders of kale in the cart.
Casey writes [{"item":"kale","count":10}]
.
curl -i -XPUT "$RIAK/types/shopping/buckets/fridge/keys/97207?returnbody=true" \
-H "Content-Type:application/json" \
-d '[{"item":"kale","count":10}]'
HTTP/1.1 200 OK
X-Riak-Vclock: a85hYGBgzGDKBVIcypz/fgaUHjmTwZTImMfKsMKK7RRfFgA=
Vary: Accept-Encoding
Server: MochiWeb/1.1 WebMachine/1.9.0 (someone had painted...
Last-Modified: Thu, 01 Nov 2012 00:13:28 GMT
ETag: "2IGTrV8g1NXEfkPZ45WfAP"
Date: Thu, 01 Nov 2012 00:13:28 GMT
Content-Type: application/json
Content-Length: 28
[{"item":"kale","count":10}]
Note the opaque vector clock (via the X-Riak-Vclock
header) returned by Riak. That same value will be returned with any read request issued for that key until another write occurs.
His roommate Mark, reads the order and adds milk. In order to allow Riak to track the update history properly, Mark includes the most recent vector clock with his PUT.
Mark writes [{"item":"kale","count":10},{"item":"milk","count":1}]
.
curl -i -XPUT "$RIAK/types/shopping/buckets/fridge/keys/97207?returnbody=true" \
-H "Content-Type:application/json" \
-H "X-Riak-Vclock:a85hYGBgzGDKBVIcypz/fgaUHjmTwZTImMfKsMKK7RRfFgA="" \
-d '[{"item":"kale","count":10},{"item":"milk","count":1}]'
HTTP/1.1 200 OK
X-Riak-Vclock: a85hYGBgzGDKBVIcypz/fgaUHjmTwZTIlMfKcMaK7RRfFgA=
Vary: Accept-Encoding
Server: MochiWeb/1.1 WebMachine/1.9.0 (someone had painted...
Last-Modified: Thu, 01 Nov 2012 00:14:04 GMT
ETag: "62NRijQH3mRYPRybFneZaY"
Date: Thu, 01 Nov 2012 00:14:04 GMT
Content-Type: application/json
Content-Length: 54
[{"item":"kale","count":10},{"item":"milk","count":1}]
If you look closely, you'll notice that the vector clock changed with the second write request
Now let's consider a third roommate, Andy, who loves almonds. Before Mark updates the shared cart with milk, Andy retrieved Casey's kale order and appends almonds. As with Mark, Andy's update includes the vector clock as it existed after Casey's original write.
Andy writes [{"item":"kale","count":10},{"item":"almonds","count":12}]
.
curl -i -XPUT "$RIAK/types/shopping/buckets/fridge/keys/97207?returnbody=true" \
-H "Content-Type:application/json" \
-H "X-Riak-Vclock:a85hYGBgzGDKBVIcypz/fgaUHjmTwZTImMfKsMKK7RRfFgA="" \
-d '[{"item":"kale","count":10},{"item":"almonds","count":12}]'
HTTP/1.1 300 Multiple Choices
X-Riak-Vclock: a85hYGBgzGDKBVIcypz/fgaUHjmTwZTInMfKoG7LdoovCwA=
Vary: Accept-Encoding
Server: MochiWeb/1.1 WebMachine/1.9.0 (someone had painted...
Last-Modified: Thu, 01 Nov 2012 00:24:07 GMT
ETag: "54Nx22W9M7JUKJnLBrRehj"
Date: Thu, 01 Nov 2012 00:24:07 GMT
Content-Type: multipart/mixed; boundary=Ql3O0enxVdaMF3YlXFOdmO5bvrs
Content-Length: 491
--Ql3O0enxVdaMF3YlXFOdmO5bvrs
Content-Type: application/json
Etag: 62NRijQH3mRYPRybFneZaY
Last-Modified: Thu, 01 Nov 2012 00:14:04 GMT
[{"item":"kale","count":10},{"item":"milk","count":1}]
--Ql3O0enxVdaMF3YlXFOdmO5bvrs
Content-Type: application/json
Etag: 7kfvPXisoVBfC43IiPKYNb
Last-Modified: Thu, 01 Nov 2012 00:24:07 GMT
[{"item":"kale","count":10},{"item":"almonds","count":12}]
--Ql3O0enxVdaMF3YlXFOdmO5bvrs--
Whoa! What's all that?
Since there was a conflict between what Mark and Andy set the fridge value to be, Riak kept both values.
Since we're using the HTTP client, Riak returned a 300 Multiple Choices
code with a multipart/mixed
MIME type. It's up to you to parse the results (or you can request a specific value by its Etag, also called a Vtag).
Issuing a plain get on the shopping/fridge/97207
key will also return the vtags of all siblings.
curl "$RIAK/types/shopping/buckets/fridge/keys/97207"
Siblings:
62NRijQH3mRYPRybFneZaY
7kfvPXisoVBfC43IiPKYNb
What can you do with this tag? Namely, you request the value of a specific sibling by its vtag
. To get the first sibling in the list (Mark's milk):
curl "$RIAK/types/shopping/buckets/fridge/keys/97207?vtag=62NRijQH3mRYPRybFneZaY"
[{"item":"kale","count":10},{"item":"milk","count":1}]
If you want to retrieve all sibling data, tell Riak that you'll accept the multipart message by adding -H "Accept:multipart/mixed"
.
curl "$RIAK/types/shopping/buckets/fridge/keys/97207" \
-H "Accept:multipart/mixed"
Aside: Use-Case Specific?
When siblings are created, it's up to the application to know how to deal with the conflict. In our example, do we want to accept only one of the orders? Should we remove both milk and almonds and only keep the kale? Should we calculate the cheaper of the two and keep the cheapest option? Should we merge all of the results into a single order? This is why we asked Riak not to resolve this conflict automatically... we want this flexibility.
When we have conflicting writes, we want to resolve them. Since that problem is typically use-case specific, Riak defers it to us, and our application must decide how to proceed.
For our example, let's merge the values into a single result set, taking the larger count if the item is the same. When done, write the new results back to Riak with the vclock of the multipart object, so Riak knows you're resolving the conflict, and you'll get back a new vector clock.
Successive reads will receive a single (merged) result.
curl -i -XPUT "$RIAK/types/shopping/buckets/fridge/keys/97207?returnbody=true" \
-H "Content-Type:application/json" \
-H "X-Riak-Vclock:a85hYGBgzGDKBVIcypz/fgaUHjmTwZTInMfKoG7LdoovCwA=" \
-d '[{"item":"kale","count":10},{"item":"milk","count":1},\
{"item":"almonds","count":12}]'
Your data and your business needs will dictate which approach to conflict resolution is appropriate. You don't need to choose one strategy globally; instead, feel free to take advantage of Riak's buckets to specify which data uses siblings and which blindly retains the last value written.
A quick recap of the two configuration values you'll want to set:
allow_mult
defaults to false
, which means that the last write wins.allow_mult
to true
instructs Riak to retain conflicting writes as siblings.last_write_wins
defaults to false
, which (perhaps counter-intuitively) still can mean that the behavior is last write wins: allow_mult
is the key parameter for the behavioral toggle.last_write_wins
to true will optimize writes by assuming that previous vector clocks have no inherent value.allow_mult
and last_write_wins
to true
is unsupported and will result in undefined behavior.When a successful read happens, but not all replicas agree upon the value, this triggers a read repair. This means that Riak will update the replicas with the most recent value. This can happen either when an object is not found (the vnode has no copy) or a vnode contains an older value (older means that it is an ancestor of the newest vector clock). Unlike last_write_wins
or manual conflict resolution, read repair is (obviously, I hope, by the name) triggered by a read, rather than a write.
If your nodes get out of sync (for example, if you increase the n_val
on a bucket), you can force read repair by performing a read operation for all of that bucket's keys. They may return with not found
the first time, but later reads will pull the newest values.
Although resolving conflicting data during get requests via read repair is sufficient for most needs, data which is never read can eventually be lost as nodes fail and are replaced.
Riak supports active anti-entropy (AAE), to proactively identify and repair inconsistent data. This feature is also helpful for recovering data loss in the event of disk corruption or administrative error.
The overhead for this functionality is minimized by maintaining sophisticated hash trees ("Merkle trees") which make it easy to compare data sets between vnodes, but if desired the feature can be disabled.