While assembling the Writing Applications chapter, I (John) tried to develop a data model that would allow for reliable locking without strong consistency. While that attempt failed, I thought it would be better to include it to illustrate the complexities of coping with eventual consistency than to throw it away entirely.
Basic premise: multiple workers may be assigned datasets to process, but each dataset should be assigned to no more than one worker.
In the absence of strong consistency, the best an application can do
is to use the pr
and pw
(primary read and primary write) parameters with
a value of quorum
or n_val
.
Common features to all of these models:
Sequence
Bucket: allow_mult=false
pr=quorum
to determine whether a lock existspw=quorum
Failure scenario
Bucket: allow_mult=false
Sequence
pr=quorum
to determine whether a lock existspw=quorum
pr=quorum
Failure scenario
If you've done any programming with threads before, you'll recognize this as a common problem with non-atomic lock operations.
Bucket: allow_mult=true
Sequence
pr=quorum
to determine whether a lock existspw=quorum
pr=quorum
Failure scenario
Bucket: allow_mult=true
Sequence
pr=quorum
to determine whether a lock existspw=quorum
and a timestamppr=quorum
Failure scenario
At this point I may hear you grumbling: clearly worker #2 would have the lower timestamp because it attempted its write first, and thus #1 would skip the dataset and try another one.
Even if both workers are running on the same server (and thus probably have timestamps that can be compared)clock-comparisons, perhaps worker #1 started its write earlier but contacted an overloaded cluster member that took longer to process the request.
clock-comparisons. An important part of any distributed systems ↩
discussion is the fact that clocks are inherently untrustworthy, and
thus calling any event the "last" to occur is an exercise in faith:
faith that a system clock wasn't set backwards in time by ntpd
or an
impatient system administrator, faith that all clocks involved are in
perfect sync.
And, to be clear, perfect synchronization of clocks across
multiple systems is unattainable. Google is attempting to solve
this by purchasing lots of atomic and GPS clocks at great expense,
and even that only narrows the margin of error.
The same failure could occur if, instead of using timestamps for comparison purposes, the ID of each worker was used for comparison purposes. All it takes is for one worker to read its own write before another worker's write request arrives.
We can certainly come up with algorithms that limit the number of times that multiple workers tackle the same job, but I have yet to find one that guarantees exclusion.
What I found surprising about this exercise is that none of the
failure scenarios required any of the odder edge conditions that can
cause unexpected outcomes. For example, pw=quorum
writes will return
an error to the client if 2 of the primary servers are not available,
but the value will still be written to the 3rd server and 2 fallback
servers. Predicting what will happen the next time someone tries to
read the key is challenging.
None of these algorithms required deletion of a value, but that is particularly fraught with peril. It's not difficult to construct scenarios in which deleted values reappear if servers are temporarily unavailable during the deletion request.