A Little Riak Book for LFE

Data modeling

It can be hard to think outside the table, but once you do, you may find interesting patterns to use in any database, even a relational one.^{sql-databases}

^{sql-databases}. Feel free to use a relational database when you're ↩

willing to sacrifice the scalability, performance, and availability of Riak...but why would you?

If you thoroughly absorbed the earlier content, some of this may feel redundant, but the implications of the key/value model are not always obvious.

Rules to live by

As with most such lists, these are guidelines rather than hard rules, but take them seriously.

(@keys) Know your keys.

The cardinal rule of any key/value datastore: the fastest way to get
data is to know what to look for, which means knowing which key you want.

How do you pull that off? Well, that's the trick, isn't it?

The best way to always know the key you want is to be able to
programmatically reproduce it based on information you already
have. Need to know the sales data for one of your client's
magazines in December 2013? Store it in a **sales** bucket and
name the key after the client, magazine, and month/year combo.

Guess what? Retrieving it will be much faster than running a SQL
`SELECT *` statement in a relational database.

And if it turns out that the magazine didn't exist yet, and there
are no sales figures for that month? No problem. A negative
response, especially for immutable data, is among the fastest
operations Riak offers.

Because keys are only unique within a bucket, the same unique
identifier can be used in different buckets to represent different
information about the same entity (e.g., a customer address might
be in an `address` bucket with the customer id as its key, whereas
the customer id as a key in a `contacts` bucket would presumably
contain contact information).

(@namespace) Know your namespaces.

Riak has several levels of namespaces when storing data.

Historically, buckets have been what most thought of as Riak's
virtual namespaces.

The newest level is provided by **bucket types**, introduced in Riak 2.0, which
allow you to group buckets for configuration and security purposes.

Less obviously, keys are their own namespaces. If you want a
hierarchy for your keys that looks like `sales/customer/month`,
you don't need nested buckets: you just need to name your keys
appropriately, as discussed in (@keys). `sales` can be your
bucket, while each key is prepended with customer name and month.

(@views) Know your queries.

Writing data is cheap. Disk space is cheap. Dynamic queries in Riak
are very, very expensive.

As your data flows into the system, generate the views you're going to
want later. That magazine sales example from (@keys)? The December
sales numbers are almost certainly aggregates of smaller values, but
if you know in advance that monthly sales numbers are going to be
requested frequently, when the last data arrives for that month the
application can assemble the full month's statistics for later
retrieval.

Yes, getting accurate business requirements is non-trivial, but
many Riak applications are version 2 or 3 of a system, written
once the business discovered that the scalability of MySQL,
Postgres, or MongoDB simply wasn't up to the job of handling their
growth.

(@small) Take small bites.

Remember your parents' advice over dinner? They were right.

When creating objects that will be updated, constrain their scope
and keep the number of contained elements low to reduce the odds
of multiple clients attempting to update the data concurrently.

(@indexes) Create your own indexes.

Riak offers metadata-driven secondary indexes (2i) and full-text indexes
(Riak Search) for values, but these face scaling challenges: in
order to identify all objects for a given index value, roughly a
third of the cluster must be involved.

For many use cases, creating your own indexes is straightforward
and much faster/more scalable, since you'll be managing and
retrieving a single object.

See [Conflict Resolution](#conflict-resolution) for more discussion of this.

(@immutable) Embrace immutability.

As we discussed in [Mutability], immutable data offers a way out
of some of the challenges of running a high-volume, high-velocity
datastore.

If possible, segregate mutable from non-mutable data, ideally
using different buckets for [request tuning][Request tuning].

[Datomic](http://www.datomic.com) is a unique data storage system
that leverages immutability for all data, with Riak commonly used
as a backend datastore. It treats any data item in its system as
a "fact," to be potentially superseded by later facts but never
updated.

(@hybrid) Don't fear hybrid solutions.

As much as we would all love to have a database that is an excellent
solution for any problem space, we're a long way from that goal.

In the meantime, it's a perfectly reasonable (and very common)
approach to mix and match databases for different needs. Riak is
very fast and scalable for retrieving keys, but it's decidedly
suboptimal at ad hoc queries. If you can't model your way out of
that problem, don't be afraid to store keys alongside searchable
metadata in a relational or other database that makes querying
simpler, and once you have the keys you need, grab the values
from Riak.

Just make sure that you consider failure scenarios when doing so;
it would be unfortunate to compromise Riak's availability by
rendering it useless when your other database is offline.

A Little Riak Book for LFE

Data modeling

Rules to live by

Further reading