Datomic

25 September 2012 21:30

I generally have a lot of things to say about DBMSs in general, but here I'm going to concentrate on Datomic's features, and try to leave the comparison with other DBMSs to a more general post on the subject (that post has been in progress for many years now).

Datomic is a new DBMS based on Rich Hickey's ideas. It does a few things fundamentally differently to other DBMSs. One is the architecture: to be ACID, all transactions go through a single transactor node. What's unusual is that nothing else needs to happen on that node. Data from there goes to storages. The application runs elsewhere, and any queries run on those application nodes. Nodes access the storages to download and cache any data required for the queries given. See these slides for pretty pictures which may help clarify things.

Like quite a few modern DBMSs, Datomic uses append-only storage. However, Datomic really takes advantage of this. Every transaction corresponds to a version of the database, and that version can be queried for eternity. So if someone asks "What was the customer's address at the time the order was placed?", you just need to find the transaction id when the order was created, and look at the address record at that transaction. It's also possible to add other data to transactions beyond the default transaction id and time; the most obvious to me is the user responsible for the changes.

Having all this built into the database sounds very powerful. What percentage of database applications end up with a history table mirroring every data table? I'm willing to bet it's significant, and it's certainly not fun.

What about storage limits?

Since people always get obsessed about performance, let's discuss the amount of data we're going to store briefly. Suppose we're writing an app for a company which has a large data entry department, say with 100 employees. Every day each of those employees enters details from 50 forms, each of which contains 100 integers and 2 strings averaging 1000 characters. That's roughly (1000x2 + 100x8)x50*200*100 = 2,800,000,000 bytes or 2.8GB per year. Add an overhead for structure and storage of about 4x, and we get about 10GB/year. DNUK will currently charge me £1380 to put 128GB of RAM in a server. For some strange reason people still don't like selling servers with SSDs in them (meaning that for many tasks they can't compete with my Macbook Air), but anyway 1TB of SSD is about £400.

My point here is that, for human entered data, there is no reason to have update-in-place unless you're designing a system to record all the items purchased at Tesco, in which case, why are you reading my blog? For the vast majority of systems ever developed, storage is not going to be an issue. Of course if you're planning of developing the next Facebook you might want to choose something else, but I suggest waiting until you look like you might have a problem. At that point you can probably afford a big enough team to address the issue.

Okay, so how does it work?

Schemas

Datomic could be described as an entity-attribute-value-transaction database. A database has a schema determining what attributes can be set on entities, what type they are and whether they are single or multiple cardinality. Since any entity can have any attribute, this makes data modelling very flexible, and you don't get stuck with the old pain of how polymorphic objects might map to SQL. Let me say that again: the schema specifies what attributes are valid; there is no constraint as to which attributes an entity can have. If you want SQL-like tables then you could stick a :table attribute on all entities.

[{:db/id #db/id[:db.part/db]
  :db/ident :account/name
  :db/valueType :db.type/string
  :db/cardinality :db.cardinality/one
  :db/doc "The account holder's name"
  :db.install/_attribute :db.part/db}

 {:db/id #db/id[:db.part/db]
  :db/ident :account/balance
  :db/valueType :db.type/integer
  :db/cardinality :db.cardinality/one
  :db/doc "The account balance"
  :db.install/_attribute :db.part/db}]

Adding data

To add data to the database, you commit facts as datoms. A datom is the addition or retration of an (entity, attribute, value, transaction) tuple.

When designing a schema, there will obviously be times when there is more tha one way to do it. One example I came across was flags. If you have an entity representing an order, you may want a flag to say whether the order has shipped. There are two simple ways to do this: either have an attribute :shipped which is added when the order is shipped, or have that attribute added and set to false when the entity is created, and changed to true when the order ships. It turns out that for cases where the flag can go from true to false, it's better to have it always present; the advantage of this is that it is easy to query to find out when it was set to its current state.

[{:db/id #db/id[:db.part/user], :account/name "alan", :account/balance 0}
 {:db/id #db/id[:db.part/user], :account/name "brenda", :account/balance 0}]

Querying

Queries are written in Datalog, a logic query language, and, since Rich Hickey wrote this, everything is just data and the query can be used on other data structures too, as well as database values.

Arbitrary logic

Since Datomic runs on the JVM, it's possible to throw code around in various sneaky ways. Since Java makes me sleepy, I'm going to assume the use of Clojure. There are two points at which we might want to do sneaky things: during transactions, and during queries.

During transactions you may wish to call a function to generate the exact update so that changes are atomic; the canonical example of this is a bank transfer, where it is critical that the new balance is a function of the old value at the time the transaction began and not at an arbitrary point in the past. If a database function throws an exception then the transaction is aborted.

The documentation says that transactions can also be used for integrity checks. Thinking about this, I guess that it must mean that the update logic can check the current state and make sure the update doesn't break it, which is rather more an update than an integrity check. For me an integrity check would imply a function being called after an update with the new state of the database and a list of new facts, but that doesn't seem to be possible.

The other thing which seems odd to me about update functions is that they look like other facts in a query. Let's define ourselves a credit funciton which adds credit to an account:

[{:db/id #db/id [:db.part/user]
  :db/ident :credit
  :db/fn #db/fn { :lang "clojure"
                  :params [db name amount]
                  :code "(let [id (ffirst (datomic.api/q
                                             '[:find ?e :in $ ?name :where [?e account/name ?name]]
                                             db name))
                               e (datomic.api/entity db id)]
                              [[:db/add id :account/balance (+ (:account/balance e 0)
                                                               amount)]])"}}]

Now I can update the accounts like this:

[[:credit "brenda" 10]]

Now what happens if I do this:

[[:credit "brenda" 12]
 [:credit "brenda" 15]]

What I expected was that the balance would be increased by 15. Or maybe 12. I certainly didn't expect the entity to grow two balance values, though thinking about it I can see that given that the transaction was explicitly requesting multiple values, that's what you get. An error that the schema is contradicted would seem more natural to me, though. Anyway, two values is what Datomic gives you, and maybe this highlights a more general point: Datomic is less strict than many users of relational databases will be used to; the impetus is on the application to be correct. For databases which are shared between many applications, this may be an popular balance, as it makes it more likely that other people break your data model, but maybe the way this is avoided is to have a shared data model which the applications all build on.

The reason I was surprised that calling a database function in a transaction looks like a normal fact is that transactional database functions will normally want to be called on their own; although each individual function will check that it maintains consistency, having more than one, or having other updates in the same transaction, is not guaranteed to. On further thought, this is probably just another example of the app being responsible for not making a mess of the database.

In queries, code may be used to, er, do other clever things. And possibly basic things like sorting. As you can probably tell, I'm not really too clear about this, but hopefully I'll get a better understanding once I can use database functions in queries from the REST interface.

Indexes

By default, Datomic keeps indexes on EAVT (entity, attribute, value, transaction), AEVT and VEAT. Datomic can optionally also index on AVET; I should learn more about this and how it affects range queries.

Partitions

Data is split into partitions. Querying within a partition is quick, across partitions is slow. Presumably querying also gets slower as partitions get larger, but until I understand more I guess I'm going to stick everything in one big partition and wait for something to go wrong.

Usage from other languages

Whilst it's most natural to use Datomic from JVM languages, and particularly from Clojure, Datomic recently grew a REST API so that everyone else can access it. I've written a Python Datomic client.

Since you can still call database functions (or will be able to soon in the case of queries, I think), you still have most of the power you need. But clearly the question of where application logic goes and what needs to be pushed into the database gets more critical, both because any logic will have to be written in a different language, and because the cost of transferring query results out of the database will be a lot higher.

Personally I see this being the way I would be most likely to use Datomic. I learnt Clojure a while back, but whilst there are probably tasks out there that I might choose it for, they are few and far between. (Developing Datomic, is, as Rich says, a very good fit). Web applications usually consist of many independent threads which communicate only at the database layer, so the value of Clojure's cunning concurrency is small. On the opposite side, Python's web development libraries are very mature and flexible, and not something I'd want to lose.

Common SQL problems

Associating users with changes

Because every transaction is logged and has a time, it's easy to find the time of any update. But since transactions are themselves entities, arbitrary extra data can also be recorded. My application transaction function adds the current user to the transaction:

transaction.append('{:db/id #db/id[:db.part/tx] :data/user %s}' % user)

Get the id of a new entity

Getting the row id that was created by a SQL insert can be surprisingly painful. Fortunately, this one's quite easy in Datomic. Transactions contain responses confirming what has changed, like this:

{':db-after': {':basis-t': 1093, ':db/alias': u'tdb/markov'},
 ':db-before': {':basis-t': 1091, ':db/alias': u'tdb/markov'},
 ':tempids': {-9223350046623220439: 17592186045510},
 ':tx-data': ({':a': 50,
               ':added': True,
               ':e': 13194139534405,
               ':tx': 13194139534405,
               ':v': datetime.datetime(2012, 9, 21, 15, 13, 48, 304000)},
              {':a': 62,
               ':added': True,
               ':e': 17592186045510,
               ':tx': 13194139534405,
               ':v': u'Understand history'},
              {':a': 63,
               ':added': True,
               ':e': 17592186045510,
               ':tx': 13194139534405,
               ':v': False})}

The tempids map values and the values of created entities. Provided you're only created one entity in your transaction, life is simple. Otherwise you'll need to think a little harder.

Conclusion

I'm quite excited by Datomic. It's quite different from my attempt to write something similar to the ZODB for Clojure! I think the accessible history is great. I've never used a EAV-store before, so Clojure's data model is entirely new.

Hopefully I'll soon get the chance to publish my version of the Pyramid todo app backed by Datomic. Whether I'll get to use Datomic for anything more is not clear. I'd like to use it for CMS development - performance is never going to be an issue, and having full history is naturally really useful.

Comments

Donald Parish wrote on 28 September 2012:

Nice summary. I've tagged many other sites at http://www.delicious.com/milhouse/datomic

Leave a comment