Constraints of the data model

  • Indexing is not available, so data may have to be denormalised
  • Columns and supercolumns are sorted by key name
    • names are byte strings but interpretation for sorting can be changed
  • Range queries are possible through partitioning
    • RandomPartitioner randomly distributes rows among machines according to MD5 value, leading to load-balancing
      • within a node, rows are sorted by key
    • OrderPreservingPartitioner distributes according to key

Cassandra data model

Column :: key → value

  • similar to a single datum

SuperColumn :: key → { subkey1 → value1, … }

  • a datum whose value is structured

ColumnFamily :: { column1, column2, … } = { key1 → {subkey1 → value1, subkey2 → value2}, … }

  • column families are stored in separate files
  • sorted by key major order
  • similar to an RDBMS table, except sparse

SuperColumnFamily :: { supercolumn1, supercolumn2, … }

Keyspace :: [ key1, key2, ... ] for a ColumnFamily

An example

  • User (an RDBMS table, a Cassandra ColumnFamily)
    • maps user attributes to byte array values
  • To do a query on one of those attributes, say state,
    • need to manually create a ColumnFamily { state → { city → { name → username ] } }
      • like indexing on state
    • then, where state == ‘CA’ is efficient (since ColumnFamilies are sorted by key)
  • Composite keys
    • corresponds to where state == ‘CA’ and city == ‘San Mateo’
      • { state:city → {name → username} }
    • ColumnFamilies are sorted by key
    • we can do where state == ‘CA’ (get all cities)
    • but also where state == ‘CA’ and city == ‘San Mateo’ (get one city)
    • but not range queries on city

Cribbed from

  • http://www.slideshare.net/benjaminblack/cassandra-basics-indexing
  • http://wiki.apache.org/cassandra/DataModel/
  • http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model