Liveblogging: Indexing in Cassandra by Ed Anuff

Indexing in Cassandra

First, a brief history:

Cassandra 0.6

– No built-in secondary

– all indexes were custom-built like using supercolumn

Cassandra 0.7

– new users flocked to the built-in secondary indexes

pros – easy to use out of the box

cons – not the same as SQL indexes but they look similar

– reinforce data modeling that plays *against* cassandra’s strengths

Present Day

– new users can get started with Cassandra w/out understanding internals, using CQL

– Veteran users are using advanced techniques like composites that aren’t really documented anywhere.

– New user panic mode when they try to use the next level and find themselves in the deep end.

 

Quick review of Indexing

2 ways of finding rows – primary index and alternate indexes

primary index (row keys):

– sometimes it’s meaningful (natural key)

– usually not, like a uuid

 

Get vs. find:

using row key is best to retrieve info if you’ve got precise and immutable 1:1 mapping

if you plan to iterate over keys, you’re probably doing something wrong. (that’s finding, not getting)

So search shouldn’t really use primary keys, just ‘get’.

 

 

alternate indexes (everything else)

Native Secondary indexes:
– easy to use, looks like SQL
– every index is stored as its own “hidden” column family (CF)
– nodes index the rows they store
– when you issue a query it gets sent to ALL nodes (no partition pushdown)
– Currently does equality ops, range get performed by memory coordinator node.
This behavior contributes to these limitations:
– Not recommended for high cardinality values (timestamps, birthdays, keywords, etc)
– Requires AT LEAST one equality comparison in a query, not efficient for less than, greater than or range queries
– Unsorted – results are in token order, not query value order
– Limited to search on data types Cassandra *natively* understands.
wide rows as lookup and grouping tables
“Why would a row need 2B columns?
– It’s the basis of all indexing, organizing and relationships in Cassandra?
– if your data model has no rows with >100 columns, you’re probably doing it wrong (you’re thinking in relational terms!)
Inherently, wide rows work as a simple index — 
indexes = {
“User_Keys_by_last_name”: {
 “aaaa”
“aaaab”
etc…
CF as indexes
-cf column ops very fast
-column slices can be retrieved by range, are always sorted, can be reversed, etc.
-if a target key a TimeUUID you get both grouping AND sorting by timestamp.  Good for inboxes, feeds, logs, etc
– This is the best option when you need to combine groups, sort and search, such as a friends list, inbox, etc.
But…what about 2 people with the same last name?  (non-unique keys)
In the docs, the first answer you might find is SuperColumns – lets you have your col name have >1 col value.  So, 2 row keys for people with 1 last name
Use SuperColumns with caution.
– Not officially deprecated, but they’re not highly recommended either.
– sorts only on the supercolumn, not the subcolumn
– some performance issues
– Cannot do more nesting, can only have 1 level of subcolumns
– Many projects have moved away from supercolumns b/c of these limitations.
So, let’s revisit regular CF’s — what happens with >1 person with the same last name?  you can’t have 2 cols with the same column name.  You can do a composite column name:
“User_Keys_by_last_name”: {
(“alden”, 1): “e5d”,
(“adams”, 1): “et3”,
(“anderson”, 1): “e5f”,
(“anderson”, 2): “e71..”,
(“doe”, 1): “a4f”,
(“franks”, 1): “f4e”,
Composite column names
Comparator = “CompositeType” or “DynamicCompositeType”
 – you don’t lose the sort capability – sorts by component values using each component type’s sort order
2 types of composites, static and dynamic
column_families:
name: My_Composite_Index_CF
compare_with: CompositeType(UTF8Type, UUIDType)
— note in static composite types, fixed # and order of columns in the definition
name: My_Dynamic_Composite_Index_CF
compare_with: DynamicCompositeType(s=>UTF8Type, u=>UUIDType)
— Any # and order of types at runtime, the definitions are just for convenience and smaller serialized component names.
The main difference is whether you need to create one CF per index s, or one CF for all indexes with one row per index
How does this work?
Queries are easy – just regular column slice ops
Updates are harder – need to remove old value and insert the new value — this is why they recommend starting with the built-in native secondary indexes. You have to know how to remove old values then insert new values, which involves a read before write.
Example – Users by Location
Use 3 CFs, not 2 for safe concurrent access
First 2 CF’s are natural:
Users
Indexes
We also need a 3rd:
User_Index_Entries
Users = {
“username”: “…”
“location”: <location>
}
Indexes = {
Users_By_Location” : {
  {<location>, <user_key>, <ts>} : …, …: …, 
  }
}
Users_Index_Entries = {
<user_key>: {
 {“location”, <ts 1>}: <location 1>,
 {“location”, <ts 2>}: <location 2>,
 {“location”, <ts N>}: <location N>,
Allows you to read the previous index value from Users_Index_Entries CF and delete the previous one. 
Read from Users_I_E WHERE KEY=<user_key>;
DELETE FROM Users_Index_Entries
DELETE FROM Users_By_Location
UPDATE Users_Index_Entries
UPDATE Users_By_Location
B/c there’s a timestamp, doing it >1 time has no consequence, so eventually consistent.
What if something goes wrong?  Repeat batch operation until it completes.
False positive?  possible, so if it’s a problem, filter on the reads.
This approach is VERY common — with some variations.  So use this *idea*, but not necessary to be an exact copy of this example.
At least now, composite indexing is now standard. 
Can do derived indexes — create a “last_name, first_name” index from a “fullname” column.    Can also unroll a JSON object to construct deep indexes of serialized JSON structures.
– Include additional denormalized values in the index for faster lookups
– use composites for column values, too — not just column names.
custom secondary indexes
Note: no official alternate index “way”.  Everything talked about here is using an official Cassandra feature/property.
How can I learn more?
Sample using Hector:
JPA implementation for this using Hector:

 

Jira entry on this:

http://issues.apache.org/jira/browse/CASSANDRA-2231

Indexing in Cassandra

First, a brief history:

Cassandra 0.6

– No built-in secondary

– all indexes were custom-built like using supercolumn

Cassandra 0.7

– new users flocked to the built-in secondary indexes

pros – easy to use out of the box

cons – not the same as SQL indexes but they look similar

– reinforce data modeling that plays *against* cassandra’s strengths

Present Day

– new users can get started with Cassandra w/out understanding internals, using CQL

– Veteran users are using advanced techniques like composites that aren’t really documented anywhere.

– New user panic mode when they try to use the next level and find themselves in the deep end.

 

Quick review of Indexing

2 ways of finding rows – primary index and alternate indexes

primary index (row keys):

– sometimes it’s meaningful (natural key)

– usually not, like a uuid

 

Get vs. find:

using row key is best to retrieve info if you’ve got precise and immutable 1:1 mapping

if you plan to iterate over keys, you’re probably doing something wrong. (that’s finding, not getting)

So search shouldn’t really use primary keys, just ‘get’.

 

 

alternate indexes (everything else)

Native Secondary indexes:
– easy to use, looks like SQL
– every index is stored as its own “hidden” column family (CF)
– nodes index the rows they store
– when you issue a query it gets sent to ALL nodes (no partition pushdown)
– Currently does equality ops, range get performed by memory coordinator node.
This behavior contributes to these limitations:
– Not recommended for high cardinality values (timestamps, birthdays, keywords, etc)
– Requires AT LEAST one equality comparison in a query, not efficient for less than, greater than or range queries
– Unsorted – results are in token order, not query value order
– Limited to search on data types Cassandra *natively* understands.
wide rows as lookup and grouping tables
“Why would a row need 2B columns?
– It’s the basis of all indexing, organizing and relationships in Cassandra?
– if your data model has no rows with >100 columns, you’re probably doing it wrong (you’re thinking in relational terms!)
Inherently, wide rows work as a simple index — 
indexes = {
“User_Keys_by_last_name”: {
 “aaaa”
“aaaab”
etc…
CF as indexes
-cf column ops very fast
-column slices can be retrieved by range, are always sorted, can be reversed, etc.
-if a target key a TimeUUID you get both grouping AND sorting by timestamp.  Good for inboxes, feeds, logs, etc
– This is the best option when you need to combine groups, sort and search, such as a friends list, inbox, etc.
But…what about 2 people with the same last name?  (non-unique keys)
In the docs, the first answer you might find is SuperColumns – lets you have your col name have >1 col value.  So, 2 row keys for people with 1 last name
Use SuperColumns with caution.
– Not officially deprecated, but they’re not highly recommended either.
– sorts only on the supercolumn, not the subcolumn
– some performance issues
– Cannot do more nesting, can only have 1 level of subcolumns
– Many projects have moved away from supercolumns b/c of these limitations.
So, let’s revisit regular CF’s — what happens with >1 person with the same last name?  you can’t have 2 cols with the same column name.  You can do a composite column name:
“User_Keys_by_last_name”: {
(“alden”, 1): “e5d”,
(“adams”, 1): “et3”,
(“anderson”, 1): “e5f”,
(“anderson”, 2): “e71..”,
(“doe”, 1): “a4f”,
(“franks”, 1): “f4e”,
Composite column names
Comparator = “CompositeType” or “DynamicCompositeType”
 – you don’t lose the sort capability – sorts by component values using each component type’s sort order
2 types of composites, static and dynamic
column_families:
name: My_Composite_Index_CF
compare_with: CompositeType(UTF8Type, UUIDType)
— note in static composite types, fixed # and order of columns in the definition
name: My_Dynamic_Composite_Index_CF
compare_with: DynamicCompositeType(s=>UTF8Type, u=>UUIDType)
— Any # and order of types at runtime, the definitions are just for convenience and smaller serialized component names.
The main difference is whether you need to create one CF per index s, or one CF for all indexes with one row per index
How does this work?
Queries are easy – just regular column slice ops
Updates are harder – need to remove old value and insert the new value — this is why they recommend starting with the built-in native secondary indexes. You have to know how to remove old values then insert new values, which involves a read before write.
Example – Users by Location
Use 3 CFs, not 2 for safe concurrent access
First 2 CF’s are natural:
Users
Indexes
We also need a 3rd:
User_Index_Entries
Users = {
“username”: “…”
“location”: <location>
}
Indexes = {
Users_By_Location” : {
  {<location>, <user_key>, <ts>} : …, …: …, 
  }
}
Users_Index_Entries = {
<user_key>: {
 {“location”, <ts 1>}: <location 1>,
 {“location”, <ts 2>}: <location 2>,
 {“location”, <ts N>}: <location N>,
Allows you to read the previous index value from Users_Index_Entries CF and delete the previous one. 
Read from Users_I_E WHERE KEY=<user_key>;
DELETE FROM Users_Index_Entries
DELETE FROM Users_By_Location
UPDATE Users_Index_Entries
UPDATE Users_By_Location
B/c there’s a timestamp, doing it >1 time has no consequence, so eventually consistent.
What if something goes wrong?  Repeat batch operation until it completes.
False positive?  possible, so if it’s a problem, filter on the reads.
This approach is VERY common — with some variations.  So use this *idea*, but not necessary to be an exact copy of this example.
At least now, composite indexing is now standard. 
Can do derived indexes — create a “last_name, first_name” index from a “fullname” column.    Can also unroll a JSON object to construct deep indexes of serialized JSON structures.
– Include additional denormalized values in the index for faster lookups
– use composites for column values, too — not just column names.
custom secondary indexes
Note: no official alternate index “way”.  Everything talked about here is using an official Cassandra feature/property.
How can I learn more?
Sample using Hector:
JPA implementation for this using Hector:

 

Jira entry on this:

http://issues.apache.org/jira/browse/CASSANDRA-2231