Liveblogging: Migration from 100s to 100s of Millions

Migration From 100s to 100s of Millions – Messaging to Mobile Devices with Cassandra at Urban Airship

Urban Airship – provides hosting for mobile services that devs should not build themselves (e.g. push notifications, content delivery services, etc). Provide a unified API for all these services across all platforms (Andriod, iOS, RIMM, etc). Starting to have SLAs for throughput and latency.

Transactional intake system at Urban Airship:
API – Apache/Python/django+piston+pycassa
Device Negotiation layer – Java NIO+Hector
Message Delivery layer = Python, Java NIO + Hector
Device Data checkins = java HTTPS endpoint
Persistence – sharded postgreSQL, Cassandra 0.7, “diminishing footprint” of MongoDB 1.7

What do they use?
Cassandra
HBase
redis (analytics internal measurements)
MongoDB (phasing out)

They don’t use Riak.

Converging on Cassandra + PostgreSQL for transactions, HBase for data warehousing.

They started with PostgreSQL on EC2, but had so many writes that after 6 months they couldn’t scale, so they went to MongoDB.

MongoDB had: heavy disk I/O problems, non-sophisticated locking caused locking, deadlocking and replication slave lag that was just not working out for them.

So they moved to Cassandra. Why?
– Well-suited to data model – simple DAG’s
– lots of uuids and hashes which partition well
– retrievals don’t need ordering beyond row keys or time-series data (e.g. doesn’t matter what order 10 million devices are retrieved, just need to retrieve them!)
– Rolling minor version upgrades are easy in Cassandra, no downtime.
– Column TTLs were huge for them (and resulting expiration)
– Particularly well-suited to working around EC2 availability problems
– They needed to partition across multiple availability zones, they didn’t trust fault containment within one zone.
– Read repair and handoff generally did the right thing when a node would flap (Ubuntu #708920)
– No single point-of-failure
– Ability to alter consistency levels (CL) on a per-operation basis – some things aren’t important to be consistent right away, others are very important.

Cassandra tips:
– Know your data model – creating indexes after the fact is a PITA
– design around wide rows (but be careful of I/O, Thrift, Count problems)
– Favor JSON over packed binaries if possible (unless you’re Twitter)
– Be careful using Thrift in the stack – having other services that use Thrift that have to talk to Cassandra has some painful versioning limitations.
– Don’t fear the StorageProxy.
– Looking at the Cassandra source code and getting your hands dirty with the Java code is a MUST.
– Assume the client will fail (difference between read timeout and connection refused)
– When maintaining your own indexes, try and clean up after failure. (i.e. have a good rollback strategy)
– Be ready to clean up inconsistencies anyway
– Verify client library assumptions and exception handling, make sure that you know what’s going on when the client communicates that it couldn’t do a write. Understand what the client is doing so you can figure out whether to retry now or later or what.
– Embedding Cassandra for testing really helped

Cassandra in EC2:
– Ensure Dynamic Snitch is enabled (also make sure you check your config files during upgrades…they had Dynamic Snitch off in 0.6 due to bugs, when they upgraded to 0.7 they didn’t turn it on)
– Disk I/O – avoid EBS except for snapshot backups … or use S3. Stripe ephemerals, not EBS volumes, because Cassandra is network I/O heavy (b/c EBS is a networked disk).
– Avoid smaller instances all together — i.e. avoid virtualization if you can
– Don’t assume that traversing a close-proximity availability zone is more expensive than in the same availability zone — it is sometimes, often isn’t. (No comment on different regions, haven’t tested yet)
– Balance RAM costs vs. the costs of additional hosts. Spend time with the GC logs.

Java best practices:
– ALL Java services, including Cassandra, are managed via the same set of scripts. For them, in most cases, operators don’t treat cassandra different from HBase, one mechanism to take a thread or heap dump, all logging is consistent for GC, application, stdx for HBase and Cassandra, even init scripts use the same scripts that the operators do.
– Bare metal will rock your world
– configure +UseLargePages will be good to (on bare metal)
– Get familiar with GC logs (-XX:+PrintGCDetails), understand what degenerate CMS collection looks like, and what promotion failures look like. Urban Airship settled at -XX:CMSInitiatingOccupancyFraction=60, lowered from the default of 75, to do CMS collection before there’s a problem, to avoid promotion failures.

Operations:
– Understand when to compact
– Understand upgrade implications foro data files
– Watch hinted handoff closely
– Monitor JMX religiously

Looking forward:
– Cassandra is a great hammer, but not everything is a nail
– Co-processors would be awesome (hint hint!)
– They still spend too much time worrying about GC
– Glad to see the ecosystem around the product evolving, CQL, Pig, Brisk

Migration From 100s to 100s of Millions – Messaging to Mobile Devices with Cassandra at Urban Airship

Urban Airship – provides hosting for mobile services that devs should not build themselves (e.g. push notifications, content delivery services, etc). Provide a unified API for all these services across all platforms (Andriod, iOS, RIMM, etc). Starting to have SLAs for throughput and latency.

Transactional intake system at Urban Airship:
API – Apache/Python/django+piston+pycassa
Device Negotiation layer – Java NIO+Hector
Message Delivery layer = Python, Java NIO + Hector
Device Data checkins = java HTTPS endpoint
Persistence – sharded postgreSQL, Cassandra 0.7, “diminishing footprint” of MongoDB 1.7

What do they use?
Cassandra
HBase
redis (analytics internal measurements)
MongoDB (phasing out)

They don’t use Riak.

Converging on Cassandra + PostgreSQL for transactions, HBase for data warehousing.

They started with PostgreSQL on EC2, but had so many writes that after 6 months they couldn’t scale, so they went to MongoDB.

MongoDB had: heavy disk I/O problems, non-sophisticated locking caused locking, deadlocking and replication slave lag that was just not working out for them.

So they moved to Cassandra. Why?
– Well-suited to data model – simple DAG’s
– lots of uuids and hashes which partition well
– retrievals don’t need ordering beyond row keys or time-series data (e.g. doesn’t matter what order 10 million devices are retrieved, just need to retrieve them!)
– Rolling minor version upgrades are easy in Cassandra, no downtime.
– Column TTLs were huge for them (and resulting expiration)
– Particularly well-suited to working around EC2 availability problems
– They needed to partition across multiple availability zones, they didn’t trust fault containment within one zone.
– Read repair and handoff generally did the right thing when a node would flap (Ubuntu #708920)
– No single point-of-failure
– Ability to alter consistency levels (CL) on a per-operation basis – some things aren’t important to be consistent right away, others are very important.

Cassandra tips:
– Know your data model – creating indexes after the fact is a PITA
– design around wide rows (but be careful of I/O, Thrift, Count problems)
– Favor JSON over packed binaries if possible (unless you’re Twitter)
– Be careful using Thrift in the stack – having other services that use Thrift that have to talk to Cassandra has some painful versioning limitations.
– Don’t fear the StorageProxy.
– Looking at the Cassandra source code and getting your hands dirty with the Java code is a MUST.
– Assume the client will fail (difference between read timeout and connection refused)
– When maintaining your own indexes, try and clean up after failure. (i.e. have a good rollback strategy)
– Be ready to clean up inconsistencies anyway
– Verify client library assumptions and exception handling, make sure that you know what’s going on when the client communicates that it couldn’t do a write. Understand what the client is doing so you can figure out whether to retry now or later or what.
– Embedding Cassandra for testing really helped

Cassandra in EC2:
– Ensure Dynamic Snitch is enabled (also make sure you check your config files during upgrades…they had Dynamic Snitch off in 0.6 due to bugs, when they upgraded to 0.7 they didn’t turn it on)
– Disk I/O – avoid EBS except for snapshot backups … or use S3. Stripe ephemerals, not EBS volumes, because Cassandra is network I/O heavy (b/c EBS is a networked disk).
– Avoid smaller instances all together — i.e. avoid virtualization if you can
– Don’t assume that traversing a close-proximity availability zone is more expensive than in the same availability zone — it is sometimes, often isn’t. (No comment on different regions, haven’t tested yet)
– Balance RAM costs vs. the costs of additional hosts. Spend time with the GC logs.

Java best practices:
– ALL Java services, including Cassandra, are managed via the same set of scripts. For them, in most cases, operators don’t treat cassandra different from HBase, one mechanism to take a thread or heap dump, all logging is consistent for GC, application, stdx for HBase and Cassandra, even init scripts use the same scripts that the operators do.
– Bare metal will rock your world
– configure +UseLargePages will be good to (on bare metal)
– Get familiar with GC logs (-XX:+PrintGCDetails), understand what degenerate CMS collection looks like, and what promotion failures look like. Urban Airship settled at -XX:CMSInitiatingOccupancyFraction=60, lowered from the default of 75, to do CMS collection before there’s a problem, to avoid promotion failures.

Operations:
– Understand when to compact
– Understand upgrade implications foro data files
– Watch hinted handoff closely
– Monitor JMX religiously

Looking forward:
– Cassandra is a great hammer, but not everything is a nail
– Co-processors would be awesome (hint hint!)
– They still spend too much time worrying about GC
– Glad to see the ecosystem around the product evolving, CQL, Pig, Brisk

Liveblogging: Indexing in Cassandra by Ed Anuff

Indexing in Cassandra

First, a brief history:

Cassandra 0.6

– No built-in secondary

– all indexes were custom-built like using supercolumn

Cassandra 0.7

– new users flocked to the built-in secondary indexes

pros – easy to use out of the box

cons – not the same as SQL indexes but they look similar

– reinforce data modeling that plays *against* cassandra’s strengths

Present Day

– new users can get started with Cassandra w/out understanding internals, using CQL

– Veteran users are using advanced techniques like composites that aren’t really documented anywhere.

– New user panic mode when they try to use the next level and find themselves in the deep end.

 

Quick review of Indexing

2 ways of finding rows – primary index and alternate indexes

primary index (row keys):

– sometimes it’s meaningful (natural key)

– usually not, like a uuid

 

Get vs. find:

using row key is best to retrieve info if you’ve got precise and immutable 1:1 mapping

if you plan to iterate over keys, you’re probably doing something wrong. (that’s finding, not getting)

So search shouldn’t really use primary keys, just ‘get’.

 

 

alternate indexes (everything else)

Native Secondary indexes:
– easy to use, looks like SQL
– every index is stored as its own “hidden” column family (CF)
– nodes index the rows they store
– when you issue a query it gets sent to ALL nodes (no partition pushdown)
– Currently does equality ops, range get performed by memory coordinator node.
This behavior contributes to these limitations:
– Not recommended for high cardinality values (timestamps, birthdays, keywords, etc)
– Requires AT LEAST one equality comparison in a query, not efficient for less than, greater than or range queries
– Unsorted – results are in token order, not query value order
– Limited to search on data types Cassandra *natively* understands.
wide rows as lookup and grouping tables
“Why would a row need 2B columns?
– It’s the basis of all indexing, organizing and relationships in Cassandra?
– if your data model has no rows with >100 columns, you’re probably doing it wrong (you’re thinking in relational terms!)
Inherently, wide rows work as a simple index — 
indexes = {
“User_Keys_by_last_name”: {
 “aaaa”
“aaaab”
etc…
CF as indexes
-cf column ops very fast
-column slices can be retrieved by range, are always sorted, can be reversed, etc.
-if a target key a TimeUUID you get both grouping AND sorting by timestamp.  Good for inboxes, feeds, logs, etc
– This is the best option when you need to combine groups, sort and search, such as a friends list, inbox, etc.
But…what about 2 people with the same last name?  (non-unique keys)
In the docs, the first answer you might find is SuperColumns – lets you have your col name have >1 col value.  So, 2 row keys for people with 1 last name
Use SuperColumns with caution.
– Not officially deprecated, but they’re not highly recommended either.
– sorts only on the supercolumn, not the subcolumn
– some performance issues
– Cannot do more nesting, can only have 1 level of subcolumns
– Many projects have moved away from supercolumns b/c of these limitations.
So, let’s revisit regular CF’s — what happens with >1 person with the same last name?  you can’t have 2 cols with the same column name.  You can do a composite column name:
“User_Keys_by_last_name”: {
(“alden”, 1): “e5d”,
(“adams”, 1): “et3”,
(“anderson”, 1): “e5f”,
(“anderson”, 2): “e71..”,
(“doe”, 1): “a4f”,
(“franks”, 1): “f4e”,
Composite column names
Comparator = “CompositeType” or “DynamicCompositeType”
 – you don’t lose the sort capability – sorts by component values using each component type’s sort order
2 types of composites, static and dynamic
column_families:
name: My_Composite_Index_CF
compare_with: CompositeType(UTF8Type, UUIDType)
— note in static composite types, fixed # and order of columns in the definition
name: My_Dynamic_Composite_Index_CF
compare_with: DynamicCompositeType(s=>UTF8Type, u=>UUIDType)
— Any # and order of types at runtime, the definitions are just for convenience and smaller serialized component names.
The main difference is whether you need to create one CF per index s, or one CF for all indexes with one row per index
How does this work?
Queries are easy – just regular column slice ops
Updates are harder – need to remove old value and insert the new value — this is why they recommend starting with the built-in native secondary indexes. You have to know how to remove old values then insert new values, which involves a read before write.
Example – Users by Location
Use 3 CFs, not 2 for safe concurrent access
First 2 CF’s are natural:
Users
Indexes
We also need a 3rd:
User_Index_Entries
Users = {
“username”: “…”
“location”: <location>
}
Indexes = {
Users_By_Location” : {
  {<location>, <user_key>, <ts>} : …, …: …, 
  }
}
Users_Index_Entries = {
<user_key>: {
 {“location”, <ts 1>}: <location 1>,
 {“location”, <ts 2>}: <location 2>,
 {“location”, <ts N>}: <location N>,
Allows you to read the previous index value from Users_Index_Entries CF and delete the previous one. 
Read from Users_I_E WHERE KEY=<user_key>;
DELETE FROM Users_Index_Entries
DELETE FROM Users_By_Location
UPDATE Users_Index_Entries
UPDATE Users_By_Location
B/c there’s a timestamp, doing it >1 time has no consequence, so eventually consistent.
What if something goes wrong?  Repeat batch operation until it completes.
False positive?  possible, so if it’s a problem, filter on the reads.
This approach is VERY common — with some variations.  So use this *idea*, but not necessary to be an exact copy of this example.
At least now, composite indexing is now standard. 
Can do derived indexes — create a “last_name, first_name” index from a “fullname” column.    Can also unroll a JSON object to construct deep indexes of serialized JSON structures.
– Include additional denormalized values in the index for faster lookups
– use composites for column values, too — not just column names.
custom secondary indexes
Note: no official alternate index “way”.  Everything talked about here is using an official Cassandra feature/property.
How can I learn more?
Sample using Hector:
JPA implementation for this using Hector:

 

Jira entry on this:

http://issues.apache.org/jira/browse/CASSANDRA-2231

Indexing in Cassandra

First, a brief history:

Cassandra 0.6

– No built-in secondary

– all indexes were custom-built like using supercolumn

Cassandra 0.7

– new users flocked to the built-in secondary indexes

pros – easy to use out of the box

cons – not the same as SQL indexes but they look similar

– reinforce data modeling that plays *against* cassandra’s strengths

Present Day

– new users can get started with Cassandra w/out understanding internals, using CQL

– Veteran users are using advanced techniques like composites that aren’t really documented anywhere.

– New user panic mode when they try to use the next level and find themselves in the deep end.

 

Quick review of Indexing

2 ways of finding rows – primary index and alternate indexes

primary index (row keys):

– sometimes it’s meaningful (natural key)

– usually not, like a uuid

 

Get vs. find:

using row key is best to retrieve info if you’ve got precise and immutable 1:1 mapping

if you plan to iterate over keys, you’re probably doing something wrong. (that’s finding, not getting)

So search shouldn’t really use primary keys, just ‘get’.

 

 

alternate indexes (everything else)

Native Secondary indexes:
– easy to use, looks like SQL
– every index is stored as its own “hidden” column family (CF)
– nodes index the rows they store
– when you issue a query it gets sent to ALL nodes (no partition pushdown)
– Currently does equality ops, range get performed by memory coordinator node.
This behavior contributes to these limitations:
– Not recommended for high cardinality values (timestamps, birthdays, keywords, etc)
– Requires AT LEAST one equality comparison in a query, not efficient for less than, greater than or range queries
– Unsorted – results are in token order, not query value order
– Limited to search on data types Cassandra *natively* understands.
wide rows as lookup and grouping tables
“Why would a row need 2B columns?
– It’s the basis of all indexing, organizing and relationships in Cassandra?
– if your data model has no rows with >100 columns, you’re probably doing it wrong (you’re thinking in relational terms!)
Inherently, wide rows work as a simple index — 
indexes = {
“User_Keys_by_last_name”: {
 “aaaa”
“aaaab”
etc…
CF as indexes
-cf column ops very fast
-column slices can be retrieved by range, are always sorted, can be reversed, etc.
-if a target key a TimeUUID you get both grouping AND sorting by timestamp.  Good for inboxes, feeds, logs, etc
– This is the best option when you need to combine groups, sort and search, such as a friends list, inbox, etc.
But…what about 2 people with the same last name?  (non-unique keys)
In the docs, the first answer you might find is SuperColumns – lets you have your col name have >1 col value.  So, 2 row keys for people with 1 last name
Use SuperColumns with caution.
– Not officially deprecated, but they’re not highly recommended either.
– sorts only on the supercolumn, not the subcolumn
– some performance issues
– Cannot do more nesting, can only have 1 level of subcolumns
– Many projects have moved away from supercolumns b/c of these limitations.
So, let’s revisit regular CF’s — what happens with >1 person with the same last name?  you can’t have 2 cols with the same column name.  You can do a composite column name:
“User_Keys_by_last_name”: {
(“alden”, 1): “e5d”,
(“adams”, 1): “et3”,
(“anderson”, 1): “e5f”,
(“anderson”, 2): “e71..”,
(“doe”, 1): “a4f”,
(“franks”, 1): “f4e”,
Composite column names
Comparator = “CompositeType” or “DynamicCompositeType”
 – you don’t lose the sort capability – sorts by component values using each component type’s sort order
2 types of composites, static and dynamic
column_families:
name: My_Composite_Index_CF
compare_with: CompositeType(UTF8Type, UUIDType)
— note in static composite types, fixed # and order of columns in the definition
name: My_Dynamic_Composite_Index_CF
compare_with: DynamicCompositeType(s=>UTF8Type, u=>UUIDType)
— Any # and order of types at runtime, the definitions are just for convenience and smaller serialized component names.
The main difference is whether you need to create one CF per index s, or one CF for all indexes with one row per index
How does this work?
Queries are easy – just regular column slice ops
Updates are harder – need to remove old value and insert the new value — this is why they recommend starting with the built-in native secondary indexes. You have to know how to remove old values then insert new values, which involves a read before write.
Example – Users by Location
Use 3 CFs, not 2 for safe concurrent access
First 2 CF’s are natural:
Users
Indexes
We also need a 3rd:
User_Index_Entries
Users = {
“username”: “…”
“location”: <location>
}
Indexes = {
Users_By_Location” : {
  {<location>, <user_key>, <ts>} : …, …: …, 
  }
}
Users_Index_Entries = {
<user_key>: {
 {“location”, <ts 1>}: <location 1>,
 {“location”, <ts 2>}: <location 2>,
 {“location”, <ts N>}: <location N>,
Allows you to read the previous index value from Users_Index_Entries CF and delete the previous one. 
Read from Users_I_E WHERE KEY=<user_key>;
DELETE FROM Users_Index_Entries
DELETE FROM Users_By_Location
UPDATE Users_Index_Entries
UPDATE Users_By_Location
B/c there’s a timestamp, doing it >1 time has no consequence, so eventually consistent.
What if something goes wrong?  Repeat batch operation until it completes.
False positive?  possible, so if it’s a problem, filter on the reads.
This approach is VERY common — with some variations.  So use this *idea*, but not necessary to be an exact copy of this example.
At least now, composite indexing is now standard. 
Can do derived indexes — create a “last_name, first_name” index from a “fullname” column.    Can also unroll a JSON object to construct deep indexes of serialized JSON structures.
– Include additional denormalized values in the index for faster lookups
– use composites for column values, too — not just column names.
custom secondary indexes
Note: no official alternate index “way”.  Everything talked about here is using an official Cassandra feature/property.
How can I learn more?
Sample using Hector:
JPA implementation for this using Hector:

 

Jira entry on this:

http://issues.apache.org/jira/browse/CASSANDRA-2231

Liveblogging: A dozen databases in 45 minutes

 

Actually, the slide is 12 databases in 25 minutes (and 20 minutes of theory) by Eric Redmond (@inviite).  

Complex data:  A lot of data isn’t really complex, it’s just modeled in a complex way.

“Complexity is a symptom of confusion, not a cause” Jeff Hawkins.

 

NoSQL

  Linear Scalability

  Ability to be Distributed

  Low Latency

 

SQL 

  Not NoSQL

 

ACID (transaction-based)

  Atomic – Transactions are “all or nothing”

  Consistent – system data will have integrity

  Isolated – Transactions can’t see each other

  Durability – changes aren’t lost

 

BASE (request based)

  Basically Available

  Soft state

  Eventual consistency

Redmond says: ACID is negative (avoid negative things), BASE is mostly positive, things will be good, not great.

CAP theorem – it’s a fact 😀

Consistent, Available, Partition-tolerant web services.  “It is impossible to reliably provide atomic, consistent data when there are partitions in the network.  It is feasible, however, to achieve

Note that “consistent” is not the same as “consistent” in ACID, it’s more like Atomicity.

Strong consistency: when an update completes, subsequent access returns the new result.  [to my mind this is actually durability]

Weak consistency – eventual consistency

“Correct consistency” – is the most important part.  DNS, for example, is eventual consistency.

Common patterns:

Replication

 CouchDB has an amazing ability to do this, Mongo is also good but not as good.

– copying data amongst nodes in a distributed database.  Lazy (optimistic) replication, gossip (nodes communicate to stay in sync). – master/slave (mongo)

– master/master (riak, couch)

  – vector clocks (keep track of write order per client

  – mvcc (mysql) 

N/R/W

  N – Nodes to write to (per bucket)

  R – Nodes read from before success

  W – Nodes written to before success

Amazon Dynamo does this (Cassandra and Riak do this) – supports both CP and AP in one db (from the CAP theorem)

Consistent Hashing

Balance your servers, and when you hash your keys, if a server goes down or is added you don’t have to rebalance ALL nodes, just some % of them.

Mapreduce

 

Relational Models:

“Nothing beats relational databases for raw queryability.”  The tradeoff — you have to structure your data and tell the system how it is structured.

PostgreSQL (full featured) – http://bitbucket.org/ged/ruby-pg, http://github.com/Casecommons/pg_search, http://github.com/tenderlove/texticle

MySQL (lighter) – http://gitub.com/oldmoe/mysqlplus, http://github.com/brianmario/mysql2

Drizzle (lightest) – http://www.drizzle.org

 

Bigtable/Columnar Style

 

What makes it columnar?  well, a primary key is really a row key, and then you have column families, which are columns, stored together.  (each column’s values are stored together as opposed to the row being stored together.)  You can set expiry for a column family too, after which the data expires (which is why it’s great for Google).

HBase – http://hbase.apache.org – Google’s BigTable implementation, which was born of Hadoop (Java mapreduce engine).  If you want to use HBase in production, use Thrift (http://thirft.apache.org) which Cassandra also uses). This is CP, but configurable to AP.  Does sequential reads and column versioning, strong but flexible columnar schema.

Cassandra – hybrid.  Node architecture like dynamo – data structure like BigTable w/column families – http://cassandra.apache.org – Good for hundreds of nodes in the same data center, if there is more than that or different data centers, use HBase (that’s what Digg and Facebook are running into).  In cassandra you set up your schemas with an XML file, not with DDL.  Benefits – sequential reads of ordered keys, also has versioning.  It’s AP, configurable to CP.

 

Documentation Datastores:

MongoDB (AP focused – master/slave)

http://www.mongodb.org – created to be huge (huMONGous).  Made to be partitioned, distributed, needed ad hoc queries.  Wasn’t built to be durable.

 

CouchDB

Not made to be distributed, originally, was meant to be very durable.  AP focused (master/master)

http://couchdb.apache.org

http://tilgovi.github.com/couchdb-lounge (clustering)

MapReduce in Mongo is an ad hoc query, comfortable for relational db ppl.  In CouchDB, you make views and then request data from those views.  

Riak – The most “architecturally cool” database out there.  It’s a dynamo implementation that is purely REST based.  It’s a key-value store, but it’s not descriptive enough — it has map-reduce built in, metadata and links you can walk.  You can store ANYTHING in riak — not just text.  example: getting a JPG file from the web and putting it as the value for the key “firefox.jpg”.  Neat demo.

Riak has a ring, eventual consistency, can pull nodes in and take nodes out, without having to invalidate all the ids.  It has quorum consistency, which blows Eric’s mind, but we didn’t have 

 

Key/value stores

memcached – don’t use it

Kyoto Cabinet – don’t use it

Redis – use it – http://redis.io – it can handle lists, hashes, can intersect the value of 2 keys (such as person and pet, to find out who owns which set).

 

Graph datastores – you walk the graph instead of querying or doing mapreduce.

Neo4j

FlockDB  – distributed, “unless you’re twitter, you don’t need to use it”.  It’s not really possible to distribute a graph database, you can’t walk it and do node traversals, you can just walk edges (you can do friends, but not friends of friends, etc).

Slides are available at https://github.com/coderoshi/holy-grail-dbs

 

 

 

Actually, the slide is 12 databases in 25 minutes (and 20 minutes of theory) by Eric Redmond (@inviite).  

Complex data:  A lot of data isn’t really complex, it’s just modeled in a complex way.

“Complexity is a symptom of confusion, not a cause” Jeff Hawkins.

 

NoSQL

  Linear Scalability

  Ability to be Distributed

  Low Latency

 

SQL 

  Not NoSQL

 

ACID (transaction-based)

  Atomic – Transactions are “all or nothing”

  Consistent – system data will have integrity

  Isolated – Transactions can’t see each other

  Durability – changes aren’t lost

 

BASE (request based)

  Basically Available

  Soft state

  Eventual consistency

Redmond says: ACID is negative (avoid negative things), BASE is mostly positive, things will be good, not great.

CAP theorem – it’s a fact 😀

Consistent, Available, Partition-tolerant web services.  “It is impossible to reliably provide atomic, consistent data when there are partitions in the network.  It is feasible, however, to achieve

Note that “consistent” is not the same as “consistent” in ACID, it’s more like Atomicity.

Strong consistency: when an update completes, subsequent access returns the new result.  [to my mind this is actually durability]

Weak consistency – eventual consistency

“Correct consistency” – is the most important part.  DNS, for example, is eventual consistency.

Common patterns:

Replication

 CouchDB has an amazing ability to do this, Mongo is also good but not as good.

– copying data amongst nodes in a distributed database.  Lazy (optimistic) replication, gossip (nodes communicate to stay in sync). – master/slave (mongo)

– master/master (riak, couch)

  – vector clocks (keep track of write order per client

  – mvcc (mysql) 

N/R/W

  N – Nodes to write to (per bucket)

  R – Nodes read from before success

  W – Nodes written to before success

Amazon Dynamo does this (Cassandra and Riak do this) – supports both CP and AP in one db (from the CAP theorem)

Consistent Hashing

Balance your servers, and when you hash your keys, if a server goes down or is added you don’t have to rebalance ALL nodes, just some % of them.

Mapreduce

 

Relational Models:

“Nothing beats relational databases for raw queryability.”  The tradeoff — you have to structure your data and tell the system how it is structured.

PostgreSQL (full featured) – http://bitbucket.org/ged/ruby-pg, http://github.com/Casecommons/pg_search, http://github.com/tenderlove/texticle

MySQL (lighter) – http://gitub.com/oldmoe/mysqlplus, http://github.com/brianmario/mysql2

Drizzle (lightest) – http://www.drizzle.org

 

Bigtable/Columnar Style

 

What makes it columnar?  well, a primary key is really a row key, and then you have column families, which are columns, stored together.  (each column’s values are stored together as opposed to the row being stored together.)  You can set expiry for a column family too, after which the data expires (which is why it’s great for Google).

HBase – http://hbase.apache.org – Google’s BigTable implementation, which was born of Hadoop (Java mapreduce engine).  If you want to use HBase in production, use Thrift (http://thirft.apache.org) which Cassandra also uses). This is CP, but configurable to AP.  Does sequential reads and column versioning, strong but flexible columnar schema.

Cassandra – hybrid.  Node architecture like dynamo – data structure like BigTable w/column families – http://cassandra.apache.org – Good for hundreds of nodes in the same data center, if there is more than that or different data centers, use HBase (that’s what Digg and Facebook are running into).  In cassandra you set up your schemas with an XML file, not with DDL.  Benefits – sequential reads of ordered keys, also has versioning.  It’s AP, configurable to CP.

 

Documentation Datastores:

MongoDB (AP focused – master/slave)

http://www.mongodb.org – created to be huge (huMONGous).  Made to be partitioned, distributed, needed ad hoc queries.  Wasn’t built to be durable.

 

CouchDB

Not made to be distributed, originally, was meant to be very durable.  AP focused (master/master)

http://couchdb.apache.org

http://tilgovi.github.com/couchdb-lounge (clustering)

MapReduce in Mongo is an ad hoc query, comfortable for relational db ppl.  In CouchDB, you make views and then request data from those views.  

Riak – The most “architecturally cool” database out there.  It’s a dynamo implementation that is purely REST based.  It’s a key-value store, but it’s not descriptive enough — it has map-reduce built in, metadata and links you can walk.  You can store ANYTHING in riak — not just text.  example: getting a JPG file from the web and putting it as the value for the key “firefox.jpg”.  Neat demo.

Riak has a ring, eventual consistency, can pull nodes in and take nodes out, without having to invalidate all the ids.  It has quorum consistency, which blows Eric’s mind, but we didn’t have 

 

Key/value stores

memcached – don’t use it

Kyoto Cabinet – don’t use it

Redis – use it – http://redis.io – it can handle lists, hashes, can intersect the value of 2 keys (such as person and pet, to find out who owns which set).

 

Graph datastores – you walk the graph instead of querying or doing mapreduce.

Neo4j

FlockDB  – distributed, “unless you’re twitter, you don’t need to use it”.  It’s not really possible to distribute a graph database, you can’t walk it and do node traversals, you can just walk edges (you can do friends, but not friends of friends, etc).

Slides are available at https://github.com/coderoshi/holy-grail-dbs

 

 

Open Source Bridge, day 1 keynote

Hacker Lounge – open 22 hours a day (closed between 6 and 8 fit cleaning).
Android app
Session notes link on each page, takes you to thewiki.
#osb11 is the official hashtag
Friday is unconference (I won’t be here on Friday).

What is open source citizenship? Nominate people for open spice citizenship awards by the registration desk.

Now the keynote, hacking for freedom: join the doocracy by Peter Fein.

Hacker, hacktivist, names are important, names have power. Peter uses the word “hack” meaning doing something clever with technology. “if we cannot see, we cannot act.”

Do-ocracy. You. Just. Do.

Telecomix (http://www.telecomix.org/) built mirrors to blocked sites, help folks tweet. They are yin to anonymous’ yang – they build and facilitate communication, not destroy and take down sites.. They go lower tech when they need to also, setting up fax and reverse fax lines for example.

The last 6-8months have shown how dramatic the Internet can be for freedom fighting. Egypt has pulled fiber optic cable to censor the ‘net, but in the Western world we have censorship and it’s more subtle. It’s not just wikileaks, there is a PROTECT-IP act going through US Congress that would allow Homeland Security to seize domain names under the guise of copyright issues.

Communicate organize resist, then we become subjects.

“If we can’t communicate, we can’t organize. If we can’t organize, we can’t resist. If we can’t resist, we become subjects [not citizens].”

We are seeing a sort of info war of nation-states vs the internet. “the Internet didn’t cause the protests in the Middle East [North Africa, really], but it did facilitate them.”. For me, though, Napster and Limewire did not cause piracy, they just facilitated piracy, but both companies no longer exist…..

Peter introduces the concept of a “disorganization” and different tactics to organize volunteers.
Radical openness – Anonymous chat rooms can be found by searching Google.
Adhocracy – put a team together todo a certain task.
At the end of the day, what matters is what you get done,and that is the core principle of the doocracy.

He points to Critical Mass as a disorganization – there’s no hierarchy, no centralized leadership, people just go out and do. He points to Burning Man as another one, and Open Source Bridge as yet another. People show up, they setup stuff they ask what they can do to help.

The he plays a clip from Lawrence Lessig speaking at Oscon 2002 about copyright: “If you don’t do something now…this freedom will be taken away…..If you can’t fight for your freedom, you don’t deserve it.”

———
Overall, the keynote was good, but for me, was not particularly informational, inspirational, nor entertaining. I enjoyed it, and it reminded me of a lot of stuff that has been happening, but nothing ideally had forgotten. Maybe that’s because I already subscribe to the doocracy way? I can see how it might spark a non-doer to go out and do, but that was not the effect it had on me.

An interesting start to a conference that has less in the way of mysql content, and more of the “soft” topics, which really are important to me.

Hacker Lounge – open 22 hours a day (closed between 6 and 8 fit cleaning).
Android app
Session notes link on each page, takes you to thewiki.
#osb11 is the official hashtag
Friday is unconference (I won’t be here on Friday).

What is open source citizenship? Nominate people for open spice citizenship awards by the registration desk.

Now the keynote, hacking for freedom: join the doocracy by Peter Fein.

Hacker, hacktivist, names are important, names have power. Peter uses the word “hack” meaning doing something clever with technology. “if we cannot see, we cannot act.”

Do-ocracy. You. Just. Do.

Telecomix (http://www.telecomix.org/) built mirrors to blocked sites, help folks tweet. They are yin to anonymous’ yang – they build and facilitate communication, not destroy and take down sites.. They go lower tech when they need to also, setting up fax and reverse fax lines for example.

The last 6-8months have shown how dramatic the Internet can be for freedom fighting. Egypt has pulled fiber optic cable to censor the ‘net, but in the Western world we have censorship and it’s more subtle. It’s not just wikileaks, there is a PROTECT-IP act going through US Congress that would allow Homeland Security to seize domain names under the guise of copyright issues.

Communicate organize resist, then we become subjects.

“If we can’t communicate, we can’t organize. If we can’t organize, we can’t resist. If we can’t resist, we become subjects [not citizens].”

We are seeing a sort of info war of nation-states vs the internet. “the Internet didn’t cause the protests in the Middle East [North Africa, really], but it did facilitate them.”. For me, though, Napster and Limewire did not cause piracy, they just facilitated piracy, but both companies no longer exist…..

Peter introduces the concept of a “disorganization” and different tactics to organize volunteers.
Radical openness – Anonymous chat rooms can be found by searching Google.
Adhocracy – put a team together todo a certain task.
At the end of the day, what matters is what you get done,and that is the core principle of the doocracy.

He points to Critical Mass as a disorganization – there’s no hierarchy, no centralized leadership, people just go out and do. He points to Burning Man as another one, and Open Source Bridge as yet another. People show up, they setup stuff they ask what they can do to help.

The he plays a clip from Lawrence Lessig speaking at Oscon 2002 about copyright: “If you don’t do something now…this freedom will be taken away…..If you can’t fight for your freedom, you don’t deserve it.”

———
Overall, the keynote was good, but for me, was not particularly informational, inspirational, nor entertaining. I enjoyed it, and it reminded me of a lot of stuff that has been happening, but nothing ideally had forgotten. Maybe that’s because I already subscribe to the doocracy way? I can see how it might spark a non-doer to go out and do, but that was not the effect it had on me.

An interesting start to a conference that has less in the way of mysql content, and more of the “soft” topics, which really are important to me.

PalominoDB Nagios Plugin for MySQL

This post has sample configurations, and an update, to the Nagios plugin called mysql_health_check.pl that PalominoDB has developed.  Presentation slides, the presentation video, the whitepaper and the download link to the actual Nagios check are always up-to-date at http://palominodb.com/about-us/projects.

There has been exciting news for the Nagios plugin PalominoDB developed for MySQL — we now have –critical and –warning options for all modes (for –varcomp mode we only had a critical option before).

 

Here are some sample configurations (which are also seen in the slides):

To check rate of temporary tables:

define command {
 command_name    check_mysql_tmp_tables
 command_line    $USER1$/mysql_health_check.pl
–hostname $HOSTADDRESS$ –user myuser –password mypass 
–cache-dir=/var/lib/nagios/mysql_cache
–max-cache-age=300 –mode=varcomp 
–expression=”Created_tmp_tables/Uptime_since_flush_status” 
–warning=”>(8/60)” –critical=”>(10/60)”
}
define service {
      use                      generic-service
      host_name                __HOSTNAME__
      service_description      MySQL Temp Tables
      check_command            check_mysql_tmp_tables
}
To check long-running queries (and an example of using a parameter, setting 2 different values one for master servers and one for slave servers):

define command {
 command_name    check_mysql_locked_queries
 command_line    $USER1$/mysql_health_check.pl
–hostname $HOSTADDRESS$ –user myuser –password mypass 
–cache-dir=/var/lib/nagios/mysql_cache
–max-cache-age=300 
–mode=locked-query  

–warning=$ARG1$ –critical=$ARG2$

 }

define service {
      use                      generic-service
      host_name                __HOSTNAME__
      service_description      MySQL Master Locked Queries

      check_command            check_mysql_locked_queries!60!120
}

 


define service {
      use                      generic-service
      host_name                __HOSTNAME__
      service_description      MySQL Slave Locked Queries
      check_command            check_mysql_locked_queries!300!600
}

If there are questions, comments or feedback, please let us know!

This post has sample configurations, and an update, to the Nagios plugin called mysql_health_check.pl that PalominoDB has developed.  Presentation slides, the presentation video, the whitepaper and the download link to the actual Nagios check are always up-to-date at http://palominodb.com/about-us/projects.

There has been exciting news for the Nagios plugin PalominoDB developed for MySQL — we now have –critical and –warning options for all modes (for –varcomp mode we only had a critical option before).

 

Here are some sample configurations (which are also seen in the slides):

To check rate of temporary tables:

define command {
 command_name    check_mysql_tmp_tables
 command_line    $USER1$/mysql_health_check.pl
–hostname $HOSTADDRESS$ –user myuser –password mypass 
–cache-dir=/var/lib/nagios/mysql_cache
–max-cache-age=300 –mode=varcomp 
–expression=”Created_tmp_tables/Uptime_since_flush_status” 
–warning=”>(8/60)” –critical=”>(10/60)”
}
define service {
      use                      generic-service
      host_name                __HOSTNAME__
      service_description      MySQL Temp Tables
      check_command            check_mysql_tmp_tables
}
To check long-running queries (and an example of using a parameter, setting 2 different values one for master servers and one for slave servers):

define command {
 command_name    check_mysql_locked_queries
 command_line    $USER1$/mysql_health_check.pl
–hostname $HOSTADDRESS$ –user myuser –password mypass 
–cache-dir=/var/lib/nagios/mysql_cache
–max-cache-age=300 
–mode=locked-query  

–warning=$ARG1$ –critical=$ARG2$

 }

define service {
      use                      generic-service
      host_name                __HOSTNAME__
      service_description      MySQL Master Locked Queries

      check_command            check_mysql_locked_queries!60!120
}

 


define service {
      use                      generic-service
      host_name                __HOSTNAME__
      service_description      MySQL Slave Locked Queries
      check_command            check_mysql_locked_queries!300!600
}

If there are questions, comments or feedback, please let us know!

Why use PalominoDB?

 [edited June 8th to correct typo; the figure given is per YEAR, not per month]

Why PalominoDB:

Efficiency in Time and Cost – 

The average salary for a full-time DBA is $80-120K per year, not counting payroll and benefit costs.  You can retain PalominoDB for your senior needs at only $37,200 a year based on a 20 hour retainer.  This leaves the balance for you to invest in a staff of flexible, competent, systems administrators who can handle day to day tasks. A retainer with PalominoDB includes access to our senior and principal personnel, engineers, and systems administrators.  Whatever your need, PalominoDB can fulfill it.  As you have access to an entire team, you are never left short-handed on holidays or during flu season…and we will never leave you for another company.
 
Team integration –

PalominoDB isn’t just a remote database services company.  From day one, we jump in and work as hard as we can to become part of your team.  We don’t sit in the wings, waiting for you to give us work or for your systems to page us.  We plug into your team’s IRC room and get into your monitoring systems.  We provide constant communication through the use of instant messaging and email.  We attend your team meetings or we set up meetings with you – and we ensure that they are efficient meetings.  Our company integrates into your systems rather than forcing you to integrate with ours.  We are confident you will come to see us as a part of your full-time workforce rather than a group of consultants.

Proactive service –

We review systems daily.  We look for anomalies in workloads.  We check error logs, and we ensure successful backups. We research the benefits of upgraded binaries, new parameters and technologies, and how they can be applied for our clients.  If you aren’t actively generating work for us, we use those hours for proactive work – SQL Reviews, Capacity Reviews and Backup/Recovery tests.  We understand how busy small businesses can be, and we make sure to keep tickets moving.  We do not require babysitting; it is our job to free up your time so you can focus on the rest of your infrastructure.  We can even provide help and recommendations on the rest of your environments.

Extensive Experience –

The professionals at PalominoDB have been managing production systems since before the dot com era.  We’ve seen explosive growth and scaled companies through years, not months, of development.  Our experience is not just with particular technologies, but with operational process, change management, incident and problem management, documentation, and configuration management.  We see the big picture, and work in it using our breadth of experience to do so.  And bottom line, we follow our mantra of three key principles:  Keep production safe, keep production simple, and document and share everything.

 [edited June 8th to correct typo; the figure given is per YEAR, not per month]

Why PalominoDB:

Efficiency in Time and Cost – 

The average salary for a full-time DBA is $80-120K per year, not counting payroll and benefit costs.  You can retain PalominoDB for your senior needs at only $37,200 a year based on a 20 hour retainer.  This leaves the balance for you to invest in a staff of flexible, competent, systems administrators who can handle day to day tasks. A retainer with PalominoDB includes access to our senior and principal personnel, engineers, and systems administrators.  Whatever your need, PalominoDB can fulfill it.  As you have access to an entire team, you are never left short-handed on holidays or during flu season…and we will never leave you for another company.
 
Team integration –

PalominoDB isn’t just a remote database services company.  From day one, we jump in and work as hard as we can to become part of your team.  We don’t sit in the wings, waiting for you to give us work or for your systems to page us.  We plug into your team’s IRC room and get into your monitoring systems.  We provide constant communication through the use of instant messaging and email.  We attend your team meetings or we set up meetings with you – and we ensure that they are efficient meetings.  Our company integrates into your systems rather than forcing you to integrate with ours.  We are confident you will come to see us as a part of your full-time workforce rather than a group of consultants.

Proactive service –

We review systems daily.  We look for anomalies in workloads.  We check error logs, and we ensure successful backups. We research the benefits of upgraded binaries, new parameters and technologies, and how they can be applied for our clients.  If you aren’t actively generating work for us, we use those hours for proactive work – SQL Reviews, Capacity Reviews and Backup/Recovery tests.  We understand how busy small businesses can be, and we make sure to keep tickets moving.  We do not require babysitting; it is our job to free up your time so you can focus on the rest of your infrastructure.  We can even provide help and recommendations on the rest of your environments.

Extensive Experience –

The professionals at PalominoDB have been managing production systems since before the dot com era.  We’ve seen explosive growth and scaled companies through years, not months, of development.  Our experience is not just with particular technologies, but with operational process, change management, incident and problem management, documentation, and configuration management.  We see the big picture, and work in it using our breadth of experience to do so.  And bottom line, we follow our mantra of three key principles:  Keep production safe, keep production simple, and document and share everything.

Oracle is not removing InnoDB!

 

I was asked “What’s the deal with Oracle removing InnoDB?”  I had not heard this, so I did some research.  It took less than 5 minutes to figure out what happened, and it was easy to see where the confusion is.

On the MySQL products page at http://mysql.com/products/ the matrix of MySQL editions includes “MySQL Classic” which is free, “MySQL Standard” which costs $2k per year, “MySQL Enterprise” which costs $5k per year and “MySQL Cluster Carrier Grade” which costs $10k per year.

Indeed, the “MySQL Classic” does not include InnoDB.  What happened was that folks assumed that, because it was free, it was the MySQL Community edition we all know and love.

This is not true.  How do I know?  Because just above the matrix is a set of links to each edition, and if you click the “MySQL Classic” link you get to http://mysql.com/products/classic/ which explains “MySQL Classic Edition is the ideal embedded database for ISVs, OEMs and VARs developing read-intensive applications using the MyISAM storage engine.”

 

So calm down, folks. 

 

I was asked “What’s the deal with Oracle removing InnoDB?”  I had not heard this, so I did some research.  It took less than 5 minutes to figure out what happened, and it was easy to see where the confusion is.

On the MySQL products page at http://mysql.com/products/ the matrix of MySQL editions includes “MySQL Classic” which is free, “MySQL Standard” which costs $2k per year, “MySQL Enterprise” which costs $5k per year and “MySQL Cluster Carrier Grade” which costs $10k per year.

Indeed, the “MySQL Classic” does not include InnoDB.  What happened was that folks assumed that, because it was free, it was the MySQL Community edition we all know and love.

This is not true.  How do I know?  Because just above the matrix is a set of links to each edition, and if you click the “MySQL Classic” link you get to http://mysql.com/products/classic/ which explains “MySQL Classic Edition is the ideal embedded database for ISVs, OEMs and VARs developing read-intensive applications using the MyISAM storage engine.”

 

So calm down, folks. 

How to Stop Hating MySQL

Last week my presentation from PICC10 on “How to Stop Hating MySQL” was uploaded to YouTube.  PICC10 (Professional IT Community Conference) was held May 7-8 in New Brunswick, NJ.  The video can be found on youtube at http://www.youtube.com/user/TomOnTime#p/u/6/xPHJnB0COjo and the slides can be downloaded from http://technocation.org/files/doc/2010_05_stophatingmysql.pdf

Last week my presentation from PICC10 on “How to Stop Hating MySQL” was uploaded to YouTube.  . 

Pictures from OpenSQLCamp Boston 2010

I have not seen folks posting any pics they took at OpenSQLCamp here, so I will post the few I have — I have a few pictures of the Indexing Panel on Saturday (the participants and some audience shots), and one picture of Matt and Monty during the Friday night social event.  You can see them here.

(The indexing panel included MySQL, Postgres, Cassandra, CouchDB and MongoDB folks)

I have not seen folks posting any pics they took at OpenSQLCamp here, so I will post the few I have — I have a few pictures of the Indexing Panel on Saturday (the participants and some audience shots), and one picture of Matt and Monty during the Friday night social event.  You can see them here.

(The indexing panel included MySQL, Postgres, Cassandra, CouchDB and MongoDB folks)

How I Find MySQL Fragmentation

Sarah Novotny recently posted InnoDB Tablespace Fragmentation – Find it and Fix it, so I thought I would share how I determine MySQL fragmentation.  This works for MyISAM and InnoDB tables, and will work whether or not innodb_file_per_table is in use.

The basic idea is to compare the size of the file(s) on disk with the size in the database metadata.  The DATA_LENGTH and INDEX_LENGTH fields of the TABLES table in the INFORMATION_SCHEMA database has size information calculated by the storage engine.  There are a few points to note:

  • Sometimes querying INFORMATION_SCHEMA can take a long time, and in rare cases has been reported to crash MySQL.  I have not seen this, however I have written about how you can tell how risky an INFORMATION_SCHEMA query might be.
  • MyISAM tables – In MyISAM, the metadata is exact, and the files on disk are split into data (.MYD) and index (.MYI).  Compare the size of the .MYD to DATA_LENGTH and compare the size of the .MYI file to INDEX_LENGTH.  You can calculate exactly how fragmented each table is, and decide which tables to defragment.  Sample query:

SELECT (DATA_LENGTH+INDEX_LENGTH)/1024/1024 AS sizeMb FROM INFORMATION_SCHEMA.TABLES WHERE ENGINE=’MyISAM’;

  • InnoDB tables when using a centralized file for data and indexes (aka “not using innodb_file_per_table”) – Because all of the data and indexes are in one file, you have to compare the sum of the size of all InnoDB tables with the size of the ibdata file(s).  However, because InnoDB has estimated metadata, you cannot use a direct comparison.  What I do is get a percentage difference between the size on disk and the size reported by the metadata, and if it’s more than about 20% difference than the table is likely fragmented enough to justify defragmentation.  Sample query:

SELECT SUM(DATA_LENGTH+INDEX_LENGTH)/1024/1024 AS sizeMb FROM INFORMATION_SCHEMA.TABLES WHERE ENGINE=’InnoDB’;

  • InnoDB tables that use separate files for metadata versus data and indexes (aka “using innodb_file_per_table”) – As explained in the previous example, InnoDB uses estimated size for metadata, not exact numbers like MyISAM.  So again I figure out the percentage difference and if it’s more than 20% then I will recommend defragmentation.  Sample query:

 

SELECT (DATA_LENGTH+INDEX_LENGTH)/1024/1024 AS sizeMb FROM INFORMATION_SCHEMA.TABLES WHERE ENGINE=’InnoDB’;

 

Note that every database is different.  I use the 20% as a guideline, not a hard-and-fast rule.  It really depends on how “wrong” Innodb’s estimate of the size is.  

Fragmentation occurs when variable-length data is updated, and when variable-length rows are deleted.  You can probably guess which tables have the most fragmentation if you know what queries are being run.  If you want to get a better sense of how often updates and deletes are being run, I highly recommend using mk-query-digest against a sample set of binary logs.

Sarah Novotny recently posted InnoDB Tablespace Fragmentation – Find it and Fix it, so I thought I would share how I determine MySQL fragmentation.  This works for MyISAM and InnoDB tables, and will work whether or not innodb_file_per_table is in use.

The basic idea is to compare the size of the file(s) on disk with the size in the database metadata.  The DATA_LENGTH and INDEX_LENGTH fields of the TABLES table in the INFORMATION_SCHEMA database has size information calculated by the storage engine.  There are a few points to note:

  • Sometimes querying INFORMATION_SCHEMA can take a long time, and in rare cases has been reported to crash MySQL.  I have not seen this, however I have written about how you can tell how risky an INFORMATION_SCHEMA query might be.
  • MyISAM tables – In MyISAM, the metadata is exact, and the files on disk are split into data (.MYD) and index (.MYI).  Compare the size of the .MYD to DATA_LENGTH and compare the size of the .MYI file to INDEX_LENGTH.  You can calculate exactly how fragmented each table is, and decide which tables to defragment.  Sample query:

SELECT (DATA_LENGTH+INDEX_LENGTH)/1024/1024 AS sizeMb FROM INFORMATION_SCHEMA.TABLES WHERE ENGINE=’MyISAM’;

  • InnoDB tables when using a centralized file for data and indexes (aka “not using innodb_file_per_table”) – Because all of the data and indexes are in one file, you have to compare the sum of the size of all InnoDB tables with the size of the ibdata file(s).  However, because InnoDB has estimated metadata, you cannot use a direct comparison.  What I do is get a percentage difference between the size on disk and the size reported by the metadata, and if it’s more than about 20% difference than the table is likely fragmented enough to justify defragmentation.  Sample query:

SELECT SUM(DATA_LENGTH+INDEX_LENGTH)/1024/1024 AS sizeMb FROM INFORMATION_SCHEMA.TABLES WHERE ENGINE=’InnoDB’;

  • InnoDB tables that use separate files for metadata versus data and indexes (aka “using innodb_file_per_table”) – As explained in the previous example, InnoDB uses estimated size for metadata, not exact numbers like MyISAM.  So again I figure out the percentage difference and if it’s more than 20% then I will recommend defragmentation.  Sample query:

 

SELECT (DATA_LENGTH+INDEX_LENGTH)/1024/1024 AS sizeMb FROM INFORMATION_SCHEMA.TABLES WHERE ENGINE=’InnoDB’;

 

Note that every database is different.  I use the 20% as a guideline, not a hard-and-fast rule.  It really depends on how “wrong” Innodb’s estimate of the size is.  

Fragmentation occurs when variable-length data is updated, and when variable-length rows are deleted.  You can probably guess which tables have the most fragmentation if you know what queries are being run.  If you want to get a better sense of how often updates and deletes are being run, I highly recommend using mk-query-digest against a sample set of binary logs.