Liveblogging: Cassandra Internals

Cassandra Internals by Gary Dusbabek of Rackspace

Questions?

What’s the best way to access data if you’re running a program in the same JVM as Cassandra? — will talk about it during StorageProxy section of the talk

Performance characteristics of using MMAP vs. not using it? – won’t cover it.

When does repair happen? will talk about it during repair part of the talk

How do Snitch and replication strategy work together? — will discuss though there is no slide on it.

Ring services – services that go throughout the ring. These are in a class called StorageService.

Storage services – things that happen locally. In a class called StorageProxy.

The cassandra executable in /bin executes cassandra.in.sh, which does:

– sets $CLASSPATH

– looks for the .jar files

– sets $CASSANDRA_CONF (mandatory, where yaml file lives)

– sets $CASSANDRA_HOME (not mandatory)

then it looks for another file [didn’t get what it was] which:

– determines heap size

– sets max heap size by default to 1/2 available memory

– sets the size for the young generation for Java GC

– sets “a whole bunch of other -X options for Java”

… then it goes to the main() class, org.Apache.Cassandra.Thrift.CassandraDaemon, which:

extends AbstractCassandraDaemon, the guts of the startup sequence. Has a method called setup(), raises config file from a Database Descriptor class.

“Database Descriptor is an awful class.”

– loads yaml file, reads into a config object, gets all the settings.

– then calls DatabaseDescriptor.loadSchemas() and loads the schema based on the last versionID, and sets them up to store them in the system column families (in the system datadir, schema column family).

– scrubs the data directories, takes out the trash (e.g. leftovers from compaction, bits and pieces from other SS tables)

– initializes the storage (keyspaces + CFs)

– Commit log recovery: CommitLog.recover() (row mutations)

– StorageService.initServer() and StorageService.joinTokenRing — this is where the magic of joining the ring happens

— starts gossip

— starts MessagingService

— Negotiates bootstrap

— knowledge of ring topology is in StorageService.tokenMetadata_ (btw underscore at end of a member variable means it’s old facebook stuff, b/c that’s their naming convention)

— partitioner is also here.

Configuration

– in DatabaseDescriptor, really a side effect of AbstractCassandraDaemon.setup

– reads config settings from yaml

– defines system tables

– changes regularly

It uses a static initializer, so we might end up making a change that happens when we’re not ready for it.

MessagingService

– Verb handlers live here (initialized from StorageService)

— main event handlers, haven’t changed much

– Socket listener

— 2 threads per ring node

– Message gateway

— MessagingService.sendRequestResponse()

— MessagingService.sendOneWay()

— MessagingService.receive() — when another node contacts you, this is the method that’s used to pass the message to a verb handler

– Messages are versioned starting in 0.8

— with IncomingTCPConnection

StageManager – fancy java ThreadPoolExecutor

– SEDA design: http://www.eecs.harvard.edu/~mdw/papers/seda-sosp01.pdf

Adding the API Methods

– open up cassandra.thrift file in the interface directory, this is where you describe methods and new data structures

– regenerate files with ant gen-thrift-java gen-thrift-py

– implement stubs: o.a.c.thrift.CassandraServer

StorageProxy – where local reads and writes happen.

– Called from o.a.c.thrift.CassandraServer

– write path changed in new version b/c of counters

— notion of WritePerformer

– eventually to Table and others

– for reads, there’s a local read path and remote read path

— Socket->CassandraServer. Looks at permissions, request validation, and marshalling.

ReadCommands created in CS.multigetSiceinternal, passed to StorageProxy — 1 per key.

StorageProxy iterates over the ReadCommands, then runs StorageProxy.read(), .fetchRows(), determines endpoints.

Locally, StorageProxy:

– READ stage executes a LocalReadRunnable

– True read vs. digest

– Table, ColumnFamilyStore

Remotely, StorageProxy:

– serializes read command

– Response handler

– Send to remote nodes

ReadRepair happens in StorageProxy.fetchRows()

Writing — follows similar pattern to reads — there is a local path and remote path.

– The marshalling turns into row mutations in CS.doInsert()

– StorageProxy.sendToHintedEndpoints

– RowMutation – one key per row (several CFs), so it calls ColumnFamilyStores.apply() to update the memtables.

RowMutation is serialized into a Message.

Then he goes into the challenges of working with the code, which I won’t reproduce here.