Liveblogging at OSCON Data: MongoDB Schema Design

Dwight Merriman gives “MongoDB Schema Design” at OSCon Data 2011.

@dmerr

 

RDBMS / MongoDB

relational / document-oriented

database / database

table / collection

row / JSON (BSON) document

index / index

column / field (dynamic/not predeclared)

SQL / Mongo query language (JSON)

Join / Embedding & Linking

Primary Key / _id field

 

Schema design is coupled with what you want to do:

– Dynamic queries

– Secondary indexes

– Atomic updates

– Map Reduce

Considerations:

– no joins

– atomic operations are at the single document level only

– types of queries/updates to do

– sharding

– speed

 

This is the commandline mongo interface but all this can be done in any (modern) programming language.

post = {author: “Herge”

date: new Date(),

text: “Destination Moon”,

tags: [“comic”, “adventure”]}

> db.posts.insert(post)

“posts” is the collection name.  Documents are analogous to rows but can be more complex.  Documents for one collection are grouped together.

> db.posts.find()

{ _id: ObjectId(“4c4ba5c0672….”),

 

author: “Herge”

date: new Date(),

text: “Destination Moon”,

tags: [“comic”, “adventure”]}

_id must exist and must be unique — if you don’t create an _id, one will be made for you, 12 bytes BSON, shorter than a normal UUID but that’s OK because you don’t need it to be unique globally, just on this db cluster.

Secondary index, on “author”:

> db.posts.ensureIndex({author: 1}) — “1” means ascending, -1 is descending

> db.posts.find({author: ‘Herge’}).explain() — shows you the explain plan

 

Multi-key indexes

//build an index on the “tags” array

> db.posts.ensureIndex({tags: 1})

Arrays are exploded and every element of the array will be indexed, and added separately to the B-tree data structure of the index.

> db.posts.find){tags: ‘comic’})

MongoDB assumes, when you query an array, that you mean you’re looking for an array item that matches.

 

Query operators

Conditional operators:

 

$ne, $in, $nin, $mod, $all, $size, $exists, $type, $lt, $lte, $gt, $gte

 

 

 

 

 

 

 

 

 

 

Update operators:

$set, $inc, $push, $pop, $pull, $pushAll, $pullAll

Extending the Schema:

new_comment = {Author: “Kyle”,

date: new Date(),

text: “great book!”

votes: 5}

db.posts.update( text: “Destination Moon” }, — this is the WHERE filter

{ ‘$push’: {comments: new_comment}, — do this

‘$inc’: {comments_count: 1}}) — and do this

If you push the comments array without it being there, it will create it without a problem.

 

> db.posts.ensureIndex({“comments.author”:1 })

> db.posts.find({comments.author:”Kyle”})

> db.posts.find({comments.text:”good book”})

The ‘dot’ operator

 

Find all posts with >50 comments,
> db.posts.findIndex({comments.votes: {$gt: 50}})
Not as robust as all the operators in SQL, but it’s pretty good, and more concise than SQL.  Over time more expressions will be added.
Find all posts with >50 comments, order by author ascending
> db.posts.findIndex({comments.votes: {$gt: 50}}).sort(author: 1)
No functional indexes (indexes on functions of fields)
If you add an index to a non-existent field, it returns NULL (which is necessary because not all documents have the same fields).
From a schema design standpoint, the point of MongoDB is to make the documents rich.  He puts up an example of a sales order, with many line items, an address field that has name, street, zip, cc field that has number, exp date.
There is model inheritance, for instance if you have
> db.shapes.find()
{_id: 1, type “circle”, area: 3.14, radius: 1}
{_id: 2, type “square”, area: 4, d: 2}
{_id: 3, type “rect”, area: 10, length: 5, width: 2}
All shapes have area, but the other dimensions are different based on the shape.
> db.shapes.find ({radius: {$gt: 0}})
— automatically finds only circles.
Note that this avoids the need to join for 1:many and many:many relationships, as in relational model.
That was embedding, now let’s talk about linking.
– done client-side

 

So for instance, a one to many relationship might look like this — in addition to the collection for posts, a collection for authors with author info:
// collection authors
{ _id: “Herge”
email: “h@foo.com”,
karma: 14.142 
}
> var p = db.posts.findOne()
> var author = db.authors.findOne({_id:p.author})
> print(author.email)
If it’s a “contains” relationship you want to embed
If you need more flexibility than that, link
Rich documents are usually easy to query

 

Rich documents are great for performance

 

Rich documents give more atomicity capability
{
account: “abc”,
debits: 21,
credits: 11
}
> db.stuff.update({account:’abc’},
{$inc:{debits:21},{$dec:{credits:11}})
Caching is based on 4k pages, so if you have very small documents, that can be a problem if you are pulling from many collections.
Trees in MongoDB:
{ comments: [
   { author: “Kyle”, text: “…”,
     replies: [
            { author: “Fred”, text: “…”,
                replies: []}
       ]}
]}
Mongo doesn’t search recursively so while this is great for display, not great for search.
> t = db.mytree;
> t.find()
{ “_id” : “a” }
{ “_id” : “b”, “ancestors” : [ “a” ], “parent” : “a” }
{ “_id” : “c”, “ancestors” : [ “a”, “b” ], “parent” : “b” }
{ “_id” : “d”, “ancestors” : [ “a”, “b” ], “parent” : “b” }
{ “_id” : “e”, “ancestors” : [ “a” ], “parent” : “a” }
{ “_id” : “f”, “ancestors” : [ “a”, “e” ], “parent” : “e” }
{ “_id” : “g”, “ancestors” : [ “a”, “b”, “d” ], “parent” : “d” }
> t.ensureIndex( { ancestors : 1 } )
> // find all descendents of b:
> t.find( { ancestors : ‘b’ })
{ “_id” : “c”, “ancestors” : [ “a”, “b” ], “parent” : “b” }
{ “_id” : “d”, “ancestors” : [ “a”, “b” ], “parent” : “b” }
{ “_id” : “g”, “ancestors” : [ “a”, “b”, “d” ], “parent” : “d” }
> // get all ancestors of f:
> anc = db.mytree.findOne({_id:’f’}).ancestors
[ “a”, “e” ]
> db.mytree.find( { _id : { $in : anc } } )
{ “_id” : “a” }
{ “_id” : “e”, “ancestors” : [ “a” ], “parent” : “a” }
Limit of document size is 16 Mb per document.  Was 4 Mb, will keep increasing probably based on Moore’s Law.  This is arbitrary just as a safety measure.
BSON is not compressed due to wanting to scan quickly.
Can do Queueing too, use the findAndModify method to find the highest priority job and mark as in-progress – see http://www.mongodb.org/display/DOCS/findAndModify+Command
The operation is fast, however the entire document is locked — not ideal, but concurrency is getting better and better in MongoDB.
MongoDB tries to do update in place, if you are adding to the document such that it does not fit in the allocation unit, it has to be deleted and reinserted.  This is expensive, hence the allocation unit – an adaptive padding factor based on the collection unit.

 

 

 

 

 

 

 

Dwight Merriman gives “MongoDB Schema Design” at OSCon Data 2011.

@dmerr

 

RDBMS / MongoDB

relational / document-oriented

database / database

table / collection

row / JSON (BSON) document

index / index

column / field (dynamic/not predeclared)

SQL / Mongo query language (JSON)

Join / Embedding & Linking

Primary Key / _id field

 

Schema design is coupled with what you want to do:

– Dynamic queries

– Secondary indexes

– Atomic updates

– Map Reduce

Considerations:

– no joins

– atomic operations are at the single document level only

– types of queries/updates to do

– sharding

– speed

 

This is the commandline mongo interface but all this can be done in any (modern) programming language.

post = {author: “Herge”

date: new Date(),

text: “Destination Moon”,

tags: [“comic”, “adventure”]}

> db.posts.insert(post)

“posts” is the collection name.  Documents are analogous to rows but can be more complex.  Documents for one collection are grouped together.

> db.posts.find()

{ _id: ObjectId(“4c4ba5c0672….”),

 

author: “Herge”

date: new Date(),

text: “Destination Moon”,

tags: [“comic”, “adventure”]}

_id must exist and must be unique — if you don’t create an _id, one will be made for you, 12 bytes BSON, shorter than a normal UUID but that’s OK because you don’t need it to be unique globally, just on this db cluster.

Secondary index, on “author”:

> db.posts.ensureIndex({author: 1}) — “1” means ascending, -1 is descending

> db.posts.find({author: ‘Herge’}).explain() — shows you the explain plan

 

Multi-key indexes

//build an index on the “tags” array

> db.posts.ensureIndex({tags: 1})

Arrays are exploded and every element of the array will be indexed, and added separately to the B-tree data structure of the index.

> db.posts.find){tags: ‘comic’})

MongoDB assumes, when you query an array, that you mean you’re looking for an array item that matches.

 

Query operators

Conditional operators:

 

$ne, $in, $nin, $mod, $all, $size, $exists, $type, $lt, $lte, $gt, $gte

 

 

 

 

 

 

 

 

 

 

Update operators:

$set, $inc, $push, $pop, $pull, $pushAll, $pullAll

Extending the Schema:

new_comment = {Author: “Kyle”,

date: new Date(),

text: “great book!”

votes: 5}

db.posts.update( text: “Destination Moon” }, — this is the WHERE filter

{ ‘$push’: {comments: new_comment}, — do this

‘$inc’: {comments_count: 1}}) — and do this

If you push the comments array without it being there, it will create it without a problem.

 

> db.posts.ensureIndex({“comments.author”:1 })

> db.posts.find({comments.author:”Kyle”})

> db.posts.find({comments.text:”good book”})

The ‘dot’ operator

 

Find all posts with >50 comments,
> db.posts.findIndex({comments.votes: {$gt: 50}})
Not as robust as all the operators in SQL, but it’s pretty good, and more concise than SQL.  Over time more expressions will be added.
Find all posts with >50 comments, order by author ascending
> db.posts.findIndex({comments.votes: {$gt: 50}}).sort(author: 1)
No functional indexes (indexes on functions of fields)
If you add an index to a non-existent field, it returns NULL (which is necessary because not all documents have the same fields).
From a schema design standpoint, the point of MongoDB is to make the documents rich.  He puts up an example of a sales order, with many line items, an address field that has name, street, zip, cc field that has number, exp date.
There is model inheritance, for instance if you have
> db.shapes.find()
{_id: 1, type “circle”, area: 3.14, radius: 1}
{_id: 2, type “square”, area: 4, d: 2}
{_id: 3, type “rect”, area: 10, length: 5, width: 2}
All shapes have area, but the other dimensions are different based on the shape.
> db.shapes.find ({radius: {$gt: 0}})
— automatically finds only circles.
Note that this avoids the need to join for 1:many and many:many relationships, as in relational model.
That was embedding, now let’s talk about linking.
– done client-side

 

So for instance, a one to many relationship might look like this — in addition to the collection for posts, a collection for authors with author info:
// collection authors
{ _id: “Herge”
email: “h@foo.com”,
karma: 14.142 
}
> var p = db.posts.findOne()
> var author = db.authors.findOne({_id:p.author})
> print(author.email)
If it’s a “contains” relationship you want to embed
If you need more flexibility than that, link
Rich documents are usually easy to query

 

Rich documents are great for performance

 

Rich documents give more atomicity capability
{
account: “abc”,
debits: 21,
credits: 11
}
> db.stuff.update({account:’abc’},
{$inc:{debits:21},{$dec:{credits:11}})
Caching is based on 4k pages, so if you have very small documents, that can be a problem if you are pulling from many collections.
Trees in MongoDB:
{ comments: [
   { author: “Kyle”, text: “…”,
     replies: [
            { author: “Fred”, text: “…”,
                replies: []}
       ]}
]}
Mongo doesn’t search recursively so while this is great for display, not great for search.
> t = db.mytree;
> t.find()
{ “_id” : “a” }
{ “_id” : “b”, “ancestors” : [ “a” ], “parent” : “a” }
{ “_id” : “c”, “ancestors” : [ “a”, “b” ], “parent” : “b” }
{ “_id” : “d”, “ancestors” : [ “a”, “b” ], “parent” : “b” }
{ “_id” : “e”, “ancestors” : [ “a” ], “parent” : “a” }
{ “_id” : “f”, “ancestors” : [ “a”, “e” ], “parent” : “e” }
{ “_id” : “g”, “ancestors” : [ “a”, “b”, “d” ], “parent” : “d” }
> t.ensureIndex( { ancestors : 1 } )
> // find all descendents of b:
> t.find( { ancestors : ‘b’ })
{ “_id” : “c”, “ancestors” : [ “a”, “b” ], “parent” : “b” }
{ “_id” : “d”, “ancestors” : [ “a”, “b” ], “parent” : “b” }
{ “_id” : “g”, “ancestors” : [ “a”, “b”, “d” ], “parent” : “d” }
> // get all ancestors of f:
> anc = db.mytree.findOne({_id:’f’}).ancestors
[ “a”, “e” ]
> db.mytree.find( { _id : { $in : anc } } )
{ “_id” : “a” }
{ “_id” : “e”, “ancestors” : [ “a” ], “parent” : “a” }
Limit of document size is 16 Mb per document.  Was 4 Mb, will keep increasing probably based on Moore’s Law.  This is arbitrary just as a safety measure.
BSON is not compressed due to wanting to scan quickly.
Can do Queueing too, use the findAndModify method to find the highest priority job and mark as in-progress – see http://www.mongodb.org/display/DOCS/findAndModify+Command
The operation is fast, however the entire document is locked — not ideal, but concurrency is getting better and better in MongoDB.
MongoDB tries to do update in place, if you are adding to the document such that it does not fit in the allocation unit, it has to be deleted and reinserted.  This is expensive, hence the allocation unit – an adaptive padding factor based on the collection unit.