BigTable Thoughts

So, Paul’s blog post pointing to Todd’s blog post got me thinking.

The main point Paul summarized was that duplicating data was a great way to scale, and used Todd’s reference to Flickr and how in their partition-by-user scheme, they put a comment in the commenter’s shard as well as in the commentee’s shard.

In my recent post about Twitter, I wrote:

Now, I understand that it is hard to get all the histories for the people I follow. But it only needs to be done once, and could then be cached — “Posts from who Sheeri follows on 5/20″. It would not be difficult, and I would be OK with the functionality changing such that “once you follow a new person, their tweets prior to when you followed them do not show up in the history.”

So using this thinking, every time someone I follow (say, @paulandstorm) makes a comment, it not only writes to their shard, but to mine. Now, that may not work given that the system also has to send messages at the same time, and that there can be numerous followers — dozens, hundreds, thousands.

The Flickr model works because it involves 2 writes to get the faster caching later, and there are more reads than writes. Twitter is more write-heavy, and likely has more writes than reads, considering that many folks do not visit an historical website to see their history.

This particular idea may not work for Twitter. But I’ve picked on Twitter enough….

I thought about livejournal. I’ve been a livejournal member since 2001 — after 2 months of writing my own journaling system with comments, I got wind that a system already existed, so started to use that.

Now, I can go and pick specific entries from specific days, or I can read my “friends list”. I specify my friends and livejournal dynamically populates pages of my friends list, with the amount of entries per page that I specify.

Livejournal could also use the idea presented above, as well as the concept of semi-dynamic data. Instead of dynamically generating the last, let’s say 20, entries of my “friends list”, livejournal could be making my friends list as it gets written to. A friend makes a post and it gets added to my shard, whether or not I read it. Once the count gets up to 20, a new cache page is generated.

Now, livejournal already has great caching, and has indeed had the growing pains Twitter is seeing. And for either livejournal or twitter to take advantage of these concepts, they would likely require a rewrite from the ground up. So it’s not that I am suggesting this. I just think it’s a great idea, and if you are working on a project, think of where it might be useful to apply…..again, it may not be applicable in all situations. Like Twitter, livejournals may have many “friends” so doing 100 or 1,000 writes every time a post is made may not actually be feasible.

So, Paul’s blog post pointing to Todd’s blog post got me thinking.

The main point Paul summarized was that duplicating data was a great way to scale, and used Todd’s reference to Flickr and how in their partition-by-user scheme, they put a comment in the commenter’s shard as well as in the commentee’s shard.

In my recent post about Twitter, I wrote:

Now, I understand that it is hard to get all the histories for the people I follow. But it only needs to be done once, and could then be cached — “Posts from who Sheeri follows on 5/20″. It would not be difficult, and I would be OK with the functionality changing such that “once you follow a new person, their tweets prior to when you followed them do not show up in the history.”

So using this thinking, every time someone I follow (say, @paulandstorm) makes a comment, it not only writes to their shard, but to mine. Now, that may not work given that the system also has to send messages at the same time, and that there can be numerous followers — dozens, hundreds, thousands.

The Flickr model works because it involves 2 writes to get the faster caching later, and there are more reads than writes. Twitter is more write-heavy, and likely has more writes than reads, considering that many folks do not visit an historical website to see their history.

This particular idea may not work for Twitter. But I’ve picked on Twitter enough….

I thought about livejournal. I’ve been a livejournal member since 2001 — after 2 months of writing my own journaling system with comments, I got wind that a system already existed, so started to use that.

Now, I can go and pick specific entries from specific days, or I can read my “friends list”. I specify my friends and livejournal dynamically populates pages of my friends list, with the amount of entries per page that I specify.

Livejournal could also use the idea presented above, as well as the concept of semi-dynamic data. Instead of dynamically generating the last, let’s say 20, entries of my “friends list”, livejournal could be making my friends list as it gets written to. A friend makes a post and it gets added to my shard, whether or not I read it. Once the count gets up to 20, a new cache page is generated.

Now, livejournal already has great caching, and has indeed had the growing pains Twitter is seeing. And for either livejournal or twitter to take advantage of these concepts, they would likely require a rewrite from the ground up. So it’s not that I am suggesting this. I just think it’s a great idea, and if you are working on a project, think of where it might be useful to apply…..again, it may not be applicable in all situations. Like Twitter, livejournals may have many “friends” so doing 100 or 1,000 writes every time a post is made may not actually be feasible.