Twitter Should Get Back to Basics

Twitter has had many outages recently. On May 17th, 2008 http://blog.twitter.com/2007/05/devils-in-details.html was posted and says:

What went wrong? We checked in code to provide more accurate pagination, to better distribute and optimize our messaging system—basically we just kept tweaking when we should have called it a day. Details are great but getting too caught up in them is a mistake. I’ve been CEO of Twitter for two months now and this an awesome lesson learned. We’re seeing the bigger picture and Twitter is back. Please contact us if something isn’t working right (with Twitter that is).

(in other news, that post was made on May 17th and does not show up on http://blog.twitter.com, which it should, between the May 16th and May 19th posts. I found a reference in other posts and had to search the site to find that post).

A real “awesome lesson learned” is “do not tweak production without testing first.” In every job I have had I have first learned and then taught the concept of “test everything possible.” Which Twitter has not learned yet, because http://blog.twitter.com/2008/05/not-true.html, posted on Tuesday May 20th, states:

We caused a database to fail during a routine update early this afternoon.

As someone who has years of experience working with MySQL, and before that was a systems adminsitrator; as someone who was referred to as “the MySQL Queen” yesterday (by someone who wanted me to test their product, so yes, they were flattering me); I can assure you:

no matter how “routine” a change is, if you do it on production without testing it first, you are playing with fire, and 95% of the fires caused by not testing first are completely preventable.

I will repeat this, because repetition is important to learning concepts.

no matter how “routine” a change is, if you do it on production without testing it first, you are playing with fire, and 95% of the fires caused by not testing first are completely preventable.

With a proper testing environment, 19 out of 20 “whoops, didn’t expect THAT from a routine change!” issues are caught. And I can tell you that often “routine changes” cause unexpected results.

Now, I was online during an outage, and http://twitter.com/home was showing their “site isn’t working” page for at least 3 hours between 2 and 5 am EDT yesterday (Tuesday, May 20th, 2008).

So…..there is no read-only copy around that Twitter could use? Maybe I cannot tweet, but I should at least be able to read what was done before!

Of course, since last week Twitter has done the opposite — often I can see the most recent 20 or so posts, but not anything prior. Now, I understand that it is hard to get all the histories for the people I follow. But it only needs to be done once, and could then be cached — “Posts from who Sheeri follows on 5/20″. It would not be difficult, and I would be OK with the functionality changing such that “once you follow a new person, their tweets prior to when you followed them do not show up in the history.”

Alternatively, you could go the snarky way and say: http://www.techcrunch.com/2008/05/20/twitter-something-is-technically-wrong/ states:
What would be great is if Twitter just moved their blog to another platform so that it doesn’t fail when users need it most.

I am not a huge user of rails. But I will say that given the content of the public announcements, the platform is not the problem. It is the code release process that is the problem. Maybe there’s “agile development” happening, paired programming and code reviews. But there is not adequate testing.

Twitter — if you truly need scaling help, please ask for help — I know Pythian would be happy to help. However, if it really is as it seems — that basic good practice is not being followed — I would like to remind you that backups are really important too, just on the off chance that backups are not happening.

Twitter has had many outages recently. On May 17th, 2008 http://blog.twitter.com/2007/05/devils-in-details.html was posted and says:

What went wrong? We checked in code to provide more accurate pagination, to better distribute and optimize our messaging system—basically we just kept tweaking when we should have called it a day. Details are great but getting too caught up in them is a mistake. I’ve been CEO of Twitter for two months now and this an awesome lesson learned. We’re seeing the bigger picture and Twitter is back. Please contact us if something isn’t working right (with Twitter that is).

(in other news, that post was made on May 17th and does not show up on http://blog.twitter.com, which it should, between the May 16th and May 19th posts. I found a reference in other posts and had to search the site to find that post).

A real “awesome lesson learned” is “do not tweak production without testing first.” In every job I have had I have first learned and then taught the concept of “test everything possible.” Which Twitter has not learned yet, because http://blog.twitter.com/2008/05/not-true.html, posted on Tuesday May 20th, states:

We caused a database to fail during a routine update early this afternoon.

As someone who has years of experience working with MySQL, and before that was a systems adminsitrator; as someone who was referred to as “the MySQL Queen” yesterday (by someone who wanted me to test their product, so yes, they were flattering me); I can assure you:

no matter how “routine” a change is, if you do it on production without testing it first, you are playing with fire, and 95% of the fires caused by not testing first are completely preventable.

I will repeat this, because repetition is important to learning concepts.

no matter how “routine” a change is, if you do it on production without testing it first, you are playing with fire, and 95% of the fires caused by not testing first are completely preventable.

With a proper testing environment, 19 out of 20 “whoops, didn’t expect THAT from a routine change!” issues are caught. And I can tell you that often “routine changes” cause unexpected results.

Now, I was online during an outage, and http://twitter.com/home was showing their “site isn’t working” page for at least 3 hours between 2 and 5 am EDT yesterday (Tuesday, May 20th, 2008).

So…..there is no read-only copy around that Twitter could use? Maybe I cannot tweet, but I should at least be able to read what was done before!

Of course, since last week Twitter has done the opposite — often I can see the most recent 20 or so posts, but not anything prior. Now, I understand that it is hard to get all the histories for the people I follow. But it only needs to be done once, and could then be cached — “Posts from who Sheeri follows on 5/20″. It would not be difficult, and I would be OK with the functionality changing such that “once you follow a new person, their tweets prior to when you followed them do not show up in the history.”

Alternatively, you could go the snarky way and say: http://www.techcrunch.com/2008/05/20/twitter-something-is-technically-wrong/ states:
What would be great is if Twitter just moved their blog to another platform so that it doesn’t fail when users need it most.

I am not a huge user of rails. But I will say that given the content of the public announcements, the platform is not the problem. It is the code release process that is the problem. Maybe there’s “agile development” happening, paired programming and code reviews. But there is not adequate testing.

Twitter — if you truly need scaling help, please ask for help — I know Pythian would be happy to help. However, if it really is as it seems — that basic good practice is not being followed — I would like to remind you that backups are really important too, just on the off chance that backups are not happening.