Backup – Page 6 – Sheeri Dot Org

Open Source and Money

Matt Asay wrote an article about open source leakage. It’s quite good, and got me thinking.

First I thought, “Open source companies do not ‘lose’ revenue to non-paying customers, they just do not gain revenue from them.” But that’s based on the model of open-source software I have in my head that open source software usually starts out as a free, collaborative effort, and if enough folks get enough steam and come up with a business model (aka “a way to get paid”), then they form a company around the open source software.

Simplifying that model: open source software is free until it’s not.

Saying there is leakage does not do justice to the fact that the river flowed freely until the company came along and dammed up the river. Sure, maybe there’s a big leak, but there’s a lot more not leaking than there is leaking.

But open source != free. And it’s not required, either.

Take a for-pay e-book. You buy a license for a personal copy of a book, and read it. You’re not supposed to make copies of the e-book, or redistribute it, etc under the terms of your license.

However, you can “delve into the source code” of an e-book. You cannot change it and redistribute it claiming you authored it. You can, however, change the words in the book to make it more meaningful for yourself. There’s nothing to stop you from annotating the work. The source is open — all the words are there for you to play with.

Now, open source is like that e-book. There’s nothing that says open source HAS to be free. By convention, it has been. Patents are good for keeping secrets and making money. The open source movement shuns patents. But they’re not shunning the making money. They’re shunning the secretive nature of it.

I once had a housemate who was vegan, whose brother owned a restaurant 3,000 miles away. She made the best vegan pancakes, and refused to give out the recipe because it was her brother’s secret recipe. Now, vegan pancakes are not that complicated. There are about 5 ingredients that could go into them. Why the need for the secret? Because her brother would lose business? Restaurants produce cookbooks all the time; I doubt business would die if the recipe got out.

And that’s what open source is all about — “I have this great recipe for vegan pancakes, and I want to share it with you.”

Let me be clear: I think that open source companies deserve to be paid for their work. Much of the time the products are excellent. That does not mean it’s bug-free. (I live in the United States, and I think it’s one of the best countries to live in, but that does not mean we do everything right….far from it!) Most of this is a semantic rant.

I find it amusing that it used to be difficult to convince big companies that open source was good, because upper management equated free with bad. Now that we’ve convinced some of them, we’re upset that it’s difficult to convince big companies that they should pay for something we give them for free.

I think MySQL actually has a sane licensing policy, and I think they’re going in the right direction with MySQL Network. Having free software and for-pay technical service and support seems like a good mix….for MySQL. I can certainly see that being abused by a company that has a bad product, intentionally, to get more $$ out of customers because they are forced to get support — much like Remedy requires lots of customization before it actually can work. MySQL is much better than that.

I think MySQL in particular would do well to offer “Optimization Consulting” for a fee. I know they offer that already, but particularly call it that, as I am always hearing about companies looking for a MySQL consultant for a few weeks to help them optimize their servers.

Matt Asay wrote an article about open source leakage. It’s quite good, and got me thinking.

Simplifying that model: open source software is free until it’s not.

But open source != free. And it’s not required, either.

Take a for-pay e-book. You buy a license for a personal copy of a book, and read it. You’re not supposed to make copies of the e-book, or redistribute it, etc under the terms of your license.

And that’s what open source is all about — “I have this great recipe for vegan pancakes, and I want to share it with you.”

MySQL Movie Magic

Weta Digital, the company on Peter Jackson films
Heavenly Creatures (1994) nominated for
The Frighteners (1996)
in 1997, Peter Jackston started working on King Kong, but Universal canned it because there were a lot of monster and disaster movies.
Contact (1997) with Robert Zemeckis — the zero-gravity ride was done by Weta Digital
Then they did the Lord of the Rings trilogy (2001-3)
Van Helsing (2004) (just a few bits and pieces)
I, Robot (2004) nominated for best digital effects — technology for armies in Lord of the Rings was used here.
King Kong (2005) and trying to get what Peter Jackson wanted.

(make a chart!)
Size of a movie is based on a shot (camera does not cut away). Visual effects movie typically had 500-1000 shots.

Year	Movie	Shots	Processors
1994	Heavenly Creatures	30	1
1996	The Frighteners	450	2?
1997	Contact	48	60
2001	Fellowship of the Rings	450	384	more processors needed for the massive armies in the prelude, and all the CG creatures
2002	The Two Towers
760	1400	More armies, the Ents, more fantastic creatures including Gollem
2003	Return of the King	1400	3200	1/2 the movie, 90 minutes, was effects!

Kong 2005, made skull island, 1930’s New York was digitized, Peter Jackson gets seasick so all the water was digitally added.

300MB per second of film! how is it archived, after scanning the film & digitizing it? 1 petabyte of data for LOTR and King Kong — 5-6,000 tapes.

Artists get to work on the online copy. 120 Terabytes of storage isn’t enough to store all the data, so it’s copied from the archive to a server. 100G ethernet even to desktops. 700 linux workstations, win and mac boxes around too. 10G ethernet to connect rooms together [sic]. High-density blade servers. After visual effects are rendered, they have to be put back to film, using red, green and blue lasers and a spinning mirror to burn film.

Wow! Pretty neat.

He showed us the shots of the way they built New York, including the almost (or just, I forget) finished Empire State Building. Also, he had many shots of the studio and outdoors with green screens, and it was just fabulous. And how they animated Kong and how it was different from Gollem.

1999 Weta used MySQL 3.22, to be the backend an online recruiting system. Migrated from 3.23 to 4.0 in 2005, and to 5.0 in 2006. 5 production machines
10 replicas (replication)
100 dbs
thousands of tables
millions of rows

MySQL helps with
production management (who’s doing what)
HR System
User database and access control for all the different systems
System monitoring (nagios, internal tools, using MySQL as its backend)
Theater, conference rooms and event booking systems
Polls for employees
Online stores (for employees to buy movie swag)
Internal auction site

Why do they use it?
Simple
Reliable (hardware crashes, but db didn’t. No lost data to date)
Scalable

MySQL at Weta
The Cluster — persistent db connections from webservers are 2/3 of db connections into cluster. 50-100 cxns per second peak, up to 50MB/sec coming out. But they’re not running on high-end hardware, just using the old rendering hardware.
The Monster — one monster db. 40 cols, sparsely used, ENUMs that need to be updated all the time, 20 useless indexes. 750,000 rows, 2/3 are meaningless. No normalization.
The Work Horse — One db with dedicated hardware — the disk monitoring system. 30 file servers, every few hours they need to know updated disk space stats (because so much disk can be used by folks). Computing that stuff takes a lot of CPU. up to 3,000 queries/sec as it compares new data with stored data and updating if necessary.

ShotInfo
Tracks thousands of shots over multiple projects
Tracks all cuts and edit changes
Tracks all the plates and film rolls (so you can find a bit of film you want to recreate/duplicate)
Tracks assignments
Data originates from FileMaker, so normalization isn’t great, field names aren’t consistently named.
One way mirror

ShotSub
Key system
Shot review system
Tracks work in progress
Visual History
How they know where they are in a shot at a given time
35,000 submissions per month for King Kong!

Disk Space Management System
Load balances data
tracks data usage
looks like normal filesystsem — also must be cross-platform
Global Name Space Distributed File System
Transaction based (like a filesystem!)
Millions of allocations, thousands created per day.

Weta’s Future with MySQL:
refactor databases and code
More scalability, more reliability, and less simplicity.
Multi-Master clustering
Federated Database servers
64-bit platforms
Faster hardware

(make a chart!)
Size of a movie is based on a shot (camera does not cut away). Visual effects movie typically had 500-1000 shots.

Year	Movie	Shots	Processors
1994	Heavenly Creatures	30	1
1996	The Frighteners	450	2?
1997	Contact	48	60
2001	Fellowship of the Rings	450	384	more processors needed for the massive armies in the prelude, and all the CG creatures
2002	The Two Towers
760	1400	More armies, the Ents, more fantastic creatures including Gollem
2003	Return of the King	1400	3200	1/2 the movie, 90 minutes, was effects!

Kong 2005, made skull island, 1930’s New York was digitized, Peter Jackson gets seasick so all the water was digitally added.

300MB per second of film! how is it archived, after scanning the film & digitizing it? 1 petabyte of data for LOTR and King Kong — 5-6,000 tapes.

Wow! Pretty neat.

Why do they use it?
Simple
Reliable (hardware crashes, but db didn’t. No lost data to date)
Scalable

ShotSub
Key system
Shot review system
Tracks work in progress
Visual History
How they know where they are in a shot at a given time
35,000 submissions per month for King Kong!

Weta’s Future with MySQL:
refactor databases and code
More scalability, more reliability, and less simplicity.
Multi-Master clustering
Federated Database servers
64-bit platforms
Faster hardware

Falcon: the new Transactional Storage Engine

Jim Starkey

Falcon is based on the netfrastructure db engine
Netfrastructure has been deployed in mission critical apps for >4 years.
Extended and integrated into mysql environment.

What Falcon is NOT:
InnoDB clone
Firebird (open source derivative of Interbase db that Jim wrote years ago)
Firebird clone
Standalone Database Management System (it was, inside of netfrastructure engine)
Netfrastructure (netfra is much more with jvm and search, though these features may roll out later)

What Jim’s learned in 20 years
Disks are slower than CPU and memory than they were 25 years ago.
MVCC=Multi-generational concurrency control (how Jim named it, but someone changed it to “version”)
Putting record versions on disks are problematic
Web applications are better and for the future (religion) [I agree, though, for portability]
People have more important things to do than tune databases

Claim: Falcon is the engine design for the net 20 years.
Goals:
Exploit large memory for more than just a bigger cache
Use threads and processors for data migration
Eliminate tradeoffs, minimize tuning
Scale gracefully to use very heavy loads

Basic Architectural Model:
Incomplete in-memory db with backfill from disk
2 caches: 1) traditional LRU page cache for disk and 2) larger row cache with age group scavenging
Serial log for single write group commits — single write-ahead log.
Multi-version in memory, single version on disk
all transaction states are in memory with automatic overflow to disk
Data and indexes are 1 file plus log files (MySQL does not do this, but most other db servers do)
future: blob repositories (put them in different area of db); multiple page spaces

Basic model is MVCC. It will be extended for relaxed consistency (but why would you want to do that?!?!), and will be extended for serializable.

Index Implementation:
Btree index with prefix compression — no difference in performance with primary or secondary indexes.
No data except key in index
2-stage index retrievals — index scan generates row bitmap, so you can get from 2 indexes before going to rows; records are fetched from disk in physical row order

Data Flow:
Uncommitted row data is staged in memory
On commit, txn copies row data to serial log and written to disk (not committed until OS says the page was written)
post commit, dedicated thread copies row data from serial log to data pages
Page cache periodically flushed to disk (except blob data)
BLOB data is queued for write at blob creation, backed out on rollback — otherwise it wastes time putting it into log.

Data Reliability:
Physical structure protected by “careful write” — db on disk is ALWAYS valid and consistent; a page is written BEFORE the pointer to it is made. So worse comes to worse, you have an orphan page and NOT a null pointer. Orphaned pages will be found by
Atomicity protected by serial log — a transaction is committed when commit record hits the oxide.
The serial log is a “do log, for post commit data migration; “redo” log for post-crash recovery of data; “undo” log for post-crash resource recovery.

Jim’s Secret Agenda: not so secret anymore!
Replace varchar with string (varchar is an ABOMINATION left over from punch cards)
Replace tiny, small, medium, big ints with “number” (set limits if you want…)
Adopt a security model that’s useful for app servers (app server logs on to db server and THEN sets up security, but by then it’s too late. 3rd party client security control should have been put there 15 years ago)
Introduce useable row level security (filter sets, so querying does not accidentally give out the wrong info to the wrong people)
Teach the database world the merits of free context search, that everyone else already knows. (do you type a SELECT statement into Google?)

“The difference between theory and practice: in theory, there isn’t any difference.”

Jim Starkey

Falcon is based on the netfrastructure db engine
Netfrastructure has been deployed in mission critical apps for >4 years.
Extended and integrated into mysql environment.

Basic model is MVCC. It will be extended for relaxed consistency (but why would you want to do that?!?!), and will be extended for serializable.

“The difference between theory and practice: in theory, there isn’t any difference.”

Extending MySQL Made Easy: Plugin API FULLTEXT parsers, Storage Engines, and More

Sergei Golubchik

Plugin API is new in MySQL 5.1, so you can plugin your own API commands.

Built-in versioning
Easy to maintain and distribute
Generic — allows you to load any functionality into mysqld

Plugins can add new status variables for SHOW STATUS
For the future, plugins will allow you to add new commandline options, new server variables, and new SQL keywords.

Plugin administration:
INSTALL PLUGIN foo SONAME 'bar.so'
UNINSTALL PLUGIN foo
SHOW PLUGINS
INFORMATION_SCHEMA.PLUGINS
–plugin-dir=/path/to/dir

Plugin types:
Storage Engines
Fulltext Search Parsers
code changes text before it goes to the FULLTEXT data
can be used to search non-plaintext data formats, such as pdf, doc, mp3
can be used to parse Chinese and Japanese text.

Plugin types future:
UDFs — versioning, securyt, ease of use
Lang Modules for stores procedures (perl, php…)
Pluggable Authentication (ie, LDAP)
Fulltext Search Engines (replace the whole thing!)
Maybe new SQL commands?

Fulltext Parser plugin has to take the object, extract text, split into words, postprocess, and then the words are stored into the index. Currently the extract part does nothing because fulltext is used on strings only right now, and postprocessing is pruning out words < min length or > max length

So, a small parser plugin that allows external files to be indexes — you give mysql path.

Make a new directory (say, from_file). copy the template files for the fulltext files.

mv plugin_example.c from_file.c
it’s a Makefile.am, so change the libdir and the SOURCES:w
Change configure.in AC_INIT file to use from_file

look for mysql_declare_plugin — it contains
type of plugin
descriptor (what’s different for different types)
name
author
description
init function (on load)
de-init function (on unload)
version
status variables

Chapter 28.2 in MySQL 5.1 documentation has a complete walk-through of all the structures.

automake using your new .am file, and make install, and then load the plugin.

on http://forge.mysql.com you can find plugins at Database software -> MySQL specific -> Plugins

SHOW PLUGINS shows name, status (ie, active), type (Storage Engine or fulltext parser) and library (filename, if blank, it’s built in).

To use,
CREATE TABLE t1 (file text, fulltext(file) with parser from_file);
insert into t1 values('etc/passwd'),('/etc/services');
select * from t1; shows we have filenames only in the table.
select * from t1 where match file against ('root'); will give the result of the filename.

If you try to uninstall a plugin that is on an open table, it will have a status of “deleted” but the table that’s open will still use it. flush tables or a closed connection, and now your table is invalid. 🙂 So be careful when uninstalling plugins to find the tables using them FIRST, drop the tables, and then uninstall.

This plugin does not load the data in the file every time a query runs. This plugin should be able to handle a load_file() for a filename OR a filename. If you change the file and need to reindex, you have to do REPAIR TABLE.

——————————
I loved having the example, and this stuff seems so easy to just implement. Sure, it’s the featureset itself that’s difficult, but …