subtitle: How to use awk to filter binlogs for specific tables
The task: Find out how many inserts were done to a set of tables over a period of time. Specifically avg inserts/min, avg inserts/hour, avg inserts/day, for the past week. Seems simple, right?
I was surprised that there was no good way to get what was done to a specific table, for data gathering AND forensic purposes!
These tables are be deleted from, so just using rows and timestamps is not valid. We have binary logs, which log when an insert was done. My first thought was to use pt-query-digest, since the tool has a “Tables” section in the output, so it already parses out which tables are in a query.
Sadly, tables is not an attribute I can --filter
on, at least not in my version of pt-query-digest. (insert sad trombone here)
My second thought was to see if mysqlbinlog could filter for records only in that table. No such luck.
So, I decided to use awk. With awk I could define a “record” as “separated by the string ‘# at'”, match for table names, and print out the separator and record.
In other words, I could parse binlogs for a specific match, and print the entire record. Because each record is a variable size (maybe the record is 5 lines, maybe it is 7, maybe it is even more!)
To get MySQL binlog events on a set of tables on a unix server:
mysqlbinlog $FILE | awk -vRS='# at' '/TABLE1|TABLE2/ { print RS $0 }' | grep -v ^$ > filtered_binlog.txt
Explanation (making no assumptions about unix knowledge):
mysqlbinlog $FILE
– parses the mysql-bin.xxxxxx file
| awk
– sends the parsed file to awk
-v
– means “set this variable”
RS
– record separator
'# at'
– the record separator string, which is the beginning of each record.
/TABLE1|TABLE2/
– a regular expression matching either TABLE1 or TABLE2 – you may decide you want your string to be something like /FROM TABLE1|FROM TABLE2|JOIN TABLE1|JOIN TABLE2/ – it depends on your queries.
print RS $0
– prints both the record separator (RS) and the record ($0). If you just print the record, the default output record separator (ORS) is the empty string, so you will not get a good file.
| grep -v ^$
– take out blank lines, grep -v means “all lines except” and ^$ is start of line (^) followed by end of line ($). If there are blank lines, pt-query-digest will not work on this file.
From here, filtered_binlog.txt can be parsed and analyzed with tools like pt-query-digest or mysqlbinlog.
Edited to add: Commenters point out that this works for STATEMENT based binlogs. If your binlogging format is ROW, you can use mysqlbinlog –verbose (or -v) to get the reconstructed SQL statement, which includes the table name.
What happen when binlog_format is ROW?
Great question! I am working with binlog_format=STATEMENT, so I didn’t test out ROW. Regardless, mysqlbinlog’s documentation does not have any flags to deal with filtering by table, so there would still have to be processing done.
Thanks for the awk trick! It works great with ROW logs as well, you just need to add the -v option to mysqlbinlog so that the table names appear in comments.