Dev Blog

Changing The Engine While You’re Driving Down The Highway

by Owen on October 14, 2011

We recently ran into an interesting and difficult problem: how do we change a large, heavily-used table without violating our “avoid downtime if possible” mantra?

The following is a slightly expurgated version of the postmortem I sent the rest of the team. Table names have been changed to protect the guilty.

What Hotfixes?

In the Sinistar release’s post-release migrations list, there were two ALTER TABLE migrations slated to make structural changes to the customer and customer_login tables (changing the types of some columns, specifically). When we tested these migrations on a copy of the production system’s data, we discovered that they would have caused those tables to be inaccessible for two and five minutes respectively. Since customers are core to our applications, we determined that this qualified as “downtime” in the release plan.

I prototyped an alternate approach to the customer changes (outlined below) intended to make changes to the table without taking the app offline while they were happening. The alternate approach was considerably more complex than the original ALTER TABLE statement, as well as being considerably slower (~15 minutes for our largest shard, as compared to two minutes); we discussed whether it was worth taking the app offline at zero-dark-hundred on a weekend instead and concluded that we’d need to be able to make online changes to expensive tables eventually and opted to go ahead with the alternate approach as a learning experience.

How Did It Work?

The alternate approach we used relied on more MySQL-side tools to make the changes. Specifically, we:

  1. Created a new, empty version of the customer table (the “target”) table based on the current customer table plus intended structural changes. Structural changes to an empty table that the app does not use are quite cheap.
  2. Installed triggers on INSERT, UPDATE, and DELETE operations on the customer table that copied rows (using REPLACE) into the target table, applying the intended transforms to the affected columns.
  3. Installed a server-side procedure (“the migration procedure”) that:
    1. Selected all of the keys from the original customer table.
    2. For each key, copied the corresponding row from the customer table to the target table using REPLACE.
  4. Ran the migration procedure. (This part is where most of the time went.)
  5. Verified that the contents of the target table were in 1:1 correspondence with the contents of the original customer table, where the only permitted differences between corresponding rows were the changes intended by the original migration.
  6. Renamed the original customer table out of the way and renamed the target table to customer.
  7. Verified that nothing had gone horribly wrong in our apps.
  8. Dropped the renamed version of the original customer table, along with the triggers and the migration procedure.

The steps are grouped into four phases: prepare.xml (covering steps 1-3), execute.xml (covering step 4), cutover.xml (covering step 6) and cleanup.xml (covering step 8). The steps not covered by phases were performed semi-manually.

The changes to customer_login proceeded along the same paths.

What Went Right

Well, first of all, we noticed a potential app outage hazard before it went live.

We were able to back out and rewrite the migrations without affecting existing development environments, where the original ALTER TABLE might have already run, thanks largely to Liquibase‘s precondition support.

We ran through several versions of this plan, both on my development environment image and with QA (on our release candidate environment), which ferretted out a handful of subtle bugs in the migration procedure that would’ve damaged customer data. Our QA team’s extensive experience and automation tooling around our applications prevented any of those mistakes from going live.

Designing the hotfix in several phases permitted us to run the relatively safe phases over the weekend, with relatively little involvement from our operations team beyond monitoring.

The verification steps did not take out our applications in the process, and revealed a way to move at least some verification steps out of our applications’ scripts directory and into the core-dbs source tree.

Finally, it worked. Despite the relatively high complexity and the large collection of moving parts, this alternate plan worked very well. Rather than taking our applications offline, everything stayed up (and didn’t slow down too badly).

What Went Wrong, And What We Learned

Finding Problems Late is A Problem

We didn’t catch the initial problem before it went out to everyone’s development environments. I’ve habitually not worried about migration timing until code freeze, which means that potentially-troublesome migrations will have already run on our bleeding-edge environment (and possibly our release candidate environment) before we notice the problem. This lead to increased complexity in the alternate migration path’s Liquibase configuration – we made extensive use of preconditions to ensure that the alternate migration would be harmless on systems where the original ALTER TABLE had already run, which seems to have worked, but I’d rather not have had that problem in the first place as it’s another moving part that can have problems.

Concurrency is a Hard Problem

The first version of the migration procedure looked like this:

  1. Select all of the rows from the original customer table.
  2. For each row, insert that row into the target table using REPLACE.

The select and loop steps use a MySQL cursor to iterate over their results. While the MySQL documentation states that it’s undefined whether a cursor is over a snapshot of the query results or over the underlying tables, in practice we found that it was using a snapshot. Since the loop took considerable time to run, and since it (intentionally) did not prevent further changes to the original customer table from occurring after the snapshot, the loop would insert, into the target table, a “stale” version of any customer row that had been modified since the snapshot was taken (overwriting the correct version copied into the target table by the triggers on the original table).

We only noticed this on our release candidate environent. When we ran through the migration plan initially to ensure that the procedure did not block our applications’ normal usage, QA ran part of their Selenium suite, which produces a relatively high volume of test traffic. Out of ten thousand customers in the release-candidate databases, four customers had stale rows copied. This is not a bug we would ever have identified using only manual testing (even with a verification tool), so I’m intensely pleased that we had automation in place and very glad I got QA involved in testing the hotfix as extensively as we did.

Switching from a snapshot of the rows in customer to a snapshot of the keys in customer and selecting out the row itself only immediately before copying it to the target table addressed the problem; even in production, our verification script identified exactly zero mismatches between the original customer table and the target table (across ten million customers). It also made the procedure body somewhat shorter and simpler by reducing the number of fields copied from the cursor on each step of the iteration.

I’ve Got Ten Million Queries, Is This Bad?

Boy, is it slow. The migration procedure, which issues one query per affected customer row, takes approximately ten times as long as a straight ALTER TABLE, and produces considerably more IO load. This lead to the customer hotfix running into Friday morning (we kicked it off Thursday afternoon). I made some changes to the customer_login migration procedure to attempt to compensate – instead of issuing a REPLACE query for each row, we used a prefix of customer_login‘s primary key (customerid) instead to issue only one REPLACE query for each customer’s worth of rows. This seems to have helped, but not by much. Future iterations of this process might even go by larger groups of rows, depending on the average and worst-case number of rows per group.

The Best Laid Plans…

We ran through three iterations of the migration procedure for the customer table before we settled on a working version. While I did design recovery migrations to back out the broken versions of the procedure that were intended to leave trunk and other environments unaffected by the iterative testing process, they didn’t quite work as planned. Somehow (and we still don’t know how for sure), all of the customers on our bleeding-edge environment were deleted outright instead of being migrated over. We weren’t able to reproduce the problem with the final plan, and we do keep backups of our testing databases, so we were able to recover the missing customers, but it’s another thing that we only noticed because of QA’s intervention.

We avoided this with customer_login by not building the migrations for that table until we’d ironed out issues in the customer migrations, and in the future we’ll avoid rolling this sort of hotfix into master (for our bleeding edge environment and development environments) until we’re confident that it’s correct.

No Space Left On Device

This is one that caught me totally by surprise. When we ran the migration procedure for customer in production, we learned that passing 10m queries through the MySQL binary logs (for replication) uses up a lot of space. Fortunately, we keep the binlogs on a separate partition from the main data store, so this did not take MySQL out, but it could’ve if our Ops folks weren’t so sharp. This did, however, take replication out temporarily. Operations resized the partition the binlogs live on, which kept things running (slowly, but smoothly) while the customer_login migration procedure ran.

Read Locking Is Not Your Friend

Our initial version of a verification tool, a PHP script in one of our apps’ scripts/ directory that ran against the live databases on our master database, would’ve taken our applications offline in roughly the same way as the original ALTER TABLE statements would’ve. I only realized this when I ran the verification queries (outside of their script) against one of our replica databases’ shards and realized how long they took, shortly before we had planned to run the verification step in the hotfix plan. Fortunately, the verification used very little from our app’s environment; I ported it to a simple shell script and ran it against a replica (after replication caught up) instead. In the future, I’m going to try to get more of our verification scripts written that way where possible, since it does a good job of insulating our applications from side effects.

In Summary

Working with large data sets is hard, especially in MySQL (where even reads can cause locks). Making structural changes to those data sets is even harder, especially while the application’s in flight. However, I feel like this went relatively well even with the “creative” issues discovered during testing, and we’ll be using this approach for large tables more frequently in the future.

4 comments

10 free passes to X.commerce Innovate

by Sunir on October 7, 2011

I have a great offer for the first 10 people to jump on this deal.

Every year PayPal throws a huge developer conference, Innovate, in San Francisco. This year they have expanded it to launch X.commerce, the new e-commerce platform from eBay, PayPal, Magento, and GSI Commerce.

That conference is next week, October 12-13 in the Moscone Conference Center in San Francisco. I have 10 free passes. Email me — sunir@freshbooks.com — and I’ll send you a coupon code.

And of course I’ll be there representing FreshBooks. If you’re there, I’d love to meet you! Again, shoot me an email — sunir@freshbooks.com.

add comment

Database migrations using triggers in MySQL with MyISAM and replication

by Taavi on October 3, 2011

Last year we purchased the book Refactoring Databases. While the first chapter was preaching to the choir (database migrations are normal around here), there was one extremely valuable gem: the use of database triggers to effect seamless data migrations.

The book targets enterprise environments where there are multiple applications accessing the database concurrently, each with release cycles measured in months and years, and regularly scheduled downtime windows. In contrast, FreshBooks has fewer moving parts (and each part is much smaller), release cycles on the order of weeks (sometimes hours!), and no time is good for downtime. We’re used around the world, and it shows in the server activity logs 24/7.

However, we can still learn a lot from Refactoring Databases. Methods of performing migrations with staggered application releases over the course of months are equally applicable to a normal web app undergoing a 0-downtime deployment with rolling backend restarts to a new version of the code.

Database triggers are a way of telling the database server to react automatically to some other action, like If This Then That. The Big Three operations that change data are INSERT, UPDATE, and DELETE. One example use of triggers is to create an audit log by instructing the database server to record the effect of all UPDATE statements against a table. From this description it’s tempting to think that a lot of things should be done using triggers! Unfortunately they run on the database server itself (so can easily cause a performance bottleneck), are written in SQL (which can make it hard to express yourself if you need to do any kind of data transformation), are impossible to unit test (any test you can write for them is essentially an integration test), and don’t allow chaining (in MySQL at least; if you need two logical things done you have to write them as a single trigger; Observer and Visitor patterns are not an option here).

Believe it or not, MySQL (even using the MyISAM table engine) has full support for triggers! In the context of supporting data migrations, there are some more specific uses whose benefits outweigh the issues above. Data migration triggers don’t have to exist for very long, and it’s a strong hint to reorganize your release schedule if you find yourself wanting to do more than one thing in a single trigger. Useful things a trigger can do to support a data migration include:

  • keeping two columns in a table in sync
  • keeping two columns in two different tables in sync
  • creating new column(s) with data synthesized from old one(s) and vice versa
  • anything else you can do with a trigger (no dynamic SQL!)

However, there are some gotchas to keep in mind when you’re replicating with triggers. With MySQL replication the master records all the data-changing SQL statements executed, and the slave(s) execute those recorded SQL statements1. The slave machine doesn’t check any privileges; it assumes that if the master performed the operation that it was okay. Normally this is perfectly fine, but triggers add an extra layer of indirection.

Let’s try an example. Say that we want to replicate data from one table (FOO) into another (BAR) using an AFTER INSERT trigger. The CREATE TRIGGER statement itself will be written to the replication logs to be executed blindly by the slave(s); each MySQL instance will then have a copy of the trigger. Trigger creation on the master requires the SUPER privilege. Note that ALL PRIVILEGES does not include SUPER!

CREATE TRIGGER `foo_to_bar`
AFTER INSERT ON foo FOR EACH ROW
INSERT INTO bar (some_column)
VALUES (NEW.some_old_column);

NEW is an alias for use in triggers. When a user issues an INSERT INTO FOO statement, the master checks that this is okay, inserts the row in FOO, runs the trigger (with the privileges of the trigger creator), and records only the original INSERT statement for replication.

A slave server sees the INSERT INTO FOO statement, does the insert blindly, and runs the trigger. And that works IFF the user who created the trigger on the master also exists on the slave! If the trigger-creator user doesn’t exist on the slave, replication will fail on the INSERT statement, because the slave is no longer in “blind-execution” replication, but acting on a trigger just as if it had been caused only on the local instance.

“So,” you think to yourself, “I’ll just add the trigger-creator user and re-start replication.” Except that this is MyISAM, and there are no transactions…and that original INSERT statement already completed! So restarting replication will cause a duplicate PK error (if you’re lucky). Being lucky, removing the new row in FOO on the slave will let you restart replication and everything goes on its merry way. If you weren’t lucky, your slave is now strangely out of sync with the master. Have fun cleaning up!

But assuming the trigger-creator user is in place, things will work quite happily. At one point, we had over a million triggers running simultaneously for a few weeks. Of course, before trusting the results we ran some data verifications to ensure we had exactly the right data.

So be careful the first time you deploy triggers, and enjoy the power they give to change your schema without inconveniencing your users. :)

1 This is equally a problem for saving the results of local-state-specific functions. For example, it’s a bad idea to do something like INSERT INTO foo VALUES (GET_LOCK('a lock')) because the value returned by GET_LOCK() could be different on the slave versus on the master. Timestamps manage to propagate properly because of extra annotations (read: computer-readable SQL comments) included in the replication logs.

2 comments

Twig in Production

by Mark on September 28, 2011

For the past year we’ve been busy behind the scenes updating much of the existing FreshBooks application to use Twig. If you haven’t heard of Twig before, it’s a great templating library for PHP. Its syntax and features are mostly inspired by Jinja2. Twig provides a number of features that are useful in larger applications:

  • Template inheritance – Useful for creating application layouts and re-using common page layouts.
  • Horizontal reuse of blocks – You can reuse blocks from one template in another giving you an alternative way to re-use template code beyond extension.
  • Macros – Create re-usable ‘functions’ entirely of template code.
  • Automatic HTML escaping – The ‘killer feature’ you should look for in any template language.
  • Sandboxing – Great for limiting what can be accessed in a template. This is really useful when exposing templates to end users.

Automatic escaping is something I think every PHP template language should have. While PHP is ok at being a template language, it is not automatically safe against cross-site scripting attacks. Having a template language offer both syntactic sugar and invisibily securing template code with no additional work is fantastic feature. If you’re using a template language without automatic escaping you should really switch to one that does. Having to explicitly turn off HTML escaping forces you to think about whether or not content is actually safe to output. In practice, we’ve found autoescaping makes developers more confident about not having accidentally created a new XSS problems. Unlike an omitted |escape, which could easily get lost in the crowd, every |raw sticks out like a sore thumb in code review.

We first started using Twig around the 0.9.6 release. It was chosen over the existing solution (Smarty) primarily because Twig offers a better feature set. Inheritance and automatic escaping were two concrete features that would improve our workflow. At the time Smarty 3 was still in beta, and the team wasn’t really happy with how Smarty had treated them in the past. We also considered other options:

  • Dwoo – This was not chosen because the project activity was not overly high, and it seemed like development had stalled.
  • PHPTal – XML only meant there were going to be significant integration hurdles with our non-compliant legacy code. It
    also meant it would be impossible to use with plain text email templates.

Getting twig into production

While updating all of our templates to use Twig we learned a few things along the way that are worth sharing.

Flushing the compiled templates is important. Forgetting to do it will cause you a great deal of pain. Thankfully, Twig has this functionality built in which makes it super easy to integrate as part of your deployment:

<?php
/**
 * Clean out compiled templates.
 *
 */

require_once './inc/bootstrap.php';

$environments = array();
$return = 0;

echo "Clearing template cache files...\n";

// Cache files used for the actual application
$environments['app'] = Fresh_Twig::getEnvironment();

// Other configurations like tests are loaded as well,
// but have been omitted.

foreach ($environments as $name => $environment) {
    $cache = $environment->getCache();

    if (!$cache) {
        echo "[Info '$name']: skipping because caching is disabled.\n";
        continue;
    }

    if (!is_dir($cache)) {
        echo "[Info '$name']: skipping because '$cache' does not exist.\n";
        continue;
    }

    if (!is_writable($cache)) {
        echo "[Error '$name']: '$cache' is not writable. Try levelling up.\n";
        $return++;
        continue;
    }

    $environment->clearCacheFiles();

    echo "[Success '$name']: cleared directory '$cache'.\n";
}
exit($return);

Automate building the cache files. Deploying on a cold cache will work for a while. If you have multiple webservers and many concurrent users you should think about pre-compiling your template cache. Deploying with a hot cache will remove the possiblity of two users generating the same template cache file. It also removes the possibility for contention to occur around the creation of cache files, and finally it just reduces the amount of work done overall. Since each webserver isn’t spending time generating cache files individually they can get back to serving users faster. We use make and a simple php script to do this:

<?php
/**
 * Simple script that gets a Twig_Enviornment and compiles all the 
 * templates that match the glob pattern.
 *
 */
if (empty($argv[1])) {
    echo "No file path provided, please give a directory to compile templates in.\n";
    exit(1);
}
if (empty($argv[2])) {
    $argv[2] = 'twig';
}
$verbose = false;
if (in_array('--verbose', $argv)) {
    $verbose = true;
}

require_once './inc/bootstrap.php';

// A wrapper that provides a Twig_Environment with
// all the extensions we use attached.
$environment = Fresh_Twig::getEnvironment();
$directory = new RecursiveDirectoryIterator($argv[1]);
$recurser = new RecursiveIteratorIterator($directory);
$matcher = new RegexIterator($recurser, '/.*?\.' . $argv[2] . '$/', RegexIterator::GET_MATCH);

echo "Compiling Templates\n";

$i = 0;
foreach ($matcher as $file) {
    try {
        $filename = str_replace($argv[1], '', $file[0]);
        $environment->loadTemplate($filename);
        $environment->loadTemplate('/' . $filename);
        if ($verbose) {
            echo " - " . $filename . "\n";
        } else {
            if ($i % 72 == 0) {
                echo "\n";
            }
            echo '.';
        }
    } catch (Exception $e) {
        echo $e->getMessage() . "\n";
    }
    $i++;
}

echo "\nTemplates compiled\n";
exit(0);

We compile both the slash-less and leading slash versions of the template. This produces more template files, but saves us from having to be super picky about how we use twig files in our application. Before each deploy we run the make script to generate all the template cache files. The compiled templates are shipped with the code changes.

In closing

Since the switch to Twig last year we haven’t really hit any real problems with it. Its been a fantastic and easily extensible library. Thanks to Fabien and all the other Twig contributors for making such a solid product.

add comment

UTF-8 is here!

by Taavi on September 21, 2011

Today I’m happy to announce that FreshBooks is fully UTF-8!

What this means for you: You can use unicode snowman and all the other characters in the Unicode Basic Multilingual Plane to your heart’s content, knowing that each symbol takes the space of only one code point (essentially a “letter”), for example being able to fit a full 50 non-Latin characters into an invoice item name. It also means that you’ll be able to use snowmen and other extended characters in the time tracking section of FreshBooks!

What this means for you as an API consumer: You can now send FreshBooks the full gamut of BMP characters directly as UTF-8 with no encoding shenanigans! Previously anything not in the ISO-8859-1 character set would be smashed into a question mark, including XML entities representing those extended charaters. As a workaround, one could “double-encode” entities, e.g. your XML would contain the ASCII string &amp;#9731; to get a snowman. This is no longer required!

Here’s how things work, starting today:

If you submit this via your browser It looks like this in the browser And like this in an API response
&amp; & &amp;
&#9731;
&amp;#9731; &#9731; &#9731;
If you submit this via the API It looks like this in the browser And like this in an API response
&amp; & &amp;
&#9731;
&amp;#9731;

For the next month or so we’ll continue to accept HTML entities through the web application (and double-encoded entities through the API), and display them just like we have for years. But after that, we’ll migrate all the exiting entity-based characters into real UTF-8 codepoints, and flip the switch such that plain text is plain text.

When that happens, the behaviour will change like this:

If you submit this via your browser It looks like this in the browser And like this in an API response
&amp; &amp; &amp;amp
&#9731; &#9731; &amp;#9731;
&amp;#9731; &amp;#9731; &amp;amp;#9731;
If you submit this via the API It looks like this in the browser And like this in an API response
&amp; & &amp;
&#9731;
&amp;#9731; &#9731; &amp;#9731;

Enjoy!

4 comments

Search