Dev Blog

The TechTalksTO Award!

by Corey on January 20, 2012

Shey & Jason & the TechTalksTO ShieldTechTalksTO is a series of technical presentations here in Toronto. Throughout 2011 they brought us folks from Mozilla, Twitter, Well.ca and other cool places around the internets. They even put on a fantastic all-day event at the not-too-classy-but-so-cool Toronto Underground Cinema, featuring speakers from Disqus, Github, OpsCode, and more. It was a fantastic event! We had dumplings!

Well, not everyone knows that TechTalksTO is the brainchild of two FreshBooks developers: Shey and Jason. They organized this whole thing, found every speaker and arrange every event. They worked their hearts out producing a series of valuable talks for the Toronto technology community. At FreshBooks we are super inspired by their example. Shey and Jason are both great developers, and great team leads, but clearly they had something more to offer than producing great software — they wanted to help build a community of great software development practice right here in Toronto.

We couldn’t be prouder of them. So we gave them a shield. It’ll come in handy during a zombie apocalypse, we figure, and blue goes with everything, right?

5 comments

Manually verifying character representations

by Taavi on December 19, 2011

As a part of the UTF-8 work we’ve been doing, we have to turn HTML entities (numeric and named character references) into UTF-8 byte sequences. As it turns out, browsers don’t do this the way PHP 5.3′s html_entity_decode() function does. :(

There are 4 classes of difference:

  1. Surrogate codepoints
  2. Windows-1252 characters in the ISO-8859-1 dead zone
  3. Mathematical angle brackets
  4. U+0000 (NUL)

The first we can just ignore. Surrogate codepoints aren’t valid in UTF-8 (if you have them, you’ve actually got CESU-8), and we’re going to leave them entirely alone. This means that eventually things might get a bit uglier (instead of seeing a replacement character e.g. �, you’ll see the text of the character reference e.g. 훘). This way we’re not destroying data, even if it wasn’t displaying correctly before.

The second is a bit of fun. All the specs say that numeric character references use Unicode codepoint values. But (apparently) a lot of data contain codepoints in the range U+0080 through U+009F inclusive, which are (largely) non-printing control characters. Browsers rewrite numeric character references for the 27 codepoints which are defined in Windows-1252 in that range anyway. For example, the numeric character reference ’ looks like ’ in your browser (properly referenced as ’ or ’ i.e. ’). Because all the major browsers do this remapping, we’re going to use it, too.

The third is another place where browsers violate the letter of the spec. Officially, ⟨ (⟨) and ⟩ (⟩) map to U+2329 (〈) and U+232A (〉), respectively. But browsers actually render them as U+27E8 (⟨) and U+27E9 (⟩)! The rationale seems to be that the old codepoint is ambiguous (the Unicode spec agrees).

The fourth fun thing is how browsers handle null bytes. Literal nulls aren’t visible at all, but the numeric character reference for codepoint U+0000 (�) comes across as the replacement character (�) in browsers we tested.

In order to test these (and make sure there aren’t more hiding!), I devised a number of tests, each attacking the problem from a slightly different angle. Nobody wants to sit through visually confirming that several thousand pairs of characters are the same character, and depending on the font it can be very difficult to see the difference between the angle brackets from #3 above.

Method #1: Pure JavaScript

Trying to fall asleep last week, I realized that we can get a browser to tell us what Unicode codepoint it uses to represent a given entity by using innerHTML and textContent.charCodeAt:

<!DOCTYPE html>
<html><head></head>
<body>
<span id="codepoint">?</span>: <span id="test_location"></span>
<table border="1">
	<thead>
		<tr><th>Entity</th><th>HTML got</th><th>JS got</th></tr>
	</thead>
	<tbody id="results">
	</tbody>
</table>
<script>
// This list was mechanically extracted from the PHP 5.3.8 source file ext/standard/tests/strings/htmlentities_html4.phpt
// It is assumed that this is how PHP's html_entity_decode will treat each of the named entities, so we can compare that to how browsers render them.
var named_entities = {"&quot;": 0x00022, "&amp;": 0x00026, "&#039;": 0x00027, "&lt;": 0x0003C, "&gt;": 0x0003E, "&nbsp;": 0x000A0, "&iexcl;": 0x000A1, "&cent;": 0x000A2, "&pound;": 0x000A3, "&curren;": 0x000A4, "&yen;": 0x000A5, "&brvbar;": 0x000A6, "&sect;": 0x000A7, "&uml;": 0x000A8, "&copy;": 0x000A9, "&ordf;": 0x000AA, "&laquo;": 0x000AB, "&not;": 0x000AC, "&shy;": 0x000AD, "&reg;": 0x000AE, "&macr;": 0x000AF, "&deg;": 0x000B0, "&plusmn;": 0x000B1, "&sup2;": 0x000B2, "&sup3;": 0x000B3, "&acute;": 0x000B4, "&micro;": 0x000B5, "&para;": 0x000B6, "&middot;": 0x000B7, "&cedil;": 0x000B8, "&sup1;": 0x000B9, "&ordm;": 0x000BA, "&raquo;": 0x000BB, "&frac14;": 0x000BC, "&frac12;": 0x000BD, "&frac34;": 0x000BE, "&iquest;": 0x000BF, "&Agrave;": 0x000C0, "&Aacute;": 0x000C1, "&Acirc;": 0x000C2, "&Atilde;": 0x000C3, "&Auml;": 0x000C4, "&Aring;": 0x000C5, "&AElig;": 0x000C6, "&Ccedil;": 0x000C7, "&Egrave;": 0x000C8, "&Eacute;": 0x000C9, "&Ecirc;": 0x000CA, "&Euml;": 0x000CB, "&Igrave;": 0x000CC, "&Iacute;": 0x000CD, "&Icirc;": 0x000CE, "&Iuml;": 0x000CF, "&ETH;": 0x000D0, "&Ntilde;": 0x000D1, "&Ograve;": 0x000D2, "&Oacute;": 0x000D3, "&Ocirc;": 0x000D4, "&Otilde;": 0x000D5, "&Ouml;": 0x000D6, "&times;": 0x000D7, "&Oslash;": 0x000D8, "&Ugrave;": 0x000D9, "&Uacute;": 0x000DA, "&Ucirc;": 0x000DB, "&Uuml;": 0x000DC, "&Yacute;": 0x000DD, "&THORN;": 0x000DE, "&szlig;": 0x000DF, "&agrave;": 0x000E0, "&aacute;": 0x000E1, "&acirc;": 0x000E2, "&atilde;": 0x000E3, "&auml;": 0x000E4, "&aring;": 0x000E5, "&aelig;": 0x000E6, "&ccedil;": 0x000E7, "&egrave;": 0x000E8, "&eacute;": 0x000E9, "&ecirc;": 0x000EA, "&euml;": 0x000EB, "&igrave;": 0x000EC, "&iacute;": 0x000ED, "&icirc;": 0x000EE, "&iuml;": 0x000EF, "&eth;": 0x000F0, "&ntilde;": 0x000F1, "&ograve;": 0x000F2, "&oacute;": 0x000F3, "&ocirc;": 0x000F4, "&otilde;": 0x000F5, "&ouml;": 0x000F6, "&divide;": 0x000F7, "&oslash;": 0x000F8, "&ugrave;": 0x000F9, "&uacute;": 0x000FA, "&ucirc;": 0x000FB, "&uuml;": 0x000FC, "&yacute;": 0x000FD, "&thorn;": 0x000FE, "&yuml;": 0x000FF, "&OElig;": 0x00152, "&oelig;": 0x00153, "&Scaron;": 0x00160, "&scaron;": 0x00161, "&Yuml;": 0x00178, "&fnof;": 0x00192, "&circ;": 0x002C6, "&tilde;": 0x002DC, "&Alpha;": 0x00391, "&Beta;": 0x00392, "&Gamma;": 0x00393, "&Delta;": 0x00394, "&Epsilon;": 0x00395, "&Zeta;": 0x00396, "&Eta;": 0x00397, "&Theta;": 0x00398, "&Iota;": 0x00399, "&Kappa;": 0x0039A, "&Lambda;": 0x0039B, "&Mu;": 0x0039C, "&Nu;": 0x0039D, "&Xi;": 0x0039E, "&Omicron;": 0x0039F, "&Pi;": 0x003A0, "&Rho;": 0x003A1, "&Sigma;": 0x003A3, "&Tau;": 0x003A4, "&Upsilon;": 0x003A5, "&Phi;": 0x003A6, "&Chi;": 0x003A7, "&Psi;": 0x003A8, "&Omega;": 0x003A9, "&alpha;": 0x003B1, "&beta;": 0x003B2, "&gamma;": 0x003B3, "&delta;": 0x003B4, "&epsilon;": 0x003B5, "&zeta;": 0x003B6, "&eta;": 0x003B7, "&theta;": 0x003B8, "&iota;": 0x003B9, "&kappa;": 0x003BA, "&lambda;": 0x003BB, "&mu;": 0x003BC, "&nu;": 0x003BD, "&xi;": 0x003BE, "&omicron;": 0x003BF, "&pi;": 0x003C0, "&rho;": 0x003C1, "&sigmaf;": 0x003C2, "&sigma;": 0x003C3, "&tau;": 0x003C4, "&upsilon;": 0x003C5, "&phi;": 0x003C6, "&chi;": 0x003C7, "&psi;": 0x003C8, "&omega;": 0x003C9, "&thetasym;": 0x003D1, "&upsih;": 0x003D2, "&piv;": 0x003D6, "&ensp;": 0x02002, "&emsp;": 0x02003, "&thinsp;": 0x02009, "&zwnj;": 0x0200C, "&zwj;": 0x0200D, "&lrm;": 0x0200E, "&rlm;": 0x0200F, "&ndash;": 0x02013, "&mdash;": 0x02014, "&lsquo;": 0x02018, "&rsquo;": 0x02019, "&sbquo;": 0x0201A, "&ldquo;": 0x0201C, "&rdquo;": 0x0201D, "&bdquo;": 0x0201E, "&dagger;": 0x02020, "&Dagger;": 0x02021, "&bull;": 0x02022, "&hellip;": 0x02026, "&permil;": 0x02030, "&prime;": 0x02032, "&Prime;": 0x02033, "&lsaquo;": 0x02039, "&rsaquo;": 0x0203A, "&oline;": 0x0203E, "&frasl;": 0x02044, "&euro;": 0x020AC, "&image;": 0x02111, "&weierp;": 0x02118, "&real;": 0x0211C, "&trade;": 0x02122, "&alefsym;": 0x02135, "&larr;": 0x02190, "&uarr;": 0x02191, "&rarr;": 0x02192, "&darr;": 0x02193, "&harr;": 0x02194, "&crarr;": 0x021B5, "&lArr;": 0x021D0, "&uArr;": 0x021D1, "&rArr;": 0x021D2, "&dArr;": 0x021D3, "&hArr;": 0x021D4, "&forall;": 0x02200, "&part;": 0x02202, "&exist;": 0x02203, "&empty;": 0x02205, "&nabla;": 0x02207, "&isin;": 0x02208, "&notin;": 0x02209, "&ni;": 0x0220B, "&prod;": 0x0220F, "&sum;": 0x02211, "&minus;": 0x02212, "&lowast;": 0x02217, "&radic;": 0x0221A, "&prop;": 0x0221D, "&infin;": 0x0221E, "&ang;": 0x02220, "&and;": 0x02227, "&or;": 0x02228, "&cap;": 0x02229, "&cup;": 0x0222A, "&int;": 0x0222B, "&there4;": 0x02234, "&sim;": 0x0223C, "&cong;": 0x02245, "&asymp;": 0x02248, "&ne;": 0x02260, "&equiv;": 0x02261, "&le;": 0x02264, "&ge;": 0x02265, "&sub;": 0x02282, "&sup;": 0x02283, "&nsub;": 0x02284, "&sube;": 0x02286, "&supe;": 0x02287, "&oplus;": 0x02295, "&otimes;": 0x02297, "&perp;": 0x022A5, "&sdot;": 0x022C5, "&lceil;": 0x02308, "&rceil;": 0x02309, "&lfloor;": 0x0230A, "&rfloor;": 0x0230B, "&lang;": 0x02329, "&rang;": 0x0232A, "&loz;": 0x025CA, "&spades;": 0x02660, "&clubs;": 0x02663, "&hearts;": 0x02665, "&diams;": 0x02666};
var test_location = document.getElementById('test_location');
var name_span = document.getElementById('codepoint');
var results = document.getElementById('results');

function testEntity(entity, expected_codepoint) {
	var new_row;
	name_span.textContent = entity;
	test_location.innerHTML = entity;
	if (test_location.textContent.charCodeAt(0) != expected_codepoint) {
		new_row = document.createElement("tr");
		new_row.innerHTML = "<td>" + entity.replace("&", "&amp;") + "</td><td>U+" + test_location.textContent.charCodeAt(0).toString(16) + "</td><td>U+" + expected_codepoint.toString(16) + "</td>";
		results.appendChild(new_row);
	}
}

function testBouncer(tests, i) {
	if (tests.length < i) {
		name_span.textContent = "DONE!";
		return;
	}
	testEntity.apply(null, tests[i]);
	setTimeout(function(){testBouncer(tests, i+1);}, 0);
}

function main() {
	var tests = [];
	for (var entity in named_entities) {
		tests.push([entity, named_entities[entity]]);
	}
	for (var codepoint = 0; codepoint < 0x10000; codepoint++) {
		if (codepoint >= 0xD800 && codepoint <= 0xDFFF) {
			// Ignore surrogates
			continue;
		}
		tests.push(["&#" + codepoint + ";", codepoint]);
	}
	setTimeout(function(){testBouncer(tests, 0);})
}

setTimeout(main, 0);
</script>
</body>
</html>

It takes a few minutes to run, and doesn’t actually tell us what the browser displays, but it does tell us what the browser interpreted. It’s probably close enough and has the benefit of being easy to run an a wide variety of browsers.

It turns out that there are some differences:

  • IE likes to render U+003F (a literal question mark “?”) for a lot of things
  • Firefox and IE on Windows complain about U+000D, but Chrome and Safari do not (nor does Firefox 8 on OSX)
  • Safari on iPhone matches Safari on OSX, but you’ll probably want to be plugged in while it runs…

The results aren’t convincing enough for me to bet customer data on it, though.

Method #2: Building a Browser

I also looked into how to render HTML into a canvas element in order to do automated pixel-perfect comparisions. Doing it with Firefox seems to involve writing a bit of chrome, and I got nowhere fast. With some idle poking around, though, I found the pyqt bindings, and enough examples of how to use a QWebPage to turn HTML into a bitmap.

First, install the pyqt bindings from homebrew:

brew install pyqt

And get a drink. Maybe go to lunch. When it’s done, we can run this bit of fun:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import base64
import htmlentitydefs
import json
import re
import sys
sys.path.insert(0, '/usr/local/lib/python')

from PyQt4.QtCore import *
from PyQt4.QtGui import *
from PyQt4.QtWebKit import *

with open('entity_list.json') as f:
    # A file containing the named entities I want to check, as a JSON array
    unique_entities = json.load(f)

app = QApplication([])

SIZE = 100
size = QSize(SIZE, SIZE)

page = QWebPage()
# If you don't set the viewport size, you won't see anything
page.setViewportSize(size)
frame = page.mainFrame()

surface1 = QImage(size, QImage.Format_ARGB32_Premultiplied)
surface2 = QImage(size, QImage.Format_ARGB32_Premultiplied)

entity_re = re.compile(r'^&(?:(?P<name>\w+)|#(?:x(?P<hex>[0-9a-f]+)|(?P<decimal>\d+)));$')

def decode(entity):
    """Naive HTML entity decoder.
    
    Explicitly follows the specs, not what browsers actually do.
    
    Takes a string containing only the entity, and returns a unicode string with the decoded intepretation.
    
    """
    m = entity_re.match(entity)
    if not m:
        raise ValueError()
    elif m.group('name'):
        return unichr(htmlentitydefs.name2codepoint[m.group('name')])
    elif m.group('hex'):
        return unichr(int(m.group('hex'), 16))
    elif m.group('decimal'):
        return unichr(int(m.group('decimal'), 10))
    else:
        raise ValueError()

for entity in unique_entities:
    try:
        decoded_entity = decode(entity)
    except ValueError:
        print "Couldn't decode entity", entity
        continue
    if u'\uD800' <= decoded_entity <= u'\uDFFF':
        # Don't care about surrogates.
        continue

    surface1.fill(0xFFFFFFFF)
    surface2.fill(0xFFFFFFFF)

    # Experiments tell me I should make new painters every time (see .end() below)
    painter1 = QPainter(surface1)
    painter2 = QPainter(surface2)

    SCALE = 3
    painter1.scale(SCALE,SCALE)
    painter2.scale(SCALE,SCALE)

    frame.setHtml(decoded_entity)
    frame.render(painter1)
    painter1.end()

    frame.setHtml(entity)
    frame.render(painter2)
    painter2.end()

    # The bits() method returns a sip_ptr thing, which is a Python interface
    # to a void* pointing at the ARGB32 backing store for the image.
    # Hence width x height x byte width of a pixel
    bitmap1 = surface1.bits().asstring(SIZE*SIZE*4)
    bitmap2 = surface2.bits().asstring(SIZE*SIZE*4)
    # Should probably be a set of 4-tuples (i.e. pixel values)
    # but "nothing" is "opaque white" so all 0xFF.
    if len(set(bitmap1)) == 1:
        print "Nothing to see", entity
    # Looking for a bit-perfect image
    if bitmap1 != bitmap2:
        print "Glyph mismatch on entity %s" % (entity,)
        surface1.save('mismatch%s.utf8.png' % (entity,))
        surface2.save('mismatch%s.entity.png' % (entity,))

# Seems to crash at the end no matter what I do. :( 
# sys.exit(app.exec_())

It’s probably advisable to generate the before and after entity transformations outside of Python, since we won’t be doing the real translation with Python. In this case, however, I know that htmlentitydefs.name2codepoint has the same data as PHP, and that html_decode_entities does the naïve thing with numeric character references (again, like html_decode_entities).

I was pretty happy with this, but it’s only for one browser engine, and not actually a browser that anyone runs. But it’s neat to know that you can create an entire browser inside of Python to generate screenshots and even hook into the JS engine.

Method #3: Taking Screenshots

As it turns out, we don’t need to verify all 60-some-thousand codepoints in the Basic Multilingual Plane. We only have about 6000 of them in the database, and 6000 characters nearly fits on a single 27″ iMac screen!

The first step is to create a big table (≈150×50 cells) containing the codepoints in question. Take a screenshot.

Then load up a similar table, but with the expected transformations applied. Take another screenshot.

Then we’ll have some fun in the GIMP using layers and the Difference mode.

The top layer is using the Difference mode at 72% opacity. It’s offset by a few dozen pixels to show the effect.

When we move the two layers to be perfectly overlaid, most characters go black (meaning “no difference”). The white background is grey because we’re still at only 72% opacity. You can see the signal we get in the URI when things are different. There’s also that pesky �…

At full opacity, any difference between the two screenshots pops out pretty loudly. Looks like &#0; renders differently than a null byte!

At the end of the day, being able to visually and manually verify all the things we intend to work on gives us a lot of confidence. It doesn’t hurt that all 3 methods agree on where the trouble spots are!

add comment

Changing The Engine While You’re Driving Down The Highway

by Owen on October 14, 2011

We recently ran into an interesting and difficult problem: how do we change a large, heavily-used table without violating our “avoid downtime if possible” mantra?

The following is a slightly expurgated version of the postmortem I sent the rest of the team. Table names have been changed to protect the guilty.

What Hotfixes?

In the Sinistar release’s post-release migrations list, there were two ALTER TABLE migrations slated to make structural changes to the customer and customer_login tables (changing the types of some columns, specifically). When we tested these migrations on a copy of the production system’s data, we discovered that they would have caused those tables to be inaccessible for two and five minutes respectively. Since customers are core to our applications, we determined that this qualified as “downtime” in the release plan.

I prototyped an alternate approach to the customer changes (outlined below) intended to make changes to the table without taking the app offline while they were happening. The alternate approach was considerably more complex than the original ALTER TABLE statement, as well as being considerably slower (~15 minutes for our largest shard, as compared to two minutes); we discussed whether it was worth taking the app offline at zero-dark-hundred on a weekend instead and concluded that we’d need to be able to make online changes to expensive tables eventually and opted to go ahead with the alternate approach as a learning experience.

How Did It Work?

The alternate approach we used relied on more MySQL-side tools to make the changes. Specifically, we:

  1. Created a new, empty version of the customer table (the “target”) table based on the current customer table plus intended structural changes. Structural changes to an empty table that the app does not use are quite cheap.
  2. Installed triggers on INSERT, UPDATE, and DELETE operations on the customer table that copied rows (using REPLACE) into the target table, applying the intended transforms to the affected columns.
  3. Installed a server-side procedure (“the migration procedure”) that:
    1. Selected all of the keys from the original customer table.
    2. For each key, copied the corresponding row from the customer table to the target table using REPLACE.
  4. Ran the migration procedure. (This part is where most of the time went.)
  5. Verified that the contents of the target table were in 1:1 correspondence with the contents of the original customer table, where the only permitted differences between corresponding rows were the changes intended by the original migration.
  6. Renamed the original customer table out of the way and renamed the target table to customer.
  7. Verified that nothing had gone horribly wrong in our apps.
  8. Dropped the renamed version of the original customer table, along with the triggers and the migration procedure.

The steps are grouped into four phases: prepare.xml (covering steps 1-3), execute.xml (covering step 4), cutover.xml (covering step 6) and cleanup.xml (covering step 8). The steps not covered by phases were performed semi-manually.

The changes to customer_login proceeded along the same paths.

What Went Right

Well, first of all, we noticed a potential app outage hazard before it went live.

We were able to back out and rewrite the migrations without affecting existing development environments, where the original ALTER TABLE might have already run, thanks largely to Liquibase‘s precondition support.

We ran through several versions of this plan, both on my development environment image and with QA (on our release candidate environment), which ferretted out a handful of subtle bugs in the migration procedure that would’ve damaged customer data. Our QA team’s extensive experience and automation tooling around our applications prevented any of those mistakes from going live.

Designing the hotfix in several phases permitted us to run the relatively safe phases over the weekend, with relatively little involvement from our operations team beyond monitoring.

The verification steps did not take out our applications in the process, and revealed a way to move at least some verification steps out of our applications’ scripts directory and into the core-dbs source tree.

Finally, it worked. Despite the relatively high complexity and the large collection of moving parts, this alternate plan worked very well. Rather than taking our applications offline, everything stayed up (and didn’t slow down too badly).

What Went Wrong, And What We Learned

Finding Problems Late is A Problem

We didn’t catch the initial problem before it went out to everyone’s development environments. I’ve habitually not worried about migration timing until code freeze, which means that potentially-troublesome migrations will have already run on our bleeding-edge environment (and possibly our release candidate environment) before we notice the problem. This lead to increased complexity in the alternate migration path’s Liquibase configuration – we made extensive use of preconditions to ensure that the alternate migration would be harmless on systems where the original ALTER TABLE had already run, which seems to have worked, but I’d rather not have had that problem in the first place as it’s another moving part that can have problems.

Concurrency is a Hard Problem

The first version of the migration procedure looked like this:

  1. Select all of the rows from the original customer table.
  2. For each row, insert that row into the target table using REPLACE.

The select and loop steps use a MySQL cursor to iterate over their results. While the MySQL documentation states that it’s undefined whether a cursor is over a snapshot of the query results or over the underlying tables, in practice we found that it was using a snapshot. Since the loop took considerable time to run, and since it (intentionally) did not prevent further changes to the original customer table from occurring after the snapshot, the loop would insert, into the target table, a “stale” version of any customer row that had been modified since the snapshot was taken (overwriting the correct version copied into the target table by the triggers on the original table).

We only noticed this on our release candidate environent. When we ran through the migration plan initially to ensure that the procedure did not block our applications’ normal usage, QA ran part of their Selenium suite, which produces a relatively high volume of test traffic. Out of ten thousand customers in the release-candidate databases, four customers had stale rows copied. This is not a bug we would ever have identified using only manual testing (even with a verification tool), so I’m intensely pleased that we had automation in place and very glad I got QA involved in testing the hotfix as extensively as we did.

Switching from a snapshot of the rows in customer to a snapshot of the keys in customer and selecting out the row itself only immediately before copying it to the target table addressed the problem; even in production, our verification script identified exactly zero mismatches between the original customer table and the target table (across ten million customers). It also made the procedure body somewhat shorter and simpler by reducing the number of fields copied from the cursor on each step of the iteration.

I’ve Got Ten Million Queries, Is This Bad?

Boy, is it slow. The migration procedure, which issues one query per affected customer row, takes approximately ten times as long as a straight ALTER TABLE, and produces considerably more IO load. This lead to the customer hotfix running into Friday morning (we kicked it off Thursday afternoon). I made some changes to the customer_login migration procedure to attempt to compensate – instead of issuing a REPLACE query for each row, we used a prefix of customer_login‘s primary key (customerid) instead to issue only one REPLACE query for each customer’s worth of rows. This seems to have helped, but not by much. Future iterations of this process might even go by larger groups of rows, depending on the average and worst-case number of rows per group.

The Best Laid Plans…

We ran through three iterations of the migration procedure for the customer table before we settled on a working version. While I did design recovery migrations to back out the broken versions of the procedure that were intended to leave trunk and other environments unaffected by the iterative testing process, they didn’t quite work as planned. Somehow (and we still don’t know how for sure), all of the customers on our bleeding-edge environment were deleted outright instead of being migrated over. We weren’t able to reproduce the problem with the final plan, and we do keep backups of our testing databases, so we were able to recover the missing customers, but it’s another thing that we only noticed because of QA’s intervention.

We avoided this with customer_login by not building the migrations for that table until we’d ironed out issues in the customer migrations, and in the future we’ll avoid rolling this sort of hotfix into master (for our bleeding edge environment and development environments) until we’re confident that it’s correct.

No Space Left On Device

This is one that caught me totally by surprise. When we ran the migration procedure for customer in production, we learned that passing 10m queries through the MySQL binary logs (for replication) uses up a lot of space. Fortunately, we keep the binlogs on a separate partition from the main data store, so this did not take MySQL out, but it could’ve if our Ops folks weren’t so sharp. This did, however, take replication out temporarily. Operations resized the partition the binlogs live on, which kept things running (slowly, but smoothly) while the customer_login migration procedure ran.

Read Locking Is Not Your Friend

Our initial version of a verification tool, a PHP script in one of our apps’ scripts/ directory that ran against the live databases on our master database, would’ve taken our applications offline in roughly the same way as the original ALTER TABLE statements would’ve. I only realized this when I ran the verification queries (outside of their script) against one of our replica databases’ shards and realized how long they took, shortly before we had planned to run the verification step in the hotfix plan. Fortunately, the verification used very little from our app’s environment; I ported it to a simple shell script and ran it against a replica (after replication caught up) instead. In the future, I’m going to try to get more of our verification scripts written that way where possible, since it does a good job of insulating our applications from side effects.

In Summary

Working with large data sets is hard, especially in MySQL (where even reads can cause locks). Making structural changes to those data sets is even harder, especially while the application’s in flight. However, I feel like this went relatively well even with the “creative” issues discovered during testing, and we’ll be using this approach for large tables more frequently in the future.

4 comments

10 free passes to X.commerce Innovate

by Sunir on October 7, 2011

I have a great offer for the first 10 people to jump on this deal.

Every year PayPal throws a huge developer conference, Innovate, in San Francisco. This year they have expanded it to launch X.commerce, the new e-commerce platform from eBay, PayPal, Magento, and GSI Commerce.

That conference is next week, October 12-13 in the Moscone Conference Center in San Francisco. I have 10 free passes. Email me — sunir@freshbooks.com — and I’ll send you a coupon code.

And of course I’ll be there representing FreshBooks. If you’re there, I’d love to meet you! Again, shoot me an email — sunir@freshbooks.com.

add comment

Database migrations using triggers in MySQL with MyISAM and replication

by Taavi on October 3, 2011

Last year we purchased the book Refactoring Databases. While the first chapter was preaching to the choir (database migrations are normal around here), there was one extremely valuable gem: the use of database triggers to effect seamless data migrations.

The book targets enterprise environments where there are multiple applications accessing the database concurrently, each with release cycles measured in months and years, and regularly scheduled downtime windows. In contrast, FreshBooks has fewer moving parts (and each part is much smaller), release cycles on the order of weeks (sometimes hours!), and no time is good for downtime. We’re used around the world, and it shows in the server activity logs 24/7.

However, we can still learn a lot from Refactoring Databases. Methods of performing migrations with staggered application releases over the course of months are equally applicable to a normal web app undergoing a 0-downtime deployment with rolling backend restarts to a new version of the code.

Database triggers are a way of telling the database server to react automatically to some other action, like If This Then That. The Big Three operations that change data are INSERT, UPDATE, and DELETE. One example use of triggers is to create an audit log by instructing the database server to record the effect of all UPDATE statements against a table. From this description it’s tempting to think that a lot of things should be done using triggers! Unfortunately they run on the database server itself (so can easily cause a performance bottleneck), are written in SQL (which can make it hard to express yourself if you need to do any kind of data transformation), are impossible to unit test (any test you can write for them is essentially an integration test), and don’t allow chaining (in MySQL at least; if you need two logical things done you have to write them as a single trigger; Observer and Visitor patterns are not an option here).

Believe it or not, MySQL (even using the MyISAM table engine) has full support for triggers! In the context of supporting data migrations, there are some more specific uses whose benefits outweigh the issues above. Data migration triggers don’t have to exist for very long, and it’s a strong hint to reorganize your release schedule if you find yourself wanting to do more than one thing in a single trigger. Useful things a trigger can do to support a data migration include:

  • keeping two columns in a table in sync
  • keeping two columns in two different tables in sync
  • creating new column(s) with data synthesized from old one(s) and vice versa
  • anything else you can do with a trigger (no dynamic SQL!)

However, there are some gotchas to keep in mind when you’re replicating with triggers. With MySQL replication the master records all the data-changing SQL statements executed, and the slave(s) execute those recorded SQL statements1. The slave machine doesn’t check any privileges; it assumes that if the master performed the operation that it was okay. Normally this is perfectly fine, but triggers add an extra layer of indirection.

Let’s try an example. Say that we want to replicate data from one table (FOO) into another (BAR) using an AFTER INSERT trigger. The CREATE TRIGGER statement itself will be written to the replication logs to be executed blindly by the slave(s); each MySQL instance will then have a copy of the trigger. Trigger creation on the master requires the SUPER privilege. Note that ALL PRIVILEGES does not include SUPER!

CREATE TRIGGER `foo_to_bar`
AFTER INSERT ON foo FOR EACH ROW
INSERT INTO bar (some_column)
VALUES (NEW.some_old_column);

NEW is an alias for use in triggers. When a user issues an INSERT INTO FOO statement, the master checks that this is okay, inserts the row in FOO, runs the trigger (with the privileges of the trigger creator), and records only the original INSERT statement for replication.

A slave server sees the INSERT INTO FOO statement, does the insert blindly, and runs the trigger. And that works IFF the user who created the trigger on the master also exists on the slave! If the trigger-creator user doesn’t exist on the slave, replication will fail on the INSERT statement, because the slave is no longer in “blind-execution” replication, but acting on a trigger just as if it had been caused only on the local instance.

“So,” you think to yourself, “I’ll just add the trigger-creator user and re-start replication.” Except that this is MyISAM, and there are no transactions…and that original INSERT statement already completed! So restarting replication will cause a duplicate PK error (if you’re lucky). Being lucky, removing the new row in FOO on the slave will let you restart replication and everything goes on its merry way. If you weren’t lucky, your slave is now strangely out of sync with the master. Have fun cleaning up!

But assuming the trigger-creator user is in place, things will work quite happily. At one point, we had over a million triggers running simultaneously for a few weeks. Of course, before trusting the results we ran some data verifications to ensure we had exactly the right data.

So be careful the first time you deploy triggers, and enjoy the power they give to change your schema without inconveniencing your users. :)

1 This is equally a problem for saving the results of local-state-specific functions. For example, it’s a bad idea to do something like INSERT INTO foo VALUES (GET_LOCK('a lock')) because the value returned by GET_LOCK() could be different on the slave versus on the master. Timestamps manage to propagate properly because of extra annotations (read: computer-readable SQL comments) included in the replication logs.

2 comments

Search