Richard North’s Blog

TestPackage

In the com­pany where I work, we’ve had a pretty good re­la­tion­ship with au­to­mated test­ing over the years (particularly in our dig­i­tal prac­tice Java dev teams). A while back I was lucky enough to be the tech­ni­cal lead of a ~8 per­son dev team where, I thought, we got our au­to­mated test­ing ap­proach pretty much as good as we could.

We used an in-house BDD frame­work that let an­a­lysts and client teams re­view ac­cep­tance cri­te­ria on JIRA tick­ets, and see progress on a func­tional map dis­play. We had page ob­jects for ease of refac­tor­ing of func­tional tests. We had shard­ing of test runs to keep the feed­back loop tight. We mea­sured our cov­er­age and acted on it, prun­ing dead code or adding tests (and doc­u­men­ta­tion) for untested emer­gent fea­tures. Coverage — measured over unit, in­te­gra­tion and func­tional test lev­els — was ba­si­cally 100%. By all ac­counts the pro­ject was a great suc­cess, and I think no small part of this was down to our test­ing ap­proach.

But I think we can still do bet­ter.

Over the course of this and many other pro­jects I’ve been able to see or try sub­tly dif­fer­ent ap­proaches to test au­toma­tion, and the ben­e­fits or down­sides that en­sue. While most of the good things come from sim­ple prac­tices and be­hav­iours as a team, I think just a few have a tech­ni­cal so­lu­tion — so I’ve put to­gether a tool, TestPackage, that’s in­tended to put these within reach of fu­ture de­vel­op­ment teams. (At least those us­ing Java, though the prin­ci­ples are portable…)

Caveats Note: This is all mainly about au­to­mated func­tional tests — black box tests — that are in­tended to ver­ify the sys­tem against de­sired func­tion­al­ity. In my ex­pe­ri­ence this has mostly been Java JUnit tests us­ing Selenium/WebDriver, and/​or some kind of in­te­gra­tion test har­ness. We tend to use some­thing like Jenkins as a Continuous Integration tool.

Many of these prin­ci­ples ap­ply to white box tests too, but there are dif­fer­ences.

Also, this is­n’t about TDD, BDD frame­works or whether 100% test cov­er­age is needed. Assuming you have some­thing ex­ter­nal to your sys­tem that sub­jects it to some kind of ver­i­fi­ca­tion tests, this is for you. If not, good luck ☺

Observation #1: No bro­ken win­dows #

The Pragmatic Programmer (and later Jeff Atwood) de­scribed how the No bro­ken win­dows’ the­ory ap­plies to soft­ware de­vel­op­ment in gen­eral. The idea is that even cos­metic dam­age to some­thing quickly sets up an en­vi­ron­ment where peo­ple don’t no­tice/​don’t care if it they dam­age it even fur­ther.

I think this is es­pe­cially rel­e­vant to au­to­mated test­ing and CI. Every time I’ve been on a pro­ject with just a cou­ple of flap­ping tests’, it’s painfully ob­vi­ous that peo­ple fail to no­tice new fail­ures as eas­ily, of­ten miss­ing new fail­ures by hours or a whole day. On a Jenkins dash­board, OK-yellow’ looks pretty much the same colour as sky-is-falling-yellow’, at least to my eyes.

It’s even harder to un­der­stand the rea­sons be­hind new test fail­ures too: if a test fails on our check-in, the most log­i­cal ex­pla­na­tion should be that our change was re­spon­si­ble. However, once a mind­set takes hold that it’s more likely to be un­re­li­able tests than un­re­li­able code, it’s easy to shirk crit­i­cally re-ex­am­in­ing our changes. This can be a very hard thing to shake off…

Observation #2: Give de­vel­op­ers mean­ing­ful feed­back quickly, and al­ways in un­der 30 min­utes #

I’ve worked on pro­jects where the au­to­mated func­tional tests run­ning on CI have ranged from a few min­utes up to sev­eral hours. I’ve even seen a team with a test suite that took more than 24 hours to run.

In all of these cases I think it’s a no-brainer that the quicker the test suite runs, the bet­ter. However, I think there is a non-lin­ear re­la­tion­ship some­where around the 30 minute mark: if test feed­back takes more than this, qual­ity grad­u­ally de­grades over the course of the pro­ject. If it’s less, the team have a dras­ti­cally eas­ier job keep­ing the build sta­ble, to the point that it’s worth putting a fair amount of ef­fort in to trim­ming five or ten min­utes off a 35 minute build.

Why? Perhaps two rea­sons:

If it’s still run­ning, though, we can ei­ther wait or start some­thing new — at the risk of hav­ing to con­text switch back to fix a bro­ken build. Invariably you’ll no­tice the test fail­ure at ex­actly the wrong time: ei­ther when you’re deep in thought de­sign­ing the next fea­ture, or (worse) when col­leagues start com­plain­ing about how you broke the build X hours ago and don’t seem to be be do­ing any­thing about it (Sorry about that again).

Observation #3: Some tests can tell us more than oth­ers #

In an ideal world, all tests should prob­a­bly be equally use­ful; they should each ver­ify a sim­i­larly sized chunk of func­tion­al­ity, and no one area of the sys­tem un­der test should be any more frag­ile than oth­ers. However, in re­al­ity, there seems to be a Pareto-like ef­fect whereby some tests are just more use­ful at de­tect­ing re­gres­sion than oth­ers.

Perhaps these are the tests that have bet­ter cov­er­age, per­form more as­ser­tions, or just ex­er­cise an eas­ily-desta­bi­lized part of the sys­tem. Regardless, we can prob­a­bly iden­tify these stronger tests by keep­ing track of his­tor­i­cal fail­ures, and run­ning first those tests which have failed most re­cently.

This is most pro­nounced when a de­vel­oper is fix­ing a re­gres­sion. For ex­am­ple, if you’re try­ing to fix the 5% of tests that you know you just broke, it would prob­a­bly be good to get feed­back on whether you suc­ceeded be­fore the rest of the test suite runs. You still value the out­come of the other 95% of tests, but run­ning these 5% first is far more likely to be use­ful.

How can TestPackage help? TestPackage can help in two ways here: #

TestPackage records a sim­ple his­tory of past test fail­ures, and se­quences tests in as­cend­ing or­der of how re­cently they failed. Tests which re­cently broke run first, and tests which might never have failed run last. This way, a de­vel­oper watch­ing the tests as they run can get use­ful feed­back far faster than if they had to wait for the tests to run in ran­dom or­der. Doing this is­n’t any­thing par­tic­u­larly new, but TestPackage does this by de­fault, rather than be­ing a set­ting you have to look for.
TestPackage can use JaCoCo to record cov­er­age data from the sys­tem un­der test while the tests are run­ning. With the knowl­edge of each test’s ex­act cov­er­age pat­tern, it is able to se­lect sub­sets of tests to meet cri­te­ria the user spec­i­fies (e.g. the quick­est tests that will hit 50% cov­er­age’ or the tests that pro­vide best cov­er­age in 10 min­utes’)

Observation #4: Maybe black box tests should be ver­sioned in­de­pen­dently of the sys­tem un­der test #

Black box tests are just an­other form (or maybe the main form) of stat­ing our cur­rent best un­der­stand­ing of the sys­tem re­quire­ments. While our un­der­stand­ing of the re­quire­ments al­ways changes a lot through the course of a pro­ject, I think that it’s gen­er­ally ad­di­tive — our un­der­stand­ing grows. For ex­am­ple, we add a new re­quire­ment, or add a new spe­cial case that af­fects an old re­quire­ment. Going back and delet­ing re­quire­ments can hap­pen, but in my ex­pe­ri­ence it’s been the ex­cep­tion rather than the rule, and it’s still growth’ of our un­der­stand­ing of what’s re­quired.

System code, though, has a much higher rate of change than the re­quire­ments and it’s fre­quently neg­a­tive — we add new code, but we also refac­tor or pe­ri­od­i­cally break things (after all, if re­gres­sions never hap­pened we would­n’t need to au­to­mate our tests). The sys­tem un­der test will of­ten be an im­per­fect re­al­iza­tion of our un­der­stand­ing of the re­quire­ments.

So ide­ally — as­sum­ing de­fec­tive tests are rel­a­tively rare vs de­fec­tive sys­tem code — the only ver­sion of the test suite that mat­ters is the lat­est one.

When a real sys­tem bug is dis­cov­ered, we try to build an ini­tially-fail­ing au­to­mated test that ex­poses it. This is es­sen­tial to help di­ag­nose the bug, to prove that we fixed it, and to guard against it re­cur­ring.

In this sit­u­a­tion, the lat­est test suite is what mat­ters, but ac­tu­ally it would be quite nice to run this test suite against older ver­sions of the sys­tem: this would help us un­der­stand when the de­fect arose and maybe link it to a spe­cific com­mit, ze­ro­ing in on the root cause faster than hunt­ing through the code­base. Debugging could be­come as sim­ple as run­ning git bi­sect in the back­ground; if we’re lucky the tool will find the bug be­fore we can.

When we hold our test suite in the same ver­sion­ing scheme as the sys­tem, and more im­por­tantly, bake the test suite into the build process, run­ning the lat­est tests against older ver­sions of the sys­tem gets hard. We’d have to re­ally hack the build process heav­ily to achieve this, so we just track down the bug the hard way.

The few times that I’ve bro­ken the ver­sion in­ter­de­pen­dency be­tween sys­tem and tests, it’s re­ally paid off. I’d be keen to hear what oth­ers’ ex­pe­ri­ences have been with this ap­proach.

← Home