In the company where I work, we’ve had a pretty good relationship with automated testing over the years (particularly in our digital practice Java dev teams). A while back I was lucky enough to be the technical lead of a ~8 person dev team where, I thought, we got our automated testing approach pretty much as good as we could.
We used an in-house BDD framework that let analysts and client teams review acceptance criteria on JIRA tickets, and see progress on a functional map display. We had page objects for ease of refactoring of functional tests. We had sharding of test runs to keep the feedback loop tight. We measured our coverage and acted on it, pruning dead code or adding tests (and documentation) for untested emergent features. Coverage — measured over unit, integration and functional test levels — was basically 100%. By all accounts the project was a great success, and I think no small part of this was down to our testing approach.
But I think we can still do better.
Over the course of this and many other projects I’ve been able to see or try subtly different approaches to test automation, and the benefits or downsides that ensue. While most of the good things come from simple practices and behaviours as a team, I think just a few have a technical solution — so I’ve put together a tool, TestPackage, that’s intended to put these within reach of future development teams. (At least those using Java, though the principles are portable…)
Caveats Note: This is all mainly about automated functional tests — black box tests — that are intended to verify the system against desired functionality. In my experience this has mostly been Java JUnit tests using Selenium/WebDriver, and/or some kind of integration test harness. We tend to use something like Jenkins as a Continuous Integration tool.
Many of these principles apply to white box tests too, but there are differences.
Also, this isn’t about TDD, BDD frameworks or whether 100% test coverage is needed. Assuming you have something external to your system that subjects it to some kind of verification tests, this is for you. If not, good luck ☺
Observation #1: No broken windows
The Pragmatic Programmer (and later Jeff Atwood) described how the ‘No broken windows’ theory applies to software development in general. The idea is that even cosmetic damage to something quickly sets up an environment where people don’t notice/don’t care if it they damage it even further.
I think this is especially relevant to automated testing and CI. Every time I’ve been on a project with ‘just a couple of flapping tests’, it’s painfully obvious that people fail to notice new failures as easily, often missing new failures by hours or a whole day. On a Jenkins dashboard, ‘OK-yellow’ looks pretty much the same colour as ‘sky-is-falling-yellow’, at least to my eyes.
It’s even harder to understand the reasons behind new test failures too: if a test fails on our check-in, the most logical explanation should be that our change was responsible. However, once a mindset takes hold that it’s more likely to be unreliable tests than unreliable code, it’s easy to shirk critically re-examining our changes. This can be a very hard thing to shake off…
Observation #2: Give developers meaningful feedback quickly, and always in under 30 minutes
I’ve worked on projects where the automated functional tests running on CI have ranged from a few minutes up to several hours. I’ve even seen a team with a test suite that took more than 24 hours to run.
In all of these cases I think it’s a no-brainer that the quicker the test suite runs, the better. However, I think there is a non-linear relationship somewhere around the 30 minute mark: if test feedback takes more than this, quality gradually degrades over the course of the project. If it’s less, the team have a drastically easier job keeping the build stable, to the point that it’s worth putting a fair amount of effort in to trimming five or ten minutes off a 35 minute build.
Why? Perhaps two reasons:
Confidence Every developer knows that their next commit could never break the build, so there’s no need to spend ages running the full test suite locally before they check in. That’s great, until you’re the one who breaks the build and has to buy donuts for the whole team. I suspect that most developers are happy with running a test suite that takes a few minutes, accepting if it takes 10–20 minutes, but downright annoyed when they have to wait 30 or more minutes before they can check in their latest pride and joy. After all, the build is blue right now but might be failing later (back to observation #1…)
Context switching After check-in, regardless of whether the developer ran tests locally, the next most important verification is the continuous integration build. After finishing development of a feature, we might check emails, go for a coffee, or catch up with other team members, etc. Upon returning, we have an easy decision to make if the build has finished (whether it succeeded or not). We either celebrate and move on to the next work item, or try to figure out why the build failed.
If it’s still running, though, we can either wait or start something new — at the risk of having to context switch back to fix a broken build. Invariably you’ll notice the test failure at exactly the wrong time: either when you’re deep in thought designing the next feature, or (worse) when colleagues start complaining about how you broke the build X hours ago and don’t seem to be be doing anything about it (Sorry about that again).
Observation #3: Some tests can tell us more than others
In an ideal world, all tests should probably be equally useful; they should each verify a similarly sized chunk of functionality, and no one area of the system under test should be any more fragile than others. However, in reality, there seems to be a Pareto-like effect whereby some tests are just more useful at detecting regression than others.
Perhaps these are the tests that have better coverage, perform more assertions, or just exercise an easily-destabilized part of the system. Regardless, we can probably identify these stronger tests by keeping track of historical failures, and running first those tests which have failed most recently.
This is most pronounced when a developer is fixing a regression. For example, if you’re trying to fix the 5% of tests that you know you just broke, it would probably be good to get feedback on whether you succeeded before the rest of the test suite runs. You still value the outcome of the other 95% of tests, but running these 5% first is far more likely to be useful.
How can TestPackage help? TestPackage can help in two ways here:
TestPackage records a simple history of past test failures, and sequences tests in ascending order of how recently they failed. Tests which recently broke run first, and tests which might never have failed run last. This way, a developer watching the tests as they run can get useful feedback far faster than if they had to wait for the tests to run in random order. Doing this isn’t anything particularly new, but TestPackage does this by default, rather than being a setting you have to look for. TestPackage can use JaCoCo to record coverage data from the system under test while the tests are running. With the knowledge of each test’s exact coverage pattern, it is able to select subsets of tests to meet criteria the user specifies (e.g. ‘the quickest tests that will hit 50% coverage’ or ‘the tests that provide best coverage in 10 minutes’)
Observation #4: Maybe black box tests should be versioned independently of the system under test
Black box tests are just another form (or maybe the main form) of stating our current best understanding of the system requirements. While our understanding of the requirements always changes a lot through the course of a project, I think that it’s generally additive — our understanding grows. For example, we add a new requirement, or add a new special case that affects an old requirement. Going back and deleting requirements can happen, but in my experience it’s been the exception rather than the rule, and it’s still ‘growth’ of our understanding of what’s required.
System code, though, has a much higher rate of change than the requirements and it’s frequently negative — we add new code, but we also refactor or periodically break things (after all, if regressions never happened we wouldn’t need to automate our tests). The system under test will often be an imperfect realization of our understanding of the requirements.
So ideally — assuming defective tests are relatively rare vs defective system code — the only version of the test suite that matters is the latest one.
When a real system bug is discovered, we try to build an initially-failing automated test that exposes it. This is essential to help diagnose the bug, to prove that we fixed it, and to guard against it recurring.
In this situation, the latest test suite is what matters, but actually it would be quite nice to run this test suite against older versions of the system: this would help us understand when the defect arose and maybe link it to a specific commit, zeroing in on the root cause faster than hunting through the codebase. Debugging could become as simple as running git bisect in the background; if we’re lucky the tool will find the bug before we can.
When we hold our test suite in the same versioning scheme as the system, and more importantly, bake the test suite into the build process, running the latest tests against older versions of the system gets hard. We’d have to really hack the build process heavily to achieve this, so we just track down the bug the hard way.
The few times that I’ve broken the version interdependency between system and tests, it’s really paid off. I’d be keen to hear what others’ experiences have been with this approach.