Tests allow us to forget the context

In this article, I will explore the idea that tests (the automated kind) are bits of context which we can afford to forget about — until we break them, and need to reevaluate what we are doing.

What are tests? Why do we test?

A few examples

In the context of producing software, we usually define “testing” as a manual or automated activity aimed at making sure that our product does what it is supposed to do, and that it doesn’t blow up in the process.

  • Having end users explore an unfinished version can provide invaluable feedback, but is costly to set up.

  • Running tests in build pipelines will usually prevent untested versions from being deployed, but requires a lot of compute.

  • “Empirical” testing on a dev instance may reveal scenarios you unexpectedly made possible, but is a non-repeatable, time-consuming endeavour.

  • Small test suites attached to the codebase are usually fast, but only cover a narrow scope of the project - with no guarantees regarding the behavior of the whole.

Generally speaking, we test because we hope to catch bugs before they materialize in production.

A former colleague of mine would insist that whatever the test (unit, integration, end-to-end,…) automation is what matters the most.

Having worked on projects with lots of moving parts and a mostly manual QA process, I can only agree with him - if only out of empathy for the colleagues doing the validation.

A quick word about test harnesses

When I first heard “test harness”, I thought it meant “having enough tests” that they act like straight jacket for the code base. With both the good (security) and the bad (limited movement) that entails.

It made sense to me, but I was wrong.

According to online definitions, a test harness is everything we can use to simulate a full environment for the purposes of our testing:

In software testing, a test harness is a collection of stubs and drivers configured to assist with the testing of an application or component. It acts as imitation infrastructure for test environments or containers where the full infrastructure is either not available or not desired.

Wikipedia

In terms of code that would mean mocks, dummies, fakes, spies… But it doesn’t end there, as code is capable of provisioning its own infrastructure, ranging from simple test containers to entire environments. There is virtually nothing we cannot do when testing.

Which is equal parts wonderful and terrifying, because now the testing infrastructure becomes a project of its own, with its own conventions, limitations and coupling to the original project’s technologies — but I digress.

The concise answer

At first we test to assert some degree of correctness.

Then we keep repeating the tests to make sure we did not introduce a regression.

Which is why manual testing becomes worthless the minute the code is deployed: Even if we documented the test in great detail, the amount of work required to reassert correctness will keep growing.

Context windows vs “the zone”

Context window?

According to Claude’s documentation:

The “context window” refers to all the text a language model can reference when generating a response, including the response itself. This is different from the large corpus of data the language model was trained on, and instead represents a “working memory” for the model. A larger context window allows the model to handle more complex and lengthy prompts, but more context isn’t automatically better. As token count grows, accuracy and recall degrade, a phenomenon known as context rot. This makes curating what’s in context just as important as how much space is available.

Adding more and more tokens (context) leads better accuracy, but only up to a point, and then it’s downhill from there. Hallucinations start happening, and in some cases, the original context can get lost entirely, making the model endlessly ramble to itself1.

Our “zone” is not that different

I can’t help but draw a parallel with our own minds here. It’s a well known phenomenon that we developers need time to get “in the zone” and that interruptions destroy our focus (and in turn, our productivity)2.

If we repeatedly feed ourselves low-quality “brainrot” content, we damage our ability to focus on a specific task3.

Our mental state — from extreme stress to great joy — also affects our ability to focus in various ways. The mind needs to quiet down so we can focus on our work.

When dealing with a complex enough problem, the amount of information to keep track of can be overwhelming — especially if the components involved have ill-defined boundaries & abstractions, or the requested change has cross-cutting concerns that weren’t anticipated.

This is the point where we tend to split the problem into smaller, more digestible tasks. Be it doing preparatory cleanup in related areas of the code, dealing head-on with the biggest expected pain point: all that matters is that we get the ball rolling, the rest is easy(-ish).

“For each desired change, make the change easy (warning: this may be hard), then make the easy change”

Kent Beck

From the simple tasks we achieve, we grow our understanding – our context – of the problem. For some of us, this context stays mostly “up” until we are done, and we have little option but to code ourselves to sleep.

Circling back to the original point

Emulating our shortcomings is something Language Models do very well: if we give one a far-reaching task with too little context, something will be missed. If we provide it with too much context, it will get confused.

It resembles our own thinking in this regard: “I must remember to take care of X, Y and Z” — the moment the action is “remember …”, we’ve already forgotten. This is why we create to-do lists: so that nothing slips through “the context” - and the smallest tasks can harass us constantly until we take care of them.

We need a way to reduce the overhead

This is where the tests come in:

  • They serve as an inventory of working features
  • They can (and should) be run before and after making changes
  • They cover edge cases we’ve uncovered, even if we forget about them in the moment

The larger the codebase and the farther-reaching the change, they help.

If our test coverage is low, so is our confidence in not breaking anything. The first (reasonable) thing to do is to add more.

Once that is done, we can reduce the cognitive effort associated to that area of the code. We can forget about it, until some seemingly unrelated change makes the test red, then we’ll say aloud: “I had completely forgotten about this!”.

How do we make the best out of it?

While the realisation itself is new to me, what I would do with it isn’t exactly ground-breaking:

  • Keep modules small and focused on doing one thing well
  • Test at boundaries and on error-prone / brittle features
  • Keep the scope of manual tests as small as possible - automate as much as you can

For AI-augmented coding

There might be an angle here about keeping the code-focused models away from the tests during development, or at least getting them excluded from the context. Of course that means the code itself should not hide its intent.

This would lighten the context and help prevent the cases where the model / agent messes up the tests in an attempt to have them pass. Should the tests fail, the context they encapsulate could be fed back to the model to improve the next output.

I am entirely unsure how handy such a solution would be - I’m more the type to write my own bugs.

For the Luddites who still write their code “character by character”

I hope the tools could get to a point where a mix of:

  • User “persona” so the scenarios contain expected data
  • Description of intent — “User Stories” (which we already have because we’re AgileTM)
  • A browser-use kind of module
  • A deployed instance of the application

Would be enough to run a lot of manual validation.

Of course, weird UI behaviors may happen and the “security / malicious user” angle isn’t covered, but at least regular users can achieve their goals. This may even help older companies be more amenable to modernising their software, by having a “low investment” path to regression-testing 15-year-old front-ends.

It’ll always beat reviewing generated code all day long.

Footnotes

  1. Try playing the “guess my character” game on a local model with little guidelines, and see for yourself.

  2. Thankfully, we solved that problem by outlawing open-plan offices.

  3. See Mobile phone short video use negatively impacts attention functions: an EEG study and The effect of short-form video addiction on users’ attention