A small tale about testing and what a coffee machine has to do with it

Heiko W. Rupp
3 min readMay 22, 2023

I am a big fan of testing in production. That does not mean I don’t test via unit test or in staging, but all those environments are different enough from the production environment, that they only (may) give hints if the production code will work or not.

Internals of a coffee machine with a white tube, that I had to replace as the original one went boom
Internals of a coffee machine

Last week our coffee machine broke down. The white hose in the above picture was burst and the machine leaked hot water instead of brewing coffee. It took a while to disassemble it and to find the leak (it was at the bottom side of the hose). I looked around for replacement parts and ordered some including the hose and some hose clips. I wasn’t able to apply them securely (the hose is operated at 15 atmospheres), so I ordered new, different ones….

Long story short: after I installed everything I tested the machine by first running the cleaning program and then successfully pouring myself a coffee.

Putting back into production

So I re-assembled the machine and put it back into the shelf. When I tried to pour myself a coffee the next morning, I heard the machine grind and pump, but I got no coffee in my mug.

After debugging (disassembling again), I found out that I put in one seal in the wrong way. Changing that, and doing more tests, it finally works again as supposed.

How’s that related to software?

Remember, after I replaced the broken hose the machine was still disassembled and my tests could be seen as a unit test for the pump and high pressure tube and then an integration test. So we can put it in production — can’t we?

Such integration tests are often enough done in stage and one tries to mimic production as good as possible in order to say that if it works in stage it will work in production.

The badly applied seal in my case could be a mis-configured database connection on the software side. Or some value in a vault. Or a different 3rd-party system you talk to. The list here is endless. You will not find this issue in stage, but only in production.

Testing in production

At the end the tale above implies to me that you can only test in production. I am not suggesting to not do any upfront testing, but only production can tell you if your software works.

Especially in SaaS-y environments, it is easy to spot issues in production and then either roll back or just divert the traffic to other replicas that still run the old version (our “old replica” was a french press, that we have at home).

I already hear you saying “but what about the customers, I can’t give them a broken version” and you are absolutely right. What you want to do is to run a so called canary deployment: you give a small amount of the traffic to the new version (the canary) and see if that works. If all is good you give it more traffic. And there is no requirement to use a random population for the small amount to start with. The start could be only QE traffic and only if that is successful, you would then put real customer traffic on it. Or in my case I poured the first two mugs after the repair for myself and then let the family touch the machine again :)

--

--

Heiko W. Rupp

Long time Open Source developer, currently working at Red Hat. Find me also at https://mastodon.social/@pilhuhn