The progression bug I'd never have found in a unit test
A bug in my own app where the program quietly stopped delivering its prescribed intensity — found mid-squat, not in CI — and why dogfooding catches what tests cannot.
I build SteelRep, a strength-training app that tells you exactly what to lift next. I also use it on every gym visit. Those two facts are related, and the second one keeps finding bugs the first one shipped.
Here’s the one that made me question everything underneath it.
The set that gave it away
I was running the Cube Method. The Cube rotates each lift through three treatments on a wave — one heavy session, one volume session, one speed session — and the whole point is that they live at different percentages of your training max. Heavy is near-maximal. Speed is 60–65% moved as fast as you can. Volume sits in between, higher reps at a moderate load. The cube turns each week. That variation is the method.
I did my heavy squat session at 140 kg. That’s about 90% of my training max — exactly right for a heavy day. Good set.
The next squat session wasn’t a heavy one. It was one of the lighter treatments — speed or volume, I genuinely can’t remember which came up. Either way it should have dropped a long way down the percentages: nowhere near 90%.
The app told me to load 142.5 kg.
It hadn’t applied the session’s percentage at all. It had taken my heavy weight and added the standard 2.5 kg increment, as if every session were the same kind of session. The cube wasn’t turning. I was standing there looking at a speed-day prescription that was heavier than my heavy day, and the only reason I caught it is that I know what the Cube is supposed to feel like.
The question that opened the trapdoor
A weird thing happens when you catch your own app being wrong about something you care about. The specific bug stops mattering for a second, and a bigger question takes its place: is this program even doing what I designed it to do?
Because if the percentages weren’t being applied here, what else wasn’t? I built these programs. I knew the Cube was meant to hit three different intensities. And the app had just shown me, in a rack, that it wasn’t.
So I went and looked. The answer was worse than one bad number.
The intensity was real — it just never reached the bar
The percentages existed — but only as text. Each session carried a
human-readable note like target ~80% of training max or 65% intensity, and
the engine treated those notes as decoration. The one function that actually
turns a percentage into a weight was never called on the workout path. The app
prescribed the raw training max, applied the normal increment, and shipped you
out the door.
And it wasn’t just the Cube. A whole class of programs had the same hole:
- A wave program computed its ascending waves into a note, then prescribed the same bar weight for all nine of them. A wave program that didn’t wave.
- A daily-undulating program wrote “heavy / medium / light” into notes and then gave all three days one weight — and, because they shared an exercise id, one progression counter. Undulation that didn’t undulate.
- A block program labelled its light orientation week and its near-maximal peak week with the same number, and limped along only because it also gave RPE cues, so a lifter could do the maths by hand.
Every one of these looks complete in the app. Nothing crashes. A weight appears. You only notice the promise is broken if you know what the program was promising — which, outside of me using it, almost nobody does set by set.
Why a unit test would never have caught it
This is the part that should bother you more than the bug.
The percentage logic had tests. Green ones. They fed a set with a % TM into the
function and asserted it returned the right weight. Correct, and passing.
The function was also dead code — called only from those tests. The app’s workout path never invoked it. So my test suite was cheerfully validating a unit that wasn’t plugged into anything. Passing tests on code your product never runs aren’t neutral; they’re worse than no tests, because they point your confidence in exactly the wrong direction. Every green run told me the intensity system worked. None of them ran the path a lifter actually takes.
A unit test checks a unit in isolation. The bug was the isolation. There was no failing assertion to write, because the thing that was missing wasn’t logic — it was a wire between two correct pieces, and you can’t fail-test a connection you never asserted exists.
How it actually got in (the unglamorous truth)
I know exactly how this shipped, because I did it.
During development I built the percentage logic, wrote its tests, watched them
pass — and left the last mile, consuming that percentage per set in the live
workout, as an “I’ll wire that up later.” There was no // TODO. No issue. No
failing test, because what I’d deferred was the connection, and nothing was
asserting the connection. So “later” became never, the tests stayed green, and
the program went out looking finished.
That’s the honest anatomy of most bugs like this. Not a wrong decision — a deferred one that left no trace of having been deferred. The app didn’t lie to me on purpose. I just never made it tell the truth, and then I forgot I hadn’t.
The fix, and the thing that makes it stay fixed
The patch is one engine. Every prescribed weight now flows through a single path that computes the intensity for the set in front of you — consuming those percentages for the first time instead of reading them off a note. The dead percentage parser is gone.
But the patch isn’t the point. The point is the thing that makes it stay fixed: a closed-loop simulator that runs all 22 programs, week by week, and asserts that each one delivers the intensity its method promises — that the wave waves, the undulation undulates, the Cube actually turns. Plus a catalog test that fails the build if a future program ships without a deliverable intensity.
Now the test suite tests the path the app runs, not a unit off to the side. The gap I couldn’t see is written down in the one place that will shout if it ever reopens.
You can’t test for a gap you can’t see
I’m wary of the tidy moral here, the one that ends “so use your own product.” Of course you should. But the reason it works isn’t motivational, it’s mechanical: being a real user under real conditions generates the one input your imagination won’t supply — the speed day that should feel light and didn’t, the set that made you stop and ask whether the whole thing was built the way you thought.
Your test suite checks your code against your spec. Your spec — and your wiring — are the parts most likely to be quietly wrong, and they’re invisible from inside the code, where you already know what everything is supposed to do. The bar doesn’t know any of that. It just sits there at 142.5 kg on a speed day, telling you the truth your CI couldn’t.
That turns out to be the most useful code reviewer I have. More of these — the features I talked myself out of, and the rebuild that taught me what I got wrong about shipping — are coming next to this one, and the apps themselves are on the projects page.
FAQ
What exactly was the bug?
For programs that vary intensity by session type — like the Cube's heavy, repetition, and speed days — the percentages of your training max lived only in human-readable notes. The engine prescribed the raw training max plus the normal increment and never applied the per-session percentage, so a 90%-of-TM heavy day and a 60–65% speed day came out at nearly the same weight. It affected some programs, not all.
How did a unit-tested engine ship this?
The function that applied a set's percentage was dead code — called only from its own tests, never from the app's workout path. Green tests on a component that was never wired in gave false confidence. The tests passed; the product never ran the code they covered.
How do you stop it recurring?
One prescription engine now computes every weight, the dead percentage parser is gone, and a closed-loop simulator runs all 22 programs and asserts each delivers its method's intensity week to week — with a catalog-invariant test that fails if a future program doesn't.
Also published on Medium , LinkedIn . The version here is canonical.