Who Tests the Testers

     

I was chatting with a fellow student about my LLM harness, and he offered up a great suggestion: “Why don’t you run the harness against some of the standard LLM benchmarks”. My initial thought, was the base model is already run against those benchmarks so what would be the point of re-running them through the harness. After a bit more discussion, I realised his point. Any deviation between the base model and the model running within the harness would likely be because of the harness. And, at the minimum, it would be a nice way to test if the harness was improving over time.

To be honest, I have been very critical of LLM benchmarks; well benchmarks used in tech in general. I am a firm believer in Goodhart’s law: “When a measure becomes a target, it ceases to be a good measure”. It seems like almost all benchmarks are now targets to game, but in this case, it’s the environmental difference between the two that is the measure which makes this a borderline integration test in my mind; anyway, he won me over.

The interface between the a raw LLM call and Strap is obviously different, so I made a little app that tries to run them in the most similar way possible. The “base qwen” runner runs the benchmark questions directly against my LLM server using the llama.cpp API endpoint directly. Strap, which in the end also calls that same API endpoint, has the questions fed to it via an IRC interface.

I haven’t finished building out the whole thing, but when I ran a test run of 50 basic maths questions to make sure both systems were working I saw something intriguing:

Base Qwen Test Run 100% accurate, 117 seconds

Qwen Test Run in Strap 98% accurate, 31 seconds

Firstly, it was unexpected that Strap would be 3x faster than talking directly to the API endpoint (the reason is for another day), but I would have expected Strap to be better at math not worse. That it got a question wrong was confusing. I thought maybe it was just a networking error or something, so I ran the same test 3 times, every time Strap got the same question incorrect. So I added some extra logging to see what it was doing wrong

the failing test

When Carlos Gives You Lemons

The question it gets wrong is:

Carlos is planting a lemon tree. The tree will cost \$90 to plant. Each year it will grow 7 lemons, which he can sell for \$1.5 each. It costs \$3 a year to water and feed the tree. How many years will it take before he starts earning money on the lemon tree?

The data set says the correct answer is $13$. I am not great at maths, but if I do that by hand I get $12$:

$90 / \left(7 \times 1.50 - 3\right) = 12$

But after some chatting with Claude, it seems the ambiguity is in: “… How many years will it take before he starts earning money on the lemon tree?”

I think this really highlights one of the reasons you hear engineers say, “English as a programming language isn’t sufficient”. Depending on the real world context of that question, I think either 12 or 13 years is correct. For example, if you are a struggling business and want to know when your first dollar of profit will happen, 12 is likely the correct answer. If you are an accounting firm and what to know when the P&L for the business will show a profit, then 13 is the correct answer.

Just to make sure my confusion was actually warranted, I presented the question to Claude who first told me that the data set was in error:

claude saying the math is unambiguous and the data set is wrong

But then when I asked, c’mon, what are the odds of me finding an error in a well used data set used in benchmarking, and how is it that a harness with a maths tool would get it wrong, but a raw model with no math capabilities would get it right. Claude then “changed its mind” and told me the data set was correct and Strap’s “strict maths mode” made it wrong:

claude saying the dataset answer is correct because of before he starts earning money

And finally, I tried to reframe the question to see if it could tell me which is objectively correct, it (probably leaning into sycophancy), told me that both were correct depending on how you look at it:

claude saying the question and answer is ambiguous and both answers are defensible

I think this is a somewhat useful example of the short comings of an LLM, human language, benchmarks or the issues when combining all three. Regardless of if this error is coming from Strap, the dataset, human communication, or my understanding, it’s intriguing.