Who Tests the Testers

     

I was chatting with a fellow student about my LLM harness, and he offered up a great suggestion: “Why don’t you run the harness against some of the standard LLM benchmarks”. My initial thought, was the base model is already run against those benchmarks so what would be the point of re-running them through the harness. After a bit more discussion, I realised his point. Any deviation between the base model and the model running within the harness would likely be because of the harness. And, at the minimum, it would be a nice way to test if the harness was improving over time.


The Context As a Protocol

     

“If everyone is thinking alike, then no one is thinking. -Benjamin Franklin” -Rob Rohan

For some quick background, I have 3 modes:

  1. I write code “by hand” for work, for school, and for fun.
  2. I use Claude Code for some open source projects, test projects, and to keep up with what’s going on
  3. I have my own lab where I made my own LLM harness, and run a local LLM on a 5080 16gb with 32k context window (i.e. my own personal Anthropic / OpenAI)

I generally do things like number 3 because I am a slow learner. Unless I can take something apart or build it from the ground up with my hands, I have a hard time understanding it.


The Realist Adjusts The Sails

     

I have always lived in the wilderness.

I am the guy who gets called when the co-founder embezzled a bunch of the company’s money, and they need someone to fix the product, but they don’t have the funds. Or the product has already had six “hot shot” ex-fang programmers with “strong opinions” who all half implemented their ideas and now the product’s code goes in different directions because they thought the code wasn’t fashionable for the time. Or there was a rift with the original developer so he ran off with all the DNS and AWS logins. Or the original website developer died so there was no way to get the original source code. Or the product just doesn’t fit the market, and they have about 2 months of runway left and need to pivot to “something”.


Local Agent Vibe Coded Keylogger

     

I’ve been doing a lot of experiments and spikes using my local lab running a local LLM (on an Nvidia 5060Ti with 16gb) using my own AI harness (written by hand in a language other than python or javascript; thank you very much), and I decided to put it through it’s paces and let it try to code something itself.

I am close to releasing an old school digital audio workstation (called a Tracker) for Mac and the Steam Deck. It’s really fun, but if you are not used to trackers you’d probably find it impossible to use. To hopefully mitigate that learning curve, I decided to make some videos showing how to use it.


Ghost Installs via AI Harness

     

This is a feature and a bug. Somewhat scary, but also has the potential for being cool.

People are doing all kinds of interesting things with LLMs, but the original use case (and the thing I find them to be the best at) is translating from one thing to another; doing what is sometimes called a stylistic transfer. In fact, I believe the transformer architecture was created by Google when they were trying to make a better translate.google.com which should be an indication.


Testing a Spatial Memory Index for Strap

     

One of the pieces I still need to build for strap is a long-term memory store. As I’ve written about before, strap compresses context to work within small token budgets. But context compression only handles what’s in the current conversation. The bigger question is how to surface relevant memories from past sessions.

The obvious approach is brute-force cosine similarity over sentence embeddings. It works for a small number of stored memories, and it’s fast enough. But I got curious whether a spatial index could do better, and whether you could make that index inspectable at the same time.