Benchmarks for the actual wedge
A note on wanting better ways to decide which tools, models, and services are actually worth our attention.
I want better benchmarks for what software actually offers. Not only benchmarks for whether something is impressive in general, but benchmarks for whether it has a real wedge.
There are almost as many tools and services as there are stars in the sky. Then you zoom in and there are even more: libraries, model providers, wrappers, agents, observability tools, deployment platforms, databases, vector stores, design tools, testing tools, and some new thing that claims to replace three of them by Friday.
For a long time I have wanted a clearer way to answer a basic question: what is this piece of software actually better at? What is its wedge? Is it worth using? More importantly, is it even worth spending attention on?
The model dropdown problem
LLMs are the clearest example right now. Models keep dropping. The dropdown gets longer. I do not blame anyone for opening the model selector and replacing GPT-5.4 with GPT-5.5, or Opus 4.7 with Opus 4.8, just because the number changed and the newer one is probably better.
Sometimes that works. Sometimes it is just vibes with a version number.
We either rely blindly on public benchmark sites like Artificial Analysis, or we spend our own quota and API budget testing models against whatever task happens to be in front of us that day.
Public benchmarks are useful. I use them. But they do not fully answer the question I care about: is this tool better for my work, my constraints, my users, my taste, and my failure modes?
Generic better is not enough
A tool can be better in a way that does not matter. A model can score higher and still be worse for the way you use it. A framework can be elegant and still cost too much migration energy. A service can save time in theory and create a new operational dependency in practice.
That is why I keep coming back to the word wedge. A wedge is not the feature list. It is the specific advantage that lets a tool enter your workflow and stay there.
Faster is not a wedge unless speed matters at that point in the system. Cheaper is not a wedge unless cost is actually constraining you. More powerful is not a wedge unless the extra power shows up in the work. Better developer experience is not a wedge unless it compounds into fewer mistakes, faster changes, or less mental drag.
What I want from a benchmark
I do not want benchmarks that only produce a leaderboard. I want benchmarks that help me make decisions repeatedly. The useful benchmark is closer to a decision ritual than a chart.
What am I trying to improve?
What is the current baseline?
What task represents real use?
What does success look like?
What does failure look like?
What is the cost in money, time, and attention?
What would make me switch?
What would make me stay put?That kind of benchmark is less glamorous, but much more useful. It forces the tool to earn its place. It also protects you from novelty. New tools are exciting because they create a feeling of possibility. A benchmark should preserve that curiosity without letting it become drift.
Testing against real work
The hardest part is choosing the test. If the test is too abstract, it turns into theater. If it is too narrow, it overfits to one weird task. The best test is usually a small piece of real work that appears often enough to matter.
For LLMs, that might mean using the same messy prompt, same repository context, same review task, or same data-cleaning problem across models. For a library, it might mean rebuilding one real feature and tracking the rough edges. For a service, it might mean measuring not only setup time, but debugging time, billing clarity, latency, reliability, and how easy it is to leave.
The benchmark should include the annoying parts. Especially the annoying parts. That is where a lot of software reveals itself.
Attention is part of the cost
The older I get as an engineer, the more I think attention is the real budget. Money matters, latency matters, correctness matters, but attention decides what actually gets adopted.
Every tool asks for a piece of your mind. You have to learn its model, read its docs, trust its failure modes, understand its pricing, integrate it, update it, debug it, and remember why it exists in the stack. Even knowing about a tool has a cost.
So the question is not only whether the tool is good. The question is whether it deserves a permanent slot in your context window.
Everything needs a benchmark
This starts with software, but I think it leaks into everything. Tools, models, workflows, habits, ideas, even opportunities. The same question keeps showing up: compared to what, for which job, under which constraints, and at what cost?
I do not want to become cynical about new things. I like new things. I want to stay curious. But I want a repeatable way to decide. I want to be able to confidently choose tools, LLMs, services, and systems over and over again without pretending every decision is a fresh act of faith.
The dream is simple: less blind trust, less random switching, fewer vibes, better judgment. Find the wedge. Test the wedge. Keep what earns its place.