Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Is claude code the best coding harness? Anyone running evals on that?


In my anecdotal experience, it is not. Same model, opus, works better in 3P harnesses such as Factory Droid or Amp.

Claude code, on the other hand, is the most subsidized one, both for consumers (through max subscription) and for enterprises (token discounts). It is also heavily optimized for cost, specially token caching and reduced thinking, at the expense of quality.


codex is way more subsidized currently, much more generous limits even for 20 dollars a month


I always wonder why everybody is so hyped about the cli harnesses. I am quite happy with Cursor (even though I used to do coding with tmux+vim in the old days). One of the features I like very much is that I can switch between different models without having to have an account with every model provider.

The main thing I am missing is having it on all my devices (like using it via the smartphone). They have a solution for that too (the cloud version), but that is too expensive IMHO. The last time I checked out Claude Code, it was too expensive for my taste as well (burning through tokens like there was no tomorrow).


Ironically, there are plenty of evals showing that it’s not actually that great. Even with Anthropic models, other harnesses are more efficient, both in terms of the number of problems solved and token usage.

Significant regressions also seem to be introduced from time to time after releases.

The UX is great, and if you need a kitchen sink packed with tons of features, even though you’ll probably only end up using a fraction of them, it’s fine.

But if you want something that performs well, you’re better off using something like Opencode or Swival.dev


Terminal Bench is testing agent harness.

The best two are Codex and Forge Code.

However I am using plugins and skills that are only compatible with Claude Code or work best with Claude Code.

So, for me, Claude Code with plugins like claude-meme, Context Mode, Superpowers and Get Shit Done is better than other tools.

I think everyone should test multiple models and multiple agent harness for his specific needs, codebase and way of working.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: