These posts are going to be a constant for the next year, because there's no obj...

pjerem · 2026-05-03T07:04:48 1777791888

The news is not in the way to compare models, it’s that Kimi K2.6 (and I’d add Deepseek v4 Pro) are more or less equivalent to Opus and that’s already pretty big.

They are open source and cost waaaay less per token than American models.

I’m using them right now on the $20 Ollama cloud plan and I can actually work with them on my side projects without reaching the limits too much. With Claude Pro $20 plan my usage can barely survive one or two prompts.

And I choose Ollama cloud just because their CLI is convenient to use but their are a lot of other providers for those models so you aren’t even stuck with shitty conditions and usage rules.

To me that’s a pretty bad thing for American economy.

chvid · 2026-05-03T08:52:31 1777798351

Or maybe it is a pretty good thing for the American economy that you can get AI at cost rather than monopoly pricing.

You know, for the rest of the economy that is not big tech.

PunchyHamster · 2026-05-03T10:17:05 1777803425

It's not good for current administration. The American AI growth is only thing that keeps the GDP not looking terrible.

And investor pumping money in US AI circular money flow just makes innovation everywhere else slower. If not for the GPU/Memory drought running stuff locally (or just in competition cloud) would be far cheaper

arvid-lind · 2026-05-03T12:01:03 1777809663

> It's not good for current administration

I don't know where to begin if you're leading with that. Anything approaching reality is not good for the current administration.

nelox · 2026-05-03T11:43:54 1777808634

That is the very reason the open source models exist. Prestige and soft power to influence interest away from American models and hopefully slow down their progress.

zozbot234 · 2026-05-03T12:06:51 1777810011

DeepSeek and other Chinese model makers are massively accelerating progress in AI not slowing it down. They're the only ones who still come up with real technical innovations while the proprietary model makers are stagnating.

Sammi · 2026-05-03T12:33:09 1777811589

I'm as happy to see cheap open weight models any anyone is, and I'm in Europe and certainly not cheering the US on, but that's a bunch of unfounded hyperbole you just said.

bigbadfeline · 2026-05-03T23:45:08 1777851908

>> DeepSeek and other Chinese model makers are massively accelerating progress in AI... They're the only ones who still come up with real technical innovations.

> that's a bunch of unfounded hyperbole you just said.

Calling the quote on top "unfounded hyperbole" betrays lack of knowledge and awareness about the subject. Keep in mind that when we talk about real technical innovations, we have in mind published research - not closed or hidden models, some of which we know only from hype but cannot even test. A cursory look at said research reveals more Chinese names than I can count.

Deepseek did introduce real technical innovations, they're in their papers, and there was plenty of talk about another "Sputnik moment" when their first model appeared. If you don't know what that means - it's the moment when the industry mobilizes to "accelerate progress" due to the unexpected appearance of strong competition.

There's a lot more to be said, but it wouldn't do much good to a person who's not following the trends.

overfeed · 2026-05-03T16:09:10 1777824550

You should read the research papers that come out with Deepseek releases. There is a reason why the first Deepseek release briefly caused existential panic.

Sammi · 2026-05-03T16:14:12 1777824852

I did not and am not inclined to invest the time to do so.

But I did read some second hand reports that what was new and exciting was that they found some really good performance optimizations. The thing about deekseek publishing this is that now everyone has this.

Or did I miss something?

overfeed · 2026-05-03T20:55:56 1777841756

> The thing about deekseek publishing this is that now everyone has this.

It sounds like you're agreeing with upstream comment then!

>> DeepSeek and other Chinese model makers are massively accelerating progress in AI not slowing it down

DroneBetter · 2026-05-03T17:44:39 1777830279

from the "DeepSeek is a ploy to undermine usamerican models' duopoly" theory's perspective, "now everyone has this" helps them achieve this goal more efficiently.

especially if it's something that the major companies had already stumbled upon (something equivalent to) and regarded as a trade secret.

dumbmrblah · 2026-05-03T13:23:10 1777814590

That is a petty big assumption (aka bullshit) unless you have direct insight the inner workings of the big US labs. Just because it isn’t published doesn’t mean that innovation is not happening.

zozbot234 · 2026-05-03T13:30:34 1777815034

That's an unfalsifiable assertion with no evidence to support it, while all the visible evidence we do have points to stagnation and merely incremental pushes among the big proprietary model makers. Even Claude Mythos, which was 'teased' to the public but not released, is reportedly mostly a scaled-up model that takes massive compute resources to run (and lengthy agentic loops to achieve its reported results in computer security). The polar opposite to what the Chinese labs are releasing now.

dumbmrblah · 2026-05-03T13:46:02 1777815962

So no insight and just going off blogposts and YouTube huh. Pot kettle calling each other black etc.

Danox · 2026-05-03T17:09:50 1777828190

Sam Altman certainly got all lovey-dovey and less arrogant after DeepSeek came into prominence at most 3-6 month gap, if there was something mind blowing, Sam would’ve gotten money out of Apple and the same thing applies to Google if they had something mind blowing, they would’ve gotten more than a $1 billion refund neither happened. The bubble is near…

darkoob12 · 2026-05-03T12:34:03 1777811643

Can you name some tangible AI idea that came out of Chinese labs?

I can name thousands that came out western universities.

I see a lot of rhetoric that only the Chinese labs are contributing to AI while companies like Google and Microsoft are still pulishing their research.

Unfortunately the domain of scientific papers is cluttered with AI slop but still occasional serious paper that i find are from western labs particularly Google Research or Microsoft Research

satvikpendem · 2026-05-03T12:58:59 1777813139

Any of DeepSeek's recent papers which are more about efficiency and that's how their inference costs can be so low.

gmerc · 2026-05-03T15:12:46 1777821166

Oh please https://github.com/deepseek-ai

darkoob12 · 2026-05-04T06:04:02 1777874642

It doesn't mean only Chinese companies are contributing. Take TurboQuant, a serious theoretical paper not just GPU optimization, it was google research as the original transformers as MoE and many other techniques we use daily in deep learning as for libraries like TensorFlow which were pivotal to fast development of AI

gmerc · 2026-05-04T09:09:59 1777885799

Bit of a strawman, isn’t it.

amunozo · 2026-05-03T16:47:29 1777826849

I am using both on OpenCode Go plan and they're pretty good, but I would say still not at the same level at GPT-5.5 in my experience, I don't know about Opus.

On a different note, is Ollama cloud good?

pjerem · 2026-05-03T18:40:10 1777833610

> is Ollama cloud good?

I'd say they have reliability issues but for the price it's worth it.

I like that usage isn't measured per token but per computation time, which means that you get more usage when models become more efficient.

rurban · 2026-05-03T10:16:36 1777803396

They are no way as good as Opus yet. But Sonnet, yes. Using all in real life.

alansaber · 2026-05-03T11:26:43 1777807603

I appreciate your reply but you are completely glossing over his point about how head to head model evals are useless lmao

Cookingboy · 2026-05-03T09:03:03 1777798983

> for American economy.

There is more to American economy than big tech.

And that's precisely why this has started: https://www.wired.com/story/super-pac-backed-by-openai-and-p...

joe_mamba · 2026-05-03T09:14:44 1777799684

>There is more to American economy than big tech.

Most of the stock market valuation is big-tech, and most of people's retirements are the stock market, so... if the AI bubble bursts a lot of the US will be affected.

coldtea · 2026-05-03T10:59:37 1777805977

>Most of the stock market valuation is big-tech

Which is why most of it is a bubble

atemerev · 2026-05-03T09:39:29 1777801169

I do not know why this is downvoted. This is true.

gigatexal · 2026-05-03T10:57:58 1777805878

Agreed. I upvoted.

yorwba · 2026-05-03T06:42:24 1777790544

There are objective ways to compare models. They involve repeated sampling and statistical analysis to determine whether the results are likely to hold up in the future or whether they're just a fluke. If you fine-tune each model to achieve its full potential on the task you expect to be giving it, the rankings produced by different benchmarks even agree to a high degree: https://arxiv.org/abs/2507.05195

The author didn't do any of that. They ran each model once on each of 13 (so far) problems and then they chose to highlight the results for the 12th problem. That's not even p-hacking, because they didn't stop to think about p-values in the first place.

LLM quality is highly variable across runs, so running each model once tells you about as much about which one is better as flipping two coins once and having one come up heads and the other tails tells you about whether one of them is more biased than the other.

jiggunjer · 2026-05-03T06:47:26 1777790846

That's objective metrics. Not an objective way to compare, which is the selection of metrics to include.

cromka · 2026-05-03T06:58:21 1777791501

That's exactly why there's a ton of different benchmarking suites used for evaluating hardware performance.

I reckon we'll have similar suites comparing different aspects of models.

And, at some point, we'll be dealing with models skewing results whenever they detect they're being benchmarked, like it happened before with hardware. Some say that's already happening with the pelican test.

PunchyHamster · 2026-05-03T10:20:33 1777803633

> I reckon we'll have similar suites comparing different aspects of models.

The problem is that hardware benchmarks are harder to game. Yes, hardware manufacturer can make driver tweaks for say particular game to run better but the benchmark is still representable for the workflow user faces and they can't change the most important part, hardware, they can't benchmark gimmick their way in designing hardware

Meanwhile in LLM land the game is to tune it for the current popular set of benchmarks, all while user experience is only vaguely related to those results

adrian_b · 2026-05-03T10:06:44 1777802804

Fine-tuning for a specific task is even much less realistic than the benchmarks shown in TFA.

Most people who have computers could run inference for even the biggest LLMs, albeit very slowly when they do not fit in fast memory.

On the other hand, training or even fine tuning requires both more capable hardware and more competent users. Moreover the effort may not be worthwhile when diverse tasks must be performed.

Instead of attempting fine-tuning, a much simpler and more feasible strategy is to keep multiple open-weights LLMs and run them all for a given task, then choose the best solution.

This can be done at little cost with open-weights models, but it can be prohibitively expensive with proprietary models.

verve_rat · 2026-05-03T06:20:44 1777789244

My theory is we will end up in a similar spot to hiring people. You can look at a CV (benchmarks) but you won't know for sure until you've worked with them for six months.

We as an industry cannot determine if one software engineer is objectively better than another, on practically any dimension, so why do we think we can come to an objective ranking of models?

tlb · 2026-05-03T08:59:34 1777798774

Yes, the entire field of software engineering ran aground on not being able to test how well people can write software.

But I'm more optimistic about testing programming models. You can run repeated tests, and compare median performance. You can run long tests, like hundreds of hours, while getting more than a few humans to complete half-day tests is a huge project. And you can do ablation testing, where you remove some feature of the environment or tools and see how much it helps/hurts.

zelphirkalt · 2026-05-03T06:54:25 1777791265

Not many things are as manifold broken as hiring these days. I hope we do not end up there.

roymain · 2026-05-03T07:39:19 1777793959

The CV-to-six-months analogy is actually exactly right and it's also why benchmarks for hiring people stopped being useful. The signal that holds up is what you see when something breaks, which is hard to compress into a number.

bartekpacia · 2026-05-03T10:11:02 1777803062

this smells like an ai-generated comment so much

pishpash · 2026-05-03T06:58:57 1777791537

You do not interview 1000 rounds on problems you're actually solving. If you did, hiring would be fine. Minus the social fit aspect, which isn't as relevant for a model.

PunchyHamster · 2026-05-03T10:22:14 1777803734

Terrible comparison. CV is just a list, telling you barely anything about performance and that's when candidate is not lying to get thru HR filter.

And we can judge developer performance, it just takes 6 months to a year working with a team so it's just hard to get metric

taegee · 2026-05-03T08:38:47 1777797527

While I partially agree with you, there IS work being done to make the metrics comparable. Eg:

https://ghzhang233.github.io/blog/2026/03/05/train-before-te...

It just hasn't been widely adopted yet. And it might be in each of their particular interests that it continues to stay so for a while. It's basically like p-hacking.

surgical_fire · 2026-05-03T11:38:44 1777808324

This is a problem for OpenAI and Anthropic when they are bleeding money and in desperate need to jack up prices by moving people to their very expensive API.

It's very difficult to justify spending on the their models in a world where DeepSeek costs a fraction and Chinese open models exists and they perform as well as what is considered the state of the art, and it only depends on you adjusting how you use them.

A couple of days ago I canceled ChatGPT and started to try out DeepSeek. Let's see how it goes.

Danox · 2026-05-03T17:16:06 1777828566

Cheaper and only 3 to 6 months behind at most.

mark_l_watson · 2026-05-03T14:11:23 1777817483

I agree. I have rather constrained use cases for LLMs and the agentic harnesses that I use with them.

I try one or two of my use cases with new models or harnesses, make my own often subjective judgements, and largely ignore benchmarks.

Blogging and writing in general are a business, or feed other tech adjacent businesses, and a lot of writing about evals is attention getting - nothing wrong with that but there is a lot of noise.

charcircuit · 2026-05-03T08:30:07 1777797007

A pretty simple one would be to have every model try and one shot every ticket your company has and then measure the acceptance rate of each model.

sam_goody · 2026-05-03T08:35:14 1777797314

Except that if you tried one-shotting your ticket twenty times at different hours of the day and different days of the week, you would have enough changes to make benchmarks even if you used the same model every time. Much moreso if you fiddled with the thinking or changed the prompt.

Because non-deterministic, because of constant updates and changes, and because the models are throttled according to number of users, releases, et al.

serial_dev · 2026-05-03T08:46:27 1777797987

You never get "the same" Steph Curry, he might be tired, annoyed by a fan, getting older... but if he and I were to throw 100 3-pointers, we could all correctly guess who will perform better.

sam_goody · 2026-05-03T13:10:21 1777813821

Good point.

But I use Codex and Claude daily (work and hobby respectively). And there are days where one or the other just seems to have gotten up on the wrong side of the bed. Or is just being lazy. Or is suddenly super-powered do everything including what i asked it not to. (To be fair, the same thing happens with myself. :/)

I am convinced that if I was bench-marking, I would be convinced these are different models on different days.

[This conviction may say more about me then about the model.]

serial_dev · 2026-05-03T15:14:27 1777821267

That's also fair, Anthropic lobotomized their services a couple of times already. One week, you are in awe that the tools figure out everything, explain everything, consider everything, produce a clean fix... next week, they are completely useless.

cyanydeez · 2026-05-03T10:06:26 1777802786

Unfortunately, you're probably right, but the cock measuring contest is going to keep escalating because the billionaires and VC backers need to _win_. And the Psychosis is going to produce some horrible collateral damage.

chrisandchris · 2026-05-03T07:10:47 1777792247

That was my thought too.

> The Word Gem Puzzle is a sliding-tile letter puzzle. The board is a rectangular grid (10×10, 15×15, 20×20, 25×25, or 30×30) filled with letter tiles and one blank space.

Just last week my superior asked to implement that for a customer. /s

Maybe some real, real task would be good? Add sone database, some REST, some random JS framework and let it figure out a full-stack task instead of creating some rectangles?

PunchyHamster · 2026-05-03T10:23:18 1777803798

giving real relatable task like that is memory excercise, not any reasoning excercise. The training dataset have tens of thousands apps like that

ljlolel · 2026-05-03T06:23:57 1777789437

[flagged]

idonotknowwhy · 2026-05-03T06:37:12 1777790232

So like Open Router?

ljlolel · 2026-05-03T16:46:19 1777826779

A secure and open source Open Router