Guessing the timing isn't accidental. Demonstrated openness vs harsh regulation
declan_roberts · 2026-06-27 13:19:27 UTC
Nobody forced anthropic to go on a media blitz loudly proclaiming the dangers their new AI model. Serves them right honestly.
cr125rider · 2026-06-27 13:23:51 UTC
China = Open. US = Harsh Regulation
Strange timeline, though this only works because it’s aligned with Xi’s goals.
Havoc · 2026-06-27 13:41:31 UTC
Yeah can definitely see a world where china pivots and we're stuck with closed/closed
Mistral...don't fumble this
skeledrew · 2026-06-28 16:27:28 UTC
What are some things that China has pivoted on in history?
ricardobeat · 2026-06-27 10:02:00 UTC
Presumably this has been in production for a while, and is one of the reasons they were able to dramatically lower prices a month ago?
_0ffh · 2026-06-27 10:28:49 UTC
Lookahead Sparse Attention should be playing a big role as well, as it dramatically slashes memory consumption.
chronogram · 2026-06-27 11:24:08 UTC
Yes. Section 5 talks about real-world deployment: 5.1: "The DSpark draft models are co-deployed with the preview versions of DeepSeek-V4-Flash and
DeepSeek-V4-Pro"; 5.4: "MTP-1 represents the former production setup, having
been superseded by DSpark two weeks following the DeepSeek-V4-preview release."
sourcecodeplz · 2026-06-27 20:32:53 UTC
good catch, they reduced the prices 75% seems like exactly in line with the speed/inference optimizations gains?
Jackobrien · 2026-06-27 10:02:26 UTC
I see a world soon where there’s an extremely wide variety of small models for speculative decoding, unique to use cases, companies, and even individuals.
nicce · 2026-06-27 10:13:14 UTC
Hopefully that is the case and hardware does not get impossible to get.
pydry · 2026-06-27 10:24:58 UTC
yes, heavily constrained by sophisticated guardrails.
this is definitely where things are going. the enormous "eat the world" models have extreme diminishing returns by comparison.
Der_Einzige · 2026-06-27 14:54:31 UTC
You clearly didn't read the recent speculative decoding papers because it's been possible to use any model to speculate for any other model for awhile. They solved the tokenization problems that prevented this in the past.
preetham_rangu · 2026-06-27 10:08:32 UTC
do they use their OCR, or someone else?
piterrro · 2026-06-27 10:09:21 UTC
I’ve been using DeepSeek v4 pro for a month now in Kilo Code and its great. Fast, reliable, large context window and cheap as… Did 1,5B tokens this month and cost me 40usd (majority cached, but still).
spiderfarmer · 2026-06-27 10:16:25 UTC
Is there a way to see how many tokes one does with claude code (pro)?
the casino has no clocks, as one HN user put it some time ago.
I second ccusage, it's nice
edg5000 · 2026-06-27 11:09:32 UTC
It's in the JSONs in ~/.claude, but last 30 days only I think. You can have the model analyze history. So for correct history you'd need to run history analysis on a cron job or something. Kinda hacky.
Stagnant · 2026-06-27 16:25:22 UTC
The 30 day limit can be overridden by adding "cleanupPeriodDays": 9999 to .claude/settings.json
> Local-first session search, analytics, insights, and token use statistics for coding agents, supporting Claude Code, Codex, and more than 20 other agents.
solid piece of software
richardlblair · 2026-06-27 12:26:50 UTC
I've been using omp with deepseek as my task and quicktask agents, and sonnet as everything else.
It's drastically reduced my AI spend. I went from spending $40/day to $10/day.
Which provider? I went through 40 bucks on it on openrouter. It was not a lot of back and forth, context ended at around 300k, 15kloc output. I was using opencode, unsure if I can make the total token count visible.
peheje · 2026-06-27 15:24:09 UTC
OpenRouter sometimes chooses a very expensive provider. Try the floor slug or choose directly the provider. I moved to just putting 5 dollars directly on deepseek instead of going through OR.
apitman · 2026-06-27 14:45:47 UTC
Have you compared Kilo to Pi or OpenCode? Those are the two I'm most familiar with but always looking for alternatives.
redman25 · 2026-06-29 20:22:32 UTC
I've been preferring Mimo recently. Same price as deekseek, more reliable tool calling (subjectively), and has some nice qualities in terms of prose, etc.
I've heard others say that Deepseek tends to be smarter on specific problems but that Mimo tends to more well-rounded.
rvz · 2026-06-27 10:17:52 UTC
This is just one of many papers DeepSeek have released to be able to serve models at extremely cheap prices, unlike the others taking on >$100B+ of debt in building data centers for the same thing.
> As with V4-Flash, we treat this point as an indication that DSpark sustains useful
throughput under an interactivity target that the baseline cannot efficiently support. At matched system capacities, DSpark delivers 57% to 78% faster per-user generation.
Reminds me of the flawed solution in scaling servers in 2017 that use memory-intensive technologies by adding even more servers to solve the problem. (It just increases costs.)
Rather than doing that, think about which critical parts of your app can be written in a more performant technology.
Fast forward to 2026, now you can see who is just throwing more money at the problem to create even more problems where as DeepSeek is giving us optimized solutions.
I know exactly who I would pay attention to, and it is absolutely not Anthropic.
denverllc · 2026-06-27 12:38:27 UTC
For so long American companies have operated under the assumption that servers are cheaper than developers, and that was used to justify all sorts of inefficient practices.
The last year has shown that’s not true anymore (even for web servers).
simianwords · 2026-06-27 14:10:40 UTC
...... are you really suggesting OpenAI and Anthropic don't have access to these techniques?
sourcecodeplz · 2026-06-27 20:34:38 UTC
if they didn't, they do now. as deepseek published the howto
2838383838 · 2026-06-27 10:18:36 UTC
Must be wonderful to be on the board of OpenAi et al & their PE investors whilst China keeps blowing up these mines under their feet lmao.
Luckily Korean pension funds will buy all the trash as usual but goddamn you gotta start moving quick or you are gonna need some serious AGI to show you how to offload those bonds
ForHackernews · 2026-06-27 10:28:52 UTC
"We will build the machine-god and pray for it to pay for itself."
FridgeSeal · 2026-06-27 10:42:41 UTC
Every day, the rate of “could post a picture of 40k tech priests and have it taken unironically” goes up, and it’s starting to get concerning.
ozgrakkurt · 2026-06-27 10:56:50 UTC
Don’t worry they will sell all the hardware and data they acquired with their grift
throwa356262 · 2026-06-27 19:50:30 UTC
Why do you think they have started accusing Chinese labs of stealing and distillation?
A&O no longer have the most to justify their high valuation. The only thing they can do now is to get the government forbid the Chinese models.
kamranjon · 2026-06-27 10:22:25 UTC
DeepSeek continues to not only push the boundaries but also publish these incredible papers explaining how they achieved their gains - something the American labs no longer do unfortunately. Chinese labs are doing the most interesting work in AI right now.
herodoturtle · 2026-06-27 10:25:40 UTC
Publishing by necessity I wonder? American labs on the cutting edge pioneering the way forward, so Deepseek open sourcing what they’ve got is to help even the playing field.
Hopefully the experts here can offer insight. The above is just my hunch and I’m not a specialist in this field.
jonplackett · 2026-06-27 10:27:41 UTC
Wouldn’t that just help the American labs anyway though? Or do they assume they’ve actually already figured this stuff out and kept it secret?
7speter · 2026-06-27 11:19:52 UTC
From what I gather, the Chinese are behind, but a lot of their research amounts to scrappy, clever discoveries in how to use more novel technologies (for Qwen and Deepseek, its mixture of expert models, that can do inference using a portion of the model at a time). The chinese also distill information from American models, so there’s that.
The American companies, from my impression don’t involve themselves with such lowly “hacks” because they have so much money to just push forward with doing everything on big heavy models that run on the most cutting edge nvidia chips that they can, the moment, kinda sorta get on demand (I say that in some degree of jest).
idiotsecant · 2026-06-27 13:14:17 UTC
The American companies would love to develop these 'hacks' because it would make them more money, something they are in existential need of right now.
They don't develop them because they don't collaborate publicly anymore.
Where would the whole industry be if Google never allowed publishing the transformers paper?
It's not a coincidence that the American AI industry grew fastest in capability when it was the most open.
7speter · 2026-06-27 14:17:54 UTC
Just a crazy catch 22, it seems
tiahura · 2026-06-27 15:11:17 UTC
Why would they collaborate? Why not defect and just keep theirs private and implement the open ones?
mistercheph · 2026-06-27 20:45:33 UTC
this is not an effective long term strategy in a collaborative environment that is advancing for the same reason that having a private secret fork of the linux kernel with a few proprietary improvements is not an effective strategy.
integrating your own work with the latest public advances takes resources. For one or two small changes this is manageable, but the further you diverge from the public, the cost of maintenance rises exponentially if you want to continue to integrate public advances. when you publish your meaningful advance, you offload the maintenance burden onto everyone else (and they only have to pay a linear cost rather than an exponential one) as it's integrated by default in new work.
In most cases, the (exponential) maintenance cost of integrating public advances with secret ones exceeds the value of the public advances, so most that undertake this strategy of advancing the open frontier in secret don't attempt to integrate continually, but instead try to make a breakaway sprint in isolation to grab a few sticky customers before the unstoppable wave of the public frontier catches up.
This is a pattern commonly seen in university research departments when researchers switch into product development mode, most of these projects are a sprint to advance away from the public frontier once a good idea is found and they do good work and find a few customers for a little while. But if you check back in a few years you won't find an advanced research department but a zombie IP company that brings in a steady income via IP enforcement and a small number of customers for whom switching is too expensive.
parineum · 2026-06-28 02:17:13 UTC
> They don't develop them because they don't collaborate publicly anymore.
How do you know they aren't doing this stuff? Something has to account for them leading the industry.
vintermann · 2026-06-27 11:40:25 UTC
It used to be the case that NSA hired the majority of all math graduates in the US, and were assumed to be years ahead in cryptography. Yet in the 90s, it became clear that they no longer were that - among other things, the cipher of the notorious Clipper chip was broken, and we can rule out that it was made weak on purpose because the whole point of Clipper was that they had a backdoor.
So, despite hiring the cream of the crop of math graduates, who could read the papers of free academia, but whose own result the free world could not access - they fell behind.
I have a theory explaining why. I think it's because science is an interactive process. NSA cryptographers could read papers, but they couldn't talk openly with the authors of those papers, because of secrecy demands - even asking question might indicate what they were working on. You can easily imagine them spending months on something they could have avoided by going to the original authors and getting told "Oh, we tried that for a long time, it doesn't work".
Whether that theory is right or not, cryptography is a concrete example of a domain where public research with fewer resources beat private research with a lot more resources.
idiotsecant · 2026-06-27 13:11:39 UTC
Everyone in this thread is getting distracted by nationalism, but you hit the nail on the head. In this case for whatever reason the Chinese AI industry is collaborative and the American AI industry is not. This will result in the Chinese companies making progress faster. Full stop. This isn't a judgement on the merits of either system, only an observation of likely results.
tiahura · 2026-06-27 15:08:57 UTC
Hasn't that been the mantra of open source for 40 years. Armies of companies, trillions of valuation, or even just Wayland, suggest that isn't always the case.
eikenberry · 2026-06-27 18:48:30 UTC
So free software can only be considered a successful strategy if every single project succeeds?
idiotsecant · 2026-06-28 02:39:31 UTC
And yet, Linux runs approximately every ounce of computing substrate on earth
overfeed · 2026-06-28 06:47:18 UTC
The Linux Foundation was bankrolled by the US government (via grants and code donations) to undermine the EU Operating System industry. Symbian was going to be amazing, until Microsoft - an American company with government links - nuked it /s
idiotsecant · 2026-06-28 08:50:35 UTC
this is the part where you make a point cogent to the point to which you responded.
tiahura · 2026-06-28 14:49:09 UTC
The point that I was responding to was that open sores leads to faster development. It's 2026 and "Next Year will be the year of Linux on the Desktop" since about 2000.
One would have to conclude that there is little correlation b/w openness and progress speed. Sometimes open is faster, sometimes it isn't.
parineum · 2026-06-28 02:15:44 UTC
> This will result in the Chinese companies making progress faster. Full stop.
Is this happening? These open models have been a generation or two behind the closed models for quite a while now. They've been keeping pace but clearly behind.
idiotsecant · 2026-06-28 14:00:38 UTC
They've been making enormous developments on a tiny fraction of the capital. Right now they've got no reason to devote half the electrical grid to brute forcing models when the Americans will waste their power doing that work and China can distill it for free.
tiahura · 2026-06-28 14:50:10 UTC
What happens when they can't just distill from closed models?
NamlchakKhandro · 2026-06-27 14:13:51 UTC
Reminds me of Dot Net in the early 2000-2012... No one collaborated
_0ffh · 2026-06-27 10:37:25 UTC
I'm afraid I'm even balking at the word "pioneering" in context with US frontier labs. They are probably doing a few new things, right, but they are not blazing any trails for others to follow along, the Chinese are.
d0gsg0w00f · 2026-06-28 04:19:17 UTC
Or if the US labs are innovating, they're not talking about specifics.
_0ffh · 2026-06-28 07:43:56 UTC
Oh, I assume they're innovating - it's what I meant with "doing new things".
But the word pioneer comes from French pionnier, literally “foot soldier”, a soldier who goes ahead to prepare the way.
If you don't publish you may be advancing, but you're not preparing anyone's way.
epolanski · 2026-06-27 10:55:51 UTC
Chinese papers and techniques have been very influential and copied by US labs.
Multi-head Latent Attention (MLA), Multi-Token prediction, MoE architecture are some of the most famous examples.
HarHarVeryFunny · 2026-06-27 12:48:19 UTC
MoE is from Google (Noam Shazeer)
MTP is from Meta
Another DeepSeek advance that the west are copying is DeepSeek Sparse Attention (DSA)
xgk · 2026-06-27 18:59:04 UTC
Mixture-of-Expert (MoE) was introduced in the 1990s [1, 2], see also
[3, 4]. The idea was that MoE scales up model capacity and only
introduces small computation overhead. MoEs did not become viable for high-performance
applications until sparse routing was integrated with modern deep
networks, made possible by large-scale distributed computation. The
breakthrough came with the development of sparsely gated networks [5],
which showed that it is possible to maintain model accuracy while
activating only a small fraction of a large parameter network during both
training and inference.
[1] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, G. E. Hinton, Adaptive mixtures of local experts. (1991)
[2] M. I. Jordan, R. A. Jacobs, Hierarchical mixtures of experts and the EM algorithm. (1993)
[3] L. Xu, M. Jordan, G. E. Hinton, An alternative model for mixtures of experts. (1994)
[4] S. Waterhouse, D. MacKay, A. Robinson, Bayesian methods for mixtures of experts. (1995)
[5] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, J. Dean, Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. (2017)
HarHarVeryFunny · 2026-06-27 19:40:35 UTC
Yes - I meant as applied to LLMs/Transformers.
try-working · 2026-06-27 11:42:42 UTC
Yes, challenger Labs publish out of necessity. It is a marketing strategy. People assuming open source means giving something up, but the reality is that Z.ai has a revenue of some $100M and it would be about $0M if they never open sourced their models.
skeledrew · 2026-06-27 12:03:15 UTC
> Publishing by necessity
It's more a cultural thing. Sharing progress is just in their blood.
idiotsecant · 2026-06-27 13:15:43 UTC
This is overly simplistic to the point of glazing. Plenty of Chinese companies maintain industrial secrets to gain an advantage.
skeledrew · 2026-06-28 09:08:52 UTC
Yes there are Chinese companies which maintain industrial secrets. Doesn't change the general cultural tendency that they prefer to share.
tomalaci · 2026-06-27 10:28:11 UTC
Probably because American AI companies are on the hook for quite a lot of investment money. I think they are trying to find the magical moat to justify their valuation.
Revealing optimizations similar to these would pretty much reduce their competitive position.
lwansbrough · 2026-06-27 10:35:43 UTC
Chinese labs are also still behind, so they’re incentivized to collaborate and have no reason to do it in private.
I suspect their tune will change if they ever take the lead..
colordrops · 2026-06-27 10:37:38 UTC
So the marketplace is working.
abc123abc123 · 2026-06-27 10:46:33 UTC
This is the way! Open source models will benefit, and once open source models reach the state of "good enough" the hyped up US AI companies will fear, since the availability of free, good enough, AI models will set the ceiling for how much they can charge. Then the bubble will pop.
VorpalWay · 2026-06-27 12:26:24 UTC
You mean open weights, I guess? There are as far as I know very few open source models, the training data is seldom released. Sadly.
tw1984 · 2026-06-27 10:39:17 UTC
> Chinese labs are also still behind, so they’re incentivized to collaborate and have no reason to do it in private.
US labs in Google, Meta and SpaceX are not leading, none of them managed to build something on par with GLM 5.2.
Care to explain to me why they still don't collaborate and still choose to do it in private?
lwansbrough · 2026-06-27 10:42:00 UTC
No idea I don’t work there.
budsniffer952 · 2026-06-27 10:43:42 UTC
Wait, are you claiming that these companies haven't contributed to the ecosystem via research and open source?
vidarh · 2026-06-27 10:43:54 UTC
I'm not sure I'd put Google in that list, but either way: Because they think they have enough capital that they can catch up and don't need the reputational boost of this.
CuriouslyC · 2026-06-27 10:50:35 UTC
As good as Gemini's visual intelligence is, it's a terrible agent.
7speter · 2026-06-27 11:09:28 UTC
Google at least still releases open source models to the public.
re-thc · 2026-06-27 12:12:45 UTC
Thank Apple?
Those are mostly for embedded devices and the current "sponsor" is Apple.
VorpalWay · 2026-06-27 12:20:45 UTC
Aren't they only open weights, not true open source?
HarHarVeryFunny · 2026-06-27 14:11:46 UTC
The concept of open source doesn't really apply to AI models since their behavior is mostly controlled by the data they were trained on and the complex ways they are trained. Having the source code of the model by itself wouldn't help you.
From a practical POV having all the training data, training infrastructure, and training know-how wouldn't help you either unless you could afford to spend the millions of dollars (hundreds of millions for a SOTA model) in compute to train it each time they released a new training set, in which case you're only talking about the big commercial companies. "open source for the people" just does not apply.
VorpalWay · 2026-06-28 10:45:46 UTC
If (and that is a big if) the concept of open source doesn't apply, then the term shouldn't be coopted to mean something else though.
But even if I can't build it from source locally, being able to see what went into the model is an important part of what open source is about.
HarHarVeryFunny · 2026-06-28 12:53:46 UTC
> If (and that is a big if) the concept of open source doesn't apply, then the term shouldn't be coopted to mean something else though.
Yes, but for whatever reason this usage seems to have stuck. Open weights is definitely a better name. I assume the reason "open source" has stuck is because you can download and use it for free, but "open source" was always intended to be about "free as in speech", not "free as in beer". That said, I remember when the term "open source" was invented, and it was always a bit different, more commercially aligned, than the goals of the FSF.
> But even if I can't build it from source locally, being able to see what went into the model is an important part of what open source is about.
True. Unfortunately LLMs have become such a big money and closed enterprise (the opposite of OpenAI and Anthropic's altruistic founding principles) that it's hard to see these commercial models releasing their training data, especially since this data is the closest thing they have to a moat other than the cost of training.
The most valuable training data right now seems to be "reasoning data", and the need for this at least may disappear as AI moves beyond pre-trained language models to smarter systems capable of learning for themselves, and that can actually reason, not need to parrot reasoning data.
nullc · 2026-06-29 09:21:29 UTC
Publishing RL/SFT/self-distillation harnesses would be very impactful even without the data.
Particularly when it comes to tool use w/ self-distillation it can be done without any data... have a tool the model doesn't know? a teacher model RTFMs and the source code, and helps the student learn to get it right.
wqaatwt · 2026-06-27 12:48:10 UTC
Gemini 3.1 is still up there, though? If Google started to compete on price they could be very successful.
disgruntledphd2 · 2026-06-28 09:55:07 UTC
It'll be their inability to build coherent products that dies them in, not their models.
oefrha · 2026-06-27 10:43:23 UTC
Which is a good thing. Self-serving motives are more reliable than altruistic ones.
nubg · 2026-06-27 10:52:02 UTC
Very interesting take
broodbucket · 2026-06-27 10:58:00 UTC
Look at how far OpenAI has drifted from their original mission. Everything comes back to greed, so it's ideal for the world if selfish motives happen to coincide with what's good for the world, like advancements in open models
Comments
Guessing the timing isn't accidental. Demonstrated openness vs harsh regulation
Strange timeline, though this only works because it’s aligned with Xi’s goals.
Mistral...don't fumble this
this is definitely where things are going. the enormous "eat the world" models have extreme diminishing returns by comparison.
I second ccusage, it's nice
> Local-first session search, analytics, insights, and token use statistics for coding agents, supporting Claude Code, Codex, and more than 20 other agents.
solid piece of software
It's drastically reduced my AI spend. I went from spending $40/day to $10/day.
https://github.com/esengine/deepseek-reasonix
I've heard others say that Deepseek tends to be smarter on specific problems but that Mimo tends to more well-rounded.
> As with V4-Flash, we treat this point as an indication that DSpark sustains useful throughput under an interactivity target that the baseline cannot efficiently support. At matched system capacities, DSpark delivers 57% to 78% faster per-user generation.
Reminds me of the flawed solution in scaling servers in 2017 that use memory-intensive technologies by adding even more servers to solve the problem. (It just increases costs.)
Rather than doing that, think about which critical parts of your app can be written in a more performant technology.
Fast forward to 2026, now you can see who is just throwing more money at the problem to create even more problems where as DeepSeek is giving us optimized solutions.
I know exactly who I would pay attention to, and it is absolutely not Anthropic.
The last year has shown that’s not true anymore (even for web servers).
A&O no longer have the most to justify their high valuation. The only thing they can do now is to get the government forbid the Chinese models.
Hopefully the experts here can offer insight. The above is just my hunch and I’m not a specialist in this field.
The American companies, from my impression don’t involve themselves with such lowly “hacks” because they have so much money to just push forward with doing everything on big heavy models that run on the most cutting edge nvidia chips that they can, the moment, kinda sorta get on demand (I say that in some degree of jest).
They don't develop them because they don't collaborate publicly anymore.
Where would the whole industry be if Google never allowed publishing the transformers paper?
It's not a coincidence that the American AI industry grew fastest in capability when it was the most open.
integrating your own work with the latest public advances takes resources. For one or two small changes this is manageable, but the further you diverge from the public, the cost of maintenance rises exponentially if you want to continue to integrate public advances. when you publish your meaningful advance, you offload the maintenance burden onto everyone else (and they only have to pay a linear cost rather than an exponential one) as it's integrated by default in new work.
In most cases, the (exponential) maintenance cost of integrating public advances with secret ones exceeds the value of the public advances, so most that undertake this strategy of advancing the open frontier in secret don't attempt to integrate continually, but instead try to make a breakaway sprint in isolation to grab a few sticky customers before the unstoppable wave of the public frontier catches up.
This is a pattern commonly seen in university research departments when researchers switch into product development mode, most of these projects are a sprint to advance away from the public frontier once a good idea is found and they do good work and find a few customers for a little while. But if you check back in a few years you won't find an advanced research department but a zombie IP company that brings in a steady income via IP enforcement and a small number of customers for whom switching is too expensive.
How do you know they aren't doing this stuff? Something has to account for them leading the industry.
So, despite hiring the cream of the crop of math graduates, who could read the papers of free academia, but whose own result the free world could not access - they fell behind.
I have a theory explaining why. I think it's because science is an interactive process. NSA cryptographers could read papers, but they couldn't talk openly with the authors of those papers, because of secrecy demands - even asking question might indicate what they were working on. You can easily imagine them spending months on something they could have avoided by going to the original authors and getting told "Oh, we tried that for a long time, it doesn't work".
Whether that theory is right or not, cryptography is a concrete example of a domain where public research with fewer resources beat private research with a lot more resources.
One would have to conclude that there is little correlation b/w openness and progress speed. Sometimes open is faster, sometimes it isn't.
Is this happening? These open models have been a generation or two behind the closed models for quite a while now. They've been keeping pace but clearly behind.
But the word pioneer comes from French pionnier, literally “foot soldier”, a soldier who goes ahead to prepare the way.
If you don't publish you may be advancing, but you're not preparing anyone's way.
Multi-head Latent Attention (MLA), Multi-Token prediction, MoE architecture are some of the most famous examples.
MTP is from Meta
Another DeepSeek advance that the west are copying is DeepSeek Sparse Attention (DSA)
[1] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, G. E. Hinton, Adaptive mixtures of local experts. (1991)
[2] M. I. Jordan, R. A. Jacobs, Hierarchical mixtures of experts and the EM algorithm. (1993)
[3] L. Xu, M. Jordan, G. E. Hinton, An alternative model for mixtures of experts. (1994)
[4] S. Waterhouse, D. MacKay, A. Robinson, Bayesian methods for mixtures of experts. (1995)
[5] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, J. Dean, Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. (2017)
It's more a cultural thing. Sharing progress is just in their blood.
Revealing optimizations similar to these would pretty much reduce their competitive position.
I suspect their tune will change if they ever take the lead..
US labs in Google, Meta and SpaceX are not leading, none of them managed to build something on par with GLM 5.2.
Care to explain to me why they still don't collaborate and still choose to do it in private?
Those are mostly for embedded devices and the current "sponsor" is Apple.
From a practical POV having all the training data, training infrastructure, and training know-how wouldn't help you either unless you could afford to spend the millions of dollars (hundreds of millions for a SOTA model) in compute to train it each time they released a new training set, in which case you're only talking about the big commercial companies. "open source for the people" just does not apply.
But even if I can't build it from source locally, being able to see what went into the model is an important part of what open source is about.
Yes, but for whatever reason this usage seems to have stuck. Open weights is definitely a better name. I assume the reason "open source" has stuck is because you can download and use it for free, but "open source" was always intended to be about "free as in speech", not "free as in beer". That said, I remember when the term "open source" was invented, and it was always a bit different, more commercially aligned, than the goals of the FSF.
> But even if I can't build it from source locally, being able to see what went into the model is an important part of what open source is about.
True. Unfortunately LLMs have become such a big money and closed enterprise (the opposite of OpenAI and Anthropic's altruistic founding principles) that it's hard to see these commercial models releasing their training data, especially since this data is the closest thing they have to a moat other than the cost of training.
The most valuable training data right now seems to be "reasoning data", and the need for this at least may disappear as AI moves beyond pre-trained language models to smarter systems capable of learning for themselves, and that can actually reason, not need to parrot reasoning data.
Particularly when it comes to tool use w/ self-distillation it can be done without any data... have a tool the model doesn't know? a teacher model RTFMs and the source code, and helps the student learn to get it right.