p.enthalabs

DSpark: Speculative decoding accelerates LLM inference [pdf]

github.com · Read Story HN original

Comments

Nice.

Guessing the timing isn't accidental. Demonstrated openness vs harsh regulation

Nobody forced anthropic to go on a media blitz loudly proclaiming the dangers their new AI model. Serves them right honestly.
China = Open. US = Harsh Regulation

Strange timeline, though this only works because it’s aligned with Xi’s goals.

Yeah can definitely see a world where china pivots and we're stuck with closed/closed

Mistral...don't fumble this

What are some things that China has pivoted on in history?
Presumably this has been in production for a while, and is one of the reasons they were able to dramatically lower prices a month ago?
Lookahead Sparse Attention should be playing a big role as well, as it dramatically slashes memory consumption.
Yes. Section 5 talks about real-world deployment: 5.1: "The DSpark draft models are co-deployed with the preview versions of DeepSeek-V4-Flash and DeepSeek-V4-Pro"; 5.4: "MTP-1 represents the former production setup, having been superseded by DSpark two weeks following the DeepSeek-V4-preview release."
good catch, they reduced the prices 75% seems like exactly in line with the speed/inference optimizations gains?
I see a world soon where there’s an extremely wide variety of small models for speculative decoding, unique to use cases, companies, and even individuals.
Hopefully that is the case and hardware does not get impossible to get.
yes, heavily constrained by sophisticated guardrails.

this is definitely where things are going. the enormous "eat the world" models have extreme diminishing returns by comparison.

You clearly didn't read the recent speculative decoding papers because it's been possible to use any model to speculate for any other model for awhile. They solved the tokenization problems that prevented this in the past.
do they use their OCR, or someone else?
I’ve been using DeepSeek v4 pro for a month now in Kilo Code and its great. Fast, reliable, large context window and cheap as… Did 1,5B tokens this month and cost me 40usd (majority cached, but still).
Is there a way to see how many tokes one does with claude code (pro)?
the casino has no clocks, as one HN user put it some time ago.

I second ccusage, it's nice

It's in the JSONs in ~/.claude, but last 30 days only I think. You can have the model analyze history. So for correct history you'd need to run history analysis on a cron job or something. Kinda hacky.
The 30 day limit can be overridden by adding "cleanupPeriodDays": 9999 to .claude/settings.json
https://github.com/kenn-io/agentsview

> Local-first session search, analytics, insights, and token use statistics for coding agents, supporting Claude Code, Codex, and more than 20 other agents.

solid piece of software

I've been using omp with deepseek as my task and quicktask agents, and sonnet as everything else.

It's drastically reduced my AI spend. I went from spending $40/day to $10/day.

Which provider? I went through 40 bucks on it on openrouter. It was not a lot of back and forth, context ended at around 300k, 15kloc output. I was using opencode, unsure if I can make the total token count visible.
OpenRouter sometimes chooses a very expensive provider. Try the floor slug or choose directly the provider. I moved to just putting 5 dollars directly on deepseek instead of going through OR.
Have you compared Kilo to Pi or OpenCode? Those are the two I'm most familiar with but always looking for alternatives.
I've been preferring Mimo recently. Same price as deekseek, more reliable tool calling (subjectively), and has some nice qualities in terms of prose, etc.

I've heard others say that Deepseek tends to be smarter on specific problems but that Mimo tends to more well-rounded.

This is just one of many papers DeepSeek have released to be able to serve models at extremely cheap prices, unlike the others taking on >$100B+ of debt in building data centers for the same thing.

> As with V4-Flash, we treat this point as an indication that DSpark sustains useful throughput under an interactivity target that the baseline cannot efficiently support. At matched system capacities, DSpark delivers 57% to 78% faster per-user generation.

Reminds me of the flawed solution in scaling servers in 2017 that use memory-intensive technologies by adding even more servers to solve the problem. (It just increases costs.)

Rather than doing that, think about which critical parts of your app can be written in a more performant technology.

Fast forward to 2026, now you can see who is just throwing more money at the problem to create even more problems where as DeepSeek is giving us optimized solutions.

I know exactly who I would pay attention to, and it is absolutely not Anthropic.

For so long American companies have operated under the assumption that servers are cheaper than developers, and that was used to justify all sorts of inefficient practices.

The last year has shown that’s not true anymore (even for web servers).

...... are you really suggesting OpenAI and Anthropic don't have access to these techniques?
if they didn't, they do now. as deepseek published the howto
Must be wonderful to be on the board of OpenAi et al & their PE investors whilst China keeps blowing up these mines under their feet lmao. Luckily Korean pension funds will buy all the trash as usual but goddamn you gotta start moving quick or you are gonna need some serious AGI to show you how to offload those bonds
"We will build the machine-god and pray for it to pay for itself."
Every day, the rate of “could post a picture of 40k tech priests and have it taken unironically” goes up, and it’s starting to get concerning.
Don’t worry they will sell all the hardware and data they acquired with their grift
Why do you think they have started accusing Chinese labs of stealing and distillation?

A&O no longer have the most to justify their high valuation. The only thing they can do now is to get the government forbid the Chinese models.

DeepSeek continues to not only push the boundaries but also publish these incredible papers explaining how they achieved their gains - something the American labs no longer do unfortunately. Chinese labs are doing the most interesting work in AI right now.
Publishing by necessity I wonder? American labs on the cutting edge pioneering the way forward, so Deepseek open sourcing what they’ve got is to help even the playing field.

Hopefully the experts here can offer insight. The above is just my hunch and I’m not a specialist in this field.

Wouldn’t that just help the American labs anyway though? Or do they assume they’ve actually already figured this stuff out and kept it secret?
From what I gather, the Chinese are behind, but a lot of their research amounts to scrappy, clever discoveries in how to use more novel technologies (for Qwen and Deepseek, its mixture of expert models, that can do inference using a portion of the model at a time). The chinese also distill information from American models, so there’s that.

The American companies, from my impression don’t involve themselves with such lowly “hacks” because they have so much money to just push forward with doing everything on big heavy models that run on the most cutting edge nvidia chips that they can, the moment, kinda sorta get on demand (I say that in some degree of jest).

The American companies would love to develop these 'hacks' because it would make them more money, something they are in existential need of right now.

They don't develop them because they don't collaborate publicly anymore.

Where would the whole industry be if Google never allowed publishing the transformers paper?

It's not a coincidence that the American AI industry grew fastest in capability when it was the most open.

Just a crazy catch 22, it seems
Why would they collaborate? Why not defect and just keep theirs private and implement the open ones?
this is not an effective long term strategy in a collaborative environment that is advancing for the same reason that having a private secret fork of the linux kernel with a few proprietary improvements is not an effective strategy.

integrating your own work with the latest public advances takes resources. For one or two small changes this is manageable, but the further you diverge from the public, the cost of maintenance rises exponentially if you want to continue to integrate public advances. when you publish your meaningful advance, you offload the maintenance burden onto everyone else (and they only have to pay a linear cost rather than an exponential one) as it's integrated by default in new work.

In most cases, the (exponential) maintenance cost of integrating public advances with secret ones exceeds the value of the public advances, so most that undertake this strategy of advancing the open frontier in secret don't attempt to integrate continually, but instead try to make a breakaway sprint in isolation to grab a few sticky customers before the unstoppable wave of the public frontier catches up.

This is a pattern commonly seen in university research departments when researchers switch into product development mode, most of these projects are a sprint to advance away from the public frontier once a good idea is found and they do good work and find a few customers for a little while. But if you check back in a few years you won't find an advanced research department but a zombie IP company that brings in a steady income via IP enforcement and a small number of customers for whom switching is too expensive.

> They don't develop them because they don't collaborate publicly anymore.

How do you know they aren't doing this stuff? Something has to account for them leading the industry.

It used to be the case that NSA hired the majority of all math graduates in the US, and were assumed to be years ahead in cryptography. Yet in the 90s, it became clear that they no longer were that - among other things, the cipher of the notorious Clipper chip was broken, and we can rule out that it was made weak on purpose because the whole point of Clipper was that they had a backdoor.

So, despite hiring the cream of the crop of math graduates, who could read the papers of free academia, but whose own result the free world could not access - they fell behind.

I have a theory explaining why. I think it's because science is an interactive process. NSA cryptographers could read papers, but they couldn't talk openly with the authors of those papers, because of secrecy demands - even asking question might indicate what they were working on. You can easily imagine them spending months on something they could have avoided by going to the original authors and getting told "Oh, we tried that for a long time, it doesn't work".

Whether that theory is right or not, cryptography is a concrete example of a domain where public research with fewer resources beat private research with a lot more resources.

Everyone in this thread is getting distracted by nationalism, but you hit the nail on the head. In this case for whatever reason the Chinese AI industry is collaborative and the American AI industry is not. This will result in the Chinese companies making progress faster. Full stop. This isn't a judgement on the merits of either system, only an observation of likely results.
Hasn't that been the mantra of open source for 40 years. Armies of companies, trillions of valuation, or even just Wayland, suggest that isn't always the case.
So free software can only be considered a successful strategy if every single project succeeds?
And yet, Linux runs approximately every ounce of computing substrate on earth
The Linux Foundation was bankrolled by the US government (via grants and code donations) to undermine the EU Operating System industry. Symbian was going to be amazing, until Microsoft - an American company with government links - nuked it /s
this is the part where you make a point cogent to the point to which you responded.
The point that I was responding to was that open sores leads to faster development. It's 2026 and "Next Year will be the year of Linux on the Desktop" since about 2000.

One would have to conclude that there is little correlation b/w openness and progress speed. Sometimes open is faster, sometimes it isn't.

> This will result in the Chinese companies making progress faster. Full stop.

Is this happening? These open models have been a generation or two behind the closed models for quite a while now. They've been keeping pace but clearly behind.

They've been making enormous developments on a tiny fraction of the capital. Right now they've got no reason to devote half the electrical grid to brute forcing models when the Americans will waste their power doing that work and China can distill it for free.
What happens when they can't just distill from closed models?
Reminds me of Dot Net in the early 2000-2012... No one collaborated
I'm afraid I'm even balking at the word "pioneering" in context with US frontier labs. They are probably doing a few new things, right, but they are not blazing any trails for others to follow along, the Chinese are.
Or if the US labs are innovating, they're not talking about specifics.
Oh, I assume they're innovating - it's what I meant with "doing new things".

But the word pioneer comes from French pionnier, literally “foot soldier”, a soldier who goes ahead to prepare the way.

If you don't publish you may be advancing, but you're not preparing anyone's way.

Chinese papers and techniques have been very influential and copied by US labs.

Multi-head Latent Attention (MLA), Multi-Token prediction, MoE architecture are some of the most famous examples.

MoE is from Google (Noam Shazeer)

MTP is from Meta

Another DeepSeek advance that the west are copying is DeepSeek Sparse Attention (DSA)

Mixture-of-Expert (MoE) was introduced in the 1990s [1, 2], see also [3, 4]. The idea was that MoE scales up model capacity and only introduces small computation overhead. MoEs did not become viable for high-performance applications until sparse routing was integrated with modern deep networks, made possible by large-scale distributed computation. The breakthrough came with the development of sparsely gated networks [5], which showed that it is possible to maintain model accuracy while activating only a small fraction of a large parameter network during both training and inference.

[1] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, G. E. Hinton, Adaptive mixtures of local experts. (1991)

[2] M. I. Jordan, R. A. Jacobs, Hierarchical mixtures of experts and the EM algorithm. (1993)

[3] L. Xu, M. Jordan, G. E. Hinton, An alternative model for mixtures of experts. (1994)

[4] S. Waterhouse, D. MacKay, A. Robinson, Bayesian methods for mixtures of experts. (1995)

[5] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, J. Dean, Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. (2017)

Yes - I meant as applied to LLMs/Transformers.
Yes, challenger Labs publish out of necessity. It is a marketing strategy. People assuming open source means giving something up, but the reality is that Z.ai has a revenue of some $100M and it would be about $0M if they never open sourced their models.
> Publishing by necessity

It's more a cultural thing. Sharing progress is just in their blood.

This is overly simplistic to the point of glazing. Plenty of Chinese companies maintain industrial secrets to gain an advantage.
Yes there are Chinese companies which maintain industrial secrets. Doesn't change the general cultural tendency that they prefer to share.
Probably because American AI companies are on the hook for quite a lot of investment money. I think they are trying to find the magical moat to justify their valuation.

Revealing optimizations similar to these would pretty much reduce their competitive position.

Chinese labs are also still behind, so they’re incentivized to collaborate and have no reason to do it in private.

I suspect their tune will change if they ever take the lead..

So the marketplace is working.
This is the way! Open source models will benefit, and once open source models reach the state of "good enough" the hyped up US AI companies will fear, since the availability of free, good enough, AI models will set the ceiling for how much they can charge. Then the bubble will pop.
You mean open weights, I guess? There are as far as I know very few open source models, the training data is seldom released. Sadly.
> Chinese labs are also still behind, so they’re incentivized to collaborate and have no reason to do it in private.

US labs in Google, Meta and SpaceX are not leading, none of them managed to build something on par with GLM 5.2.

Care to explain to me why they still don't collaborate and still choose to do it in private?

No idea I don’t work there.
Wait, are you claiming that these companies haven't contributed to the ecosystem via research and open source?
I'm not sure I'd put Google in that list, but either way: Because they think they have enough capital that they can catch up and don't need the reputational boost of this.
As good as Gemini's visual intelligence is, it's a terrible agent.
Google at least still releases open source models to the public.
Thank Apple?

Those are mostly for embedded devices and the current "sponsor" is Apple.

Aren't they only open weights, not true open source?
The concept of open source doesn't really apply to AI models since their behavior is mostly controlled by the data they were trained on and the complex ways they are trained. Having the source code of the model by itself wouldn't help you.

From a practical POV having all the training data, training infrastructure, and training know-how wouldn't help you either unless you could afford to spend the millions of dollars (hundreds of millions for a SOTA model) in compute to train it each time they released a new training set, in which case you're only talking about the big commercial companies. "open source for the people" just does not apply.

If (and that is a big if) the concept of open source doesn't apply, then the term shouldn't be coopted to mean something else though.

But even if I can't build it from source locally, being able to see what went into the model is an important part of what open source is about.

> If (and that is a big if) the concept of open source doesn't apply, then the term shouldn't be coopted to mean something else though.

Yes, but for whatever reason this usage seems to have stuck. Open weights is definitely a better name. I assume the reason "open source" has stuck is because you can download and use it for free, but "open source" was always intended to be about "free as in speech", not "free as in beer". That said, I remember when the term "open source" was invented, and it was always a bit different, more commercially aligned, than the goals of the FSF.

> But even if I can't build it from source locally, being able to see what went into the model is an important part of what open source is about.

True. Unfortunately LLMs have become such a big money and closed enterprise (the opposite of OpenAI and Anthropic's altruistic founding principles) that it's hard to see these commercial models releasing their training data, especially since this data is the closest thing they have to a moat other than the cost of training.

The most valuable training data right now seems to be "reasoning data", and the need for this at least may disappear as AI moves beyond pre-trained language models to smarter systems capable of learning for themselves, and that can actually reason, not need to parrot reasoning data.

Publishing RL/SFT/self-distillation harnesses would be very impactful even without the data.

Particularly when it comes to tool use w/ self-distillation it can be done without any data... have a tool the model doesn't know? a teacher model RTFMs and the source code, and helps the student learn to get it right.

Gemini 3.1 is still up there, though? If Google started to compete on price they could be very successful.
It'll be their inability to build coherent products that dies them in, not their models.
Which is a good thing. Self-serving motives are more reliable than altruistic ones.
Very interesting take
Look at how far OpenAI has drifted from their original mission. Everything comes back to greed, so it's ideal for the world if selfish motives happen to coincide with what's good for the world, like advancements in open models