This is kind of like saying grass is green to be honest
madduci · 2026-06-29 17:29:43 UTC
Like everybody got 128 GB RAM..
dofm · 2026-06-29 17:31:53 UTC
Doesn't need it at Q4 at least; it'll run in 64GB.
intothemild · 2026-06-29 19:23:00 UTC
Q6 can run with 256k at Q4 on 32gb easy.
200k @ K : Q5_0 V: 4_1 (which is a bit of a sweet spot)
sleepyeldrazi · 2026-06-29 17:39:18 UTC
I've been running it almost since launch on a 3090 (24gb vram), you really don't need that much. Second hand those are really cheap and i get 50-70 t/s (with MTP at 2), full ctx. IQ4_NL (unsloth) on this model seems suspiciously competent, and after the (by now not so recent) updates to q4 KV on llama.cpp, I just keep going back to it after dsv4pro disappointed me for the 100th time because it gave up on a task.
aand16 · 2026-06-29 17:22:28 UTC
I've come from the future to say Qwen 3.7 27B is just around the corner and slaps!
lor_louis · 2026-06-29 17:24:15 UTC
Do no give me hope like that.
mendeza · 2026-06-29 17:29:28 UTC
I am eagerly waiting!
jensC · 2026-06-29 19:20:57 UTC
Me too, I am on a Jetson Orion 64GB (about 50W max). Using the nvidia graphic cards for AI seem to be so power hungry that it was not a choice I could take with todays environmental problems.
NamlchakKhandro · 2026-06-30 11:24:48 UTC
Huh?
layer8 · 2026-06-29 17:47:01 UTC
Are RAM prices down?
alfiedotwtf · 2026-06-29 19:30:58 UTC
Qwen 3.7 120B will kill off Antropic’s IPO
HotGarbage · 2026-06-29 17:22:40 UTC
And AI companies will continue to buy up all the silicon to make this prohibitively expensive to run at home.
dofm · 2026-06-29 17:28:07 UTC
It will run (somewhat slowly) on a five year old M1 Max with 64GB RAM.
Personally I prefer the 35B MoE model, which is fast enough to be interactively useful, and capable, but I would probably use the 27B if I wanted to generate whole applications like that.
I am unconvinced that most "local" AI applications need anything much more powerful than the Gemma 4 12B model. Local agentic coding is a small niche, but there are plenty of ways a local model can help with development tasks.
I would really like to see a 12B or 16B Qwen 3.6.
I am currently playing with Ornith 1.0 in the MoE configuration, which is based on the 35B variant of Qwen 3.5; I am not sure if it is better than the 3.6 version.
Benchmarks say it is; my own silly tests either suggest otherwise or suggest that I have to talk to it a bit differently.
sleepyeldrazi · 2026-06-29 17:45:27 UTC
I need to ask, since I have desperately wanted to make Gemma 4 12B work, but im not sure if its the quant (i usually up it to q8, which is a lot higher than iq4_nl that i use for 3.6 27B) or the model itself, but it just starts confusing itself really quickly when I give it coding tasks. And quickly starts failing tool calls.
I really want to have a model that i can run locally on my 24gb m4 pro mbp for when i don't have internet to connect to my 3090 running the qwen, and i love how gemma 4 models 'feel', but i can't make them be competent. I am in the middle of finetuning both qwen3.5 9B and gemma 4 12B just to try and make those bridge closer to 27B for coding/agentic tasks (and am trying to ternarize and DQT 27B so that it fits in ~9gb pre-KV).
How do you run the gemma? What do you use it for (and in what harness), maybe llama.cpp and pi-mono just aren't for this model and that's what i'm doing wrong.
dofm · 2026-06-29 18:23:25 UTC
It sounds to me like you're further along on this than I am, if you are fine tuning?
I am still mostly tinkering/learning rather than spilling out code, and I feel quite slow on it. So it doesn't matter too much to me if it is really slow. More the journey than the destination if that makes sense. I'm stubborn.
I have tried the Gemma 4 12B model (Unsloth's QAT version) with search/browse tools in LM Studio and Unsloth Studio, when I am trying to understand a new thing.
Basically I get it to write introductory starter documentation for me to absorb, because my big personal problem, these days, is focussing enough to start a project and then digging in; I need the help.
I have found its limits on obscure packages (that it sometimes makes up) but before that it's a bit like stumbling on a blog post that happens to be really right for your particular need. Good enough to work through.
It's stuff I could ask Perplexity to do, or ChatGPT, to be fair, I just like LM Studio for this and have the inquisitiveness to want to run it locally.
In your case: I don't believe it's the quant. I'm sure it's the model — it has good coding knowledge but it's clearly not specialised. It might be good enough at writing Python/PHP/JavaScript at a novice level. It is also quite good on WordPress tooling and functions.
But I wouldn't bother with it for agentic coding if you've got experience elsewhere. Might be interesting to see what you can do with the 9B Ornith model?
Qwen 3.6 MoE in its Unsloth version is another matter. Impressive and I am trying to find ways to support my old brain doing what I've done before.
rusk · 2026-06-29 17:25:07 UTC
Spent a week trying to get sensible results out of llama 3.3 At one point it even simulated doing the work, log output and everything and when I challenged it about the missing artefacts it actually started questioning my intelligence. Seems appropriate for a Zuck enterprise.
Qwen on the other hand got straight to work with astonishing competency on the same system.
From what I read llama3 needs beefier compute to reliably invoke tools, which I presume relates to it focussing more on simulating AGI rather than being a useful tool.
am17an · 2026-06-29 17:26:30 UTC
llama 3? Are you from 2023?
culi · 2026-06-29 17:28:59 UTC
You might find this helpful. llama is not anywhere near the Pareto distribution (performance vs cost)
Llama3.1 instruct seems to be doing okay on that page, mostly because it's dirt cheap.
rhgraysonii · 2026-06-29 17:28:07 UTC
I have been having pretty good success with Qwen 3.5 9B for "nontrivial but not challenging work all things considered" -- it runs great on my 24gb unified memory m4 pro MacBook Pro. What do the baseline specs look like Mac-wise for getting this model to run? Am I looking at a 96gb? 128? 256?
dofm · 2026-06-29 17:31:12 UTC
You might be interested in Ornith 1.0 9B, which is a new intriguing post-training of Qwen 3.5 9B.
Qwen 3.6 27B will run in full offload with a 4-bit quantisation in 64GB on an M1 Max. It is quite slow.
I don't know about 48GB but 64GB should be enough.
rhgraysonii · 2026-06-29 17:33:02 UTC
Thanks! I was thinking of doing the 128gb to have some future proofing. I figure at this point, it's akin to a mechanic keeping great tools around, when it comes to having this sort of homelab and exposing it for your own uses. And great practice for building the next era of user facing computing that will be around as this proliferates.
dofm · 2026-06-29 17:40:28 UTC
I would not buy a 64GB model again, probably, if this were to remain particularly important to me. But I gather memory bandwidth is pretty important here.
So for example I'd favour a used M1 Max over a used M2 Pro, at least based on my naïve understanding. Not quite sure where the balance changes.
There appear to be some hardware improvements with the M3 and up regarding the Apple Neural Engine which I'd hope would show up in MLX performance; I remember seeing some optimisations in image generation models that are only possible on later hardware.
The GPU cores are progressively better I believe, but the memory bandwidth is lower. Though perhaps the M4 can get closer to actually saturating said bandwidth.
(And I must reiterate that my understanding of this stuff is pretty naïve.)
freehorse · 2026-06-29 18:58:53 UTC
Used M1 max is still a good choice because its memory bandwidth only got surpassed by generation m4 and later (except with ultra variants which are more expensive). Its prefill speed is not great though, and that is an issue for running larger contexts, which only substantially improved with m5. Moreover, up to m3 they only have thunderbolt 4, not 5, which means that they lack RDMA support which would make stacking machines more effective. So unless you go higher price for m4+ max, or any m ultra, m1 max is pretty decent still compared to m2 and m3 max, definitely better than pro variants, if you can find in a decent price and want to experiment without caring much about time to first token and large contexts.
Note the drop in performance for the base (binned) m3 max version. You are better off with full m1 max than the binned m3 max, even price aside.
The issue I have with my m1 max is that with 64gb you cannot run really decent MoE models, ie the ones you can run like qwen 35B-A3B have only 3b active parameters and are much less capable than qwen 27b in my testing. So I end up running the 27b one, but it runs relatively slow (though still usable at 10-20 tok/s) and I would have been better off a used nvidia gpu setup for dense models. I assume 35B-A3B has its use cases, eg as subagents, just that I cannot find them. With a higher amount of ram I could probably run bigger MoE models which could be more comparable, though prefill would still be an issue (and prob a bigger one). The only hopeful thing is that there are performance hacks appearing (speculative decoding and prefill) that seem to start improving inference speed once getting implemented, so I am mildly hopeful.
(I must also iterate that my understanding is not very deep either)
dofm · 2026-06-29 19:25:17 UTC
Good reply, those two links are v. useful and I had missed them.
It got rather tangled up when I tried it with one of my coding tests, which is a simple wordpress plugin, but I frustrate the model by asking it to write code for older PHP, break WP coding conventions and use a rather bespoke method for arranging code in objects. So it is sort of a hybrid of a green field and brown field task; a bit muddy.
It did not do as well as Qwen 3.6 35B, but the way it worked through its thoughts was interesting.
TBH I struggled to understand what DeepReinforce are doing that is materially different; the explanation of their training technique goes over my head at this point.
jensC · 2026-06-29 19:23:00 UTC
It is also available with Ollama now and I am equally impressed too.
MatthiasPortzel · 2026-06-29 18:23:29 UTC
I posted this elsewhere, but Unsloth says the 27B model should run in 18GB. That leaves little RAM for other tasks, but it depends on your tolerance for slowness I suppose. I haven’t tried it in 24GB so report back if you do.
How does llama.cpp use the GPU efficiently as opposed to MLX?
Is there any way to use MLX and GPU at the same time? Or does memory become a big problem?
TBH, I never understood Apple hyping these neural cores because I didn't think anyone actually uses them except maybe certain photo/video editing software.
If I can generate voice at the same time as video, that would be useful.
dannyw · 2026-06-29 17:46:44 UTC
Llama.cpp uses the GPU very effectively because inference of LLMs is very rudimentary and basically as simple as your GPU memory bandwidth. That's essentially the baseline performance ceiling, with model-specific optimisations like MTP potentially increasing it.
The neural cores aren't suitable for LLMs/transformers and isn't used in LLM inference. On the M5 and later chips, it comes with neural accelerators, aka Tensor Cores, which speed up the 'prefill' (i.e. processing your context window) part, but don't do anything for inference.
The MLX vs GGUF debate is mostly irrelevant. The GGUF pathways are optimised for apple silicon to the extent of practically identical performance to MLX. MLX is just one way of using Apple GPUs, it comes with many optimisations in the box, but they're not hard and they're no longer MLX-exclusive.
kpw94 · 2026-06-29 17:32:43 UTC
> What it does:
>
> --jinja for tool calling support
Pretty sure this flag hasn't done anything for a while. It's enabled by default since ~November of last year
ascii0eks84 · 2026-06-29 17:33:27 UTC
Very capable lora adapters are surfacing but it seems they are very niche.
DenisM · 2026-06-29 17:47:44 UTC
Can you share more? It’s the first I hear of lora outside research papers. Practical applications would be great to see.
Lora if effective could be a great reason to run local models.
0x0000000 · 2026-06-29 17:33:50 UTC
> ... on my Macbook Max M5 128 GB
Local development for who? How many of y'all are rocking 128GB of memory? Am I reading Apple's site correctly that it's a $10,000 laptop?
wpm · 2026-06-29 17:36:19 UTC
It wasn't $10k a month ago
kllrnohj · 2026-06-29 17:44:04 UTC
You don't need nearly that much RAM to run Qwen 3.6 27B, though. qwen3.6:27b-q4_K_M is only 17GB, for example.
DanHulton · 2026-06-29 18:13:42 UTC
This is what I run on an M5 MacBook Air 32GB. Works great.
I’m not having it build whole features from scratch, though. I give it pretty explicit instructions closer to the class or function level, and it still saves me an immense amount of time, while I’m very connected to the code that’s written.
Definitely the sweet spot for me.
spike021 · 2026-06-29 17:44:21 UTC
Certainly won't work on my M4 Pro with 24GB lol
whynotmaybe · 2026-06-29 17:54:33 UTC
I feel you!
Sent from my 8gb M2 Mac mini.
kevinrineer · 2026-06-30 02:35:38 UTC
I'm still rocking my nvidia 2060, which I had purchased for $400 at the time.
I struggle to imagine purchasing multiple 1k+ cards on my own dime.
MatthiasPortzel · 2026-06-29 18:18:55 UTC
I’m using it on a 48GB machine and it causes some lag, so it might be worse on 24, but it should run.
Unsloth recommends 18GB of RAM for Qwen3.6-27B (for their version of the model).
I'm on 128GB ram strix halo, bought framework desktop for a few thousand CAD back when everyone was calling framework desktop overpriced
rhdunn · 2026-06-29 17:50:16 UTC
A 27B model can fit easily on a 32GB VRAM card (e.g. 5090) or a 32GB computer in RAM at FP8/Q8 (unsloth have 28.6GB Q8 files).
For 24GB VRAM cards (e.g. 4090) you can use Q6_K (22.5GB) or Q5_K_M (19.5GB) quants, possibly offloading some of the weights to RAM.
jboss10 · 2026-06-29 20:05:31 UTC
For the 35B model, ofloading to RAM doesn't slow it down much. If you have a nice CPU and a weak GPU, it will be fast enough to use.
mr_mitm · 2026-06-29 18:09:12 UTC
Think commercial. My company invested in a local rig since privacy is important to our customers and sometimes I want to use these models on private data.
Gigachad · 2026-06-29 23:09:39 UTC
Even in that case it would make more sense to put the hardware in a server rack shared with everyone rather than inside macbooks.
At any rate it makes a stolen backpack or spilled drink a lot less damaging.
mr_mitm · 2026-06-30 07:51:49 UTC
Obviously the rig is not a macbook but indeed a server rack. I'm just saying that we're using this model for local development.
scotty79 · 2026-06-29 20:42:31 UTC
Qwen3.6 runs great on GPU with 24GB VRAM. You could get used 3090 for it.
bahmboo · 2026-06-29 22:15:18 UTC
I work with a lot of 3D graphics and geo stuff so I can hit the ceiling with my 48 GB mac. It's not all LLM work. I prioritized more storage than RAM with my budget. Being able to run local llms has greatly helped me understand how they work. For day to day dev I pay for Gemini or Claude.
onion2k · 2026-06-29 17:34:30 UTC
None of the examples reflect 'real work', at least not what I'd consider real work. Being able to nail a zero-shot greenfield project is relatively easy even for a small model. There's not much context to build up and it can fall back to similar examples in the training data easily. So long as you're not asking it to invent something wholly new it'll probably manage.
The real test is whether or not it can work with your existing codebases. In my limited experiments Qwen 3.5 (maybe 3.6 is loads better) does OK on a Rust+React app, and less well on a C# monolith. Not to the point of being unusable but definitely poorly enough that I went back to Claude after 20 minutes. If I lost access to a cloud model and had to use Qwen instead I'd be visibly sad.
h4ny · 2026-06-29 17:50:17 UTC
> In my limited experiments Qwen 3.5 (maybe 3.6 is loads better)
1. Maybe you should tell us what those limited experiments are.
2. Maybe you should actually try 3.6 because it's huge difference in most cases. Don't forget to tell us quants and don't forget to tell us scope.
3. Maybe actually show us data compared to frontier models instead of this... vibe comment. Pretty tired of this kind of comments on HN that doesn't require logic or evidence. Just vibes. Like the pelican riding a bicycle crap that everyone has taken for granted but has no objective way of assessing goodness.
snapcaster · 2026-06-29 19:13:00 UTC
Nobody owes you a scientifically rigorous write up
sosodev · 2026-06-29 18:02:19 UTC
In my experience, even with basic project concepts the small models struggle to spin up greenfield stuff. There's just too many decisions to be made and they're not good at that.
Modifying existing code is way easier if you don't expect it to be smart about it. Don't say "add X feature" and let it explore the codebase and build its own understanding. Point it at the relevant files and say "the goal is to add X feature to this code, follow Y guidelines". Now you've done the hardest part of making the decisions and it just has to follow instructions while coloring within the lines.
fluoridation · 2026-06-29 18:40:54 UTC
>Point it at the relevant files and say "the goal is to add X feature to this code, follow Y guidelines".
Is that not how you would work with any model, local or not? I wouldn't trust it to make the right decisions unattended. I just know the moment I look away it's going to do something utterly braindead.
tenuousemphasis · 2026-06-29 21:20:35 UTC
Claude Opus with xhigh thinking is surprisingly good at figuring our details. Granted I'm only using it for little hobby projects, nothing overly complicated.
verdverm · 2026-06-29 19:20:39 UTC
I had good results doing an open box reimplementation. Gave qwen access to my old projects and it rebuilt it on JAX.
> Being able to nail a zero-shot greenfield project is relatively easy even for a small model
Not really germane to your comment but I hope I don’t sound old when I say I remember a time when spinning up a PoC was a week of work, and a statement like yours was pure science fiction.
cyanydeez · 2026-06-29 18:26:52 UTC
I love the ability to spin up any repo on github by pointing a local model at it with zero cost beyond the heat & electricity.
ai_fry_ur_brain · 2026-06-29 19:43:18 UTC
Yeah, and we still do take a week for people that actually care.
If I start prompting away the core of a new project I lose interest in the entire thing almost straight away. I hate it. The next day I could care less about it. In fact it just makes me lazy, like a fat person who drives everywhere.
I love typing code and thinking for myself. Im going to continue to do that. I still dont know anyone who's shipped anything truly useful with this garbage tech, let alone with a local 30b param model. So much cope in these comments.
Spending 6k on hardware to run the worlds most mediocre model truly does make you an incredibly stupid person, so Im not really suprised by these comments of people saying these tiny models are helping them so much.
Its like a special needs kid all of sudden got the ability to code, of course they'd be impressed by basically all the code it produces.
j_bum · 2026-06-29 20:06:00 UTC
I mean, have you looked for examples of things that people using local models to build and ship? Or are you just assuming it doesn’t happen?
I’ve used Qwen 3.6 27B for many things at work, and I’m regularly able use it for reasonably scoped tasks.
I’m not saying these models are perfect.
But you are complaining about people on the extreme, while at the same shouting from the opposite extreme.
hollowturtle · 2026-06-29 21:11:11 UTC
In what era spinning up a PoC required a week of work? Especially on the web. I've been a developer for roughly 20 years and that has never been the case, to the point that I believe people impressed by LLMs are the same who had a very low productivity. Today we have game jams as short as 3 days and talented people are able to produce very good PoC, with some almost complete!
spiralcoaster · 2026-06-29 21:46:49 UTC
So what you're saying is that all PoC's are guaranteed to take less than a week of work.
What are you even saying? Are you aware that there is a massive range in the scope of projects? You must work on some incredibly simple CRUD apps if this is your take.
hollowturtle · 2026-06-30 07:31:10 UTC
These people work mostly in CRUD apps and they're telling you they how feel productive. Btw exploratory ideas even for hard problems come out already after a hackaon of a day or a game jam of 3 days
janalsncm · 2026-06-30 01:30:42 UTC
1) It depends entirely on the concept you are trying to prove and how experienced you are in that domain.
2) Not every team will have someone with 20 years of experience in a particular domain eager to spin up a PoC.
I have been using pi (and previously the codex cli) with Qwen 3.6 27b with 100k context for my development at work, and I have been very blown away by how well it works. It's not perfect, but it's enough to accelerate my normal development flow. I mostly use it for writing Go and C#.
Aurornis · 2026-06-29 19:34:03 UTC
> and it can fall back to similar examples in the training data easily.
This is an underrated consideration when evaluating the small models: The further you deviate from standard example code, the more their weaknesses show.
My experience is that Qwen3.6 produced some amazing results for a small model when I tried it with simple apps that are widely reproduced everywhere. If you want a React TODO app or to set up a little boilerplate app with shadcn and other popular tools, it will produce something that looks not too bad.
Then when I started straying outside of common tasks and into some of my more niche work, it would spin for hours and go in circles before finally producing some groan-inducing output that wasn't usable.
If you're looking for a model to help with simple refactoring or small tasks where you provide very explicit instructions for exactly what you want, but you don't want to do all of the typing yourself, they can do a lot of good work, though. But you're right that once you get into long context sessions involving topics off the beaten path, the weaknesses are very apparent.
The quantizations that are popular for making these models fit on smaller hardware make the problems worse. When you read it about online there is almost a consensus that 4-bit quants are lossless and that you can use q8_0/q8_0 kv cache quantization without any real loss, but in my experience with real projects there's a substantial degradation in long context performance with any of these quants.
CMay · 2026-06-29 20:18:50 UTC
This is my experience too. Qwen optimizes for a lot of scenarios which masks their weaker generalization compared to US frontier models.
Never go below an fp16 kv cache unless you've already tested it in advance with your model on a verified task that you know it can successfully complete. People should also test the difference using the exact same seed value so they can see how the tokens diverge. If you have memory constraints, sometimes you can still use an fp16 kv cache and use storage for an agentic buffer to work your task with mixed abstractions rather than having everything in memory.
For 4-bit weight quants, Gemma 4 31B QAT is where people should be looking instead of Qwen 3.6.
mark_l_watson · 2026-06-29 21:33:29 UTC
There are several general types of tasks that a Gemma 4 12B class model works for me, including: 1) design a large project composed of small libraries that can be coded and tested in isolation. 2) clean up old coding projects: add README files, comment code, show an example of using a new API and have it update API use, etc.
All small-scale stuff. For large integrated projects I am finding DeepSeek v4 Pro commercial API to be very inexpensive and helps me produce good results.
internet101010 · 2026-06-30 04:36:32 UTC
Exactly. If the repo has all of the knowledge living inside of it that window fills up fast, even when using something like codegraph.
mikert89 · 2026-06-29 17:35:01 UTC
none of these local models are good for development, complete waste of time. nobody has $100k+ hardware sitting around at home to actually run a good model
jlongr · 2026-06-29 17:36:54 UTC
skill issue
mikert89 · 2026-06-29 20:08:55 UTC
the models suck
anonym29 · 2026-06-29 17:35:57 UTC
Strix Halo user here. While Qwen 3.6 27B exhibits remarkable intelligence density, I will still take unsloth's dynamic IQ2_XXS of Minimax M2.7 over Q8_0 Qwen 3.6 27B any day of the week, and this isn't just because of generation speed either. I wrote my own custom harness, and I get hallucinated tool call parameters and bizarre invocations with Q3.6 27B even at Q8_0, but no issues with the IQ2_XXS of M2.7.
BoredomIsFun · 2026-06-29 18:14:07 UTC
> I get hallucinated tool call parameters and bizarre invocations
tweaking sampler might help
RedCinnabar · 2026-06-29 17:38:31 UTC
Call me back when you can run these models on 16GB of RAM and any recent i5/i7. Until then, there’s no point on using these toy models.
giancarlostoro · 2026-06-29 17:39:21 UTC
You need it to run in about 8 GB so you have extra space for the context window.
Catloafdev · 2026-06-29 17:40:42 UTC
Hello, it's the internet calling, today is that day.
Edit: it's gonna be slow if you're not using any VRAM. But it's possible. Software isn't going to speed that up anytime soon, it's just a hardware bandwidth limit.
guax · 2026-06-29 19:48:57 UTC
Its so funny, these "toy models" would be the wet dreams of researchers not 5 years ago.
Progress marches without mercy.
kgeist · 2026-06-29 21:20:46 UTC
Yeah people don't realize these "toy models" now completely destroy gpt-4o on most tasks, and no one called gpt-4o a toy model back in the day... It was OpenAI's flagship model from 2024 to 2025.
Gigachad · 2026-06-29 23:12:36 UTC
Tbh in 2024 most were calling these models useless for programming and a scam. It wasn't until this year things really changed. My experience with Qwen 3.6 is it can do things, and it's super impressive it can do things, but it's not any more productive than doing it myself.
jboss10 · 2026-06-29 20:08:30 UTC
They can be ran on 32GB with 8GB VRAM. I don't think these will be on 16GB for a while. (35B MoE)
TheCycoONE · 2026-06-29 20:32:49 UTC
I have 32GB of RAM with 16GB VRAM and I haven't had a lot of luck running larger models like this. Are you able to expand on that?
slim · 2026-06-29 21:03:13 UTC
use llama.cpp with cuda
TheCycoONE · 2026-06-29 22:16:12 UTC
The problem may be that it's a 7800XT which handles memory contention by freezing.
bensyverson · 2026-06-29 17:38:52 UTC
The article is based on running Qwen 3.6 on a 128GB MacBook Pro. For reference, a 128GB MBP currently starts at $6699 USD [0]
Some people will be happy to pay that premium for privacy, but at roughly 10X the cost of a MacBook Neo, that money could also buy a lot of credits on OpenRouter or frontier labs.
Comments
200k @ K : Q5_0 V: 4_1 (which is a bit of a sweet spot)
Personally I prefer the 35B MoE model, which is fast enough to be interactively useful, and capable, but I would probably use the 27B if I wanted to generate whole applications like that.
I am unconvinced that most "local" AI applications need anything much more powerful than the Gemma 4 12B model. Local agentic coding is a small niche, but there are plenty of ways a local model can help with development tasks.
I would really like to see a 12B or 16B Qwen 3.6.
I am currently playing with Ornith 1.0 in the MoE configuration, which is based on the 35B variant of Qwen 3.5; I am not sure if it is better than the 3.6 version.
Benchmarks say it is; my own silly tests either suggest otherwise or suggest that I have to talk to it a bit differently.
I really want to have a model that i can run locally on my 24gb m4 pro mbp for when i don't have internet to connect to my 3090 running the qwen, and i love how gemma 4 models 'feel', but i can't make them be competent. I am in the middle of finetuning both qwen3.5 9B and gemma 4 12B just to try and make those bridge closer to 27B for coding/agentic tasks (and am trying to ternarize and DQT 27B so that it fits in ~9gb pre-KV).
How do you run the gemma? What do you use it for (and in what harness), maybe llama.cpp and pi-mono just aren't for this model and that's what i'm doing wrong.
I am still mostly tinkering/learning rather than spilling out code, and I feel quite slow on it. So it doesn't matter too much to me if it is really slow. More the journey than the destination if that makes sense. I'm stubborn.
I have tried the Gemma 4 12B model (Unsloth's QAT version) with search/browse tools in LM Studio and Unsloth Studio, when I am trying to understand a new thing.
Basically I get it to write introductory starter documentation for me to absorb, because my big personal problem, these days, is focussing enough to start a project and then digging in; I need the help.
I have found its limits on obscure packages (that it sometimes makes up) but before that it's a bit like stumbling on a blog post that happens to be really right for your particular need. Good enough to work through.
It's stuff I could ask Perplexity to do, or ChatGPT, to be fair, I just like LM Studio for this and have the inquisitiveness to want to run it locally.
In your case: I don't believe it's the quant. I'm sure it's the model — it has good coding knowledge but it's clearly not specialised. It might be good enough at writing Python/PHP/JavaScript at a novice level. It is also quite good on WordPress tooling and functions.
But I wouldn't bother with it for agentic coding if you've got experience elsewhere. Might be interesting to see what you can do with the 9B Ornith model?
Qwen 3.6 MoE in its Unsloth version is another matter. Impressive and I am trying to find ways to support my old brain doing what I've done before.
Qwen on the other hand got straight to work with astonishing competency on the same system.
From what I read llama3 needs beefier compute to reliably invoke tools, which I presume relates to it focussing more on simulating AGI rather than being a useful tool.
https://arena.ai/leaderboard/code/webdev/pareto?license=open...
https://arena.ai/leaderboard/text/pareto?license=open-source
Qwen 3.6 27B will run in full offload with a 4-bit quantisation in 64GB on an M1 Max. It is quite slow.
I don't know about 48GB but 64GB should be enough.
So for example I'd favour a used M1 Max over a used M2 Pro, at least based on my naïve understanding. Not quite sure where the balance changes.
There appear to be some hardware improvements with the M3 and up regarding the Apple Neural Engine which I'd hope would show up in MLX performance; I remember seeing some optimisations in image generation models that are only possible on later hardware.
The GPU cores are progressively better I believe, but the memory bandwidth is lower. Though perhaps the M4 can get closer to actually saturating said bandwidth.
(And I must reiterate that my understanding of this stuff is pretty naïve.)
A very useful resource for characteristics and comparative performance of all M variants, if anybody is interested, is https://github.com/ggml-org/llama.cpp/discussions/4167?sort=...
Its sister discussion for nvidia gpus is https://github.com/ggml-org/llama.cpp/discussions/15013
Note the drop in performance for the base (binned) m3 max version. You are better off with full m1 max than the binned m3 max, even price aside.
The issue I have with my m1 max is that with 64gb you cannot run really decent MoE models, ie the ones you can run like qwen 35B-A3B have only 3b active parameters and are much less capable than qwen 27b in my testing. So I end up running the 27b one, but it runs relatively slow (though still usable at 10-20 tok/s) and I would have been better off a used nvidia gpu setup for dense models. I assume 35B-A3B has its use cases, eg as subagents, just that I cannot find them. With a higher amount of ram I could probably run bigger MoE models which could be more comparable, though prefill would still be an issue (and prob a bigger one). The only hopeful thing is that there are performance hacks appearing (speculative decoding and prefill) that seem to start improving inference speed once getting implemented, so I am mildly hopeful.
(I must also iterate that my understanding is not very deep either)
It got rather tangled up when I tried it with one of my coding tests, which is a simple wordpress plugin, but I frustrate the model by asking it to write code for older PHP, break WP coding conventions and use a rather bespoke method for arranging code in objects. So it is sort of a hybrid of a green field and brown field task; a bit muddy.
It did not do as well as Qwen 3.6 35B, but the way it worked through its thoughts was interesting.
TBH I struggled to understand what DeepReinforce are doing that is materially different; the explanation of their training technique goes over my head at this point.
https://unsloth.ai/docs/models/qwen3.6
Is there any way to use MLX and GPU at the same time? Or does memory become a big problem?
TBH, I never understood Apple hyping these neural cores because I didn't think anyone actually uses them except maybe certain photo/video editing software.
If I can generate voice at the same time as video, that would be useful.
The neural cores aren't suitable for LLMs/transformers and isn't used in LLM inference. On the M5 and later chips, it comes with neural accelerators, aka Tensor Cores, which speed up the 'prefill' (i.e. processing your context window) part, but don't do anything for inference.
The MLX vs GGUF debate is mostly irrelevant. The GGUF pathways are optimised for apple silicon to the extent of practically identical performance to MLX. MLX is just one way of using Apple GPUs, it comes with many optimisations in the box, but they're not hard and they're no longer MLX-exclusive.
>
> --jinja for tool calling support
Pretty sure this flag hasn't done anything for a while. It's enabled by default since ~November of last year
Lora if effective could be a great reason to run local models.
Local development for who? How many of y'all are rocking 128GB of memory? Am I reading Apple's site correctly that it's a $10,000 laptop?
I’m not having it build whole features from scratch, though. I give it pretty explicit instructions closer to the class or function level, and it still saves me an immense amount of time, while I’m very connected to the code that’s written.
Definitely the sweet spot for me.
Sent from my 8gb M2 Mac mini.
I struggle to imagine purchasing multiple 1k+ cards on my own dime.
Unsloth recommends 18GB of RAM for Qwen3.6-27B (for their version of the model).
https://unsloth.ai/docs/models/qwen3.6
For 24GB VRAM cards (e.g. 4090) you can use Q6_K (22.5GB) or Q5_K_M (19.5GB) quants, possibly offloading some of the weights to RAM.
At any rate it makes a stolen backpack or spilled drink a lot less damaging.
The real test is whether or not it can work with your existing codebases. In my limited experiments Qwen 3.5 (maybe 3.6 is loads better) does OK on a Rust+React app, and less well on a C# monolith. Not to the point of being unusable but definitely poorly enough that I went back to Claude after 20 minutes. If I lost access to a cloud model and had to use Qwen instead I'd be visibly sad.
1. Maybe you should tell us what those limited experiments are.
2. Maybe you should actually try 3.6 because it's huge difference in most cases. Don't forget to tell us quants and don't forget to tell us scope.
3. Maybe actually show us data compared to frontier models instead of this... vibe comment. Pretty tired of this kind of comments on HN that doesn't require logic or evidence. Just vibes. Like the pelican riding a bicycle crap that everyone has taken for granted but has no objective way of assessing goodness.
Modifying existing code is way easier if you don't expect it to be smart about it. Don't say "add X feature" and let it explore the codebase and build its own understanding. Point it at the relevant files and say "the goal is to add X feature to this code, follow Y guidelines". Now you've done the hardest part of making the decisions and it just has to follow instructions while coloring within the lines.
Is that not how you would work with any model, local or not? I wouldn't trust it to make the right decisions unattended. I just know the moment I look away it's going to do something utterly braindead.
https://github.com/verdverm/pge-jax
Not really germane to your comment but I hope I don’t sound old when I say I remember a time when spinning up a PoC was a week of work, and a statement like yours was pure science fiction.
If I start prompting away the core of a new project I lose interest in the entire thing almost straight away. I hate it. The next day I could care less about it. In fact it just makes me lazy, like a fat person who drives everywhere.
I love typing code and thinking for myself. Im going to continue to do that. I still dont know anyone who's shipped anything truly useful with this garbage tech, let alone with a local 30b param model. So much cope in these comments.
Spending 6k on hardware to run the worlds most mediocre model truly does make you an incredibly stupid person, so Im not really suprised by these comments of people saying these tiny models are helping them so much.
Its like a special needs kid all of sudden got the ability to code, of course they'd be impressed by basically all the code it produces.
I’ve used Qwen 3.6 27B for many things at work, and I’m regularly able use it for reasonably scoped tasks.
I’m not saying these models are perfect.
But you are complaining about people on the extreme, while at the same shouting from the opposite extreme.
What are you even saying? Are you aware that there is a massive range in the scope of projects? You must work on some incredibly simple CRUD apps if this is your take.
2) Not every team will have someone with 20 years of experience in a particular domain eager to spin up a PoC.
This is an underrated consideration when evaluating the small models: The further you deviate from standard example code, the more their weaknesses show.
My experience is that Qwen3.6 produced some amazing results for a small model when I tried it with simple apps that are widely reproduced everywhere. If you want a React TODO app or to set up a little boilerplate app with shadcn and other popular tools, it will produce something that looks not too bad.
Then when I started straying outside of common tasks and into some of my more niche work, it would spin for hours and go in circles before finally producing some groan-inducing output that wasn't usable.
If you're looking for a model to help with simple refactoring or small tasks where you provide very explicit instructions for exactly what you want, but you don't want to do all of the typing yourself, they can do a lot of good work, though. But you're right that once you get into long context sessions involving topics off the beaten path, the weaknesses are very apparent.
The quantizations that are popular for making these models fit on smaller hardware make the problems worse. When you read it about online there is almost a consensus that 4-bit quants are lossless and that you can use q8_0/q8_0 kv cache quantization without any real loss, but in my experience with real projects there's a substantial degradation in long context performance with any of these quants.
Never go below an fp16 kv cache unless you've already tested it in advance with your model on a verified task that you know it can successfully complete. People should also test the difference using the exact same seed value so they can see how the tokens diverge. If you have memory constraints, sometimes you can still use an fp16 kv cache and use storage for an agentic buffer to work your task with mixed abstractions rather than having everything in memory.
For 4-bit weight quants, Gemma 4 31B QAT is where people should be looking instead of Qwen 3.6.
All small-scale stuff. For large integrated projects I am finding DeepSeek v4 Pro commercial API to be very inexpensive and helps me produce good results.
tweaking sampler might help
https://github.com/ikawrakow/ik_llama.cpp
Edit: it's gonna be slow if you're not using any VRAM. But it's possible. Software isn't going to speed that up anytime soon, it's just a hardware bandwidth limit.
Progress marches without mercy.
Some people will be happy to pay that premium for privacy, but at roughly 10X the cost of a MacBook Neo, that money could also buy a lot of credits on OpenRouter or frontier labs.
[0]: https://www.apple.com/shop/buy-mac/macbook-pro/14-inch-space...