Rendered at 19:24:28 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
2ndorderthought 1 days ago [-]
I test drove it yesterday. It's pretty impressive at 8b. Runs on commodity hardware quickly.
Qwen3.6 35b a3b is still my local champion but I may use this for auto complete and small tasks. Granite has recent training data which is nice. If the other small models got fine tuned on recent data I don't know if I would use this at all, but that alone makes it pretty decent.
The 4b they released was not good for my needs but could probably handle tool calls or something
3abiton 1 days ago [-]
> Qwen3.6 35b a3b is still my local champion but I may use this for auto complete and small tasks.
I second this! Using the Unsloth Q6 (I forgot the exact name). Currently using it with forgecode (with zsh), on my Strix Halo, and it's suprisingly really good. I would say slightly Similar to Haiku 4.5, plus additional privacy, minus speed. It's surprisingly really fast for the hardware, given the speculative decoding, still PP is on the slow side.
bpye 13 hours ago [-]
Out of interest, what are you seeing for token generation - especially as the context fills?
lostmsu 19 hours ago [-]
If you use it for agentic coding and often hit PP, there's something wrong with your harness IMO
vessenes 1 days ago [-]
Have you tried the Gemma 4 series, out of curiosity? I haven’t run a local model in a while, but the benchmarks look good. I’d take a free local tool-use model if it was relatively consistent.
v3ss0n 1 days ago [-]
Qwen 3.6 burns it to the ground. it was not even a challenge. Gemma4 seriously fails at toolcalls and agentic works. It got all messed up after 2-3 turns of Vibecoding.
xrd 1 days ago [-]
How do you run it? vllm? llama.cpp?
Can you share some parameters you enable tool calling and agentic usage?
Or, higher level, some philosophies on what approaches you are using for tuning to get better tool calling and/or agentic usage?
I'm having surprisingly good success with unsloth/Qwen3.6-27B-GGUF:Q4_K_M (love unsloth guys) on my RTX3090/24GB using opencode as the orchestrator.
It concocts some misleading paths, but the code often compiles, and I consider that a victory.
You have to watch it like you would watch a 14 year old boy who says he is doing his homework but you hear the sound effects of explosions.
jyap 1 days ago [-]
I run it with Llama.cpp on my RTX 3090. Also using the same Unsloth model.
I need to try out some of the other set ups mentioned in this repo for increased TPS.
thot_experiment 1 days ago [-]
naw, i mean i prefer Qwen 3.6 to Gemma 90% of the time, especially the MoE with a light tune to make it's tone more claude-like, but Gemma 4 is definitely better in some cases and I think they're pretty close in general.
The difference basically boils down to Gemma 4 making more assumptions and Qwen 3.6 sticking closer to the prompt, if your prompt is bad or leaves things up to the imagination, Gemma will do a better job, if you need strict prompt adherence Qwen is better. Since local models are "dumb" i think it makes sense to prefer prompt adherence, but there are complex tasks that Gemma will complete much much faster than Qwen because it makes the right assumptions the first time and as a result even with slower inference requires way fewer turns.
My speculation is that this comes from google having a much better strategy for filtering their training data, I think this also shows up in the shape of the world knowledge of the models. Gemma's world knowledge seems deeper even though the models are of roughly equivalent size to the Qwen counterparts so it's mostly likely just concentrated in places that are more relevant to my queries.
Most notably in my testing, Gemma 4 31b is the ONLY local model that will tell me the significance of 1738 correctly. Even most flagship/cloud models answer with some hallucinatory nonsense.
59nadir 1 days ago [-]
Counter-point: I built an agent that can only interface with Kakoune, a much less common and more challenging situation for an LLM to find itself in, and Gemma4-A4B 8bit quantized does remarkably better in actually figuring out how to get text in buffers than Qwen3.6-35B-A3B in a similar class as Gemma4 A4B.
Now, is this the usual use case? No, it's a benchmark I created specifically in order to put LLMs in situations where they can't just blast out their bash commands without having to interface with something else and adapt.
celrod 1 days ago [-]
Fellow kakoune user here.
I'm curious about your use case/ what you're doing with it!
59nadir 1 days ago [-]
I'm just messing around with building agents, that's all. I'm not super interested in making ones that just sit in a terminal executing shell scripts because truth be told they're absolutely trivial to make and don't show any interesting parts of LLMs, whereas telling an agent that they are sitting in Kakoune is a whole lot more interesting and really shows a lot of what LLMs aren't great at, and how they'll have to fight their urge to spit out overwrought bash invocations or at the very least find a way to fit those into something new.
So far the only tools the agent has access to are `evaluate_commands(commands=["...", "..."])` and `get_buffer_contents()`, which really makes them have to work for doing things. I could make it super easy for them but then it wouldn't be an interesting experiment.
59nadir 20 hours ago [-]
As an addendum to this:
If I were to try to make something more useful out of this, I'd probably add the ability for LLMs to list buffers, probably give them an easier out for executing shell scripts in the way they prefer, give them an easier time to list docs and a few other things like that.
The tools and the interaction with Kakoune is really trivial to write; I already use this by having the agent write to the session FIFO (a very simple binary format) and I extract information via my own FIFO that Kakoune writes to (this is used for the buffer data only right now).
I think once you started using it more as a tool and not a pseudo-benchmark like I am you'd probably think of even more things to add but a lot of it comes down to just making Kakoune's state visible and making shell spam (which the LLMs love) easier.
lambda 1 days ago [-]
Gemma 4 31b was working ok for me; but it was consuming tons of memory on SWA checkpoints, I had to turn them way down, and as a 31b dense model is fairly slow on a Strix Halo. I did have a lot of tool calling issues on 26b-a4b, though.
The Qwen models are quite solid though.
xrd 1 days ago [-]
What are you using to run it vllm, llama.cpp or other?
Can you share your switches and approach for using tools?
lambda 1 days ago [-]
llama.cpp
My setup is a bit of a mess as I experiment with different ways of configuring and hosting local models. So at some point I was experimenting with the router server but stopped doing that, but some of my settings are still in models.ini while some are on the command line.
With the following as the relevant settings in models.ini (I actually have no idea if these settings are applied when not using the router server, it's been hard for me to figure out what settings are actually applied when using bot the command line and models.ini
[*]
jinja = true
seed = 3407
flash-attn = on
[unsloth/gemma-4-31B-it-GGUF:UD-Q8_K_XL]
temperature = 1.0
top_p = 0.95
top_k = 64
As my harness, I'm using pi, with a pretty vanilla config.
Anyhow, Gemms 4 31b worked in this config, but it was slow and RAM hungry. Since then, I've mostly moved to Qwen 3.6 35b-a3b because it's a lot faster.
I'm not actually doing anything useful with these yet, but I've used them for some experiments and Qwen 3.6 35b-a3b was capable of doing some pretty long mostly unsupervised agentic loops in my experimentation.
BoredomIsFun 5 hours ago [-]
> Qwen 3.6 burns it to the ground.
Not for creative writing or NLP.
2ndorderthought 1 days ago [-]
Gemma4 is definitely not used for vibe/agentic coding. Not even worth trying. But its a different weight class.
blurbleblurble 1 days ago [-]
I agree but would add that gemma 4 is really nice at vibing though in ways qwen 3.6 could never.
Maybe it could be fun to hook them up via a2a protocol as left and right brain agents operating in tandem.
copper-float 12 hours ago [-]
As someone who has never used AI for any coding or agent tasks, I feel like i'm going insane when I read things like this.
>Maybe it could be fun to hook them up via a2a protocol as left and right brain agents operating in tandem
What in the world does this even mean?
zkmon 1 days ago [-]
I have tested Gemma4-26B against Qwen3.6-35B. Gemma beats Qwen on structured data extraction and instruction following. Gemma is far more precise than Qwen in these tasks, while Qwen gets a bit more creative, verbose, and imprecise. However Qwen has far more general smartness, high token throughput. Qwen could precisely pinpoint the issues in data quality and code, while Gemma had no clue. On the coding skills, Qwen appears to have edge over Gemma, but this could depend on the agent you use. For direct chat (llama_cpp UI), bot models show same skills for coding.
seemaze 1 days ago [-]
That's interesting. I've been using Qwen3.5-35B for (poorly) structured table extraction based largely on the reports that Qwen had a much better vision implementation.
I have not benchmarked Qwen3.5 vs. Qwen3.6 for the same task, nor trialed Gemma4-26B. Guess it's time for some testing!
2ndorderthought 1 days ago [-]
I tried the Gemma 4 I think 2 and 4b. The 2b was not useful for me at all. A little too weak for my use cases
The 4b was okay. It didn't get all of my small math questions right, it didn't know about some of the libraries I use, but it was able to do some basic auto complete type stuff. For microscopic models I like the llama 3.2 3b more right now for what I do, it's a little faster and seems a little stronger for what I do. But everyone is different and I don't think I'll use it anymore this past month has been crazy for local model releases.
throwaw12 1 days ago [-]
can you share your use cases for 2b and 4b models?
curious how people are leveraging these models
2ndorderthought 1 days ago [-]
For me, I use them for quick auto complete or small questions. I am not a vibe/agentic coder. I know I am a relic and a Luddite because of this.
Instead of hitting stack overflow and Google I will ask questions like "can you give me an example of how to do x in library y?" Or "this error is appearing what might be happening if I checked a b and c". Or "please write unit tests for this function". Or code auto complete.
I am not looking for the world's best answer from a 3b model. I am looking for a super fast answer that reminds me of things I already know or maybe just maybe gives me a fast idea to stub something while I focus on something more important, I am going to refactor anyways. Think a low quality rubber duck
I mostly use 7-9b models for this now but llama 3.2 3b is pretty decent for not hogging resources while say I have other compute heavy operations happening on a weak computer.
Probably half the questions people ask chatgpt could get roughly the same quality of answer with a small model in my opinion. You can't fully trust an LLM anyways so the difference between 60% and 70% accuracy isn't as much are marketing makes it sound like. That said the quality of a good 7-9b model is worth it compared to a 3b if your machine can run it. Furthermore the quality of qwen 36 is crazy and makes me wonder if I will ever need an AI provider again if the trend continues.
SwellJoe 1 days ago [-]
Over the weekend I used the small models for experimental training runs when figuring out how to build LoRAs. It takes a lot less time to do smoke tests of the process on E2B vs the 31B version. And E4B was a reasonable stop along the line just to make sure the LoRA combined with the base model to produce coherent output.
Also, they're good enough for a lot of simple categorization and data extraction tasks, e.g. something like "flag abusive posts/comments", or "visit website, find the contact info, open hours, address". And they run fast on the kind of hardware you're likely to have at home, while the bigger dense versions decidedly do not.
I used Gemma 4 itself to review and prune the data (my social media posts over the last ~5 years, about 5 million words) being ingested into the training process for a LoRA for Gemma 4. I found the bigger model (31B) was more nuanced and useful than the smaller ones, and I wasn't in a big hurry by that stage of the process, so I used the big one overnight. Gemma 4 31B was also a better judge of my writing than Gemini Flash 2.5, by my reckoning.
It was, again, more nuanced, and was able to recognize a generally helpful comment that opened kinda jokey/rude, while the smaller model and Gemini 2.5 Flash tended to gravitate toward extremes (1 or 5) rather than the 1-5 scale they were prompted to rate on. I assume Gemini 3.1 Flash is probably competitive or better, but I didn't try it, since I liked the results the self-hosted Gemma 4 was giving for free.
The little ones also run great on very modest hardware. Both run at comfortable interactive speed mid-range tablets. E4B is blazing fast on an iPad M4 or Pixel 10 Pro and entirely usable on a midrange Android with sufficient RAM.
UncleOxidant 1 days ago [-]
> I may use this for auto complete
Using an 8B LLM for auto complete seems kind of like overkill. Couldn't a much smaller model handle that? IIRC there's a Qwen 1B model.
steveharing1 1 days ago [-]
Yea, No doubt Qwen 3.6 open weights are far more strong
rnadomvirlabe 1 days ago [-]
Why no doubt?
captainbland 1 days ago [-]
No comparison with competitor models other than the previous granite version strongly implies that it does not compete well with other comparable models. At least this is the most reasonable assumption until data comes out to the contrary
2ndorderthought 1 days ago [-]
Qwen 36 is effectively a pocket sized frontier model. It's really surprising for me anyway
steveharing1 1 days ago [-]
Because Qwen 3.6 pushes way above its weight. Granite 8B is impressive, but Qwen still wins on raw capability, especially for coding.
rnadomvirlabe 1 days ago [-]
You just asserted the same thing again. Why do you say this is the case?
2ndorderthought 1 days ago [-]
Qwen scores above sonnet in coding benchmarks. Runs locally. In personal use it's really good. Anecdotally others have used it to vibe code or agentic code successfully. Not toy problems. Not a toy model.
Qwen3.6 raises the bar for models of its size. There really isn't a comparison in my opinion.
albedoa 1 days ago [-]
Maybe you could tell him what you want instead of making him guess.
noodletheworld 1 days ago [-]
Having tried it.
Qwen is really good.
Also, generally, it makes sense. 8B models are generally not very good^.
That this 8B model is decent is impressive, but that it could perform on par with a good model 4 times as large is a daydream.
^ - To be polite. The small models + tool use for coding agents are almost universally ass. Proof: my personal experience. Ive tried many of them.
meatmanek 1 days ago [-]
It's not that surprising that an 8B dense model would compete with a 35B-A3B MoE model.
The geometric mean rule of thumb for MoE models is that the intelligence level of an MoE model with T total parameters and A active parameters is roughly equivalent to that of a dense model with sqrt(A*T) parameters. For qwen3.6-35B-A3B, that equivalent size is 10.24B, spitting distance of an 8B model. Good training can make up the 28% difference in size.
irishcoffee 1 days ago [-]
So it’s just like, your opinion, man?
edit: It was a play on The Big Lebowski, folks.
Terretta 1 days ago [-]
College SAT scores do not tell you how the dev applying for your open back end systems engineering job is going to do once they're in your workplace harness.
Nor do class standings, nor hackerrank and the like.
What will tell you is asking them to fix a thing in your codebase. Once you ask an LLM to do that, a dozen times, I'd argue it's no longer "just your opinion man", it's a context-engineered performance x applicability assessment.
And it is very predictive.
But it's also why someone doing well at job A isn't necessarily going to be great at B, or bad at A doesn't mean will necessarily be bad at B.
I've often felt we should normalize a sort of mutual try-buy period where job-change seeker and company can spend a series of days without harming one's existing employment, to derisk the mutual learning. ESPECIALLY to derisk the career change for the applicant who only gets one timeline to manage, opposed to company that considers the applicant fungible.
But back to the LLM, yeah, the only valid opinion on whether it works for you is not benchmark, it's an informed opinion from 'using it in anger'.
robotmaxtron 1 days ago [-]
the (dead) internet is full of opinions exactly like this
brazukadev 1 days ago [-]
you tried qwen3.6 and you think it is not good?
robotmaxtron 1 days ago [-]
I do not have high opinions of any ai model.
noodletheworld 1 days ago [-]
> So it’s just like, your opinion, man?
Yes.
That is how you empirically evaluate tools; not by reading stupid benchmarks. By actually using the tools, for hours and hours. Doing real work.
Did you try using it? For hours? Do you use qwen?
How about you tell us about your experience with your great 8B models that you use daily. What coding agent harness do you have then hooked up to? What context size can you get before they lose track of whats happening? Do you swap between models for different coding tasks?
Or, have you not, actually, even actually tried any of this stuff, yourself?
irishcoffee 22 hours ago [-]
Work pays for copilot, so I use copilot. I will never spend a penny of my own money on this stuff. If it is free, I'll use it.
I'll never use any free opensource anything from china ever, so fuck no I haven't used qwen.
steveharing1 1 days ago [-]
[dead]
actionfromafar 1 days ago [-]
Way above its weights.
drittich 1 days ago [-]
Nanobanana for scale.
locknitpicker 1 days ago [-]
[dead]
cyanydeez 1 days ago [-]
Qwen3-Coder-Next seems to be perfect sized for coding. I tried the new and just found the verbosity not really useful for coding. But probably for more analytical tasks or writing docs.
UncleOxidant 24 hours ago [-]
Qwen3-coder-next is still my favorite local model. Qwen3.6-27b is probably a bit better, but it also runs much slower on my Strix Halo box. Hoping we see a Qwen3.6-coder soon!
designed for multilingual automatic speech recognition (ASR) and bidirectional automatic speech translation (AST) for English, French, German, Spanish, Portuguese and Japanese.
uf00lme 1 days ago [-]
Woah, is this part of the future of models?
Basically little models you can use as tools.
2ndorderthought 1 days ago [-]
It's looking like running your own mini ecosystem is the way of the future to me. No data centers, just a decent GPU 16-24gb of VRAM, CPU, and 32gb of RAM.
Lalabadie 1 days ago [-]
This is Apple's bet, among others.
Training purpose-specific miniature models lets you have a lot of tasks you can run with high confidence on consumer hardware.
twoodfin 1 days ago [-]
Or on a commodity EC2 instance with a relatively cheap inference sidecar.
I don’t know how many difference little models this uses under the hood, but I was shocked at how good it was at the couple document extraction tasks I threw it at.
SecretDreams 1 days ago [-]
Eventually we'll have models small enough to do a single thing really well and we'll call them functions.
hathym 1 days ago [-]
True if you can write a function that summerize an article for example
cyanydeez 1 days ago [-]
I'm pretty sure there's someone somewhere who'll create a proper harness that's equivalent to one giant model. The difficulty is mostly local hardware has lot of memory constraints. Targeting 128GB would seem to be the current sweet spot. If we could get out of the corporate market movers of buying up all the memory, we could maybe have more.
Regardless, the people in the 80s capable of pruning programs to fit on small devices is likely happening now. I'd bet most of the Chinese firms are doing it because of the US's silly GPU games among other constraints.
nickpsecurity 23 hours ago [-]
What needs to happen is for companies (or individuals) tired of that to pool money together to build new, memory products. Then, sell them to consumers first and for non-AI use. If not that, then round-robin scheduling of quantities so the units are spread around more.
If costs are high, they might reserve a certain percentage for big business at market prices (or just under) to cover the chip's mask costs.
After DDR5+ RAM, then GDDR5-6 RAM for use with AI accelerators. They might try to jump right in on a HBM alternative. That could be the percentage for AI buyers I just mentioned. Especially if they could put 40-80GB on accelerators like Intel ARC's.
If successful enough, they license MIPS' gaming GPU's to combine with this stuff with full, open-source stack and RTOS support for military sales.
Tuna-Fish 17 hours ago [-]
Time for my daily "HBF is coming" comment.
The next step for models is to put the weights on flash, connected with a very wide interface to the accelerator. The first users will be datacenters, but it should trickle down to consumer hardware eventually. A single 512GB stack is expected to cost about $200, and provide 1.6TB/s of reads.
You still need some fast DRAM for the KV cache and for activations, but weights should be sitting on flash.
zozbot234 14 hours ago [-]
Reading from Flash is too power-intensive compared to DRAM, this is why Flash offload isn't used in the data center today. Flash is also prone to wearing out quickly so ephemeral data like the KV-cache can't really be stashed in there. Unless your model has an unprecedented level of sparsity I just don't see how HBF could ever be useful.
Tuna-Fish 5 hours ago [-]
Currently available flash is obviously unusable. HBF is not that.
The reason HBF is (about to be) a thing is that flash manufacturers realized that if you heavily optimize flash for read throughput and energy, as opposed to density, you can match DRAM on throughput and get to within 2x on energy, at the cost of half your density. That would make the density still ~50 times better than DRAM, built on a cheap mass-produced process. All manufacturers are chasing this hard right now, with first samples to arrive later this year.
You are correct that it would absolutely not be used for any mutable data, only weights in inference. This is both because there is insufficient endurance (expected to be ~hundreds of drive writes total), but also because it will be very slow to write compared to the read speed. A single HBF stack is expected to provide 1.6TB/s reads, and single-digit GB/s writes. That's why I wrote the last sentence of my post that you replied to.
HBF is not that. The paper you linked is about how to use flash memory that exists to boost LLM performance, with all kinds of optimization tricks. HBF is about making flash memory that doesn't require any of those tricks, and just has the read throughput that's needed for inference.
smj-edison 1 days ago [-]
On the topic of local models, is there a good equivalent to something like Claude's chat interface? I've recently started transitioning to open models after getting fed up with Claude's usage limits (I'm not in a position to drop $200/month), and for coding tasks Kimi 2.6 has been about the same as Sonnet in my experience. The only thing I've found myself missing is a nice interface to ask it questions and have it help me with my math assignments.
0xbadcafebee 22 hours ago [-]
Yes but not exactly.
- A lot of people suggesting llama-server's web ui, but that requires you use local AI (llama.cpp), it's persisting content into your browser rather than the server (so you can lose your chats), and it doesn't support much functionality.
- There are some pure-browser chat interfaces that are like llama-server but you can use remote LLMs. This is closer to what you want, but everything is stored in the browser, so backup is harder.
- There's LocalAI, which is like the llama-server option, but more stuff is built in and it persists data to disk. It's flashy and very easy if all you want to do is local AI.
- There's LM Studio, which is another thing like LocalAI, but a desktop app.
- There's OpenWebUI, where it's like LocalAI, except you don't do local inference, you use remote LLMs. It sucks to be honest, just stops working a lot of the time, UX is terrible, lots of weird bugs.
- There's OpenHands, which is more like Codex/Claude Code web UI. You run it locally and connect to remote LLMs. Kinda clunky, limited, poor design. Like most coding agents, it doesn't support all the features you would want, like LocalAI/OpenWebUI do.
- There's OpenCode's web UI, which is like OpenHands, but less crappy.
- There's Jan, which is probably what you want. It's a desktop app rather than a web UI.
lostmsu 19 hours ago [-]
I started using https://github.com/milisp/codexia/ (which is a desktop app or a web server) that wraps your regular codex-cli or Claude Code CLI. So you can see Codex/Claude threads in your web UI and access it remotely. I love it because you can do Web UI or terminal and all conversations are preserved.
Unfortunately it is pretty buggy, so I am maintaining a fork matching my personal needs with bugfixes and a few extra features.
SwellJoe 1 days ago [-]
Most of the common ways to run local LLMs include a chat interface. llama.cpp's `llama-server` stands up a chat interface on 8080, as well as an OpenAI compatible API. LM Studio is a desktop app with a chat interface and API, as well. unsloth Studio, too.
LM Studio is nice in that it makes it easy to add tools, like search. Qwen 3.6 is such a small model that it lacks a lot of knowledge of the world (so it can hallucinate at an uncomfortable rate, which is a common failure mode of very small models), but it can use tools, so being able to search lets it research before answering. It has pretty good reasoning and tool calling, so it's actually pretty effective. I've been comparing Gemma 4 (31B at 8-bits, also very good with tools and reasoning for its size), Qwen 3.6 (27B at 8-bits), against Claude Opus and Gemini Pro lately. And, obviously the frontiers are better, but most of the time, I find the tiny models are fine. I'm still not quite at the point where I'd be willing to code with local models, as the time wasted on hallucinations and logic bugs and sloppy coding practices are much higher, as is the cost of security bugs that make it past review.
You can try Open WebUI. Its genuinely useful when it comes to running open models locally with a clean interface
RationPhantoms 1 days ago [-]
Yep, couple Open WebUI for general chats and OpenCode for software-specific tasks and it feels close to Claude Desktop and Claude Code.
simonw 1 days ago [-]
I've been mostly using LM Studio for this recently. Ollama has an OK chat UI now too. 'brew install llama.cpp' gets you 'llama-server' which provides quite a good web UI.
Svoka 1 days ago [-]
With Ollama* you can use Claude Code with `ollama launch claude`
v0.125.0 finally broke open models including their own gpt-oss over llama.cpp or vllm. I don't think they will fix it.
rangerelf 1 days ago [-]
llama-server from the llama.cpp package has a local web interface.
steveharing1 1 days ago [-]
yes. I've used it a lot. its very simple and good
Havoc 1 days ago [-]
Interesting to see a pivot away from MoE by both IBM and mistral while the larger classes of SOTA of models all seem to be sticking to it.
Quick vibe check of it- 8B @ Q6 - seems promising. Bit of a clinical tone, but can see that being useful for data processing and similar. You don't really want a LLM that spams you with emojis sometimes...
embedding-shape 1 days ago [-]
Makes sense, dense for small models, dense or MoE for larger ones, end up fitting various hardware setups pretty neatly, no need for MoE at smaller scale and dense too heavy at large scale.
npodbielski 1 days ago [-]
I never want LLM to span me with emojis. What is the use case for that?
I find it highly annoying.
2ndorderthought 1 days ago [-]
Shh people are paying for each token. Don't get them asking too many questions
Havoc 1 days ago [-]
Think it can be a plus in moderation. eg in openclaw it can add some character
But yea dislike that style where each heading and bullet point gets an emoji
0xbadcafebee 1 days ago [-]
People complain a lot about LLM-written articles, but the human comments here on HN are far worse. Mostly a bunch of people extremely proud of themselves for not reading an LLM-written article, and then a bunch of people who take it at face value and make the model seem almost useful, and one comment that actually looked at other benchmarks. Good 'ol humanity, good at.. being emotional... and not doing analysis.....
The article makes some good points about model design (how different size models within a family can get similar results, how to filter out hallucination, math result reinforcement), so that's worth understanding. It's analyzing a paper, which only discussed 3 sizes of the same model family. But what the article doesn't say is, compared to other model families, Granite 4.1 8B sucks. The only benchmark it does well at compared to other models is non-hallucination and instruction following. Qwen 3.5 4B (among other models) easily outclass it on every other metric.
This article teaches a valuable lesson about reading articles in general. You can take useful information away from them (yes, despite being written by LLM). But you should also use critical thinking skills and be proactive to see if the article missed anything you might find relevant.
sureMan6 1 days ago [-]
The pro LLM rant is weird, LLMs "hallucinate" in creating detailed elaborate lies, the frontier models still do this egregiously, an LLM written article by default has 0 value since every single line could be true or it could be a convincingly crafted lie, every line has to be fact checked
I'm using Gemini 3.1 pro to help me research my thesis, it still with search enabled and on pro mode, invents entire papers that don't exist, and lies about the contents of existing papers to relate them to the context or to appease me, if I submitted an LLM written article based on the results its given me 80% of the article would be lies
Commenting to complain that the article is LLM written is helpful too since some people aren't able to distinguish
0xbadcafebee 1 days ago [-]
> an LLM written article by default has 0 value since every single line could be true or it could be a convincingly crafted lie, every line has to be fact checked
The exact same thing is true of Human speech. You have no idea if anything a human says is true until you fact check it. But you don't fact check everything every person says, do you?
So what do you do instead? You use heuristics. Simple - and quite flawed - subconscious rules to stop worrying about things. You find a person you like, and you classify them "trustworthy", and believe almost all of what they say, not considering if any of it might be false. But of course, humans are fallible, and many of them receive "poisoned" input, and even hallucinate (making up information). They then spread that false information around. Yes, even the people you trust.
And when you're faced with something untrue, said by someone you trust, you rationalize it. "Oh, they just made a mistake." And you completely ignore that the person you trust told you a falsehood. Life is hard enough without having to question if everything we hear is false. So we just accept falsehoods from some people, and not others.
LLMs are likely more factual and knowledgeable today than humans are, thanks to their constant improvements via reinforcement. They're going to keep getting better too. But they'll never be perfect. Rather than rejecting anything they produce, my suggestion would be to do what you do with humans: trust them a little, verify big things, let the little things go, accept that there will be errors, and move on with life.
WarmWash 1 days ago [-]
If you are asking an LLM to cite it's sources you are wasting your time and degrading the quality of the response. LLMs have no inherent mechanism for "knowledge source tracking", because that isn't at all how they work. We're trying to get there with agentic stacks, but it's still too new.
For sparse knowledge tasks, where you know that the model can't possibly have much training because even humans themselves don't have much knowledge there, use it as a brainstorming partner, not as a source. Or put relevant papers in it's context to help you eval those papers in relation to your work. But it's just going to hurt itself in confusion trying to tie fuzzy ideas to sparse sources embedded in pages upon pages of mildly related google search results.
kevin42 1 days ago [-]
If they can't distinguish LLM text, then why should they care?
Anti-AI people like to bring up hallucination as if everything AI generates is false.
I can write pages of text, with my own content, and then use AI to improve my writing and clarity. Then I review and edit. It might have some LLM markers in there, which I remove sometimes because it's distracting. But the final, AI assisted writing is easier to read and better organized. But all the ideas are mine. Hallucinations are not remotely a problem in this case.
Forgeties79 1 days ago [-]
If you can’t distinguish between fake images and real ones why should you care?
kevin42 1 days ago [-]
That depends on the purpose of the image.
If it's used to create a false narrative (like a deep fake), sure, you should care. But if it's used as an alternative to a stock photo, or as an easy way to make an infographic then no, I don't think you should care.
joquarky 24 hours ago [-]
> you should care
Why should I care? The world is full of false narratives.
How can I have the bandwidth to care about everything all of the time?
I swear that more than half of the complaining that I find here comes from priveledged people bike shedding over inane topics, and who have never had to really worry about serious survival-level (how am I going to eat today?) issues in their lives.
Forgeties79 1 days ago [-]
And when an LLM starts hallucinating, and I emphasize “when,” is that not the same issue as creating a false narrative?
halJordan 1 days ago [-]
No, you're being weird (why are you calling people weird anyway, not helpful).
You're complaining about facts that have been true since words have been written on paper. If you read the article with the same criticality you read any other article you wont have the problem you complain about.
The reality is, you're only complaining because you hate ai. Cool, but dont dress it up and resort to name calling to browbeat the other guy
lelanthran 1 days ago [-]
If I read something and cannot tell that it is AI generated, then there's no problem.
If it has AI tells then I wont bother to continue reading because it was either written by an AI or it was written by someone who can't tell the difference.
Either way it's probably a poor piece of writing.
phkahler 1 days ago [-]
>> The only benchmark it does well at compared to other models is non-hallucination and instruction following.
I think instruction following is going to be the most useful thing these models do. Add a voice interface and access to a bunch of simple, straight-forward devices or APIs and you have a mildly useful assistant. If that can be done in 8B parameters it will soon run on edge devices. That's solid usefulness.
encrux 1 days ago [-]
Anything that beats alexa-level intelligence on an edge-device is what I'd call useful as well, which shouldn't be too hard.
It's mind-boggling how bad current voice assistants sometimes are when you prompt them some fairly easy questions.
haolez 1 days ago [-]
The problem is the signal/noise ratio in these articles. If the AI has written the article, then this same info could have been generated by my own AI, but tailored to my needs. So what, exactly, is the new info that this article is generating that I can use to consult with my AI? That's what I want to get out of this interaction.
Maybe my point is something on the lines of "Just send me the prompt"[0]
prompt + all other bits of information the context has been seeded with before the output was created (documents, web searches, other sources) in which case it might be more efficient to just consume the final deliverable (yourself or via LLM).
haolez 3 hours ago [-]
Fair point. We could classify AI generated articles in two categories:
1) articles generated with context data that's trivial to find (or even embedded into the model)
2) articles generated with context data that's hard to find or not publicly available
lelanthran 1 days ago [-]
> people complain a lot about LLM-written articles, but the human comments here on HN are far worse.
No, they aren't.
You are comparing writing produced with little to no effort to writing produced with the minimal effort required to communicate.
It's reasonable for people to complain that they are presented material that not even the author thought was worth the effort.
simonw 1 days ago [-]
"The article makes some good points about model design"
But how can I tell if those are good points or not?
I don't want to invest time in reading something if the presence of those "good points" depends on a roll of the dice.
steveharing1 1 days ago [-]
even calling it roll of the dice is an assumption. Can you point anything you find as mistake?
lelanthran 1 days ago [-]
You expect people to read every single excretion, which can be generated faster than I can read,just to find the rare gem that might exist?
The problem is that in the past it took multiple times more effort and hours to write something than it took to read. That served two purposes:
1. Lazy people just looking for an audience were effectively gatekept from drowning the world with their every vapid thought.
2. Because supply was many times slower than consumption it was viable to give most articles a chance: the author could not drown me in a deluge even if they wanted to.
Having the criteria now that the author should spend at least as much effort creating the piece as they expect the reader expend reading it is a damn useful bar: instead of reading 1000 AI articles just to find the one good one, I can simply read 10 human authored articles and be certain that 9 of them have something worthwhile.
simonw 1 days ago [-]
No, because I'm not going to spend a bunch of my time fact-checking obvious AI slop.
joquarky 24 hours ago [-]
Then don't complain.
simonw 23 hours ago [-]
?
geraneum 1 days ago [-]
> the human comments here on HN are far worse
I already assume some comments here are LLM written.
mkovach 1 days ago [-]
I just wait until I'm hallucinating, then I comment. Keeps the classifiers honest.
elxr 1 days ago [-]
I mean, obviously.
I assume some people here have never programmed a single useful thing even once in their lives.
drob518 1 days ago [-]
> But what the article doesn't say is, compared to other model families, Granite 4.1 8B sucks.
Right. This just says that Granite 4.1 8B is better than a previous version, Granite 4.0-H-Small, which has 32B, 9B active.
So, they made a less bad model than before. But that doesn't tell you anything about how it compares with other models.
DetroitThrow 1 days ago [-]
>Mostly a bunch of people extremely proud of themselves for not reading an LLM-written article
I'm not sure it's proud as much as people voicing displeasure with the uncertainty about what went into the LLM prompt. This may have been a 1 sentence prompt, or it may have been some well researched background that simply reformatted it. Why waste minutes-hours on verifying it if it's possible someone could have spent 10 second on it? It's very easy to see their point.
People seem to indicate people they disagree with voicing their opinion about anything lately is some auto-fellatio, I wonder what causes them to think this way.
whalesalad 1 days ago [-]
The thing is it's just a bunch of other original content that has been chewed up and regurgitated into something "new". Just show us the original content instead. This is by definition, slop. https://huggingface.co/blog/ibm-granite/granite-4-1
1 days ago [-]
steveharing1 1 days ago [-]
[dead]
nielsbot 20 hours ago [-]
Very much an aside, but I'm struck by IBM's consistent iconic design language. For me it harkens all the way back to the futuristic design in 2001: A Space Odyssey from 1968. But you can also see it in their old mainframe hardware designs and other places.
100ms 1 days ago [-]
> Full stop.
Why people don't edit out obvious sloppification and expect to still have readers left
wewewedxfgdf 1 days ago [-]
Third line in to the article: "But there’s one result in the benchmarks I keep coming back to."
I hear this sort of thing all the time now on YouTube from media/news personalities:
“And that’s the part nobody seems to be talking about.”
"And here's what keeps me up at night."
“This is where the story gets complicated.”
“Here’s the piece that doesn’t quite fit.”
“And this is where the usual explanation starts to break down.”
“Here’s what I can’t stop thinking about.”
“The part that should worry us is not the obvious one.”
“And that’s where the real problem begins.”
“But the more interesting question is the one no one is asking.”
“And this is where things stop being simple.”
It doesn't really worry me but I think its interesting that LLM speak sounds so distinctive, and how willing these media personalities are to be so obvious in reading out on TV what the LLM spat out.
I've never studied what LLMs say in depth is it is interesting that my brain recognises the speech pattern so easily.
frereubu 1 days ago [-]
I think this kind of language predates widespread LLM use, and has been picked up from that kind of writing. It's a "and here's where it gets interesting" pattern that people like Malcolm Gladwell and Freakonomics have used, even if the same thing could be said in a way that makes it sound much less intriguing.
cwillu 1 days ago [-]
There's even a word for it: “cliché”
someguyiguess 1 days ago [-]
How banal
cyanydeez 1 days ago [-]
10 EASY WAYS TO SPOT A LLM~ THE 10TH ONE WILL SURPRISE YOU!
helsinkiandrew 1 days ago [-]
Isn't this the format of "hook-driven media" a constant stream of "second-act pivots" - where some new twist is added to a story to re-engage the reader and keep them reading.
BuzzFeed and Upworthy etc pioneered this for web 'news stories', then it got used in linkedin, twitter, and everywhere where views are more important than the content.
jmbwell 1 days ago [-]
The language of drama and import without meaningful substance. Words statistically likely to be used in a segue, regardless of the preceding or subsequent point. Particularly effective when it seems like you’re getting let in on a secret. Really fatiguing to read
A writing teacher once excoriated me for saying that something was important. “Don’t tell me it’s important, show me, and let me decide, and if you do your job I’ll agree”
I don’t know how a completion can tell when it needs to do this. Mostly so far it doesn’t seem capable
MarsIronPI 1 days ago [-]
Maybe the solution is to cull the bad, cliché writing from the training data.
wewewedxfgdf 1 days ago [-]
You can just instruct the LLM not to write like an LLM.
MarsIronPI 1 days ago [-]
Ugh, you're making me remember the last time I listened to NPR. It's so bad.
stuff4ben 1 days ago [-]
I listen to NPR daily and I don't think I've ever heard any of them use that phrasing.
bambax 1 days ago [-]
I notice this very often in LinkedIn posts, and it's annoying, but I had not realized it was LLM-speak? Isn't it possible that people write like this naturally?
wewewedxfgdf 1 days ago [-]
I think LLM's have that sort of "summarise, wrap it in a bow tie, give a little dramatic punch as a preview to the next few points".
cyanydeez 1 days ago [-]
Guys, LLMs are build on all these social cues which were developed pre-model. There's atleast 10 years of pre-llm gibberish.
This is to say: Marketers and spammers repeat the same things over and over, and these models are build on coalescing repetition into the basis.
So yeah, of course people talked like this before, but it was always in some known context like linked in or a spam website.
fwip 1 days ago [-]
Sure, but RLHF ended up emphasizing this to a level beyond normal human writing.
spicyusername 1 days ago [-]
Arguably it's exactly because it was used naturally so often that the LLMs parrot it so frequently.
trvz 1 days ago [-]
Yes. Some people are very trigger happy in attributing human slop to LLMs.
steveharing1 1 days ago [-]
[dead]
nwatson 1 days ago [-]
Nate B Jones videos ... YouTube channel "AI News and Strategy Daily" channel uses all of these. Every video.
bityard 1 days ago [-]
I listened to a lot of NPR podcasts before LLM were around, and most of them are full of these kinds of filler phrases.
riknos314 1 days ago [-]
The general concept of a hook with delayed payoff is far from new, and generally one of the better ways at keeping attention.
It's also exactly the Mr beast playbook, and got him to the largest channel on YouTube.
Any system attempting to capture human attention will use these techniques, nothing LLM-specific here at all.
Lerc 1 days ago [-]
Apparently John Oliver was an LLM before they were even invented.
cbg0 1 days ago [-]
So are we saying it's fine that the article is written by an LLM as long as it doesn't have the tell-tale signs of LLMs?
ramon156 1 days ago [-]
It's more about curating the things you're publishing. Why would I bother reading what you couldn't bother to read?
alienbaby 1 days ago [-]
They could easily have read it, and thought , that communicates the information that it needs to.
No point creating busywork for yourself just shuffling words around when the information is there, no?
I guess it depends on what you want out of the article. Substance, or style?
lelanthran 1 days ago [-]
> They could easily have read it, and thought , that communicates the information that it needs to.
I'd they aren't self-aware enough or smart enough to determine that what they wrote is indistinguishable from text generation, how probable is it that they have something of value to add to any thought?
100ms 1 days ago [-]
I don't really see reason to complain about tool use, so long as the result is cohesive, accurate and that ultimately means a human has at least read their own output before publishing. It's a bit like receiving a supposedly personal letter that starts "Dear [INSERT_FIRST_NAME_FIELD]," are you really going to read such a thing?
HighGoldstein 1 days ago [-]
An article without telltale signs of an LLM is indistinguishable from an article written by a human, so yes.
spicyusername 1 days ago [-]
My opinion is that literature and art will continue pushing the envelope in the places they always pushed the envelope. LLMs will not change this, humans love making art, and they love doing it in new ways.
Corporate announcements were never the places that literature and art were pushing the envelope. They were slop before, and they're slop now.
crunis 1 days ago [-]
Are you referring to the literal use of the expression "full stop"? I don't see it anymore in the article, maybe they edited it out?
Do you have any reasons to believe that granite is more immune to the effects of quantization than other tiny models? Otherwise it seems odd to judge a tiny model true capabilities by using its 4bit quant.
simonw 1 days ago [-]
This model is small enough that it might be sensible to try the same prompts against all of the quant sizes to try and spot any differences.
I convinced that SLM are a real parto of solution for true integrated AI in process...
dash2 1 days ago [-]
Nah, I ain't reading that. If they can't be bothered to get a human to write it, it can't be that important. I'm glad for them though. Or sorry that happened.
The 8B class closing the gap with 32B is the real story of 2026 for anyone running models locally. I've been using smaller models for agent tool-use and the progress this year is real.
The gap that still matters most isn't intelligence — it's consistency on structured output. When you chain 5+ tool calls in sequence, even a small per-call reliability difference compounds fast. Would love to see Granite 4.1 benchmarked specifically on multi-step function calling rather than just general benchmarks.
agunapal 1 days ago [-]
If you really think about why MoE came into existence, its to save significant cost during training, I don't think there was any concrete evidence of performance gains for comparable MoE vs dense models. Over the years, I believe all the new techniques being employed in post training have made the models better.
vessenes 1 days ago [-]
I think you mean inference compute? I believe all expert weights are updated in each backward pass during MoE training. The first benefit was getting a sort of structured pruning of weights through the mechanism of expert selection so that the model didn’t need to go through ‘unnecessary’ parts of the model for a given token. This then let inference use memory more efficiently in memory constrained environments, where non-hot or less common experts could be put into slow RAM, or sometimes even streamed off storage.
But I don’t think it necessarily saved training cost; if it did, I’d be interested to learn how!
agunapal 9 hours ago [-]
Here is a paper from few years ago where they talk about 7x speed increase, which equates to savings.
Each token is only routed through a few chosen (topk) experts during training. So not all expert weights are updated in the backward pass. Otoh, you may need more training to ensure all experts see enough tokens!
I doubt MoE is actually worth it, given how complicated high-performance expert routing and training is. But who knows, I don't.
zozbot234 1 days ago [-]
MoE models will have far more world knowledge than dense models with the same amount of active parameters. MoE is a no-brainer if your inference setup is ultimately limited by compute or memory throughput - not total memory footprint - or alternately if it has fast, high-bandwidth access to lower-tier storage to fetch cold model weights from on demand.
regularfry 1 days ago [-]
Yes, this. I can run the 122B Qwen3.5 MoE usably on one 4090 + 64GB RAM. That's a monster of a model, comparatively speaking.
aitchnyu 1 days ago [-]
Tangential. I'm a newb, can you name the concept of partitioning weights so we dont need to load whole thing?
agunapal 9 hours ago [-]
Do you mean model sharding?
woadwarrior01 1 days ago [-]
The most salient thing about these models is that they're non-reasoning models. This makes then very token efficient and particularly well suited for local inference where decoding is usually slower than with datacenter GPUs.
Probably worse than Gemma 4 or Qwen 3.6 with thinking off.
mdp2021 1 days ago [-]
I read that IBM pioneered the concept of "shifting through "mid-training" from "guessing the next token" to "guessing the next logical step"". I am wondering how far is the research from "enhancing apparent reasoning" to "achieving solid, reliable reasoning".
If techniques existed to shift from "guess the next highly probable" token to "guess the best next logical step", as some interpreted said research, should not that be the foremost objective?
latentframe 9 hours ago [-]
The limit is changing from scaling parameters to scaling datas quality however compute is still the big constraint
dissahc 1 days ago [-]
qwen3.5 9b outperforms granite 4.1 30b by a huge amount (32 vs 15 on artificialanalysis benchmark)... i have no idea what made the writer of this article say so many demonstrably incorrect things
RandyOrion 1 days ago [-]
Although the performance claim of 8b dense matching 32b moe is somewhat questionable, thank you granite team for releasing small dense LLMs.
peter_d_sherman 3 hours ago [-]
>"Stage two was RLHF training on general chat prompts using a reward model to improve helpfulness. This worked. AlpacaEval scores jumped around 18.9 points on average compared to the fine-tuned checkpoints.
Then something broke. The RLHF stage, while improving chat quality, caused math benchmark scores to drop. GSM8K and DeepMind-Math both regressed."
Observation: Math (which when fully decomposed, results in Logic) is at the core of how computers (traditional/older, non-LLM, programming languages work. If an LLM gets Math training wrong at any stage for any reason, then, in my opinion, that should be viewed as something that needs to be fixed at a lower level, not a higher one; not a later training level...
I think it would be interesting exercise to train an LLM that only deals in simple Math, simple English, and only the ability to compute simple equations (+,-,x,/)... like, what's the absolute minimum in terms of text and layers necessary to train a model like that?
I think some interesting understandings could be potentially be had by experimentation like that...
I myself would love a pure (simplest, smallest possible)
Text-to-Math only LLM (TTMLLM, TTMSLM?)
, along with all of the necessary corpuses (which would ideally be as small as possible) and instructions necessary to train such an LLM...
mdp2021 1 days ago [-]
Wish they also released an embedding model, in the line of their previous: compact (while good)...
Is this a model that will create reliable output or will it also produce errors?
RugnirViking 1 days ago [-]
sounds interesting. Here's hoping they release a 32B model, thats a pretty good sweet spot for feasibility of home setups.
edit: I just realised they do actually have a 30b release alongside this. Haven't tried it yet.
2ndorderthought 1 days ago [-]
Try qwen 3.6. it will knock your socks off
cindyllm 1 days ago [-]
[dead]
SwellJoe 1 days ago [-]
I wish AI slop articles were somehow automatically flagged and deaded. They're all flowery verbose piles of crap. Yeah, the model is interesting, but the article is trash. I can't believe real humans are willing to sign their name to this stuff.
theblazehen 1 days ago [-]
> models are judged by GPT-4
An interesting choice
cubefox 1 days ago [-]
It's strange that they don't include reasoning training (RLVR). Their justification doesn't sound convincing:
> While reasoning models have grown in popularity in recent years, their abilities aren’t always the most efficient way to get a result. In enterprise settings, token costs and speed are often as important as performance. That is why turning to less expensive, non-reasoning models with similar benchmark performance for select tasks like instruction following and tool calling makes sense for enterprise users.
I guess they currently don't have the ability to do proper RLVR.
mdp2021 8 hours ago [-]
I may have misunderstood: is not reasoning training (RLVR) independent from the use of the "<think>" tags - is it not a method that improves results in reasoning? How do we know that it was not carried out?
Incidentally: I am trying to spend some time researching in the progresses in the area (the jump from parroting, to inconsistent apparent reasoning, to reliable reasoning).
nickpsecurity 23 hours ago [-]
On changing the training mix, H20 did that with Danube in 2024:
With those results, I would've already done that in any models I got to train. There's also the principle that the LLM's are often better at what they saw last in their training set. That also justifies putting more logic, code, and math in at the end for an analytical or coding model. So, a few precedents for that technique already.
rasz 17 hours ago [-]
me: "Bits 5-4 indicate the video mode", how do you derive VIDEO_MASK_00b = 0x03C0 from this?
Granite 4.1 8B: "I apologize for the confusion earlier. Let's clarify the derivation of the VIDEO__MASK_00b based on the description "Bits 5-6 indicate the video mode" from the equipment_list section.
Given that bits 5 and 6 are used for video mode, we need to create a mask that isolates these two bits:
Bit 5 is the least significant bit of the group (2^5 = 32 decimal, 0b00100000 in 8-bit form).
Bit 6 is the next bit up (2^6 = 64 decimal, 0b01000000 in 8-bit form).
Combining these bits: ...
Therefore, the correct mask for VIDEO_MASK_00b should be 0x00C0"
Errors on top of errors when converting description into binary numbers. Its hopeless for basic task like parsing/generating headers :(
robotmaxtron 1 days ago [-]
"open source"
show me.
jasonlotito 1 days ago [-]
Apache 2.0 License. Did you not click the link to the project? They even list it in the article.
> Apache 2.0 across the board, so commercial use is clean.
Did you just stop when you saw open source and come post this here because you couldn't be bothered to... look at the project and see it's cleanly and clearly listed.
Edit: Like. I get it. It's fine to question open source. But this isn't hidden. It's repeated and made clear multiple times. They even link to the license: https://www.apache.org/licenses/LICENSE-2.0
It wasn't hidden, it wasn't in some weird, out-of-the-way place. In fact, I found it so easily that I genuinely questioned whether it was real because of your comment. Like, why would anyone post what you posted if it was this easy to find?
NOPE! It was right there.
speedgoose 1 days ago [-]
If I give you an amd64 elf binary under Apache2 license, is it open source?
EagnaIonat 1 days ago [-]
Can you clarify what you mean?
If you check HF you will see its Apache2 and the datasets were also permissive.
It's one of the few models on the market where the creator indemnifies it against copyright claims.
Maybe I suck but I didn’t find that in 5 seconds. Or with more time.
I meant the full training datasets and the complete recipes to make the models.
EagnaIonat 13 hours ago [-]
The training datasets are listed there and are all open source.
> the complete recipes to make the models.
You mean the weights which most companies don't release. Again you can find from that link.
speedgoose 11 hours ago [-]
Where is the list?
No I didn't mean the weights, but the source code to make the weights.
jeraldbenny 13 hours ago [-]
if you are googling you can find so many open source dataset. Also use kaggle, they're also having training datasets which we can use.
speedgoose 11 hours ago [-]
I know, but that is not my point.
jasonlotito 4 hours ago [-]
You never had one. You tried to be clever and failed.
speedgoose 3 hours ago [-]
I don’t know. Since you are perhaps clever, can you show me the training datasets and recipes so I can replicate this model locally? I have access to good HPCs.
I think it’s fair if you use a bit more than 5 seconds as someone stated above. I would gladly be proven stupid.
robotmaxtron 23 hours ago [-]
if I can't reproduce the artifact, is it really open source?
postalrat 22 hours ago [-]
If IBM themselves can't reproduce the artifact do they have the source?
robotmaxtron 20 hours ago [-]
I guess not
nickpsecurity 16 hours ago [-]
Open source for ML is more like Allen Institute's Olmo models.
Qwen3.6 35b a3b is still my local champion but I may use this for auto complete and small tasks. Granite has recent training data which is nice. If the other small models got fine tuned on recent data I don't know if I would use this at all, but that alone makes it pretty decent.
The 4b they released was not good for my needs but could probably handle tool calls or something
I second this! Using the Unsloth Q6 (I forgot the exact name). Currently using it with forgecode (with zsh), on my Strix Halo, and it's suprisingly really good. I would say slightly Similar to Haiku 4.5, plus additional privacy, minus speed. It's surprisingly really fast for the hardware, given the speculative decoding, still PP is on the slow side.
Can you share some parameters you enable tool calling and agentic usage?
Or, higher level, some philosophies on what approaches you are using for tuning to get better tool calling and/or agentic usage?
I'm having surprisingly good success with unsloth/Qwen3.6-27B-GGUF:Q4_K_M (love unsloth guys) on my RTX3090/24GB using opencode as the orchestrator.
It concocts some misleading paths, but the code often compiles, and I consider that a victory.
You have to watch it like you would watch a 14 year old boy who says he is doing his homework but you hear the sound effects of explosions.
My config is similar to: https://github.com/noonghunna/club-3090/blob/master/docs/eng...
I need to try out some of the other set ups mentioned in this repo for increased TPS.
The difference basically boils down to Gemma 4 making more assumptions and Qwen 3.6 sticking closer to the prompt, if your prompt is bad or leaves things up to the imagination, Gemma will do a better job, if you need strict prompt adherence Qwen is better. Since local models are "dumb" i think it makes sense to prefer prompt adherence, but there are complex tasks that Gemma will complete much much faster than Qwen because it makes the right assumptions the first time and as a result even with slower inference requires way fewer turns.
My speculation is that this comes from google having a much better strategy for filtering their training data, I think this also shows up in the shape of the world knowledge of the models. Gemma's world knowledge seems deeper even though the models are of roughly equivalent size to the Qwen counterparts so it's mostly likely just concentrated in places that are more relevant to my queries.
Most notably in my testing, Gemma 4 31b is the ONLY local model that will tell me the significance of 1738 correctly. Even most flagship/cloud models answer with some hallucinatory nonsense.
Now, is this the usual use case? No, it's a benchmark I created specifically in order to put LLMs in situations where they can't just blast out their bash commands without having to interface with something else and adapt.
So far the only tools the agent has access to are `evaluate_commands(commands=["...", "..."])` and `get_buffer_contents()`, which really makes them have to work for doing things. I could make it super easy for them but then it wouldn't be an interesting experiment.
If I were to try to make something more useful out of this, I'd probably add the ability for LLMs to list buffers, probably give them an easier out for executing shell scripts in the way they prefer, give them an easier time to list docs and a few other things like that.
The tools and the interaction with Kakoune is really trivial to write; I already use this by having the agent write to the session FIFO (a very simple binary format) and I extract information via my own FIFO that Kakoune writes to (this is used for the buffer data only right now).
I think once you started using it more as a tool and not a pseudo-benchmark like I am you'd probably think of even more things to add but a lot of it comes down to just making Kakoune's state visible and making shell spam (which the LLMs love) easier.
The Qwen models are quite solid though.
Can you share your switches and approach for using tools?
My setup is a bit of a mess as I experiment with different ways of configuring and hosting local models. So at some point I was experimenting with the router server but stopped doing that, but some of my settings are still in models.ini while some are on the command line.
podman run --env "HF_TOKEN=$HF_TOKEN" --env "LLAMA_SERVER_SLOTS_DEBUG=1" -p 8080:8080 --device /dev/kfd --device /dev/dri --security-opt seccomp=unconfined --security-opt label=disable --rm -it -v ~/.cache/huggingface/:/root/.cache/huggingface/ -v ./unsloth:/app/unsloth -v ./models.ini:/app/models.ini llama.cpp-rocm7.2 -hf unsloth/gemma-4-31B-it-GGUF:UD-Q8_K_XL --chat-template-file /root/.cache/huggingface/gemma-4-31B-it-chat_template.jinja -ctxcp 8 --port 8080 --host 0.0.0.0 -dio --models-preset models.ini
With the following as the relevant settings in models.ini (I actually have no idea if these settings are applied when not using the router server, it's been hard for me to figure out what settings are actually applied when using bot the command line and models.ini
And it looks like the chat_template.jinja I have is actually out of date by now, there was a new one pushed just a couple of days ago that seems to have some further tool calling fixes: https://huggingface.co/google/gemma-4-31B-it/blob/main/chat_...As my harness, I'm using pi, with a pretty vanilla config.
Anyhow, Gemms 4 31b worked in this config, but it was slow and RAM hungry. Since then, I've mostly moved to Qwen 3.6 35b-a3b because it's a lot faster.
I'm not actually doing anything useful with these yet, but I've used them for some experiments and Qwen 3.6 35b-a3b was capable of doing some pretty long mostly unsupervised agentic loops in my experimentation.
Not for creative writing or NLP.
Maybe it could be fun to hook them up via a2a protocol as left and right brain agents operating in tandem.
>Maybe it could be fun to hook them up via a2a protocol as left and right brain agents operating in tandem
What in the world does this even mean?
I have not benchmarked Qwen3.5 vs. Qwen3.6 for the same task, nor trialed Gemma4-26B. Guess it's time for some testing!
The 4b was okay. It didn't get all of my small math questions right, it didn't know about some of the libraries I use, but it was able to do some basic auto complete type stuff. For microscopic models I like the llama 3.2 3b more right now for what I do, it's a little faster and seems a little stronger for what I do. But everyone is different and I don't think I'll use it anymore this past month has been crazy for local model releases.
curious how people are leveraging these models
Instead of hitting stack overflow and Google I will ask questions like "can you give me an example of how to do x in library y?" Or "this error is appearing what might be happening if I checked a b and c". Or "please write unit tests for this function". Or code auto complete.
I am not looking for the world's best answer from a 3b model. I am looking for a super fast answer that reminds me of things I already know or maybe just maybe gives me a fast idea to stub something while I focus on something more important, I am going to refactor anyways. Think a low quality rubber duck
I mostly use 7-9b models for this now but llama 3.2 3b is pretty decent for not hogging resources while say I have other compute heavy operations happening on a weak computer.
Probably half the questions people ask chatgpt could get roughly the same quality of answer with a small model in my opinion. You can't fully trust an LLM anyways so the difference between 60% and 70% accuracy isn't as much are marketing makes it sound like. That said the quality of a good 7-9b model is worth it compared to a 3b if your machine can run it. Furthermore the quality of qwen 36 is crazy and makes me wonder if I will ever need an AI provider again if the trend continues.
Also, they're good enough for a lot of simple categorization and data extraction tasks, e.g. something like "flag abusive posts/comments", or "visit website, find the contact info, open hours, address". And they run fast on the kind of hardware you're likely to have at home, while the bigger dense versions decidedly do not.
I used Gemma 4 itself to review and prune the data (my social media posts over the last ~5 years, about 5 million words) being ingested into the training process for a LoRA for Gemma 4. I found the bigger model (31B) was more nuanced and useful than the smaller ones, and I wasn't in a big hurry by that stage of the process, so I used the big one overnight. Gemma 4 31B was also a better judge of my writing than Gemini Flash 2.5, by my reckoning.
It was, again, more nuanced, and was able to recognize a generally helpful comment that opened kinda jokey/rude, while the smaller model and Gemini 2.5 Flash tended to gravitate toward extremes (1 or 5) rather than the 1-5 scale they were prompted to rate on. I assume Gemini 3.1 Flash is probably competitive or better, but I didn't try it, since I liked the results the self-hosted Gemma 4 was giving for free.
The little ones also run great on very modest hardware. Both run at comfortable interactive speed mid-range tablets. E4B is blazing fast on an iPad M4 or Pixel 10 Pro and entirely usable on a midrange Android with sufficient RAM.
Using an 8B LLM for auto complete seems kind of like overkill. Couldn't a much smaller model handle that? IIRC there's a Qwen 1B model.
Qwen3.6 raises the bar for models of its size. There really isn't a comparison in my opinion.
Qwen is really good.
Also, generally, it makes sense. 8B models are generally not very good^.
That this 8B model is decent is impressive, but that it could perform on par with a good model 4 times as large is a daydream.
^ - To be polite. The small models + tool use for coding agents are almost universally ass. Proof: my personal experience. Ive tried many of them.
The geometric mean rule of thumb for MoE models is that the intelligence level of an MoE model with T total parameters and A active parameters is roughly equivalent to that of a dense model with sqrt(A*T) parameters. For qwen3.6-35B-A3B, that equivalent size is 10.24B, spitting distance of an 8B model. Good training can make up the 28% difference in size.
edit: It was a play on The Big Lebowski, folks.
Nor do class standings, nor hackerrank and the like.
What will tell you is asking them to fix a thing in your codebase. Once you ask an LLM to do that, a dozen times, I'd argue it's no longer "just your opinion man", it's a context-engineered performance x applicability assessment.
And it is very predictive.
But it's also why someone doing well at job A isn't necessarily going to be great at B, or bad at A doesn't mean will necessarily be bad at B.
I've often felt we should normalize a sort of mutual try-buy period where job-change seeker and company can spend a series of days without harming one's existing employment, to derisk the mutual learning. ESPECIALLY to derisk the career change for the applicant who only gets one timeline to manage, opposed to company that considers the applicant fungible.
But back to the LLM, yeah, the only valid opinion on whether it works for you is not benchmark, it's an informed opinion from 'using it in anger'.
Yes.
That is how you empirically evaluate tools; not by reading stupid benchmarks. By actually using the tools, for hours and hours. Doing real work.
Did you try using it? For hours? Do you use qwen?
How about you tell us about your experience with your great 8B models that you use daily. What coding agent harness do you have then hooked up to? What context size can you get before they lose track of whats happening? Do you swap between models for different coding tasks?
Or, have you not, actually, even actually tried any of this stuff, yourself?
I'll never use any free opensource anything from china ever, so fuck no I haven't used qwen.
Original article on IBM research
Hugging face weights: https://huggingface.co/collections/ibm-granite/granite-41-la...
https://huggingface.co/ibm-granite/granite-speech-4.1-2b
designed for multilingual automatic speech recognition (ASR) and bidirectional automatic speech translation (AST) for English, French, German, Spanish, Portuguese and Japanese.
Training purpose-specific miniature models lets you have a lot of tasks you can run with high confidence on consumer hardware.
I don’t know how many difference little models this uses under the hood, but I was shocked at how good it was at the couple document extraction tasks I threw it at.
Regardless, the people in the 80s capable of pruning programs to fit on small devices is likely happening now. I'd bet most of the Chinese firms are doing it because of the US's silly GPU games among other constraints.
If costs are high, they might reserve a certain percentage for big business at market prices (or just under) to cover the chip's mask costs.
After DDR5+ RAM, then GDDR5-6 RAM for use with AI accelerators. They might try to jump right in on a HBM alternative. That could be the percentage for AI buyers I just mentioned. Especially if they could put 40-80GB on accelerators like Intel ARC's.
If successful enough, they license MIPS' gaming GPU's to combine with this stuff with full, open-source stack and RTOS support for military sales.
The next step for models is to put the weights on flash, connected with a very wide interface to the accelerator. The first users will be datacenters, but it should trickle down to consumer hardware eventually. A single 512GB stack is expected to cost about $200, and provide 1.6TB/s of reads.
You still need some fast DRAM for the KV cache and for activations, but weights should be sitting on flash.
The reason HBF is (about to be) a thing is that flash manufacturers realized that if you heavily optimize flash for read throughput and energy, as opposed to density, you can match DRAM on throughput and get to within 2x on energy, at the cost of half your density. That would make the density still ~50 times better than DRAM, built on a cheap mass-produced process. All manufacturers are chasing this hard right now, with first samples to arrive later this year.
You are correct that it would absolutely not be used for any mutable data, only weights in inference. This is both because there is insufficient endurance (expected to be ~hundreds of drive writes total), but also because it will be very slow to write compared to the read speed. A single HBF stack is expected to provide 1.6TB/s reads, and single-digit GB/s writes. That's why I wrote the last sentence of my post that you replied to.
https://arxiv.org/pdf/2312.11514
- A lot of people suggesting llama-server's web ui, but that requires you use local AI (llama.cpp), it's persisting content into your browser rather than the server (so you can lose your chats), and it doesn't support much functionality.
- There are some pure-browser chat interfaces that are like llama-server but you can use remote LLMs. This is closer to what you want, but everything is stored in the browser, so backup is harder.
- There's LocalAI, which is like the llama-server option, but more stuff is built in and it persists data to disk. It's flashy and very easy if all you want to do is local AI.
- There's LM Studio, which is another thing like LocalAI, but a desktop app.
- There's OpenWebUI, where it's like LocalAI, except you don't do local inference, you use remote LLMs. It sucks to be honest, just stops working a lot of the time, UX is terrible, lots of weird bugs.
- There's OpenHands, which is more like Codex/Claude Code web UI. You run it locally and connect to remote LLMs. Kinda clunky, limited, poor design. Like most coding agents, it doesn't support all the features you would want, like LocalAI/OpenWebUI do.
- There's OpenCode's web UI, which is like OpenHands, but less crappy.
- There's Jan, which is probably what you want. It's a desktop app rather than a web UI.
Unfortunately it is pretty buggy, so I am maintaining a fork matching my personal needs with bugfixes and a few extra features.
LM Studio is nice in that it makes it easy to add tools, like search. Qwen 3.6 is such a small model that it lacks a lot of knowledge of the world (so it can hallucinate at an uncomfortable rate, which is a common failure mode of very small models), but it can use tools, so being able to search lets it research before answering. It has pretty good reasoning and tool calling, so it's actually pretty effective. I've been comparing Gemma 4 (31B at 8-bits, also very good with tools and reasoning for its size), Qwen 3.6 (27B at 8-bits), against Claude Opus and Gemini Pro lately. And, obviously the frontiers are better, but most of the time, I find the tiny models are fine. I'm still not quite at the point where I'd be willing to code with local models, as the time wasted on hallucinations and logic bugs and sloppy coding practices are much higher, as is the cost of security bugs that make it past review.
* https://docs.ollama.com/integrations/claude-code
Quick vibe check of it- 8B @ Q6 - seems promising. Bit of a clinical tone, but can see that being useful for data processing and similar. You don't really want a LLM that spams you with emojis sometimes...
But yea dislike that style where each heading and bullet point gets an emoji
The article makes some good points about model design (how different size models within a family can get similar results, how to filter out hallucination, math result reinforcement), so that's worth understanding. It's analyzing a paper, which only discussed 3 sizes of the same model family. But what the article doesn't say is, compared to other model families, Granite 4.1 8B sucks. The only benchmark it does well at compared to other models is non-hallucination and instruction following. Qwen 3.5 4B (among other models) easily outclass it on every other metric.
This article teaches a valuable lesson about reading articles in general. You can take useful information away from them (yes, despite being written by LLM). But you should also use critical thinking skills and be proactive to see if the article missed anything you might find relevant.
I'm using Gemini 3.1 pro to help me research my thesis, it still with search enabled and on pro mode, invents entire papers that don't exist, and lies about the contents of existing papers to relate them to the context or to appease me, if I submitted an LLM written article based on the results its given me 80% of the article would be lies
Commenting to complain that the article is LLM written is helpful too since some people aren't able to distinguish
The exact same thing is true of Human speech. You have no idea if anything a human says is true until you fact check it. But you don't fact check everything every person says, do you?
So what do you do instead? You use heuristics. Simple - and quite flawed - subconscious rules to stop worrying about things. You find a person you like, and you classify them "trustworthy", and believe almost all of what they say, not considering if any of it might be false. But of course, humans are fallible, and many of them receive "poisoned" input, and even hallucinate (making up information). They then spread that false information around. Yes, even the people you trust.
And when you're faced with something untrue, said by someone you trust, you rationalize it. "Oh, they just made a mistake." And you completely ignore that the person you trust told you a falsehood. Life is hard enough without having to question if everything we hear is false. So we just accept falsehoods from some people, and not others.
LLMs are likely more factual and knowledgeable today than humans are, thanks to their constant improvements via reinforcement. They're going to keep getting better too. But they'll never be perfect. Rather than rejecting anything they produce, my suggestion would be to do what you do with humans: trust them a little, verify big things, let the little things go, accept that there will be errors, and move on with life.
For sparse knowledge tasks, where you know that the model can't possibly have much training because even humans themselves don't have much knowledge there, use it as a brainstorming partner, not as a source. Or put relevant papers in it's context to help you eval those papers in relation to your work. But it's just going to hurt itself in confusion trying to tie fuzzy ideas to sparse sources embedded in pages upon pages of mildly related google search results.
Anti-AI people like to bring up hallucination as if everything AI generates is false.
I can write pages of text, with my own content, and then use AI to improve my writing and clarity. Then I review and edit. It might have some LLM markers in there, which I remove sometimes because it's distracting. But the final, AI assisted writing is easier to read and better organized. But all the ideas are mine. Hallucinations are not remotely a problem in this case.
If it's used to create a false narrative (like a deep fake), sure, you should care. But if it's used as an alternative to a stock photo, or as an easy way to make an infographic then no, I don't think you should care.
Why should I care? The world is full of false narratives.
How can I have the bandwidth to care about everything all of the time?
I swear that more than half of the complaining that I find here comes from priveledged people bike shedding over inane topics, and who have never had to really worry about serious survival-level (how am I going to eat today?) issues in their lives.
You're complaining about facts that have been true since words have been written on paper. If you read the article with the same criticality you read any other article you wont have the problem you complain about.
The reality is, you're only complaining because you hate ai. Cool, but dont dress it up and resort to name calling to browbeat the other guy
If it has AI tells then I wont bother to continue reading because it was either written by an AI or it was written by someone who can't tell the difference.
Either way it's probably a poor piece of writing.
I think instruction following is going to be the most useful thing these models do. Add a voice interface and access to a bunch of simple, straight-forward devices or APIs and you have a mildly useful assistant. If that can be done in 8B parameters it will soon run on edge devices. That's solid usefulness.
It's mind-boggling how bad current voice assistants sometimes are when you prompt them some fairly easy questions.
Maybe my point is something on the lines of "Just send me the prompt"[0]
[0] https://blog.gpkb.org/posts/just-send-me-the-prompt/
1) articles generated with context data that's trivial to find (or even embedded into the model)
2) articles generated with context data that's hard to find or not publicly available
No, they aren't.
You are comparing writing produced with little to no effort to writing produced with the minimal effort required to communicate.
It's reasonable for people to complain that they are presented material that not even the author thought was worth the effort.
But how can I tell if those are good points or not?
I don't want to invest time in reading something if the presence of those "good points" depends on a roll of the dice.
The problem is that in the past it took multiple times more effort and hours to write something than it took to read. That served two purposes:
1. Lazy people just looking for an audience were effectively gatekept from drowning the world with their every vapid thought.
2. Because supply was many times slower than consumption it was viable to give most articles a chance: the author could not drown me in a deluge even if they wanted to.
Having the criteria now that the author should spend at least as much effort creating the piece as they expect the reader expend reading it is a damn useful bar: instead of reading 1000 AI articles just to find the one good one, I can simply read 10 human authored articles and be certain that 9 of them have something worthwhile.
I already assume some comments here are LLM written.
I assume some people here have never programmed a single useful thing even once in their lives.
Right. This just says that Granite 4.1 8B is better than a previous version, Granite 4.0-H-Small, which has 32B, 9B active.
So, they made a less bad model than before. But that doesn't tell you anything about how it compares with other models.
I'm not sure it's proud as much as people voicing displeasure with the uncertainty about what went into the LLM prompt. This may have been a 1 sentence prompt, or it may have been some well researched background that simply reformatted it. Why waste minutes-hours on verifying it if it's possible someone could have spent 10 second on it? It's very easy to see their point.
People seem to indicate people they disagree with voicing their opinion about anything lately is some auto-fellatio, I wonder what causes them to think this way.
Why people don't edit out obvious sloppification and expect to still have readers left
I hear this sort of thing all the time now on YouTube from media/news personalities:
“And that’s the part nobody seems to be talking about.”
"And here's what keeps me up at night."
“This is where the story gets complicated.”
“Here’s the piece that doesn’t quite fit.”
“And this is where the usual explanation starts to break down.”
“Here’s what I can’t stop thinking about.”
“The part that should worry us is not the obvious one.”
“And that’s where the real problem begins.”
“But the more interesting question is the one no one is asking.”
“And this is where things stop being simple.”
It doesn't really worry me but I think its interesting that LLM speak sounds so distinctive, and how willing these media personalities are to be so obvious in reading out on TV what the LLM spat out.
I've never studied what LLMs say in depth is it is interesting that my brain recognises the speech pattern so easily.
BuzzFeed and Upworthy etc pioneered this for web 'news stories', then it got used in linkedin, twitter, and everywhere where views are more important than the content.
A writing teacher once excoriated me for saying that something was important. “Don’t tell me it’s important, show me, and let me decide, and if you do your job I’ll agree”
I don’t know how a completion can tell when it needs to do this. Mostly so far it doesn’t seem capable
This is to say: Marketers and spammers repeat the same things over and over, and these models are build on coalescing repetition into the basis.
So yeah, of course people talked like this before, but it was always in some known context like linked in or a spam website.
It's also exactly the Mr beast playbook, and got him to the largest channel on YouTube.
Any system attempting to capture human attention will use these techniques, nothing LLM-specific here at all.
No point creating busywork for yourself just shuffling words around when the information is there, no?
I guess it depends on what you want out of the article. Substance, or style?
I'd they aren't self-aware enough or smart enough to determine that what they wrote is indistinguishable from text generation, how probable is it that they have something of value to add to any thought?
Corporate announcements were never the places that literature and art were pushing the envelope. They were slop before, and they're slop now.
I ran it in LM Studio and got a pleasingly abstract pelican on a bicycle (genuinely not bad for a tiny 3B model - it can at least output valid SVG): https://gist.github.com/simonw/5f2df6093885a04c9573cf5756d34...
I have been using it with their Chunkless RAG concept and it is fitting very well! (for curious https://github.com/scub-france/Docling-Studio)
I convinced that SLM are a real parto of solution for true integrated AI in process...
It is not the researchers' fault that some slop got posted here instead.
The gap that still matters most isn't intelligence — it's consistency on structured output. When you chain 5+ tool calls in sequence, even a small per-call reliability difference compounds fast. Would love to see Granite 4.1 benchmarked specifically on multi-step function calling rather than just general benchmarks.
But I don’t think it necessarily saved training cost; if it did, I’d be interested to learn how!
https://arxiv.org/abs/2101.03961
I doubt MoE is actually worth it, given how complicated high-performance expert routing and training is. But who knows, I don't.
Link to HF collection: https://huggingface.co/collections/ibm-granite/granite-41-la...
If techniques existed to shift from "guess the next highly probable" token to "guess the best next logical step", as some interpreted said research, should not that be the foremost objective?
Then something broke. The RLHF stage, while improving chat quality, caused math benchmark scores to drop. GSM8K and DeepMind-Math both regressed."
Observation: Math (which when fully decomposed, results in Logic) is at the core of how computers (traditional/older, non-LLM, programming languages work. If an LLM gets Math training wrong at any stage for any reason, then, in my opinion, that should be viewed as something that needs to be fixed at a lower level, not a higher one; not a later training level...
I think it would be interesting exercise to train an LLM that only deals in simple Math, simple English, and only the ability to compute simple equations (+,-,x,/)... like, what's the absolute minimum in terms of text and layers necessary to train a model like that?
I think some interesting understandings could be potentially be had by experimentation like that...
I myself would love a pure (simplest, smallest possible)
Text-to-Math only LLM (TTMLLM, TTMSLM?)
, along with all of the necessary corpuses (which would ideally be as small as possible) and instructions necessary to train such an LLM...
https://huggingface.co/collections/ibm-granite/granite-embed...
311M and 97M versions.
Granite Vision 4.1; Granite Speech 4.1; Granite Guardian 4.1; Granite Embedding Multilingual R2 - with, of course, the "Small Language Models"
https://research.ibm.com/blog/granite-4-1-ai-foundation-mode...
edit: I just realised they do actually have a 30b release alongside this. Haven't tried it yet.
An interesting choice
> While reasoning models have grown in popularity in recent years, their abilities aren’t always the most efficient way to get a result. In enterprise settings, token costs and speed are often as important as performance. That is why turning to less expensive, non-reasoning models with similar benchmark performance for select tasks like instruction following and tool calling makes sense for enterprise users.
I guess they currently don't have the ability to do proper RLVR.
Incidentally: I am trying to spend some time researching in the progresses in the area (the jump from parroting, to inconsistent apparent reasoning, to reliable reasoning).
https://arxiv.org/pdf/2401.16818
With those results, I would've already done that in any models I got to train. There's also the principle that the LLM's are often better at what they saw last in their training set. That also justifies putting more logic, code, and math in at the end for an analytical or coding model. So, a few precedents for that technique already.
Granite 4.1 8B: "I apologize for the confusion earlier. Let's clarify the derivation of the VIDEO__MASK_00b based on the description "Bits 5-6 indicate the video mode" from the equipment_list section.
Given that bits 5 and 6 are used for video mode, we need to create a mask that isolates these two bits:
Bit 5 is the least significant bit of the group (2^5 = 32 decimal, 0b00100000 in 8-bit form). Bit 6 is the next bit up (2^6 = 64 decimal, 0b01000000 in 8-bit form). Combining these bits: ...
Therefore, the correct mask for VIDEO_MASK_00b should be 0x00C0"
Errors on top of errors when converting description into binary numbers. Its hopeless for basic task like parsing/generating headers :(
show me.
> Apache 2.0 across the board, so commercial use is clean.
Did you just stop when you saw open source and come post this here because you couldn't be bothered to... look at the project and see it's cleanly and clearly listed.
Edit: Like. I get it. It's fine to question open source. But this isn't hidden. It's repeated and made clear multiple times. They even link to the license: https://www.apache.org/licenses/LICENSE-2.0
It wasn't hidden, it wasn't in some weird, out-of-the-way place. In fact, I found it so easily that I genuinely questioned whether it was real because of your comment. Like, why would anyone post what you posted if it was this easy to find?
NOPE! It was right there.
If you check HF you will see its Apache2 and the datasets were also permissive.
It's one of the few models on the market where the creator indemnifies it against copyright claims.
https://research.ibm.com/blog/granite-ethical-ai
https://github.com/ibm-granite
I meant the full training datasets and the complete recipes to make the models.
> the complete recipes to make the models.
You mean the weights which most companies don't release. Again you can find from that link.
No I didn't mean the weights, but the source code to make the weights.
I think it’s fair if you use a bit more than 5 seconds as someone stated above. I would gladly be proven stupid.
https://allenai.org/olmo
I'm just giving it as an example. I haven't looked at Granite's repos.