VibeThinker: 3B param model that beats Opus 4.5 on reasoning with novel SFT+GRPO

(arxiv.org)

52 points | by timhigins 2 hours ago

8 comments

secretslol 5 minutes ago
Am I right in thinking this is a tiny model which has been trained well to reason, and that's it? Makes me think of a smart person who doesn't know anything about a given topic, but with the right tools will go and research the heck out of it. I really like the sound of this... why have models train on learning anything when you can just train them how to learn and let them get on with it from something as small as a Pi Zero and an internet connection.
gslepak 44 minutes ago
Note that these are Python-only results, the model will not do as well with other languages.
I'm glad to see more domain-focused SLMs, we need more of them! A programming focused MoE should work well across many languages.
deftio 30 minutes ago
There is some base level of intelligence any model needs to be useful, even in narrow tasks.
Could you teach a 5 year old to drive a car? A 10 year old? A 12 year old? To drive a car requires being able to read, to have judgement about ice or rainy conditions, to anticipate a child running after a ball. By the time a human in in their mid teens they have acquired the base knowledge...
Small models need to have enough base knowledge to be able to be good enough -- even in a seemingly narrow regime. Where is that? Obviously they don't need all the obscure knowledge of a frontier model but there is some base level which is probably more than it would first seem.
aero2146 1 hour ago
I tried generating the classic pelican svg, but it failed horribly just showing me a rectangle and a black circle...
[-]
- fwipsy 1 hour ago
  I think this is predicted? Part of the story is how they were able to preserve core reasoning ability while cutting knowledge like "pelicans have wings."
  > these findings motivate the Parametric Compression-Coverage Hypothesis, which views verifiable reasoning as compressible into compact reasoning cores, while open-domain knowledge and general-purpose competence require broad parameter coverage over facts, concepts, and long-tail scenarios.
  [-]
  - pylotlight 47 minutes ago
    The only real essential item here is tool calling capability is it not? So I assume they tested a strong read/write/edit tool consistency?
    [-]
    - nsingh2 23 minutes ago
      This model doesn't support tool calling, was not part of its training. It's focused on Python (and I think C++) competitive programming and mathematics tasks, i.e. tasks with verifiable rewards. So if you have a task that fits that description, the size-to-capability ratio is good.
      These kinds of models might be more useful as tools to be used by larger orchestrator models, than being the orchestrators themselves.
    - btown 22 minutes ago
      I'm not seeing any mention of tools in the paper, much less a bias towards "curiosity" to use those tools when it encounters gaps in its knowledge. So perhaps this is a good proof-of-concept that single-pass code generation is viable with this small a model - but we're still a long way from a viable solution.
- realitysballs 1 hour ago
  That’s all I needed to hear
  [-]
  - pylotlight 48 minutes ago
    As in, you learnt that a useless test that no one should be using was tested here, that's what you meant right?
- physPop 1 hour ago
  Its for reasoning not generating art?
  [-]
  - websap 1 hour ago
    Can you explain this a bit more
    [-]
    - tyre 53 minutes ago
      Imagine you want to make a smaller model that is really good at one thing, say, driving a car. You could remove the parameters that lead it to correctly answer, "What is the powerhouse of the cell?" or, "Who was the first president of the United States?"
      It would look really dumb if someone asked it that, but that's fine. You're trying to make a model that is optimized for efficiency for a specific task. As much as possible, you should prune uncorrelated things.
    - pylotlight 47 minutes ago
      SVG generation is a useless test, what's there more to know?
      [-]
      - steve_adams_86 25 minutes ago
        What if you're reasoning about how to generate SVG correctly?
        [-]
        Mtinie 5 minutes ago
        In this case, I’d expect it should make a web search tool call to find the Python library best suited for SVG generation and manipulation, and then use what it learns there to execute the task you’ve asked it to do (either asking if you’d like to incorporate the library as a dependency or to roll its own implementation of a subset of the features if that was your preference),
        Assuming tool calling hasn’t been entirely stripped out of this model.
        (Edit) No tool calling, per this comment: https://news.ycombinator.com/item?id=48640189
noperator 1 hour ago
Having some success while testing this model out as a replacement for GPT-5 nano in source code security review. Running on RTX 3090 (24 GB VRAM) via vLLM. It's not great on structured output (as noted in the model card) but I'm working around that in my harness.
[-]
- dummydummy1234 43 minutes ago
  Can't you just force it to do structured output via constrained generation?
SwellJoe 25 minutes ago
It's terrible at hunting security bugs (I expected it to be, but I wanted to be sure). I added it to a benchmark I made with a corpus of some Mythos-discovered bugs, and it found zero. The smallest pretty successful models remain Qwen 3.6 and Gemma 4 (but I haven't tested the very small variants of those yet).
https://swelljoe.com/post/will-it-mythos/
[-]
- nsingh2 15 minutes ago
  The lack of tool use will hinder it a lot I think, since bug hunting requires collecting context across a code base and stitching it together. It might be good in a more narrow sense, i.e "is there a bug in this block of code" and not considering how it interacts with the rest of the code base.
  That's also more aligned to its leetcode style training data, the code under test is fully in the context window. It might be interesting to have a bigger tool use model go through the effort of collecting the context, and feeding it into this kind of model for analysis only. It becomes more of a thinking tool, instead of the orchestrator.
sosojustdo 1 hour ago
[flagged]