From Gemma and Qwen to Ornith 1.0: the open-source flywheel

DeepReinforce’s Ornith-1.0 matches Claude Opus 4.7 on agentic coding by fine-tuning open Gemma 4 and Qwen 3.5 weights with reinforcement learning – a clean case study in how open-source compounds.

TL;DR

DeepReinforce released Ornith-1.0 on June 25, 2026 – a family of open-source coding models built on top of Gemma 4 and Qwen 3.5 weights, fine-tuned with a self-improving RL framework. The flagship 397B model scores 77.5 on Terminal-Bench 2.1 and 82.4 on SWE-Bench Verified, surpassing Claude Opus 4.7 on both benchmarks at time of release. But the more durable story isn’t in the benchmark numbers – it’s in what made them possible: Google and Alibaba released their weights openly, a specialist research team applied years of RL expertise on top, and the result is now available to anyone with the hardware to run it. DiscreteStack began serving Ornith-1.0 before official inference providers deployed it, which means enterprises that care about data sovereignty can already use it on their own infrastructure. Benchmarks will age – this model architecture won’t be the last word – but the pattern it represents is here to stay.

The open source flywheel – a story about compounding

There’s a particular dynamic in open-source AI that doesn’t get enough credit: the best open release today becomes the foundation for the best open release next quarter.

It’s not that every open-weight model is equally good. It’s that each good one lowers the floor for what comes next. When Google released Gemma 4 and Alibaba released Qwen 3.5 openly, they weren’t just sharing model weights – they were handing hundreds of thousands of GPU-hours of pretraining to anyone with the right expertise to apply fine-tuning on top. The question then becomes: who has that expertise, and what are they building?

Open-source AI compounding chain; from base weights to enterprise deployment, as taken from Deep Reinforce.

DeepReinforce is one answer. They’re a research group whose entire body of work is about applying reinforcement learning to make systems self-improve – not just LLMs, but CUDA kernels (CUDA-L2 achieves up to 28.7% speedup over cuBLAS in server-like inference settings), nearest-neighbor search algorithms (CRINN), and competitive programming AI (GrandCode, which ranked first across three consecutive Codeforces live competitions, beating every human participant including legendary grandmasters). Ornith-1.0 is the natural next step: take open base weights, apply deep RL expertise, and ask what happens to coding performance.

The answer is that it matches – and in several benchmarks surpasses – Claude Opus 4.7.

That’s the compounding flywheel in action.

What Ornith-1.0 actually is

Ornith-1.0 is a family of four models: 9B Dense, 31B Dense, 35B MoE, and 397B MoE. All are available on HuggingFace under an open license. None are deployed by official inference providers as of this writing – a point we’ll come back to.

The core technical innovation is what DeepReinforce calls a self-improving training framework. The name is accurate: rather than using a fixed, human-designed harness to guide solution generation during RL training, Ornith-1.0 learns to generate both the solution and the scaffold that guides the solution. These co-evolve over training.

Ornith self-improving training framework diagram

Concretely: each RL step first proposes a refined scaffold conditioned on the task and prior scaffold, then generates a solution rollout conditioned on that scaffold. Reward from the rollout propagates back to both stages. Over thousands of steps, this produces scaffolds that are progressively better at eliciting high-reward trajectories – without anyone having to design them by hand.

The obvious risk is reward hacking. If the model can modify its own evaluation harness, it might learn to satisfy the verifier without doing the actual work. DeepReinforce defends against this on three levels:

Immutable environment boundary – the tool surface, environment, and test isolation are fixed and outside the model’s reach.
Deterministic monitor – flags any attempt to read withheld paths or modify verification scripts, assigning zero reward and excluding from advantage computation.
Frozen LLM judge – a secondary veto on top of the verifier to catch intent-level gaming within the allowed tool surface.

For long rollouts, they also address the off-policy problem in RL training with a staleness weight that downweights older tokens and drops them once a threshold is exceeded – a practical engineering solution to a real problem in RL training at this scale.

The benchmark picture (and how to read it)

Numbers first, context second.

The flagship Ornith-1.0-397B scores 77.5 on Terminal-Bench 2.1 (Terminus-2 harness) and 82.4 on SWE-Bench Verified. Claude Opus 4.7 scores 70.3 and 80.8 respectively. Claude Opus 4.8 – a newer release – scores 85.0 on Terminal-Bench 2.1 and 87.6 on SWE-Bench Verified, which does outperform Ornith-1.0-397B. The 397B model’s moment at the top of the open-source leaderboard is real, but the frontier keeps moving.

Ornith-1.0-397B evaluation results comparing against Claude Opus 4.7 and other leading models

How to read these numbers honestly: benchmarks are a snapshot. The SWE-Bench and Terminal-Bench scores reflect specific evaluation harnesses at a specific point in time. New models will move these tables. What’s less likely to change is the pattern: a well-applied RL method can substantially close the gap between open-weight fine-tunes and closed frontier models. The specific scores in this post will date; the method will not.

Once companies focus on what they do best and share their technical advances as open source for others to build upon, there’s no ceiling on what’s possible.

The benchmark comparison in full

For reference, here’s the complete benchmark table from the Ornith-1.0 release:

Benchmark	Ornith-1.0-397B	Qwen3.5-397B	Qwen3.7-Max	GLM-5.2-744B	Minimax-M3	DeepSeek-V4-Pro	Claude Opus 4.7	Claude Opus 4.8
Terminal Bench 2.1 (Terminus-2)	77.5	53.5	73.5	81.0	64.0	64.0	70.3	85.0
Terminal Bench 2.1 (Claude Code)	78.2	48.6	69.8	82.7	–	66.5	69.7	78.9
SWE-Bench Verified	82.4	76.4	80.4	–	–	80.6	80.8	87.6
SWE-Bench Pro	62.2	51.6	60.6	62.1	59.0	55.4	64.3	69.2
SWE-Bench Multilingual	78.9	69.3	78.3	–	–	76.2	–	–
NL2Repo	48.2	36.8	47.2	48.9	42.1	–	–	69.7
ClawEval Avg	77.1	70.7	65.2	–	–	75.8	78.2	–

Source: deep-reinforce.com/ornith_1_0.html

What open source actually unlocks

The benchmark performance is interesting. What it represents is more interesting.

Before Ornith-1.0, if you wanted Claude Opus 4.7-level coding performance, you had to either pay Anthropic per token and send your code to their servers, or deploy frontier level open source models with trillions of parameters. Now there’s an open-weight alternative that matches both on several benchmarks. And since it is open source it can be downloaded, audited, fine-tuned, quantized, and run entirely on your own hardware.

It means the “intelligence tax” argument for sending code to a closed API has weakened substantially. Our analysis of open vs. closed model performance shows the gap between the best open-weight and best closed models has narrowed to roughly three months on average. In some domains, including coding, it has effectively disappeared.

Terminal-Bench 2.1 scores comparison: open-source vs closed models (June 2026)

For AI infrastructure providers, it creates a new product to offer: serving a model that matches closed alternatives at a fraction of the per-token cost, on hardware you control.

The Linux Foundation reports that 90% of organizations now cite open source as essential to their sovereignty strategy. That number sounds inflated until you consider what’s at stake: vendor lock-in, data residency, compliance with the EU AI Act, and the ability to audit what the model is actually doing with your sensitive code.

What is really the latest model?

Here’s a detail worth noting: in the days following the June 25th Ornith-1.0’s release, HuggingFace’s model card explicitly stated it isn’t deployed by any inference provider. OpenRouter, which has become the default discovery layer for many developers, hasn’t listed it as well. The major cloud AI APIs haven’t added it.

DiscreteStack had Ornith-1.0 available for its private AI infrastructure since day 0. This is worth mentioning. The model exists. The weights are public. And you can have latest state of the art model, without routing code through a third-party cloud.

Private deployment vs cloud API; data sovereignty and cost comparison with DiscreteStack

The cost argument is separate from the sovereignty argument, but worth stating. DiscreteStack’s comparison calculator puts the effective per-token cost at €0.24/1M tokens on their platform, versus €1.30/1M tokens via hyperscaler APIs. For a 50-person engineering team that pushes through heavy AI-assisted coding workflows, that works out to roughly €341,000 in annual savings. The off-hours compute is a bonus: when developers aren’t generating code, the same hardware can handle RAG indexing, batch inference, report generation, and agentic workflows – at no incremental cost.

The DiscreteStack stack runs on Hopper (H100/200), Blackwell (B100/B200) and RTX (5000/6000 Pro) hardware, with each model compiled specifically for the target GPU topology. The API is OpenAI- and Anthropic-compatible, meaning Claude Code, OpenCode, LibreChat, and any other tool that speaks those protocols works without changes.

The honest assessment

Ornith-1.0 is a real result. The 397B model’s performance on Terminal-Bench 2.1 and SWE-Bench Verified at time of release is genuinely competitive with both closed frontier models and frontier level open source models – in several cases surpassing them.

It is also worth being clear-eyed about the limits. Claude Opus 4.8 scores above Ornith-1.0-397B on most of the same benchmarks. Benchmarks measure specific harness-defined tasks, not all real-world coding scenarios. The 397B variant still requires enterprise-grade hardware to serve at reasonable latency. And the model is freshly released – production track record takes time to develop.

What’s durable is the pattern: an open-source research team, starting from publicly available base weights, applies a focused method and produces results genuinely competitive at the frontier — then releases them openly for the next team to build on. It will happen again. The question for enterprises is whether they’re positioned to take advantage of the next iteration, or locked at vendor APIs.

The Stanford HAI 2026 AI Index confirmed that model leadership is now fluid – no single provider holds the frontier for more than a few months. The teams that capture value from that cycle are the ones that can evaluate, adopt, and serve new open models faster.

Try DiscreteStack

DiscreteStack is a private AI operating system for enterprises – a hardware-native platform that runs the best open-weight models, including Ornith-1.0, on your own NVIDIA GPU infrastructure. Fixed annual licensing at compute node per year, OpenAI- and Anthropic-compatible API, and complete trial up and running in 24 hours.

For engineering teams that need frontier-class coding capability without sending code to a third-party cloud, or worrying about AI cost, this is what that looks like in practice.

Request trial access at discretestack.com/try

Back to blog