Laguna XS.2 and M.1: A Deeper Dive

Source: https://poolside.ai/blog/laguna-a-deeper-dive

We’ve released the first two models in the Laguna family, Laguna M.1 and Laguna XS.2, alongside the runtime we use to train and operate agents, available through two product experiences in preview.

Laguna M.1 came first, finishing pre-training at the end of last year; it's the foundation for everything else we're building across the family. Laguna XS.2 is a much smaller model, but remarkably capable for its size, and it's our first open-weight release. Both models are free to use for a limited time via our API and on OpenRouter, and Laguna XS.2 weights are also available under an Apache 2.0 license.

Laguna XS.2 and Laguna M.1 are agentic coding models built for long-horizon work. To date, we’ve been focused on serving our government and public sector clients with capable models deployable into the highest-security environments. And while our commitment to these customers remains, we’re now ready to share where we are with the world. We’re also excited to release the weights of Laguna XS.2, our newest generation model, to the open ecosystem to support builders and the wider research community.

We're working toward models that enable more capable agents; and we believe the path runs through coding capability and increasingly long-horizon tasks. Creating software is the core skill through which many other capabilities get expressed.

Today, most agents interact with the world through tool calling, where structured interfaces restrict agents to a fixed set of actions defined in advance. We think this is a transitional pattern. Software is a much more expressive interface. An agent that can write and execute code can compose actions, parallelize work, and build its own ad-hoc systems to interact with the world.

These models are the work of the roughly 60 people who make up our Applied Research organization, across architecture, data, pre-training, and reinforcement learning. We're excited to bring this work into the world and see what the community builds with it.

- Laguna M.1 225B-A23B

- Laguna XS.2 33B-A3B

- Qwen3.5 397B-A17B

- Qwen3.5 35B-A3B

- Qwen3.6 35B-A3B

- Claude Sonnet 4.6-

**Laguna M.1** is our most capable model to date and completed pre-training at the end of last year. It's a 225B total parameter Mixture of Experts (MoE) model with 23B activated parameters, trained completely in-house and from scratch on 30T tokens, using 6,144 interconnected NVIDIA Hopper GPUs. Laguna M.1 reaches **46.9% on SWE-bench Pro and 40.7% on Terminal-Bench 2.0**.

- Laguna M.1 225B-A23B

- Devstral 2 123B dense†

- GLM-4.7 355B-A32B

- DeepSeek-V4-Flash 284B-A13B

- Qwen3.5 397B-A17B

- Claude Sonnet 4.6-

_† We have chosen to include dense models with larger activated parameter counts to highlight the relative efficiency of MoE models._

**Laguna XS.2** is our second-generation MoE and our first open-weight model, built on everything we've learned since training Laguna M.1 across data, including synthetic, and RL. At 33B total parameters with 3B activated (30T tokens trained), it's a highly capable open-weight agentic coding model in its weight class, reaching **44.5% on SWE-bench Pro and 30.1% on Terminal-Bench 2.0**. The weights are available for download today under Apache 2.0.

- Laguna XS.2 33B-A3B

- Devstral Small 2 24B dense†

- Gemma 4*31B dense†

- Qwen3.5 35B-A3B

- Qwen3.6 35B-A3B

- Claude Haiku 4.5*-

- GPT-5.4 Nano-

_† We have chosen to include dense models with larger activated parameter counts to highlight the relative efficiency of MoE models._

Our agent harness, an Agent Client Protocol (ACP) server, is the same carrier we use for agent RL training and evaluation. We're releasing it alongside the models because we believe models and agents should be seen and used together as the gap between them closes.

| | Laguna M.1 (225B-A23B) | Devstral 2 (123B dense) | GLM-4.7 (355B-A32B) | DeepSeek-V4-Flash (284B-A13B) | Qwen3.5 (397B-A17B) | Claude Sonnet 4.6 (-) | | --- | --- | --- | --- | --- | --- | --- | | SWE-bench Verified | 72.5 | 72.2 | 73.8 | 79.0 | 76.2 | 79.6 | | SWE-bench Multilingual | 67.3 | 61.3 | 66.7 | 73.3 | 69.3 | - | | SWE-bench Pro | 46.9 | - | - | 52.6 | 50.9 | - | | Terminal-Bench 2.0 | 40.7 | 32.6 | 41.0 | 56.9 | 52.5 | 59.1 |

| | Laguna XS.2 (33B-A3B) | Devstral Small 2 (24B dense) | Gemma 4 (31B dense) | Qwen3.5 (35B-A3B) | Qwen3.6 (35B-A3B) | Claude Haiku 4.5 (-) | GPT-5.4 Nano (-) | | --- | --- | --- | --- | --- | --- | --- | --- | | SWE-bench Verified | 68.2 | 68.0 | 52.0 | 69.2 | 73.4 | 73.3 | - | | SWE-bench Multilingual | 62.4 | 55.7 | 51.7 | 60.3 | 67.2 | - | - | | SWE-bench Pro | 44.5 | - | 35.7 | 44.6 | 49.5 | 39.5 | 52.4 | | Terminal-Bench 2.0 | 30.1 | 22.5 | 42.9 | 40.5 | 51.5 | 29.8 | 46.3 |

Footnotes: All benchmarking for Laguna M.1 and Laguna XS.2 was completed using the Laude Institute's Harbor Framework with our agent harness, using a maximum of 500 steps and sandboxed execution using 8 GB RAM/2 CPUs (with the exception of Terminal-Bench 2.0; see below). The same sampling parameters were used across both models and for all benchmarking: `temperature=0.7` and `top_k=20`.

Some base task images and verifiers were patched to fix infrastructure reliability issues inherent in task setup, such as rate limits on third-party dependencies in external registries used by the verifier. More details outlining these updates and other findings will follow in a future technical blog post.

- SWE-bench Pro: mean pass@1 averaged over 3 runs.

- SWE-bench Verified: mean pass@1 averaged over 4 runs.

- SWE-bench Multilingual: mean pass@1 averaged over 7 runs.

- Terminal-Bench 2.0: mean pass@1 averaged over 5 runs. 48GB RAM/32 CPUs.

- We used the highest publicly-referenced scores for all comparison models across each benchmark. In all cases these were official scores published in release blog posts or equivalent, with the exception of Gemma 4 31B IT where the highest published scores were reported by the Qwen team, and Claude Haiku 4.5 where the highest published (verified) scores for SWE-bench Pro and Terminal-Bench 2.0 are from their respective official leaderboards.

Open weights

Laguna XS.2 is our first open-weight model. Until now, we've been focused on building for the public sector, where security requirements like on-prem and air-gapped deployments make shipping frontier models a uniquely hard but important problem. That work continues and remains core to what we do.

At the same time, we believe the West needs strong open-weight models, and we want to contribute to that ecosystem. The fastest way for us to improve our models is to bring the world along in building and evaluating them, and we want people to know they can look to us to contribute going forward.

The most exciting applications of foundation models come from people building on top of capable starting points. If you want to fine-tune, quantize, or serve, the weights are yours. We will release Laguna XS.2-base soon.

- OpenRouter

- Ollama

Coming soon

We're bringing Laguna XS.2 to more leading frameworks in the coming weeks, with the help of our partners and the community.

Working with NVIDIA

Every aspect of our Laguna series, from data curation and pre-training through post-training, was conducted on NVIDIA hardware. Additionally, Laguna XS.2 is supported in NVIDIA TensorRT-LLM on Day 1. We're also providing an NVFP4 version of Laguna XS.2, so you can expect strong performance on NVIDIA Blackwell architecture.

Model building

We train all our models from scratch. That means our own data work, our own training codebase (Titan, which we cover in this blog post), and our own agent RL infrastructure. Laguna pushed the limits of that stack, particularly across three domains: our data pipeline including synthetic data, how we optimized the efficiency of the Muon optimizer, and our async on-policy RL scheme.

Data and automixing

Both Laguna M.1 and XS.2 were trained on more than 30T tokens. Reaching that scale, and using it productively in training, required pushing the limits of data generation, processing, curation, and mixing.

#### Large-scale web data

We take great care in building and curating our datasets. We treat web data curation as a joint optimization of quality and diversity. We model quality as a continuous, multi-dimensional signal and rank data using a composite score, using models heavily across the stack for quality signals. Crucially, we don't only keep top-quality data. We found it to be biased toward STEM and reasoning, so we retain portions of mid- and lower-quality buckets to preserve diversity, which is critical for generalization.

Compared to precision-focused pipelines optimized for short token horizons, this approach yields ~2× more unique tokens while maintaining performance. The gain persists when scaling to longer training horizons, which highlights the importance of diversity alongside quality.

We also conducted a detailed deduplication analysis and confirmed FineWeb's hypothesis that global deduplication disproportionately removes high-quality data. By matching the quality distribution between global and snapshot deduplication, we could further close the gap on downstream performance.

#### Synthetic data

To round out natural web data, we use synthetic data to complement the training mix along dimensions that are otherwise hard to control. In Laguna XS.2, it contributes about 13% of the final training mix throughout all pre-training stages, building on organic data rather than replacing it, and expanding where it falls short. The Laguna series uses approx. 4.4T+ synthetic tokens.

To preserve diversity and validity at pre-training scale, our work spans a spectrum between seed-heavy and pipeline-heavy generation. At the seed-heavy end, we reshape content across formats (Q&A, structured lists, dialogue, and so on) to regularize how information is presented, so the model sees valuable artifacts through multiple angles. At the pipeline-heavy end, we move into feature extraction and recomposition, surfacing implicit reasoning, structure, and relationships, and teaching them in new forms and contexts.

We also expand synthetic generation beyond narrow, high-signal domains. Alongside STEM and code, we apply these pipelines across the broader data mix, expanding coverage while maintaining high, grounded signal density.

Our approach is designed to integrate within the larger training ecosystem, focusing on robustness and letting synthetic data contribute earlier and more consistently throughout training.

#### AutoMixer: data mixture optimization

Data curation and the mix that goes into training is extremely impactful on final model performance. We developed an automixing framework to systematically explore and optimize pre-training data mixtures. Instead of relying on manual heuristics, each run of the automixer trains a swarm of ~60 sufficiently large proxy models, each on a different data mix, and measures performance across key capability groups (code, math, STEM, common sense). From these runs, we fit surrogate regressors that approximate how changes in dataset proportions affect downstream evaluation. That gives us a learned mapping from data mix to performance, which we can directly optimize to propose improved mixtures. The approach is inspired by recent work such as Olmix, MDE, and RegMix, but adapted to our setting with richer data groupings and controlled exploration around a strong prior to fit the tight surrogate budget.

The learned signals are both intuitive and informative. Code performance is strongly driven by synthetic and curated code sources, while general web data tends to hurt it. Math performance benefits primarily from diverse web math data. STEM knowledge correlates with academic and educational text. Importantly, the regression recovers these expected relationships while providing a much more fine-grained view of how individual subsets contribute, enabling more precise trade-offs. To avoid trivial solutions (for example, over-indexing on a single high-signal source), we regularize the optimization toward a baseline mix and simulate target repetition rate.

When scaled to a significantly larger model and longer training horizon, the optimized mix delivered substantial gains on targeted capabilities, particularly code and math, relative to a strong prior baseline obtained through a series of independent and more costly ablations, and without compromising generalization to held-out benchmarks.

Muon

Through all training stages of Laguna XS.2 and Laguna M.1, we used an internal distributed implementation of the Muon optimizer. In our initial pre-training ablations, we were able to achieve the same training loss as an AdamW baseline in ~15% fewer steps, with large absolute evaluation uplifts on the final model, while also achieving learning rate transfer across model scales.

Compared to many other optimizers, Muon creates significant compute overhead that we tackle through distribution of the compute across ranks. At a high level, Muon needs to aggregate the gradients into a momentum buffer, apply Nesterov momentum to the gradients, perform orthogonalization of the gradients via Newton-Schulz, and update the parameters with the orthogonalized gradients. Naively, each rank would need to do this for every full parameter. Our implementation assigns each parameter and gradient to only one of the ranks sharding it, gathers the full parameter and gradient on that rank, performs Newton-Schulz, and redistributes the corresponding orthogonalized gradient shards back to all other ranks within the group, which then update their local parameter shards. That effectively removes the compute bottleneck of the Muon optimizer, at the cost of additional communication.

Our implementation overlaps batched communication with the Newton-Schulz computations. An additional benefit of Muon is its lower memory requirement compared to AdamW, as only one state per parameter is required rather than two; that's equally beneficial for checkpoint saving and loading. We also support enabling CUDA graphs for the Newton-Schulz procedure to reduce the CPU overhead of launching many relatively small kernels in the orthogonalization procedure, which is mainly beneficial for smaller models. Thanks to the above, during pre-training of Laguna M.1, the overhead from the optimizer was less than 1% of the training step time.

As the updates and compute are replicated across model replicas (i.e., DDP ranks), we have periodic hash checks on the model weights in place to assert all replicas hold the exact same weights. These checks primarily catch silent data corruption (SDC) from defective GPUs; specifically, errors originating in arithmetic logic and pipeline registers, which unlike DRAM and SRAM are not covered by ECC protection. Hash checks also verify the correctness of our distributed training implementation, protecting against data races, collective communication bugs, and replica divergence.

Agent RL

To train our models to excel at long-horizon agentic tasks, we built a fully asynchronous online RL system that uses our agentic harness inside the training loop, running across large quantities of realistic end-to-end software engineering, terminal, and tool-integrated reasoning tasks.

Our RL stack is a custom-built system loosely coupling the major components of inference and rollout generation, orchestration of code execution sandboxes, trajectory scoring, buffering and filtering, and distributed training.

At a high level, a single turn of the loop looks like this: the trainer publishes a new checkpoint, which is deployed to our inference cluster. Actor processes pull tasks from a dataset, spin up sandboxed containers, and run our production agent binary against each task using the freshly deployed model. The resulting trajectories are scored, filtered, and written to Iceberg tables. The trainer consumes those records continuously and produces the next checkpoint. Inference and training run asynchronously in parallel, with their throughput tuned to balance off-policy staleness. Our modular approach helps us iterate quickly on each component, and swap components easily to experiment with new ideas.

#### Asynchronous rollouts for long-horizon tasks

Realistic software engineering tasks can span a very large number of tool calls over long time horizons: reading files, running tests, iterating on a patch, re-running the suite. If the trainer had to wait for a full batch of trajectories before taking a step, the GPUs would sit idle for most of the wall-clock time, and long trajectories would be systematically under-represented.

We use a fully asynchronous setup. Actors and the trainer run independently, and trajectories go through a queue buffer with explicit gating on how far behind the current policy a rollout is allowed to be. Actors continuously generate data against the most recent checkpoint. The trainer pulls records at its own pace, with a configurable parameter to control the maximum off-policyness.

To synchronize checkpoints between trainer and inference, we developed a custom weight transfer scheme over GPUDirect RDMA, which lets us transfer hundreds of gigabytes of weights within seconds. For Laguna M.1, we can transfer BF16 weights within 5s across nodes between training and inference.

During long rollouts, it's common that training has already progressed to a new checkpoint. In those cases, we optimize for throughput and update the inference model without disrupting running inference requests. To further optimize throughput, inference supports running model weights and KV cache in FP8, even when training is on BF16. We route all requests from a given agent to the same inference replica, enabling KV cache reuse across turns.

#### Training stability with off-policy RL

We optimize our RL pipeline for scale and throughput, which makes training off-policy to some degree. Off-policyness has several sources: stale model parameters, non-deterministic kernels, and even numerical precision mismatch between inference and training. Staleness, in particular, is by design. The gap in training steps between the inference and training models is empirically tuned to balance data freshness and system throughput.

A common issue that can lead to off-policyness is the way data is wired through from rollout generation to training. Naive implementations lead to re-tokenization of data, which can cause a mismatch in token representations. Instead, our actors are designed in a token-in, token-out manner, where token IDs are preserved across multiple agentic turns in the whole trajectory.

To train stably in the off-policy regime, we use a variant of the CISPO algorithm. Our RL runs maintain stability and continued performance improvements over many days of training, without the need for additional stability techniques like entropy regularization.

A technical report on our Laguna XS.2 model is in development. You can learn more about our approach to model building in our Model Factory series.

Get started

Laguna M.1 and Laguna XS.2 are free to use for a limited time. Alongside the models, we're releasing our agent harness, pool, as a research preview. It's the same environment we use internally for agent RL training and evaluation.

Jump straight into Shimmer, our vision of how software will be built in the future, or download pool**.** You can also get started on OpenRouter.

And if you're building on models at a startup, an institution, or a university, we're happy to support higher rate limits on request or provide access to the weights for Laguna M.1. Get in touch at models@poolside.ai or send us a DM on X**.**