OLMo Hybrid: From Theory to Practice and Back

Published in arXiv, 2026

We demonstrate that hybrid models mixing recurrence and attention outperform traditional transformers. We introduce OLMo Hybrid, a 7-billion parameter model that replaces sliding window attention layers with Gated DeltaNet layers. We show theoretically that these hybrid models can express tasks beyond the capabilities of both pure transformers and linear RNNs. Empirically, we demonstrate that the hybrid approach scales more efficiently during pretraining while achieving superior downstream performance.

arXiv

Citation BibTeX:

@article{merrill2026olmohybrid,
      title={OLMo Hybrid: From Theory to Practice and Back},
      author={William Merrill and Yanhong Li and Tyler Romero and Anej Svete and Caia Costello and Pradeep Dasigi and Dirk Groeneveld and David Heineman and Bailey Kuehl and Nathan Lambert and Chuan Li and Kyle Lo and Saumya Malik and DJ Matusz and Benjamin Minixhofer and Jacob Morrison and Luca Soldaini and Finbarr Timbers and Pete Walsh and Noah A. Smith and Hannaneh Hajishirzi and Ashish Sabharwal},
      year={2026},
      eprint={2604.03444},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2604.03444},
      journal={arXiv preprint arXiv:2604.03444},
}