Probability Distributions Computed by Hard-Attention Transformers
Published in arXiv, 2025
Most expressivity results for transformers treat them as language recognizers (which accept or reject strings), and not as they are used in practice, as language models (which generate strings autoregressively and probabilistically). Here, we characterize the probability distributions that transformer language models can express. We show that making transformer language recognizers autoregressive can sometimes increase their expressivity, and that making them probabilistic can break equivalences that hold in the non-probabilistic case. Our overall contribution is to tease apart what functions transformers are capable of expressing, in their most common use-case as language models.
Citation BibTeX:
@article{yang2025probabilitydistributionshardattention,
title={Probability Distributions Computed by Hard-Attention Transformers},
author={Andy Yang and Anej Svete and Jiaoda Li and Anthony Widjaja Lin and Jonathan Rawski and Ryan Cotterell and David Chiang},
year={2025},
eprint={2510.27118},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2510.27118},
journal={arXiv preprint arXiv:2510.27118},
}
