A Fundamental Trade-off in Aligned Language Models and its Relation to Sampling Adaptors
Published in Under review, 2024
The relationship between the quality of a string and its probability p(y) under a language model has been influential in the development of techniques to build good text generation systems. For example, several decoding algorithms have been motivated to manipulate p(y) to produce higher-quality text. In this work, we examine the probability–quality relationship in language models explicitly aligned to human preferences, e.g., through Reinforcement Learning through Human Feedback (RLHF). We find that, given a general language model and its aligned version, for corpora sampled from an aligned language model, there exists a trade-off between the average reward and average log-likelihood of the strings under the general language model. We provide a formal treatment of this issue and demonstrate how a choice of sampling adaptor allows for a selection of how much likelihood we exchange for the reward.
Citation BibTeX
:
@article{tan2024fundamentaltradeoffalignedlanguage,
title={A Fundamental Trade-off in Aligned Language Models and its Relation to Sampling Adaptors},
author={Naaman Tan and Josef Valvoda and Anej Svete and Tianyu Liu and Yanxia Qin and Kan Min-Yen and Ryan Cotterell},
year={2024},
eprint={2406.10203},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2406.10203},
journal={arXiv preprint arXiv:2406.10203},
}