PipelineRL: Faster On-policy Reinforcement Learning for Long Sequence Generation
Published in arXiv preprint arXiv:2509.19128, 2025
We present PipelineRL, a fully pipelined on-policy reinforcement learning stack for large language models that keeps accelerators saturated without generating stale, off-policy trajectories. The system overlaps rollout, preference modeling, and policy updates to deliver up to 2× faster convergence on long-context reasoning benchmarks while retaining stability guarantees for PPO-style algorithms.
Recommended citation: Alexandre Piché, Ehsan Kamalloo, Rafael Pardinas, Xiaoyin Chen, Dzmitry Bahdanau. (2025). "PipelineRL: Faster On-policy Reinforcement Learning for Long Sequence Generation." arXiv preprint arXiv:2509.19128.
Download Paper
