skandermoalla/qrpo-paper-llama-nosft-ultrafeedback-armorm-temp1-ref50-offline-armorm
Viewer
•
Updated
•
61.6k
•
125
Datasets with reference completions and rewards used in the paper https://arxiv.org/abs/2507.08068.