hf-doc-build/doc-build
Updated • 306k • 38
[batch × seq × vocab] logits tensor before computing cross-entropy, which dominates peak memory at long context lengths. The new loss_type="chunked_nll" path drops ignored-label tokens before the lm_head matmul and computes cross-entropy in checkpointed chunks of 256.nll, and unlocks sequence lengths that don't fit at all under the standard path.SFTConfig(loss_type="chunked_nll")trl.experimental.openreward adapter plugs any environment speaking the [Open Reward Standard](https://openrewardstandard.io) protocol into any TRL trainer that takes an environment_factory. One string — a catalog name or a URL — wires the dataset, factory, and reward_func slots; tools are bound dynamically from JSON Schema, no per-env wrapper code:from trl import GRPOTrainer
from trl.experimental.openreward import OpenRewardSpec
spec = OpenRewardSpec("Eigent/SETA", num_tasks=64)
trainer = GRPOTrainer(
...,
train_dataset=spec.train_dataset,
environment_factory=spec.environment_factory,
reward_funcs=spec.reward_funcs,
)