Error on fsdp2 trainer with fsdp_cpu_ram_efficient_loading = True

Hi all, I had a quick question. There’s an error when I try to train my model using fsdp2 with the setting: fsdp_cpu_ram_efficient_loading = True. It will raise error:

[rank0]: Traceback (most recent call last):
[rank0]: File “trainpy”, line 259, in
[rank0]: invoke_main()
[rank0]: File “train.py”, line 256, in invoke_main
[rank0]: main(config)
[rank0]: File “train.py”, line 99, in main
[rank0]: model, optimizer, train_loader, valid_loader, scheduler = accelerator.prepare(
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File “/data0/miniconda3/envs/esat/lib/python3.12/site-packages/accelerate/accelerator.py”, line 1555, in prepare
[rank0]: result = self._prepare_fsdp2(*args)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File “/data0/miniconda3/envs/esat/lib/python3.12/site-packages/accelerate/accelerator.py”, line 1687, in _prepare_fsdp2
[rank0]: model = fsdp2_prepare_model(self, model)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File “/data0/miniconda3/envs/esat/lib/python3.12/site-packages/accelerate/utils/fsdp_utils.py”, line 678, in fsdp2_prepare_model
[rank0]: fsdp2_load_full_state_dict(accelerator, model, original_sd)
[rank0]: File “/data0/miniconda3/envs/esat/lib/python3.12/site-packages/accelerate/utils/fsdp_utils.py”, line 507, in fsdp2_load_full_state_dict
[rank0]: device_mesh = sharded_param.device_mesh
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: AttributeError: ‘Tensor’ object has no attribute ‘device_mesh’

and my accelerate configs:

distributed_type: FSDP
downcast_bf16: ‘no’
dynamo_backend: ‘INDUCTOR’ # ‘INDUCTOR’ ‘NO’
fsdp_config:
fsdp_version: 2
fsdp_reshard_after_forward: True
fsdp_mixed_precision_policy:
param_dtype: “bf16”
reduce_dtype: “bf16”
buffer_dtype: “fp32”
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_transformer_layer_cls_to_wrap: Block
fsdp_offload_params: False
fsdp_state_dict_type: SHARDED_STATE_DICT
fsdp_cpu_ram_efficient_loading: False
fsdp_activation_checkpointing: False
mixed_precision: bf16
num_processes: 2
gpu_ids: 5,6
use_cpu: False

I noticed that fsdp_cpu_ram_efficient_loading can prevent CPU memory overflow (OOM), which is helpful when i want to scale up my model, what’s more, fsdp_cpu_ram_efficient_loading=True is the default setting. However, this feature appears unavailable. How can I resolve this issue?

1 Like

Seems known bug?


This failure is coming from the same root class of bugs that others hit in Accelerate FSDP2 “CPU RAM efficient loading”: Accelerate’s FSDP2 loading path assumes it can read DTensor metadata (DeviceMesh) from “sharded params”, but in some configurations it instead sees a plain Tensor, so accessing .device_mesh throws exactly what you see.

This is not you mis-typing YAML. It is a real integration edge.


What fsdp_cpu_ram_efficient_loading actually does

Accelerate documents this flag very narrowly:

  • It is only applicable to Transformers models.
  • It is meant for loading via from_pretrained.
  • When enabled, only rank 0 loads the pretrained checkpoint, other ranks start with “empty” weights, then weights are broadcast/synced to other ranks. (Hugging Face)
  • Accelerate explicitly says to set it to False if you get errors while loading pretrained Transformers weights. (Hugging Face)

So this flag is not a general FSDP2 “reduce CPU RAM always” switch. It is specifically a Transformers from_pretrained loading optimization.


Why you get Tensor has no attribute device_mesh

Background: FSDP2 is DTensor-based

In PyTorch FSDP2 (fully_shard), parameters are converted in-place from plain tensors to DTensors, and DTensors carry a DeviceMesh that describes the topology of ranks and how tensors are sharded/replicated. (PyTorch Documentation)

That is why “DeviceMesh” exists at all: DTensor layouts are defined by (DeviceMesh, Placement). (PyTorch Documentation)

What Accelerate is doing in the failing path

Accelerate’s FSDP2 utility fsdp2_load_full_state_dict is documented as:

  • “Loads the full state dict (could be only on rank 0) into the sharded model”
  • “done by broadcasting the parameters from rank 0”
  • expects the model to be on meta device to avoid VRAM spikes. (Hugging Face)

To broadcast correctly, it tries to infer sharding metadata from the sharded parameters, and that’s where .device_mesh is accessed. Your crash means the “sharded param” it looked at was not a DTensor (no .device_mesh), it was a plain Tensor.

This is a known, recurring incompatibility

There are recent Accelerate issues showing the same family of breakage when cpu_ram_efficient_loading=True triggers broadcast-based loading under FSDP2, especially when DTensor/DeviceMesh context is not what Accelerate expects. Example: an Oct 2025 issue shows cpu_ram_efficient_loading=True leading into fsdp2_load_full_state_dict and then failing inside dist.broadcast because DTensor dispatch cannot find a DeviceMesh context. (GitHub)

Another related issue (Aug 2025) notes the error is tied to cpu_ram_efficient_loading and that setting it to False makes training run, with maintainers stating they were working on making it compatible (in that report, with TP). (GitHub)


Practical resolution options

Option A (most reliable): disable it for now

Set:

fsdp_cpu_ram_efficient_loading: False

This is explicitly recommended in the docs when it causes loading errors. (Hugging Face)
And it matches real-world maintainers’ guidance in similar reports. (GitHub)

If you only need a working run today, this is the fastest path.


Option B: keep it, but satisfy the real prerequisites

If you want to keep True, make sure you are actually using it in the narrow supported way:

  1. You must be loading a Transformers model via from_pretrained (not constructing weights yourself, not loading custom state dicts first). (Hugging Face)

  2. Distributed must be initialized before from_pretrained.
    Accelerate’s doc explicitly calls this out. In Trainer, this usually happens when TrainingArguments is created; in a custom loop, it means create the Accelerator (or otherwise init the process group) before calling from_pretrained. (Hugging Face)

  3. Ensure fsdp_sync_module_states=True (or let Accelerate force it).
    Docs: it “must be True” when CPU RAM efficient loading is True. (Hugging Face)

If you do all three and still hit the crash, you are in the “known bug” territory, not the “misuse” territory.


Option C: get the CPU-RAM benefit without this brittle path

If your core goal is “avoid CPU OOM when scaling model size,” you can often avoid the broadcast-based full-state-dict load entirely:

1) Prefer SHARDED checkpoint formats for resume and large-scale

You already set:

fsdp_state_dict_type: SHARDED_STATE_DICT

That is good. The remaining pain point is usually the initial pretrained load, which is still a full checkpoint.

A common workaround is a one-time conversion step: load the pretrained model on a machine with enough CPU RAM once, then save out a sharded checkpoint, and from then on resume from shards.

Accelerate even provides utilities around sharded weights (and warns merging is CPU-bound). (Hugging Face)

2) Use PyTorch Distributed Checkpointing (DCP) when possible

PyTorch DCP is designed to load/save in a distributed way and requires sharding info before loading. That is the same scaling problem you are fighting (avoid “gather everything on one rank” patterns). (PyTorch Tutorials KR)


Strong “do this next” checklist

  1. Print versions:
python -c "import torch,accelerate,transformers; print(torch.__version__, accelerate.__version__, transformers.__version__)"

FSDP2 is fast-moving and Accelerate has had multiple FSDP2 fixes recently. (GitHub)

  1. If you need it to work now: set fsdp_cpu_ram_efficient_loading=False. (Hugging Face)

  2. If you need the memory win: plan to convert pretrained weights once into a sharded checkpoint or move to DCP-based checkpointing to avoid full-state broadcast style loading. (PyTorch Tutorials KR)


Key references and “similar cases”

  • Accelerate doc for fsdp_cpu_ram_efficient_loading requirements and “disable if errors.” (Hugging Face)
  • Accelerate FSDP2 loading utility description (broadcast from rank 0; expects meta-device model). (Hugging Face)
  • Similar recent failures: DeviceMesh/DTensor broadcast path errors with cpu_ram_efficient_loading=True. (GitHub)
  • Another similar issue: setting cpu_ram_efficient_loading=False as workaround. (GitHub)
  • PyTorch FSDP2 fully_shard converts params to DTensor (why DeviceMesh is expected). (PyTorch Documentation)

Summary

  • Your crash is a DTensor/DeviceMesh mismatch in Accelerate’s FSDP2 “RAM efficient loading” path. (Hugging Face)
  • The flag is Transformers from_pretrained–specific, not a universal knob. (Hugging Face)
  • Most reliable fix: set it to False (docs and similar issues endorse this). (Hugging Face)
  • To still avoid CPU OOM: move to sharded checkpoints or PyTorch DCP flows. (PyTorch Tutorials KR)

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.