accelerate.utils.enable_fsdp_ram_efficient_loadingaccelerate.utils.enable_fsdp_ram_efficient_loadinghttps://github.com/huggingface/accelerate/blob/v1.11.0/src/accelerate/utils/fsdp_utils.py#L39[] Enables RAM efficient loading of Hugging Face models for FSDP in the environment.

accelerate.utils.disable_fsdp_ram_efficient_loadingaccelerate.utils.disable_fsdp_ram_efficient_loadinghttps://github.com/huggingface/accelerate/blob/v1.11.0/src/accelerate/utils/fsdp_utils.py#L49[] Disables RAM efficient loading of Hugging Face models for FSDP in the environment.

accelerate.utils.merge_fsdp_weightsaccelerate.utils.merge_fsdp_weightshttps://github.com/huggingface/accelerate/blob/v1.11.0/src/accelerate/utils/fsdp_utils.py#L360[{"name": "checkpoint_dir", "val": ": str"}, {"name": "output_path", "val": ": str"}, {"name": "safe_serialization", "val": ": bool = True"}, {"name": "remove_checkpoint_dir", "val": ": bool = False"}]- **checkpoint_dir** (`str`) -- The directory containing the FSDP checkpoints (can be either the model or optimizer). - **output_path** (`str`) -- The path to save the merged checkpoint. - **safe_serialization** (`bool`, *optional*, defaults to `True`) -- Whether to save the merged weights with safetensors (recommended). - **remove_checkpoint_dir** (`bool`, *optional*, defaults to `False`) -- Whether to remove the checkpoint directory after merging.0 Merge the weights from sharded FSDP model checkpoints into a single combined checkpoint. Should be used if `SHARDED_STATE_DICT` was used for the model. Weights will be saved to `{output_path}/model.safetensors` if `safe_serialization` else `pytorch_model.bin`. Note: this is a CPU-bound process.

class accelerate.FullyShardedDataParallelPluginaccelerate.FullyShardedDataParallelPluginhttps://github.com/huggingface/accelerate/blob/v1.11.0/src/accelerate/utils/dataclasses.py#L1538[{"name": "fsdp_version", "val": ": int = None"}, {"name": "sharding_strategy", "val": ": typing.Union[str, ForwardRef('torch.distributed.fsdp.ShardingStrategy')] = None"}, {"name": "reshard_after_forward", "val": ": typing.Union[str, ForwardRef('torch.distributed.fsdp.ShardingStrategy'), bool] = None"}, {"name": "backward_prefetch", "val": ": typing.Union[str, ForwardRef('torch.distributed.fsdp.BackwardPrefetch'), NoneType] = None"}, {"name": "mixed_precision_policy", "val": ": typing.Union[dict, str, ForwardRef('torch.distributed.fsdp.MixedPrecision'), ForwardRef('torch.distributed.fsdp.MixedPrecisionPolicy'), NoneType] = None"}, {"name": "auto_wrap_policy", "val": ": typing.Union[typing.Callable, typing.Literal['transformer_based_wrap', 'size_based_wrap', 'no_wrap'], NoneType] = None"}, {"name": "cpu_offload", "val": ": typing.Union[bool, ForwardRef('torch.distributed.fsdp.CPUOffload'), ForwardRef('torch.distributed.fsdp.CPUOffloadPolicy')] = None"}, {"name": "ignored_modules", "val": ": typing.Union[collections.abc.Iterable[torch.nn.modules.module.Module], str, NoneType] = None"}, {"name": "state_dict_type", "val": ": typing.Union[str, ForwardRef('torch.distributed.fsdp.StateDictType')] = None"}, {"name": "state_dict_config", "val": ": typing.Union[ForwardRef('torch.distributed.fsdp.FullStateDictConfig'), ForwardRef('torch.distributed.fsdp.ShardedStateDictConfig'), NoneType] = None"}, {"name": "optim_state_dict_config", "val": ": typing.Union[ForwardRef('torch.distributed.fsdp.FullOptimStateDictConfig'), ForwardRef('torch.distributed.fsdp.ShardedOptimStateDictConfig'), NoneType] = None"}, {"name": "limit_all_gathers", "val": ": bool = True"}, {"name": "use_orig_params", "val": ": typing.Optional[bool] = None"}, {"name": "param_init_fn", "val": ": typing.Optional[typing.Callable[[torch.nn.modules.module.Module], NoneType]] = None"}, {"name": "sync_module_states", "val": ": typing.Optional[bool] = None"}, {"name": "forward_prefetch", "val": ": bool = None"}, {"name": "activation_checkpointing", "val": ": bool = None"}, {"name": "cpu_ram_efficient_loading", "val": ": bool = None"}, {"name": "transformer_cls_names_to_wrap", "val": ": typing.Optional[list[str]] = None"}, {"name": "min_num_params", "val": ": typing.Optional[int] = None"}]- **fsdp_version** (`int`, defaults to `1`) -- The version of FSDP to use. Defaults to 1. If set to 2, launcher expects the config to be converted to FSDP2 format. - **sharding_strategy** (`Union[str, torch.distributed.fsdp.ShardingStrategy]`, defaults to `'FULL_SHARD'`) -- Sharding strategy to use. Should be either a `str` or an instance of `torch.distributed.fsdp.fully_sharded_data_parallel.ShardingStrategy`. Is deprecated in favor of `reshard_after_forward`. - **reshard_after_forward** (`Union[str, torch.distributed.fsdp.ShardingStrategy, bool]`, defaults to `'FULL_SHARD'` for `fsdp_version=1` and `True` for `fsdp_version=2`) -- Sharding strategy to use. Should be a bool if `fsdp_version` is set to 2 else a `str` or an instance of `torch.distributed.fsdp.fully_sharded_data_parallel.ShardingStrategy`. - **backward_prefetch** (`Union[str, torch.distributed.fsdp.BackwardPrefetch]`, defaults to `'NO_PREFETCH'`) -- Backward prefetch strategy to use. Should be either a `str` or an instance of `torch.distributed.fsdp.fully_sharded_data_parallel.BackwardPrefetch`. - **mixed_precision_policy** (`Optional[Union[dict, str, torch.distributed.fsdp.MixedPrecision, torch.distributed.fsdp.MixedPrecisionPolicy]]`, defaults to `None`) -- A config to enable mixed precision training with FullyShardedDataParallel. If passing in a `dict`, it should have the following keys: `param_dtype`, `reduce_dtype`, and `buffer_dtype`, can be an instance of `torch.distributed.fsdp.MixedPrecisionPolicy` if `fsdp_version` is set to 2. If passing in a `str`, it should be one of the following values: fp8, fp16, bf16, fp32, and used to set `param_dtype`, `reduce_dtype`, and `buffer_dtype`. - **auto_wrap_policy** (`Optional(Union[Callable, Literal["transformer_based_wrap", "size_based_wrap", "no_wrap"]]), defaults to `NO_WRAP`) -- A callable or string specifying a policy to recursively wrap layers with FSDP. If a string, it must be one of `transformer_based_wrap`, `size_based_wrap`, or `no_wrap`. See `torch.distributed.fsdp.wrap.size_based_wrap_policy` for a direction on what it should look like. - **cpu_offload** (`Union[bool, torch.distributed.fsdp.CPUOffload, torch.distributed.fsdp.CPUOffloadPolicy]`, defaults to `False`) -- Whether to offload parameters to CPU. Should be either a `bool` or an instance of `torch.distributed.fsdp.fully_sharded_data_parallel.CPUOffload` or `torch.distributed.fsdp.fully_sharded_data_parallel.CPUOffloadPolicy` if `fsdp_version` is set to 2. - **ignored_modules** (`Optional[Union[Iterable[torch.nn.Module], str]]`, defaults to `None`) -- A list of modules to ignore when wrapping with FSDP. When passing a string, will match the modules by name using regex fullmatch. If `fsdp_version` is set to 2, the modules are converted to parameters and used. - **state_dict_type** (`Union[str, torch.distributed.fsdp.StateDictType]`, defaults to `'FULL_STATE_DICT'`) -- State dict type to use. If a string, it must be one of `full_state_dict`, `local_state_dict`, or `sharded_state_dict`. - **state_dict_config** (`Optional[Union[torch.distributed.fsdp.FullStateDictConfig, torch.distributed.fsdp.ShardedStateDictConfig]`, defaults to `None`) -- State dict config to use. Is determined based on the `state_dict_type` if not passed in. - **optim_state_dict_config** (`Optional[Union[torch.distributed.fsdp.FullOptimStateDictConfig, torch.distributed.fsdp.ShardedOptimStateDictConfig]`, defaults to `None`) -- Optim state dict config to use. Is determined based on the `state_dict_type` if not passed in. - **limit_all_gathers** (`bool`, defaults to `True`) -- Whether to have FSDP explicitly synchronizes the CPU thread to prevent too many in-flight all-gathers. This bool only affects the sharded strategies that schedule all-gathers. Enabling this can help lower the number of CUDA malloc retries. - **use_orig_params** (`bool`, defaults to `False`) -- Whether to use the original parameters for the optimizer. - **param_init_fn** (`Optional[Callable[[torch.nn.Module], None]`, defaults to `None`) -- A `Callable[torch.nn.Module] -> None` that specifies how modules that are currently on the meta device should be initialized onto an actual device. Only applicable when `sync_module_states` is `True`. By default is a `lambda` which calls `to_empty` on the module. - **sync_module_states** (`bool`, defaults to `False`) -- Whether each individually wrapped FSDP unit should broadcast module parameters from rank 0 to ensure they are the same across all ranks after initialization. Defaults to `False` unless `cpu_ram_efficient_loading` is `True`, then will be forcibly enabled. - **forward_prefetch** (`bool`, defaults to `False`) -- Whether to have FSDP explicitly prefetches the next upcoming all-gather while executing in the forward pass. only use with Static graphs. - **activation_checkpointing** (`bool`, defaults to `False`) -- A technique to reduce memory usage by clearing activations of certain layers and recomputing them during a backward pass. Effectively, this trades extra computation time for reduced memory usage. - **cpu_ram_efficient_loading** (`bool`, defaults to `None`) -- If True, only the first process loads the pretrained model checkoint while all other processes have empty weights. Only applicable for Transformers. When using this, `sync_module_states` needs to be `True`. - **transformer_cls_names_to_wrap** (`Optional[List[str]]`, defaults to `None`) -- A list of transformer layer class names to wrap. Only applicable when `auto_wrap_policy` is `transformer_based_wrap`. - **min_num_params** (`Optional[int]`, defaults to `None`) -- The minimum number of parameters a module must have to be wrapped. Only applicable when `auto_wrap_policy` is `size_based_wrap`.0 This plugin is used to enable fully sharded data parallelism.

set_auto_wrap_policyaccelerate.FullyShardedDataParallelPlugin.set_auto_wrap_policyhttps://github.com/huggingface/accelerate/blob/v1.11.0/src/accelerate/utils/dataclasses.py#L2016[{"name": "model", "val": ""}] Given `model`, creates an `auto_wrap_policy` based on the passed in policy and if we can use the `transformer_cls_to_wrap`

set_mixed_precisionaccelerate.FullyShardedDataParallelPlugin.set_mixed_precisionhttps://github.com/huggingface/accelerate/blob/v1.11.0/src/accelerate/utils/dataclasses.py#L2050[{"name": "mixed_precision", "val": ""}, {"name": "buffer_autocast", "val": " = False"}, {"name": "override", "val": " = False"}] Sets the mixed precision policy for FSDP

set_state_dict_typeaccelerate.FullyShardedDataParallelPlugin.set_state_dict_typehttps://github.com/huggingface/accelerate/blob/v1.11.0/src/accelerate/utils/dataclasses.py#L1971[{"name": "state_dict_type", "val": " = None"}] Set the state dict config based on the `StateDictType`.

validate_mixed_precision_policyaccelerate.FullyShardedDataParallelPlugin.validate_mixed_precision_policyhttps://github.com/huggingface/accelerate/blob/v1.11.0/src/accelerate/utils/dataclasses.py#L2102[] Validates the mixed precision policy, abstracted away to not bring in the imports if not needed.

accelerate.utils.fsdp2_load_full_state_dictaccelerate.utils.fsdp2_load_full_state_dicthttps://github.com/huggingface/accelerate/blob/v1.11.0/src/accelerate/utils/fsdp_utils.py#L461[{"name": "accelerator", "val": ""}, {"name": "model", "val": ": Module"}, {"name": "full_sd", "val": ": dict"}]- **accelerator** (`Accelerator`) -- The accelerator instance - **model** (`torch.nn.Module`) -- The model to load the state dict into, expected to be on meta device or a VRAM spike can occur - **full_sd** (`dict`) -- The full state dict to load, can only be on rank 00 Loads the full state dict (could be only on rank 0) into the sharded model. This is done by broadcasting the parameters from rank 0 to all other ranks. This function modifies the model in-place.

accelerate.utils.fsdp2_switch_optimizer_parametersaccelerate.utils.fsdp2_switch_optimizer_parametershttps://github.com/huggingface/accelerate/blob/v1.11.0/src/accelerate/utils/fsdp_utils.py#L538[{"name": "optimizer", "val": ": Optimizer"}, {"name": "mapping", "val": ": dict"}]- **optimizer** (`torch.optim.Optimizer`) -- Optimizer instance which contains the original model parameters - **mapping** (`dict`) -- Mapping from the original parameter (specified by `data_ptr`) to the sharded parameter0- ``KeyError`` -- If a parameter in the optimizer couldn't be switched to its sharded version. This should never happen and indicates a bug. If we kept the original params instead of raising, the training wouldn't be numerically correct and weights wouldn't get updated.``KeyError`` Switches the parameters of the optimizer to new ones (sharded parameters in usual case). This function modifies the optimizer in-place.

accelerate.utils.fsdp2_prepare_modelaccelerate.utils.fsdp2_prepare_modelhttps://github.com/huggingface/accelerate/blob/v1.11.0/src/accelerate/utils/fsdp_utils.py#L602[{"name": "accelerator", "val": ""}, {"name": "model", "val": ": Module"}]- **accelerator** (`Accelerator`) -- The accelerator instance - **model** (`torch.nn.Module`) -- The model to prepare0`torch.nn.Module`Prepared model Prepares the model for FSDP2 in-place. Also returns the model to avoid misuse of the original model.