QLoRA with torchao
Background
bitsandbytes is the historical backend for QLoRA in Axolotl. It works well in
single-GPU and FSDP1 setups, but its 4-bit kernels graph-break under
torch.compile and its Params4bit storage needs custom patches to cooperate
with FSDP2.
torchao provides tensor-subclass quantization
(AffineQuantizedTensor, NF4Tensor) that:
- compiles cleanly with
torch.compile(no graph breaks for INT4/INT8 weight-only quant); - ships native FSDP2 support via DTensor;
- shares its codebase with QAT / PTQ.
Axolotl exposes both backends via the structured model_quantization_config
field, which discriminates on the backend name.
Usage
model_quantization_config is a structured discriminator: set exactly one of
bnb / torchao / mxfp4 / fp8. The adapter type is auto-detected from
the dtype, so users can write adapter: lora and the validator promotes it
to qlora for 4-bit dtypes.
# torchao QLoRA (auto-promotes adapter to qlora for int4 / nf4 / nvfp4)
adapter: lora
model_quantization_config:
torchao:
weight_dtype: int4 # one of: int4, nf4, nvfp4, int8, fp8
# group_size: 128 # defaults: int4/int8 → 128, nf4 → 64, nvfp4 → 16# bnb QLoRA shorthand (replaces `adapter: qlora` + `load_in_4bit: true`)
adapter: lora
model_quantization_config:
bnb:
weight_dtype: nf4 # qlora + load_in_4bit# bnb 8-bit LoRA shorthand
adapter: lora
model_quantization_config:
bnb:
weight_dtype: int8 # lora + load_in_8bit| Backend | weight_dtype |
Effective adapter | Quant config installed |
|---|---|---|---|
bnb |
nf4 |
qlora |
BitsAndBytesConfig(load_in_4bit=True) |
bnb |
int8 |
lora |
BitsAndBytesConfig(load_in_8bit=True) |
torchao |
int4 |
qlora |
Int4WeightOnlyConfig |
torchao |
nf4 |
qlora |
NF4WeightOnlyConfig |
torchao |
nvfp4 |
qlora |
NVFP4WeightOnlyConfig |
torchao |
int8 |
lora |
Int8WeightOnlyConfig |
torchao |
fp8 |
lora |
Float8WeightOnlyConfig |
Deprecations
There is only one canonical shape for base-quant LoRA training:
adapter: lora
model_quantization_config:
bnb: # or torchao / mxfp4 / fp8
weight_dtype: nf4The following legacy shapes keep working — they are translated to the
canonical form at config load and emit a DEPRECATED: warning. They will
be removed in a future release:
adapter: qlora(with or withoutload_in_4bit: true) →adapter: loramodel_quantization_config: {bnb: {weight_dtype: nf4}}. QLoRA is just “LoRA with a 4-bit base quant”; there is no separate adapter type.
load_in_4bit: true(alone) →model_quantization_config: {bnb: {weight_dtype: nf4}}.load_in_8bit: true(alone) →model_quantization_config: {bnb: {weight_dtype: int8}}.model_quantization_config: Mxfp4Config(string) withmodel_quantization_config_kwargs: …→model_quantization_config: {mxfp4: {config_kwargs: …}}.model_quantization_config: FineGrainedFP8Config(string) →model_quantization_config: {fp8: {config_kwargs: …}}.
Mxfp4Config / FineGrainedFP8Config
Pre-quantized checkpoints typically encode their quant scheme in the
checkpoint’s own quantization_config; you don’t need to set
model_quantization_config in that case. For the cases where you do want to
override at load time, the legacy string form keeps working:
# Legacy string form (still works)
model_quantization_config: Mxfp4Config
model_quantization_config_kwargs: {}…and has a parallel structured form:
model_quantization_config:
mxfp4:
config_kwargs: {}model_quantization_config:
fp8:
config_kwargs: {}MXFP4 (MoE only)
mxfp4 under torchao has no weight-only torchao config for arbitrary linear
layers, so model_quantization_config.torchao.weight_dtype: mxfp4 is rejected
by the loader. For Mixture-of-Experts models, the MXFP4 LoRA path lives behind
quantize_moe_experts: true with lora_target_parameters matching the fused
3D expert tensors (gate_up_proj / down_proj). See
Expert Quantization for that flow.
Mixed-quant models
model_quantization_config.torchao is a single-knob shorthand that installs
one TorchAoConfig covering every linear layer in the model. It does not
compose with other quantization mechanisms — combining them would produce
silent overrides or fights over the same tensors. The validator and loader
reject the conflicts explicitly:
| Combination | Outcome |
|---|---|
model_quantization_config.torchao + quantize_moe_experts: true |
rejected at validation |
model_quantization_config.torchao + gptq: true |
rejected at validation |
model_quantization_config.torchao + checkpoint with embedded quant |
rejected at load time |
Modern quantized checkpoints encode the mix via a per-module exclusion
list in their own quantization_config. For example,
amd/Kimi-K2.6-MXFP4 ships
with quant_method: quark, MXFP4 weights, and ~305 excluded modules
(every attention projection, lm_head, the vision tower, the
mm_projector). Loading that checkpoint just lets transformers honor the
checkpoint’s own scheme — attention layers stay in their native dtype,
only the listed modules get quantized.
If you want experts in MXFP4 + attention in bf16, drop
model_quantization_config entirely and pick one of:
- Load a checkpoint that already carries the right
quantization_config(gpt-oss MXFP4, AMD Quark MXFP4, AWQ/GPTQ, …). Axolotl forwards the checkpoint’s config to transformers unchanged, exclusion list and all. - For a bf16 MoE checkpoint where you want MXFP4 experts at load time,
set
quantize_moe_experts: true. Attention stays in its checkpoint dtype; the expert tensors get quantized via the path landed in #3663.
Attention-only LoRA on a MoE model with quantized experts is the natural training-time recipe; see Expert Quantization.
Constraints
- DoRA works with torchao (the kernels dequantize through the unified
dequantize_weighthelper). Other PEFT extras follow standard LoRA rules. - Merging adapters with
axolotl merge-lorarequiresmerge_method: legacy. The memory-efficient merger simulates bnb’s NF4 quantization and does not yet understand torchao tensor subclasses. - PEFT’s
TorchaoLoraLinearonly supports INT8. Axolotl patchesdispatch_torchaoso INT4 / NF4 weights fall back to standardLinearLoRA layers (the kernels dequantize the base weight on access). load_in_4bitandload_in_8bitare bnb-only. They cannot be combined withmodel_quantization_config.torchao— the validator rejects the config. (Thebnbbranch sets these flags automatically as part of auto-promotion.)
Example config
FSDP2
torchao tensor subclasses are first-class DTensor citizens, so the bnb-specific
FSDP2 patches (sharded-init shims, Linear8bitLt save patch) are skipped
automatically when model_quantization_config.torchao is in use.
adapter: lora
model_quantization_config:
torchao:
weight_dtype: nf4
fsdp_version: 2
fsdp_config:
fsdp_offload_params: false
fsdp_cpu_ram_efficient_loading: true
fsdp_sharding_strategy: FULL_SHARD
fsdp_state_dict_type: SHARDED_STATE_DICT
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP