DoRA
Background
Citation: DoRA: Weight-Decomposed Low-Rank Adaptation (Liu et al., 2024)
DoRA introduces improvements that are intended to close the issue of LoRA’s accuracy lagging behind that of full fine-tuning. DoRA decomposes pre-trained weights into magnitude and direction components, fine-tuning both while using LoRA specifically for directional updates to efficiently minimize trainable parameters. It enhances both learning capacity and training stability while avoiding additional inference overhead.
Quick Facts
DoRA uses weight-decomposed fine-tuning to extend LoRA with magnitude-direction decomposition.
DoRA can often but not in all cases (such as ours) achieve accuracy close to full fine-tuning while maintaining the same parameter count as LoRA.
DoRA introduces no additional inference latency when weights are merged.
Algorithmic Idea
LoRA’s limitations come from coupling magnitude and direction updates. DoRA separates those components, enabling mo43 fine-grained adaptation that more closely matches full fine-tuning.
For a pre-trained weight matrix \(\mathbf{W}_0\), DoRA decomposes it into a magnitude vector \(\mathbf{m}\) and direction matrix \(\mathbf{V}\) where \(\mathbf{m} = ||\mathbf{W}_0||_c\) (column-wise norm) and \(\mathbf{V} = \mathbf{W}_0\). The magnitude vector consists of the \(\ell_2\) norms of each column, while the direction matrix contains the original weight matrix.
During fine-tuning, the following hold true:
\(\mathbf{W}_0\) is decomposed into a magnitude component \(\mathbf{m}\) and a direction component \(\mathbf{V}\).
Only the direction matrix receives LoRA updates \(\Delta\mathbf{V} = \mathbf{B}\mathbf{A}\) while the magnitude vector is trained directly.
The forward pass becomes: \(\mathbf{W}' = \mathbf{m} \frac{\mathbf{V} + \Delta\mathbf{V}}{||\mathbf{V} + \Delta\mathbf{V}||_c}\).
Key Equations
For a pre-trained weight matrix \(\mathbf{W}_0 \in \mathbb{R}^{d \times k}\), the DoRA decomposition follows:
Where the weight decomposition is:
The updated weight matrix becomes:
Where:
\(\mathbf{m} \in \mathbb{R}^{1 \times k}\) is the magnitude vector of column-wise \(\ell_2\) norms.
\(\mathbf{V} \in \mathbb{R}^{d \times k}\) is the direction matrix initialized as \(\mathbf{W}_0\).
\(\mathbf{A} \in \mathbb{R}^{r \times k}\) and \(\mathbf{B} \in \mathbb{R}^{d \times r}\) are the LoRA adaptation matrices for directional updates.
\(||\cdot||_c\) denotes the column-wise \(\ell_2\) norm operation.
The number of trainable parameters is:
where \(k\) accounts for the magnitude vector, \(L_{\text{DoRA}}\) is the number of weight matrices DoRA is applied to, \(d_{\text{model}}\) is the model dimension, and \(r\) is the rank.
Implementation in FinLoRA
To use DoRA in FinLoRA, configure fine-tuning with DoRA enabled:
python lora/finetune.py sentiment_llama_3_1_8b_8bits_r8_dora
Configuration example from lora/finetune_configs.json:
"sentiment_llama_3_1_8b_8bits_r8_dora": {
"base_model": "meta-llama/Llama-3.1-8B-Instruct",
"dataset_path": "../data/train/finlora_sentiment_train.jsonl",
"lora_r": 8,
"quant_bits": 8,
"peft_use_dora": true,
"learning_rate": 0.0001,
"num_epochs": 4,
"batch_size": 8,
"gradient_accumulation_steps": 2
}
Key parameters:
- lora_r: The rank \(r\) of the LoRA adapter (typically 8-16 for DoRA)
- quant_bits: The quantization bits (8 or 4, same as vanilla LoRA)
- peft_use_dora: Enable DoRA decomposition (set to true)
- lora_alpha: The scaling parameter \(\alpha\) (default: 16, giving \(\gamma_r = \alpha/r\))
Usage Example
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch
# Load base model
base_model_name = "meta-llama/Llama-3.1-8B-Instruct"
base_model = AutoModelForCausalLM.from_pretrained(
base_model_name,
torch_dtype=torch.float16,
device_map="auto"
)
# Load DoRA adapter
adapter_path = "./lora_adapters/8bits_r8_dora/sentiment_llama_3_1_8b_8bits_r8_dora"
model = PeftModel.from_pretrained(base_model, adapter_path)
# Generate text
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
prompt = "The financial markets showed positive sentiment today"
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=100, temperature=0)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
References
Why This Method?
DoRA is important because it addresses a funadmental flaw in LoRA that causes it to lag behind in accuracy compared to full fine-tuning. It shows how decomposing into magnitude and direction components can enhance fine-tuning by allowing the model to capture more fine-grained patterns.
Useful Links
NVIDIA DoRA Implementation - Official implementation by NVIDIA
NVIDIA Technical Blog: Introducing DoRA - Technical blog about DoRA by NVIDIA
Axolotl - Training framework with DoRA support used in FinLoRA