DoRA 

Background 

Citation: DoRA: Weight-Decomposed Low-Rank Adaptation (Liu et al., 2024)

DoRA introduces improvements that are intended to close the issue of LoRA’s accuracy lagging behind that of full fine-tuning. DoRA decomposes pre-trained weights into magnitude and direction components, fine-tuning both while using LoRA specifically for directional updates to efficiently minimize trainable parameters. It enhances both learning capacity and training stability while avoiding additional inference overhead.

Quick Facts 

DoRA uses weight-decomposed fine-tuning to extend LoRA with magnitude-direction decomposition.
DoRA can often but not in all cases (such as ours) achieve accuracy close to full fine-tuning while maintaining the same parameter count as LoRA.
DoRA introduces no additional inference latency when weights are merged.

Algorithmic Idea 

LoRA’s limitations come from coupling magnitude and direction updates. DoRA separates those components, enabling mo43 fine-grained adaptation that more closely matches full fine-tuning.

For a pre-trained weight matrix \(\mathbf{W}_0\), DoRA decomposes it into a magnitude vector \(\mathbf{m}\) and direction matrix \(\mathbf{V}\) where \(\mathbf{m} = ||\mathbf{W}_0||_c\) (column-wise norm) and \(\mathbf{V} = \mathbf{W}_0\). The magnitude vector consists of the \(\ell_2\) norms of each column, while the direction matrix contains the original weight matrix.

During fine-tuning, the following hold true:

\(\mathbf{W}_0\) is decomposed into a magnitude component \(\mathbf{m}\) and a direction component \(\mathbf{V}\).
Only the direction matrix receives LoRA updates \(\Delta\mathbf{V} = \mathbf{B}\mathbf{A}\) while the magnitude vector is trained directly.
The forward pass becomes: \(\mathbf{W}' = \mathbf{m} \frac{\mathbf{V} + \Delta\mathbf{V}}{||\mathbf{V} + \Delta\mathbf{V}||_c}\).

Key Equations 

For a pre-trained weight matrix \(\mathbf{W}_0 \in \mathbb{R}^{d \times k}\), the DoRA decomposition follows:

\[\mathbf{W}_0 = \mathbf{m} \frac{\mathbf{V}}{||\mathbf{V}||_c} = ||\mathbf{W}_0||_c \frac{\mathbf{W}_0}{||\mathbf{W}_0||_c}\]

Where the weight decomposition is:

\[\mathbf{m} = ||\mathbf{W}_0||_c, \quad \mathbf{V} = \mathbf{W}_0\]

The updated weight matrix becomes:

\[\mathbf{W}' = \mathbf{m} \frac{\mathbf{V} + \Delta\mathbf{V}}{||\mathbf{V} + \Delta\mathbf{V}||_c} = \mathbf{m} \frac{\mathbf{W}_0 + \mathbf{B}\mathbf{A}}{||\mathbf{W}_0 + \mathbf{B}\mathbf{A}||_c}\]

Where:

\(\mathbf{m} \in \mathbb{R}^{1 \times k}\) is the magnitude vector of column-wise \(\ell_2\) norms.
\(\mathbf{V} \in \mathbb{R}^{d \times k}\) is the direction matrix initialized as \(\mathbf{W}_0\).
\(\mathbf{A} \in \mathbb{R}^{r \times k}\) and \(\mathbf{B} \in \mathbb{R}^{d \times r}\) are the LoRA adaptation matrices for directional updates.
\(||\cdot||_c\) denotes the column-wise \(\ell_2\) norm operation.

The number of trainable parameters is:

\[|\Theta| = k + 2 \times L_{\text{DoRA}} \times d_{\text{model}} \times r\]

where \(k\) accounts for the magnitude vector, \(L_{\text{DoRA}}\) is the number of weight matrices DoRA is applied to, \(d_{\text{model}}\) is the model dimension, and \(r\) is the rank.

Implementation in FinLoRA 

To use DoRA in FinLoRA, configure fine-tuning with DoRA enabled:

python lora/finetune.py sentiment_llama_3_1_8b_8bits_r8_dora

Configuration example from lora/finetune_configs.json:

"sentiment_llama_3_1_8b_8bits_r8_dora": {
  "base_model": "meta-llama/Llama-3.1-8B-Instruct",
  "dataset_path": "../data/train/finlora_sentiment_train.jsonl",
  "lora_r": 8,
  "quant_bits": 8,
  "peft_use_dora": true,
  "learning_rate": 0.0001,
  "num_epochs": 4,
  "batch_size": 8,
  "gradient_accumulation_steps": 2
}

Key parameters: - lora_r: The rank \(r\) of the LoRA adapter (typically 8-16 for DoRA) - quant_bits: The quantization bits (8 or 4, same as vanilla LoRA) - peft_use_dora: Enable DoRA decomposition (set to true) - lora_alpha: The scaling parameter \(\alpha\) (default: 16, giving \(\gamma_r = \alpha/r\))

Usage Example 

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

# Load base model
base_model_name = "meta-llama/Llama-3.1-8B-Instruct"
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Load DoRA adapter
adapter_path = "./lora_adapters/8bits_r8_dora/sentiment_llama_3_1_8b_8bits_r8_dora"
model = PeftModel.from_pretrained(base_model, adapter_path)

# Generate text
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
prompt = "The financial markets showed positive sentiment today"
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=100, temperature=0)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)

References 

Why This Method?

DoRA is important because it addresses a funadmental flaw in LoRA that causes it to lag behind in accuracy compared to full fine-tuning. It shows how decomposing into magnitude and direction components can enhance fine-tuning by allowing the model to capture more fine-grained patterns.

Useful Links 

NVIDIA DoRA Implementation - Official implementation by NVIDIA
NVIDIA Technical Blog: Introducing DoRA - Technical blog about DoRA by NVIDIA
Axolotl - Training framework with DoRA support used in FinLoRA