DoRA

Background

Citation: DoRA: Weight-Decomposed Low-Rank Adaptation (Liu et al., 2024)

DoRA introduces improvements that are intended to close the issue of LoRA’s accuracy lagging behind that of full fine-tuning. DoRA decomposes pre-trained weights into magnitude and direction components, fine-tuning both while using LoRA specifically for directional updates to efficiently minimize trainable parameters. It enhances both learning capacity and training stability while avoiding additional inference overhead.

Quick Facts

  1. DoRA uses weight-decomposed fine-tuning to extend LoRA with magnitude-direction decomposition.

  2. DoRA can often but not in all cases (such as ours) achieve accuracy close to full fine-tuning while maintaining the same parameter count as LoRA.

  3. DoRA introduces no additional inference latency when weights are merged.

Algorithmic Idea

LoRA’s limitations come from coupling magnitude and direction updates. DoRA separates those components, enabling mo43 fine-grained adaptation that more closely matches full fine-tuning.

For a pre-trained weight matrix \(\mathbf{W}_0\), DoRA decomposes it into a magnitude vector \(\mathbf{m}\) and direction matrix \(\mathbf{V}\) where \(\mathbf{m} = ||\mathbf{W}_0||_c\) (column-wise norm) and \(\mathbf{V} = \mathbf{W}_0\). The magnitude vector consists of the \(\ell_2\) norms of each column, while the direction matrix contains the original weight matrix.

During fine-tuning, the following hold true:

  1. \(\mathbf{W}_0\) is decomposed into a magnitude component \(\mathbf{m}\) and a direction component \(\mathbf{V}\).

  2. Only the direction matrix receives LoRA updates \(\Delta\mathbf{V} = \mathbf{B}\mathbf{A}\) while the magnitude vector is trained directly.

  3. The forward pass becomes: \(\mathbf{W}' = \mathbf{m} \frac{\mathbf{V} + \Delta\mathbf{V}}{||\mathbf{V} + \Delta\mathbf{V}||_c}\).

Key Equations

For a pre-trained weight matrix \(\mathbf{W}_0 \in \mathbb{R}^{d \times k}\), the DoRA decomposition follows:

\[\mathbf{W}_0 = \mathbf{m} \frac{\mathbf{V}}{||\mathbf{V}||_c} = ||\mathbf{W}_0||_c \frac{\mathbf{W}_0}{||\mathbf{W}_0||_c}\]

Where the weight decomposition is:

\[\mathbf{m} = ||\mathbf{W}_0||_c, \quad \mathbf{V} = \mathbf{W}_0\]

The updated weight matrix becomes:

\[\mathbf{W}' = \mathbf{m} \frac{\mathbf{V} + \Delta\mathbf{V}}{||\mathbf{V} + \Delta\mathbf{V}||_c} = \mathbf{m} \frac{\mathbf{W}_0 + \mathbf{B}\mathbf{A}}{||\mathbf{W}_0 + \mathbf{B}\mathbf{A}||_c}\]

Where:

  1. \(\mathbf{m} \in \mathbb{R}^{1 \times k}\) is the magnitude vector of column-wise \(\ell_2\) norms.

  2. \(\mathbf{V} \in \mathbb{R}^{d \times k}\) is the direction matrix initialized as \(\mathbf{W}_0\).

  3. \(\mathbf{A} \in \mathbb{R}^{r \times k}\) and \(\mathbf{B} \in \mathbb{R}^{d \times r}\) are the LoRA adaptation matrices for directional updates.

  4. \(||\cdot||_c\) denotes the column-wise \(\ell_2\) norm operation.

The number of trainable parameters is:

\[|\Theta| = k + 2 \times L_{\text{DoRA}} \times d_{\text{model}} \times r\]

where \(k\) accounts for the magnitude vector, \(L_{\text{DoRA}}\) is the number of weight matrices DoRA is applied to, \(d_{\text{model}}\) is the model dimension, and \(r\) is the rank.

Implementation in FinLoRA

To use DoRA in FinLoRA, configure fine-tuning with DoRA enabled:

python lora/finetune.py sentiment_llama_3_1_8b_8bits_r8_dora

Configuration example from lora/finetune_configs.json:

"sentiment_llama_3_1_8b_8bits_r8_dora": {
  "base_model": "meta-llama/Llama-3.1-8B-Instruct",
  "dataset_path": "../data/train/finlora_sentiment_train.jsonl",
  "lora_r": 8,
  "quant_bits": 8,
  "peft_use_dora": true,
  "learning_rate": 0.0001,
  "num_epochs": 4,
  "batch_size": 8,
  "gradient_accumulation_steps": 2
}

Key parameters: - lora_r: The rank \(r\) of the LoRA adapter (typically 8-16 for DoRA) - quant_bits: The quantization bits (8 or 4, same as vanilla LoRA) - peft_use_dora: Enable DoRA decomposition (set to true) - lora_alpha: The scaling parameter \(\alpha\) (default: 16, giving \(\gamma_r = \alpha/r\))

Usage Example

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

# Load base model
base_model_name = "meta-llama/Llama-3.1-8B-Instruct"
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Load DoRA adapter
adapter_path = "./lora_adapters/8bits_r8_dora/sentiment_llama_3_1_8b_8bits_r8_dora"
model = PeftModel.from_pretrained(base_model, adapter_path)

# Generate text
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
prompt = "The financial markets showed positive sentiment today"
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=100, temperature=0)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)

References

Why This Method?

DoRA is important because it addresses a funadmental flaw in LoRA that causes it to lag behind in accuracy compared to full fine-tuning. It shows how decomposing into magnitude and direction components can enhance fine-tuning by allowing the model to capture more fine-grained patterns.