Vanilla LoRA

Background

Citation: LoRA: Low-Rank Adaptation of Large Language Models (Hu et al., 2021)

LoRA addresses the fundamental challenge of fully fine-tuning large language models (LLMs), which becomes increasingly impractical as models grow larger. LoRA freezes pre-trained model weights and injects trainable rank-decomposition matrices into each layer of the Transformer architecture, enabling efficient adaptation to downstream tasks.

Quick Facts

  1. LoRA is a parameter-efficient fine-tuning method.

  2. LoRA can reduce trainable parameters by ~6,340× when applied to the 70B rank-4 configuration.

  3. LoRA introduces no additional inference latency when weights are merged.

Algorithmic Idea

The core idea behind LoRA is that the change in weights during model adaptation has a low “intrinsic rank.” LoRA preserves the weights of the pre-trained model and introduces a smaller set of trainable weights through low-rank decomposition.

Stage One: Fine-tuning Process

  1. Add a second path: Introduce two low-rank matrices \(\mathbf{A} \in \mathbb{R}^{r \times n}\) and \(\mathbf{B} \in \mathbb{R}^{n \times r}\) where the rank \(r \ll n\).

  2. Feedforward pass: The forward pass becomes \(\mathbf{h} = \mathbf{W}_0 \mathbf{x} + \gamma_r \mathbf{B}\mathbf{A} \mathbf{x}\), where the contribution from the frozen weights is \(\mathbf{W}_0 \mathbf{x}\) and the adapter contribution is \(\gamma_r \mathbf{B}\mathbf{A} \mathbf{x}\). This combined output is then used to compute the loss function.

  3. Backpropagation: \(\mathbf{W}_0\) is frozen and receives no gradient updates. Only \(\mathbf{A}\) and \(\mathbf{B}\) receive gradients and are updated during training. \(\mathbf{A}\) is initialized randomly while \(\mathbf{B}\) is initialized to zero, ensuring \(\Delta\mathbf{W} = 0\) at training start.

Stage Two: Inference Weight Merging

After training, the learned adapter can be merged with the original weights for efficient inference:

\[\mathbf{W}_{merged} = \mathbf{W}_0 + \Delta\mathbf{W} = \mathbf{W}_0 + \gamma_r \mathbf{B}\mathbf{A}, where \gamma_r = \frac{\alpha}{r}.\]

Once merged, inference becomes a standard matrix multiplication \(\mathbf{h} = \mathbf{W}_{merged} \mathbf{x}\) with no additional computational overhead.

Only the small matrices \(\mathbf{A}\) and \(\mathbf{B}\) require gradient computation and storage as LoRA adapters, dramatically reducing memory requirements and trainable parameters compared to full fine-tuning.

Detailed Parameter Reduction with Rank=4 (Llama 3.1 70B)

  • Full model parameters: ~70.55B

  • LoRA applied to: q_proj, k_proj, v_proj - q_proj (in_features=8192, out_features=8192):

    \((8192 \times 4) + (4 \times 8192) = 65{,}536\)

    • k_proj (in_features=8192, out_features=1024): \((8192 \times 4) + (4 \times 1024) = 32{,}768 + 4{,}096 = 36{,}864\)

    • v_proj (in_features=8192, out_features=1024): \((8192 \times 4) + (4 \times 1024) = 32{,}768 + 4{,}096 = 36{,}864\)

  • Single attention block: 65,536 + 36,864 + 36,864 = 139,264 LoRA parameters

  • For 80 blocks: 139,264 × 80 = 11,141,120 total LoRA parameters

  • Reduction factor:

    \[\frac{\text{Full model parameters}}{\text{LoRA parameters}} = \frac{70{,}553{,}706{,}496}{11{,}141{,}120} \approx 6{,}337\]

Thus, LoRA rank-4 adapts a 70B model using only ~1/6,337 of the total parameters, making large-model fine-tuning both memory-efficient and feasible.

Key Equations

For a pre-trained weight matrix \(\mathbf{W}_0 \in \mathbb{R}^{n \times n}\), the LoRA update follows: \(\mathbf{h} = \mathbf{W}_0 \mathbf{x} + \Delta\mathbf{W} \mathbf{x} = \mathbf{W}_0 \mathbf{x} + \gamma_r \mathbf{B}\mathbf{A} \mathbf{x}\).

Where the low-rank decomposition is: \(\Delta\mathbf{W} = \gamma_r \mathbf{B}\mathbf{A}\).

The scaling factor is defined as: \(\gamma_r = \frac{\alpha}{r}\).

Where:

  1. \(\alpha > 0\) is a hyperparameter controlling the scaling.

  2. \(r > 0\) is the rank with the low-rank condition \(r \ll n\).

  3. \(\mathbf{A} \in \mathbb{R}^{r \times n}\) is initialized with random Gaussian weights.

  4. \(\mathbf{B} \in \mathbb{R}^{n \times r}\) is initialized to zero, so \(\Delta\mathbf{W} = 0\) at training start.

The number of trainable parameters is: \(|\Theta| = 2 \times L_{\text{LoRA}} \times n \times r\).

where \(L_{\text{LoRA}}\) is the number of weight matrices LoRA is applied to, \(n\) is the matrix dimension, and \(r\) is the rank.

Implementation in FinLoRA

To use vanilla LoRA in FinLoRA, configure fine-tuning with standard parameters:

python lora/finetune.py sentiment_llama_3_1_8b_8bits_r8

Configuration example from lora/finetune_configs.json:

"sentiment_llama_3_1_8b_8bits_r8": {
  "base_model": "meta-llama/Llama-3.1-8B-Instruct",
  "dataset_path": "../data/train/finlora_sentiment_train.jsonl",
  "lora_r": 8,
  "quant_bits": 8,
  "learning_rate": 0.0001,
  "num_epochs": 4,
  "batch_size": 8,
  "gradient_accumulation_steps": 2
}

Key parameters: - lora_r: The rank \(r\) of the LoRA adapter (typically 4-16) - quant_bits: The quantization bits (we use 8 for vanilla LoRA, but different values can be used) - lora_alpha: The scaling parameter \(\alpha\) (default: 16, giving \(\gamma_r = \alpha/r\))

Usage Example

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

# Load base model
base_model_name = "meta-llama/Llama-3.1-8B-Instruct"
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Load LoRA adapter
adapter_path = "./lora_adapters/8bits_r8/sentiment_llama_3_1_8b_8bits_r8"
model = PeftModel.from_pretrained(base_model, adapter_path)

# Generate text
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
prompt = "The financial markets showed positive sentiment today"
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=100, temperature=0)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)

References

Why This Method?

LoRA is crucial to understanding parameter-efficient fine-tuning. It introduced the core mathematical concepts upon which subsequent LoRA variants were based, providing theoretical justification for low-rank adaptations and widespread adoption for LLM fine-tuning.