Vanilla LoRA
Background
Citation: LoRA: Low-Rank Adaptation of Large Language Models (Hu et al., 2021)
LoRA addresses the fundamental challenge of fully fine-tuning large language models (LLMs), which becomes increasingly impractical as models grow larger. LoRA freezes pre-trained model weights and injects trainable rank-decomposition matrices into each layer of the Transformer architecture, enabling efficient adaptation to downstream tasks.
Quick Facts
LoRA is a parameter-efficient fine-tuning method.
LoRA can reduce trainable parameters by ~6,340× when applied to the 70B rank-4 configuration.
LoRA introduces no additional inference latency when weights are merged.
Algorithmic Idea
The core idea behind LoRA is that the change in weights during model adaptation has a low “intrinsic rank.” LoRA preserves the weights of the pre-trained model and introduces a smaller set of trainable weights through low-rank decomposition.
Stage One: Fine-tuning Process
Add a second path: Introduce two low-rank matrices \(\mathbf{A} \in \mathbb{R}^{r \times n}\) and \(\mathbf{B} \in \mathbb{R}^{n \times r}\) where the rank \(r \ll n\).
Feedforward pass: The forward pass becomes \(\mathbf{h} = \mathbf{W}_0 \mathbf{x} + \gamma_r \mathbf{B}\mathbf{A} \mathbf{x}\), where the contribution from the frozen weights is \(\mathbf{W}_0 \mathbf{x}\) and the adapter contribution is \(\gamma_r \mathbf{B}\mathbf{A} \mathbf{x}\). This combined output is then used to compute the loss function.
Backpropagation: \(\mathbf{W}_0\) is frozen and receives no gradient updates. Only \(\mathbf{A}\) and \(\mathbf{B}\) receive gradients and are updated during training. \(\mathbf{A}\) is initialized randomly while \(\mathbf{B}\) is initialized to zero, ensuring \(\Delta\mathbf{W} = 0\) at training start.
Stage Two: Inference Weight Merging
After training, the learned adapter can be merged with the original weights for efficient inference:
Once merged, inference becomes a standard matrix multiplication \(\mathbf{h} = \mathbf{W}_{merged} \mathbf{x}\) with no additional computational overhead.
Only the small matrices \(\mathbf{A}\) and \(\mathbf{B}\) require gradient computation and storage as LoRA adapters, dramatically reducing memory requirements and trainable parameters compared to full fine-tuning.
Detailed Parameter Reduction with Rank=4 (Llama 3.1 70B)
Full model parameters: ~70.55B
LoRA applied to: q_proj, k_proj, v_proj - q_proj (in_features=8192, out_features=8192):
\((8192 \times 4) + (4 \times 8192) = 65{,}536\)
k_proj (in_features=8192, out_features=1024): \((8192 \times 4) + (4 \times 1024) = 32{,}768 + 4{,}096 = 36{,}864\)
v_proj (in_features=8192, out_features=1024): \((8192 \times 4) + (4 \times 1024) = 32{,}768 + 4{,}096 = 36{,}864\)
Single attention block: 65,536 + 36,864 + 36,864 = 139,264 LoRA parameters
For 80 blocks: 139,264 × 80 = 11,141,120 total LoRA parameters
Reduction factor:
\[\frac{\text{Full model parameters}}{\text{LoRA parameters}} = \frac{70{,}553{,}706{,}496}{11{,}141{,}120} \approx 6{,}337\]
Thus, LoRA rank-4 adapts a 70B model using only ~1/6,337 of the total parameters, making large-model fine-tuning both memory-efficient and feasible.
Key Equations
For a pre-trained weight matrix \(\mathbf{W}_0 \in \mathbb{R}^{n \times n}\), the LoRA update follows: \(\mathbf{h} = \mathbf{W}_0 \mathbf{x} + \Delta\mathbf{W} \mathbf{x} = \mathbf{W}_0 \mathbf{x} + \gamma_r \mathbf{B}\mathbf{A} \mathbf{x}\).
Where the low-rank decomposition is: \(\Delta\mathbf{W} = \gamma_r \mathbf{B}\mathbf{A}\).
The scaling factor is defined as: \(\gamma_r = \frac{\alpha}{r}\).
Where:
\(\alpha > 0\) is a hyperparameter controlling the scaling.
\(r > 0\) is the rank with the low-rank condition \(r \ll n\).
\(\mathbf{A} \in \mathbb{R}^{r \times n}\) is initialized with random Gaussian weights.
\(\mathbf{B} \in \mathbb{R}^{n \times r}\) is initialized to zero, so \(\Delta\mathbf{W} = 0\) at training start.
The number of trainable parameters is: \(|\Theta| = 2 \times L_{\text{LoRA}} \times n \times r\).
where \(L_{\text{LoRA}}\) is the number of weight matrices LoRA is applied to, \(n\) is the matrix dimension, and \(r\) is the rank.
Implementation in FinLoRA
To use vanilla LoRA in FinLoRA, configure fine-tuning with standard parameters:
python lora/finetune.py sentiment_llama_3_1_8b_8bits_r8
Configuration example from lora/finetune_configs.json:
"sentiment_llama_3_1_8b_8bits_r8": {
"base_model": "meta-llama/Llama-3.1-8B-Instruct",
"dataset_path": "../data/train/finlora_sentiment_train.jsonl",
"lora_r": 8,
"quant_bits": 8,
"learning_rate": 0.0001,
"num_epochs": 4,
"batch_size": 8,
"gradient_accumulation_steps": 2
}
Key parameters:
- lora_r: The rank \(r\) of the LoRA adapter (typically 4-16)
- quant_bits: The quantization bits (we use 8 for vanilla LoRA, but different values can be used)
- lora_alpha: The scaling parameter \(\alpha\) (default: 16, giving \(\gamma_r = \alpha/r\))
Usage Example
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch
# Load base model
base_model_name = "meta-llama/Llama-3.1-8B-Instruct"
base_model = AutoModelForCausalLM.from_pretrained(
base_model_name,
torch_dtype=torch.float16,
device_map="auto"
)
# Load LoRA adapter
adapter_path = "./lora_adapters/8bits_r8/sentiment_llama_3_1_8b_8bits_r8"
model = PeftModel.from_pretrained(base_model, adapter_path)
# Generate text
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
prompt = "The financial markets showed positive sentiment today"
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=100, temperature=0)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
References
Why This Method?
LoRA is crucial to understanding parameter-efficient fine-tuning. It introduced the core mathematical concepts upon which subsequent LoRA variants were based, providing theoretical justification for low-rank adaptations and widespread adoption for LLM fine-tuning.
Useful Links
Microsoft LoRA - Original implementation
Axolotl - Training framework with LoRA support used in FinLoRA