Resource Usage and Cost Analysis

Fine-tuning Costs

Part of Benchmark Angle III: Resources of LoRA Fine-tuning and Inference

The computational requirements for fine-tuning large language models vary dramatically depending on the approach used. The following table compares the cost and time requirements across different methods:

Table 13 Fine-tuning Cost Comparison

Models

Time

GPUs

Est. Cost (USD)

BloombergGPT

53 days

512×A100

$2.7M

LoRA

14.9h

4×A5000

$15.50

QLoRA

14.1h

4×A5000

$14.66

DoRA

15.9h

4×A5000

$16.54

rsLoRA

14.5h

4×A5000

$15.11

Gemini 2.0 FL

8.8h

$162.02

GPT-4o-mini

$312.00

Note: GPT-4o cost is estimated based on 4 epochs of fine-tuning at OpenAI fine-tuning pricing.

LoRA-based methods demonstrate remarkable cost efficiency compared to traditional approaches:

Self-hosted LoRA Fine-tuning: Using four NVIDIA A5000 GPUs, fine-tuning requires 14.1-15.9 hours wall-clock time, corresponding to 56.4-63.6 total GPU hours. At an estimated rate of $0.26 per GPU hour, this translates to approximately $14.66-$16.54 per fine-tuning run.

Comparison with Commercial Services: - LoRA methods: ~$15 (99.5% cost reduction vs. BloombergGPT) - Gemini 2.0 FL: $162.02 (10× more expensive than LoRA) - GPT-4o-mini: $312.00 (20× more expensive than LoRA)

This demonstrates that self-hosted LoRA fine-tuning is substantially more cost-effective than commercial fine-tuning services.

GPU Memory Requirements

Memory usage varies significantly based on quantization level and LoRA rank. The following table shows memory requirements for Llama-3.1-8B fine-tuning on NER tasks:

Table 14 GPU Memory Usage for Fine-tuning

Models

Parameters

GPU Memory (Batch=4)

GPU Memory (Batch=8)

Model Size

Percentage

Llama-3.1-8B-16bit (base)

8.03B

16.06 GB

Llama-3.1-8B-r8-16bit

4.72M

30.91 GB

30.91 GB

16.08 GB

100.1%

Llama-3.1-8B-r8-8bit

4.72M

11.41 GB

11.81 GB

8.04 GB

50.1%

Llama-3.1-8B-r8-4bit

4.72M

8.26 GB

8.65 GB

4.02 GB

25.0%

Llama-3.1-8B-r4-16bit

2.36M

30.90 GB

30.90 GB

16.07 GB

100.1%

Llama-3.1-8B-r4-8bit

2.36M

11.40 GB

11.78 GB

8.03 GB

50.0%

Llama-3.1-8B-r4-4bit

2.36M

8.25 GB

8.61 GB

4.02 GB

25.0%

Memory Usage Insights

Quantization Impact: - 4-bit quantization: Reduces memory usage to 25% of the base model - 8-bit quantization: Reduces memory usage to ~50% of the base model - 16-bit (no quantization): Maintains full memory requirements

LoRA Rank Impact: The LoRA rank (r4 vs r8) has minimal impact on total memory usage, as the adapter parameters (2.36M vs 4.72M) are negligible compared to the base model size (8.03B parameters).

Practical Implications: - 16-bit fine-tuning: Requires high-end GPUs (>30GB VRAM) - 8-bit fine-tuning: Accessible on mid-range GPUs (12-16GB VRAM) - 4-bit fine-tuning: Enables fine-tuning on consumer GPUs (8-12GB VRAM)

Batch Size Effects: Memory usage shows minimal increase when doubling batch size from 4 to 8, indicating efficient gradient accumulation strategies can be employed for larger effective batch sizes.