Evaluation
This guide explains how to evaluate fine-tuned models in FinLoRA.
Evaluation Process
FinLoRA provides scripts to evaluate models on various financial tasks. The evaluation process uses the test datasets in the data/test directory.
Using run_test.sh
The main script for evaluation is test/run_test.sh. This script runs test.py with specific parameters to evaluate models on different datasets.
Basic usage:
cd test
./run_test.sh
You can modify the script to change the evaluation parameters, such as the dataset, model, and quantization bits.
Using test.py Directly
You can also run test.py directly with custom parameters:
python test/test.py \
--dataset <dataset_name> \
--base_model <model_path_or_name> \
--peft_model <peft_model_path> \
--batch_size <batch_size> \
--quant_bits <quant_bits> \
--source <source>
Where:
- --dataset: The dataset to evaluate on (e.g., “sentiment”, “headline”, “ner”)
- --base_model: The base model path or name
- --peft_model: The path to the LoRA adapter (optional)
- --batch_size: Batch size for evaluation
- --quant_bits: Quantization bits (4 or 8)
- --source: The source of the model (e.g., “hf” for Hugging Face, “google” for Google models)
Example:
python test/test.py \
--dataset sentiment \
--base_model meta-llama/Llama-3.1-8B-Instruct \
--peft_model lora_adapters/8bits_r8/sentiment_llama_3_1_8b_8bits_r8 \
--batch_size 8 \
--quant_bits 8 \
--source hf
Using run_all_adapters.sh
To test multiple adapters systematically, use the run_all_adapters.sh script:
cd test
bash run_all_adapters.sh
Before running, define the adapters and tasks you want to run in the script by editing the configuration variables. Then execute:
bash run_all_adapters.sh
This script allows you to batch evaluate multiple LoRA adapters across different tasks efficiently.
Using run_openai.sh
To run evaluations using base models from external APIs (e.g., OpenAI):
bash run_openai.sh
Before running:
Enter your API key in the file
Set the tasks you want to run
Configure any other API-specific parameters
Then execute:
bash run_openai.sh
This is useful for comparing your fine-tuned LoRA adapters against commercial models like GPT-4.
Evaluation Results
The evaluation results will be printed to the console, including metrics such as accuracy and F1 score. The results can also be found in the test/results directory.
For each dataset, the evaluation script will generate a report with the model’s performance on the test set.
Available Datasets and LoRA Adapters for Evaluation
The following table lists the available datasets and LoRA adapters for evaluation in FinLoRA:
Dataset |
Description |
Dataset Parameter |
Documentation |
|---|---|---|---|
Sentiment Analysis |
Financial sentiment analysis datasets (FPB, FiQA SA, TFNS, NWGI) |
|
|
Headline Analysis |
Financial headline classification |
|
|
Named Entity Recognition |
Financial named entity recognition |
|
|
FiNER-139 |
XBRL tagging with 139 common US GAAP tags |
|
|
XBRL Term |
XBRL terminology explanation |
|
|
XBRL Extraction |
Tag and value extraction from XBRL documents |
|
|
Financial Math |
Financial mathematics problems |
|
|
FinanceBench |
Financial benchmarking and analysis |
|
|
CFA Level I |
CFA Level I exam questions |
|
|
CFA Level II |
CFA Level II exam questions |
|
|
CFA Level III |
CFA Level III exam questions |
|
|
CPA REG |
CPA Regulation exam questions |
|
Adapter Type |
Description |
Path |
Documentation |
|---|---|---|---|
Vanilla LoRA (8-bit) |
8-bit quantization with rank 8 |
|
../lora_methods/lora_methods |
QLoRA (4-bit) |
4-bit quantization with rank 4 |
|
|
DoRA |
Weight-Decomposed Low-Rank Adaptation |
|
|
RSLoRA |
Rank-Stabilized LoRA |
|
Replace <task> with the specific task name (e.g., sentiment, headline, ner, etc.).