Data Processing

This guide explains the datasets and data processing scripts used in FinLoRA.

Dataset Overview

FinLoRA tests Llama 3.1 8B Instruct with LoRA adapters on 19 datasets across 4 different types of tasks, ranging from general financial tasks to professional level XBRL (eXtensible Business Reporting Language)-based financial statement analysis.

The train-test splits for the four task categories are as follows:

General Financial Tasks: 122.9k/31.7k
Financial Certificate Tasks: 472/346
Financial Reporting Tasks: 15.9k/8.3k
Financial Statement Analysis Tasks: 27.9k/7.3k

Dataset Categories

General Financial Tasks:

Sentiment Analysis (FPB, FiQA SA, TFNS): Financial sentences classified with sentiment from {negative, neutral, positive}
NWGI Sentiment: Financial text classified into 7-level sentiment, simplified to {negative, neutral, positive}
Headline Analysis: Financial headlines classified with binary answers from {Yes, No}
Named Entity Recognition: Financial text with highlighted entities classified into {person, location, organization}

Financial Certificate Tasks:

CFA Level I/II/III & CPA REG: Multiple choice questions from mock exams with answers from {A, B, C, D} or {A, B, C}

Financial Reporting Tasks:

XBRL Term: Brief explanations for XBRL terminology from XBRL International website
FiNER/FNXL Tagging: Financial text with numerical entities tagged with appropriate US GAAP tags

Financial Statement Analysis Tasks:

XBRL Tag Extraction: XBRL context analysis to identify specific XBRL tags
XBRL Value Extraction: XBRL context analysis to find specific numerical values
XBRL Formula Construction: Creating financial formulas using US GAAP tags
XBRL Formula Calculation: Substituting numerical values into financial formulas
Financial Math: Applying financial formulas to solve numerical problems
FinanceBench: Answering questions based on XBRL financial reports

Dataset Directories

FinLoRA uses two main dataset directories:

data/train: Contains training datasets for fine-tuning models
data/test: Contains test datasets for evaluating models

Each dataset is stored in JSONL format, with each line containing a JSON object with fields like context (or input), target (or output), and sometimes instruction.

Dataset Formats

The processed datasets follow consistent formats:

Standard Format:

{
  "context": "The input text/instruction",
  "target": "The expected output"
}

Data Processing Scripts

FinLoRA includes several scripts for processing raw data into the format required for training and testing:

process_xbrl_extract.py

This script processes XBRL (eXtensible Business Reporting Language) data for extraction tasks. It converts raw XBRL data into a format suitable for training models to extract information from financial reports.

python data/process_xbrl_extract.py

The script processes multiple categories: - Tags extraction - Value extraction - Formula construction - Formula calculations

process_finer.py

This script processes data for the FiNER-139 (Financial Named Entity Recognition) dataset. It prepares data for training models to tag financial entities with 139 common US GAAP tags.

python data/process_finer.py

process_fnxl.py

This script processes data for the FNXL (Financial XBRL) dataset. It prepares batched data for training models to work with XBRL tags in financial documents.

python data/process_fnxl.py

process_sentiment_train.py

This script processes various financial sentiment analysis datasets including FPB, FiQA SA, TFNS, and NWGI.

python data/process_sentiment_train.py

The script handles: - Financial Phrasebank (FPB) sentiment classification - FiQA sentiment analysis - Twitter Financial News Sentiment (TFNS) - News With GPT Instruction (NWGI) sentiment

process_xbrl_agent_data.py

This script processes XBRL terminology data and FinanceBench datasets with OCR capabilities for PDF processing.

python data/process_xbrl_agent_data.py

It does the folowing: - XBRL terminology processing - FinanceBench data with PDF OCR - Formula data processing

process_multitask_data.py

This script processes data for multi-task learning scenarios, combining multiple financial datasets.

python data/process_multitask_data.py

Additional Processing Scripts

convert_gemini_format.py: Converts datasets to Gemini API format
dataset_statistics.py: Computes dataset statistics and metrics
process_xbrl_formula.py: Processes XBRL formula-specific data
process_xbrl_tag.py: Processes XBRL tag extraction data

Dataset Statistics

The datasets vary significantly in size and complexity:

Average Prompt Lengths: - Certificate tasks (CFA/CPA): 147-1,000 tokens - XBRL Analysis tasks: 3,800+ tokens - General Financial tasks: 43-138 tokens - Reporting tasks: 25-7,100 tokens

Evaluation Metrics: - Most datasets: Accuracy and F1 score - XBRL Term and FinanceBench: BERTScore F1 - Complex reasoning tasks: Custom evaluation metrics

Working with Custom Datasets

To add your own financial dataset:

Prepare your data in the required JSONL format
Place training data in data/train/
Place test data in data/test/
Update configuration in finetune_configs.json
Run processing scripts if needed

Example dataset entry:

{
  "context": "Instruction: Analyze the financial statement...\nInput: Company XYZ reported...\nAnswer: ",
  "target": "The company shows strong performance..."
}