Dateset Processing
Training Data Preparation
Sentiment Analysis
The training data for the Sentiment Analysis LoRA model was constructed by aggregating and processing 4 distinct financial sentiment datasets.
Datasets Used:
Financial PhraseBank (FPB)
financial_phrasebankFiQA Sentiment Analysis (FiQA SA)
ChanceFocus/flare-fiqasaTwitter Financial News Sentiment (TFNS)
zeroshot/twitter-financial-news-sentimentNews with GPT Instructions (NWGI)
oliverwang15/news_with_gpt_instructions
Common Processing Steps (Applied Before/During Splitting as appropriate):
Label Normalization: Labels were standardized. Numerical labels (FPB, TFNS) or sentiment scores (FiQA) were mapped to string labels (e.g.,
"negative","neutral","positive"). FiQA scores were binned (-0.1, 0.1 thresholds). NWGI labels were kept as multi-class strings.Instruction Formatting: A specific
instructioncolumn was added.Column Standardization: Datasets were standardized to have
input,output, andinstructioncolumns.
Dataset-Specific Contribution and Splitting:
Financial PhraseBank (FPB):
Source: Original
trainsplit (sentences_50agreeconfiguration).Splitting: Manual split (25% test, 75% train, seed 42)
Contribution to Combined Training Set: The training portion from the split (duplicated 6 times)
Test Set: The test portion from the split.
FiQA Sentiment Analysis (FiQA SA):
Source: The original
train,validation, andtestsplits were loaded.Splitting: Original.
Contribution to Combined Training Set: The original
trainsplit. (duplicated 21 times)Test Set: The original
testsplit.
Twitter Financial News Sentiment (TFNS):
Source: The original
trainandtestsplit was loaded and processed.Splitting: Original.
Contribution to Combined Training Set: The original
trainsplit. (duplicated 2 times)Test Set: The original
testsplit.
News with GPT Instructions (NWGI):
Source: The original
trainandtestsplit was loaded and processed.Splitting: Original.
Contribution to Combined Training Set: The original
trainsplit.Test Set: The original
testsplit.
Final Combined Training Set Construction: The processed and augmented training portions from FPB, FiQA, TFNS, and NWGI (as described above) were concatenated into a single large dataset and shuffled (seed=42). Total size PENDING.
Evaluation Strategy: The fine-tuned Sentiment Analysis model was evaluated separately against each of the test sets created for FPB, FiQA, TFNS, and NWGI.
Headline Analysis
The data preparation for the Headline Analysis LoRA model was more straightforward:
Dataset Used: The standard Financial Headline Analysis dataset [headline] was used.
Train/Test Split: The original
trainsplit provided with the dataset was used directly as the training set for LoRA fine-tuning. The originaltestsplit was reserved and used as the evaluation set to measure performance after fine-tuning.Formatting: Data was formatted to include
input(the headline),output(the classification label, e.g., “Yes”/”No”), and an appropriateinstructionguiding the model on the headline analysis task.
Named Entity Recognition
Similar to Headline Analysis, the data preparation for the Named Entity Recognition (NER) LoRA model utilized the standard splits of the chosen dataset:
Dataset Used: The financial Named Entity Recognition (NER) dataset [ner] was used.
Train/Test Split: The official
trainsplit accompanying the dataset formed the training data for fine-tuning. The corresponding officialtestsplit was used for model evaluation.Formatting: Data was formatted into the required structure, typically involving an
instructionasking for the entity type of a specific phrase within theinputsentence, and theoutputbeing the correct entity label (e.g., “location”, “person”, “organization”).
Citations
Sinha, A., & Khandait, P. (2020). Headline-Enhanced Financial Embedding. In Proceedings of the 2nd Workshop on Economics and Natural Language Processing (pp. 66-74).
Salinas Alvarado, D., Rönnqvist, S., & Niklaus, J. (2015). Domain-Specific Named Entity Recognition: A Case Study in Finance. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing (pp. 110-115).