Dateset Processing

Training Data Preparation

Sentiment Analysis

The training data for the Sentiment Analysis LoRA model was constructed by aggregating and processing 4 distinct financial sentiment datasets.

Datasets Used:
- Financial PhraseBank (FPB) financial_phrasebank
- FiQA Sentiment Analysis (FiQA SA) ChanceFocus/flare-fiqasa
- Twitter Financial News Sentiment (TFNS) zeroshot/twitter-financial-news-sentiment
- News with GPT Instructions (NWGI) oliverwang15/news_with_gpt_instructions
Common Processing Steps (Applied Before/During Splitting as appropriate):
- Label Normalization: Labels were standardized. Numerical labels (FPB, TFNS) or sentiment scores (FiQA) were mapped to string labels (e.g., "negative", "neutral", "positive"). FiQA scores were binned (-0.1, 0.1 thresholds). NWGI labels were kept as multi-class strings.
- Instruction Formatting: A specific instruction column was added.
- Column Standardization: Datasets were standardized to have input, output, and instruction columns.
Dataset-Specific Contribution and Splitting:
- Financial PhraseBank (FPB):
  - Source: Original train split (sentences_50agree configuration).
  - Splitting: Manual split (25% test, 75% train, seed 42)
  - Contribution to Combined Training Set: The training portion from the split (duplicated 6 times)
  - Test Set: The test portion from the split.
- FiQA Sentiment Analysis (FiQA SA):
  - Source: The original train, validation, and test splits were loaded.
  - Splitting: Original.
  - Contribution to Combined Training Set: The original train split. (duplicated 21 times)
  - Test Set: The original test split.
- Twitter Financial News Sentiment (TFNS):
  - Source: The original train and test split was loaded and processed.
  - Splitting: Original.
  - Contribution to Combined Training Set: The original train split. (duplicated 2 times)
  - Test Set: The original test split.
- News with GPT Instructions (NWGI):
  - Source: The original train and test split was loaded and processed.
  - Splitting: Original.
  - Contribution to Combined Training Set: The original train split.
  - Test Set: The original test split.
Final Combined Training Set Construction: The processed and augmented training portions from FPB, FiQA, TFNS, and NWGI (as described above) were concatenated into a single large dataset and shuffled (seed=42). Total size PENDING.
Evaluation Strategy: The fine-tuned Sentiment Analysis model was evaluated separately against each of the test sets created for FPB, FiQA, TFNS, and NWGI.

Headline Analysis

The data preparation for the Headline Analysis LoRA model was more straightforward:

Dataset Used: The standard Financial Headline Analysis dataset [headline] was used.
Train/Test Split: The original train split provided with the dataset was used directly as the training set for LoRA fine-tuning. The original test split was reserved and used as the evaluation set to measure performance after fine-tuning.
Formatting: Data was formatted to include input (the headline), output (the classification label, e.g., “Yes”/”No”), and an appropriate instruction guiding the model on the headline analysis task.

Named Entity Recognition

Similar to Headline Analysis, the data preparation for the Named Entity Recognition (NER) LoRA model utilized the standard splits of the chosen dataset:

Dataset Used: The financial Named Entity Recognition (NER) dataset [ner] was used.
Train/Test Split: The official train split accompanying the dataset formed the training data for fine-tuning. The corresponding official test split was used for model evaluation.
Formatting: Data was formatted into the required structure, typically involving an instruction asking for the entity type of a specific phrase within the input sentence, and the output being the correct entity label (e.g., “location”, “person”, “organization”).

Citations

[headline]

Sinha, A., & Khandait, P. (2020). Headline-Enhanced Financial Embedding. In Proceedings of the 2nd Workshop on Economics and Natural Language Processing (pp. 66-74).

[ner]

Salinas Alvarado, D., Rönnqvist, S., & Niklaus, J. (2015). Domain-Specific Named Entity Recognition: A Case Study in Finance. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing (pp. 110-115).