Financial Question Answering with BART + LoRA Fine-Tuning
1. Introduction
This project explores the task of financial question answering, specifically focusing on generating answers based on provided context snippets, often simulating Retrieval-Augmented Generation (RAG) scenarios. Accurate information extraction is crucial in finance, but LLMs can sometimes struggle with domain-specific factuality.
To address this, we fine-tuned the facebook/bart-large-cnn
model using Low-Rank Adaptation (LoRA) on a financial QA dataset derived from financial_phrasebank. The goal was to improve the model's ability to generate concise, contextually grounded answers.
Post-training benchmarks using an LLM-as-Judge approach (Gemini) and attempted evaluations with RAG-specific frameworks (like Ragas) indicated mixed results. While fine-tuning potentially improved the model's faithfulness (generating answers more consistent with the provided context), it did not show clear improvements in answer relevancy or overall quality scores compared to the baseline model, and may have slightly decreased performance in these areas for some examples. Automated RAG benchmarks further highlighted challenges with context retrieval precision in the experimental setup.
2. Training Data
The dataset used for fine-tuning are financial_phrasebank. Texts were loaded using LangChain's UnstructuredURLLoader
and chunked using RecursiveCharacterTextSplitter
. Question-Answer pairs were constructed based on these chunks.
3. Training Method
The chosen training method was LoRA (Low-Rank Adaptation), applied to the facebook/bart-large-cnn
base model. LoRA allows efficient adaptation of large models using low-rank updates while keeping most base model parameters frozen.
The LoRA configuration used was:
r=8
lora_alpha=16
lora_dropout=0.1
task_type=SEQ_2_SEQ_LM
target_modules=["q_proj", "v_proj"]
(Note: target modules depend on the base model architecture)
Training Arguments (Seq2SeqTrainingArguments
):
- Learning rate:
2e-4
- Epochs:
2
(Adjusted based on training logs provided) - Batch size:
2
per device eval_strategy
:"epoch"
save_strategy
:"epoch"
load_best_model_at_end
:True
metric_for_best_model
:"eval_loss"
- Other standard arguments (weight decay, save limits, logging, etc.)
The model was trained using the Seq2SeqTrainer
from the Hugging Face transformers
library.
4. Evaluation
Post-training evaluation focused on assessing the quality of answers generated by the fine-tuned model compared to the baseline facebook/bart-large-cnn
model within a simulated RAG context (using financial_phrasebank
as the context source for benchmarking).
Evaluation Methods:
LLM-as-Judge (Gemini):
- A capable LLM (Gemini-1.5-Flash) was used as an impartial judge to evaluate the quality of answers generated by both the baseline and the fine-tuned models for the same set of questions and retrieved contexts.
- Criteria included: Answer Relevancy, Faithfulness (presumed consistency with context), Fluency, Helpfulness (relative to context), and Overall Quality.
- Findings: The fine-tuned model showed potential improvements in faithfulness (less likely to generate answers inconsistent with the context). Helpfulness scores saw minor increases in some cases. However, answer relevancy and overall quality scores did not show clear improvement and potentially decreased slightly compared to the baseline. Fluency remained high for both.
Automated RAG Benchmarks (Ragas/DeepEval - Attempted):
- Frameworks like Ragas and DeepEval were used to attempt calculation of metrics such as
faithfulness
,answer_relevancy
, andcontext_precision
. - Findings & Caveats: Initial runs suggested similar trends to the LLM-as-Judge (potential decrease in faithfulness/relevancy post-tuning). A key finding was consistently low context precision, indicating significant challenges with the retriever component in finding relevant context snippets from the
financial_phrasebank
for the questions asked. However, these benchmark runs were incomplete due to evaluation setup issues (missing API keys for default metric models), resulting in many NaN/failed calculations. Therefore, quantitative scores from these runs are unreliable and not reported here.
- Frameworks like Ragas and DeepEval were used to attempt calculation of metrics such as
Summary: Fine-tuning with LoRA improved the model's tendency to stick to the provided context (faithfulness) but did not demonstrably improve, and may have slightly hindered, its ability to generate the most relevant or highest overall quality answers for this specific task and evaluation setup compared to the baseline BART model. Retrieval quality (context precision) was a significant challenge throughout.
5. Usage and Intended Uses
The model is intended for use in applications requiring question answering based on provided financial text snippets, such as assisting analysts, summarizing news, or powering chatbots where the context is explicitly provided.
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, AutoConfig
from peft import PeftModel, PeftConfig
# Specify the path to your saved LoRA model repository
# This could be a local path or a Hugging Face Hub repo ID
# PEFT_MODEL_PATH = "./bart-lora-finance/best_model" # Local path example
PEFT_MODEL_PATH = "deoleojr/bart-finance-lora" # HF Hub example (replace if different)
# Determine device
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
# Load the configuration from the PEFT model path
try:
config = PeftConfig.from_pretrained(PEFT_MODEL_PATH)
print(f"Base model identified from config: {config.base_model_name_or_path}")
except Exception as e:
print(f"Error loading PeftConfig: {e}")
# Fallback or exit if config cannot be loaded
config = None # Set config to None or handle error appropriately
BASE_MODEL_NAME = "facebook/bart-large-cnn" # Manually specify if needed
print(f"Warning: Could not load PeftConfig. Assuming base model: {BASE_MODEL_NAME}")
# Load the base model
if config:
BASE_MODEL_NAME = config.base_model_name_or_path
try:
print(f"Loading base model: {BASE_MODEL_NAME}...")
base_model = AutoModelForSeq2SeqLM.from_pretrained(BASE_MODEL_NAME)
print("Base model loaded.")
except Exception as e:
print(f"Error loading base model '{BASE_MODEL_NAME}': {e}")
exit()
# Load the PEFT model (LoRA layers) on top of the base model
try:
print(f"Loading LoRA adapter from: {PEFT_MODEL_PATH}...")
# Ensure the base model is loaded before applying adapters
model = PeftModel.from_pretrained(base_model, PEFT_MODEL_PATH)
model.to(device) # Move the combined model to the device
model.eval() # Set model to evaluation mode
print("LoRA model loaded and ready.")
except Exception as e:
print(f"Error loading PEFT model: {e}")
exit()
# Load the tokenizer (usually saved with the adapters)
try:
print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(PEFT_MODEL_PATH)
print("Tokenizer loaded.")
except Exception as e:
print(f"Error loading tokenizer from {PEFT_MODEL_PATH}: {e}")
# Fallback to base model tokenizer if necessary
try:
print(f"Falling back to base model tokenizer: {BASE_MODEL_NAME}")
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_NAME)
print("Base model tokenizer loaded.")
except Exception as e_base_tok:
print(f"Error loading base tokenizer: {e_base_tok}")
exit()
# Example usage
question = "What was the main reason for Tesla's stock rally?"
context = "Tesla (TSLA.O) rallied 10% after Morgan Stanley upgraded the electric car maker to 'overweight' from 'equal-weight', saying its Dojo supercomputer could boost the company's market value by nearly $600 billion."
# Use the specified prompt format
prompt = f"Instruction: {question}\n\n[Context Information]\n{context}"
print(f"\nInput Prompt:\n{prompt}")
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512).to(device)
# Generate the answer
with torch.no_grad():
# Adjust generation parameters as needed
outputs = model.generate(
**inputs,
max_new_tokens=100, # Max tokens to ADD to the input
temperature=0.7, # Controls randomness (lower is more deterministic)
top_k=50, # Considers top K tokens
top_p=0.95, # Considers tokens cumulative prob > P
do_sample=True, # Use sampling (needed for temp, top_k, top_p)
num_beams=1 # Use 1 for sampling, >1 for beam search
)
prediction = tokenizer.decode(outputs[0], skip_special_tokens=True).strip()
print(f"\nQuestion: {question}")
print(f"Generated Answer: {prediction}")
Intended Use: Primarily for offline analysis or integration into systems where context is supplied alongside the question. Suitable for assisting financial analysts or researchers needing fact extraction from specific texts. Not intended for real-time trading decisions or fully automated financial advice due to limitations.
6. Prompt Format
The model expects prompts formatted with an "Instruction:" prefix for the question, followed by two newlines, the label "[Context Information]", a newline, and then the context text itself.
Instruction: {question}
[Context Information]
{news_article_chunk}
Replace {question} with the user's question and {context_passage} with the relevant text snippet.
7. Expected Output Format
The model outputs a single block of text representing the answer generated based on the provided question and context.
A concise, fact-based answer derived from the provided context.
Example:
"Morgan Stanley upgraded Tesla to 'overweight' from 'equal-weight', citing the potential of its Dojo supercomputer."
8. Limitations
Context Dependency: Performance heavily relies on the quality and relevance of the provided context. It may struggle or hallucinate if the context is insufficient or irrelevant to the question.
Retriever Performance: Benchmarks indicated challenges with retrieving highly relevant context snippets from the financial_phrasebank corpus, which impacts the overall RAG quality regardless of the generator model. The model's performance in a real RAG system will depend heavily on the effectiveness of the retrieval component used.
Evaluation Metrics: Automated metrics like BLEU/ROUGE (not used here) and even model-based metrics (Faithfulness, Relevancy) do not perfectly capture factual correctness or nuanced understanding. Human oversight or external fact-checking is advisable for critical applications.
Potential Relevancy Decrease: LLM-as-Judge evaluation suggested that fine-tuning, while potentially improving faithfulness, might have slightly reduced the direct relevance of answers compared to the baseline model in some cases.