Boosting Financial NLP with Reinforcement Learning: How a 3B Parameter Model Outperforms the Giants

Introduction

In the dynamic world of financial markets and monetary policy, capturing and understanding public sentiment around inflation is critically important. Social media platforms like Twitter provide real-time insights into how people perceive economic conditions—but extracting meaningful signal from this noisy data requires sophisticated natural language processing (NLP) techniques.

During my past work on a research project analyzing inflation trends in German-speaking countries, I encountered a fascinating challenge: how to accurately classify tweets discussing inflation in the DACH region (Germany, Austria, and Switzerland) into distinct categories representing upward, downward, or neutral price pressures. We have already published the paper here: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4567305

While the project has officially concluded, my curiosity led me to explore whether reinforcement learning techniques could improve classification accuracy beyond what we achieved with traditional methods such as manual labeling and using generally available APIs from cloud providers. The results were surprising and demonstrated that a lightweight model trained with the right approach can compete with—and even outperform—models hundreds of times larger.

The Challenge: Nuanced Classification of Multilingual Financial Content

The task at hand was deceptively complex. We needed to analyze tweets that could be written in any language but specifically discussed inflation in German-speaking countries. Each tweet had to be classified into one of three categories:

  • +1 (Upward Price Pressure): Evidence of inflation being problematic, increasing, or elevated
  • -1 (Downward Price Pressure): Indications of lower inflation or prices
  • 0 (Neutral/Unclear): Stable inflation, ambiguous direction, or unclear geographic focus

What made this particularly challenging was the need for models to understand:

  1. Multilingual content: Tweets in German, English, and other languages
  2. Regional specificity: Only inflation discussions about the DACH region were relevant
  3. Directional nuance: Distinguishing between absolute high inflation and rising inflation
  4. Implicit indicators: Many tweets don’t explicitly state “inflation is rising” but use colloquial expressions

During the official project, we experimented with various approaches and large language models, but were never fully satisfied with the results. Models trained on API-labeled data struggled to exceed 70% accuracy on test sets, leaving room for improvement.

The Solution: Reinforcement Learning on a Lightweight Model

Instead of pursuing the “bigger is better” approach, I hypothesized that reinforcement learning could help a much smaller model learn the nuances of this classification task more effectively.

My approach centered on fine-tuning a Qwen 2.5 3B model using Group Relative Policy Optimization (GRPO)—the same reinforcement learning technique behind DeepSeek’s powerful R1 model. The critical difference? I applied it to a model roughly 1/200th the size of the largest models we’d previously tested.

These are the results of a model I finetuned: Qwen 2.5 3B and it performs on par at pass @1 and when using pass @N approaches near human levels at N=6, and at majority voting reaches: 84%

Technical Implementation

For those interested in the technical details, my implementation involved:

  1. Base Model: Qwen 2.5 3B, a relatively small but capable language model
  2. Fine-tuning Method: LoRA (Low-Rank Adaptation) with rank 32
  3. Reinforcement Learning: GRPO with a learning rate of 3e-4
  4. Training Data: Just 619 high-quality, human-labeled tweets with balanced distribution across classes
  5. Hardware: Custom-built rig with three RTX 4090 GPUs

The training process leveraged a carefully designed reward function that combined:

  1. Correctness Reward: Verifying that the model’s output matched the expected classification
  2. Format Reward: Ensuring the model followed the required output format
def exact_correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
"""Check if extracted answer matches ground truth"""
responses = [completion[0]['content'] for completion in completions]
extracted_responses = [extract_xml_answer(r) for r in responses]

# Debug printing omitted for brevity

return [2.0 if r == a else 0.0 for r, a in zip(extracted_responses, answer)]

def sparse_reward_func(completions, **kwargs) -> list[float]:
"""Check for general format compliance with granular rewards"""
base_reward = 0.125 # Reward per tag
responses = [completion[0]["content"] for completion in completions]
rewards = []

for r in responses:
score = 0.0
# Check each tag individually
if "" in r:
score += base_reward
if "" in r:
score += base_reward
if "" in r:
score += base_reward
if "" in r:
score += base_reward

rewards.append(score)

return rewards

Ablation Studies and Optimizations

To find the optimal configuration, I conducted extensive ablation studies exploring:

  1. Learning Rates: Comparing 3e-4, 4e-5, and 5e-6
  2. LoRA Ranks: Testing rank 16 vs. rank 32
  3. Dataset Distribution: Experimenting with different class balancing strategies

The training curves below showcase how the reward improved over time, with the optimal configuration (learning rate 3e-4, LoRA rank 32) consistently outperforming alternatives:

The total reward and reward for correct answer both increase as time progresses!

This graph shows the standard deviation of rewards decreasing over time, indicating increasingly stable performance:

Results: David vs. Goliath

The most exciting aspect of this experiment was comparing my 3B parameter model against much larger competitors:

What’s remarkable is that a model 200+ times smaller achieved significantly better performance on this specific task. This demonstrates that task-specific reinforcement learning can be more effective than scaling up model size—particularly for specialized classification tasks.

What if I increase the number of samples and then check the scores? – Inference time scaling

Performance of the RL-tuned Qwen 2.5 3B model across various sampling methods.  Accuracy, Precision, Recall, and F1-Score are shown for Greedy Decoding, Pass@1 (single sample) with different temperatures, Majority Voting, and Pass@N (multiple samples, averaging accuracy).
Performance of the RL-tuned Qwen 2.5 3B model across various sampling methods. Accuracy, Precision, Recall, and F1-Score are shown for Greedy Decoding, Pass@1 (single sample) with different temperatures, Majority Voting, and Pass@N (multiple samples, averaging accuracy).

The model definitely scores better when taking the

Technical Insights for Practitioners

For those considering similar approaches, here are some key technical insights from this experiment:

1. Reward Function Design is Critical

The combination of correctness and formatting rewards proved essential. The correctness reward provided the primary learning signal, while the formatting reward ensured the model’s outputs remained consistent and parsable.

2. Hyperparameter Sensitivity

Reinforcement learning is notoriously sensitive to hyperparameters. In my experiments:

  • Learning rate 3e-4 significantly outperformed lower rates
  • LoRA rank 32 provided better results than rank 16
  • Batch size and training steps needed careful tuning to avoid instability

3. High-Quality Training Data Beats Quantity

The 619-tweet dataset was relatively small but labeled by an economics postdoc with domain expertise. This high-quality data proved more valuable than larger datasets with noisier labels from generic APIs.

4. LoRA Enables Efficient Fine-tuning

Low-Rank Adaptation allowed efficient fine-tuning even on consumer GPUs, making this approach accessible without enterprise-grade infrastructure.

Broader Implications

While this experiment focused on inflation tweet classification, the approach has broader implications:

  1. Specialized Models > General Models: For specific tasks, a well-tuned smaller model can outperform general-purpose larger models
  2. Cost Efficiency: Smaller models require less computational resources and can run on consumer hardware
  3. Inference Speed: A 3B parameter model is dramatically faster at inference than 670B+ parameter models
  4. Environmental Impact: Training and running smaller models has a reduced carbon footprint
Qwen 2.5B using Pass @ N, while the rest using pass @ 1 due to costs.

Conclusion

This exploration demonstrates that reinforcement learning, when applied thoughtfully to smaller models, can achieve remarkable results on complex NLP tasks. For specialized applications like financial sentiment analysis, this approach offers a compelling alternative to simply scaling up model size.

Even though the original project has concluded, this experiment satisfied my curiosity about whether reinforcement learning could overcome the limitations we encountered. The answer is a resounding yes—with the right training approach, even a 3B parameter model can become a specialist that outperforms giants hundreds of times its size.

As language models continue to evolve, I believe we’ll see increasing value in these “specialist” models that leverage reinforcement learning to excel at specific tasks—a trend that could democratize access to advanced NLP capabilities without requiring enterprise-scale compute resources.


Note: This experiment was conducted on a personal custom rig with three RTX 4090 GPUs, demonstrating that meaningful AI research doesn’t necessarily require massive cloud infrastructure.

Citation

@article{dalal2025boosting,
author = {Dalal, Hrishbh},
title = {Boosting Financial NLP with Reinforcement Learning: How a 3B Parameter Model Outperforms the Giants},
year = {2025},
month = {3},
day = {4},
url = {https://hrishbhdalal.com/projects/boosting-financial-nlp},
note = {Accessed on March 19, 2025}
}

Back to Projects | Get in Touch

Scroll to Top