Introduction
In the dynamic world of financial markets and monetary policy, capturing and understanding public sentiment around inflation is critically important. Social media platforms like Twitter provide real-time insights into how people perceive economic conditions—but extracting meaningful signal from this noisy data requires sophisticated natural language processing (NLP) techniques.
During my past work on a research project analyzing inflation trends in German-speaking countries, I encountered a fascinating challenge: how to accurately classify tweets discussing inflation in the DACH region (Germany, Austria, and Switzerland) into distinct categories representing upward, downward, or neutral price pressures. We have already published the paper here: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4567305
While the project has officially concluded, my curiosity led me to explore whether reinforcement learning techniques could improve classification accuracy beyond what we achieved with traditional methods such as manual labeling and using generally available APIs from cloud providers. The results were surprising and demonstrated that a lightweight model trained with the right approach can compete with—and even outperform—models hundreds of times larger.
The Challenge: Nuanced Classification of Multilingual Financial Content
The task at hand was deceptively complex. We needed to analyze tweets that could be written in any language but specifically discussed inflation in German-speaking countries. Each tweet had to be classified into one of three categories:
- +1 (Upward Price Pressure): Evidence of inflation being problematic, increasing, or elevated
- -1 (Downward Price Pressure): Indications of lower inflation or prices
- 0 (Neutral/Unclear): Stable inflation, ambiguous direction, or unclear geographic focus
What made this particularly challenging was the need for models to understand:
- Multilingual content: Tweets in German, English, and other languages
- Regional specificity: Only inflation discussions about the DACH region were relevant
- Directional nuance: Distinguishing between absolute high inflation and rising inflation
- Implicit indicators: Many tweets don’t explicitly state “inflation is rising” but use colloquial expressions
During the official project, we experimented with various approaches and large language models, but were never fully satisfied with the results. Models trained on API-labeled data struggled to exceed 70% accuracy on test sets, leaving room for improvement.
The Solution: Reinforcement Learning on a Lightweight Model
Instead of pursuing the “bigger is better” approach, I hypothesized that reinforcement learning could help a much smaller model learn the nuances of this classification task more effectively.
My approach centered on fine-tuning a Qwen 2.5 3B model using Group Relative Policy Optimization (GRPO)—the same reinforcement learning technique behind DeepSeek’s powerful R1 model. The critical difference? I applied it to a model roughly 1/200th the size of the largest models we’d previously tested.

These are the results of a model I finetuned: Qwen 2.5 3B and it performs on par at pass @1 and when using pass @N approaches near human levels at N=6, and at majority voting reaches: 84%
Technical Implementation
For those interested in the technical details, my implementation involved:
- Base Model: Qwen 2.5 3B, a relatively small but capable language model
- Fine-tuning Method: LoRA (Low-Rank Adaptation) with rank 32
- Reinforcement Learning: GRPO with a learning rate of 3e-4
- Training Data: Just 619 high-quality, human-labeled tweets with balanced distribution across classes
- Hardware: Custom-built rig with three RTX 4090 GPUs
The training process leveraged a carefully designed reward function that combined:
- Correctness Reward: Verifying that the model’s output matched the expected classification
- Format Reward: Ensuring the model followed the required output format
def exact_correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
"""Check if extracted answer matches ground truth"""
responses = [completion[0]['content'] for completion in completions]
extracted_responses = [extract_xml_answer(r) for r in responses]
# Debug printing omitted for brevity
return [2.0 if r == a else 0.0 for r, a in zip(extracted_responses, answer)]
def sparse_reward_func(completions, **kwargs) -> list[float]:
"""Check for general format compliance with granular rewards"""
base_reward = 0.125 # Reward per tag
responses = [completion[0]["content"] for completion in completions]
rewards = []
for r in responses:
score = 0.0
# Check each tag individually
if "" in r:
score += base_reward
if "" in r:
score += base_reward
if "" in r:
score += base_reward
if "" in r:
score += base_reward
rewards.append(score)
return rewards
Ablation Studies and Optimizations
To find the optimal configuration, I conducted extensive ablation studies exploring:
- Learning Rates: Comparing 3e-4, 4e-5, and 5e-6
- LoRA Ranks: Testing rank 16 vs. rank 32
- Dataset Distribution: Experimenting with different class balancing strategies
The training curves below showcase how the reward improved over time, with the optimal configuration (learning rate 3e-4, LoRA rank 32) consistently outperforming alternatives:
The total reward and reward for correct answer both increase as time progresses!


This graph shows the standard deviation of rewards decreasing over time, indicating increasingly stable performance:

Results: David vs. Goliath
The most exciting aspect of this experiment was comparing my 3B parameter model against much larger competitors:

What’s remarkable is that a model 200+ times smaller achieved significantly better performance on this specific task. This demonstrates that task-specific reinforcement learning can be more effective than scaling up model size—particularly for specialized classification tasks.
What if I increase the number of samples and then check the scores? – Inference time scaling

Performance of the RL-tuned Qwen 2.5 3B model across various sampling methods. Accuracy, Precision, Recall, and F1-Score are shown for Greedy Decoding, Pass@1 (single sample) with different temperatures, Majority Voting, and Pass@N (multiple samples, averaging accuracy).The model definitely scores better when taking the
Technical Insights for Practitioners
For those considering similar approaches, here are some key technical insights from this experiment:
1. Reward Function Design is Critical
The combination of correctness and formatting rewards proved essential. The correctness reward provided the primary learning signal, while the formatting reward ensured the model’s outputs remained consistent and parsable.
2. Hyperparameter Sensitivity
Reinforcement learning is notoriously sensitive to hyperparameters. In my experiments:
- Learning rate 3e-4 significantly outperformed lower rates
- LoRA rank 32 provided better results than rank 16
- Batch size and training steps needed careful tuning to avoid instability
3. High-Quality Training Data Beats Quantity
The 619-tweet dataset was relatively small but labeled by an economics postdoc with domain expertise. This high-quality data proved more valuable than larger datasets with noisier labels from generic APIs.
4. LoRA Enables Efficient Fine-tuning
Low-Rank Adaptation allowed efficient fine-tuning even on consumer GPUs, making this approach accessible without enterprise-grade infrastructure.
Broader Implications
While this experiment focused on inflation tweet classification, the approach has broader implications:
- Specialized Models > General Models: For specific tasks, a well-tuned smaller model can outperform general-purpose larger models
- Cost Efficiency: Smaller models require less computational resources and can run on consumer hardware
- Inference Speed: A 3B parameter model is dramatically faster at inference than 670B+ parameter models
- Environmental Impact: Training and running smaller models has a reduced carbon footprint

Conclusion
This exploration demonstrates that reinforcement learning, when applied thoughtfully to smaller models, can achieve remarkable results on complex NLP tasks. For specialized applications like financial sentiment analysis, this approach offers a compelling alternative to simply scaling up model size.
Even though the original project has concluded, this experiment satisfied my curiosity about whether reinforcement learning could overcome the limitations we encountered. The answer is a resounding yes—with the right training approach, even a 3B parameter model can become a specialist that outperforms giants hundreds of times its size.
As language models continue to evolve, I believe we’ll see increasing value in these “specialist” models that leverage reinforcement learning to excel at specific tasks—a trend that could democratize access to advanced NLP capabilities without requiring enterprise-scale compute resources.
Note: This experiment was conducted on a personal custom rig with three RTX 4090 GPUs, demonstrating that meaningful AI research doesn’t necessarily require massive cloud infrastructure.
Citation
@article{dalal2025boosting,
author = {Dalal, Hrishbh},
title = {Boosting Financial NLP with Reinforcement Learning: How a 3B Parameter Model Outperforms the Giants},
year = {2025},
month = {3},
day = {4},
url = {https://hrishbhdalal.com/projects/boosting-financial-nlp},
note = {Accessed on March 19, 2025}
}