Small Model, Big Equations: Training Qwen 2.5-VL to Master LaTeX OCR with GRPO

Vision Language Models (VLMs) are bridging the gap between pixels and text, but can a relatively small VLM be trained to rival its larger counterparts on highly structured, visually complex tasks like LaTeX Optical Character Recognition (OCR)? This project explores exactly that: using Group Relative Policy Optimization (GRPO), a reinforcement learning technique, to significantly boost the LaTeX conversion quality of the Qwen 2.5-VL 3B and 7B model. The goal? To achieve accuracy and formatting precision that meets, or even potentially surpasses, what’s often expected only from much larger models.

The Challenge: More Than Just Reading Text

Converting images of mathematical equations into accurate, compilable LaTeX code is notoriously difficult, even for sophisticated AI models. It demands more than simple character recognition:

  1. Visual Complexity: Models must interpret intricate 2D layouts involving nested fractions, matrices, complex subscripts/superscripts, varied spacing, and a vast array of specialized symbols.
  2. Syntactic Precision: The output isn’t just text; it’s code. It needs to be perfectly valid LaTeX, using correct environments (\begin{equation}, \[…\]), commands (\frac, \left(, \right)), and structure to be compilable. A single misplaced brace can break everything.
  3. Implicit Mathematical Context: Understanding the underlying mathematical structure often helps disambiguate visually similar symbols or layouts.
  4. Structured Reasoning & Output: The process often benefits from explicit reasoning (a “thought process”) before generating the final LaTeX code, requiring the model to adhere to a specific output format (e.g., using <think> and <answer> tags).

Smaller VLMs, while capable, might initially lack the nuanced understanding required for high-fidelity LaTeX generation out-of-the-box. Can RL provide the targeted training needed to elevate their performance?

Source: unsloth/LaTeX_OCR

The Foundation: A Ready-Made LaTeX Dataset

In this work I leveraged an existing resource: the unsloth/LaTeX_OCR dataset available on Hugging Face. This provided a collection of mathematical expression images paired with their corresponding LaTeX source code.The preparation process focused on structuring this data for VLM training:

  1. Data Loading: The dataset (specifically the ‘train’ and ‘test’ splits) was loaded.
  2. Processing & Formatting: A Python script (process_dataset in the code) iterated through the examples:
  3. Processing & Formatting: A Python script (process_dataset in the code) iterated through the examples:
    • Saved each PIL image locally (e.g., as train_XXX.png).
    • Constructed a JSONL file where each line represents a training example.
    • Each entry contained the absolute path to the saved image and a structured conversation:
    • A human turn combining the <image> placeholder with a standard TASK_INSTRUCTION, guiding the model on the goal and required format.
    • A gpt turn containing the ground-truth LaTeX code from the dataset.
    • A standard SYSTEM_PROMPT was used to define the model’s persona as a “LaTeX OCR expert” and reinforce the required <think>/<answer> structure.

This resulted in training files (like train_1000.jsonl) directly usable by the GRPO trainer, pairing images with their target LaTeX outputs within a conversational format suitable for instruction-following VLMs.

My Experimental Setup: GRPO on Qwen 2.5-VL

I chose Group Relative Policy Optimization (GRPO), inspired by its successful application by Deepseek in training models for mathematical reasoning. GRPO is effective for optimizing language model policies by comparing the rewards of multiple generated responses.

  • Model: Qwen/Qwen2.5-VL-7B-Instruct and 3B-Instruct: A capable class of open-source VLM chosen for its balance of size and performance.
  • Optimization: Training likely benefited from acceleration techniques. Key techniques included:
    • Quantization: Loading the model in 4-bit (load_in_4bit=True) using nf4 quantization and bfloat16 compute dtype (bnb_4bit_quant_type=”nf4″, bnb_4bit_compute_dtype=”bfloat16″), with double quantization enabled (bnb_4bit_use_double_quant=True). This significantly reduces memory footprint.
    • LoRA: Efficient fine-tuning using Low-Rank Adaptation (use_peft=True) with lora_r=16, lora_alpha=32, lora_dropout=0.05, targeting key attention and feed-forward modules (q_proj, k_proj, v_proj, o_proj, etc.), and lora_bias=”none”.
  • Training Configuration (based on finalized_vlm_r1_grpo_qwen_training.py):
    • Max Sequence Length: 1200 tokens (allowing for reasoning + LaTeX)
    • Max Prompt Length: 600 tokens
    • Max Completion Length: 600 tokens
    • Batch Size: 2 per device (per_device_train_batch_size=2)
    • Gradient Accumulation: 4 steps (effective batch size of 8)
    • Number of Generations per Prompt: 6 (num_generations=6 for GRPO comparison)
    • Learning Rate: 4e-5 (a crucial hyperparameter)
    • Max Steps: 100 (for initial runs/epochs)
    • Optimizer: paged_adamw_8bit, this is for crucial memory saving. this keeps the optimizer states in 8 bits leading to massive memory savings
    • Precision: bfloat16 (bf16=True)
  • Hardware: Likely a single consumer GPU (e.g., RTX 4090 24GB, based on previous project)

The Reward System: Guiding Towards Valid and Accurate LaTeX

The effectiveness of RL hinges on the reward functions. Initially, I focused on rewarding structural compliance (using the correct tags) and textual similarity (using ROUGE-L). However, a key challenge with LaTeX is ensuring the output is syntactically valid so it can actually be compiled. Simply being textually similar isn’t enough if a misplaced brace breaks the code.To address this, I refined the core reward mechanism by introducing a validity check before considering similarity.

1. Format Compliance Rewards (helpful initially): These ensure the model learns the structure of the desired response.

  • tag_presence_reward_fn: Gives partial credit (0.25 points each, max 1.0) for including each of the required tags: <think>, </think>, <answer>, </answer>. Encourages structural completeness.
Click to expand tag_presence_reward_fn
def tag_presence_reward_fn(completions: List[List[Dict]], **kwargs) -> List[float]:
    """
    Gives 0.25 for each required tag present (max 1.0)
    Required tags: <think>, </think>, <answer>, </answer>
    """
    rewards = []
    # Extract the string content from the completion structure
    responses = [comp[0]["content"] for comp in completions]
    for response in responses:
        score = 0.0
        # Use case-insensitive matching for tags
        if re.search(r"<think>", response, re.IGNORECASE): score += 0.25
        if re.search(r"</think>", response, re.IGNORECASE): score += 0.25
        if re.search(r"<answer>", response, re.IGNORECASE): score += 0.25
        if re.search(r"</answer>", response, re.IGNORECASE): score += 0.25
        rewards.append(score)
    return rewards
  • tag_order_reward_fn: Gives a full reward (1.0) only if the tags appear in the correct sequence: <think> block followed by <answer> block. Enforces logical flow.
Click to expand tag_order_reward_fn
def tag_order_reward_fn(completions: List[List[Dict]], **kwargs) -> List[float]:
    """
    Gives 1.0 if tags appear in correct order:
    <think>...</think>...<answer>...</answer> (case-insensitive)
    """
    # Case-insensitive pattern
    pattern = r"<think>.*?</think>.*?<answer>.*?</answer>"
    # Extract the string content
    responses = [comp[0]["content"] for comp in completions]
    return [1.0 if re.search(pattern, comp, re.DOTALL | re.IGNORECASE) else 0.0 for comp in responses]

2. Validity-Gated Accuracy Reward (The Core Improvement): This is the main reward function, designed to strongly prioritize basic structural correctness.

  • valid_latex_and_similarity_reward_fn: This function operates in two stages:
    • Stage 1: Heuristic Validity Check. It first extracts the LaTeX code from the <answer> tags using parse_vlm_answer. Then, it runs a fast, custom heuristic check (basic_latex_heuristic_check) on this extracted code.
    • What the Heuristic Does: This simple check primarily looks for balanced curly braces {}, while attempting to ignore braces within comments (lines starting with %) and escaped braces (\{, \}). It’s a quick proxy for structural soundness, catching common errors.
    • Limitations: It’s crucial to understand this is not a full LaTeX compiler. It doesn’t understand commands, environments, or complex syntax rules. It’s a fast check for a common failure mode, suitable for the rapid feedback needed in RL.
def basic_latex_heuristic_check(latex_string: str) -> bool:
    """
    VERY basic heuristic for LaTeX validity (balanced braces), ignoring comments
    and escaped braces.
    WARNING: This is NOT a reliable check for true LaTeX compilability.
    """
    if not isinstance(latex_string, str) or not latex_string:
        return False # Empty or non-string is not valid for our purpose
    
    balance = 0
    escaped = False

    lines = latex_string.splitlines()
    for line in lines:
        # Find the first unescaped comment character '%'
        comment_start_index = -1
        temp_escaped = False
        for i, char in enumerate(line):
            if char == '%' and not temp_escaped:
                comment_start_index = i
                break
            temp_escaped = (char == '\\' and not temp_escaped)
        
        effective_line = line if comment_start_index == -1 else line[:comment_start_index]

        # Process characters in the effective line
        escaped = False # Reset escape status for each effective line part
        for char in effective_line:
            if escaped:
                escaped = False # Consume the escape character
                continue
            if char == '\\':
                escaped = True
                continue
            
            # Check braces only if not escaped
            if char == '{':
                balance += 1
            elif char == '}':
                balance -= 1
            
            if balance < 0: # Closing brace before opening
                return False
                
    return balance == 0 # Must be perfectly balanced at the end
  • Stage 2: Conditional Similarity Score.
    • If the heuristic check fails: The function immediately assigns a reward of 0.0. The model gets no credit, even if the text was otherwise similar to the ground truth. This sends a strong signal: invalid structure is unacceptable.
    • If the heuristic check passes: The function proceeds to calculate the similarity. It normalizes both the (heuristically valid) generated LaTeX and the ground truth LaTeX using normalize_text (lowercase, strip whitespace). It then calculates the ROUGE-L F1 score between them. The final reward is 1.0 + ROUGE-L F1 score. This gives a base reward of 1.0 for passing the validity check, plus an additional reward (up to 1.0) based on how textually similar it is to the correct answer.
def valid_latex_and_similarity_reward_fn(
    completions: List[List[Dict]],
    solution: List[str], # TRL passes batch values as List
    prompt: Optional[List[List[Dict]]] = None, # Pass prompt if available for logging context
    **kwargs # Catches model_version, hyperparameters, batch_index etc. from trainer state
) -> List[float]:
    """
    Calculates reward based on heuristic LaTeX validity and ROUGE-L similarity.
    - If heuristic check fails OR answer/GT is empty/parsing fails: reward = 0.0
    - If heuristic check passes: reward = 1.0 + ROUGE-L F1 score
    Logs details including the validity check result.
    """
    rewards = []
    # Assume log_training_response, parse_vlm_answer, normalize_text exist elsewhere
    # And necessary imports like List, Dict, Optional, Any, rouge_scorer are present
    
    scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)
    responses = [comp[0]["content"] for comp in completions] # Extract first generated response text

    batch_size = len(responses)
    if batch_size != len(solution):
        print(f"Warning: Mismatch in lengths: {batch_size} completions vs {len(solution)} ground truths.")
        # Log an error for each expected item if log_training_response exists
        if 'log_training_response' in globals():
            for i in range(batch_size):
                log_training_response(error="Batch length mismatch during reward calc", **kwargs)
        return [0.0] * batch_size # Return zero rewards for the batch

    # --- Loop through each item in the batch ---
    for i in range(batch_size):
        response_text = responses[i]
        gt_text = solution[i]
        current_prompt_text = "Prompt not provided" # Default, attempt extraction below
        image_id = f"item_{i}_in_batch" # Basic identifier

        # Try to extract prompt text if provided and log_training_response exists
        if prompt and i < len(prompt) and 'log_training_response' in globals():
            # (Keep your existing prompt extraction logic here if needed for logging)
            try:
                prompt_details = prompt[i]
                system_content = next((msg['content'][0]['text'] for msg in prompt_details if msg['role'] == 'system'), "")
                user_content_list = next((msg['content'] for msg in prompt_details if msg['role'] == 'user'), None)
                user_text = next((item['text'] for item in user_content_list if item['type'] == 'text'), "") if user_content_list else ""
                current_prompt_text = f"SYSTEM: {system_content}\nUSER: {user_text}"
            except Exception as e:
                 current_prompt_text = f"Error extracting prompt: {e}"


        # --- Initialize reward calculation variables ---
        reward = 0.0
        rouge_l_f1 = 0.0
        is_heuristically_valid = False
        parsed_ans = "" # Initialize to avoid unbound error in logging
        norm_parsed = ""
        norm_gt = ""
        log_error = None
        exact_match = False # Keep for logging consistency

        try:
            # Assume parse_vlm_answer(text) exists
            parsed_ans = parse_vlm_answer(response_text)

            if not parsed_ans:
                log_error = "Failed to parse answer from <answer> tags"
            elif not gt_text:
                 log_error = "Ground truth text is missing/empty"
            else:
                # --- Heuristic Validity Check ---
                is_heuristically_valid = basic_latex_heuristic_check(parsed_ans)

                if is_heuristically_valid:
                    # --- Calculate Similarity ONLY IF valid ---
                    # Assume normalize_text(text) exists
                    norm_parsed = normalize_text(parsed_ans)
                    norm_gt = normalize_text(gt_text)

                    if not norm_parsed or not norm_gt:
                        log_error = "Normalization resulted in empty string(s), treating as invalid"
                        is_heuristically_valid = False # Override validity
                        reward = 0.0
                    else:
                        # Calculate ROUGE-L F1 score
                        scores = scorer.score(norm_gt, norm_parsed)
                        rouge_l_f1 = scores['rougeL'].fmeasure
                        
                        # Assign reward: 1.0 bonus + similarity score
                        reward = 1.0 + rouge_l_f1
                        
                        exact_match = (norm_parsed == norm_gt) # Check exact match for logging
                else:
                    # If heuristic check fails
                    log_error = "Failed basic LaTeX heuristic check (e.g., unbalanced braces)"
                    reward = 0.0
                    rouge_l_f1 = 0.0 # Ensure score is also 0 if invalid

        except Exception as e:
             log_error = f"Reward calculation error: {str(e)}"
             print(f"Error calculating reward for GT='{gt_text}', PRED='{parsed_ans}': {e}")
             reward = 0.0 # Ensure reward is 0 on error
             is_heuristically_valid = False
             rouge_l_f1 = 0.0


        rewards.append(reward)

        # --- Log results for this sample (if logging function exists) ---
        if 'log_training_response' in globals():
            log_training_response(
                image_identifier=image_id,
                prompt_text=current_prompt_text,
                generated_response=response_text,
                parsed_answer=parsed_ans,
                ground_truth_answer=gt_text,
                total_reward=reward,
                # Log reward components based on validity
                reward_components={
                    "heuristic_validity_bonus": 1.0 if is_heuristically_valid and reward > 0 else 0.0,
                    "similarity_rouge_l": rouge_l_f1 if is_heuristically_valid and reward > 0 else 0.0
                    },
                rouge_l_f1=rouge_l_f1 if is_heuristically_valid and reward > 0 else 0.0, # Log the conditional ROUGE
                # You could add 'is_heuristically_valid=is_heuristically_valid' if you modify the logger
                exact_match=exact_match,
                error=log_error,
                **kwargs # Pass through other info like model_version
            )

    return rewards

3. Exact Match Bonus (Optional):

  • exact_match_bonus_reward_fn: Provides an additional bonus reward (e.g., 2.0) if the normalized generated answer exactly matches the normalized ground truth. This strongly incentivizes perfection but can be a very sparse reward signal, as achieving exact matches in complex LaTeX is difficult. (During my experiments, this reward was often zero, highlighting the difficulty of exact matching).

This improved reward system, particularly the valid_latex_and_similarity_reward_fn, aims to guide the model more effectively by first ensuring a baseline level of structural correctness before rewarding textual similarity.

The Results: Small Model Learns More Robust LaTeX

The GRPO training process, guided by the validity-gated reward function, demonstrated a significant improvement in both the Qwen 2.5-VL 3B and 7B model’s ability to generate accurate and structurally sound LaTeX code. To quantify this, I evaluated both the original base model and the final model (with merged LoRA adapters) on a held-out test set of 200 samples, excluded from the training data.

Metrics/Graphs for only the successful and most stable runs:

  • Reward Curves: These go up showing a good training run and there is no stabilization meaning that the models are constantly improving!
  • Some other graphs related to the run:
  • Comparison Table: Compare metrics (e.g., average ROUGE-L, Heuristic Validity Rate, actual compilation rate on a test set) for the base model vs. the GRPO-tuned model.

Key Takeaways from the Results

  • Improved Similarity: The average ROUGE-L F1 score saw a notable increase from 0.7376 to 0.8009. This indicates that the generated LaTeX from the fine-tuned model is significantly more similar in sequence and structure to the ground truth compared to the base model.
  • Enhanced Validity: While the base model already performed well on the basic brace-balancing heuristic (99.5% validity), the GRPO training pushed this to 100% on the test set. This suggests the reward function successfully penalized outputs with even simple structural flaws, leading to more consistently well-formed code according to this basic check.

Why This Matters: Democratizing Specialized AI

Improving LaTeX OCR on smaller models isn’t just an academic curiosity. It has real-world implications:

  • Accessibility: Enables high-quality math OCR on consumer hardware or resource-constrained environments. 3B model matches the quality of 7B in terms of final generated strings from the Latex images. This is a good sign meaning that we can squeeze way more performance out of the existing models
  • Efficiency: Faster and cheaper inference for processing scientific documents, educational materials, etc.
  • Reliability: The focus on validity leads to more consistently usable output.
  • Customization: Smaller models are easier to fine-tune further.
  • Research: Provides a platform for exploring VLM capabilities and RL techniques.

Headaches and Future Directions

  • Improved Similarity: GRPO fine-tuning significantly boosted ROUGE-L F1 scores for both models (3B: +0.0640, 7B: +0.0633), showing enhanced textual accuracy.
  • Increased Reliability: The models produced more structurally sound outputs, achieving near-perfect or perfect heuristic validity rates and dramatically reducing metric calculation errors (especially for the 3B model).
  • Model Scale Matters: The larger 7B model reached higher absolute ROUGE-L scores (0.8009 vs 0.7502), indicating its higher performance ceiling.
  • Accessibility Win: The substantial gains on the 3B model confirm the effectiveness of this RL approach on smaller, more accessible hardware.

Conclusion: A Promising Path for Specialized VLMs

This project demonstrates the power of reinforcement learning (GRPO) to significantly enhance the capabilities of small Vision Language Models (Qwen 2.5-VL 3B and 7B) for the challenging task of LaTeX OCR. By refining the reward system to prioritize basic structural validity before rewarding textual similarity, we can train smaller models to produce more reliable and accurate results, making powerful AI tools more accessible and efficient. The journey towards perfect AI-driven OCR continues, but this experiment highlights the importance of thoughtful reward engineering in achieving robust performance.

Reference

@misc{dalal2025latexocr_grpo_blog_v2,
author = {Dalal, Hrishbh},
title = {{Small Model, Big Equations: Training Qwen 2.5-VL to Master LaTeX OCR with GRPO and Validity Rewards}},
year = {2025},
month = {April},
url = {https://hrishbh.com/small-model-big-equations-training-qwen-2-5-vl-to-master-latex-ocr-with-grpo/},
note = {[Blog post] Accessed: April 14, 2025}
}

Scroll to Top