Teaching Language Models to Invent or Optimize Efficient Sudoku Algorithms Through Reinforcement Learning

While my previous experiment (Teaching Language Models to Solve Sudoku Through Reinforcement Learning) demonstrated that language models can learn to solve Sudoku puzzles through reinforcement learning by finding solutions through language and writing, this latest research takes a fundamentally different approach. Instead of teaching the model to solve Sudoku puzzles directly, I’m now training an LLM to invent increasingly efficient algorithms for solving Sudoku puzzles.

This meta-approach asks: Can AI not only solve problems but also innovate better methods of solving them? The results suggest the answer is a promising yes—even with relatively small 7B parameter models.

Building on Previous Success: From Puzzle Solvers to Algorithm Inventors

In my previous work, I showed how language models could learn step-by-step reasoning to solve Sudoku puzzles. This new experiment shifts focus:

  • Previous approach: Train the model to directly solve Sudoku puzzles
  • New approach: Train the model to write code that solves Sudoku puzzles efficiently

This shift leverages language models’ existing strength in code generation, as they have been trained on billions if not trillions of really high quality coding and mathematical tokens, while providing a framework to systematically improve algorithmic thinking through reinforcement learning.

The progression feels natural – having taught models to solve puzzles, we now challenge them to invent better ways of solving these same puzzles. This represents a higher order of reasoning and creativity, moving from “can you solve this problem and writing language?” to “can you invent a better method for solving this class of problems by writing programs?”By building on the reward functions, evaluation frameworks, and insights from my previous Sudoku-solving experiment, I’ve created a more ambitious challenge that pushes language models toward genuine algorithmic innovation. The knowledge gained from teaching models to maintain grid structures and follow logical rules now serves as the foundation for this more advanced task.

The Challenge of Algorithmic Innovation

Creating efficient Sudoku solvers presents a fascinating challenge for language models. Unlike simple code generation, developing optimized algorithms requires:

  • Understanding algorithmic complexity and performance trade-offs
  • Applying domain-specific knowledge about constraint satisfaction problems
  • Implementing optimization techniques like constraint propagation and pruning
  • Balancing readability with efficiency
  • Ensuring 100% correctness across all test cases

Another thing to note is that THE MODEL IS NEVER SHOWN ANY PUZZLES OTHER THAN TWO IN THE PROMPTS! The model prompt is given instructions on improving a baseline which was generated by this model and then checking on whether this was good or not. We know that these LLMs have definitely seen algos that solve sudoku, but the interesting problem is whether they can solve it or no.

What makes this particularly interesting is that language models aren’t designed for algorithmic innovation, that we know of. They’re trained to predict text, not to systematically improve code efficiency least of all the ones available as Open Source. Yet with the right approach, they can learn these skills.

The Naive Baseline Algorithm

For this experiment, I implemented a simple baseline Sudoku solver that uses basic backtracking:

def baseline_solver(quiz: str) -> str:
    """
    Reliable baseline Sudoku solver that always returns correct solution
    Returns only the solution string without timing information
    """
    import numpy as np
    grid = np.zeros((9, 9), dtype=int)
    for i in range(81):
        row, col = i // 9, i % 9
        grid[row][col] = int(quiz[i])
    
    # Backtracking solver implementation
    def solve(g):
        for i in range(81):
            r, c = i // 9, i % 9
            if g[r][c] == 0:
                for n in range(1, 10):
                    if (n not in g[r, :] and 
                        n not in g[:, c] and 
                        n not in g[3*(r//3):3*(r//3)+3, 3*(c//3):3*(c//3)+3]):
                        g[r][c] = n
                        if solve(g):
                            return True
                        g[r][c] = 0
                return False
        return True
    
    solve(grid)
    return ''.join(map(str, grid.flatten()))

This baseline algorithm, while correct, suffers from several critical limitations:

  1. Inefficient cell traversal: The algorithm processes cells in sequential order (left-to-right, top-to-bottom) rather than targeting the most constrained cells first.
  1. Naive constraint checking: For each cell, it rechecks all constraints from scratch without caching previous results.
  1. Slow for difficult puzzles: Performance degrades exponentially as the number of clues decreases, making it impractical for harder puzzles.
  1. No early constraint propagation: The algorithm doesn’t eliminate impossible values before attempting backtracking.
  1. Memory inefficiency: The approach creates numerous function call stacks during recursion.

The most serious problem is that for very difficult puzzles (those with fewer than 25 clues), the algorithm can time out completely, failing to find a solution within a reasonable timeframe.

Data Preparation: From Puzzles to Algorithm Training

For this experiment, I leveraged a dataset of 4 million Sudoku puzzles from Kaggle, ranging from very easy to challenging difficulties. The dataset preparation involved several key steps:

  1. Loading and filtering: I used the kagglehub library to download the dataset and filtered puzzles based on difficulty levels.
  2. Difficulty classification: Puzzles were categorized into four difficulty levels based on the number of clues:
    • Level 1 (Very Easy): 50-81 clues
    • Level 2 (Easy): 40-49 clues
    • Level 3 (Medium): 30-39 clues
    • Level 4 (Hard): 24-29 clues (many of these have multiple solutions and led to tons of failed runs, sigh, German electricity bill 😉 )
    • I am using a distribution of [1, 2, 3, 4] amongst these levels respectively. As in my opinion, the true innovation should come from hard tasks, as the timeout for a code is set to 5.0 seconds and the models have to be faster than that.
  3. Prompt engineering: Each training sample contained:
    • The baseline solver code as a reference
    • Instructions to improve the algorithm’s efficiency
    • Format requirements (<think></think> and <answer></answer> tags)
    • Example test cases of varying difficulty

For this initial experiment, I created a focused dataset of 800 training samples, primarily using easier puzzles to establish a baseline for learning. I deliberately kept the dataset small to test how efficiently the models could learn with limited examples. Even though I am calling them training samples, the model never sees them during training, but only sees rewards based on them.

My Experimental Approach

I decided to explore whether reinforcement learning—specifically Group Relative Policy Optimization (GRPO)—could teach language models to become Sudoku algorithm inventors. I experimented with the Qwen 2.5 3B and 7B Instruct models, fine-tuned with LoRA rank 36.Importantly, this was a pure reinforcement learning approach, starting from the base instruction-tuned model without any cold-start data. The training configuration included:

  • Batch size: 1
  • Gradient accumulation steps: 4
  • Learning rate: 3e-4 (Karpathy Constant)
  • Max steps: 100
  • Max sequence length: 3000 tokens

The Code Execution Framework

To train a model to improve algorithms, I needed a robust system to extract, execute, and evaluate code. This presented several technical challenges:

def time_to_solution(
    fn_code: str, 
    quiz: str, 
    true_solution: str,
    subprocess_time: bool = True,
    timeout: float = 10.0,
    num_samples: int = 1
) -> Tuple[Optional[str], bool, float]:
    """
    Safely execute generated function and validate solution, averaging multiple runs
    """
    try:
        fn_name = extract_function_name(fn_code)
        if not fn_name:
            print("Could not find function name in code")
            return None, False, timeout
        
        # Create the full code to execute with timing measurement
        full_code = f'''
import sys
import time
import numpy as np

{fn_code}

try:
    start = time.perf_counter()
    solution = {fn_name}("{quiz}")
    solve_time = time.perf_counter() - start
    
    print(f"{{solution}}")
    print(f"{{solve_time:.16f}}")
except Exception as e:
    print(f"Error in solver: {{e}}", file=sys.stderr)
    sys.exit(1)
'''
        
        # Process execution and result handling...
        
    except Exception as e:
        print(f"Error running solver: {e}")
        return None, False, timeout

This framework allowed me to:

  1. Extract the function from the model’s response
  2. Execute it safely in a subprocess with timeout protection
  3. Measure execution time precisely
  4. Validate the correctness of the solution
  5. Compare performance against the baseline algorithm

A particular challenge was dealing with puzzles that have multiple valid solutions:

def validate_sudoku_solution(puzzle: str, solution: str) -> bool:
"""
Validate a Sudoku solution against the original puzzle.
"""
# Basic validation checks
# Check original clues are preserved
# Check Sudoku rules (rows, columns, boxes)
# Check no zeros remain
# Check whether the solution is actually a solution for the puzzle
# If any of the above checks failed, return False, meaning not a solution
…

The Reward System: Training Better Algorithm Inventors

The heart of this experiment is the reward system that guides the language model toward more efficient algorithms. I implemented a multi-component reward system:

1. Format Compliance Rewards

These rewards ensure the model uses proper thinking and answering tags:

Code for Format Compliance
def format_followed_completion_reward_fn(completions, **kwargs):
"""
Reward for presence of required tags (0.25 per tag, max 1.0)
"""
responses = [completion[0]["content"] for completion in completions]
rewards = []
for response in responses:
score = 0.0
if "" in response: score += 0.25 if "" in response: score += 0.25
if "" in response: score += 0.25 if "" in response: score += 0.25
rewards.append(score)
return rewards

2. Function Validation Rewards

These rewards check that the model generates valid Python functions with the correct signature:

def function_present_reward_fn(completions, **kwargs):
    """
    Evaluate completions and return rewards based on whether they contain a valid function.
    """
    rewards = []
    responses = [completion[0]["content"] for completion in completions]
    for response in responses:
        function_code = parse_improved_function(response)
        if function_code is not None:
            rewards.append(1.0)
        else:
            rewards.append(0.0)
    
    return rewards
def format_followed_by_parsed_function_reward_fn(completions, **kwargs):
    """
    Validate function structure and syntax
    """
    responses = [completion[0]["content"] for completion in completions]
    rewards = []
    for response in responses:
        # Extract the function code
        function_code = parse_improved_function(response)
        if function_code is None:
            rewards.append(0.0)
            continue
            
        try:
            # Validate function signature and basic structure
            namespace = {}
            exec(function_code, namespace)
            
            # Get the function name from the code
            import re
            fn_name_match = re.search(r'def\s+(\w+)', function_code)
            if not fn_name_match:
                rewards.append(0.0)
                continue
                
            fn_name = fn_name_match.group(1)
            fn = namespace.get(fn_name)
            
            if not fn:
                rewards.append(0.0)
                continue
                
            import inspect
            sig = inspect.signature(fn)
            params = list(sig.parameters.values())
            
            # Check function parameters
            if (len(params) != 1 or 
                params[0].name != 'quiz' or 
                params[0].annotation != str):
                rewards.append(0.0)
            else:
                rewards.append(1.0)
                
        except Exception as e:
            rewards.append(0.0)
            
    return rewards

3. The Critical Performance Reward

The most important reward measures algorithmic performance improvement over the baseline:

def advantage_over_baseline_reward_fn_bounded(
    completions,
    quiz,
    **kwargs
) -> List[float]:
    """
    Calculate reward by comparing against baseline solution and timing
    
    Reward breakdown:
    - 1.0 base reward for correctness
    - Up to 5.0 additional reward for speed improvements
    
    Example rewards (for correct solutions):
    - No speedup:   1.0 reward
    - 2x speedup:   2.0 reward (1.0 + log2(2))
    - 4x speedup:   3.0 reward (1.0 + log2(4))
    - 32x speedup:  6.0 reward (1.0 + min(5.0, log2(32)))
    """
    rewards = []
    responses = [completion[0]["content"] for completion in completions]
    
    # Get baseline solver source code
    baseline_solver_source = inspect.getsource(baseline_solver)
    difficulty = kwargs.get('difficulty')
    clue_count = kwargs.get('clue_count')
    
    for response, q, diff, cc in zip(responses, quiz, difficulty, clue_count):
        error = None
        baseline_solution = None
        model_solution = None
        is_correct = False
        baseline_time = 0.0
        model_time = 0.0
        reward = 0.0
        
        try:
            # Get baseline solution and time
            baseline_solution, baseline_correct, baseline_time = time_to_solution(
                fn_code=baseline_solver_source,
                quiz=q,
                true_solution=None,
                subprocess_time=True,
                timeout=5.0
            )
            
            if baseline_solution is None:
                error = "Baseline solver failed to find solution"
                rewards.append(0.0)
                continue
                
            # Parse model's function code
            fn_code = parse_improved_function(response)
            if not fn_code:
                error = "No valid function found in response"
                rewards.append(0.0)
                continue
                
            # Execute and time model's solution
            model_solution, is_correct, model_time = time_to_solution(
                fn_code=fn_code,
                quiz=q,
                true_solution=baseline_solution,
                subprocess_time=True,
                timeout=5.0
            )
            
            if not is_correct:
                error = "Model solution incorrect"
                rewards.append(0.0)
                continue
                
            # Calculate speed advantage with logarithmic scaling
            time_ratio = baseline_time / max(model_time, 1e-9)
            if time_ratio > 1:
                reward = 1.0 + min(5.0, np.log2(time_ratio))
            else:
                reward = 1.0  # Just the base reward for correctness
                
            rewards.append(reward)
            
        except Exception as e:
            error = str(e)
            rewards.append(0.0)
            
        finally:
            # Log results to database
            log_training_response(
                quiz=q,
                response=response,
                is_correct=is_correct,
                baseline_time=baseline_time,
                model_time=model_time,
                reward=reward,
                error=error,
                difficulty=diff,
                clue_count=cc,
                **training_context
            )
        
    return rewards

This reward function is particularly elegant because it:

  1. Ensures correctness as a prerequisite for any reward
  2. Uses logarithmic scaling to reward exponential improvements
  3. Provides a smooth gradient for the model to follow
  4. Records detailed performance metrics in a database

I tried to keep things simple to actually see the power of RL in exploration and exploitation.

The initial results are as follows

  1. Main reward which is the sum total of all the rewards
There is a drop in rewards as the 7B diverges from good learning but 3B is fascinatingly improving, maybe lora 32 helps, so training a 7B model with Lora 32 as well
  • Main reward function for advantages over the initial baseline:
7B model peaks around 60 steps and 3B keeps on learning albeit much slower than the 7B. The divergence in 7B might because of smaller Lora rank, training one with 32
  • Function format reward, meaning the function should be present in the proper described format in the prompt
This shows that somehow the 7B model stopped improving here and diverged to follow different format than asked for. It would be better to have higher rewards for this to keep the model in line

Observations:

  1. Early stopping will help with the model
  2. Training on Lora 32 should keep the stability (doing so right now)
  3. Length of the responses increases where it diverged from convergence with higher quality rewards

Some examples of algorithms performing better than the baseline

  1. For difficulty level 4, the baseline time is 0.39977, the improved time is 0.00191, the improvement for this algo is 1985.22138 times compared to the baseline.
import numpy as np
from itertools import product

def your_solver(quiz: str) -> str:
    """
    A faster Sudoku solver using constraint propagation and advanced pruning strategies.
    """
    
    grid = np.zeros((9, 9), dtype=int)
    for i in range(81):
        row, col = i // 9, i % 9
        grid[row][col] = int(quiz[i])
    
    # Initialize constraints
    row_constraints = [set(range(1, 10)) for _ in range(9)]
    col_constraints = [set(range(1, 10)) for _ in range(9)]
    block_constraints = [[set(range(1, 10)) for _ in range(3)] for _ in range(3)]
    
    def initialize_constraints():
        for r in range(9):
            for c in range(9):
                if grid[r][c] != 0:
                    row_constraints[r].remove(grid[r][c])
                    col_constraints[c].remove(grid[r][c])
                    block_constraints[r//3][c//3].remove(grid[r][c])
    
    initialize_constraints()
    
    # Apply constraints to each cell
    def apply_constraints():
        for i in range(81):
            r, c = i // 9, i % 9
            if grid[r][c] == 0:
                possible_values = row_constraints[r] & col_constraints[c] & block_constraints[r//3][c//3]
                if len(possible_values) == 1:
                    grid[r][c] = int(possible_values.pop())
                    row_constraints[r].remove(grid[r][c])
                    col_constraints[c].remove(grid[r][c])
                    block_constraints[r//3][c//3].remove(grid[r][c])
    
    apply_constraints()
    
    # Select cells with the fewest possible candidates
    def select_cell():
        min_candidates = 10
        selected = None
        for r in range(9):
            for c in range(9):
                if grid[r][c] == 0:
                    candidates = row_constraints[r] & col_constraints[c] & block_constraints[r//3][c//3]
                    if len(candidates) < min_candidates:
                        min_candidates = len(candidates)
                        selected = (r, c)
        return selected
    
    # Solve the grid using backtracking
    def solve(g):
        select = select_cell()
        if select is None:
            return True
        row, col = select
        for n in range(1, 10):
            if n in row_constraints[row] and n in col_constraints[col] and n in block_constraints[row//3][col//3]:
                g[row][col] = n
                row_constraints[row].remove(n)
                col_constraints[col].remove(n)
                block_constraints[row//3][col//3].remove(n)
                if solve(g):
                    return True
                g[row][col] = 0
                row_constraints[row].add(n)
                col_constraints[col].add(n)
                block_constraints[row//3][col//3].add(n)
        return False
    
    solve(grid)
    return ''.join(map(str, grid.flatten()))

2. Another example for difficulty 4, where the speedup was 700 times. The baseline time is 0.3998, the improved time is 0.00057 s, the time ratio is 704.25

def your_solver(quiz: str) -> str:
    # Initialize the grid and constraints
    grid = [[0] * 9 for _ in range(9)]
    row_constraints = [set(range(1, 10)) for _ in range(9)]
    col_constraints = [set(range(1, 10)) for _ in range(9)]
    block_constraints = [set(range(1, 10)) for _ in range(9)]
    
    for i, char in enumerate(quiz):
        if char == '0':
            continue
        r, c = i // 9, i % 9
        val = int(char)
        grid[r][c] = val
        row_constraints[r].remove(val)
        col_constraints[c].remove(val)
        block_constraints[r // 3 * 3 + c // 3].remove(val)
    
    # Backtracking solver with constraint propagation
    def solve(g, row_constraints, col_constraints, block_constraints, cells):
        if not cells:
            return True
        
        i, j = cells.pop()
        box_id = 3 * (i // 3) + j // 3
        
        for n in range(1, 10):
            if n in row_constraints[i] and n in col_constraints[j] and n in block_constraints[box_id]:
                g[i][j] = n
                row_constraints[i].remove(n)
                col_constraints[j].remove(n)
                block_constraints[box_id].remove(n)
                
                if solve(g, row_constraints, col_constraints, block_constraints, cells):
                    return True
                
                g[i][j] = 0
                row_constraints[i].add(n)
                col_constraints[j].add(n)
                block_constraints[box_id].add(n)
        
        cells.append((i, j))
        return False
    
    # Initialize the stack of empty cells
    cells = []
    for i in range(9):
        for j in range(9):
            if grid[i][j] == 0:
                cells.append((i, j))
    
    # Start solving
    solve(grid, row_constraints, col_constraints, block_constraints, cells)
    
    # Convert the grid to a string
    return ''.join(map(str, sum(grid, [])))

Failures and Frustrations

These results did not come with hiccups and bumps and bruises.

  1. Small models do not learn fast enough. Meaning something like 7B size is needed if no SFT is done.
  2. Defining prompt for the task goes a long way. Think of it as explaining to an intern on what the task and the exact output will look like. Prompt engineering helps get a really good head start.
  3. Compute matters. This lets you try tons of different things. Like ablating on hardness, I wanted to try like uniform distribution or bigger datasets. But constraints breed intuition. So I think on how I would like my teacher to help me show problems and shine in the class when working on a task 🙂
  4. Library I am using Unsloth works amazingly, I can try crazy stuff, but unfortunately for one GPU now. I tried writing my own multi GPU code, but left it alone as I wanted to focus on this research first and leaving it to the legends that are working on this code.
  5. Working with evaluations was a nightmare as those led to 3 times slower runs and code breaking all the time. Spent nights till 2-3am figuring out what is wrong in unsloth or trl. A good standardized codebase for general compute poor people goes a long way.

Potential Areas of Improvements

As with everything in the world, things definetely could be better in many ways.

  1. Longer evaluations time: Some sudokus, however easy, do actually take longer than 5 seconds and hence I made an adaptive timeout, [2, 4, 5 and 15] seconds respectively for the various hardnesses respectively, but did not implement it due to constraints of time and wanting to work on other projects. This will definitely make code better as so many of the discarded algos do solve, but in a different time. This is my next task.
  2. Using bigger models: Too obvious. Yes models improve faster when they are bigger. What we sacrifice in GPU memory is saved in time to convergence as I have trained way too many to realize this.
  3. Having an SFT dataset to start with: In my intuition, having a dataset of tens or even a few hundreds to SFT small models firstly and then RLing them should go a long way. Like you teach a child by illustrations first, he memorizes it and then you give it new problems to do FAFO on, this way the child generalizes and even takes the teachings in other aspects of life.

Citation

@online{dalal2025sudoku,
author = {Dalal, Hrishbh},
title = {Teaching Language Models to Invent or Optimize Efficient Sudoku Algorithms Through Reinforcement Learning},
year = {2025},
month = {3},
day = {18},
url = {https://hrishbh.com/teaching-language-models-to-invent-or-optimize-efficient-sudoku-algorithms-through-reinforcement-learning/},
urldate = {2025-03-18}
}

Note: This is a work in progress, and I welcome feedback and suggestions from the community!

Back to Projects | Get in Touch

Scroll to Top