While my previous experiment (Teaching Language Models to Solve Sudoku Through Reinforcement Learning) demonstrated that language models can learn to solve Sudoku puzzles through reinforcement learning by finding solutions through language and writing, this latest research takes a fundamentally different approach. Instead of teaching the model to solve Sudoku puzzles directly, I’m now training an LLM to invent increasingly efficient algorithms for solving Sudoku puzzles.
This meta-approach asks: Can AI not only solve problems but also innovate better methods of solving them? The results suggest the answer is a promising yes—even with relatively small 7B parameter models.
Building on Previous Success: From Puzzle Solvers to Algorithm Inventors
In my previous work, I showed how language models could learn step-by-step reasoning to solve Sudoku puzzles. This new experiment shifts focus:
- Previous approach: Train the model to directly solve Sudoku puzzles
- New approach: Train the model to write code that solves Sudoku puzzles efficiently
This shift leverages language models’ existing strength in code generation, as they have been trained on billions if not trillions of really high quality coding and mathematical tokens, while providing a framework to systematically improve algorithmic thinking through reinforcement learning.
The progression feels natural – having taught models to solve puzzles, we now challenge them to invent better ways of solving these same puzzles. This represents a higher order of reasoning and creativity, moving from “can you solve this problem and writing language?” to “can you invent a better method for solving this class of problems by writing programs?”By building on the reward functions, evaluation frameworks, and insights from my previous Sudoku-solving experiment, I’ve created a more ambitious challenge that pushes language models toward genuine algorithmic innovation. The knowledge gained from teaching models to maintain grid structures and follow logical rules now serves as the foundation for this more advanced task.
The Challenge of Algorithmic Innovation
Creating efficient Sudoku solvers presents a fascinating challenge for language models. Unlike simple code generation, developing optimized algorithms requires:
- Understanding algorithmic complexity and performance trade-offs
- Applying domain-specific knowledge about constraint satisfaction problems
- Implementing optimization techniques like constraint propagation and pruning
- Balancing readability with efficiency
- Ensuring 100% correctness across all test cases
Another thing to note is that THE MODEL IS NEVER SHOWN ANY PUZZLES OTHER THAN TWO IN THE PROMPTS! The model prompt is given instructions on improving a baseline which was generated by this model and then checking on whether this was good or not. We know that these LLMs have definitely seen algos that solve sudoku, but the interesting problem is whether they can solve it or no.
What makes this particularly interesting is that language models aren’t designed for algorithmic innovation, that we know of. They’re trained to predict text, not to systematically improve code efficiency least of all the ones available as Open Source. Yet with the right approach, they can learn these skills.
The Naive Baseline Algorithm
For this experiment, I implemented a simple baseline Sudoku solver that uses basic backtracking:
def baseline_solver(quiz: str) -> str:
"""
Reliable baseline Sudoku solver that always returns correct solution
Returns only the solution string without timing information
"""
import numpy as np
grid = np.zeros((9, 9), dtype=int)
for i in range(81):
row, col = i // 9, i % 9
grid[row][col] = int(quiz[i])
# Backtracking solver implementation
def solve(g):
for i in range(81):
r, c = i // 9, i % 9
if g[r][c] == 0:
for n in range(1, 10):
if (n not in g[r, :] and
n not in g[:, c] and
n not in g[3*(r//3):3*(r//3)+3, 3*(c//3):3*(c//3)+3]):
g[r][c] = n
if solve(g):
return True
g[r][c] = 0
return False
return True
solve(grid)
return ''.join(map(str, grid.flatten()))
This baseline algorithm, while correct, suffers from several critical limitations:
- Inefficient cell traversal: The algorithm processes cells in sequential order (left-to-right, top-to-bottom) rather than targeting the most constrained cells first.
- Naive constraint checking: For each cell, it rechecks all constraints from scratch without caching previous results.
- Slow for difficult puzzles: Performance degrades exponentially as the number of clues decreases, making it impractical for harder puzzles.
- No early constraint propagation: The algorithm doesn’t eliminate impossible values before attempting backtracking.
- Memory inefficiency: The approach creates numerous function call stacks during recursion.
The most serious problem is that for very difficult puzzles (those with fewer than 25 clues), the algorithm can time out completely, failing to find a solution within a reasonable timeframe.
Data Preparation: From Puzzles to Algorithm Training
For this experiment, I leveraged a dataset of 4 million Sudoku puzzles from Kaggle, ranging from very easy to challenging difficulties. The dataset preparation involved several key steps:
- Loading and filtering: I used the kagglehub library to download the dataset and filtered puzzles based on difficulty levels.
- Difficulty classification: Puzzles were categorized into four difficulty levels based on the number of clues:
- Level 1 (Very Easy): 50-81 clues
- Level 2 (Easy): 40-49 clues
- Level 3 (Medium): 30-39 clues
- Level 4 (Hard): 24-29 clues (many of these have multiple solutions and led to tons of failed runs, sigh, German electricity bill 😉 )
- I am using a distribution of [1, 2, 3, 4] amongst these levels respectively. As in my opinion, the true innovation should come from hard tasks, as the timeout for a code is set to 5.0 seconds and the models have to be faster than that.
- Prompt engineering: Each training sample contained:
- The baseline solver code as a reference
- Instructions to improve the algorithm’s efficiency
- Format requirements (<think></think> and <answer></answer> tags)
- Example test cases of varying difficulty
For this initial experiment, I created a focused dataset of 800 training samples, primarily using easier puzzles to establish a baseline for learning. I deliberately kept the dataset small to test how efficiently the models could learn with limited examples. Even though I am calling them training samples, the model never sees them during training, but only sees rewards based on them.
My Experimental Approach
I decided to explore whether reinforcement learning—specifically Group Relative Policy Optimization (GRPO)—could teach language models to become Sudoku algorithm inventors. I experimented with the Qwen 2.5 3B and 7B Instruct models, fine-tuned with LoRA rank 36.Importantly, this was a pure reinforcement learning approach, starting from the base instruction-tuned model without any cold-start data. The training configuration included:
- Batch size: 1
- Gradient accumulation steps: 4
- Learning rate: 3e-4 (Karpathy Constant)
- Max steps: 100
- Max sequence length: 3000 tokens
The Code Execution Framework
To train a model to improve algorithms, I needed a robust system to extract, execute, and evaluate code. This presented several technical challenges:
def time_to_solution(
fn_code: str,
quiz: str,
true_solution: str,
subprocess_time: bool = True,
timeout: float = 10.0,
num_samples: int = 1
) -> Tuple[Optional[str], bool, float]:
"""
Safely execute generated function and validate solution, averaging multiple runs
"""
try:
fn_name = extract_function_name(fn_code)
if not fn_name:
print("Could not find function name in code")
return None, False, timeout
# Create the full code to execute with timing measurement
full_code = f'''
import sys
import time
import numpy as np
{fn_code}
try:
start = time.perf_counter()
solution = {fn_name}("{quiz}")
solve_time = time.perf_counter() - start
print(f"{{solution}}")
print(f"{{solve_time:.16f}}")
except Exception as e:
print(f"Error in solver: {{e}}", file=sys.stderr)
sys.exit(1)
'''
# Process execution and result handling...
except Exception as e:
print(f"Error running solver: {e}")
return None, False, timeout
This framework allowed me to:
- Extract the function from the model’s response
- Execute it safely in a subprocess with timeout protection
- Measure execution time precisely
- Validate the correctness of the solution
- Compare performance against the baseline algorithm
A particular challenge was dealing with puzzles that have multiple valid solutions:
def validate_sudoku_solution(puzzle: str, solution: str) -> bool:
"""
Validate a Sudoku solution against the original puzzle.
"""
# Basic validation checks
# Check original clues are preserved
# Check Sudoku rules (rows, columns, boxes)
# Check no zeros remain
# Check whether the solution is actually a solution for the puzzle
# If any of the above checks failed, return False, meaning not a solution
…
The Reward System: Training Better Algorithm Inventors
The heart of this experiment is the reward system that guides the language model toward more efficient algorithms. I implemented a multi-component reward system:
1. Format Compliance Rewards
These rewards ensure the model uses proper thinking and answering tags:
Code for Format Compliance
def format_followed_completion_reward_fn(completions, **kwargs):
"""
Reward for presence of required tags (0.25 per tag, max 1.0)
"""
responses = [completion[0]["content"] for completion in completions]
rewards = []
for response in responses:
score = 0.0
if "" in response: score += 0.25 if "" in response: score += 0.25
if "" in response: score += 0.25 if "" in response: score += 0.25
rewards.append(score)
return rewards
2. Function Validation Rewards
These rewards check that the model generates valid Python functions with the correct signature:
def function_present_reward_fn(completions, **kwargs):
"""
Evaluate completions and return rewards based on whether they contain a valid function.
"""
rewards = []
responses = [completion[0]["content"] for completion in completions]
for response in responses:
function_code = parse_improved_function(response)
if function_code is not None:
rewards.append(1.0)
else:
rewards.append(0.0)
return rewards
def format_followed_by_parsed_function_reward_fn(completions, **kwargs):
"""
Validate function structure and syntax
"""
responses = [completion[0]["content"] for completion in completions]
rewards = []
for response in responses:
# Extract the function code
function_code = parse_improved_function(response)
if function_code is None:
rewards.append(0.0)
continue
try:
# Validate function signature and basic structure
namespace = {}
exec(function_code, namespace)
# Get the function name from the code
import re
fn_name_match = re.search(r'def\s+(\w+)', function_code)
if not fn_name_match:
rewards.append(0.0)
continue
fn_name = fn_name_match.group(1)
fn = namespace.get(fn_name)
if not fn:
rewards.append(0.0)
continue
import inspect
sig = inspect.signature(fn)
params = list(sig.parameters.values())
# Check function parameters
if (len(params) != 1 or
params[0].name != 'quiz' or
params[0].annotation != str):
rewards.append(0.0)
else:
rewards.append(1.0)
except Exception as e:
rewards.append(0.0)
return rewards
3. The Critical Performance Reward
The most important reward measures algorithmic performance improvement over the baseline:
def advantage_over_baseline_reward_fn_bounded(
completions,
quiz,
**kwargs
) -> List[float]:
"""
Calculate reward by comparing against baseline solution and timing
Reward breakdown:
- 1.0 base reward for correctness
- Up to 5.0 additional reward for speed improvements
Example rewards (for correct solutions):
- No speedup: 1.0 reward
- 2x speedup: 2.0 reward (1.0 + log2(2))
- 4x speedup: 3.0 reward (1.0 + log2(4))
- 32x speedup: 6.0 reward (1.0 + min(5.0, log2(32)))
"""
rewards = []
responses = [completion[0]["content"] for completion in completions]
# Get baseline solver source code
baseline_solver_source = inspect.getsource(baseline_solver)
difficulty = kwargs.get('difficulty')
clue_count = kwargs.get('clue_count')
for response, q, diff, cc in zip(responses, quiz, difficulty, clue_count):
error = None
baseline_solution = None
model_solution = None
is_correct = False
baseline_time = 0.0
model_time = 0.0
reward = 0.0
try:
# Get baseline solution and time
baseline_solution, baseline_correct, baseline_time = time_to_solution(
fn_code=baseline_solver_source,
quiz=q,
true_solution=None,
subprocess_time=True,
timeout=5.0
)
if baseline_solution is None:
error = "Baseline solver failed to find solution"
rewards.append(0.0)
continue
# Parse model's function code
fn_code = parse_improved_function(response)
if not fn_code:
error = "No valid function found in response"
rewards.append(0.0)
continue
# Execute and time model's solution
model_solution, is_correct, model_time = time_to_solution(
fn_code=fn_code,
quiz=q,
true_solution=baseline_solution,
subprocess_time=True,
timeout=5.0
)
if not is_correct:
error = "Model solution incorrect"
rewards.append(0.0)
continue
# Calculate speed advantage with logarithmic scaling
time_ratio = baseline_time / max(model_time, 1e-9)
if time_ratio > 1:
reward = 1.0 + min(5.0, np.log2(time_ratio))
else:
reward = 1.0 # Just the base reward for correctness
rewards.append(reward)
except Exception as e:
error = str(e)
rewards.append(0.0)
finally:
# Log results to database
log_training_response(
quiz=q,
response=response,
is_correct=is_correct,
baseline_time=baseline_time,
model_time=model_time,
reward=reward,
error=error,
difficulty=diff,
clue_count=cc,
**training_context
)
return rewards
This reward function is particularly elegant because it:
- Ensures correctness as a prerequisite for any reward
- Uses logarithmic scaling to reward exponential improvements
- Provides a smooth gradient for the model to follow
- Records detailed performance metrics in a database
I tried to keep things simple to actually see the power of RL in exploration and exploitation.
The initial results are as follows
- Main reward which is the sum total of all the rewards

- Main reward function for advantages over the initial baseline:

- Function format reward, meaning the function should be present in the proper described format in the prompt

Observations:
- Early stopping will help with the model
- Training on Lora 32 should keep the stability (doing so right now)
- Length of the responses increases where it diverged from convergence with higher quality rewards
Some examples of algorithms performing better than the baseline
- For difficulty level 4, the baseline time is 0.39977, the improved time is 0.00191, the improvement for this algo is 1985.22138 times compared to the baseline.
import numpy as np
from itertools import product
def your_solver(quiz: str) -> str:
"""
A faster Sudoku solver using constraint propagation and advanced pruning strategies.
"""
grid = np.zeros((9, 9), dtype=int)
for i in range(81):
row, col = i // 9, i % 9
grid[row][col] = int(quiz[i])
# Initialize constraints
row_constraints = [set(range(1, 10)) for _ in range(9)]
col_constraints = [set(range(1, 10)) for _ in range(9)]
block_constraints = [[set(range(1, 10)) for _ in range(3)] for _ in range(3)]
def initialize_constraints():
for r in range(9):
for c in range(9):
if grid[r][c] != 0:
row_constraints[r].remove(grid[r][c])
col_constraints[c].remove(grid[r][c])
block_constraints[r//3][c//3].remove(grid[r][c])
initialize_constraints()
# Apply constraints to each cell
def apply_constraints():
for i in range(81):
r, c = i // 9, i % 9
if grid[r][c] == 0:
possible_values = row_constraints[r] & col_constraints[c] & block_constraints[r//3][c//3]
if len(possible_values) == 1:
grid[r][c] = int(possible_values.pop())
row_constraints[r].remove(grid[r][c])
col_constraints[c].remove(grid[r][c])
block_constraints[r//3][c//3].remove(grid[r][c])
apply_constraints()
# Select cells with the fewest possible candidates
def select_cell():
min_candidates = 10
selected = None
for r in range(9):
for c in range(9):
if grid[r][c] == 0:
candidates = row_constraints[r] & col_constraints[c] & block_constraints[r//3][c//3]
if len(candidates) < min_candidates:
min_candidates = len(candidates)
selected = (r, c)
return selected
# Solve the grid using backtracking
def solve(g):
select = select_cell()
if select is None:
return True
row, col = select
for n in range(1, 10):
if n in row_constraints[row] and n in col_constraints[col] and n in block_constraints[row//3][col//3]:
g[row][col] = n
row_constraints[row].remove(n)
col_constraints[col].remove(n)
block_constraints[row//3][col//3].remove(n)
if solve(g):
return True
g[row][col] = 0
row_constraints[row].add(n)
col_constraints[col].add(n)
block_constraints[row//3][col//3].add(n)
return False
solve(grid)
return ''.join(map(str, grid.flatten()))
2. Another example for difficulty 4, where the speedup was 700 times. The baseline time is 0.3998, the improved time is 0.00057 s, the time ratio is 704.25
def your_solver(quiz: str) -> str:
# Initialize the grid and constraints
grid = [[0] * 9 for _ in range(9)]
row_constraints = [set(range(1, 10)) for _ in range(9)]
col_constraints = [set(range(1, 10)) for _ in range(9)]
block_constraints = [set(range(1, 10)) for _ in range(9)]
for i, char in enumerate(quiz):
if char == '0':
continue
r, c = i // 9, i % 9
val = int(char)
grid[r][c] = val
row_constraints[r].remove(val)
col_constraints[c].remove(val)
block_constraints[r // 3 * 3 + c // 3].remove(val)
# Backtracking solver with constraint propagation
def solve(g, row_constraints, col_constraints, block_constraints, cells):
if not cells:
return True
i, j = cells.pop()
box_id = 3 * (i // 3) + j // 3
for n in range(1, 10):
if n in row_constraints[i] and n in col_constraints[j] and n in block_constraints[box_id]:
g[i][j] = n
row_constraints[i].remove(n)
col_constraints[j].remove(n)
block_constraints[box_id].remove(n)
if solve(g, row_constraints, col_constraints, block_constraints, cells):
return True
g[i][j] = 0
row_constraints[i].add(n)
col_constraints[j].add(n)
block_constraints[box_id].add(n)
cells.append((i, j))
return False
# Initialize the stack of empty cells
cells = []
for i in range(9):
for j in range(9):
if grid[i][j] == 0:
cells.append((i, j))
# Start solving
solve(grid, row_constraints, col_constraints, block_constraints, cells)
# Convert the grid to a string
return ''.join(map(str, sum(grid, [])))
Failures and Frustrations
These results did not come with hiccups and bumps and bruises.
- Small models do not learn fast enough. Meaning something like 7B size is needed if no SFT is done.
- Defining prompt for the task goes a long way. Think of it as explaining to an intern on what the task and the exact output will look like. Prompt engineering helps get a really good head start.
- Compute matters. This lets you try tons of different things. Like ablating on hardness, I wanted to try like uniform distribution or bigger datasets. But constraints breed intuition. So I think on how I would like my teacher to help me show problems and shine in the class when working on a task 🙂
- Library I am using Unsloth works amazingly, I can try crazy stuff, but unfortunately for one GPU now. I tried writing my own multi GPU code, but left it alone as I wanted to focus on this research first and leaving it to the legends that are working on this code.
- Working with evaluations was a nightmare as those led to 3 times slower runs and code breaking all the time. Spent nights till 2-3am figuring out what is wrong in unsloth or trl. A good standardized codebase for general compute poor people goes a long way.
Potential Areas of Improvements
As with everything in the world, things definetely could be better in many ways.
- Longer evaluations time: Some sudokus, however easy, do actually take longer than 5 seconds and hence I made an adaptive timeout, [2, 4, 5 and 15] seconds respectively for the various hardnesses respectively, but did not implement it due to constraints of time and wanting to work on other projects. This will definitely make code better as so many of the discarded algos do solve, but in a different time. This is my next task.
- Using bigger models: Too obvious. Yes models improve faster when they are bigger. What we sacrifice in GPU memory is saved in time to convergence as I have trained way too many to realize this.
- Having an SFT dataset to start with: In my intuition, having a dataset of tens or even a few hundreds to SFT small models firstly and then RLing them should go a long way. Like you teach a child by illustrations first, he memorizes it and then you give it new problems to do FAFO on, this way the child generalizes and even takes the teachings in other aspects of life.
Citation
@online{dalal2025sudoku,
author = {Dalal, Hrishbh},
title = {Teaching Language Models to Invent or Optimize Efficient Sudoku Algorithms Through Reinforcement Learning},
year = {2025},
month = {3},
day = {18},
url = {https://hrishbh.com/teaching-language-models-to-invent-or-optimize-efficient-sudoku-algorithms-through-reinforcement-learning/},
urldate = {2025-03-18}
}
Note: This is a work in progress, and I welcome feedback and suggestions from the community!