Vision-Language Models (VLMs) have shown incredible prowess in understanding and describing the visual world. But can they translate that sight into strategic action? Can a VLM not only see the arrangement of tiles in the classic game 2048 but also learn to play it effectively, anticipating merges and planning moves based on visual input? This project dives into teaching a VLM, specifically Qwen2.5-VL, the strategic nuances of 2048 using the power of reinforcement learning.
The Challenge: Seeing is Believing, but Strategy is Learned
Teaching a VLM to play 2048 introduces unique hurdles compared to a text-only model:
- Visual Interpretation & Rule Adherence: The model must accurately interpret the tile values and positions directly from the image and apply the game’s movement and merging rules based on this visual understanding.
- Strategic Depth from Visual State: Good play still requires looking ahead and avoiding grid lock, but the planning must originate from the visual representation of the board.
- Visual-Spatial Reasoning: The model needs to inherently understand the spatial relationships between tiles on the grid as depicted in the image, not just infer them from text coordinates.
- Numerical Grounding in Vision: It must connect the visual appearance of tile numbers (e.g., the image of “128”) to their numerical value and the goal of doubling.
- Structured Text Output from Visual Input: Despite the visual input, the model must still generate a precise, valid text action (‘up’, ‘down’, ‘left’, ‘right’) within the required structured format (<think>, <answer> tags).
VLMs are naturally better suited for visual tasks, but the strategic planning and sequential decision-making aspects still require targeted training. Reinforcement learning provides the framework to bridge this gap.
VLM Playing 2048
YOUTUBE LINK HERE
The Foundation: Visual Data for a Seeing Agent
2048 is a Markov Process, meaning that you only have to see the current state of the game and the action you take will decide what kind of reward you get. You do not need to remember history for this game and that is beautiful! There are other games like Backgammon where you just need to make the most optimal move based on the board you see!
- A crucial part of this project was creating a diverse and representative dataset of 2048 board states. Simply using random boards wouldn’t capture the nuances of different game stages and difficulties. I enjoyed this process more than the training and seeing the results as just randomly sampling just gets you impractical data, which I learned from multiple feedbacks from my amazing community of twitter/x research and anon accounts. This time, I developed a sophisticated data generation pipeline:
- Simulated Gameplay: Instead of static puzzles, I generated board states by simulating thousands of games using random (but valid) moves. This creates more realistic mid-game scenarios.
- Difficulty Scoring: I implemented a custom heuristic function to evaluate the complexity of a given board state. This considered factors like the number of empty cells, the value of the highest tile, the similarity of adjacent tile values (smoothness), and the directional flow of values (monotonicity).
- Difficulty Classification: Based on the difficulty score and predefined thresholds (calibrated using a large sample of generated states), each board was classified into one of five difficulty levels, ensuring the dataset spanned a range from beginner to expert positions.
- Targeted Generation Pipeline:
- Calibration: First, a large number of states were generated to determine the score thresholds for each difficulty level automatically.
- Oversampling: To ensure enough examples for rarer, harder difficulty levels, the pipeline generated significantly more states than initially needed, using an oversampling factor of 3.0.
- Parallel Processing: Simulation is computationally intensive, so the generation process was designed to run in parallel across multiple CPU cores, significantly speeding up dataset creation.
- Balanced Sampling: Finally, the target number of states for each difficulty level was sampled from the generated candidates, creating a balanced dataset according to predefined ratios (weighted towards harder states with ratios of 1:2:3:4:5 for increasing difficulty).
- Prompt Formatting: Each selected board state was formatted into a structured prompt suitable for the language model. This involved providing a system message explaining the required thinking process (<think>…</think>) and answer format (<answer>…</answer>), followed by a user prompt describing the game, presenting the current board state as a formatted text grid, and asking for the best move.
- Conversion of the grid to images: Here I converted the entire grid generated to images using a random value out of 4 formats so that the model sees multiple variation of the numbers, their formatting, background and overall color of the grid. These are saved locally and used in the dataset to refer to the images
This meticulous process resulted in a dataset my desired train and test sets, carefully curated to represent a wide spectrum of 2048 game situations.
Using a dataset with only 800 examples, I train multiple models to firstly figure out the best hyperparameters, otherwise training RL can be quite a waste of life, and then scaling the data to 2000 training examples, and then finally for the final model scaling it to 8000 training examples.
My Experimental Setup
I chose Group Relative Policy Optimization (GRPO), implemented via the trl library, to train the VLM.
- Model: Qwen/Qwen2.5-VL- 7B-Instruct – A class of powerful Vision-Language Models capable of processing both images and text.
- Optimization: Employed 4-bit quantization, and efficient LoRA.
- Fine-tuning: Low-Rank Adaptation (LoRA) applied with lora_r=16, lora_alpha=32, targeting standard transformer modules (q_proj, k_proj, etc.). VLM-specific parameters like freeze_vision_modules=True and False were set (meaning the vision tower was also trained).
- Training Configuration (from script):
- Max Sequence Length: 1000 tokens.
- Batch Size: per_device_train_batch_size = 1.
- Gradient Accumulation: gradient_accumulation_steps = 8 (effective batch size of 8).
- Learning Rate: learning_rate = 4e-5.
- Optimizer: optim = “paged_adamw_8bit”.
- Max Steps: Variable (100, 200, etc., depending on the run).
- Number of Generations per Prompt: num_generations = 3.
- Precision: bf16 = True.
- VLM Params: max_pixels=12845056, min_pixels=1000.
Hyperparameter Tuning
- Main Culprit for Instability, Learning Rate: This task with RL turned out to be trickier than the usual. I wanted to release this days ago, but I was using high LR which deepfried my models. And then when I lowered the LR to 4e-5, and finally started seeing success.
- Generation Length: I constantly decreased it as the models actually never utilized it. Another great thing that I noticed was that as the models became efficient, the generation lengths dropped from ~500 to ~250, which is amazing meaning that the models refine their thoughts
- Data Scaling: This should always be done last. I have found out that if you figure out really good hyperparameters, with high quality data in the beginning usually the scaling is icing on the cake and same here. I would create data on my PC by simulation which would take time, then I would use it to see where the rewards would go the best in a reasonable amount. Scaling does help, if you have really high quality data, for example in this case, around 2000 then you can have a good mode, but if you can scale it, it always helps. The last model that I trained that I put on huggingface here: 2000_steps_trained_model, 8000_steps_trained_model was trained in around 10 hours and am quite happy with the results. All done on a single rtx 4090.
The Reward System: Guiding the Learning Process
The core of RL lies in the reward functions – how we provide feedback to the model. I designed a suite of functions, each targeting a specific aspect of good 2048 play:
Format Compliance Rewards:
These reward functions are used to teach the model on how to respond and what format to follow when responding.
- tag_presence_reward_fn: Gives partial credit (0.25 points each, max 1.0) for simply including each of the four required tags (<think>, </think>, <answer>, </answer>). This encourages the model to remember all parts of the structure.
Click here to expand tag_presence_reward_fn
def tag_presence_reward_fn(completions: List[str], **kwargs) -> List[float]:
"""
Gives 0.25 for each required tag present (max 1.0).
Required tags: <think>, </think>, <answer>, </answer>
"""
rewards = []
# Extract the raw response text from the completion structure
responses = [comp[0]["content"] for comp in completions]
for response in responses:
score = 0.0
# Check for the presence of each required tag
if "<think>" in response: score += 0.25
if "</think>" in response: score += 0.25
if "<answer>" in response: score += 0.25
if "</answer>" in response: score += 0.25
rewards.append(score)
return rewards
- tag_order_reward_fn: Gives a full reward (1.0) only if the tags appear in the correct sequence: <think> block followed by <answer> block. This ensures logical flow.
Click here to expand tag_order_reward_fn
def tag_order_reward_fn(completions: List[str], **kwargs) -> List[float]:
"""
Gives 1.0 if tags appear in correct order:
<think>...</think>...<answer>...</answer>
"""
# Regex pattern to match the correct sequence of tags with content in between
# .*? makes the matching non-greedy
# re.DOTALL allows '.' to match newline characters
pattern = r"<think>.*?</think>.*?<answer>.*?</answer>"
responses = [comp[0]["content"] for comp in completions]
# Return 1.0 if the pattern is found in the response, 0.0 otherwise
return [1.0 if re.search(pattern, comp, re.DOTALL) else 0.0 for comp in responses]
- valid_action_reward_fn: The model’s final output within <answer> must be one of the four allowed moves. This involves two steps:
Click here to expand valid_action_reward_fn
def parse_action(completion: str) -> str:
"""
Extracts content between <answer> tags, normalizes it,
and validates against allowed actions.
Returns: Valid action (lowercase) or empty string if not found/invalid.
"""
# Search for the <answer> block using regex, allowing for surrounding whitespace
# re.DOTALL makes '.' match newlines, robustly capturing multi-line answers
match = re.search(r"<answer>\s*(.*?)\s*</answer>", completion, re.DOTALL)
if not match:
# Return empty string if <answer> tags are not found
return ""
# Extract the content, remove leading/trailing whitespace, convert to lowercase
action = match.group(1).strip().lower()
# Check if the extracted action is one of the valid moves
return action if action in {"up", "down", "left", "right"} else ""
def valid_action_reward_fn(completions: List[str], **kwargs) -> List[float]:
"""
Gives 1.0 if the parsed action from the completion is valid
(up/down/left/right), 0.0 otherwise.
"""
responses = [comp[0]["content"] for comp in completions]
# Apply the parse_action function to each response.
# Reward is 1.0 if parse_action returns a non-empty string (a valid move),
# and 0.0 if it returns an empty string (invalid or missing move).
return [1.0 if parse_action(response) else 0.0 for response in responses]
Game State Improvement (The Core Strategy Rewards):
This is the main juice of this whole process. I designed very simple rewards but they turned out to be quite powerful. If you think about it, the whole idea is of the game 2048 is to reduce the spread of the tiles, keep adding them together and never completely fill your whole board. Meaning, you have to increase the density of the total of the tiles – (sum of tiles / number of occupied tiles)
Increasing density in the form of a reward has so many direct and indirect utilities: It will increase the sum of tiles, decrease the number of tiles occupied which is good for taking risks in future moves, consolidate everything together, and even increased the highest score on the tiles. So I created these rewards:
- Density Reward: A key heuristic in 2048 is maximizing the value packed into the fewest tiles. This function measures board “density” (sum of tile values / number of occupied cells) before and after the model’s move. If a move increases density (e.g., by merging tiles efficiently), the model receives a reward proportional to the ratio of the new density to the old density. This directly encourages consolidation and efficient use of space.
def post_action_density_ratio_reward_fn(completions, initial_state, rng_seed=42, **kwargs):
"""
Reward based purely on density ratio after action.
Rewards:
- 0.0 if move is invalid or density doesn't change
- density_ratio (new_density/prev_density) if density increases
"""
rewards = []
# Extract the text response from the model's completion structure
responses = [comp[0]["content"] for comp in completions]
for response, init_state in zip(responses, initial_state):
reward = 0.0
error = None
# Convert initial state to numpy array if necessary
init_state_np = np.array(init_state) if isinstance(init_state, list) else init_state
# Calculate initial difficulty for logging
difficulty = calculate_difficulty_score(init_state_np)
try:
# Simulate the move using the parsed action and get the resulting state
new_board, changed, score_delta = get_next_board_state(
init_state_np, response, rng_seed
)
game_over = is_game_over(new_board)
max_tile = np.max(new_board) if changed else np.max(init_state_np)
# Calculate density-based reward only if the move was valid
if changed:
prev_density = calculate_density(init_state_np)
new_density = calculate_density(new_board)
# Reward is the ratio if density improved
if new_density > prev_density:
# Add a small epsilon to prevent division by zero if prev_density is near zero
reward = new_density / (prev_density + 1e-6)
# else: reward stays 0.0 (density didn't improve or decreased)
# else: reward stays 0.0 (invalid move)
# Log the details of this step (move, outcome, reward, etc.)
# Note: The actual logging happens within this called function
log_training_response(
board_state=init_state_np.tolist(), # Ensure list format for JSON
next_board_state=new_board.tolist(), # Ensure list format for JSON
move_direction=parse_action(response),
is_valid=changed,
score_delta=int(score_delta),
max_tile=int(max_tile),
game_over=game_over,
reward=float(reward),
difficulty=int(difficulty),
model_version=kwargs.get('model_name', 'unknown'), # Get model name if passed via kwargs
hyperparameters=kwargs.get('hyperparameters', {}) # Get hyperparameters if passed
)
except Exception as e:
# Handle potential errors during simulation or calculation
error = str(e)
reward = 0.0
# Log the error case
log_training_response(
board_state=init_state_np.tolist(),
next_board_state=init_state_np.tolist(), # Log initial state if error occurred
move_direction="error",
is_valid=False,
score_delta=0,
max_tile=int(np.max(init_state_np)),
game_over=False, # Assume not game over if error occurred mid-step
reward=0.0,
error=error,
difficulty=int(difficulty), # Log difficulty calculated before error
model_version=kwargs.get('model_name', 'unknown'),
hyperparameters=kwargs.get('hyperparameters', {})
)
rewards.append(reward)
return rewards
- Highest Tile Reward: The ultimate goal in 2048 is creating the highest possible tile. This function directly incentivizes progress towards that goal by rewarding the model based on the ratio of the maximum tile value after the move compared to before. If a merge creates a new highest tile, this yields a significant reward signal of 5.0. This is to simulate a surprise reward on getting the model to merge to bigger/higher tiles. I tried with 1.0 before and the model even with 10x examples, does not converge well and only focuses on density, but when made 5.0, it is actually more incentivized to create higher tiles, which is the end game of the goal anyways! Even though it is partially covered in the density ratio, I still wanted the model to be rewarded for this as many moves on many states do not lead to the highest rewards and this acts like an unexpected candy for a good move.
def post_action_highest_reward_fn(completions, initial_state, rng_seed=42, **kwargs):
"""Reward for increasing the maximum tile value"""
rewards = []
responses = [comp[0]["content"] for comp in completions]
for response, init_state in zip(responses, initial_state):
# Use SAME transition function with SAME seed
new_board, changed, _ = get_next_board_state(
init_state,
response,
rng_seed
)
if not changed:
rewards.append(0.0) # Invalid move gets no reward
continue
prev_max = np.max(initial_state)
new_max = np.max(new_board)
if new_max > prev_max:
rewards.append(5.0)
else:
rewards.append(0.0) # Ratio of new max to old max
return rewards
Progress Logging: Integrated within the game state reward functions, every move attempt during training (along with the resulting board state, score change, validity, and calculated reward) was logged to an SQLite database. This allows for detailed post-hoc analysis of the learning process.
The Results: Model Learns to Play 2048
The training process showed that the Qwen 2.5 7B models, guided by the GRPO algorithm and the multi-component reward system, could learn to play 2048 effectively even with a relatively small dataset and limited training steps. The 7B model learns the best in the same amount of data which makes sense in scaling laws terms.
Some results and graphs:
1. 7B on different sizes of data


Graph on left is for 100 steps on different dataset sizes
Graph on right is the model trained on 5 times the initial data to see how well does the model perform
2. Survival Reward

3. Generation Length Over Time

4. Net Reward Over Time

Findings and Observations
- Bigger Model Helps: A 7B model converges better than the 3B model. What you pay in terms of memory, you gain in terms of efficiency!
- Lower Learning Rate Helps: In my hyperparameter testing, with high reasoning tasks, lower LR helps better. Also I observed that when the LR was 3e-4, the model always gave DOWN move, which showed the model just got deepfried.
- Data Scaling Helps: As can be seen above, scaling the data five times, gave better results and more stable
- Intelligent Reward Functions: I am quite proud of the density reward function as I came up with as I wanted very simple but high quality rewards using the mathematical skills I gained during the school and learning from Professor Andrew Ng. I remembered his lectures where he taught about deep learning losses which would combine multiple things in the same function to have simplicity and direct benifits from learning.
- Rewards Do Not Tell the Whole Story: Even though the models seem to be reaching an asymptotic ceiling, the scaling leads to better games. When stopped at 1000 steps, the model does not show that good of reasoning as it does with 5000 data points, and 4 times the number of steps. With 1000 steps, the model does things that you will disagree with constantly, but with 5000 data points, you will see that the model does match human intuition.
What This Teaches Us About AI Learning
This 2048 experiment offers several takeaways:
- Data Generation is Key for Complex Tasks: The quality and diversity of the initial board states, generated through simulation and difficulty classification, were likely critical for effective learning. RL needs a good “curriculum” to learn from.
- Multi-Component Rewards Work: Breaking down the desired behavior into specific, rewardable components (format, validity, density, max tile, survival) provides clearer guidance than a single, monolithic reward signal.
- Heuristics as Reward Signals: Game heuristics, like board density, can be effectively translated into reward functions to teach strategic concepts to language models.
- RL Can Teach Strategy: GRPO successfully encouraged the model to move beyond just making valid moves towards making strategically sound ones within the context of 2048.
- Compute/Model Size Matters: While the 7B model learned successfully here, the Sudoku experiment showed that smaller models might struggle with complex reasoning tasks, highlighting the importance of model capacity.
Headaches and Future Directions
- Tuning hyperparameters is a nightmare sometimes 😉 . But the more you train the better intuition you develop
- Too many reward functions can lead to too many things to worry about. Consolidating them together always helps. In the beginning I was using 1 + log(density ratio) which did not lead to anything as the density usually does not change that much when you analyze and view the game plays, as the log squishes the good strategies and does not reward the intermediate behavior in my opinion. Will exponential help meaning: e^density ratio as this will make the density efficiencies bigger? I hope someone tries this.
- Trying even bigger models should help.
- Multi-step Interaction Training: Right now, I am using all the MDP based intuitions to create the reward functions and based only on one step, due to this you can actually see that the model is quite myopic, meaning it sometimes does not behave in a planned manner as it wants the maximum reward for the next step, but we human do plan in our heads before doing something like this. We also want to win long term and due to this we make some sacrifices, like not merging some tiles right now and take some risk to set some things up, so that in the next step or some steps after, we make our move for much higher reward than right now.
- Trying JEPA: Yann Lecunn always says that LLMs and VLMs are an off ramp, and I can see that the more models I train and the bigger dataset I throw at the models. Yann invented JEPA, and it does learn well on robotics and image embeddings and such. I do want to train a policy meanwhile using DINOv2 and JEPA in a manner so as to make the next best move. Most probably I will fail as there are no code bases but let’s see.
- Implement Continual Learning Algos from Richard Sutton: The Godfather of RL actually recently published a paper in Nature on continual backpropagation. Lora finetuning does have some sort of regularization effect, but the models do become bad at some other things. So, I want to try out whether we can keep on learning new things, meanwhile maintaining the progress we have made on the models already. Lora does help in this manner, but if you want to train an LLM on new things, you usually have to do full finetuning on complete datasets on the base model.
Why This Matters: Beyond the Game
Teaching AI to play games like 2048 isn’t just an academic exercise. The underlying challenges – strategic planning, rule adherence, resource optimization (tile space), and sequential decision-making – are relevant to many real-world problems:
- Logistics and Operations: Optimizing routes or resource allocation.
- Process Optimization: Finding efficient sequences of steps in manufacturing or workflows.
- Financial Modeling: Making sequential decisions based on evolving market states.
- Scientific Discovery: Planning sequences of experiments.
By developing methods to teach AI these skills in controlled environments like games, we build the foundation for more capable and reliable AI systems in complex domains.
Conclusion: A Promising Trajectory
This experiment demonstrates that reinforcement learning, specifically GRPO coupled with carefully crafted data generation and multi-component reward functions, can effectively teach language models strategic gameplay in a complex puzzle like 2048. The Qwen 2.5 VL 7B Instruct model showed a clear ability to learn and apply game heuristics.The journey continues, but these initial results are highly encouraging, showcasing the potential for RL to imbue language models with deeper reasoning and planning capabilities.
Note: This is an ongoing project. Feedback and suggestions are welcome!
Citation
@misc{dalal2025agent2048vlm,
author = {Dalal, Hrishbh},
title = {{Agent 2048 Visually Masters Strategic Gameplay Through Data, Rewards, and RL}},
year = {2025},
month = {April},
url = {https://hrishbh.com/agent-2048-visually-masters-strategic-gameplay-through-data-rewards-and-rl/},
note = {[Blog post] Accessed: April 18, 2025}
}