Agent 2048: Forging Strategic Gameplay in an AI Through Data, Rewards, and RL

Language models excel at generating human-like text, but can they master games that require strategic planning, spatial reasoning, and numerical foresight? That’s the question I explored in my recent research project: teaching a language model to play the classic game 2048 using reinforcement learning.

The Challenge: More Than Just Sliding Tiles

2048 seems simple on the surface: slide numbered tiles on a 4×4 grid, merging identical tiles to create higher values with increasing difficulty with success. However, teaching an AI to play well presents unique challenges for language models:

  1. Rule Adherence: The model must understand and strictly follow the game’s movement and merging rules.
  2. Strategic Depth: Good play requires looking ahead, setting up merges, avoiding grid lock, and maximizing score – concepts not inherent to text prediction.
  3. Spatial Reasoning: The model needs to “see” the grid, understand tile positions, and anticipate the results of moving in different directions.
  4. Numerical Understanding: It must grasp the concept of doubling values and prioritize creating higher-numbered tiles.
  5. Structured Output: The model needs to output a specific, valid action (‘up’, ‘down’, ‘left’, ‘right’) based on its analysis, often within a structured reasoning format.

Language models aren’t naturally equipped for this kind of structured, stateful, strategic thinking. My goal was to see if reinforcement learning could bridge this gap.

LLM Playing 2048

Models are available to test out here:

  1. 200 steps and 2000 data points model: https://huggingface.co/HriDal/2048-game-qwen-7b-2k-ds
  2. 200 steps and 2000 data points model: uploading right now

The Foundation: Generating High-Quality Game States

2048 is a Markov Process, meaning that you only have to see the current state of the game and the action you take will decide what kind of reward you get. You do not need to remember history for this game and that is beautiful! There are other games like Backgammon where you just need to make the most optimal move based on the board you see!

A crucial part of this project was creating a diverse and representative dataset of 2048 board states. Simply using random boards wouldn’t capture the nuances of different game stages and difficulties. I enjoyed this process more than the training and seeing the results as just randomly sampling just gets you impractical data, which I learned from multiple feedbacks from my amazing community of twitter/x research and anon accounts. This time, I developed a sophisticated data generation pipeline:

  1. Simulated Gameplay: Instead of static puzzles, I generated board states by simulating thousands of games using random (but valid) moves. This creates more realistic mid-game scenarios.
  2. Difficulty Scoring: I implemented a custom heuristic function to evaluate the complexity of a given board state. This considered factors like the number of empty cells, the value of the highest tile, the similarity of adjacent tile values (smoothness), and the directional flow of values (monotonicity).
  3. Difficulty Classification: Based on the difficulty score and predefined thresholds (calibrated using a large sample of generated states), each board was classified into one of five difficulty levels, ensuring the dataset spanned a range from beginner to expert positions.
  4. Targeted Generation Pipeline:
    • Calibration: First, a large number of states were generated to determine the score thresholds for each difficulty level automatically.
    • Oversampling: To ensure enough examples for rarer, harder difficulty levels, the pipeline generated significantly more states than initially needed, using an oversampling factor of 3.0.
    • Parallel Processing: Simulation is computationally intensive, so the generation process was designed to run in parallel across multiple CPU cores, significantly speeding up dataset creation.
    • Balanced Sampling: Finally, the target number of states for each difficulty level was sampled from the generated candidates, creating a balanced dataset according to predefined ratios (weighted towards harder states with ratios of 1:2:3:4:5 for increasing difficulty).
  5. Prompt Formatting: Each selected board state was formatted into a structured prompt suitable for the language model. This involved providing a system message explaining the required thinking process (<think>…</think>) and answer format (<answer>…</answer>), followed by a user prompt describing the game, presenting the current board state as a formatted text grid, and asking for the best move.

This meticulous process resulted in a dataset my desired train and test sets, carefully curated to represent a wide spectrum of 2048 game situations.

Using a dataset with only 800 examples, I train multiple models to firstly figure out the best hyperparameters, otherwise training RL can be quite a waste of life, and then scaling the data to 2000 training examples, and then finally for the final model scaling it to 8000 training examples.

My Experimental Setup: GRPO on Qwen 7B

I chose Group Relative Policy Optimization (GRPO), a reinforcement learning algorithm well-suited for language models, implemented using the trl library and significantly accelerated by Unsloth. GRPO was invented by Deepseek and utilized firstly in their tiny maths model which achieved results on part with GPT4 last year.

  • Model: Qwen/Qwen2.5 3B and 7B-Instruct models – powerful instruction following models
  • Optimization: Unsloth was used for 2x faster fine-tuning and reduced memory usage, enabling training on available hardware. This included 4-bit quantization and efficient LoRA implementation.
  • Fine-tuning: Low-Rank Adaptation (LoRA) was applied with a rank of 16 and 32 to adapt the model efficiently.
  • Training Configuration:
    • Max Sequence Length: 1500 tokens, later decreased to 1000 as the model never used this length anyways and was eating my gpu anyways (balancing reasoning depth with GPU memory limits on a single RTX 4090 24GB).
    • Batch Size: 1.
    • Gradient Accumulation: 4 steps (simulating a batch size of 4 for stability).
    • Learning Rate: 3e-4, 4e-5
    • Max Steps: 100, 200, 800
    • Optimizer: paged_adamw_8bit.
    • Number of Generations per Prompt: 4 (exploring different potential responses).

The relatively small dataset size and limited training steps were intentional, designed to test how quickly and effectively the model could learn the core game strategy with GRPO.

Hyperparameter Tuning

  1. Main Culprit for Instability, Learning Rate: This task with RL turned out to be trickier than the usual. I wanted to release this days ago, but I was using high LR which deepfried my models. And then when I lowered the LR to 4e-5, and finally started seeing success.
  2. Generation Length: I constantly decreased it as the models actually never utilized it. Another great thing that I noticed was that as the models became efficient, the generation lengths dropped from ~500 to ~90, which is amazing meaning that the models refine their thoughts
  3. Data Scaling: This should always be done last. I have found out that if you figure out really good hyperparameters, with high quality data in the beginning usually the scaling is icing on the cake and same here. I would create data on my PC by simulation which would take time, then I would use it to see where the rewards would go the best in a reasonable amount. Scaling does help, if you have really high quality data, for example in this case, around 2000 then you can have a good mode, but if you can scale it, it always helps. The last model that I trained that I put on huggingface here: 2000_steps_trained_model, 8000_steps_trained_model was trained in around 10 hours and am quite happy with the results. All done on a single rtx 4090.

The Reward System: Guiding the Learning Process

The core of RL lies in the reward functions – how we provide feedback to the model. I designed a suite of functions, each targeting a specific aspect of good 2048 play:

Format Compliance Rewards:

These reward functions are used to teach the model on how to respond and what format to follow when responding.

  • tag_presence_reward_fn: Gives partial credit (0.25 points each, max 1.0) for simply including each of the four required tags (<think>, </think>, <answer>, </answer>). This encourages the model to remember all parts of the structure.
Click here to expand tag_presence_reward_fn
        def tag_presence_reward_fn(completions: List[str], **kwargs) -> List[float]:
            """
            Gives 0.25 for each required tag present (max 1.0).
            Required tags: <think>, </think>, <answer>, </answer>
            """
            rewards = []
            # Extract the raw response text from the completion structure
            responses = [comp[0]["content"] for comp in completions]
            for response in responses:
                score = 0.0
                # Check for the presence of each required tag
                if "<think>" in response: score += 0.25
                if "</think>" in response: score += 0.25
                if "<answer>" in response: score += 0.25
                if "</answer>" in response: score += 0.25
                rewards.append(score)
            return rewards
  • tag_order_reward_fn: Gives a full reward (1.0) only if the tags appear in the correct sequence: <think> block followed by <answer> block. This ensures logical flow.
Click here to expand tag_order_reward_fn
    def tag_order_reward_fn(completions: List[str], **kwargs) -> List[float]:
        """
        Gives 1.0 if tags appear in correct order:
        <think>...</think>...<answer>...</answer>
        """
        # Regex pattern to match the correct sequence of tags with content in between
        # .*? makes the matching non-greedy
        # re.DOTALL allows '.' to match newline characters
        pattern = r"<think>.*?</think>.*?<answer>.*?</answer>"
        responses = [comp[0]["content"] for comp in completions]
        # Return 1.0 if the pattern is found in the response, 0.0 otherwise
        return [1.0 if re.search(pattern, comp, re.DOTALL) else 0.0 for comp in responses]
  • valid_action_reward_fn: The model’s final output within <answer> must be one of the four allowed moves. This involves two steps:
Click here to expand valid_action_reward_fn
        def parse_action(completion: str) -> str:
            """
            Extracts content between <answer> tags, normalizes it,
            and validates against allowed actions.

            Returns: Valid action (lowercase) or empty string if not found/invalid.
            """
            # Search for the <answer> block using regex, allowing for surrounding whitespace
            # re.DOTALL makes '.' match newlines, robustly capturing multi-line answers
            match = re.search(r"<answer>\s*(.*?)\s*</answer>", completion, re.DOTALL)
            if not match:
                # Return empty string if <answer> tags are not found
                return ""

            # Extract the content, remove leading/trailing whitespace, convert to lowercase
            action = match.group(1).strip().lower()

            # Check if the extracted action is one of the valid moves
            return action if action in {"up", "down", "left", "right"} else ""

         def valid_action_reward_fn(completions: List[str], **kwargs) -> List[float]:
            """
            Gives 1.0 if the parsed action from the completion is valid
            (up/down/left/right), 0.0 otherwise.
            """
            responses = [comp[0]["content"] for comp in completions]
            # Apply the parse_action function to each response.
            # Reward is 1.0 if parse_action returns a non-empty string (a valid move),
            # and 0.0 if it returns an empty string (invalid or missing move).
            return [1.0 if parse_action(response) else 0.0 for response in responses]

Game State Improvement (The Core Strategy Rewards):

This is the main juice of this whole process. I designed very simple rewards but they turned out to be quite powerful. If you think about it, the whole idea is of the game 2048 is to reduce the spread of the tiles, keep adding them together and never completely fill your whole board. Meaning, you have to increase the density of the total of the tiles – (sum of tiles / number of occupied tiles)

Increasing density in the form of a reward has so many direct and indirect utilities: It will increase the sum of tiles, decrease the number of tiles occupied which is good for taking risks in future moves, consolidate everything together, and even increased the highest score on the tiles. So I created these rewards:

  • Density Reward: A key heuristic in 2048 is maximizing the value packed into the fewest tiles. This function measures board “density” (sum of tile values / number of occupied cells) before and after the model’s move. If a move increases density (e.g., by merging tiles efficiently), the model receives a reward proportional to the ratio of the new density to the old density. This directly encourages consolidation and efficient use of space.
    def post_action_density_ratio_reward_fn(completions, initial_state, rng_seed=42, **kwargs):
        """
        Reward based purely on density ratio after action.
        Rewards:
        - 0.0 if move is invalid or density doesn't change
        - density_ratio (new_density/prev_density) if density increases
        """
        rewards = []
        # Extract the text response from the model's completion structure
        responses = [comp[0]["content"] for comp in completions]

        for response, init_state in zip(responses, initial_state):
            reward = 0.0
            error = None
            # Convert initial state to numpy array if necessary
            init_state_np = np.array(init_state) if isinstance(init_state, list) else init_state
            # Calculate initial difficulty for logging
            difficulty = calculate_difficulty_score(init_state_np)

            try:
                # Simulate the move using the parsed action and get the resulting state
                new_board, changed, score_delta = get_next_board_state(
                    init_state_np, response, rng_seed
                )
                game_over = is_game_over(new_board)
                max_tile = np.max(new_board) if changed else np.max(init_state_np)

                # Calculate density-based reward only if the move was valid
                if changed:
                    prev_density = calculate_density(init_state_np)
                    new_density = calculate_density(new_board)

                    # Reward is the ratio if density improved
                    if new_density > prev_density:
                        # Add a small epsilon to prevent division by zero if prev_density is near zero
                        reward = new_density / (prev_density + 1e-6)
                    # else: reward stays 0.0 (density didn't improve or decreased)
                # else: reward stays 0.0 (invalid move)

                # Log the details of this step (move, outcome, reward, etc.)
                # Note: The actual logging happens within this called function
                log_training_response(
                    board_state=init_state_np.tolist(), # Ensure list format for JSON
                    next_board_state=new_board.tolist(), # Ensure list format for JSON
                    move_direction=parse_action(response),
                    is_valid=changed,
                    score_delta=int(score_delta),
                    max_tile=int(max_tile),
                    game_over=game_over,
                    reward=float(reward),
                    difficulty=int(difficulty),
                    model_version=kwargs.get('model_name', 'unknown'), # Get model name if passed via kwargs
                    hyperparameters=kwargs.get('hyperparameters', {}) # Get hyperparameters if passed
                )

            except Exception as e:
                # Handle potential errors during simulation or calculation
                error = str(e)
                reward = 0.0
                # Log the error case
                log_training_response(
                    board_state=init_state_np.tolist(),
                    next_board_state=init_state_np.tolist(), # Log initial state if error occurred
                    move_direction="error",
                    is_valid=False,
                    score_delta=0,
                    max_tile=int(np.max(init_state_np)),
                    game_over=False, # Assume not game over if error occurred mid-step
                    reward=0.0,
                    error=error,
                    difficulty=int(difficulty), # Log difficulty calculated before error
                    model_version=kwargs.get('model_name', 'unknown'),
                    hyperparameters=kwargs.get('hyperparameters', {})
                )

            rewards.append(reward)

        return rewards
  • Highest Tile Reward: The ultimate goal in 2048 is creating the highest possible tile. This function directly incentivizes progress towards that goal by rewarding the model based on the ratio of the maximum tile value after the move compared to before. If a merge creates a new highest tile, this yields a significant reward signal. Even though it is partially covered in the density ratio, I still wanted the model to be rewarded for this as many moves on many states do not lead to the highest rewards and this acts like an unexpected candy for a good move.
    def post_action_highest_reward_fn(completions, initial_state, rng_seed=42, **kwargs):
        """Reward for increasing the maximum tile value"""
        rewards = []
        responses = [comp[0]["content"] for comp in completions]

        for response, init_state in zip(responses, initial_state):
            # Ensure initial state is a numpy array
            init_state_np = np.array(init_state) if isinstance(init_state, list) else init_state

            # Simulate the next state based on the action
            new_board, changed, _ = get_next_board_state(
                init_state_np,
                response,
                rng_seed
            )

            # No reward if the move was invalid (didn't change the board)
            if not changed:
                rewards.append(0.0)
                continue

            # Calculate max tiles before and after
            prev_max = np.max(init_state_np)
            new_max = np.max(new_board)

            # Reward is the ratio of new max to old max.
            # Adding epsilon prevents division by zero if prev_max is 0 (initial board)
            # and ensures a positive reward even if max doesn't increase (reward=1),
            # but a larger reward ( > 1) if it does.
            reward = new_max / (prev_max + 1e-6)
            rewards.append(reward)

            # Note: Logging could also be integrated here similarly to the density function

        return rewards
  • Survival Reward: A good move shouldn’t lead to an immediate game over. This function provides a simple, crucial check: it gives a reward of 1.0 if the game state after the move is still playable (i.e., not game over), and 0.0 if the move results in a loss or was invalid. This encourages prudence and discourages moves that lock up the board. As I have more difficult boards than easier ones, it makes sense to make the model have this to plan well.
    def post_reward_survival_fn(completions, initial_state, rng_seed=42, **kwargs):
        """Reward 1.0 if game can continue after move, 0.0 otherwise"""
        rewards = []
        responses = [comp[0]["content"] for comp in completions]
        for response, init_state in zip(responses, initial_state):
            # Ensure initial state is a numpy array
            init_state_np = np.array(init_state) if isinstance(init_state, list) else init_state

            # Get the board state after the proposed move
            new_board, changed, _ = get_next_board_state(
                init_state_np,
                response,
                rng_seed
            )

            # If the move was invalid (board didn't change), reward is 0
            if not changed:
                rewards.append(0.0)
                continue

            # Check if the resulting board state is game over
            game_over = is_game_over(new_board)
            # Reward is 1.0 for surviving, 0.0 for game over
            rewards.append(1.0 if not game_over else 0.0)

            # Note: Logging could also be integrated here

        return rewards

Progress Logging: Integrated within the game state reward functions, every move attempt during training (along with the resulting board state, score change, validity, and calculated reward) was logged to an SQLite database. This allows for detailed post-hoc analysis of the learning process.

The Results: Model Learns to Play 2048

The training process showed that the Qwen 2.5 3B and 7B models, guided by the GRPO algorithm and the multi-component reward system, could learn to play 2048 effectively even with a relatively small dataset and limited training steps. The 7B model learns the best in the same amount of data which makes sense in scaling laws terms.

Some results and graphs:

1. 3B vs 7B on comparative data

The 7B because of higher learning rate is deepfried here and becomes unstable after 100 steps

2. Scaling the data and lowering the LR helps

Lowering the LR helped with stability in increasing rewards and no crashes.

3. Further Data Scaling helps with much better rewards

Findings and Observations

  1. Bigger Model Helps: A 7B model converges better than the 3B model. What you pay in terms of memory, you gain in terms of efficiency!
  2. Lower Learning Rate Helps: As seen above, with high reasoning tasks, lower LR helps better. Also I observed that when the LR was 3e-4, the model always gave DOWN move, which showed the model just got deepfried.
  3. Data Scaling helps: As can be seen above, scaling the data to twice, and then 4 times helped model achieve nearly greater than 1 density (see the faded spikes which are actual values), meaning that model actually does consolidate values much better than before
  4. Intelligent Reward Functions: I am quite proud of the density reward function as I came up with as I wanted very simple but high quality rewards using the mathematical skills I gained during the school and learning from Professor Andrew Ng. I remembered his lectures where he taught about deep learning losses which would combine multiple things in the same function to have simplicity and direct benifits from learning.

What This Teaches Us About AI Learning

This 2048 experiment offers several takeaways:

  1. Data Generation is Key for Complex Tasks: The quality and diversity of the initial board states, generated through simulation and difficulty classification, were likely critical for effective learning. RL needs a good “curriculum” to learn from.
  2. Multi-Component Rewards Work: Breaking down the desired behavior into specific, rewardable components (format, validity, density, max tile, survival) provides clearer guidance than a single, monolithic reward signal.
  3. Heuristics as Reward Signals: Game heuristics, like board density, can be effectively translated into reward functions to teach strategic concepts to language models.
  4. RL Can Teach Strategy: GRPO successfully encouraged the model to move beyond just making valid moves towards making strategically sound ones within the context of 2048.
  5. Compute/Model Size Matters: While the 7B model learned successfully here, the Sudoku experiment showed that smaller models might struggle with complex reasoning tasks, highlighting the importance of model capacity.

Headaches and Future Directions

  1. Tuning hyperparameters is a nightmare sometimes 😉 . But the more you train the better intuition you develop
  2. Too many reward functions can lead to too many things to worry about. Consolidating them together always helps. In the beginning I was using 1 + log(density ratio) which did not lead to anything as the density usually does not change that much when you analyze and view the game plays, as the log squishes the good strategies and does not reward the intermediate behavior in my opinion. Will exponential help meaning: e^density ratio as this will make the density efficiencies bigger? I hope someone tries this.
  3. Trying even bigger models should help.
  4. VLM Integration: Training a vision-language model using the visual board representations as an image to see if visual input improves performance. This is my next mission.

Why This Matters: Beyond the Game

Teaching AI to play games like 2048 isn’t just an academic exercise. The underlying challenges – strategic planning, rule adherence, resource optimization (tile space), and sequential decision-making – are relevant to many real-world problems:

  • Logistics and Operations: Optimizing routes or resource allocation.
  • Process Optimization: Finding efficient sequences of steps in manufacturing or workflows.
  • Financial Modeling: Making sequential decisions based on evolving market states.
  • Scientific Discovery: Planning sequences of experiments.

By developing methods to teach AI these skills in controlled environments like games, we build the foundation for more capable and reliable AI systems in complex domains.

Conclusion: A Promising Trajectory

This experiment demonstrates that reinforcement learning, specifically GRPO coupled with carefully crafted data generation and multi-component reward functions, can effectively teach language models strategic gameplay in a complex puzzle like 2048. The Qwen 2.5 7B model showed a clear ability to learn and apply game heuristics.The journey continues, but these initial results are highly encouraging, showcasing the potential for RL to imbue language models with deeper reasoning and planning capabilities.

Note: This is an ongoing project. Feedback and suggestions are welcome!

Citation

@misc{dalal2024agent2048blog,
    author = {Dalal, Hrishbh},
    title = {{Agent 2048: Forging Strategic Gameplay in an AI Through Data, Rewards, and RL}},
    year = {2024},
    month = {March},
    url = {https://yourwebsite.com/blog/ai-agent-plays-2048},
    note = {[Blog post] Accessed: March 30, 2024}
}
Scroll to Top