Solving the Rubik's Cube using Reinforcement Learning

Kito Theron
May 18
10 min read

I modelled and visualised a Rubik's cube and implemented Reinforcement Learning techniques to attempt to solve the cube. This project stemmed from my personal interest in Rubik’s cubes. At the time of writing, my best solve time is around 20 seconds.

Here is a preview of what the program looks like.

Here’s a short 28-second clip demonstrating the program in action.

Project structure

rubiks-cube-ai-solver/
├── src/
│   ├── controller/
│   │   └── app_controller.py   # Application controller
│   ├── model/
│   │   └── data_model.py       # Cube data representation and logic
│   ├── solver/
│   │   └── rl_solver.py        # Reinforcement Learning solver
│   ├── view/
│   │   ├── cube_visualization_2d.py # 2D cube drawing
│   │   └── main_window.py      # Main GUI window
│   └── main.py                 # Main application entry point
└── README.md

Modelling the Cube

I modelled the cube in my data_model.py file, in this file I represented the cube as an array of 3x3 matrices (one for each face). I found this approach effective since I’m familiar with matrix transformations, and it aligns well with how a Rubik’s cube is structured

self.cube = [
            [['y0', 'y1', 'y2'], ['y3', 'y4', 'y5'], ['y6', 'y7', 'y8']],
            [['b0', 'b1', 'b2'], ['b3', 'b4', 'b5'], ['b6', 'b7', 'b8']],
            [['r0', 'r1', 'r2'], ['r3', 'r4', 'r5'], ['r6', 'r7', 'r8']],
            [['o0', 'o1', 'o2'], ['o3', 'o4', 'o5'], ['o6', 'o7', 'o8']],
            [['g0', 'g1', 'g2'], ['g3', 'g4', 'g5'], ['g6', 'g7', 'g8']],
            [['w0', 'w1', 'w2'], ['w3', 'w4', 'w5'], ['w6', 'w7', 'w8']]
        ]

When solving a Rubik’s Cube, there is a standard notation for moves, which I incorporated into the project. For example an R move means rotating the right side upwards and R' means rotating the right side downwards.

I had to write functions for all of these move variations individually and figure out which indices in the cube model needed to change. For example, here is the rotate_R method:

def rotate_R(self):
        """Rotate the Right face clockwise"""
        # Rotate the right face itself
        self.cube[self.right_face] = [list(row) for row in zip(*self.cube[self.right_face][::-1])]
        
        # Rotate the adjacent edges
        temp = [self.cube[self.top_face][i][2] for i in range(3)]
        for i in range(3):
            self.cube[self.top_face][i][2] = self.cube[self.front_face][i][2]
            self.cube[self.front_face][i][2] = self.cube[self.bottom_face][i][2]
            self.cube[self.bottom_face][i][2] = self.cube[self.back_face][2 - i][0]
            self.cube[self.back_face][2 - i][0] = temp[i]

The top line of the code is just to rotate the right face clockwise (a 90 degree rotation of the matrix), the lines afterwards are more complicated and relate to the side pieces which also get impacted when turning a face on a Rubik's cube. It was crucial to get this process exactly right, as the cube controls wouldn’t work otherwise.

To simplify calling face rotations I created this function which decided which rotation should be called.

def rotate_face(self, face, direction='clockwise'):
        """
        Rotate a face of the cube
        
        Args:
            face: One of 'U', 'D', 'L', 'R', 'F', 'B'
            direction: 'clockwise', 'counterclockwise', or 'double'
        """
        if face == 'R':
            if direction == 'clockwise':
                self.rotate_R()
            elif direction == 'counterclockwise':
                self.rotate_R_prime()
            elif direction == 'double':
                self.rotate_R2()
        elif face == 'L':
            if direction == 'clockwise':
                self.rotate_L()
            elif direction == 'counterclockwise':
                self.rotate_L_prime()
            elif direction == 'double':
                self.rotate_L2()
        elif face == 'U':
            if direction == 'clockwise':
                self.rotate_U()
            elif direction == 'counterclockwise':
                self.rotate_U_prime()
            elif direction == 'double':
                self.rotate_U2()
        elif face == 'D':
            if direction == 'clockwise':
                self.rotate_D()
            elif direction == 'counterclockwise':
                self.rotate_D_prime()
            elif direction == 'double':
                self.rotate_D2()
        elif face == 'F':
            if direction == 'clockwise':
                self.rotate_F()
            elif direction == 'counterclockwise':
                self.rotate_F_prime()
            elif direction == 'double':
                self.rotate_F2()
        elif face == 'B':
            if direction == 'clockwise':
                self.rotate_B()
            elif direction == 'counterclockwise':
                self.rotate_B_prime()
            elif direction == 'double':
                self.rotate_B2()
        else:
            raise ValueError(f"Invalid face '{face}' or direction '{direction}'")
        pass
        self.debug_cube_state()

Visualising the cube

I decided that a 2D plan of the cube would be the most effective way of displaying it. This allowed all sides to be visible at once.

for example this is what it looks like when the cube has it's front face rotated clockwise (F)

One challenge I faced was ensuring that the model remained in sync with the visualised cube. I added some text debugging of this onto the screen so I could check it remained consistent.

You will notice each piece of the cube has a number, this is so I could track specific stickers. This is important because certain stickers should always be connected. If they aren’t, I know something isn’t working correctly. Also the cube cannot be solved without the correct number in the correct place regardless of the colours so it was important for my Reinforcement learning algorithm to know.

When creating the visualisation I assigned each face a vector which I used for where they should be placed on the screen.

def create_empty_grid(self):
        """Create an empty visualization grid"""
        # Calculate square size based on canvas size
        self.square_size = 30
        self.spacing = 5
        
        # Define positions for the unfolded cube layout
        # The layout looks like:
        #     [U]
        #  [L][F][R][B]
        #     [D]
        self.face_positions = {
            'U': (1, 0),  # Top face (x=1, y=0)
            'L': (0, 1),  # Left face (x=0, y=1)
            'F': (1, 1),  # Front face (x=1, y=1)
            'R': (2, 1),  # Right face (x=2, y=1)
            'B': (3, 1),  # Back face (x=3, y=1)
            'D': (1, 2),  # Down face (x=1, y=2)
        }

Reinforcement Learning

Reinforcement Learning (RL) involves using a neural network to determine the best next move and updating that network after each move based on positive or negative outcomes. For this project, I implemented a Deep Q-Learning (DQL) approach using the TensorFlow library. To do this effectively, the model requires a precise scoring algorithm so it can learn what constitutes a good or bad move.

model = tf.keras.Sequential([
                tf.keras.layers.Input(shape=(324,)),  # Explicit input layer
                tf.keras.layers.Dense(512, activation='relu'),
                tf.keras.layers.Dense(256, activation='relu'),
                tf.keras.layers.Dense(128, activation='relu'),
                tf.keras.layers.Dense(len(self.actions))  # Q-values for each action/algorithm
            ])

In my first implementation attempt I created a simple RL algorithm which only had the option of completing single moves at a time, It also used a basic scoring method that evaluated how many stickers were in the correct position. I quickly realised the cube was never going to make any progress using this method and I had to adapt my approach.

I decided to split my cubes solving process into different stages (the same stages a human would use when learning to solve a cube) these stages are called CFOP. Cross, First two layers, Orientate last layer, Permutate last layer. This improved my scoring function as I only scored moves for the stage the cube was on given the previous stages had been completed.

I also added features where good states were stored and if the cube went to a worse state for too many moves it would go back to a previous high scoring state. The model is penalised for revisiting previously explored states

new_state_hash = self._get_state_hash(temp_eval_state)
visit_penalty = 200 if new_state_hash in visited_states else 0

I also made it so that the neural network had the option to perform full algorithms I specified as apposed to only doing singular moves and the choice of algorithms available depended on which stage of solving the cube was at.

def _update_actions_for_stage(self):
        """
        Update self.actions based on the current solving stage.
        """
        if self.stage == "cross":
            # Only single moves
            self.actions = [action for action in self.all_actions if action['type'] == 'single_move']
        elif self.stage == "f2l":
            # Single moves and F2L algorithms (no repetitions)
            self.actions = [
                action for action in self.all_actions
                if action['type'] == 'single_move' or
                (action['type'] == 'algorithm' and self._is_f2l_algorithm(action['name']))
            ]
        elif self.stage == "oll":
            # Single moves, OLL algorithms, and U moves
            self.actions = [
                action for action in self.all_actions
                if action['type'] == 'single_move' or
                (action['type'] == 'algorithm' and self._is_oll_algorithm(action['name'])) or
                (action['type'] == 'single_move' and action['move_tuple'][0] == 'U')
            ]
        elif self.stage == "pll":
            # Single moves, PLL algorithms, and U moves
            self.actions = [
                action for action in self.all_actions
                if action['type'] == 'single_move' or
                (action['type'] == 'algorithm' and self._is_pll_algorithm(action['name'])) or
                (action['type'] == 'single_move' and action['move_tuple'][0] == 'U')
            ]
        else:
            # Default to all single moves if stage is unknown
            self.actions = [action for action in self.all_actions if action['type'] == 'single_move']

here is an example of a PLL algorithm

("PLL - H Perm Alt", [
                ("R", "clockwise"), ("U", "clockwise"), ("R", "counterclockwise"),
                ("U", "clockwise"), ("R", "clockwise"), ("U", "clockwise"), ("U", "clockwise"),
                ("R", "counterclockwise"), ("U", "clockwise")
            ]),

which alters the cube from its solved state to this configuration.

I also wanted to make sure these algorithms could be applied facing any face so I created this mapping function to simulate looking at a different face.

for name, moves in filtered_algorithms:
            # Add the original algorithm without orientation changes
            all_oriented_algorithms.append((name, moves))
            
            # Create oriented versions
            for orientation in face_map.keys():
                oriented_moves = []
                for move in moves:
                    face, direction = move[:2]  # Safely unpack only the first two elements
                    # Check if face is in our standard faces list
                    if face in standard_faces:
                        # Handle D face specially since it's not in the mapping arrays
                        if face == "D":
                            # For D face, we need a special mapping
                            if orientation == "U": mapped_face = "D"
                            elif orientation == "D": mapped_face = "U"
                            elif orientation == "F": mapped_face = "B"
                            elif orientation == "B": mapped_face = "F"
                            elif orientation == "R": mapped_face = "L"
                            elif orientation == "L": mapped_face = "R"
                        else:
                            # For all other faces, use the mapping
                            face_idx = ["U", "F", "R", "B", "L"].index(face)
                            mapped_face = face_map[orientation][face_idx]
                        
                        oriented_moves.append((mapped_face, direction))
                    else:
                        # For faces not recognized, keep them unchanged
                        oriented_moves.append((face, direction))
                
                # Add the oriented algorithm
                all_oriented_algorithms.append((f"{name} (oriented {orientation})", oriented_moves))

Here is the code which illustrates the flow of the reinforcement learning element.

# --- RL: Predict Q-values for all actions ---
                state_features = self._state_to_features(current_state)
                q_values = None
                if hasattr(self, "model") and self.model is not None:
                    try:
                        q_values = self.model.predict(state_features, verbose=0)[0]
                    except Exception as e:
                        print(f"Warning: Model prediction failed: {e}")
                        q_values = None

                for idx, action_definition in enumerate(self.actions):
                    temp_eval_state = copy.deepcopy(current_state)
                    action_type = action_definition['type']

                    option_entry = {"type": action_type}
                    first_move_for_penalty_calc = None

                    if action_type == 'single_move':
                        face, direction = action_definition['move_tuple']
                        temp_eval_state = self._apply_action(temp_eval_state, face, direction)
                        option_entry["move_tuple"] = action_definition['move_tuple']
                        first_move_for_penalty_calc = action_definition['move_tuple']
                    elif action_type == 'algorithm':
                        alg_name = action_definition['name']
                        alg_moves = action_definition['moves_list']
                        if not alg_moves: continue
                        temp_eval_state = self._apply_algorithm(temp_eval_state, alg_moves)
                        option_entry["name"] = alg_name
                        option_entry["moves_list"] = alg_moves
                        first_move_for_penalty_calc = alg_moves[0] if alg_moves else None
                    else:
                        print(f"Warning: Unknown action type: {action_type}")
                        continue

                    original_score = self._evaluate_state(temp_eval_state)

                    repeat_penalty = 0
                    if first_move_for_penalty_calc:
                        repeat_penalty = self._calculate_repeat_penalty(move_history, first_move_for_penalty_calc)
                        if action_type == 'algorithm':  # Algorithms are more deliberate
                            repeat_penalty *= 0.3 

                    stage_bonus = 0
                    eval_cross_score = self._evaluate_white_cross(temp_eval_state)

                    # Penalties for losing progress (relaxed: allow temporary regression)
                    # Only apply a strong penalty if cross is lost for too long (handled above)
                    # Here, just apply a mild penalty for regression
                    if white_cross_solved and eval_cross_score < 100:  # Lost solved cross
                        stage_bonus -= 2000 
                    elif not white_cross_solved and white_cross_score > 0 and eval_cross_score < white_cross_score * 0.7:
                        stage_bonus -= (white_cross_score - eval_cross_score) * 2  # Mild penalty

                    # Bonuses for achieving/improving stages
                    if not white_cross_solved and eval_cross_score > white_cross_score:
                        stage_bonus += (eval_cross_score - white_cross_score) * 15  # Strong bonus for cross progress
                        if eval_cross_score >= 100: stage_bonus += 500  # Cross completion bonus

                    new_state_hash = self._get_state_hash(temp_eval_state)
                    visit_penalty = 200 if new_state_hash in visited_states else 0

                    adjusted_score = original_score + stage_bonus - repeat_penalty - visit_penalty

                    # RL: If Q-values available, blend with adjusted_score
                    if q_values is not None and idx < len(q_values):
                        # Weighted sum: RL Q-value + heuristic
                        rl_weight = 0.6
                        adjusted_score = rl_weight * q_values[idx] + (1 - rl_weight) * adjusted_score

                    option_entry["next_state"] = temp_eval_state
                    option_entry["original_score"] = original_score
                    option_entry["adjusted_score"] = adjusted_score
                    option_entry["action_idx"] = idx
                    all_evaluated_options.append(option_entry)

                all_evaluated_options.sort(key=lambda x: x["adjusted_score"], reverse=True) 

                # --- RL: Epsilon-greedy selection ---
                chosen_option_details = None
                epsilon = exploration_probability
                if q_values is not None:
                    # Anneal epsilon over time
                    epsilon = max(0.05, exploration_probability * (1 - step / self.max_steps))
                if np.random.rand() < epsilon and len(all_evaluated_options) > 1:
                    chosen_option_details = np.random.choice(all_evaluated_options[1:min(6, len(all_evaluated_options))])
                    print(f"  RL Exploring: {chosen_option_details.get('name', chosen_option_details.get('move_tuple'))}")
                elif all_evaluated_options:
                    chosen_option_details = all_evaluated_options[0]
                    print(f"  RL Exploiting: {chosen_option_details.get('name', chosen_option_details.get('move_tuple'))}")

                # --- RL: Store experience for learning ---
                if chosen_option_details:
                    prev_features = self._state_to_features(current_state)
                    action_idx = chosen_option_details.get("action_idx", 0)
                    reward = chosen_option_details["original_score"] - current_score_eval
                    next_state = chosen_option_details["next_state"]
                    done = self._is_solved(next_state)
                    next_features = self._state_to_features(next_state)
                    experience_buffer.append((prev_features, action_idx, reward, next_features, done))

                    # Keep buffer size reasonable
                    if len(experience_buffer) > 10000:
                        experience_buffer = experience_buffer[-10000:]

                # --- RL: Train model from experience buffer ---
                if hasattr(self, "model") and self.model is not None and len(experience_buffer) >= batch_size:
                    minibatch = random.sample(experience_buffer, batch_size)
                    states = np.vstack([exp[0] for exp in minibatch])
                    actions = [exp[1] for exp in minibatch]
                    rewards = [exp[2] for exp in minibatch]
                    next_states = np.vstack([exp[3] for exp in minibatch])
                    dones = [exp[4] for exp in minibatch]

                    # Predict Q-values for current and next states
                    q_current = self.model.predict(states, verbose=0)
                    q_next = self.model.predict(next_states, verbose=0)

                    for i in range(batch_size):
                        target = rewards[i]
                        if not dones[i]:
                            target += gamma * np.max(q_next[i])
                        q_current[i][actions[i]] = target

                    self.model.fit(states, q_current, epochs=1, verbose=0)

                # --- Apply chosen move ---
                if chosen_option_details:
                    current_state = chosen_option_details["next_state"]
                    visited_states.add(self._get_state_hash(current_state))
                    if chosen_option_details["type"] == "single_move":
                        solution_moves.append(chosen_option_details["move_tuple"])
                        move_history.append(chosen_option_details["move_tuple"])
                    elif chosen_option_details["type"] == "algorithm":
                        solution_moves.extend(chosen_option_details["moves_list"])
                        move_history.extend(chosen_option_details["moves_list"])

                    if controller and hasattr(controller.view, 'root'):
                        controller.model.cube = copy.deepcopy(current_state)
                        controller.view.root.after(0, lambda: controller.view.update_view(controller.model.cube))
                        controller.view.root.update_idletasks()
                        controller.view.root.update()
                        time.sleep(0.01)

                if self._is_solved(current_state):
                    print(f"Solution found in {step + 1} steps!")
                    return solution_moves

It retrieves the Q-values (predicted action values) from the RL model and combines them with heuristic scores (e.g., avoiding repeated states) to evaluate potential moves. The program starts off greedy but over time the value of epsilon decreases and there is more exploration in moves. Experience of learning is stored (state, action, reward, next state and whether the task is solved) into a buffer, these experiences are used to train the neural network. The chosen action is then applied and the solution is checked for if it is solved.

This model does have it's limits, I’ve not yet been able to solve the cube from a completely scrambled state to a completely solved state (in the 5 minutes of running I do per attempt). Currently, the model can typically solve the first layer within that time, but struggles to complete the next two layers. Here is an example of what a final solution could look like.

The GUI is really good for illustrating the reinforcement learning process. Here is a video which demonstrates the beginning stages of the RL model attempting to solve the cube.

The model is also quite good at solving cubes which are less than 5 moves from being solved.

After every move, the RL model sleeps for 0.01 seconds to allow the GUI to update, originally when I did this it caused my window to freeze so to fix this I had to put the RL solver into a thread inside of the controller so it didn't prevent the other processes from running.

Next Steps

I am planning to keep developing this project. I plan to eventually make the cube capable of solving itself from start to finish.

I would like to experiment with having the Reinforcement learning algorithm run for longer and potentially having multiple cubes learning which share their findings with each other.

I would like there to be a 3d cube visualised in addition to the 2d cube, I've already played with a few implementations of doing this and had some struggles as it is difficult to keep the 3d cube consistent with the model. I will be able to get this working in the future.

I would like to add the functionality of turning the Rubik's cube to the plan where each face would shift along accordingly

I encourage other programmers to fork my repository as a base for any Rubik's cube related programs they wish to create.

Repository: https://github.com/KitoSTheron/rubiks-cube-ai-solver (my README.md contains a more technical explanation of my program)

Computer Science and Artificial Intelligence at the University of Nottingham
Former Software Engineer Intern at Argus Media

Solving the Rubik's Cube using Reinforcement Learning

Recent Posts

Comments