Deep Reinforcement Learning with Space Invaders… Literally from scratch

Yassine Kaddi
11 min readNov 7, 2020

--

I was so amazed when I first saw the Deepmind’s agent playing a super human level in the Atari games. I was even more astonished when I watched the AlphaGo’s mind blowing story and the OpenAi’s Hide and Seek game where their agents not only learned how to master the game but even surprised their own developers when they’ve exploited a bug of the game to reach their goals..

Ever since RL caught my attention (compared to Supervised/Unsupervised Learning), I started learning the basics about Reinforcement Learning in the bible Reinforcement Learning an introduction (Andrew Barto and Richard Sutton) and along with that, since I strongly believe in the “learning by doing” philosophy, I’ve started looking up some tutorials on the net to get my hands dirty. I came across some pretty good tutorials that helped me understand and apply the basics (mainly about Q Learning and Deep Q Learning). I’ve got even pretty excited that Kaggle opened up to reinforcement learning as well, with some interesting competitions. But at some point, I figured out that more or less the same ideas are repeated. Besides, the tutorials are mainly meant to “get your hand” on the topic. Which means that you only get the visible part of the iceberg. By that I mean that the environment, the actions, the states and the rewards are already defined. You only choose the RL method to apply. I totally agree that it helps to understand RL methods. But I’m afraid that, when you want to do everything from scratch, there is tons of questions that arise that seemed to be obvious when you do tutorials. Hence, I decided to do everything from scratch : I decided to apply RL to the game Space Invaders. So I coded the game with Python using Pygame package. Then I modified it so I can train the RL learner.

Please note that the focus here will be on the details/questions that arise while implementing everything from zero along with the results obtained. For that matter, I will only give a brief definition of RL in the introduction. Note also that I’m not a RL expert, I’ll explain things as I understood and I’ll be grateful if you have some remarks.

Introduction

Reinforcement Learning is considered as the third field of Machine Learning beside Supervised Learning and Unsupervised Learning. It’s about learning what to do in a an environment in order to maximize a score. In other terms, the learner is supposed to map situations of his environment in order to get maximum rewards. It’s about finding solutions to Markov Decision Processes defined by a set of states, set of actions, a reward function and the discount factor Gamma.

I want to apply Reinforcement Learning to the game Space Invaders implemented from zero. For this, I’m using as RL algorithm the model-free algorithm Deep QN for learning. I started straight with DeepQN instead of Q table (as it’s usually done in tutorials) because (1) DeepQN works better for complex environments with numerous new states based on the weights, (2) Q table needs much more memory when the states become bigger to maintain the table as the size of the Q table increase compared with deep QN and (3) DeepQN is much more used. The down side is that it takes a lot to train (probably need a GPU) and there are more hyperparameters to handle.

In the long run, I want to handle reward shaping, observation definitions, hyperparameter optimizing, and compare withother model-free RL techniques (Dual Deep QN, Double Deep QN, A3C… ).

The Game : Space Invaders

The Space Invaders game is coded using Pygame package. I first started by coding the game that can be played by us: humans, then I modified it so it can be played by the RL agent. I mention here the game details that might important for the RL part :

The window

The window dimensions are (700+150)x500. The 700px width is the length where the player and the ennemy are able to move and the 150 px left are used for monitoring. It’s the part where we show the Health, Score, the highest score, the episode and the reward.

The agent

The agent’s position is initialized at the bottom center of the window (X= (window_width — player_width)/2 , Y = window_height -player_height) (note that with Pygame, the point (0,0) is at the upper left hand corner of the screen. The X coordinates increase from left to right and the Y coordinates from top to bottom). He can move left, right and shoot. The player’s health is 10, which means that he’ll be killed when he’s hit with the enemy’s laser 10 times. He is supposed to kill as many enemies as possible. When an ennemy is killed, the score is updated by 1.

The enemy

The enemies are initialized right above the window in random position in the horizontal direction (X = np.random.randint(window_width), Y = -enemy_height). The total number of enemies is 7. Whenever an enemy is killed, an other enemy is created in order to keep the total number constant. Each enemy’s health is 4, which means that he’ll be killed when he’s hit with the player’s laser 4times.

The RL part

The Space Invaders game (the atari game) as used in many (if not all) tutorials uses Deep QN and a convolutional neural net (CNN) is used to learn the states from raw images of the game. That’s something I’ve remarked while going through many DeepQN tutorials (mostly games), they usually learn the states from the raw images using a CNN. I’m planning to use RL in other areas where we might not have images so I was curious to represent the states in an other shape than images. It’s also a way that can be reproducible in many other areas than games.

Observations

So how am I going to define observations if I’m not using raw images ? that was actually a funny part because I tried to play the game many times and figure out what I, as a human player, unconsciously do in order to choose an action. In other terms, what features I was relying on when playing in order to get the highest score possible. And that was not as easy as it seemed to be: I might have forgot some features but here are the main remarks I’ve done :

  • I localize the enemies’s position in order to shoot.
  • I localize myself’s position.
  • I track my laser’s position to enhance my shooting time.
  • I track the closest enemy lasers’s position so I can avoid them.

The next step was to convert those remarks into pure plain maths. And that can be a bit tricky since there are some features that are not easily noticeable. Here are the results (hence the observations) of the “conversion” of my remarks:

  1. The player’s position: the X coordinate of the player normalized by the window width. We only have one player so that’s the 1th observation.
  2. The enemies relative distance : which is the distance between the enemies and the player normalized by the window height. We multiply the relative distance by +1 or -1 depending on which side the enemy are on (left or right). Since we have 7 enemies, that results in 7 relative distances, hence 8 observations in total so far.
  3. The relative distance of player’s laser : The player’s laser is localized by tracking its relative distance to the player. We multiply the relative distance by +1 or -1 depending on which side the laser is on. The problem here is that the player’s laser number is random since it changes all the time. So I figured I’ll just track 3 lasers for the moment (unless you have an other idea ?). That leads to the 11th observations.
  4. The relative distance of enemy lasers: Here I want to get track of the closest enemy lasers. We multiply the relative distance by +1 or -1 depending on which side the enemy’s laser is on. Same problem as with the previous observation: the total enemy lasers is random so I chose to track the 6 closest lasers. In math’s words : I took the 6 lowest relative distances between the enemy’s laser position and the player’s position. That’s the 17th observation.
  5. Direction of the enemy: So that the agent can capture whether the enemies are moving right or left, I assigned a value of -0.5 for moving left and +0.5 for moving right. That can be interpreted as velocity too especially that velocity is constant. Since we’ve 7 enemies, that gives 7 directions and 24th observations in total.

Note here that I mostly use relative distance instead of coordinates. The main reason for this is that it leads to less observations and thus less computation: Keep in mind that we’re using a deep NN for the learner and the observations are the input of the deepNN. So one relative distance takes one neuron whereas a (X, Y) coordinates take two neurones. This helps to get a faster computation.

Actions

The actions’s space is pretty easy here: the player can move left, right, stay still and shoot. So the action’s space size is 4.

Rewards

The reward function is really important for the learning process. The better the reward function the better the agent learns. In other terms, a better reward function will increase the speed of convergence and avoid getting stuck in local minima. So I thought the rewards would be as follow

  • Death Reward = -1
  • Player Hit Reward = -0.5
  • Enemy Hit Reward = 0.5
  • Enemy Killed Reward = 1
  • Player Alive Reward = 0.0001

You’ll notice that the maximum value (1) is assigned to the action that increases the score, that is kill the enemy. The action that helps kill the enemy is to hit it so it gets a lower reward : 0.5. Of course, we avoid to get killed so the minimum reward is when we die. The action that help kill the player is to hit it so it gets a slightly higher reward : -0.5. And obviously, we want the player to stay alive as long as possible so I gave it a very small reward : 0.0001.

DQN Agent

The DeepQN used here is composed of 24 neurones in the input layer (which are the 24 observations detailed above) and 4 neurones in the output layer (which correspond to the 4 actions). It has 3 hidden layers, each one composed of 100 neurones. The replay memory size is 10 000. The target network is updated every 5 episodes. The batch size is 64. The discount factor for epsilon is 0.999 and the discount factor for that accounts for futur rewards is 0.99. The learning rate was 10e-6.

Results

Each training lasted 2500 episodes in total which took over 30 hours. It’s done on my labtop (CPU 1.60 GHZ, Intel I5 8th Gen). The first episodes where the learner behaves by chosing action randomly looked like this:

After a while (I guess from episode 600 to 1500), the agent learnt to starts by moving left in order to avoid the first shoots, which might be explained by the fact that the enemies starts moving from left to right so there is more chance to have many shoots on the left areas :

Over the last episodes (>2000), the player starts to shoot a bit more, which enabled him to get higher scores since he starts killing more (instead of just avoiding lasers). However, he’s still doesn’t seem to really avoid each enemy’s laser. Maybe more training is needed as well as more enemy lasers should be tracked :

Here I plot some parameters’s evolution along the learning process : the rewards values, epsilon, the score and the loss of the deep NN with a learning rate of 10e-6. The increasing curve of rewards shows at least that the agent learned some tricks that enabled him to get higher scores. That happened starting from the episode ~1300. Probably higher number of total episodes would led to higher rewards.

I’ve only varied the learning rates to see how it can changes the learning. The learning rate values adopted here are pretty small compared with what we can find in literature (mainly between 0.1 and 1e-4). I’ve actually started by 10e-3 and 10e-4 but the learner seemed to need much more episodes to learn that’s why I went with lower values.

Discussion

Observations

Note here that we talk about “observations” instead of “states”. That is because the agent doesn’t know “everything knowable” that could determine his the response of the environment to his actions: he only tracks 3 of his lasers and only localize 6 of the enemy’s closest lasers. Tracking the maximum lasers will probably enhance its performance at the expense of a higher computational costs.

Besides, the core feature used here is the relative distance. One can argue that this might be a bad idea to represent observations (interesting discussion with Neil Slater here) since we loose information about the direction and the movement (which is captured in the Space invaders Atari games by considering a stack of frames as state instead of only one frame). I believe that the signs we account for (+1 or -1 depending on wich side the laser/enemy is on) help track the direction. For the movement, as long as the velocity doesn’t change, capturing the relative distance, along with the direction might be enough.

Reward

Defining the reward function was a little bit challenging. It might be obvious that we should give the higher reward to the action that maximizes the score but the question was should the higher reward be 1 ? 20 ? 1000 ? I actually saw other RL examples giving rewards (in absolute value) with order of magnitudes of ~50 (i.e. -50, 50…), and examples with order of magnitude ~1 … I just managed to give rewards with ~1 values and it turns out to be working. Maybe a comparison with higher values (imagine just multiplying reward values by 10 then 100) could be interesting. I even thought that maybe a continous reward function could be more compelling and realistic..

hypermarameters

An other interesting (or daunting, depending on how you see things!) point is hyperparameters optimization. Here I’m talking about all the parameters I detailed in DeepQN Agent section. Starting with the deep NN, the number of neurons and the hidden layers number is completely random. One can try many values and see how it impacts the learning but the computational costs can be very high. Same thing goes with the replay memory size, the number of episodes upon which the target network is updated, the discount factor for epsilon, the discount factor that accounts for futur rewards, the learning rates…

With this in mind, I understood finally why RL can be difficult to apply in real world problems. But still, the recent advances in hyperparameters optimization are encouraging. I’ll let you know if I happen to try one of the proposed methods in literature.

Next

I’d like to see the agent getting some really high scores that I couldn’t achieve. Maybe a longer training can get him to do that but I’m thinking of tuning some parameters before increasing the number of total episodes. Playing with reward function before moving to an other RL techniques seems compelling to me. I’ll see if any modification of the reward values might enhance the learning. I might also track many enemy lasers instead of only 6, even though it’ll be a bit more costly. I’m actually mulling over computing on the cloud (AWS, Azure or Google Cloud?) which will help me move faster. I’ll keep you updated if interested!

--

--

No responses yet