Building a Smart PySC2 Agent

Published in

Chatbots Life

5 min readOct 9, 2017

In my previous tutorial, we were able to create a simple PySC2 agent that built units and attacked the enemy. This time we will use a branch of Machine Learning known as Reinforcement Learning to teach our agent how to build units and attack the enemy automatically.

If you are unfamiliar with machine learning, I’d recommend the free Coursera and Udacity courses. In this tutorial I have borrowed some Q-Learning table code from Morvan Zhou’s reinforcement learning tutorials. Morvan’s YouTube videos are quite concise and contain examples of simple challenges that illustrate the concept well.

Q-Learning tables essentially maintain a list of states (where the game is at) and a list of actions for each state, where each action in a given state has a score. As the agent performs actions it can receive rewards, and the score for that action in that state is adjusted accordingly. The reward s then applied to each of the states and actions that lead to the current point, only each step back in time the impact of the reward is slightly reduced.

It is through this combination of action, reward, and the flow of the reward back through the history, that allows the system to identify good or bad paths to follow, even if the reward is several steps ahead.

1. Create the Agent

To begin, let’s set up the basic constants and classes that we will need, you might recognise these from my last tutorial:

Now let’s add the Q-Learning table class, this is essentially the “brain” that keeps track of all the states and actions:

You can run your agent, however it won’t do much:

python -m pysc2.bin.agent \
--map Simple64 \
--agent smart_agent.SmartAgent \
--agent_race T \
--max_agent_steps 0 \
--norender

As with the previous tutorial we are going to use Terran as our race. I have also disabled rendering of the feature layers in PySC2 as it doesn’t provide much value here, and it slows down the processing. It is also a good idea to set the agent step limit to 0 to stop it from hitting the default of 2500 as it will stop running.

2. Define the Actions

In order for the agent to do anything, we need to define the actions. For this tutorial we want to be able to build marines an attack. We define the actions as constants, and then add them to a list so they can be selected from later:

It might seem odd that we define an action to “do nothing”, but this comes in handy to stop the system from being locked up when all other actions can be performed. It’s also possible the system could learn that waiting is better than trying to perform an action that results in a negative reward.

There are a few more steps involved before we can connect up the Q-Learning system, until we get there we can choose a action at random inside the step() method:

You may notice one optimisation I have made since the last tutorial, when selecting SCVs I randomly choose the x and y coordinates as sometimes picking the first x and y coordinate will select something next to the SCV. This method reduces the chances of this happening, especially on repeated calls.

Now we can add the Q-Learning table:

3. Define the State

Now we need to let the Q-Learning system know what’s happening in the game. You can be as broad or as specific as you like with your state definition, however for this tutorial we can keep the state fairly simple and this will also make the system faster as there are less combinations of state and action.

We will track whether or not our supply depot and barracks have been built, the supply limit, and the army supply. These values all seem important to helping the system learn a good sequence of events.

Now instead of randomly picking an action, we can ask the Q-Learning system for the action, based on the current state.

For now it’s not that useful since there is no reward for any actions, so let’s define the rewards.

4. Define the Rewards

We’re almost there! The next step is to tell the Q-Learning system is has done a good or bad job by introducing rewards.

While many reinforcement learning systems limit the rewards to a 1 for a win or a 0 for a loss, we can reward any action we like. I like the idea of rewarding the system for killing units or destroying buildings:

In order to tell when we have killed a unit or building, we can use the cumulative score system. This system is incremental, so we need to keep track of the last step’s value to compare with the current value, this way we know if the value has increased, we must have killed something. First let’s add some properties to track the previous values:

Next, we calculate the reward when the value has changed:

Then we want to make sure we store the new values for the next step:

That’s all great, but it still doesn’t do much! Fair point, why don’t we connect it all up then?

5. It’s Alive!

For the system to learn the consequences of its actions, we need to track the previous state and action:

Now we can feed all of this data into the Q-Learning system:

You may notice we have wrapped the reward calculation and the learn step inside a condition. This ensures we are not trying to learn from the very first step, before we have performed any action.

Finally, we update the state and action values for the next step:

Give a it a test run and see how it goes. I found that initially it would send out my SCVs to attack the enemy, but over time it figured out that sending marines was a better approach. After long enough it worked out that it needed to build a few marines together before attacking.

That’s it for now, hopefully soon I will create another tutorial with information on how you can make this agent even smarter.

All of the code for this tutorial is available here.

In the next step, learn how to add smart attacking to your agent.

If you like this tutorial, please support me on Patreon. Also please join me on Discord, or follow me on Twitch, Medium, GitHub, Twitter and YouTube.