An Introduction to Reinforcement Learning

Reinforcement learning (RL) is a paradigm in machine learning where a computer learns to perform tasks such as driving a vehicleplaying atari games, and beating humans in the game of Go, with little to no supervision from human experts. Several RL algorithms have been named most interesting breakthroughs in 2017. Everyone was excited with the new possibilities. I was excited.

However, one thing keeps bugging me. Almost all RL implementations rely on Deep Neural Networks.

Admittedly, I like Neural Networks, but I prefer gradient boosted trees wherever I can choose. I simply find designing, training, and testing neural networks tedious, with a lot of uncertainties and swings in the evaluation results. Personally, gradient boosted trees offer better performance (at least on structured datasets) while converging much faster and giving consistent results. So, I began my journey to implement RL (Q-Learning, in this case) with Gradient Boosted Trees.

Theoretically, there is no restriction over the underlying machine learning algorithms for Q-Learning. The most basic version uses tabular form to represent (states x actions x expected rewards) triplets. However, because the table is often too large in practice, we need a model to approximate this table. The model can be any regression algorithms. On this quest, I have tried Linear Regression, SVR, KNN Regressors, Random Forest, and a lot more. Trust me, they all work (to a varying degree).

Then, why does Deep Neural Networks dominate Q-Learning and RL in general? There are a couple of reasons:

  1. Partial fit — With neural networks, you can ask the networks to learn only the new experiences. Even with experience replay, you usually train the networks with a fraction of memory. Other algorithms do not support partial fit, and must be trained with the whole memory. This causes implication of long training time at later training steps where the memory grows large.
  2. Input sources — If your input sources are images or videos, Convolutional networks are a way to go. And since you are using CNN already, it makes sense to use FCN to do Q-Learning.
  3. Literature — Because everyone else does it with Deep Learning, there is a lot more literature in case you get stuck.

OK. So, Deep Learning is a soulmate to RL? Not so fast, my pal.

  1. Performance — Not every problem deals with unstructured data. And structured data is where gradient boosting shines. There is a good reason that Kaggle competitions are dominated by XGBoost and its friends.
  2. Convergence — Gradient boosted trees generally converge quickly and reliably, unlike neural networks which you need a little bit of luck.

These two are main advantages over neural networks which makes them a strong contender. But would introducing GBT to Q-Learning feasible? Or would it be too slow, too resource-hungry, or performing too poorly for the task?


Problem Formulation

This project began as my sheer curiosity whether other algorithms can be used to do Q-Learning. Thus, we are not touching deeper questions, such as implementing decision trees specifically for reinforcement learning, or deeper analysis of performance and speed, etc.

We are going to use one of the most basic RL problems to experiment — CartPole V0. This environment is provided by OpenAI Gym — a library consisting of various environments to test drive reinforcement learning frameworks. Basically, there is a pole on a cart. You can move left or right. Your goal is to keep the pole from toppling over as long as possible.

I forked a GitHub repository from Greg Surma where he provides a Keras model to solve CartPole problem. I reused most of his code — modifying only parts that need to be changed. My kudos to Greg Surma for his code base.gsurma/cartpoleOpenAI’s cartpole env solver. Contribute to gsurma/cartpole development by creating an account on GitHub.github.com

Let’s go over what Surma’s model does step by step.

  1. The CartPole game is initialized. Number of actions is noted down.
  2. At each time step, a tuple (x1, x2, x3, x4) is given to us. This tuple represents the current state of the cart and the pole.
  3. We tell the game if we want to move left or right. The decision is taken from our model (Neural Networks, generally) by asking the model for expected rewards for carrying out each action in this state.
  4. After the action is executed, we obtain a new state and a reward.
  5. We ask our model to predict the expected rewards for each action in the new state. Take the highest reward as the expected reward for being in this new state, discount it by a factor GAMMA, and add it to the present reward. This is the new q-value (or expected reward) for this state and action combination.
  6. Put the state/action/reward/new state tuple into some sort of buffer or memory.
  7. Select all or a part of memory. Partially fit the model with the recalled memory.
  8. Go back to step 2. Repeat until we get satisfactory results.

As you can see, the whole process is straightforward when applying with Neural Networks (or at the very least, Keras). We can ask neural networks for predictions on expected rewards of actions, even before the networks is trained (of course, you get random numbers, but it’s ok for the first round.) Once we compute the new q-value, we can select an instance from memory and fit it to neural networks without losing everything we have already trained so far (admittedly, there will be degradation. That is the basis for introducing experience replay in the first place.)

If we want to replace neural networks with other regressors, there are a few requirements.

  1. The regressors should support multi-labels. I put “should” instead of “must” because it is not essential. With a few line of codes, we could implement an individual regressor for each action separately; but still the native multi-label support is cleaner to work with.
    Most algorithms in SKLearn natively support multi-label, so you can pretty much drop in Linear Regression, SVR, and Random Forest. Gradient Boosted Trees are a little trickier, as popular libraries, such as XGBoost, CatBoost, and LightGBM, do not offer multi-label support. Fortunately, we can wrap SKLearn’s MultiOutputRegressor around them and solve this problem.
  2. Because GBT must be fit before calling predict, we must supply our own expected rewards for the first round. This involves checking if the model has been fit first. If not so, we can either supply a fixed value or a random number. I chose to supply 0 as initial values.
  3. We cannot do partial fit with GBT. Instead of fitting one instance at a time, we must construct the whole memory, along with an array of q-values, and completely retrain the regressor at each time step.
  4. Because we retrain with the whole memory, stale experience with stale q-value will never go away on its own. Unlike neural networks, we must restrict the size of memory to a much smaller number. I chose 1,000 in my case.
  5. As a side effect, we must push minimum exploration rate up to prevent the learner from getting stuck. It is likely that the small memory we have will get filled with low quality experiences, so we need to keep exploring.

The following figure highlights the changes I made to the original code. I use LightGBM in the experiment, due to its performance and speed.

Remark I run the code through Jupyter Notebook, which, surprisingly, was faster than running from command line. Might have something to do with being on Windows machine.

The code can also be found on GitHub.palmbook/cartpoleOpenAI’s cartpole env solver. Contribute to palmbook/cartpole development by creating an account on GitHub.github.com


Result

It works. Sometimes it works surprisingly well. Sometimes it frustratingly requires over 1,000 time steps to solve the problem. Rarely, it does not solve at all.

Here is the graph of time steps before solving the problem for each trial.

From the graph, it is clear that, on average, the approach is inferior to Neural Networks. The silver lining is that there are a few trials which LightGBM could solve the problem even before hitting the 100th round, effectively meaning it could solve the problem right from the beginning. This suggests the potential of the approach. I believe we need a better experience replay/memory management to take a full advantage of LightGBM in Q-Learning, which will make the whole code base a lot more complicated.

At the very least, we now know that Q-Learning can be done with other regressors as well. But for practical purposes, I think I will stick with Neural Networks in foreseeable future.

Source- Medium

Leave a comment