Bandit Swarm Networks

Hello,

It’s been a while since I posted to this blog, as I have been busy at Ogma Corp. where I develop fast, incremental/online learning neuroscience-inspired machine learning algorithms. However, I sometimes dabble in other things as well, so I thought I should start sharing them here.

So first up is Bandit Swarm Networks (BSN). These basically grew out of a frustration with temporal difference learning methods. I find that temporal difference learning methods are often too restrictive and computationally expensive for many tasks. They also are often unstable, and implementations are often very error (bug) prone.

The idea is to perform a reinforcement learning task without some sort of temporal difference learning that requires tracking state and action pairs (which in turn often requires backpropagation, which is slow). So, I came up with Bandit Swarm Networks, which use very simple multi-armed bandit algorithms configured in a swarm in order to solve a variety of reinforcement learning tasks.

Consider any arbitrary neural network or computational graph. This may be continuous, discrete, dense, or sparse; it doesn’t really matter. What if we could optimize it in a generic way (network type agnostic) that also functions incrementally? For non-incremental scenarios, we have things like genetic algorithms, simulated annealing, and particle swarm optimization – these can all optimize basically anything, but require either a very rigid experiment setup or populations of agents. What we want is to optimize within a single agent in an incremental update regime.

Enter multi-armed bandits, the classic mathematical model of reinforcement learning that deals with exploration and exploitation of stochastic rewards received when the arm of some multi-armed bandit is pulled. Our particular case requires a multi-armed bandit solution with feedback of variable delay. We find that even the most simple of delayed multi-armed bandit algorithms (running average) perform quite well for this, although more powerful methods do exist.

The basic idea is to have each parameter/weight in a neural network represent the selected arm of a multi-armed bandit. This means that each weight has some discrete number of arms associated with it, with the selected arm dictating the value of the weight. To perform the mapping of selected arm to weight, we use the following:

w = logit(\frac{n + 1}{N + 1}), 0 \le n < N

Depending on the number of arms per weight (a hyperparameter, N), this will map and arm index n to some nonlinear weight range (with more values concentrated around 0).

So now, for each weight we keep track of the average reward of each arm that the weight has. This results in each weight requiring N additional values attached to it.

The arm that is selected is simply the one with the highest average reward. The average reward is updated only for the last arm selected, or (if the network supports it) surrounding arms as well with some falloff.

The average reward for each arm is updated with a simple decaying average. The decay rate determines how quickly the agent will discount rewards, similar to temporal difference learning.

The resulting algorithm is a swarm of per-weight little agents that attempt to locally maximize the global performance of the network. Each multi-armed bandit is assumed to never change its selected arm once reward has been maximized. We are therefore essentially performing a highly local parameter search that receives no information aside from how well it did (through the reward).

So how does this algorithm perform? Let’s start simple, with the classic cart-pole experiment (using the OpenAI gym). We will use a simple 1-hidden layer multi-layer perceptron with 8 neurons in the hidden layer. Here is the resulting reward graph:

It quickly learns to solve the environment. In this example, N=64 (64 armed bandit per weight), and the average decay rate was 0.01. No exploration was used, but this can be added by randomly selecting a different arm every now and then or by adding some noise to the reward itself.

So now let’s try something more complicated. Let’s control the Minitaur robot in the PyBullet simulation.

Unnatural looking, but very fast!

Runs pretty well. The video is using again a single hidden layer MLP, but this time with 16 neurons and recurrent connections.

Now let’s try the BipedalWalker-v2 environment in the OpenAI Gym:

The video is at episode 1200. This was the same network structure as the Minitaur but without the recurrent connections (these were not necessary this time).

All the experiments above were run on a single CPU core.

While the BSN technique seems really silly and is trivial to implement (so much so that I only felt the need to describe it in text for the most part), it also seems to compete with much heavier algorithms like A2C, A3C, DDPG, PPO, DQN, etc, at least on the tasks selected. More testing is needed to determine what the downsides of this algorithm are.

Source code will be released soon, just need to perform cleanup (although it seems almost unneccessary with how simple the algorithm is 😉 ).

Until next time!

One thought on “Bandit Swarm Networks”

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.