ART Theft Auto

While working on AOgmaNeo at Ogma, I test a ton of different encoding methods. AOgmaNeo, and SPH (Sparse Predictive Hierarchies) in general rely heavily on high-quality, incrementally-learned sparse codes for their encoders. Recently, I have been playing a lot with ART (Adaptive Resonance Theory)-based encoders. ART is a neuroscience-inspired online/incremental learning method with tons of variants for different tasks. Here is a good survey paper. I have experimented with ART encoders in the past, but this time I found a new way to make them distributed, which seems to drastically outperform previous attempts in terms of runtime speed and end-result.

The new ART-based encoders use a 2-stage process to learn distributed codes. First, each column in SPH (ART module) performs the standard ART activation algorithm (search and resonance). Then, a second stage kicks in, selecting among multiple columns. The second stage allows only the most active columns to participate in learning. This results in column-wise distributed codes.

If you want to know more about AOgmaNeo and SPH in general, here is our Handmade Network entry. This contains a bunch of useful links.

In the past, I attempted to re-created YouTuber Sentdex’s GAN Theft Auto experiment, but using SPH. It worked, but it didn’t run quite as fast as I would have liked and lacked a lot of detail. It ran in real-time on the CPU with 8 threads, but it turns out we can do better. You can see the video and code for that attempt here.

For those who haven’t seen it, GAN Theft Auto is a model that uses Generative Adversarial Networks to learn a simple simulation of the game Grand Theft Auto V from a video dataset of movements collected on a bridge in the game.

With the new ART-based encoders, we can take things to the next level. With 8 CPU threads, I can train a better model than before in just 10 minutes. While the following results are still noisy compared to the original GAN-based result, and notably lacks upscaling, I think it’s cool that we can run such things in the browser with WebAssembly. I also only have access to a sample of the dataset, not the whole thing.

Keep in mind that the original GAN Theft Auto was trained for quite some time on a DGX A100. It also requires quite a powerful GPU for inference, too. My version, however, can be trained at roughly the same speed as inference (which runs at 434 fps on my machine). And, well, it runs in the browser, using your CPU.

Anyways, without further ado, here is “ART Theft Auto”.

ART Theft Auto

Controls: A/D to turn. The demo may take a bit to load.

LD47 game “Dopamine Drip” Post-Jam version

Hello, here is the post-jam version of my Ludum Dare 47 entry. The main change is the AI has a better reward function now and there are 2 additional players. It isn’t nearly as exploitable now. Here is the Jam version.

As before, you (the green agent) move with WASD.

NOTE: It may take a minute to load.

LD47

A bit more information on the AI – it uses AOgmaNeo (Arduino-compatible version of OgmaNeo2) but compiled with WebAssembly (with emscripten). Graphics are handled by the SFML-like (but WebAssembly-compatible) library SMK.

Each player has 18 columns with 16 cells in each column. Overall, there are 1,440 cells and almost 1 million synapses being simulated in the game.

The agents are trained via self-play, but they continue learning while you play (online learning).

LD47 Game using RL

A little game I made for Ludum Dare 47. It uses reinforcement learning (via AOgmaNeo) compiled with WebAssembly. It’s a capture-the-flag game, but you play with and against reinforcement learning agents!

WASD keys to move. There isn’t really a goal yet other than trying to get more points than the red team.

LD47

Bandit Swarm Networks

Hello,

It’s been a while since I posted to this blog, as I have been busy at Ogma Corp. where I develop fast, incremental/online learning neuroscience-inspired machine learning algorithms. However, I sometimes dabble in other things as well, so I thought I should start sharing them here.

So first up is Bandit Swarm Networks (BSN). These basically grew out of a frustration with temporal difference learning methods. I find that temporal difference learning methods are often too restrictive and computationally expensive for many tasks. They also are often unstable, and implementations are often very error (bug) prone.

The idea is to perform a reinforcement learning task without some sort of temporal difference learning that requires tracking state and action pairs (which in turn often requires backpropagation, which is slow). So, I came up with Bandit Swarm Networks, which use very simple multi-armed bandit algorithms configured in a swarm in order to solve a variety of reinforcement learning tasks.

Consider any arbitrary neural network or computational graph. This may be continuous, discrete, dense, or sparse; it doesn’t really matter. What if we could optimize it in a generic way (network type agnostic) that also functions incrementally? For non-incremental scenarios, we have things like genetic algorithms, simulated annealing, and particle swarm optimization – these can all optimize basically anything, but require either a very rigid experiment setup or populations of agents. What we want is to optimize within a single agent in an incremental update regime.

Enter multi-armed bandits, the classic mathematical model of reinforcement learning that deals with exploration and exploitation of stochastic rewards received when the arm of some multi-armed bandit is pulled. Our particular case requires a multi-armed bandit solution with feedback of variable delay. We find that even the most simple of delayed multi-armed bandit algorithms (running average) perform quite well for this, although more powerful methods do exist.

The basic idea is to have each parameter/weight in a neural network represent the selected arm of a multi-armed bandit. This means that each weight has some discrete number of arms associated with it, with the selected arm dictating the value of the weight. To perform the mapping of selected arm to weight, we use the following:

w = logit(\frac{n + 1}{N + 1}), 0 \le n < N

Depending on the number of arms per weight (a hyperparameter, N), this will map and arm index n to some nonlinear weight range (with more values concentrated around 0).

So now, for each weight we keep track of the average reward of each arm that the weight has. This results in each weight requiring N additional values attached to it.

The arm that is selected is simply the one with the highest average reward. The average reward is updated only for the last arm selected, or (if the network supports it) surrounding arms as well with some falloff.

The average reward for each arm is updated with a simple decaying average. The decay rate determines how quickly the agent will discount rewards, similar to temporal difference learning.

The resulting algorithm is a swarm of per-weight little agents that attempt to locally maximize the global performance of the network. Each multi-armed bandit is assumed to never change its selected arm once reward has been maximized. We are therefore essentially performing a highly local parameter search that receives no information aside from how well it did (through the reward).

So how does this algorithm perform? Let’s start simple, with the classic cart-pole experiment (using the OpenAI gym). We will use a simple 1-hidden layer multi-layer perceptron with 8 neurons in the hidden layer. Here is the resulting reward graph:

It quickly learns to solve the environment. In this example, N=64 (64 armed bandit per weight), and the average decay rate was 0.01. No exploration was used, but this can be added by randomly selecting a different arm every now and then or by adding some noise to the reward itself.

So now let’s try something more complicated. Let’s control the Minitaur robot in the PyBullet simulation.

Unnatural looking, but very fast!

Runs pretty well. The video is using again a single hidden layer MLP, but this time with 16 neurons and recurrent connections.

Now let’s try the BipedalWalker-v2 environment in the OpenAI Gym:

The video is at episode 1200. This was the same network structure as the Minitaur but without the recurrent connections (these were not necessary this time).

All the experiments above were run on a single CPU core.

While the BSN technique seems really silly and is trivial to implement (so much so that I only felt the need to describe it in text for the most part), it also seems to compete with much heavier algorithms like A2C, A3C, DDPG, PPO, DQN, etc, at least on the tasks selected. More testing is needed to determine what the downsides of this algorithm are.

Source code will be released soon, just need to perform cleanup (although it seems almost unneccessary with how simple the algorithm is 😉 ).

Until next time!

Generative SDRs

Hello again,

It’s been a while! I have been working on AI related stuff of course, but what exactly I have been spending the bulk of my time on will be revealed in the near future.

For now though, I would like to show a simple little demo I made for showing the generative characteristics of SDRs (Sparse Distributed Representations).

In terms of generative models, Generative Adverserial Networks (GANs) and Variational Autoencoders (VAEs) seem to currently be among the most popular models.

I have come up with another way of doing a generative model relatively easily. It’s based on K-sparse autoencoders (here). SDRs are implicitly generative, as they force a certain (binary) distribution on the hidden units. K-sparse autoencoders can be used to learn SDRs with some modifications: The hidden units should be binary (top K are set to 1, the rest to 0), and training proceeds by minimizing the reconstruction error with tied weights.

With these modifications, one can learn some nice features from data. It is also possible to be able to control certain features by forcing a correlation between a feature and an input pattern. I did this by having two networks: The K-sparse autoencoder network (with binary states + reconstruction), and a random projection network that maps “control” features to random biases in the hidden units of the autoencoder.

The resulting system learns to incorporate the controlled features into the hidden features such that the reconstruction from a biased set of features produces the image with the desired properties.

So let’s try the standard MNIST test. The control features are a 1-hot vector of length 10 (one for each digit), while the hidden size is 256 units. The random projections were initialized to relatively high weights to overcome the K-sparsity’s thresholding. After training for a few seconds, we can see the results:

I applied a thresholding shader to make it look a bit prettier, although it does cut off some transitions a little.

If you would like to experiment with GSDR yourself, here is the code:

https://github.com/222464/GSDR

Until next time!

 

MiniNeoRL

Hello,

I have recently made a small Python port of my GPU NeoRL library. It doesn’t have the same features, so the most important differences are listed here:

  • It is fully connected (not sparsely connected)
  • It uses a new method for organizing temporal data (predictive coding)
  • It is slower
  • It is much easier to understand!

I call this port MiniNeoRL, since it serves mostly to help me prototype new algorithms and to explain the algorithms to others. Now I am not exactly a Python expert, but I think the code is simple enough such that with some explanation it should be easy to understand.

Along with MiniNeoRL I have made this slideshow that serves as a brief overview of what NeoRL is and how it works:

NeoRL_presentation

Until next time!

Driving a Car with NeoRL

Hello,

In my ongoing quest of improving NeoRL to produce a generic intelligent reinforcement learning agent, I created a demo I would like to share. It’s a simple demo, but still interesting in my opinion.

The agent receives 1D vision data (since the game is 2D), and must drive a car around a thin track. This is essentially a “thread the needle” task, where the AI requires relatively precise control in order to obtain reward.

As a human, I was not able to make it as far as the AI did. The AI almost completed the entire track, I was only able to make it about half-way. It looks easy, but it’s not!

Training time: About 2 minutes.

Here is a video of the car being controlled by NeoRL:

Until next time!

Generating Audio with NeoRL’s Predictive Hierarchy

Hello everyone,

Small (but hopefully interesting) update!

A while back I showed how I was able to memorize music and play it back using HTSL. Now with NeoRL I can not only remember but also generate more music based on sample data.

As is usually done with these predictive-generative scenarios, I added some noise to the input as it runs off of its own predictions. This causes it to diverge from the original data somewhat, resulting in semi-original audio.

Here is some audio data from a song called “Glorious Morning” by Waterflame:

Here is a sample of some audio I was able to generate, after training off of raw audio data, without preprocessing:

Training time: about 1 minute.

Now a problem with this is that it is just being trained off of one song right now, so the result is basically just a reorganized form of the original plus noise. I am going to try to train it on multiple songs, extract end-of-sequence SDRs, and use these to generate songs with a particular desired style based on the input data styles. Longer training times should help clear up the noise a bit too (hopefully).

Full source code is available in the NeoRL repository. It is the Audio_Generate.cpp example. Link to repository here.

Until next time!

 

NeoRL – Self-Sustaining Predictions

Hello!

Just a small post on an update to my NeoRL algorithm.

A while ago, I showed an MNIST prediction demo. Many rightfully thought that it may just be learning an identity mapping. But, with some slight modifications I can show that the algorithm does indeed predict properly and does so fully online.

I changed the SDRs to binary, this way there is no decay/explosion when continuously feeding its own predictions as input. So I can now run NeoRL’s predictive hierarchy (without RL) on itself indefinitely. It simplifies the digits to noisy blobs (since the digits are chosen randomly, it can’t predict uniform randomness), but the movement trajectories are preserved.

Another interesting thing is how fast it learns this – I trained it for about 1 minute to get the video below. It also ran in real-time while training (I didn’t write a “speed mode” in the demo yet).

The binary SDRs do have some downsides though. While the indefinite predictions thing is interesting, the binary SDRs sacrifice some representational power by removing the ability to have scalar SDRs.

So here’s a video. The first half shows it just predicting the next frame based on the input on the left. Then in the second half, the input on the left is ignored (it is not fed into the agent at all), rather the agent’s own predictions are used as input. As a result, it plays a sort of video of its own knowledge of the input.

Until next time!

NeoRL – Reinforcement Learning

Hello everyone!

I continue to work on NeoRL, and have a new way of applying reinforcement learning to the underlying predictive hierarchy. So far it works better than previous algorithms, despite not being debugged and optimized yet.

The new reinforcement learning algorithm is based on the deterministic policy gradient version (action gradient) of my SDRRL algorithm (SDRRL 2.0). Recall that a single SDRRL agent has an architecture like this (see here for the original post: link):

SDRRLDiag

It was able to solve a large variety of simple tasks very quickly while using next to no processing power due to its sparsity. But, it had problems: It didn’t scale well, since it didn’t have a hierarchy.  I now came up with an efficient way of doing hierarchy with this system.

Consider now a layer of SDRRL units, with sparse, local connectivity. It uses multiple Q nodes for different portions of the layer (they are also convolutional). The architecture looks like this:

ConvSDRRL

 

There can be as many action layers as desired. In my model, I use one action layer for the output actions and one for attention.

The input comes from the layer below or the input layer to the system. In my implementation it is 2D so it can work easily on images and run well on the GPU. The hidden layer performs prediction-assisted sparse coding, as to form a predictive hierarchy. Once the sparse codes are found, we activate sub-networks with the action layers as input through the “on” bits of the sparse codes. This is basically a convolutional form of the SDRRL 2.0 algorithm. Actions are then created by starting from the predicted action and then moving along the deterministic policy gradient.

As always, features are extracted upwards, and actions flow downwards. Now, actions are integrated into the lower layers as another set of sparse codes in the SDRRL hidden layer. So the full state of the hidden layer in SDRRL contains the feed-forward features and the feed-back action codes.

As explained earlier, I use two layers of actions. One for the action to be taken (output), and another for attention. Attention works by blocking off regions of the input as to ignore it. Which regions should be blocked are learned through the deterministic policy gradients.

I just finished coding this thing, and got excited when I saw it working without any tuning at all, and while likely still having many bugs. So I decided to make a video of it moving to the right (not shown, but it still works when I tell it to reverse directions):

Until next time!

(For those who missed it, the repository for this is here)