NeoRL: How it works

Hello everyone,

I would like to give an explanation of the algorithm behind NeoRL (GPU) and NeoRL-CPU (Available here and here). In this post, I will only go over the predictive hierarchy, since the reinforcement learning is still work-in-progress.

Overview

So let’s start with the basic idea behind the algorithm.

NeoRL operates on the theory that the neocortex is a bidirectional predictive hierarchy, similar to HTM (Hierarchical Temporal Memory). However, it differentiates from HTM in several important aspects:

  • Includes temporal pooling
  • Full hierarchy support of multiple regions
  • Does not use columns for predictions, these are reserved for reinforcement learning
  • Uses spiking neurons with “explaining-away” characteristics
  • Continuous inputs and predictions

The connectivity scheme of NeoRL is sort of like that of a convolutional neural network, but the weights are not shared. Also, as mentioned before, it is a bidirectional hierarchy unlike convolutional neural networks. The basic idea is that features are extracted upwards and predictions come downwards. The predictions can then influence the feature extraction to improve the features, which in turn results in better predictions.

NeoRL uses this hierarchy to predict the next timestep. The input is a 2D field (although theoretically it can be any dimension) of scalars, and the predictions are another 2D field of the same dimensions as the input. This makes it sort of like a predictive autoencoder.

IPRSDR_diag

Why predict only one timestep? Well, for one, it’s the theoretical minimum for building a world model. Why is this? Well, if your model understands how to predict the next timestep, then it can predict the timestep after that based on the timestep it just predicted, and then predict from that, and so on.

Now, let’s go into how the spatio-temporal features are extracted (upwards flow).

Spatio-Temporal Feature Extraction

Spatial-temporal feature extraction is based on sparse coding, or SDRs (sparse distributed representations). With SDRs/sparse coding, one attempts to find a sparse (few active) set of bases that can be used to reconstruct the input. This is not unlike a sparse autoencoder, however the way NeoRL does it is a bit different.

Below is an image of the FISTA algorithm being used to extract sparse codes.

IRSDR_codes

The algorithm used in NeoRL is similar to ISTA, but uses spiking neurons to produces time-averaged codes that are always in the [0, 1] range. I found that ISTA seems to not have strict enough bounding for codes, so I opted for something between ISTA and another algorithm called SAILnet (here).

The result works as follows:

  1. Excite neurons from reconstruction error
  2. Inhibit neurons from each other (lateral connectivity)
  3. Reconstruct the average firing rates of the neurons to get a reconstruction error (input – reconstruction)
  4. Repeat 1-3

Like ISTA, neurons are activated off of reconstruction error, but like SAILnet the codes are formed by having a spiked neuron inhibit its neighbors. This is performed sequentially for some iterations until a stable sparse code has been formed.

Once the code has been formed, the feed-forward and lateral weights are updated through Hebbian and anti-Hebbian learning rules respectively, based on the average spiking activities and the reconstruction thereof.

In order to extend this idea to the time domain, one can add an extra set of recurrent connections. This takes the previous average spiking activities and feeds them back in to the current cycle. So it will try to form sparse codes not only of the input, but also itself. This leads to a history compression algorithm.

However, there is one more trick we can do: In order to make the representation as efficient as possible, we can only compress the history/spatial features that lead to low prediction errors. This is accomplished through eligibility traces.

Eligibility traces are often used in reinforcement learning to propagate reward signals back in time to address the distal reward problem. In our case, we are using them as a replacement for backpropagation through time (BPTT), which is typically used with the LSTM algorithm. Instead of having to save a history buffer and update on that to a fixed-length horizon, we can easily propagate prediction errors to past codes with an infinite horizon (well, limited by floating-point accuracy of course).

The idea is that instead of updating the weights for the sparse coder directly, we instead use the weight change we would have applied to increment the eligibility trace. This trace then decays exponentially, giving newer samples more importance. Then, when the prediction error is below average, we want to update on those traces (since the past updates were good for us).

Here is an example of how such a trace variable can look (plotted over time). It’s not shown in the image, but the trace can also be negative.

AccumulatingTrace

 

So that’s how a single layer performs feature extraction.

Prediction

The prediction in NeoRL is very simple. It’s essentially a multilayer perceptron with thresholded units that points in the opposite direction of the feature extraction hierarchy. It can be thought of as overlaying the feature extraction hierarchy – each prediction neuron tries to learn a mapping from local (lateral and feedback) inputs to the next state of the feature extractor neuron it is associated with.

Each layer receives local (lateral) features along with the predictions of the next higher layer (if there is one) as input. It is then trained with a standard time-delayed perceptron learning rule to produce the next SDR at each layer.

The prediction errors are kept around to feed back to the feature extractors as explained earlier. We keep a decaying average of the error, and if the current error is less than the average, we “reward” the feature extractor here, otherwise we do nothing.

NeoRL predicting a periodic function (Red – actual, blue – prediction):

neorl_test

Conclusion

 

NeoRL is still in early stages of development, as is NeoRL-CPU. I hope I was able to give a decent explanation of what is going on in the algorithm. I will post explanations of the reinforcement learning extension as soon as it is up and running as well. Feel free to try out the library yourself!

Until next time,

CireNeikual

Neo/RL – a GPU algorithmic neocortex simulation.

Hello everyone!

I now have a LSTM/ConvNet competitive version of my HTM-inspired algorithm. It now lives as part of a new library I am working on called Neo/RL, (neocortex + reinforcement learning). The reinforcement learning is not yet complete, but the predictive hierarchy is up and running smoothly, and it is able to match or even outperform LSTMs/ConvNets, all while being fully online without any sort of rehearsal or batching. Don’t believe me? Evaluate it for yourself! https://github.com/222464/NeoRL

I am working on additional benchmarks, here is one where I reproduced this paper which used LSTMs to predict moving digits (link to original paper: http://arxiv.org/pdf/1502.04681.pdf)

https://youtu.be/sC4H85snNyc

I alternate between predicting the next input based on current input and predicting the next input based on its own predictions in the video. The video is real-time.

I am also working on a text prediction benchmark, and a time series dataset benchmark (the latter of which already works, but I need the LSTM comparison working properly as well).

To truly convince people, I probably need more benchmarks still though, three is not enough! So, if you are interested in helping out on that front, let me know!

Until next time!

CireNeikual

A New Cortical Column Model

Hello!

As always, I am trying to figure out how the Neocortex works in order to exploit its properties for reinforcement learning.

I believe I have finally found something substantial, at least in how complete it is. My latest model has several bizarre features. Whether or not the whole thing actually works remains to be seen as I am still in the process of coding it. This model is built of components that I have already shown work, so the question is whether the combination of these components leads to desired properties.

The fundamental idea behind this theory I came up with as a submission to Numenta’s HTM challenge. It is as follows: Every cortical column is in itself a tiny reinforcement learning agent that learns to control information flow in order to maximize a reward signal.

There are three important modules to this new system:

  • A bottom-up sparse coding hierarchy
  • A top-down prediction hierarchy
  • The gating SDRRL units (reinforcement learners)

So, I decided to use my previous SDRRL algorithm for this task, but really any reinforcement learning agent should work.

Sparse codes are extracted in a bottom up fashion. However, unlike typical hierarchical sparse coding, the inputs from one layer to the next are modulated by the SDRRL units – this way, the column can learn to drive attention to certain inputs. Each SDRRL unit itself receives sparse codes in a local radius as input, and along with this attention gate, it has a prediction learning gate and a sparse code learning gate. This makes 3 gates in total, although the exact amount may change as I develop this theory further.

The top-down predictive hierarchy learns to predict the sparse codes of the next timestep, but its learning rate is modulated by SDRRL. This way, SDRRL can choose to only predict things that lead to higher rewards in the future – considering that some of the predicted inputs may actually be actions as well, this allows the system to select actions.

The system as a whole gates information flow in a “reinforcement-learning-modulated” fashion, so instead of the purely unsupervised learning typical associated with hierarchical sparse coding/prediction, it “bends” the process towards important information and rewarding prediction-actions.

Below is a diagram of a single column of this model:

CireNeikualColumnModel

Well, on to coding the thing! I am developing a CPU version first, and then a multithreaded/GPU version in OpenCL.

Until next time!

~CireNeikual

 

SDRRL v2.0 and PRSDRRL

Hello everyone!

In my previous post I presented SDRRL, and in the one before that a demo of that algorithm. Since then, I have made many improvements to the algorithm, vastly increasing performance, both in terms of convergence rate and processing power used. I have another demo, but this time it is not a web demo, since it is something I used for internal testing that I just cleaned up a bit 🙂

SDRRL v2.0 Demo

I present to you a simple “Big Dog” style demo, where SDRRL must learn to move a robotic dog body to the right. Almost all of the processing time spent is taken up by the physics engine instead of the AI.

When running the demo, press T to speed up time, and K to reverse the walking direction.

Link to the demo: https://drive.google.com/file/d/0B2btNvgW7MHUV2xfalUzLW5YaXc/view?usp=sharing

Note: It’s made in VS2015, so you may need the VS2015 runtime to run it: http://www.microsoft.com/en-us/download/details.aspx?id=48145

Screenshot:

runner_image1

The current SDR is shown in the bottom left.

PRSDRRL

Wow, that acronym, I know! It stands for Predictive Recurrent Sparse Distributed Representation Reinforcement Learning!

I cannot give a full description of this beast yet, I would be here forever! It’s my latest AGI attempt, with several bizarre features:

  • No stochastic sampling/experience replay – it is fully online
  • Hierarchical feature extraction and prediction (world model building)
  • Prediction-perturbation action selection
  • Imagination! (Yes, really – read more below!)
  • Self-model

So first off, let me say that it is not done yet! So if any of you try the code (which is available on GitHub: https://github.com/222464/BIDInet/blob/master/BIDInet/source/sdr/PRSDRRL.h), don’t expect it to work yet! It is highly experimental.

So, the first feature I harp on a lot – something backpropagation-based solutions lack, and that is fully online learning without experience replay or stochastic sampling (which have horrible computational complexity).

The second feature is there because this is based off of my PRSDR algorithm, which is basically a hierarchical LSTM replacement (for those interested, I have some performance benchmarks, showing the up sides and down sides). It’s the usual HTM-like bidirectional predictive hierarchy thing.

Actions are selected by perturbing the predictions towards actions that lead to higher reward. Right now I am using a simple policy gradient method to do this.

Now, the last two points are sort of the same thing: This model has imagination. I’m serious! The basic idea is as follows: Leak some of your own predictions into your input. This way, the model tries not only to run and predict off of the world, but also itself. It tries to predict its own predictions – lead to a sort of sensory-implanting imagination similar to that that humans have. Sure, this imagination isn’t really necessary for AGI, but it’s a good heuristic to speed up learning I think. It allows for simulation of situations ahead of time, and planning as a result.

Other than that the model uses good ol’ SARSA for reward prediction and temporal difference error generation.

I am working on some demos for it now, and am trying to train it on the ALE. Let’s see how that goes!

Until next time!

SDRRL Tutorial (sparse distributed representation reinforcement learning)

Hello!

In the last post I provided an online demo for a new reinforcement learning algorithm. However, I did not provide many details on how it actually works, certainly not near enough for anyone else to code it without looking at the Javascript themselves.

So, here is a tutorial on how to get your own super-processing-power-efficient reinforcement learner!

SDRRL

Overview

SDRRL (sparse distributed representation reinforcement learning) was created in order to produce a simple reinforcement learning agent that can handle continuous states, actions, and could assign credit back in time. However, what sets it apart from other reinforcement learners with similar capabilities is the computational efficiency of the algorithm.

An important characteristic of SDRRL is also that it doesn’t use backpropagation or the experience replay/rehearsal that if often associated with backpropagation-based reinforcement learners. Instead, it avoids catastrophic interference by using SDRs (sparse distributed representations). These are also a source of computational efficiency, since as their name suggests, we often only have to update a handful of nodes at a time.

SDRRL is an actor-critic algorithm. Both the actor and critic feed off of the sparse distributed representations for input, which in turn are learned in an unsupervised fashion. Q values and the policy are updated using an algorithm similar to CACLA (read) that has been extended with eligibility traces.

Below is an image describing the overall network architecture. Connections are only shown between active nodes. Note that the hidden layer is binary, a by-product of the particular SDR learning algorithm used in this case. All other nodes are continuous.

SDRRLDiag

 

Here, each node in the hidden layer is called a cell. This naming comes from HTM terminology (hierarchical temporal memory). This reinforcement learning agent is meant as a possible explanation of cortical microcolumn functionality (this will be the subject of a future post).

The inhibitory sheath surrounds the cells, and ensures that on average the cells have some desired sparsity level. The sheath can be viewed as a set of connections on the cells. The cells are then fully connected with these inhibitory connections.

In this tutorial, we will use a very simple sparse distributed representation learning algorithm, inspired by more complex models such as SAILnet (read) which use spiking neurons.

SDR formation

An iteration of the algorithm starts by computing the sparse distribution representation of the input:

SDRRLequations_SDR

Where A is the activation value of the cell, I is the inhibition, and S is the resulting state. W, B, and N are weights (B is a per-cell bias). In is the input to the agent.

The idea behind this is as follows: We compute a linear combination of the inputs for each cell based on their “feed forward” weights, and then inhibit them to be sparse using their “lateral” weights. The lateral connections do not provide any inhibition if the activation of the current cell is greater than that of the neighbor. If it is less, however, it receives the inhibition weight N. This is a sort of comparison-based inhibition algorithm. The advantage of this over iterative algorithms that solve for the SDRs is that it is blazing fast.

Now, we compute a reconstruction of the input based on this SDR:

SDRRLequations_recon

This is basically just a reverse-activation of the “feed forward” weights. There are alternatives to this reconstruction method, such as Oja’s rule (Hebbian learning), but I find that this produces better results.

To learn the SDR weights, we use the following updates:

SDRRLequations_SDRupdate

Where alpha, beta, and omega are learning rates. Rho is the target sparsity level.

The intuition here is that we update the feed-forward weights to minimize the reconstruction error. We only update on states (S) that are 1, the rest can be ignored. The “lateral” weights are updated such that the covariance of two cells is equal to the covariance of two ideally sparse cells. Finally, the bias (B) is adjusted to simply maintain sparsity levels.

Now we have a way of learning SDRs. This method is fully online, it doesn’t need any sort of stochastic sampling due to its sparseness.

Q and Actions

Q values and actions are trained on the SDRs we now have. Both are updated with different forms of eligibility traces.

First, we compute the actions Act and Q value from the SDR:

SDRRLequations_actionQ

Where V and P are weights, and f is the logistic sigmoid function.

An exploratory version of the action (Act) must be generated, we will call this Actexp. The exploration can be done with various methods, I simply used a combination of epsilon-greedy and normal distribution perturbations.

Once we have Q, we can compute the temporal difference error using SARSA:

SDRRLequations_actionQbellman

We are then ready to update the Q and action weights:

SDRRLequations_actionQupdate

Where T and E are eligibility traces, and phi and theta are learning rates. Lambda is a trace decay factor, which is piled on top of gamma.

Here, Q uses replacing traces, hence the max function. The traces for the actions function like momentum. They can also be modified to standard accumulating traces if desired, but I find that these work better.

Conclusion

So, if all goes well, you should have a fun little reinforcement learning agent that is fully online and uses no backpropagation or stochastic sampling/experience replay/rehearsal.

A C++ implementation exists as part of a larger project of mine here.

Please let me know if you make anything cool with this! Have fun!

Reinforcement Learning Demo!

Hello!

I created a new reinforcement learning algorithm, and thanks to this new website, I have a three.js demo for it in this post!

The reinforcement learning algorithm is a combination of my one-iteration sparse distributed representation unsupervised learning algorithm as well as a version of the continuous actor-critic learning automaton with eligibility traces.

It works entirely without backpropagation! It also doesn’t use stochastic sampling from a replay buffer. The SDRs assure that there is little to no catastrophic interference. Everything is updated in one go over the weights per timestep.

This algorithm is still a bit of a prototype, but I think it works well enough to warrant a demo!

When running the demo, you can speed up time by dragging the slider in the controls menu.

The bits at the top left represent the current SDR.

The agent should learn to crawl withing a few seconds with the speed turned up to max.
It may get stuck at times, if this is a case just refresh the page!

Have fun!