NeoRL – Reinforcement Learning

Hello everyone!

I continue to work on NeoRL, and have a new way of applying reinforcement learning to the underlying predictive hierarchy. So far it works better than previous algorithms, despite not being debugged and optimized yet.

The new reinforcement learning algorithm is based on the deterministic policy gradient version (action gradient) of my SDRRL algorithm (SDRRL 2.0). Recall that a single SDRRL agent has an architecture like this (see here for the original post: link):


It was able to solve a large variety of simple tasks very quickly while using next to no processing power due to its sparsity. But, it had problems: It didn’t scale well, since it didn’t have a hierarchy.  I now came up with an efficient way of doing hierarchy with this system.

Consider now a layer of SDRRL units, with sparse, local connectivity. It uses multiple Q nodes for different portions of the layer (they are also convolutional). The architecture looks like this:



There can be as many action layers as desired. In my model, I use one action layer for the output actions and one for attention.

The input comes from the layer below or the input layer to the system. In my implementation it is 2D so it can work easily on images and run well on the GPU. The hidden layer performs prediction-assisted sparse coding, as to form a predictive hierarchy. Once the sparse codes are found, we activate sub-networks with the action layers as input through the “on” bits of the sparse codes. This is basically a convolutional form of the SDRRL 2.0 algorithm. Actions are then created by starting from the predicted action and then moving along the deterministic policy gradient.

As always, features are extracted upwards, and actions flow downwards. Now, actions are integrated into the lower layers as another set of sparse codes in the SDRRL hidden layer. So the full state of the hidden layer in SDRRL contains the feed-forward features and the feed-back action codes.

As explained earlier, I use two layers of actions. One for the action to be taken (output), and another for attention. Attention works by blocking off regions of the input as to ignore it. Which regions should be blocked are learned through the deterministic policy gradients.

I just finished coding this thing, and got excited when I saw it working without any tuning at all, and while likely still having many bugs. So I decided to make a video of it moving to the right (not shown, but it still works when I tell it to reverse directions):

Until next time!

(For those who missed it, the repository for this is here)

Leave a Reply

Your email address will not be published. Required fields are marked *