In my previous post I presented SDRRL, and in the one before that a demo of that algorithm. Since then, I have made many improvements to the algorithm, vastly increasing performance, both in terms of convergence rate and processing power used. I have another demo, but this time it is not a web demo, since it is something I used for internal testing that I just cleaned up a bit 🙂
SDRRL v2.0 Demo
I present to you a simple “Big Dog” style demo, where SDRRL must learn to move a robotic dog body to the right. Almost all of the processing time spent is taken up by the physics engine instead of the AI.
When running the demo, press T to speed up time, and K to reverse the walking direction.
Note: It’s made in VS2015, so you may need the VS2015 runtime to run it: http://www.microsoft.com/en-us/download/details.aspx?id=48145
The current SDR is shown in the bottom left.
Wow, that acronym, I know! It stands for Predictive Recurrent Sparse Distributed Representation Reinforcement Learning!
I cannot give a full description of this beast yet, I would be here forever! It’s my latest AGI attempt, with several bizarre features:
- No stochastic sampling/experience replay – it is fully online
- Hierarchical feature extraction and prediction (world model building)
- Prediction-perturbation action selection
- Imagination! (Yes, really – read more below!)
So first off, let me say that it is not done yet! So if any of you try the code (which is available on GitHub: https://github.com/222464/BIDInet/blob/master/BIDInet/source/sdr/PRSDRRL.h), don’t expect it to work yet! It is highly experimental.
So, the first feature I harp on a lot – something backpropagation-based solutions lack, and that is fully online learning without experience replay or stochastic sampling (which have horrible computational complexity).
The second feature is there because this is based off of my PRSDR algorithm, which is basically a hierarchical LSTM replacement (for those interested, I have some performance benchmarks, showing the up sides and down sides). It’s the usual HTM-like bidirectional predictive hierarchy thing.
Actions are selected by perturbing the predictions towards actions that lead to higher reward. Right now I am using a simple policy gradient method to do this.
Now, the last two points are sort of the same thing: This model has imagination. I’m serious! The basic idea is as follows: Leak some of your own predictions into your input. This way, the model tries not only to run and predict off of the world, but also itself. It tries to predict its own predictions – lead to a sort of sensory-implanting imagination similar to that that humans have. Sure, this imagination isn’t really necessary for AGI, but it’s a good heuristic to speed up learning I think. It allows for simulation of situations ahead of time, and planning as a result.
Other than that the model uses good ol’ SARSA for reward prediction and temporal difference error generation.
I am working on some demos for it now, and am trying to train it on the ALE. Let’s see how that goes!
Until next time!