So, here is a tutorial on how to get your own super-processing-power-efficient reinforcement learner!
SDRRL (sparse distributed representation reinforcement learning) was created in order to produce a simple reinforcement learning agent that can handle continuous states, actions, and could assign credit back in time. However, what sets it apart from other reinforcement learners with similar capabilities is the computational efficiency of the algorithm.
An important characteristic of SDRRL is also that it doesn’t use backpropagation or the experience replay/rehearsal that if often associated with backpropagation-based reinforcement learners. Instead, it avoids catastrophic interference by using SDRs (sparse distributed representations). These are also a source of computational efficiency, since as their name suggests, we often only have to update a handful of nodes at a time.
SDRRL is an actor-critic algorithm. Both the actor and critic feed off of the sparse distributed representations for input, which in turn are learned in an unsupervised fashion. Q values and the policy are updated using an algorithm similar to CACLA (read) that has been extended with eligibility traces.
Below is an image describing the overall network architecture. Connections are only shown between active nodes. Note that the hidden layer is binary, a by-product of the particular SDR learning algorithm used in this case. All other nodes are continuous.
Here, each node in the hidden layer is called a cell. This naming comes from HTM terminology (hierarchical temporal memory). This reinforcement learning agent is meant as a possible explanation of cortical microcolumn functionality (this will be the subject of a future post).
The inhibitory sheath surrounds the cells, and ensures that on average the cells have some desired sparsity level. The sheath can be viewed as a set of connections on the cells. The cells are then fully connected with these inhibitory connections.
In this tutorial, we will use a very simple sparse distributed representation learning algorithm, inspired by more complex models such as SAILnet (read) which use spiking neurons.
An iteration of the algorithm starts by computing the sparse distribution representation of the input:
Where A is the activation value of the cell, I is the inhibition, and S is the resulting state. W, B, and N are weights (B is a per-cell bias). In is the input to the agent.
The idea behind this is as follows: We compute a linear combination of the inputs for each cell based on their “feed forward” weights, and then inhibit them to be sparse using their “lateral” weights. The lateral connections do not provide any inhibition if the activation of the current cell is greater than that of the neighbor. If it is less, however, it receives the inhibition weight N. This is a sort of comparison-based inhibition algorithm. The advantage of this over iterative algorithms that solve for the SDRs is that it is blazing fast.
Now, we compute a reconstruction of the input based on this SDR:
This is basically just a reverse-activation of the “feed forward” weights. There are alternatives to this reconstruction method, such as Oja’s rule (Hebbian learning), but I find that this produces better results.
To learn the SDR weights, we use the following updates:
Where alpha, beta, and omega are learning rates. Rho is the target sparsity level.
The intuition here is that we update the feed-forward weights to minimize the reconstruction error. We only update on states (S) that are 1, the rest can be ignored. The “lateral” weights are updated such that the covariance of two cells is equal to the covariance of two ideally sparse cells. Finally, the bias (B) is adjusted to simply maintain sparsity levels.
Now we have a way of learning SDRs. This method is fully online, it doesn’t need any sort of stochastic sampling due to its sparseness.
Q and Actions
Q values and actions are trained on the SDRs we now have. Both are updated with different forms of eligibility traces.
First, we compute the actions Act and Q value from the SDR:
Where V and P are weights, and f is the logistic sigmoid function.
An exploratory version of the action (Act) must be generated, we will call this Actexp. The exploration can be done with various methods, I simply used a combination of epsilon-greedy and normal distribution perturbations.
Once we have Q, we can compute the temporal difference error using SARSA:
We are then ready to update the Q and action weights:
Where T and E are eligibility traces, and phi and theta are learning rates. Lambda is a trace decay factor, which is piled on top of gamma.
Here, Q uses replacing traces, hence the max function. The traces for the actions function like momentum. They can also be modified to standard accumulating traces if desired, but I find that these work better.
So, if all goes well, you should have a fun little reinforcement learning agent that is fully online and uses no backpropagation or stochastic sampling/experience replay/rehearsal.
A C++ implementation exists as part of a larger project of mine here.
Please let me know if you make anything cool with this! Have fun!