LeabRa: Biologically realistic neural networks based on Leabra in R

Johannes Titz (johannes.titz at gmail.com)

2017-09-22

This package provides the Leabra artificial neural network algorithm (O’Reilly, 1996) for R. Leabra stands for “local error driven and associative biologically realistic algorithm”. It is the Rolls Royce of artificial neural networks because it combines error driven learning and self organized learning in an elegant way, while focusing on a biologically plausible learning rule. If you have never heard of Leabra, you should read about it first. A good place to start is the computational cognitive neuroscience book (Part I), available at https://grey.colorado.edu/CompCogNeuro/index.php/CCNBook/Main (O’Reilly et al., 2016). This version of Leabra is rather slow compared to the original implementation in C++ (https://grey.colorado.edu/emergent/index.php/Main_Page). It was not intended for constructing large networks or running many trials – unless you do not care about processing resources. The main purpose of this implementation is to try out new ideas quickly, either in constructing new networks or in changing the algorithm to achieve certain outcomes. If you wanted to do this with the C++ version, you would have to deal with optimized code, which is harder to read. Note that the MATLAB version by Sergio Verduzco-Flores (https://grey.colorado.edu/svn/emergent/emergent/trunk/Matlab/) has the same purpose, and I recommend trying it out and reading his great documentation. I still believe that this R version is the easiest way to get familiar with Leabra quickly. In contrast to MATLAB, R respects your freedom (https://www.gnu.org/philosophy/free-sw.html), so you can get going by simply downloading R and installing the leabRa package. This is especially true for psychologists, many of whom are acquainted with R already and might want to “wrangle” with the data in their programming mother tongue. (R is quite popular among psychologists.) What follows is a brief introduction for constructing networks with this package. The first network will be a pattern associator: it associates a specific input with a specific output in an error-driven fashion through a hidden layer. The second network will be a self-organized network that attempts to categorize animals represented by feature vectors.

Pattern Associator

Associating two patterns may seem unspectacular at first glance, because this can be achieved easily by back-propagating errors (Rumelhart et al., 1986). But the learning rule employed by Leabra is not only more sophisticated, it is also biologically oriented. In essence, you first present the input without the correct output. After that, you present the input and the correct output together. Now the weights are changed in such a way that the activation of the output converges to the correct output if the same input is shown again. The critical part is that changing weights is done completely locally. Only the activations of connected neurons over different time frames (short, medium, and long term) are used to achieve this.

I think of it as akin to learning a new language or geography. For instance, suppose I want to associate Burkina Faso with its capital Ouagadougou (pronounced waga’du:gu). In the first phase only Burkina Faso is presented and the neuronal network in my brain will try to produce the correct output, Ouagadougou. It will likely not succeed at first; maybe something similar will be output, but not the correct city. After this phase Burkina Faso and Ouagadougou are presented simultaneously. Now is the time to change the weights so that the output will be more similar to Ouagadougou. This adjustment depends only on the neuron’s activation over short, medium, and long time frames – variables that are likely available to a real biological neuron.

Constructing the Network

Let us load the package

library(leabRa)

To reproduce the example we can use a seed. Try to guess who was born on July 22nd, 1904.

set.seed(07221904)

To construct a network, we have to specify the dimensions of the network and the connections between the layers. We will create three layers: input, hidden, and output. They are quite small to keep calculation time low. The first layer will have dimensions of \(2 \times 5\) (2 rows, 5 columns), the second of \(2 \times 10\) and the third of \(2 \times 5\) again. Note that these dimensions are not relevant for the algorithm, because the units are vectorized internally, so we could have specified \(1 \times 10\) or \(5 \times 2\) for layer 1 as well.

dim_lays <- list(c(2, 5), c(2, 10), c(2, 5))

Let us now specify the connections between these layers. Layer 1 (input) should be connected with layer 2 (hidden). Layer 3 (output) will be bidirectionally connected with layer 2. If layer j sends projections to layer i, then connections[i, j] = strength > 0 and 0 otherwise. Strength specifies the relative strength of that connection with respect to the other projections to layer i. More intuitively, just look at the rows and you will see that row (layer) 2 receives from columns (layers) 1 and 3; the connection with layer 1 is 5 times stronger (\(0.2 \cdot 5 = 1\)) than the connection with layer 3. Furthermore, row (layer) 3 receives from column (layer) 2. Row 1 (layer 1) does not receive from anywhere, because all connection strengths are set to 0.

connections <- matrix(c(0, 0, 0,
                        1, 0, 0.2,
                        0, 1, 0), nrow = 3, byrow = T)

Note that in the current version of the package, layers are either fully connected or unconnected. If you need partially connected layers, you will need to add this functionality on your own. Now we will create a network with default parameters.

net <- network$new(dim_lays, connections)

As a side note, the package is an R6 package, a special type of object oriented programming that behaves differently from the usual R object oriented programming style (S3 or S4). You can see this because we call the method of a class with the dollar sign (network$new(…)) instead of using a generic function. Furthermore, variables are also accessed via the dollar sign instead of the at-sign @.

dim_lays and connections is the minimum you need to specify a network. But if you are constructing more complex networks, you should pay attention to g_i_gain, which controls overall inhibition in a layer (inhibitory conductance gain). If this value is not set carefully, you might get unexpected results (too much or not enough activation).

Creating Input Patterns

Now we have a network, but no inputs. Let us create 15 random patterns with the method create_inputs in the network class. We want random patterns in layers 1 and 3; these are supposed to be associated during learning. We call these inputs inputs_plus, because these are what are presented to the network during the plus phase (correct output in layer 3 is presented). prop_active is the number of active units in the patterns; activation is either 0.05 or 0.95. We choose .3, meaning that on average 30% of units will have an activation of 0.95 and 70% an activation of 0.05.

inputs_plus <- net$create_inputs(which_layers = c(1, 3),
                                 n_inputs = 15,
                                 prop_active = .3)

It is possible to create inputs with your own functions. The network will accept an external input list that has a length equal to the number of layers. Every element in the list should have the activation values of the neurons for the specific layer.

For error-driven learning the Leabra way, we need to remove the inputs of the output layer (layer 3) for the minus phase. We will call this list inputs_minus (the correct output is missing, so it needs to be “subtracted”). Functionals are neat, so we will use lapply here:

inputs_minus <- lapply(inputs_plus, function(x) replace(x, 3, list(NULL)))

Learning

Now we can start learning with the default parameters. The return value of the learning function is the output activation after each trial before the weights are changed. This way we save resources, because we do not have to present the inputs again after learning. The first epoch will be an approximate baseline of each stimulus. In the next step we will use the output activations to calculate the error. During learning, the progress is reported by dots representing a single trial. This means that the minus phase, plus phase, and weight changing have been performed for one stimulus. Every row is a new epoch, which is a term to describe that all stimuli were presented once.

n_epochs <- 10
outs <- lapply(seq(n_epochs), function(x) 
  net$learn_error_driven(inputs_minus,
                         inputs_plus))
## ...............
## ...............
## ...............
## ...............
## ...............
## ...............
## ...............
## ...............
## ...............
## ...............

Plotting Results

The network class can calculate the mean absolute distance (mad) with the method mad_per_epoch between the actual and correct patterns for each epoch. You can also use your own functions on these lists to calculate other types of errors like the cosine error. We are interested in the error of layer 3 (output).

mad <- net$mad_per_epoch(outs, inputs_plus, 3)

How about a minimalist plot to see if it worked?

plot(mad, axes = F, pch = 16, family = "serif", type = "b",
     xlab = "epoch [#]",
     ylab = "mean absolute distance [activation]",
     ylim = c(round(min(mad), 2), round(max(mad + 0.01), 2)))
axis(1, at = seq(length(mad)), tick = T, family = "serif")
axis(2, at = seq(0, 1, 0.05), labels = seq(0, 1, 0.05), tick = T,
     family = "serif", las = 2)

The error gets smaller with each epoch, so the pattern associator seems to work just fine.

Some Additional Notes

You can influence how many cycles should be run during the minus phase and the plus phase, which are parameters for the learn_error_driven method. You could also implement your own functions to learn. Internally, the learn_error_driven method is straightforward. It uses the method cycle to clamp the external input activations and to get the internal inputs from other layers. This is done several times for the minus phase (e.g. 50 times by default) and then for the plus phase (e.g. 25 times by default). After that, the method chg_wt is called to adjust the weights. This procedure is repeated for every stimulus.

If you want to modify the initial weight matrix you have several options. When creating the network, you can specify a function to create a random weight. The default function is:

w_init_fun = function(x) runif(x, 0.3, 0.7)

It produces weights between 0.3 and 0.7 from a uniform distribution. Let us say you want to generate weights from a normal distribution with a mean of 0.6 and a standard deviation of 0.1. Just specify the w_init_fun accordingly when constructing a new network object:

net <- network$new(dim_lays, connections,
                   w_init_fun = function(x) rnorm(x, mean = 0.6, sd = 0.1))

If this does not offer enough flexibility, you can also create your own weight matrix from scratch and pass it as the parameter w_init, the initial weight matrix. w_init is a matrix of matrices (like a cell array in MATLAB):

all_weights <- net$get_weights()
all_weights
##      [,1]        [,2]        [,3]       
## [1,] NULL        NULL        NULL       
## [2,] Numeric,200 NULL        Numeric,200
## [3,] NULL        Numeric,200 NULL
all_weights[3, 2]
## [[1]]
##            [,1]      [,2]      [,3]      [,4]      [,5]      [,6]
##  [1,] 0.6199792 0.5816655 0.5966143 0.6089049 0.5807257 0.5352835
##  [2,] 0.5377392 0.5942340 0.5608627 0.5589160 0.4438800 0.6483404
##  [3,] 0.6602290 0.6014105 0.6947235 0.7350114 0.5999013 0.5665938
##  [4,] 0.5013222 0.5464402 0.5464375 0.6761180 0.5927200 0.6464844
##  [5,] 0.7125542 0.5597441 0.6541184 0.7898291 0.6359726 0.6202081
##  [6,] 0.6245973 0.6564946 0.6311048 0.6802373 0.6934607 0.5922013
##  [7,] 0.4933633 0.7623871 0.6622023 0.5322418 0.4851709 0.5541157
##  [8,] 0.6356127 0.6887199 0.6371301 0.7478384 0.5997549 0.6241816
##  [9,] 0.7103423 0.4964789 0.6067320 0.7706679 0.4798251 0.6227243
## [10,] 0.7072451 0.6498023 0.7569080 0.5923243 0.6646205 0.4953556
##            [,7]      [,8]      [,9]     [,10]     [,11]     [,12]
##  [1,] 0.4657841 0.4880591 0.7225643 0.4055622 0.5686968 0.5642393
##  [2,] 0.5458209 0.6754034 0.4899045 0.4762736 0.6155015 0.6982280
##  [3,] 0.5430482 0.5845956 0.4379233 0.6910951 0.5464861 0.5369073
##  [4,] 0.5609346 0.5984497 0.4077778 0.5712721 0.7759060 0.6386810
##  [5,] 0.4486378 0.5860919 0.6365965 0.5661420 0.4844575 0.5107766
##  [6,] 0.4439360 0.6209044 0.6283859 0.4777480 0.7109740 0.4373598
##  [7,] 0.6267409 0.5056140 0.5632133 0.7094829 0.5958420 0.4406635
##  [8,] 0.6970609 0.6294668 0.4147536 0.6806164 0.7446742 0.7294338
##  [9,] 0.5323914 0.8517820 0.5133184 0.4869751 0.5099135 0.6131721
## [10,] 0.7229776 0.5214591 0.6332850 0.5893964 0.4716185 0.5320660
##           [,13]     [,14]     [,15]     [,16]     [,17]     [,18]
##  [1,] 0.4239211 0.6668740 0.6651165 0.5915938 0.5722867 0.4240229
##  [2,] 0.6572897 0.6708034 0.6310316 0.3660084 0.7316463 0.7295660
##  [3,] 0.5677364 0.7676059 0.5588645 0.6585694 0.5487881 0.5413860
##  [4,] 0.4914294 0.4506926 0.5760387 0.5909589 0.6628934 0.5140708
##  [5,] 0.5495123 0.5132568 0.6472836 0.6800647 0.4999588 0.7499881
##  [6,] 0.5861824 0.5952372 0.5133139 0.6587509 0.7144574 0.6387816
##  [7,] 0.5304132 0.5694711 0.6346313 0.5565570 0.5557743 0.6464434
##  [8,] 0.5989620 0.6050367 0.5551266 0.6165508 0.4864436 0.5043580
##  [9,] 0.5850210 0.6888675 0.7597975 0.4697959 0.6854917 0.5578774
## [10,] 0.7583314 0.6909298 0.6063724 0.5446131 0.6422718 0.7091008
##           [,19]     [,20]
##  [1,] 0.7470220 0.5926404
##  [2,] 0.6023885 0.5333771
##  [3,] 0.6978501 0.3738139
##  [4,] 0.6481803 0.9615503
##  [5,] 0.3121930 0.5945361
##  [6,] 0.5437668 0.4853428
##  [7,] 0.6345784 0.6550956
##  [8,] 0.5550953 0.6028935
##  [9,] 0.4613824 0.5491546
## [10,] 0.6528826 0.4835204

Be careful when you create a w_init matrix on your own.

As mentioned before, this package uses R6 classes, meaning that you do not have to assign objects in the usual R way. For instance, calling net$learn_error_driven above actually modified the net object, although we did not make any explicit assignment. This is unusual for R and has some disadvantages, but it is faster and uses fewer resources (ideal for a simulation) than the more common S3/S4 classes. Just pay attention when you call methods in this package. They will modify objects in place.

Hello World in “Connectionism”: Categorizing Animals

Every time I explore a new neural network software, I try to create some typical examples. One obvious example is the pattern associator. Personally, I like the example by Knight (1990, p. 70) for unsupervised (self-organized) learning of animals. This became my “hello world” for artificial neural networks.

Again, let us set a seed, so you can reproduce the example.

set.seed(22071904)

We will start with the input patterns, because the network architecture depends on the dimension of these patterns.

Input Patterns

The inputs for the network are animals represented by features that are either present or absent (Knight, 1990, p. 71). This data comes directly with the leabRa package and is called animals:

animals
##           has.hair has.scales has.feathers flies lives.in.water lays.eggs
## dog              1          0            0     0              0         0
## cat              1          0            0     0              0         0
## bat              1          0            0     1              0         0
## whale            1          0            0     0              1         0
## canary           0          0            1     1              0         1
## robin            0          0            1     1              0         1
## ostrich          0          0            1     0              0         1
## snake            0          1            0     0              0         1
## lizard           0          1            0     0              0         1
## alligator        0          1            0     0              1         1

Because the network class at present only accepts a list as external inputs, we transform the data frame rows into elements of a list.

inputs <- plyr::alply(animals, 1)

Furthermore, we need an empty list element (NULL) for the second layer.

inputs <- lapply(inputs, function(x) list(x, NULL))

This is what I meant when I wrote that R people might prefer wrangling with data in their mother tongue.

Network Architecture

We will use a 2-layer network, where layer 2 receives projections from layer 1. The size of layer 1 must be 6, because there are 6 features for representing an animal in our example. The size of layer 2 is 3, meaning that the inputs will be categorized into three groups (the active unit will be the category). You can experiment with the number of units in layer 2 to get other categories.

dim_lays <- list(c(6, 1), c(3, 1))
connections <- matrix(c(0, 0,
                        1, 0), nrow = 2, byrow = T)

Learning

We want to run the simulation not just once, but several times to get a feeling for how much the results can vary. To achieve this, we can write a short function that initializes the network and then learns unsupervised. After the learning is done, we test the network’s reactions to the shown inputs with the method test_inputs (changing weights is turned off in this method). In contrast to the network described previously, we have to do this because the learning phase will only last one epoch per simulation. The network will be different for each simulation, because the weights are initialized randomly. You can think of this procedure as having several participants observe ten different animals. The differences between participants are indicated by the individual weight matrices assigned to each network.

run_sim <- function(dim_lays, connections, inputs){
  net <- network$new(dim_lays, connections)
  net$learn_self_organized(inputs, random_order = TRUE)
  return(net$test_inputs(inputs))
}

Now we can run the simulation. Ten runs should not be a problem, because the network is tiny.

n_runs <- 10
outs <- lapply(seq(n_runs), function(x) run_sim(dim_lays, connections, inputs))
## ..........
## ..........
## ..........
## ..........
## ..........
## ..........
## ..........
## ..........
## ..........
## ..........

Plotting Results

The output for each run is the activations of each layer after all stimuli have been presented once. We are only interested in layer 2, so let us extract these activations and transform them into data frames (some “wrangling” again). We can then look at the outputs of two simulation runs to get a feeling for whether it worked.

outs_layer_two <- lapply(outs, function(x) lapply(x, function(y) y[[2]]))
outs_layer_two <- lapply(outs_layer_two, function(x) do.call(rbind, x))
outs_layer_two <- lapply(outs_layer_two, round, 2)

To inspect the third simulation we just call:

outs_layer_two[[3]]
##    [,1] [,2] [,3]
## 1  0.00 0.00 0.92
## 2  0.00 0.00 0.92
## 3  0.00 0.00 0.44
## 4  0.08 0.00 0.75
## 5  0.00 0.94 0.00
## 6  0.00 0.94 0.00
## 7  0.00 0.88 0.00
## 8  0.91 0.18 0.00
## 9  0.91 0.18 0.00
## 10 0.97 0.00 0.00

The output units fight for activation, such that only one unit is active most of the time. This is the category of the animal and it seems to work quite well. For instance, recall that the animals in rows 5, 6, and 7 were canary, robin, and ostrich and they all have high activations on unit 2. Let us look at another simulation, where the result is not as straightforward:

outs_layer_two[[1]]
##    [,1] [,2] [,3]
## 1  0.00 0.89    0
## 2  0.00 0.89    0
## 3  0.00 0.72    0
## 4  0.00 0.96    0
## 5  0.88 0.00    0
## 6  0.88 0.00    0
## 7  0.74 0.06    0
## 8  0.71 0.00    0
## 9  0.71 0.00    0
## 10 0.00 0.75    0

One problem we can see here is that only 2 output units are active. This happens because of “hogging,” a problem that often occurs in self organized learning (e.g. Knight, 1990, p. 72). Some output units are so strong that they attract everything. This can also happen with a single unit. There are a couple of ways to deal with hogging (see https://grey.colorado.edu/emergent/index.php/Leabra), but for our simple example we can simply ignore it; we have run several simulations, so it is not an issue if a couple of them have hogging units. Maybe this also reflects that grouping animals is to some degree subjective and that sometimes only two categories emerge.

There are many ways to work with these output activations. For instance, we can calculate the distance between the ten animals in their output activation and then run a cluster analysis or draw a heatmap. But a devil’s advocate might say that this allows for too many degrees of freedom. The output units can have activations between 0 and 1 and there are three of them. Maybe the 6 binary features will just be mapped onto three units which have a wide range of possible values. This might not be terribly impressive. Instead we can try to force the network to make a clear decision in which category to put an animal (one, two, or three). This is also more similar to how we would prompt human participants in a cognitive experiment. For instance, we could ask them to group animals into three categories, with every animal in exactly one category.

To achieve this, we can transform the activation matrices to 1 and 0. The maximum value will get a value of 1 and the rest of 0. This is a clear-cut decision into which category to put an animal. We will use a short function for this, that is applied on every row of every output matrix.

apply_threshold_on_row <- function(row){
  row[-which.max(row)] <- 0
  row[which.max(row)] <- 1
  return(row)
}

outs_layer_two <- lapply(outs_layer_two,
                         function(x) t(apply(x, 1, apply_threshold_on_row)))
outs_layer_two[[1]]
##    [,1] [,2] [,3]
## 1     0    1    0
## 2     0    1    0
## 3     0    1    0
## 4     0    1    0
## 5     1    0    0
## 6     1    0    0
## 7     1    0    0
## 8     1    0    0
## 9     1    0    0
## 10    0    1    0

Now we want to know which animals are grouped together. Here, we take a shortcut by calculating the binary distance matrix for every simulation. Using the value assignments described in the previous paragraph, we know the distance between two animals is either 0 if they belong to the same category or 1 if they do not.

dists <- lapply(outs_layer_two, dist, method = "binary")
dists[[1]]
##    1 2 3 4 5 6 7 8 9
## 2  0                
## 3  0 0              
## 4  0 0 0            
## 5  1 1 1 1          
## 6  1 1 1 1 0        
## 7  1 1 1 1 0 0      
## 8  1 1 1 1 0 0 0    
## 9  1 1 1 1 0 0 0 0  
## 10 0 0 0 0 1 1 1 1 1

So here animals 2, 3, 4, and 10 are in one category and the rest are in the other. But this is only 1 distance matrix; we have 10 of them, which is simply too much information. We can average these values over the simulation runs by using a neat functional again:

dists_mtrx <- lapply(dists, as.matrix)
mean_dists <- Reduce("+", dists_mtrx) / length(dists)
mean_dists
##      1   2   3   4   5   6   7   8   9  10
## 1  0.0 0.0 0.2 0.2 0.6 0.6 0.7 0.9 0.9 0.8
## 2  0.0 0.0 0.2 0.2 0.6 0.6 0.7 0.9 0.9 0.8
## 3  0.2 0.2 0.0 0.4 0.4 0.4 0.5 0.8 0.8 0.7
## 4  0.2 0.2 0.4 0.0 0.8 0.8 0.8 0.9 0.9 0.8
## 5  0.6 0.6 0.4 0.8 0.0 0.0 0.1 0.7 0.7 0.8
## 6  0.6 0.6 0.4 0.8 0.0 0.0 0.1 0.7 0.7 0.8
## 7  0.7 0.7 0.5 0.8 0.1 0.1 0.0 0.6 0.6 0.7
## 8  0.9 0.9 0.8 0.9 0.7 0.7 0.6 0.0 0.0 0.2
## 9  0.9 0.9 0.8 0.9 0.7 0.7 0.6 0.0 0.0 0.2
## 10 0.8 0.8 0.7 0.8 0.8 0.8 0.7 0.2 0.2 0.0

We need to add the row names from the original data set, so that we know which animal is which.

colnames(mean_dists) <- rownames(animals)
rownames(mean_dists) <- rownames(animals)

We are finally ready to apply clustering and then plot a dendrogram:

plot(hclust(as.dist(mean_dists)), main = "", sub = "", xlab = "",
     ylab = "Distance")

Three natural categories seem to emerge. The distance between two animals in each category is always zero, which means they are identical (1. snake and lizard; 2. canary and robin; 3. dog and cat). The more interesting part is what happens with the alligator, ostrich, whale, and bat. The alligator is grouped with snake and lizard. These are the reptiles, although the alligator is not a typical member because it lives in water. The ostrich is grouped with canary and robin, the birds. Although it cannot fly, it still makes sense to put the ostrich in this category. Finally, the whale and the bat are grouped together with dog and cat, the mammals. They are rather untypical members of this category, but zoologists also group them this way. Obviously, the example by Knight is somewhat artificial, but in this sense it is my favorite “hello world” example for artificial neural networks.

Summary and Restrictions

These examples show that leabRa seems to work fine for two typical use cases. Still, I cannot guarantee that the code is correct in every detail. Furthermore, there are some differences from the original C++ code (as of the time of writing this vignette, September 2017). For instance, you cannot specify partial connections and the nxx1-function is a step-function to reduce calculation resources. What is more, compared to the default in the current emergent version, the R version does not use momentum for calculating weight changes. Overall, the algorithm should still produce very similar results to the original Leabra implementation.

References

Knight, K. (1990). Connectionist ideas and algorithms. Communications of the ACM, 33(11), 59–74.

O’Reilly, R. C. (1996). The Leabra Model of Neural Interactions and Learning in the Neocortex. Phd Thesis, Carnegie Mellon University, Pittsburgh. URL: ftp://grey.colorado.edu/pub/oreilly/thesis/oreilly_thesis.all.pdf

O’Reilly, R. C., Munakata, Y., Frank, M. J., Hazy, T. E., and Contributors (2016). Computational Cognitive Neuroscience. Wiki Book, 3rd (partial) Edition. URL: http://ccnbook.colorado.edu

Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature. 323(6088): 533–536. URL: http://dx.doi.org/10.1038/323533a0.