If you ever worked with neural networks, you will know that selecting network topology is usually not a trivial task. This makes figuring out how to build ever more clever networks a real challenge — especially because they also need to be trained in a resource and data intense process.
If we think about engineering our way to something as big and complex as a whole human brain, we will simply hit hard resource and time constraints. To try and shortcut this, various optimization techniques have been developed to tune network structures and hyper-parameters automatically. From all these techniques however, there is only one we know of that can provably lead to brain-level intellect: Evolution (and we are the proof).
The whole complexity of our brain (and more) was encoded and evolved over millennia by use of something as compact as our genome. This is the reason why computer scientists have been trying to replicate that success since the 1950's.
As of now (Feb. 2022), we have not been able (with any technique) to create a human-level machine intelligence. Yet, in the years to come, these evolutionary mechanisms might well play a part in that, so let’s investigate them in more detail!
In this article, we’ll look at the concept of evolving (i.e. optimizing) neural networks based on genetic encodings (Neuroevolution). To understand this, we will first dive into principles of evolution in nature and IT (sometimes called ‘Genetic Algorithms’ or similar — the concepts all being closely related).
NB: To follow this article, you should understand what artificial neural nets are and (roughly) how they can be used and trained via gradient descent/backpropagation. A fundamental level of understanding of Darwinian evolution principles might also be useful — but I try to explain this part again anyway.
We won’t spend too much time on this here, but a (small) part of the motivation for looking into evolutionary optimization was understanding alternatives to gradient descent/backpropagation based training. Some other examples I already explored can be found in the articles about Cascade-Correlation and Quickprop.
As mentioned, a lot of different genetic or evolutionary algorithms exist and they have different connotations. To keep things simple, whenever we talk about genetic or evolutionary algorithms, we talk about algorithms using the principles describes in the “Evolution & Genetics” section. If you have specific questions about these algorithms or questions about the choice of words in the article, let me know.
To understand why and how we can build on the examples that nature provides, we’ll first have a quick look at the runaway optimization process that created us all — evolution.
Wikipedia defines it from a top-down perspective:
Evolution is change in the heritable characteristics of biological populations over successive generations.
To paraphrase this, in nature, evolution is a set of principles that allow for gradual (random) change of characteristics or features of individuals within a population over generations. As a (statistical) side effect of the principles, these populations tend to adapt to their biological niche over time.
The evolutionary principles that allow for this to happen in nature are:
Let’s elaborate on these points and compare nature and computer science side by side for each:
Genetics/Genetic encodings (of heritable characteristics):
Any organism in nature is constructed from a genome, i.e. a set of genes. These are a sort of seed, a schema or plan that encapsulates all information necessary to create the whole organism. In nature, the DNA stores our genes, which get transcribed to RNA and eventually decoded into proteins — which, in turn, make up our cells, tissues, organs etc.
In computer science, a genome or genetic encoding refers to data in any form (e.g. a bit-string, list of integers or other data structures) that can be translated into an algorithm or the parameters thereof (e.g. a neural net topology and weights). Exactly how data is encoded genetically has some interesting implications for the performance of the optimization process as a whole.
Note that the encoding does not necessarily need to cover all possible configurations of our algorithm or parameters. In fact, we might really want a much more restricted encoding, simply to reduce the search space.
Selection (of the fittest for reproduction):
Any individual of a population has a certain probability of reproducing, thus passing on its genes. Usually, the ‘fitter’ it is to exploit its biological niche, the higher the probability of reproduction and the higher the probability of replicating its genes.
In computer science, any solution that performs well on a task — measured by some predefined metric — is selected for reproduction, i.e. future refinement of its genome in a new iteration/generation.
Random changes/Random walks (in the space of genetic encodings):
Through exchange of genetic information between genomes (a.k.a. recombination) and through mutation (i.e. random changes) of a genome, the whole space of possible organisms is slowly explored in a random walk fashion. Interestingly, in nature, the visible ‘steps’ of the random walk — i.e. the changes within the phenotype, the ‘shape’ of an organism — are usually very small and/or modular.
In computer science, we also make use of recombination and mutation to enrich the genetic possibilities and explore possible solutions randomly. But since we’re engineers, we can also construct more fine-tuned (i.e. less random) ways to modify the genome, as a way to make this artificial evolution more efficient. Ideally, we can create just enough diversity to find better solutions by chance, but not too much to lose good solutions from one generation to the next.
Side note: This means that evolution is not typically leading to individuals that ‘make sense’, i.e. that have specific characteristics for certain purposes. In fact, we have features that make absolutely no sense: The recurrent laryngeal nerve for example connects our brain to regions in our throat (the larynx) — yet to do that, it goes way down below, curves around the aorta (close to our heart) and then comes back up again. This is so obviously inefficient, people have made entertaining videos about it. [1]
So why do these ‘evolutionary mistakes’ happen? To answer that, we need to look at evolution from a bottom-up point of view. Seeing it from the molecular perspective, evolution is simply a side-effect of having a molecular structure that allows for replication. Given that this process is error prone (e.g. because of mutation) it will eventually create molecules that simply have a higher probability of replicating. They can thus better ‘compete’ for necessary resources, decreasing the probability of replication for other molecules. After a stupendous amount of these random changes (and some less random ones, such as through recombination), some molecules eventually end up creating very effective and resilient machines that help them in their replication efforts — organisms, such as you and me.
The reason why these machines have design defects, like the recurrent laryngeal nerve, is simply because it would take a surprising amount of mutations to fix it, and all intermediate steps are lethal (because it loops around the aorta). This means a gradual fix over several generations is impossible and a fix from one generation to the next is just incredibly unlikely. Not to mention that an actual fix might not substantially enhance the replication probability of that individual’s genetic code, meaning it could simply get lost again.
Another interesting aspect that the molecular point of view provides is that the system can be gamed: Viruses, for example, are very fast and resilient at replication, even tough they only consist of comparatively little DNA or RNA wrapped in a simplistic package. They do this by intruding into an organism and inserting their genome into the cells of the host, so they are forced to produce copies of the virus.
Usually, the molecules we are talking about are RNA and DNA, our genetic code. Apart from these though, other, more exotic, examples exist — see e.g. prions which are weirdly folded proteins. These are not even at the level of a virus — instead of possessing DNA or RNA their weird folding forces proteins of similar structure to also undergo this weird folding, creating a snowball effect within organisms. Even if this seems unreasonably simplistic, prions can be transfered from host to host and can undergo mutation, providing the basis for evolution.
Looking at evolution from this perspective means that the real ‘reason’ why we exist is to be a rather intricate replication mechanism, a vessel, for our genes. I know this seems rather bland, but that is basically at the core of it. It doesn’t answer any of the great philosophical questions, but the world around us would look much more rocky and dead without this blind optimization process.
If you are curious what other interesting implications evolution has as an optimization mechanism, check out e.g. [2] & [3].
So, with this understanding of evolution, how do we get from there to a neural network configuration?
In the beginning of the article I promised a technique that could possibly a create brain-like level of intelligence. This, of course, is rather easy to claim, given that evolution provided our brains — so, promise fulfilled! But then, biological evolution took quite some time — only a few billion years, or at least a few million if you strip the very early stages — which is, I guess, a tad more than what we can invest.
So instead of going all the way from molecules, to cells, to tissues, organs and organisms, we’re going to cut some corners. We’re going to create brains without the body to support them. Well, sort of: We’ll create tiny test networks that will be a stand-in example here.
This is possible because we already have a working model of how simplified information processing happens on the micro scale of a brain: artificial neural networks. As you know, the cells (neurons) in this model have weighted input signals that, summed up, are pumped into an activation function, returning the activation signal of that neuron, which in turn is being transfered to other neurons.
In supervised learning, we want to adjust the weights of the neurons such that for a given input signal, we achieve an approximation to an expected output signal. Classically, one would compare the network output to the expected output (the resulting difference is called the loss) and then use that for a gradient-descent-based optimization. Which just means iterating backwards through the network and adjusting the weights in tiny steps, to gain a slightly better result when comparing the net’s output to the expected output next time.
NB: This illustration and the networks we will look at later on are intentionally not including a bias input. This is to simplify the case. A bias could be added simply as another (constant) input neuron.
Now, this is, of course, a high-level summary of what happens, but the real problems are hidden in the details. How can you avoid getting stuck in local optima? How can you prevent the algorithm from converging too slowly if the loss gradient is not steep enough? How do you find a network topology that can handle the problem without overfitting?
If we were to approach this from a machine-learning engineer’s perspective, we’d have to:
Practically all of these steps are problem-domain specific, which is the main reason we need experienced ML engineers who have worked on similar domains before.
But we are in luck! Many heuristics and meta-level optimizations have been invented, and one of them is neuroevolution.
Neuroevolution means using principles of evolution to generate problem-domain adjusted artificial neural networks. This could mean generating specific structures, weights, activation functions or any combination of these.
Concretely, neuroevolution means encoding networks like the one above genetically, generating a random population of individual network encodings, creating actual networks from them, evaluating these networks, selecting the fittest ones and introduce certain random changes to create a new population from them. Then rinse and repeat until some fitness level is reached.
Ideally, this process happens completely automatic without having to determine any problem-domain specific hyperparameters or network architectures upfront. Nice, eh?
“This sounds great, why isn’t everyone doing this instead of hiring more machine learning engineers?” — you might wonder. Truth be told, the devil is in the details. If done poorly, not only will this process deliver vastly suboptimal results, it will also take much more time and resources to do so. In fact, if speed of learning is the essence and everything else is a non-issue, then there is really no reason to use neuroevolution as opposed to standard back-propagation.
However, in domains where expected output data is hard to define a priori (e.g. games and reinforcement learning in general) or where the best network structure is unknown, neuroevolution could be a good fit. The challenge then is to create a system that can both reach a stable state and global optimum fast and with high probability and low resource requirements. Since most of this relies on random choices, solving this issue is equivalent to solving the classic exploration vs. exploitation challenge. That is, how much randomness do I introduce (exploration) to discover potentially new local optima — and how much do I guide or restrict these random changes to exploit a local optimum that I already discovered further.
To find a good sweet spot, we have four general opportunities for tweaking. Each of these leans on the principles of evolution outlined above:
Let’s look into each to see why and how it can be tweaked.
How well a given evolutionary algorithm performs on the task of optimization is influenced strongly by its genetic encoding, because this in turn influences available genetic operators and the random walk in the genotype space.
Let’s understand what it means to encode a neural net and see what options we have.
To start of, let’s look at an example we’ll use throughout the rest of the article.
Say you want to encode a neural network. You already came up with the architecture and it’s the classic layered one we have seen in network illustration above. It has two input cells, three hidden neurons and an output.
Now, a simple way to encode this network it to pack the weights into an adjacency matrix. In this encoding, the rows of the matrix signify input and hidden neurons and the columns signify hidden and output neurons. For each (directed) connection between these, the respective entry (by row/column chosen from source/target neuron) in the matrix contains the weight. If there is no connection from some neuron to another, then the entry is set to 0.
If we were to assume all the weights had a value of 1, the resulting matrix for the net from above would then look like this:
NB: To hammer the point home — the first row are all the connections going out of the first input neuron, the second of the second input neuron, the third of the first hidden neuron and so on. The first column are all incoming signals of the first hidden neuron, the second column of the second hidden neuron and so on. The last column is the input signal weights of the output neuron. That means that the first row, first column entry of 1 is the weight of the connection for the first input neuron to the first hidden neuron — w1,1 from the sketch.
Also note, again, that we do not include a bias input here to make it as simple as possible.
This network encoding is directly useful: We can directly run a forward pass through it, using a slightly modified input vector and multiplying it with this matrix. Applying the activation function and iterating this for the output layer (because it is a layered network) we can attain the output:
1
2import numpy as np
3
4# ...
5
6def relu(activations):
7 '''Sample 'ReLU' activation function for illustration purposes'''
8 return np.maximum(activations, 0)
9
10def forward(model, batch, input_size, hidden_size, activation=relu, iterations=2):
11 '''Push a batch of input vectors (in rows) through an adjacency matrix (the 'model').
12 Since it is undefined how 'deep' the network is, we repeat the process `iterations`-times.
13 NB: This is unoptimized. Intermediate results from the output layer will be discarded.
14 Hidden layer activation function is assumed to be ReLU for illustration purposes.
15 No activation function is applied to the output layer nodes.'''
16
17 # helpers for formatting input vectors/matrices:
18 def left_pad_zero(mat, pad_size):
19 return np.pad(mat, ((0,0), (pad_size,0)), 'constant', constant_values=(0,0))
20
21 def right_pad_zero(mat, pad_size):
22 return np.pad(mat, ((0,0), (0,pad_size)), 'constant', constant_values=(0,0))
23
24 # initially, the hidden layer neuron activation is zero, so we pad the input vector(s)
25 input_batch = right_pad_zero(batch, hidden_size)
26
27 output_batch = None
28 for i in range(iterations):
29 # push each input vector through the adj matrix, computing the output of one layer
30 output_batch = np.matmul(input_batch, model)
31
32 # apply activation function (e.g. ReLU) on the hidden node outputs
33 output_batch[:,0:hidden_size] = activation(output_batch[:,0:hidden_size])
34
35 # create a new input batch for the next iteration (i.e. the next layer)
36 input_batch = left_pad_zero(output_batch[:,0:hidden_size], input_size)
37
38 # then rinse & repeat for the next layer/iteration
39
40 # finally return the values that are left in the output vector
41 return output_batch[:,hidden_size:]
This algorithm features an ‘iterations’ parameter, effectively specifying the depth of our network. Because we pack all matrices — that would usually be divided — into one, one matrix multiplication is only one ‘activation’. To activate one layer after another we need to do the multiplication another time, passing in the intermediate results as inputs. (This is not necessarily very efficient, so don’t use this approach in production. In fact it makes more sense if the network is recurrent.)
With this example, we have effectively encoded the network structure as seen in the sketch and its weights and encoded it in a way that lets us directly operate with it. Genotype matches phenotype 1-to-1 here. This is nice as it allows us to recombine and randomly adjust networks and their internal wiring simply by tweaking the matrix entries.
Since each entry in the matrix directly refers to the weight of one possible connection, this kind of encoding is called a direct encoding.
Another such direct encoding could be to simply list the nodes and connections in a fitting data structure, as has been done for general network topologies, weights and activation functions in [4]. This could then be translated back into an adjacency matrix to make use of it.
While flexible and simple, one obvious problem with direct encodings is the size of it and therefore the size of the space of possible genotypes that must be traversed in the evolutionary random walk. Another issue stemming from the size is the relatively low chance of finding genetic patterns/modules that can be exploited, refined and transfered between individuals. It is easier, for example, to first discover how to make several, very similar limbs and then discover how to put fingers on a limb (and repeat that for all other limbs).
NB: I suspect this could potentially be partially remediated by using specialized recombination and mutation operations that enhance the chance of occurring patterns, like some sophisticated gene duplication mechanism. But I couldn’t find any info on that in the literature I reviewed.
The pattern and symmetries we observe in nature, and the tiny size of our genetic code as compared to the complexity of our genotypes suggests that it would be more clever to use an indirect encoding instead. I.e. a construction plan or much shorter algorithm that eventually creates the final phenotype over the course of some sort of development.
Indirect encodings are a simpler, compressed format of specification. For example, our DNA encodes the basic shape of our brains without encoding each connection (it contains far to little data to do that).
Encodings like these influence which ways evolution can take in all kinds of subtle ways. Stanley et al. already mention this in [5]: “because indirect encodings do not map directly to their phenotypes, they can bias the search in unpredictable ways”. In the end, direct encodings are usually chosen precisely because indirect encodings make it really hard to reason about cause and effect of mutation and recombination.
Independent of whether the encoding is direct or indirect, a problem that both can suffer from is loss of useful traits over the course of evolution. This issue cannot be addressed by the encoding alone, but the encoding can be made more resilient towards this kind of feature regression. In e.g. NEAT [5], Stanley et al. overcome this in part by tracking the historic changes of genes and forcing speciation, i.e. forcing inidivuals to only reproduce with genetically similar ones.
To sum this up, a well designed genetic encoding can make it easier to find and exploit good solutions. Not only can it help to discover symmetries and inheritable ‘modules’ (think limbs), in the search space faster, it can also reduce the risk of loosing useful traits or introducing lethal mutations (as opposed to beneficial ones). Indirect encodings can potentially help this whole process by reducing the search space and biasing the search in useful ways.
How we select individuals of the population for reproduction is the next possible part that can be tweaked. Just to illustrate the point without diving too deep, this can range from randomly selecting individuals according to their fitness, to selecting individuals with very specific requirements, like say, only the top 10.
As one may guess, different approaches have different effects on the speed of convergence and stability of the algorithm, some aspects have been explored in this comparison paper by Goldberg et al. [6].
To continue our example from before, let us assume we have selected two individuals from the population, A and B. Their genome (i.e. the set of their genes) looks like this respectively:
Depending on the chosen encoding, recombination or ‘cross-over’ of genes can be defined in many different ways. Each of which, again, have different implications on exploration vs. exploitation.
An example could look like this: Take two genomes of the selected individuals. Align them in a way that pairs similar genes. Swap genes or sections of genes between them randomly. The length of the sections and how many are being exchanged would be parameters here.
Taking our selected genomes A and B and recombining them with this scheme might give us the following offspring:
Again, the possibilities are endless here and depend a lot on the encoding. A possible implementation of mutation could take a genome and introduce random changes to it. E.g. flip bits or randomly increase or decrease values slightly. The likelihood of occurring changes per gene could be a parameter here.
Take the resulting matrix from recombination, C and mutate it to gain, as an example, the following:
This resulting — new — individual then becomes part of the next generation, a new population, to be tested, selected and iterated.
As with hyperparameters of learning algorithms for standard artificial neural nets, any variant or parameter of influences 1.-4. need to be defined before the actual evolution process can start. The hope is, however, that viable parameters will be problem-domain agnostic. Instead of optimizing the training, we optimize the rate of evolution by adjusting these knobs beforehand, and then just reuse that exact same configuration later on.
In recent years, this seems to actually have been proven ‘in the wild’ to be a viable approach, as Uber’s AI dept have released several papers supporting neuroevolution in favor of back-propagation [7].
Generally speaking, solutions that can actually exploit symmetries and copy network features have a huge potential for generating architectures that approach brain structures. After all we know, the power of the human neocortex stems from a simple building block, a cortical column (or a variant thereof), that is being replicated many, many times. [8] So it may well be that training an ever evolving network on selected, hard problems will some day lead to it showing aspects of general intelligence. After all, this is more or less how we as humanity got to where we are now, biologically speaking.
Of course, all this generality means that there will be copious examples of where a very specific, engineered and back-prop trained network performs better with less resources invested in the training. This is a standard case of specialization vs. generalization. To give an analogous example, a manually coded algorithm in a modern programming language can be much more efficient and error proof in well-definable arithmetic and logic tasks compared to a trained neural network. A similar effect then applies to engineered neural nets vs. evolved ones.
It remains a task for you to decide (or test) which one is better suited for your problem domain.
All finished source documents, notebooks and code related to this is also available on Github. Please feel encouraged to leave feedback and suggest improvements.
If you’d like to support the creation of this and similarly fascinating articles, you can sign up for a medium membership and/or follow my account.
[1] R. Francis, “The stupidest nerve in the human body”, Medlife Crisis, https://www.youtube.com/watch?v=wzIXF6zy7hg
[2] R. Dawkins, “The Selfish Gene” (1976), Oxford University Press
[3] E. Yudkowski, et al. “The Simple Math of Evolution”, LessWrong, https://www.lesswrong.com/s/MH2b8NfWv22dBtrs8
[4] D. V. Vargas, J. Murata, “Spectrum-Diverse Neuroevolution With Unified Neural Models” (2019), IEEE Transactions on Neural Networks and Learning Systems
[5] K. O. Stanley, R. Miikkulainen, “Evolving Neural Networks through Augmenting Topologies” (2002), Evolutionary Computation 10 (2): 99–127
[6] D. E. Goldberg, D. Kalyanmoy, “A comparative analysis of selection schemes used in genetic algorithms.” (1991), Foundations of genetic algorithms Vol. 1: 69–93
[7] K. O. Stanley, “Welcoming the Era of Deep Neuroevolution” (2017), Uber, https://eng.uber.com/deep-neuroevolution/
[8] J. Hawkins, “A Thousand Brains: A New Theory of Intelligence” (2021), Basic Books
Let’s discuss how your data combined with machine learning technologies can increase the performance of your organization.