Generation 5: Artificial Intelligence Repository - How Do Genetic Algorithms Work?

#### How Do Genetic Algorithms Work?

This essay is heavily based on the relevant sections in "An Introduction to Genetic Algorithms" by Melanie Mitchell. This essay contains a great deal of mathematics, a decent high-school background of statistics, algebra, and logic is advised.

#### Schemata and Other Terminology

The pioneer of genetic algorithms (GA), John Holland, was the first person to fully describe GAs mathematically. GAs are designed to search for, discover, emphasize and breed good solutions (or "building blocks") to a problem in a highly parallel fashion. Holland invented the idea of schemas (or schemata) to formally conceptualize the notion of "building blocks."

Each schema is a template for a bit string of length l. The schemata are comprised of one's and zero's (defining the bits), and asteriks ('*') that act as wild cards within the string. Therefore, the schema:

H = 1 * * * 0 *

represents all bit strings that begin with one and have a '0' at the 5th bit position. Examples of bit strings that satisfy this are 111001, 100001, 111100, 111101 etc. These are called instances of the schema. The H in the equation stands for hyperplane. A hyperplane is a plane of various dimensions - the dimension of the plane is equal to l, the number of bits in the string. To define this principle further, take the schema: H = * * 0. Since H is a three-dimensional hyperplane it can be graphed like this:

Other traits of a schema include the defined bits and the defining length. The defined bits are the number of bits defined within the schema - in the above example there are 2 defined bits. The defining length is the distance between the two most outer defined bits - in the above example, the defining length is 4.

Sets and Subsets
Note that most of the subsets of a length l bit string cannot be described as a schema. For a bit string of length-l, there are 2l possibilities, 22^l possible substrings, and 3l (because of the wildcards). Therefore, for a bit string with 4-bits, there are 16 possible bit strings, an amazing 65536 substrings, yet only 81 schemata to represent them in.

#### Processing

Remembering that any bitstring is a member of 2l different schemata, look at the example Mitchell uses. The length-2 bitstring '11' is an instance of **, 1*, *1, and 11. Now, in a population of n-strings it is easy to see that you will have anything between 2l and n x 2l different schemata.

It can therefore be said, that while the GA explicitly evaluates n strings during any given generation, it is implicitly evaluating a much larger number of schema. Think of it as half of the population being part of the schema '1 * * ... *' and the other half as '0 * * ... *'. Now, after a given generation has been run, you will have an estimate of the average fitness of those schemata (it will be quite a erroneous estimate, since the population actually evaluated was only a tiny portion of the actual schema). The estimates of these averages of obviously not stored explicitly since the schemata themselves are not explicitly evalulated. Where this idea is of help, is in looking at the increase or decrease of a given schema in a population - because it can be described as though the GA were storing these implicitly calculated values.

The Dynamics of a Schema
The method of approximating the dynamics can be calculated through the following method. Let H be a schema with at least one instance in the given population at the time t. Let the function m(H,t) be the number of instances of H and time t, and let the function û(H,t) be the observed average of H and time t (this is the phenomena described in the above section). Now, we want to calculate the expected number of instances of schema H during the next time interval. Assuming that E(x) is the expected value of x, we want to find E(m(H,t+1)).

Now, the expect value of number of instances at t+1 is equivalent to the number of offstring produced. Therefore, in order to calculate further we need to clarify what form of selection process we will use. Assume that we use a fitness-proportionate selection method (this method is used in the GA example), therefore the expected number of offstring is equal to the fitness of the string over the average fitness of the generation or:

So, assuming that xÎH (x is a subset, or instance, of H), we have the equation:

Now, using the definition of û(H,t) we come up with the final equation:

There we have our basic formula for tracking the dynamics of a schema in a genetic algorithm. Genetic algorithms are this simple, though, since they also simulate the effects of mutation and crossovers. These two effects have both constructive and destructive effects - but for the moment, concentrate only on the destructive effects.

Compensating for Destructive Effects
Firstly, let pc be the probability that crossover will be applied to a given string in the population, and assume that an instance of schema H is picked to be the parent. Schema H is said to have "survived" if one of the offstring is still an instance of H. For example, given the schema "1 1 * * * *", and given two strings, 110010 and 010100. Note that the first bit is an instance of the schema, the second is not. Now, if a single-point crossover is applied at the 4th bit, you get the following offspring - 110000 and 010110. The first child is still an instance of the schema, thus it has survived.

Given Sc(H) as being the probability of schema H surviving a single-point crossover, d(H) as the defining length, and l as the length of the bit string you can create the equation:

A breakdown of this equation is necessary. The d(H)/(l-1) is equal to the fraction of the string that the schema occupies. For example, if we have a defining length of 3 in a string of length 5, then 75% of the string is part of schema H. Now, we multiply that by the probability that they survive, pc. This gives us an upper-bound that the schema will be destroyed. We want the lower-bound, though, so the upper bound is subtracted from 1 (remember cumulative probabilities equal 1).

Now that crossover has been addressed, we must look at mutation. Let pm be the probability that a bit will be mutated, and that the function Sm(H) is the probability that the schema H will survive the mutation. In order to survive the mutation, the schema must remain intact, therefore Sm(H) must be equal to the probability that a bit will NOT be mutated (1-pm) raised to the power of the number of defined bits in the schema, or:

where o(H) is the number of defined bits (or order) of the schema. From these two calculations you can see that the shorter and lower order the schema, the greater chance it has to survive. By applying these effects with equation 1.1, you get the Schema Theorem first derived by Holland:

Hopefully, you now fully understand the mathematics behind genetic algorithms. For an analysis of what equation 1.2 implies, see the section in Melanie Mitchell's book An Introduction to Genetic Algorithms. Note that this equation only deals with the destructive effects of genetic algorithms, in fact crossover is regarded to be one of the chief areas of the GA's power, and mutation is often seen as an insurance policy to prevent a loss of diversity. It is the characteristic of the GA to be able to implicitly evalulate many solutions that makes it so powerful.

#### Mathematical Models

The Schema Theorem only shows how schemas dynamically change, and how short, low-order schemas whose fitness remain above the average mean receive exponentially growing increases in the number of samples. It cannot make more direct predictions about the population composition, distribution of fitness and other statistics more directly related to the genetic algorithm itself.

We are now going to look at the simpliest mathematical model for a GA as derived by Michael Vose and Gunar Liepins. This is a model of a genetic algorithm like the one used in the sample. The algorithm is as follows:

1. Calculate the fitness of each string.
2. Choose two parents using fitness-proportionate selection.
3. Use a single-point crossover, use only one of the offspring.
4. Mutate each bit with probability pm, and place in population.
5. Go to step 2 until population complete.
6. Go to step 1.
As always, this applies to a random population of binary strings of length l. In the model that Vose and Liepens used, the binary strings simply represented binary numbers. Therefore, when l = 8, 00000101 = 5. The population of strings was represented at any generation t by two vectors, p(t) and s(t) (the arrows above p and s are omitted to make HTML simpler - they are included in the equations). The ith component of p(t) denotes the proportion of the population at t consisting of string i (this is denoted as pi(t)) The ith component of s(t) denoted the probability of the string mating. Note that both p(t) and s(t) are column vectors even though they are written as row vectors. It is necessary to know this because the fitnesses are represented in a two-dimensional matrix F. F is structured such that Fi,j is zero for all elements, except when i = j. In this case, the element is the fitness of string i. Therefore, using the definition of fitness-proportionate selection, we get:

Now that p(t) and s(t) can be defined in terms of each other, we must find a operator G that acts like a genetic algorithm. Meaning that is G was applied to s(t), it would mimic the effects of running a genetic algorithm through one generation:

Now, using the idea that the expected value of the the population at t+1 is equal to the probability of the strings mating. Also, let's define x ~ y as 'x differs from y by a scalar factor'. We now have:

The last equation is the type of relationship we are looking for - yet due to statistical problems, and the lack of crossover and mutation in the equation we have to complicate matters further (a lot further!). Plus, we still haven't figured G into the equation. The way Vose and Liepens continued was by defining G as the composition of the fitness matrix F and a recombination operator (we'll call this ß). ß mimic both crossover and mutation, thus the term 'recombination.'

The question now is how to define ß. Vose and Liepens defined it by finding ri,j(k), the probability that string k will be produced by recombination between string i and string j. Therefore, once ri,j(k) is known we can see that:

This equation shows that the expected value of the proportion of string k in the next generation is equal to the "probability that it will be produced by each given pair of parents, times by those parents' probabilities of being selected, summed over all possible pairs of parents."

We still have to define both ri,j(k) and ß, though. Vose and Liepens firstly defined a simplier matrix M whose elements at i,j gave the probability ri,j(0) that string 0 (all bits 0) would be produced through recombination of i and j.

ri,j(0) is equal to the sum of the probability that crossover does not occur between i and j and the selected offspring is mutated to all zeros, and the probability that crossover does occur and the selected offspring is mutated to all zeros. So, the probability that crossover occurs between i and j is pc (thus the probability is doesn't is 1-pc). Same goes with mutation, pm it does occur, 1-pm is doesn't. Let's define |i| as the number of ones in the bit-string i of length l. Therefore, the probability that i will be mutated to all zeros is:

that is, the probability that all of the |i| ones are mutated multiplied by the probability that none of that (1-|i|) zeros are mutated. With this in mind, the first term in the expression for ri,j(0) equates to:

|j| is the same as |i|, only defined for string j. Note that 1/2 is included since only one of the children survive the mating process. Now, for the second term let h and k denote the two offspring produced from a single-point crossover at point c (counted from right-hand side, like binary). There are, therefore, l-1 crossover points, each having the probability 1/(l-1) of being chosen. Therefore, the second term can be written as:

All that is missing now is a definition of h and k. So let i1 be the substring of i consisting of l-c bits to the left of c, and i2 to the substring of i to the right of c. Therefore,|h|=|i|-|i2|+|j2|, and |k|=|j|-|j2|+|i2|. Now, using a bit of fancy notation adjustment, you can simplify the equation down a little:

Now, we can finally define ri,j(0). To simplify the equation, lets define h as pm/(1-pm). After a little of algebraic manipulation we get the final equation:

We know finally have the ri,j, we need to use this to come up with ß and finally G. By use of logical operators and permutations and a lot of manipulation you can express ß in terms of M.

Defining |v| as the sum of its elements, we can define Gp:

Now, although G and Gp act on different representations of the population, they can be translated into one and other through simple transformation.

Finally, we are at the end of this long essay! Hopefully you've got a much better understanding of the theoritcal and mathematical theory behind genetic algorithms. Note that the math we looked at only applied to the simpliest of genetic algorithms (and one with an infinite population), more complicated models and explanations much better than my own can be found in Melanie Mitchell's book An Introduction to Genetic Algorithms.