1 06-PopularVariants

Now, we’ll describe some of the most widely known evolutionary algorithm variants.

Historical EA variants:
* Genetic Algorithms
* Evolution Strategies
* Evolutionary Programming
* Genetic Programming

More recent versions:
* Learning Classifier Systems
* Differential Evolution
* Particle Swarm Optimization
* Estimation of Distribution Algorithms
* Evolutionary Robotics
* Neuroevolution
* Many more!

1.2 Genetic Algorithms

The genetic algorithm (GA) is the most widely known type of evolutionary algorithm.
GAs have largely (if mistakenly) been considered as function optimization methods.

1.2.1 Quick Overview

1.2.2 SGA technical summary

GAs employ:
a binary representation,
fitness proportionate selection,
a low probability of mutation,
an emphasis on genetically inspired recombination as a means of generating new candidate solutions.

- -
Representation Bit-strings
Recombination 1-Point crossover
Mutation Bit flip
Parent selection Fitness proportional, implemented by Roulette Wheel
Survivor selection Generational

1.2.3 SGA reproduction cycle

Select parents - for the mating pool
(size of mating pool = population size)
Shuffle the mating pool
Apply crossover - for each consecutive pair with probability pc, otherwise copy parents
Apply mutation - for each offspring (bit-flip with probability pm independently for each bit)
Replace the whole population with the resulting offspring

In detail:
GAs traditionally have a fixed work-flow:
given a population of μ individuals,
parent selection fills an intermediary population of μ,
allowing duplicates.
Then the intermediary population is shuffled to create random pairs,
and crossover is applied to each consecutive pair with probability pc,
and the children replace the parents immediately.
The new intermediary population undergoes mutation individual by individual,
where each of the l bits in an individual is modified by mutation with independent probability pm.
The resulting intermediary population forms the next generation replacing the previous one entirely.
Note that in this new generation there might be pieces,
perhaps complete individuals,
from the previous one that survived crossover and mutation without being modified,
but the likelihood of this is rather low (depending on the parameters μ, pc, pm).

1.2.4 An example after Goldberg ’89

1.2.5 X2 example

We covered this previously. Selection

06-PopularVariants/ch06-Popular_EA_Variants-2014-without-movies-20140.png Crossover

06-PopularVariants/ch06-Popular_EA_Variants-2014-without-movies-20141.png Mutation


1.2.6 The simple GA

Has been subject of many (early) studies.
In the early years of the field,
there was significant attention paid to trying to establish suitable values for GA parameters,
such as the population size, crossover and mutation probabilities.
Recommendations were for mutation rates between 1/l and 1/μ,
crossover probabilities around 0.6-0.8,
and population sizes in the fifties or low hundreds,
although to some extent these values reflect the computing power available in the 1980s and 1990s. Weaknesses

It has been recognized that there are some flaws in the SGA.
Shows many shortcomings, e.g.,
Representation is too restrictive.
Mutation and crossover operators are only applicable for bit-string and integer representations.
Selection mechanism is sensitive to converging populations with close fitness values.
Generational population model can be improved with explicit survivor selection. New features

Various features have been added since the first SGA.
Factors such as elitism, and non-generational models were added to offer faster convergence if needed.
SUS is preferred to roulette wheel implementation,
and most commonly rank-based selection is used,
implemented via tournament selection for simplicity and speed.
Studying the biases in the interplay between representation and one-point crossover led to the development of alternatives,
such as uniform crossover,
and a stream of work through ‘messy-GAs’ and ‘Linkage Learning’,
to Estimation of Distribution Algorithms.
Analysis and experience has recognised the need to use non-binary representations where more appropriate.
The problem of how to choose a suitable fixed mutation rate has largely been solved by adopting the idea of self-adaptation,
where the rates are encoded as extra genes in an individuals representation and allowed to evolve.

1.3 Evolution Strategies

This one was actually intended as a numerical optimization technique.

1.3.1 Quick overview

1.3.2 ES technical summary table

- -
Representation Real-valued vectors
Recombination Discrete or intermediary
Mutation Gaussian perturbation
Parent selection Uniform random
Survivor selection (μ,λ) or (μ+λ)

1.3.3 (1+1) ES

1.3.4 Adaptive mutation mechanism

1.3.5 Illustration of normal distribution


1.3.6 Historical example: the jet nozzle experiment

The original nozzle:
To actually build it:

1.3.7 Representation

ES works with extended chromosomes \(\bar{x}\), \(\bar{p}\),
where \(\bar{x}\) ∈ Rn is a vector from the domain of the given objective function to be optimized,
while \(\bar{p}\) carries the algorithm parameters.

1.3.8 Recombination

1.3.9 Names of recombinations

Two fixed parents Two parents selected for each i
z_i = (x_i + y_i)/2 Local intermediary Global intermediary
z_i is x_i or y_i chosen randomly Local discrete Global discrete

1.3.10 Parent selection

Parents are selected by uniform random distribution whenever an operator needs one/some.
Thus, ES parent selection is unbiased.
Every individual has the same probability to be selected.

1.3.11 Self-adaptation illustrated

1.3.12 Prerequisites for self-adaptation

μ > 1 to carry different strategies.
λ > μ to generate offspring surplus.
(μ,λ)-selection to get rid of mis-adapted σ’s.
Mixing strategy parameters by (intermediary) recombination on them.

1.3.13 Selection Pressure

Is quite high for ES.
Takeover time τ _*_ is a measure to quantify the selection pressure.
The number of generations it takes until the application of selection completely fills the population with copies of the best individual.
Goldberg and Deb showed:
For proportional selection in a genetic algorithm the takeover time is λ ln(λ)

1.3.14 Example application: The cherry brandy experiment

Task: to create a color mix yielding a target color (that of a well known cherry brandy).
Ingredients: water + red, yellow, blue dye
Representation: <w, r, y, b> no self-adaptation!
Values scaled to give a predefined total volume (30 ml).
Mutation: lo / med / hi values of σ used with equal chance.
Selection: (1,8) strategy
Fitness: students effectively making the mix and comparing it with target color
Termination criterion: student satisfied with mixed color
Solution is found mostly within 20 generations
Accuracy is very good.

1.3.15 Example application: The Ackley function (Bäck et al ’93)

The Ackley function is widely used for testing optimization algorithms,
much like an image dataset, or openAI gym, now.
In its two-dimensional form, as shown in the plot above,
it is characterized by a nearly flat outer region,
and a large hole at the centre.
The function poses a risk for optimization algorithms,
particularly hillclimbing algorithms,
to be trapped in one of its many local minima.

1.4 Evolutionary Programming:


1.4.1 Quick overview

1.4.2 Technical summary table

- -
Representation Real-valued vectors
Recombination None
Mutation Gaussian perturbation
Parent selection Deterministic (each parent one offspring)
Survivor selection Probabilistic (μ+μ)

1.4.3 Historical EP perspective

EP aimed at achieving intelligence
Intelligence was viewed as adaptive behavior
Prediction of the environment was considered a prerequisite to adaptive behavior
Thus: capability to predict is key to intelligence

1.4.4 Prediction by finite state machines


The original paper:
Fogel, L.J., Owens, A.J., Walsh, M.J.
Artificial Intelligence Through Simulated Evolution
New York: John Wiley, 1966
A discussion of it:

A population of FSMs is exposed to the environment, that is,
the sequence of symbols that has been observed up to the current time.
For each parent machine,
as each input symbol is presented to the machine,
the corresponding output symbol is compared with the next input symbol.
The worth of this prediction is then measured with respect to the payoff function
(e.g., all-none, squared error).
After the last prediction is made,
a function of the payoff for the sequence of symbols indicates the fitness of the machine or program
(e.g., average payoff per symbol).
Offspring machines are created by randomly mutating the parents,
and are scored in a similar manner.
Those machines that provide the greatest payoff are retained,
to become parents of the next generation,
and the process iterates.
When new symbols are to be predicted,
the best available machine serves as the basis for making such a prediction,
and the new observation is added to the available database.

1.4.5 General FSM example

Consider the FSM with:
S = {A, B, C}
I = {0, 1}
O = {a, b, c}
δ given by a diagram (above).

1.4.6 FSM as predictor

Consider the following FSM
Task: predict next input
Quality (prediction accuracy): % of in(i+1) = outi
Given initial state C
Input sequence 011101
Leads to output 110111
Quality: 3 out of 5

1.4.7 Evolving FSMs to predict primes

1.4.8 Modern EP

No predefined representation in general.
Thus, no predefined mutation (must match representation).
Often applies self-adaptation of mutation parameters.

1.4.9 Representation

1.4.10 Mutation

Since the 1990s,
EP variants for optimization of real-valued parameter vectors have become more frequent,
and even positioned as “standard’ EP.
During the history of EP a number of mutation schemes,
such as one in which the step size is inversely related to the fitness of the solutions,
have been proposed.
Since the proposal of meta-EP,
self-adaptation of step sizes has become the norm.
Other ideas from ES have also informed the development of EP algorithms,
and a version with self-adaptation of covariance matrices, called R-meta-EP is also in use.
Worthy of note is Yao’s improved fast evolutionary programming algorithm (IFEP).
whereby two offspring are created from each parent,
one using a Gaussian distribution to generate the random mutations,
and the other using the Cauchy distribution.
The latter has a fatter tail
(i.e., more chance of generating a large mutation),
which the authors suggest gives the overall algorithm greater chance of escaping from local minima,
whilst the Gaussian distribution (if small step sizes evolve),
gives greater ability to fine-tune the current parents.

1.4.11 Recombination

Little to none, though it was present:
One point in the search space stands for a species,
not for an individual and there can be no crossover between species.
Much historical debate “mutation vs. crossover”.
The issue of using a mutation-only algorithm versus a recombination and mutation variant has been intensively discussed since the 1990s.
“the traditional practice of setting operator probabilities at constant ,
… is quite limiting and may even prevent the successful discovery of suitable solutions.”

1.4.12 Parent selection

1.4.13 Survivor selection

1.4.14 Evolving checkers players

Skim paper, figure of ANN at end.

Tournament size q = 5
Programs (with NN inside) play against other programs,
no human trainer or hard-wired intelligence.
After 840 generation (6 months!),
best strategy was tested against humans via Internet.
Program earned “expert class” ranking,
outperforming 99.61% of all rated players.

The checkers player is a kind of:
Evolutionary algorithms generate artificial neural networks (ANN),
their parameters, and their rules.
It is commonly used as part of the reinforcement learning paradigm,
and it can be contrasted with conventional deep learning techniques,
that use gradient descent on a neural network with a fixed topology.

Discussion question:
What happened to the field of EP’s evolving FSMs?

1.5 Genetic Programming


1.5.1 History

The first record of the proposal to evolve programs is probably that of Alan Turing in the 1950s.

Non-GP EC grew in the 60s and 70s,
but despite this progress,
there was a gap of some thirty years,
before Richard Forsyth demonstrated the successful evolution of small programs,
represented as trees,
to perform classification of crime scene evidence for the UK Home Office!

Although the idea of evolving programs,
initially in the computer language LISP,
was discussed among John Holland’s students,
it was not until they organized the first Genetic Algorithms conference in Pittsburgh,
that Nichael Cramer published evolved programs in two specially designed languages.

In 1988, another PhD student of John Holland,
John Koza, patented his invention of a GA for program evolution:
“Non-Linear Genetic Algorithms for Solving Problems”,
“United States Patent 4935877”, 1990

This was followed by publication in the International Joint Conference on Artificial Intelligence IJCAI-89.

Then, Koza followed this with 205 publications on “Genetic Programming” (GP).
GP was a name coined by David Goldberg, also a PhD student of John Holland.
There was an enormous expansion of the number of publications.
In 2010, Koza listed 77 results where Genetic Programming was human competitive.

1.5.2 Quick overview

1.5.3 Theory

Genetic programming is a relatively young member of the evolutionary algorithm family.
EAs discussed so far are typically applied to optimization problems.
Most other EAs find some input, realizing maximum payoff.

GP could instead be positioned in machine learning.
GP is used to seek models with maximum fit.
Once maximization is introduced,
modeling problems can be seen as special cases of optimization.
This is the basis of using evolution for such tasks:
models, represented as parse trees,
are treated as individuals,
and their fitness is the model quality to be maximized.

Discussion question:
Might GP subsume other fields?
Does GP suffice in all cases where evolving an FSM might?
Would you guess GP has any disadvantages compared to simpler representations?

1.5.4 Basics

The parse trees used by GP as chromosomes are expressions in a given formal syntax.
Depending on the problem at hand,
and the users’ perceptions on what the solutions must look like,
this can be the syntax of arithmetic expressions,
formulas in first-order predicate logic,
or code written in a programming language.
In particular, they can be envisioned as executable source code,
that is, programs.
The syntax of functional programming,
e.g., the language LISP,
very closely matches the so-called Polish notation of expressions.
For instance, the formula we covered before:
\(2 \cdot \pi + ((x + 3) - \frac{y}{5+1})\)
can be rewritten in this Polish notation as:
+(·(2, π), −(+(x, 3), /(y, +(5, 1)))),
while the executable LISP code1 looks like:
(+ (· 2 π) (− (+ x 3) (/ y (+ 5 1)))).
This is cool!

1.5.5 Technical summary table

- -
Representation Tree structures
Recombination Exchange of subtrees
Mutation Random change in trees
Parent selection Fitness proportional
Survivor selection Generational replacement

Most are basic trees:
But, there are other flavors too (more below).

1.5.6 Example credit scoring

Bank wants to distinguish good from bad loan applicants
Model needed that matches historical data

ID No of children (NoS) Salary (S) Marital status OK?
ID-1 2 45000 Married 0
ID-3 1 40000 Divorced 1
ID-2 0 30000 Single 1

A possible model:

IF (NoC = 2) AND (S > 80000) 
THEN good 
ELSE bad

In general:

IF formula 
THEN good 
ELSE bad

Only unknown is the right formula, hence:
Our search space (phenotypes) is the set of formulas.

Natural fitness of a formula:
percentage of well classified cases of the model it stands for.

IF (NoC = 2) AND (S > 80000)
THEN good  
ELSE bad

can be represented by the following tree

1.5.7 Offspring creation scheme

GA scheme using crossover AND mutation sequentially (be it probabilistically)
GP scheme using crossover OR mutation (chosen probabilistically)

1.5.8 GA vs GP


1.5.9 Selection

1.5.10 Initialization

1.5.11 Bloat

Average tree sizes tend to grow during the course of a run.
This is known as bloat,
sometimes called the ‘survival of the fattest’.
This may be a broader phenomenon than GP.
Bloat is seemingly observed in organismal chromosomes as well.
Some small viruses show size constraints.
The simplest way to prevent bloat,
is to introduce a maximum tree size.
If the child(ren) resulting from a variation aperation would exceed this maximum size,
then the variation operator is forbidden from acting.
The threshold can be seen as an additional parameter of mutation and recombination in GP.
Another mechanism is parsimony pressure.
(i.e., being ‘stingy’ or ungenerous)
is achieved through introducing a penalty term in the fitness formula,
that reduces the fitness of large chromosomes,
or uses other multiobjective techniques.

1.5.12 Example symbolic regression

1.5.13 Modern sub-fields

Types of GP include:
Tree-based Genetic Programming
Stack-based Genetic Programming
Linear Genetic Programming (LGP)
Grammatical Evolution
Extended Compact Genetic Programming (ECGP)
Cartesian Genetic Programming (CGP)
Probabilistic Incremental Program Evolution (PIPE)
Strongly Typed Genetic Programming (STGP)
Genetic Improvement of Software for Multiple Objectives (GISMO)

Basic steps in generational, Tree-based GP:

  1. Generate an initial, stochastic population.
  2. Iteratively perform selection, genetic operation, and evaluation:
  3. Repeat until the user-defined termination criteria are met.

1.5.14 Modern software options

Review this briefly:
Mention Clojush, Tiny Genetic Programming

1.5.15 Free books


Mention his GP bot competitions.

And, our future lecture for this course:

1.6 Learning Classifier Systems


Is much like an independent discovery of processes which resemble:
reinforcement learning, Q-learning, TD-learning.
https://en.wikipedia.org/wiki/Q-learning (show images)

Read (show this briefly):

Animation on ch12 here (Java applets, lol):


Where payoff is like RL’s reward.

1.6.1 Quick overview Michigan-style

1.6.2 Technical summary table Michigan-style

- -
Representation Tuple of {condition:action:payoff,accuracy}, conditions use {0,1,#} alphabet
Survivor selection Stochastic, inversely related to number of rules covering same environmental niche
Recombination One-point crossover on conditions/actions
Parent selection Fitness proportional with sharing within environmental niches
Mutation Binary resetting as appropriate on action/conditions
Fitness Each reward received updates predicted pay-off and accuracy of rules in relevant action sets by reinforcement learning

1.6.3 Representation

Each rule of the rule base is a tuple {condition:action:payoff}
Match set: subset of rules whose condition matches the current inputs from the environment
Action set: subset of the match set advocating the chosen action
Later, the rule-tuple included an accuracy value,
reflecting the system’s experience of how well the predicted payoff matches the reward received

1.6.4 Introduction example: the multiplexer

1.6.5 Iteration


1.6.6 Pittsburgh style

The Pittsburgh-style LCS predates,
but is similar to the better-known GP:

Each member of the evolutionary algorithm’s population represents a complete model,
of the mapping from input to output spaces.

Each gene in an individual typically represents a rule,
and a new input item may match more than one rule,
in which case typically the first match is taken.

This means that the representation should be viewed as an ordered list.
Two individuals which contain the same rules,
but in a different order on the genome,
are effectively different models.

Learning of appropriately complex models is typically facilitated by using a variable-length representation,
so that new rules can be added at any stage.
This approach has several conceptual advantages.
In particular, since fitness is awarded to complete rule sets,
models can be learned for complex multi-step problems.
The downside of this flexibility is that,
like GP, Pittsburgh-style LCS suffers from bloat,
and the search space becomes potentially infinite.
Nevertheless, given sufficient computational resources,
and effective methods of parsimony to counteract bloat,
Pittsburgh-style LCS has demonstrated state-of-the-art performance in several machine learning domains,
especially for applications such as bioinformatics and medicine,
where human-interpretability of the evolved models is vital,
and large data-sets are available,
so that the system can evolve off-line to minimize prediction error.
Two recent examples winning Humies Awards,
for better than human performance are in the realms of:
prostate cancer detection and protein structure prediction.

1.7 Differential Evolution


1.7.1 Quick overview

1.7.2 Technical summary table

- -
Representation Real-valued vectors
Recombination Uniform crossover
Mutation Differential mutation
Parent selection Uniform random selection of the 3 necessary vectors
Survivor selection Deterministic elitist replacement (parent vs child)

1.7.3 Differential mutation

In this section we describe a young member of the evolutionary algorithm family:
differential evolution (DE).
Its birth can be dated to 1995,
when Storn and Price published a technical report describing the main concepts behind a
“new heuristic approach for minimizing possibly nonlinear and non-differentiable continuous space functions”.
The distinguishing feature that delivered the name of this approach,
is a twist to the usual reproduction operators in EC:
the so-called differential mutation.

Given a population of candidate solution vectors in the set of real numbers, Rn,
a new mutant vector x’ is produced,
by adding a perturbation vector to an existing one,
x’ = x + p,
where the perturbation vector, p, is the scaled vector difference of:
two other, randomly chosen population members
p = F · (y − z),
where y and x are randomly chosen population members,
the scaling factor F > 0 is a real number,
that controls the rate at which the population evolves.

1.7.4 Uniform crossover

DE uses uniform crossover but with a slight twist
The other reproduction operator is the usual uniform crossover,
subject to one parameter,
the crossover probability Cr ∈ [0, 1]
that defines the chance that for any position in the parents currently undergoing crossover,
the allele of the first parent will be included in the child.

Remember that in GAs, the crossover rate pc ∈ [0, 1] is defined for any given pair of individuals,
and it is the likelihood of actually executing crossover.

In DE, at one randomly chosen position,
the child allele is taken from the first parent, without making a random decision.
This ensures that the child does not duplicate the second parent.
The number of inherited mutant alleles follows a binomial distribution

1.7.5 Evolutionary cycle

1.7.6 Different variants

1.8 Particle Swarm Optimization:


1.8.1 Quick overview

1.8.2 Technical summary table

- -
Representation Real-valued vectors
Recombination None
Mutation Adding velocity vector
Parent selection Deterministic (each parent creates one offspring via mutation)
Survivor selection Generational (offspring replaces parents)

1.8.3 Representation

Every population member can be considered as a pair,
where the first vector is candidate solution,
and the second one a perturbation vector in IRn

On a conceptual level every population member in a PSO can be considered as a pair <x, p>,
where x ∈ IRn is a candidate solution vector and p ∈ IRn is a perturbation vector,
that determines how the solution vector is changed to produce a new one.
The main idea is that a new pair <x’, p’> is produced from <x, p>
by first calculating a new perturbation vector p’
(using p and some additional information)
and adding this to x.
That is,
x’ = x + p’
The core of the PSO perspective is to consider a population member as a point in space,
with a position and a velocity,
and use the latter to determine a new position (and a new velocity).
Thus, looking under the hood of a PSO,
we find that a perturbation vector is a velocity vector v,
and a new velocity vector v’ is defined as:
the weighted sum of three components: v and two vector differences.
The first points from the current position x to the best position y the given population member ever had in the past,
and the second points from x to the best position z the whole population ever had.

The perturbation vector determines how the solution vector is changed to produce a new one.

1.9 Estimation of Distribution Algorithms

Estimation of distribution algorithms (EDA) are based on the idea of replacing the creation of offspring by ‘standard’ variation operators (recombination and mutation) by a three-step process.
First a ‘graphical model’ is chosen to represent the current state of the search,
in terms of the dependencies between variables (genes) describing a candidate solution.
Next the parameters of this model are estimated from the current population,
to create a conditional probability distribution over the variables.
Finally, offspring are created by sampling this distribution.

    INITIALISE population P 0 with µ random candidate solutions;
    set t = 0;
        EVALUATE each candidate in P t ;
        SELECT subpopulation Pst to be used in the modeling steps;
        MODEL SELECTION creates graph G by dependencies in Pst ;
        MODEL FITTING creates Bayesian Network BN with G and Pst ;
        MODEL SAMPLING produces a set Sample(BN ) of µ candidates;
        set t = t + 1;
        P t = Sample(BN );

1.10 Evolutionary Robotics


1.11 NeuroEvolution

Discussed above!