# 1 05-FitnessSelection

Previous: 04-RepresentMutateRecombine.html

## 1.1 Review

Two fundamental forces form the basis of evolutionary systems:

variation and selection.

These opposing forces are both necessary.
Maintaining diversity is important,
but not sufficient.
Selection for more fit individuals is important,
but not sufficient.
Each is one side, of a dual process.

## 1.2 Fitness, Selection, and Population Management

https://en.wikipedia.org/wiki/Selection_(genetic_algorithm)

Selection for fitness is the second fundamental force driving evolution.
By selection for fitness, we mean meritocratically determined,
differential success in survival and reproduction.

We will cover:
* Population management models
* Selection operators
* Preserving diversity
* Parent selection
* Survivor selection

## 1.3 Population Management Models

Two different population management models exist:

### 1.3.1 Generational model

In each generation we begin with a population of size μ,
from which a mating pool of parents is selected.
Every member of the pool is derived from something in the population.
The proportions from each parents will probably differ,
with usually more copies of the ‘better’ parents.
Next, λ offspring are created from the mating pool,
through the application of variation operators.
Offspring are evaluated by the fitness function.
From the offspring, select μ individuals as then next generation.

#### 1.3.1.1 Options

In the model typically used within the Simple Genetic Algorithm,
the population, mating pool, and offspring are all the same size,
so that each generation is replaced by all of its offspring.

This restriction to use all offspring is not necessary.
For example in the (μ, λ) Evolution Strategy,
typically λ/μ is in the range 5-7.
Create an excess of offspring,
and select only the best fraction.

Summary
* μ is population number
* λ is offspring number
* each individual survives for exactly one generation
* the entire set of parents is replaced by the offspring

the entire population is not changed synchronously at once.
λ (< μ) old individuals are replaced by λ new offspring.
The steady-state model has been widely studied and applied.

### 1.3.3 Generational gap

is the proportion of the population that is replaced,
when after breeding produces more children,
we cull back to μ.
Formally, it is:

λ/μ

For generational, it is often λ/μ = 1.0
We replace all parents with children.

For steady state, it is often, λ = 1/pop_size
We replace 1 (generally bad and/or old) parent with 1 new child.

Summary
* one offspring is often generated per generation
* one member of population replaced
* Generation Gap
* The proportion of the population replaced
* 1.0 for generational

### 1.3.4 Fitness based competition

Operators responsible for competition in population management,
work on the basis of an individual’s fitness value itself.
Abstract fitness rank is independent of representation.
Selection and replacement operators work independently of the problem representation.

#### 1.3.4.1 Two selection mechanisms

Fitness-based competition occurs at two points in the evolutionary cycle:

1. during selection, to take part in mating,
2. and during the selection of individuals to survive, into the next generation.

Many mechanisms can be used for either.

#### 1.3.4.2 Reminder: maximize vs. minimize

We will adopt a convention that:
we are trying to maximize fitness,
and that fitness values are not negative.

We often express problems in terms of an objective function to be minimized,
For some formulations negative fitness values occur.
We can map these into the desired form by using an appropriate transformation.

#### 1.3.4.3 Summary

• Selection can occur in two places:
• Selection from current generation to take part in mating (parent selection)
• Selection from parents + offspring to go into next generation (survivor selection)
• Selection operators work on whole individual
• i.e. they are representation-independent !
• Distinction between selection
• Operators: define selection probabilities
• Algorithms: define how probabilities are implemented

## 1.4 Parent Selection

We review several different parent selection mechanisms.
Some of these can be used for survivor selection too.

### 1.4.1 Fitness-Proportionate Selection (FP)

https://en.wikipedia.org/wiki/Fitness_proportionate_selection
https://en.wikipedia.org/wiki/Selection_(genetic_algorithm)#Roulette_Wheel_Selection
The probability that an individual, i, is selected for mating, depends on a ratio,
of the individual’s fitness, relative to the population’s fitness.

The sum of the probabilities over the whole population should equal 1.
The selection probability of individual i can be defined as:
$$P_{FPS}(i) = f_i / \sum_{j=1}^{u} f_j$$
This selection mechanism has been the topic of intensive study.
It is particularly amenable to theoretical analysis.

There are some problems with this selection mechanism:

#### 1.4.1.1 Problem 1: Premature convergence

Initially outstanding individuals can take over the entire population quickly.
This is known as premature convergence.
It tends to focus the search process.
It decreases the probability the algorithm will thoroughly search,
through the space of possible solutions.
Better solutions may exist.
This phenomenon is often observed in early generations.
To make the problem worse, in early generations,
most randomly created individuals will likely have low fitness.

#### 1.4.1.2 Problem 2: Diminished pressure

When fitness values are all very close together,
selective pressure is reduced.
Thus, selection approaches random.
Having a slightly better fitness, is not very ‘useful’ to an individual.
Later in a run, some convergence has taken place,
and the worst individuals are gone.
Then, it is often observed that the mean population fitness may only increase very slowly.

#### 1.4.1.3 Problem 3: Transformation distortion

FP behaves differently if the fitness function is linearly transposed.
The table shows three individuals and a fitness function with:
f(A) = 1, f(B) = 4, f(C) = 5

Transposing this fitness function changes the selection probabilities.
The ordinal linkage of the fitness landscape,
and hence the location of the optimum,
remains the same.

#### 1.4.1.4 Solution 1: Windowing

To avoid the second two problems with FPS,
we employ a procedure called windowing.

Fitness differentials are maintained by:
subtracting from the raw fitness f(x) a value βt,
which depends in some way on the recent search history,
and so can change over time or iterations
(hence the superscript t).
It is often not a strict function of time,
but of the population that happens to exist,
at any given time.
One approach subtracts the value of the least-fit member of the current population,
Pt by setting:
$$\beta^t = min_{y \in p^t} f(y)$$
We then re-compute the fitness:
$$f'(x) = f(x) - \beta^t$$
What effect does that have?

βt value may fluctuate quite rapidly,
so one alternative is to use a running average,
over the last few generations.

This is like normalizing grades to the best student in the class.
Normalizing to a max or min is common.

#### 1.4.1.5 Solution 2: Sigma scaling

Another well-known approach is sigma scaling.
It incorporates summary information for fitnesses in the population:
the mean $$\bar{f}$$ and
standard deviation $$\sigma_f$$
where c is a constant value,
usually set to 2:
$$f'(x) = max(f(x) - (\bar{f} - c \cdot \sigma_f), 0)$$
What effect does that have?
Normalization!

#### 1.4.1.6 Summary

• Problems include
• Premature Convergence:
• One highly fit member can rapidly take over, if rest of population is much less fit,
• Diminished pressure
• At end of runs when fitnesses are similar, loss of selection pressure
• Transformation distortion
• Highly susceptible to function transposition
• Solutions include
• Scaling, which can fix last two problems, via:
• Windowing:
• where Beta is worst fitness in this (last n) generations
• Sigma Scaling:
• where c is a constant, usually 2.0
• What about the first problem?

### 1.4.2 Rank-based Selection

https://en.wikipedia.org/wiki/Selection_(genetic_algorithm)#Rank_Selection
Rank-based selection was designed to address previous problems.
It preserves a constant differential selection pressure.
It sorts the population on the basis of fitness.
Selection probabilities map to individuals according to their rank,
rather than according to their raw fitness.

#### 1.4.2.1 Decreasing rank

If we define ranks as decreasing numbers,
then an individual’s rank implies the number of worse solutions in the population.
Specifically, the best has rank μ-1,
and the worst has rank 0.

#### 1.4.2.2 Cumulative probability

In any selection scheme,
we require that the sum over the population of the selection probabilities must be unity.
The cumulative probability is often thought of as 1.
That the sum not less than 1,
implies that we must select one of the parents.

#### 1.4.2.3 Types of mapping

We must mapping from rank number to selection probability.
The mechanism of mapping can vary.

##### 1.4.2.3.1 Linear Ranking

The simplest concept to map this idea would be:
$$P_{lin-too-simple^{(i)}} = \frac{i}{\mu}$$
μ is fixed, population size.
i is rank, decreasing.
Observe that:
with high rank (large i), p(i) is high.

05-FitnessSelection/rank_linear_simple.py

#!/usr/bin/python3
"""
This one behaves more like a cumulative probability.
"""

import matplotlib.pyplot as plt

u = 200
ui = range(u)
fitnesses = []
for i in range(u):
P_i = i / u
print(f"Fitness of {i}, P_i is {P_i}")
fitnesses.append(P_i)
plt.plot(ui, fitnesses)
plt.show()

Note: this looks like a cumulative probability distribution,
which we used before:
03-MetaheuristicParts.html#run-1-cycle

A common real linear ranking schemes is parameterized by a value s (1 < s ≤ 2).
In the case of a generational EA, where μ = λ,
this can be interpreted as:
the expected number of offspring allotted to the fittest individual.
Since this individual has i (rank) defined as μ − 1,
and the worst has i (rank) is 0,
the selection probability for an individual of rank i is:
$$P_{lin-rank^{(i)}} = \frac{(2-s)}{\mu} + \frac{2i(s-1)}{\mu(\mu-1)}$$
The first term will be constant for all individuals.
It is there to ensure the probabilities add to one.
Did my bad example above do that?
Since the second term will be zero for the worst individual (with rank i = 0),
it can be thought of as the ‘baseline’ probability of selecting that individual.
s just pushes the proportion in a direction of favorable.
s can be meta-optimized.

05-FitnessSelection/rank_linear.py

#!/usr/bin/python3
"""
When s = 1,
this would add up to the cumulative probability of the linear_rank_simple.py
"""

import numpy as np
import matplotlib.pyplot as plt

u = 10
ui = range(u)
s_values = np.arange(1, 2.1, 0.1)
for s in s_values:
print(f"\nS value of {s}")
fitnesses = []
for i in range(u):
fitness_i = ((2 - s) / u) + ((2 * i) * (s - 1)) / (u * (u - 1))
fitnesses.append(fitness_i)
print(f"fitness of {i} is {fitness_i}")
plt.plot(ui, fitnesses, label=f"{s}")
plt.legend()
plt.show()

Example: FP versus Linear Rank
We show an example of how the selection probabilities differ for
a population of μ = 3 different individuals,
with fitness proportionate versus rank-based selection,
with different values of s.
Parameterized by factor s:
1 < s ≤ 2.

FP is not linear: e.g., 0.5-0.4=0.1 and 0.4-0.1=0.3
LR produces linear differentials between individuals.
When the mapping from rank to selection probabilities is linear,
only limited selection pressure can be applied.
This arises from the assumption that, on average,
an individual of median fitness should have one chance to be reproduced,
which in turn imposes a maximum value of s = 2.
Given the equation above, letting s > 2,
would require the worst to have a negative selection probability.
Recall: the probabilities are defined to sum to unity.

##### 1.4.2.3.2 Exponential Ranking

We may want a higher selection pressure on top individuals.
To put more emphasis on selecting individuals of above-average fitness,
an exponential ranking scheme is often used.
Since this individual has i (rank) defined as μ − 1,
and the worst has i (rank) is 0,
the selection probability for an individual of rank i is:
e is Euler’s number:
https://en.wikipedia.org/wiki/E_(mathematical_constant)
$$P_{exp-rank^{(i)}} = \frac{1-e^{-i}}{c}$$
The normalization factor c,
is chosen so that the sum of the probabilities is unity,
i.e., it is a function of the population size.

Linear Ranking is limited in selection pressure.
Exponential Ranking can allocate an average of more than 2 copies to fittest individual.
We can normalize the constant factor, c, according to population size.
Finally, from the resulting probability distribution,
we sample the next pool using something like:
roulette wheel, stochastic universal sampling, etc.
These are illustrated in the next section.

05-FitnessSelection/rank_exp.py

#!/usr/bin/python3
"""
Exponential ranking
"""

import numpy as np
import matplotlib.pyplot as plt

u = 200
# TODO fiddle with C value
c = 200
ui = range(u)
fitnesses = []
for i in range(u):
e = np.exp(1)
P_i = (1 - (e ** -i)) / c
print(f"Fitness of {i}, P_i is {P_i}")
fitnesses.append(P_i)
print(sum(fitnesses))
plt.plot(ui, fitnesses)
plt.show()

#### 1.4.2.4 Summary

We attempt to address the problems of FP,
by basing selection probabilities on relative rather than absolute fitness.
We rank the population according to fitness,
and then base selection probabilities on rank.
The fittest has rank u_-1_ and worst rank 0.
Computationally, this imposes a sorting overhead on the algorithm.
However, this is usually negligible compared to the fitness evaluation time!

### 1.4.3 Implementing Selection Probabilities

A probability distribution defines the likelihood of each individual in the population being selected for reproduction.

#### 1.4.3.1 Problem: finite population size

Recall:
03-MetaheuristicParts.html#run-1-cycle
Where I implement the “marbles in a bucket” algorithm,
which assumes the popuulation size is 100.

We previously covered two schemes for defining a probability distribution.
Ideally, the mating pool of parents taking part in recombination,
would have exactly the same proportions as this selection probability distribution.
This would mean that the number of any given individual,
would be given by its selection probability,
multiplied by the size of the mating pool.
In practice this is not possible with a finite sized population.
When we do the multiplications to calculate probabilities,
we typically find that some individuals have an expected number of copies,
which is non-integer, often less than 1.
Yet, we need to select complete individuals.
The mating pool of parents is sampled from the selection probability distribution,
but will not in always perfectly reflect it’s proportions.

#### 1.4.3.2 Solution 1: Roulette wheel

The simplest way to sample is known as the roulette wheel algorithm.
By analogy to roulette,
assuming the sizes of the holes reflect the selection probabilities,
This is like repeatedly throwing one ball onto this spinning wheel:

In general, the algorithm can be applied to a set of μ parents to select λ members.
The λ individuals form the mating pool.
We will assume some order over the population (ranking or random) from 1 to μ,
so that we can calculate the cumulative probability disibutribution,
which is a list of values [a1, a2, …, aµ],
such that:
$$a_i = \sum_1^i P_{sel^{(i)}}$$
where $$P_{sel^{(i)}}$$ is defined by the selection distribution,
derived from either fitness proportionate (FP) or ranking.
Note that this implies aµ = 1.
Recall our example from the simple evolution of x2?
Show that now:
03-MetaheuristicParts.html#run-1-cycle
The important part:
We are given the cumulative probabilily distribution,
a list of probabilities for each individual.

Roulette wheel algorithm:

BEGIN
/* Given the cumulative probability distribution, a, */
/* and assuming we wish to select λ members of the mating pool */
set current_member = 1;
WHILE ( current_member ≤ λ ) DO
Pick a random value, r, uniformly from [0, 1];
set i = 1;
WHILE ( a_i < r ) DO
set i = i + 1;
OD
set mating_pool[current_member] = parents[i];
set current_member = current_member + 1;
OD
END

How many times does the outer while loop execute?

Despite its simplicity,
and logical consistency,
it may not provide representative sample of the distribution.

#### 1.4.3.3 Solution 2: Stochastic Universal Sampling

https://en.wikipedia.org/wiki/Stochastic_universal_sampling
Whenever more than one sample is to be drawn from the distribution,
for instance, λ for a mating pool,
the use of the stochastic universal sampling (SUS) algorithm is preferred.
By analogy to roulette,
assuming the sizes of the holes reflect the selection probabilities,
this is equivalent to making one spin of a wheel with λ equally spaced arms,
rather than λ spins of a one-armed wheel.

Given the same list of cumulative selection probabilities,
[a1, a2, …, aµ],
it performs λ selections, choosing the mating pool.

Stochastic universal sampling algorithm:

BEGIN
/* Given the cumulative probability distribution a */
/* and assuming we wish to select λ members of the mating pool */
set current_member = i = 1;
Pick a random value, r, uniformly from [0, 1/λ];
WHILE ( current_member ≤ λ ) DO
WHILE ( r ≤ a[i] ) DO
set mating_pool[current_member] = parents[i];
set r = r + 1/λ;
set current_member = current_member + 1;
OD
set i = i + 1;
OD
END

How many times does the outer while loop execute?

Since the value of the variable r is initialised in the range [0, 1/λ],
and increases by an amount 1/λ every time a selection is made,
it is guaranteed that the number of copies made of each parent i,
is at least the integer part of $$\lambda \cdot P_{sel^{(i)}}$$,
and is no more than one greater.
Finally, we should note that with minor changes to the code,
SUS can be used to make any number of selections from the parents,
and in the case of making just one selection,
it is the same as the roulette wheel.

### 1.4.4 Tournament Selection

https://en.wikipedia.org/wiki/Tournament_selection
Repeated competitions between k individuals select breeders.

E.g., k=3:

#### 1.4.4.1 Problem 1: calculating fitness of all individuals

The previous two selection methods,
and the algorithms used to sample from their probability distributions,
relied on a knowledge of the entire population.
If the population size is very large,
or if the population is distributed in some way (perhaps on a parallel system),
obtaining this knowledge is either highly time consuming,
or at worst impossible.

#### 1.4.4.2 Problem 2: indirect fitness measures

Both methods assume that fitness is an directly quantifiable measure,
based on some explicit objective function,
to be optimized in a vacuum.
This assumption may not always be practically true.

##### 1.4.4.2.1 Example: zero-sum game player

Imagine evolving game playing strategies.
We might not be able to quantify the strength of a given individual (strategy) in isolation.
We may need to compare any two of them,
by simulating a game played by these strategies as opponents.
The success of the strategy is still merely and indicator of the fitness,
under varied conditions, opponents’ strategies.

##### 1.4.4.2.2 Example: evolutionary art

https://en.wikipedia.org/wiki/Evolutionary_art

Similar situations occur also in evolutionary art applications.
The fitness function is a human’s judgement.
The user makes a subjective selection,
by comparing individuals representing designs or pieces of art,
rather than using a quantitative measure to assign fitness.

##### 1.4.4.2.3 Interactive genetic programming

This problem is not limited to art,
but any problem where the human is the fitness function.
More broadly, it can be considered a sub
This is sometimes called:
interactive evolutionary computing,
human-in-the-loop, or
interactive EC.
https://en.wikipedia.org/wiki/Interactive_evolutionary_computation

Interactive evolutionary computation (IEC) or aesthetic selection is a general term for methods of evolutionary computation that use human evaluation.
Usually human evaluation is necessary when the form of fitness function is not known (for example, visual appeal or attractiveness;
or the result of optimization should fit a particular user preference
(for example, taste of coffee or color set of the user interface).

An interactive genetic algorithm (IGA) is defined as a genetic algorithm that uses human evaluation.
These algorithms belong to a more general category of Interactive evolutionary computation.
The main application of these techniques include domains where it is hard or impossible to design a computational fitness function.
For example, evolving images, music, various artistic designs and forms to fit a user’s aesthetic preferences.
Interactive computation methods can use different representations,
both linear (as in traditional genetic algorithms)
and tree-like ones (as in genetic programming).

Human-based EC is a bit different of a notion:
https://en.wikipedia.org/wiki/Human-based_evolutionary_computation
https://en.wikipedia.org/wiki/Human-based_genetic_algorithm

We can also evolve things for the human, such as a more optimal keyboard (show this in class):

#### 1.4.4.3 Tournament selection properties

No global knowledge of the population,
nor a quantifiable measure of quality.
Instead, it only relies on an ordering relation,
that can compare and rank any two individuals.
It is therefore conceptually simple,
and fast to implement and apply.
The application of tournament selection,
to select λ members of a pool of μ individuals,
works according to the following procedure:

Tournament selection:

BEGIN
/* Assume we wish to select λ members of a pool of µ individuals */
set current_member = 1;
WHILE ( current_member ≤ λ ) DO
Pick k individuals randomly, with or without replacement;
Compare these k individuals and select the best of them;
Denote this individual as i;
set mating_pool[current member] = i;
set current_member = current_member + 1;
OD
END

#### 1.4.4.4 Indirect evaluation

Tournament selection evaluates relative fitness,
rather than absolute fitness.
Thus, it shares properties with previous ranking schemes.
Both demonstrate invariance to translation and transposition of the fitness function.
Four factors impact the probability of selecting an individual in a tournament:

##### 1.4.4.4.1 1. Rank

Individual rank in the population.
Tournaments estimate rank,
without directly sorting the whole population.

##### 1.4.4.4.2 2. Tournament size, k

The larger the tournament size, k,
the greater the chance that it will contain members of above-average fitness,
and the lesser the chance that it will consist entirely of low-fitness members.
As k is increases,
the probability of selecting a high-fitness member increases,
and that of selecting a low-fitness member decreases.
Increasing k increases the selection pressure.

##### 1.4.4.4.3 3. Certainty of outcome

p defines the probability that the most fit member of the tournament is selected.
Deterministic tournament outcomes define p = 1.
Stochastic tournament outcomes permit p < 1.
Since stochasticity makes it more likely that a less-fit member is selected,
decreasing p will decrease the selection pressure.

##### 1.4.4.4.4 4. Replacement strategy

Whether individuals are chosen with, or without replacement,
to participate in the tournament.
This replacement choice is before competing.

For simplicity of this sub-analysis,
we temporarily assume deterministic tournaments:

Without replacement:
The k-1 least-fit members of the population can never be selected,
since the other member of the tournament will be fitter.

With replacement:
All tournament candidates could randomly be copies of the worst member.
This, it is possible for the least-fit member of a population to be selected.
With a low probability, 1/μk > 0,

#### 1.4.4.5 Comparison with ranking

Timing similarities
For binary (k = 2) tournaments with parameter p (determinism of outcome),
the expected time for a single individual of high fitness to take over the population,
is roughly the same as that for linear ranking with s = 2p.

Weakness similarities
Since λ tournaments are required to produce λ selections,
it suffers from similar problems as the roulette wheel algorithm.
Specifically, with low n,
the observed outcomes can differ greatly,
from the theoretical probability distribution,

Despite these drawbacks,
tournament selection is a widely used selection operator in some EC dialects,
(in particular, Genetic Algorithms).
It is simple.
Selection pressure is easy to control,
by varying the tournament size, k.
And, most importantly, it can handle implicit value.

Summary
* All previous methods rely on global population statistics
* Could be a bottleneck especially on parallel machines, very large population
* Relies on presence of external fitness function which might not exist:
* e.g. evolving game players
* Idea for a procedure using only local fitness information:
* Pick k members at random then select the best of these
* Repeat to select more individuals
* Probability of selecting i will depend on:
* Rank of i
* Size of sample k
* higher k increases selection pressure
* Whether contestants are picked with replacement
* Picking without replacement increases selection pressure
* Whether fittest contestant always wins (deterministic) or this happens with probability p

### 1.4.5 Uniform selection

(i.e., entirely random parental selection)
In some dialects of EC,
it is common to select each individual for breeding with the same probability.
This might appear to suggest that there is no selection pressure in the algorithm,
which would indeed be true,
if this was not coupled with a strong fitness-based survivor selection mechanism.

In Evolutionary Programming,
usually there is no recombination,
only mutation,
and parent selection is deterministic.
Each parent produces exactly one child by mutation.

Evolution Strategies usually randomly select parents for mating,
i.e.,
for each 1 ≤ i ≤ μ ,
we have:
Puniform(i) = 1/μ.

Summary
Parents are selected by uniform random distribution,
whenever an operator needs one/some.
Uniform parent selection is unbiased.
Every individual has the same probability to be selected.
When working with extremely large populations,
over-selection can be used.

### 1.4.6 Over-selection for large populations

In many cases,
it may be desirable to work with extremely large populations.

#### 1.4.6.1 For technical reasons

For example, implementing EAs using graphics cards (GPUs),
which offer similar speed-up to clusters or supercomputers,
but at much lower cost.
However, achieving the maximum potential speed-up,
typically depends on having a large population on each processing node.

##### 1.4.6.1.1 For theoretical reasons

If the potential search space is enormous,
it might be a good idea to use a large population,
to avoid ‘missing’ promising regions in the initial random generation,
and thereafter to maintain the diversity needed to support exploration.

##### 1.4.6.1.2 Example: Genetic Programming

GP often employs population sizes of several thousands to millions.
A method called over-selection can be used for population sizes of 1000 and above.
The population is first ranked by fitness,
and then divided into two groups,
the top x% in one,
and the remaining (100 − x)% in the other.
For example, with x at 80,
around 80% of the selection operations choose from the first group,
and around 20% from the second.
Koza defines some “rule of thumb” values for x,
depending on the population size, show below.
Proportion of ranked population in fitter sub-population,
from which majority of parents are selected:

As can be seen,
the number of individuals from which the majority of parents are chosen,
stays constant,
i.e., the selection pressure increases dramatically for larger populations.

## 1.5 Survivor Selection

This step in the main evolutionary cycle is also called replacement.
Survivor selection mechanisms take set of μ parents and λ offspring,
and choose a set of μ individuals for the next generation.
In principle, any of the mechanisms introduced for parent selection,
could be also used for selecting survivors.
Historically, different selection strategies were employed for survival.
They can be categorized according to whether they discriminate on either:
1. the age of an individual
2. the fitness of an individual

### 1.5.1 Age-based replacemnet

Each individual exists in the population for the same number of EA iterations.
Fitness of individuals is not taken into account for replacement.

Copies of highly-fit individuals might still persist in the population.
However, they would have to be chosen at least once in the parental selection phase,
and the resulting offspring would have to survive recombination and mutation stages,
and they would have to escape the recombination and mutation stages un-modified…
This is quite unlikely, but possible.

#### 1.5.1.1 Exploration

Since fitness is not taken into account,
the mean fitness, and even best fitness, of any given generation,
may be lower than that of its predecessor.
While slightly counter-intuitive,
this is the nature of exploration.
Such curiosity may escape a local optimum.
To increase average fitness over time, we must:
have sufficient selection pressure when selecting parents for the mating pool, and
use variation operators that are not too disruptive.

#### 1.5.1.2 Generational versus Steady State

Age-based replacement is often used in the classic “Genetic Algorithm” methods.

Generational implementations
use the same number of offspring and parents, (μ = λ).
Thus, each individual exists for just one cycle.
to be replaced by the entire set of offspring.
This is the generational model we discussed before.

define overlapping populations (λ < μ)
In each cycle, λ new offspring are created,
and inserted in the population.
The parent(s) that produced the offspring is discarded.
The strategy takes the form of a first-in-first-out (FIFO) queue.

an alternative method to discarding the λ breeding parent(s),
is to randomly select λ parent(s) for replacement.
This probabilistic strategy has the same mean effect.
With λ=1, on average individuals live for μ iterations.
Some work has shown higher variance in performance than a comparable generational GA.
The random strategy is more likely to lose the best member of the population.
The random replacement strategy is not recommended.

### 1.5.2 Fitness-based replacement

Fitness selection rules choose individuals for survival.
The fittest of μ parents + λ offspring survive into the next generation.
A wide number of strategies based on fitness have been proposed.
Some also take age into account.

#### 1.5.2.1 Replace worst (GENITOR)

Replace the worst λ members of the existing μ population.
This can lead to very rapid improvements in the mean population fitness.
That is not always good.
It can also lead to premature convergence.
The population tends to rapidly focus on the fittest member currently present.
Thus, it is commonly used in conjunction with large populations,
and/or a “no duplicates” policy.

#### 1.5.2.2 Elitism

https://en.wikipedia.org/wiki/Selection_(genetic_algorithm)#Elitism_Selection
Age-based and stochastic fitness-based replacement schemes can discard fit members.
Elitism prevents the loss of the current fittest member of the population.
The current fittest member is always kept in the population.
If the fittest current member would be replaced,
and none of the new offspring has equal or better fitness,
then the best is kept,
and one of the offspring is discarded.

#### 1.5.2.3 Round-robin tournament

https://en.wikipedia.org/wiki/Round-robin_tournament
A round-robin tournament (or all-play-all tournament),
is a competition in which each contestant meets every other participant,
usually in turn.

This mechanism was introduced within Evolutionary Programming,
where it is applied to choose μ survivors.
In principle, it can also be used to select λ parents from a given population of μ.

The method works by holding pairwise tournament competitions in round-robin format.
Each individual is evaluated against q others,
that are randomly chosen from the merged parent and offspring populations.
For each comparison, a “win” is assigned if the individual is better than its opponent.
After finishing all tournaments,
the μ individuals with the greatest number of wins are selected.
Often, q = 10 is used in Evolutionary Programming.
This stochastic variant of inclusion in the tournament allows for less-fit solutions to be selected,
if they had a lucky draw of opponents.
As the value of q increases,
this chance becomes more and unlikely,
until in the limit it becomes deterministic μ + μ.

#### 1.5.2.4 (μ + λ) Selection

Offspring and parents are merged,
and ranked according to (estimated) fitness,
then the top μ are kept to form the next generation.
This strategy can be seen as a generalization of the replace-worst method (μ > λ),
and the round-robin tournament used in Evolutionary Programming where (μ = λ).
In Evolution Strategies,
λ > μ is often used with a great offspring surplus,
(typically λ/μ ≈ 5-7).
This induces a large selection pressure.

#### 1.5.2.5 (μ, λ) Selection

This method works on a mixture of age and fitness.
Typically, λ > μ children are created from a population of μ parents.
The age component dictates that all the parents are discarded,
so no individual is directly kept for more than one generation
The fitness component ranks λ offspring according to fitness,
and the best μ form the next generation.
The (μ, λ) strategy is often used in Evolution Strategies.

#### 1.5.2.6 (μ + λ) versus (μ, λ)

In Evolution Strategies, (μ, λ) selection, is generally preferred over (μ + λ) selection for the following reasons:
The (μ, λ) discards all parents and is therefore in principle able to leave (small) local optima.
This may be advantageous in a multimodal search space with many local optima.
If the fitness function is not fixed,
but changes in time,
the (μ+λ) selection preserves outdated solutions,
so it may not follow a moving optimum well.
The (μ + λ) selection hinders the self-adaptation mechanism used to adapt strategy parameters (upcoming later).

#### 1.5.2.7 Summary

• Elitism
• Always keep at least one copy of the fittest solution so far
• Widely used in both population models (GGA, SSGA)
• GENITOR: a.k.a. “delete-worst”
• From Whitley’s original Steady-State algorithm (he also used linear ranking for parent selection)
• Rapid takeover: use with large populations or “no duplicates” policy
• Round-robin tournament
• P(t): μ parents, P’(t): μ offspring
• Pairwise competitions in round-robin format:
• Each solution x from P(t) U P’(t) is evaluated against q other randomly chosen solutions
• For each comparison, a “win” is assigned if x is better than its opponent
• The μ solutions with the greatest number of wins are retained to be parents of the next generation
• Parameter q allows tuning selection pressure
• Typically q = 10
• (μ,λ)-selection
• based on the set of children only (λ > μ)
• choose best μ
• (μ+λ)-selection
• based on the set of parents and children
• choose best μ
• Often (μ,λ)-selection is preferred for:
• Better in leaving local optima
• Better in following moving optima.
Using the + strategy bad $$\sigma$$ values can survive in $$<x, \sigma>$$ too long, if their host x is very fit
• λ around 7 * μ is a traditionally good setting (decreasing over the last couple of years, λ around 3 * μ seems more popular lately)

## 1.6 Selection Pressure

We have referred informally to the notion of selection pressure.
As selection pressure increases,
fitter solutions are more proportionately likely to be selected,
and less-fit solutions are proportionately less likely.
We can formalize and quantify this.

The takeover time, τ∗, can be defined for a given selection mechanism.
The number of generations it takes from one copy,
until the application of selection homogenously fills the population with that copy.
Some work has shown that:
τ∗ = ln λ / ln(λ/μ)
For a typical evolution strategy,
with μ = 15 and λ = 100,
this results in τ∗ ≈ 2.
For fitness-proportional (FP) selection in a genetic algorithm it is:
τ∗ = λ ln λ,
resulting in τ∗ = 460 for population size λ = 100.

Other considered solutions such as a a ‘Diversity Indicator’.
One called ‘Expected Loss of Diversity’ is the expected change in the number of diverse solutions after μ selection events;
“Selection Intensity” is the expected relative increase in mean population fitness after applying a selection operator.

## 1.7 Multimodal feature-fitness landscapes

https://en.wikipedia.org/wiki/Multimodal_distribution
Most interesting problems have more than one locally optimal solution.
In statistics, a multimodal distribution is a probability distribution with more than one mode.
These appear as distinct peaks (local maxima) in the probability density function:

One gene:

Two genes:

n-genes:
To visualize this, you need to do dimensionality reduction.
But, it still works regardless.

### 1.7.1 Global versus local optima

We introduced the concept of local optima.
Effective search relies on the preservation of sufficient diversity,
to allow both:
exploitation of learned information,
by investigating regions containing high fitness solutions discovered,
and exploration, in order to uncover new high-fitness regions.

### 1.7.2 Global goal

Multimodality is a typical feature of the type of problems for which EAs are often employed.
One goal is to locate the global optimum, in spite of local optima.

### 1.7.3 Local goal(s)

An alternative may be to identify a number of high-fitness solutions,
corresponding to various local optima.
The latter situation can often arise,
for example, when the fitness function used by the EA does not completely specify the underlying problem.

An example of this might be in the design of a new widget,
where the parameters of the fitness function may change during the design process,
as progressively more refined and detailed models are used over time,
as decisions are made, such as the choice of materials, etc.
In this situation, it is valuable to be able to examine a number of possible options,
first to permit room for human aesthetic judgments, and
second because it is probably desirable to use solutions from niches with broader peaks,
rather than from a sharp peak.
This is because the latter may be over-fitted (that is, overly specialized) to the current fitness function,
and may not be as good once the fitness function is refined.

### 1.7.4 Genetic Drift

The population-based nature of EAs excels at identifying multiple optima.

In practice, the finite population size,
when coupled with recombination between any parents
(known as panmictic mixing),
leads to the phenomenon known as genetic drift,
and eventual convergence around one optimum.

The reasons for this can easily be seen:
imagine that we have two equally fit niches,
and a population of 100 individuals,
originally equally divided between them.
Eventually, because of the random effects in selection,
it is likely that we will obtain a parent population,
consisting of 49 of one sort and 51 of the other.
Ignoring the effects of recombination and mutation,
in the next generation the probabilities of selecting individuals from the two niches are now 0.49 and 0.51 respectively,
i.e., we are increasingly likely to select individuals from the second niche.
This effect increases as the two sub-populations become unbalanced, until eventually we end up with only one niche represented in the population.

Summary
Finite population with global mixing and selection,
eventually convergence around one optimum.
Why?
Often might want to identify several possible peaks.
Sub-optimum mean can still be more attractive,
if we have a number of local optima.

### 1.7.5 Approaches for Preserving Diversity

#### 1.7.5.1 Introduction

• Different diversity spaces:
• Genotype space
• Set of representable solutions
• Phenotype space
• The end result
• neighborhood structure may bear little relation with genotype space
• Algorithmic space
• Equivalent of the geographical space on which life on earth has evolved
• Structuring the population of candidate solutions
• Explicit vs implicit
• Explicit approaches
• Make similar individuals compete for resources (fitness)
• Make similar individuals compete with each other for survival
• Implicit approaches:
• Impose an equivalent of speciation
• Impose an equivalent of geographical separation

#### 1.7.5.2 What is diversity

Discussion question:
What is diversity?
Can you attempt a structured or formal definition of diversity?
In a population of solutions, we rely on an assumption:
there can be too much and too little diversity.
Why might that assumption exist in EC?

Review basics of diversity informatics
../../../Classes/Bioinformatics/Content.html
Lecture 07 diversity informatics:
git clone https://gitlab.com/bio-data/sequence-informatics.git
Unpack this some in lecture.

It is useful to clarify exactly what we mean by “diversity” and “space”.
Just as biological evolution takes place on a geographic surface,
but can also be considered to occur on an adaptive landscape,
so we can define a number of spaces within which the evolutionary algorithms operate:

##### 1.7.5.2.1 Phenotype Space:

This is the end result:
a search space whose structure is based on distance metrics between solutions.
The neighborhood structure in this space,
may bear little relationship to that in the genotype space,
according to the complexity of the representation–solution mapping.

##### 1.7.5.2.2 Genotype Space:

We may perceive the set of representable solutions as a genotype space,
and define some distance metrics.
This can be a natural distance metrics in that space
(e.g., the Manhattan distance, hamming distance)
or based on some fundamental move operator.
Typical move operators include a single bit-flip for binary spaces,
a single inversion for adjacency-based permutation problems,
and a single swap for order-based permutations problems.

• https://en.wikipedia.org/wiki/Hamming_distance
• https://en.wikipedia.org/wiki/Distance_matrix (images partway down)
##### 1.7.5.2.3 Phylogenetic space:

See bioinformatics lecture above.

##### 1.7.5.2.4 Algorithmic Space:

This is the equivalent of the geographical space,
on which life on Earth has evolved.
Effectively we are considering that the working memory of the EA,
that is, the population of candidate solutions, can be structured in some way.
This spatial structure could be either a conceptual division, or real:
for example,
an isolated breeding population might be split over a number of processors or cores.

#### 1.7.5.3 Mechanisms

A number of mechanisms have been proposed to aid the use of EAs on multimodal problems.
These can be broadly separated into two camps:

explicit approaches,
in which specific changes are made to operators in order to preserve diversity, and

implicit approaches,
in which a framework is used that permits,
but does not guarantee,
the preservation of diverse solutions.

Explicit approaches to diversity maintenance,
based on measures of either:
genotype or phenotypic space.
Includes: Fitness Sharing, Crowding, and Speciation,
These work by affecting the probability distributions used by selection.

Implicit approaches to diversity maintenance,
based on the concept of algorithmic space.
Include Island Model EAs, and Cellular EAs.

#### 1.7.5.4 Explicit Approaches for Preserving Diversity

Fitness sharing, Crowding, Speciation all shape the probability of selection.

##### 1.7.5.4.1 Fitness Sharing

This scheme is based upon the idea that,
the number of individuals within a given niche is controlled,
by sharing their fitness, immediately prior to selection,
in an attempt to allocate individuals to niches,
in proportion to the niche fitness.
In practice, the scheme considers each possible pairing,
of individuals i and j within the population (including i with itself),
and calculates a distance d(i, j) between them according to some distance metric
(phenotypic is preferred if possible, else genotypic,
e.g., Hamming distance for binary representations).
The fitness F of each individual i is then adjusted,
according to the number of individuals falling within some pre-specified distance,
σshare using a power-law distribution:
$$F'(i) = \frac{F(i)}{\sum_j sh(d(i,j))}$$
where the sharing function sh(d) is a function of the distance d, given by

$$sh(d) = \left\{ \begin{array}{cc} 1 - (d/\sigma_{share})^\alpha & if d \leq \sigma_{share}, \\ 0 & otherwise. \\ \end{array} \right\}$$
The constant value α determines the shape of the sharing function:
for α=1 the function is linear, but for values greater than this,
the effect of similar individuals, in reducing a solution’s fitness,
falls off more rapidly, with increasing distance.
The value of the share radius σshare decides both:
how many niches can be maintained, and
the granularity with which different niches can be discriminated.
Some suggest that a default value in the range 5-10 should be used.
Fitness proportionate selection is assumed for the fitness-sharing method.
There exists a stable distribution of solutions among the niches,
when solutions from each peak have the same effective fitness F’.
Since the niche fitness Fk’ = Fk / nk ,
in this stable distribution each niche k,
contains a number of solutions nk
proportional to the niche fitness Fk1 .
This point is illustrated in below.
Studies have indicated that the use of alternative selection methods,
does not lead to the formation and preservation of stable sub-populations in niches.

Summary
Restricts the number of individuals within a given niche by “sharing” their fitness,
so as to allocate individuals to niches in proportion to the niche fitness.
Need to set the size of the niche $$\sigma_{share}$$,
in either genotype or phenotype space
Run EA as normal but after each generation set.

Note:
if we used sh(d) = 1 for d < $$\sigma_{share}$$
then the sum that reduces the fitness would simply count the number of neighbors,
i.e., individuals closer than $$\sigma_{share}$$
This creates an advantage of being alone in the neighborhood
Using 1 – d / $$\sigma_{share}$$ instead of 1 implies that we count distant neighbors less

##### 1.7.5.4.2 Crowding

The crowding algorithm was first suggested as a way of preserving diversity,
by ensuring that new individuals replaced similar members of the population.
The original scheme worked in a steady-state setting
(the number of new individuals generated in each step was 20% of the population size).
When a new offspring is inserted into the population,
a number 2 of members of the parent population are chosen at random,
and then the offspring replaces the most similar of those parents.
A number of problems were found with this approach,
and others have suggested an improvement called deterministic crowding.
This algorithm relies on the fact that offspring are likely to be similar to their parents as follows:
1. The parent population is randomly paired.
2. Each pair produces two offspring via recombination.
3. These offspring are mutated and then evaluated.
4. The four pairwise distances between offspring and parents are calculated.
5. Each offspring then competes for survival in a tournament with one parent,
so that the inter-competition distances are minimized.
In other words,
denoting the parents as p,
the offspring as o,
and using the subscript to indicate tournament pairing,
d(p1, o1) + d(p2, o2) < d(p1, o2) + d(p2, o1).
The net result of all this is that offspring tend to compete for survival with the most similar parent,
so sub-populations are preserved in niches but their size does not depend on fitness;
rather it is equally distributed amongs the peaks available.
The image illustrates this point in comparison with the distribution achieved under crowding.

Summary
Attempts to distribute individuals evenly among niches.
Relies on the assumption that offspring will tend to be close to parents.
Eses a distance metric in ph/genotype space.
Randomly shuffle and pair parents, produce 2 offspring.
Set up the parent vs. child tournaments such that the intertournament distances are minimal.
That is, number the two p’s (parents) and the two o’s (offspring) such that
d(p1, o1) + d(p2, o2) < d(p1, o2) + d(p2, o1)
and let o1 compete with p1 and o2 compete with p2

##### 1.7.5.4.3 Crowding or Fitness sharing?

Fitness sharing (top)

Crowding (bottom)

Which seems like it produces “more diversity”?

##### 1.7.5.4.4 Automatic Speciation

The automatic speciation approach imposes mating restrictions based on some aspect of the candidate solutions (or their genotypes) defining them as belonging to different species.
The population contains multiple species,
and during parent selection for recombination individuals will only mate with others from the same (or similar) species.
The biological analogy becomes particularly clear when we note that some authors refer to the aspect controlling reproductive opportunities as an individual’s “plumage”.
A number of schemes have been proposed to implement speciation,
which can be divided into two main approaches.
In the first speciation is based on the solution (or its representation),
e.g., Deb’s phenotype (genotype)-restricted mating.
The alternative approach is to add some elements such as tags to the genotype that code for the individual’s species,
rather than representing part of the solution.
Many of these ideas were previously suggested by other authors.
These are usually randomly initialized and subject to recombination and mutation.
Common to both approaches is the idea that once an individual has been selected to be a parent,
then the choice of mate involves the use of a pairwise distance metric
(in phenotype or genotype space as appropriate),
with potential mates being rejected beyond a certain distance.
Note that in the tag scheme,
there is initially no guarantee that individuals with similar tags will represent similar solutions,
although after a few generations selection will usually take care of this problem.
Neither is there any guarantee that different species will contain different solutions,
although Spears goes some way towards rectifying this by also using the tags to perform fitness sharing,
and even without this Deb reported improved performance compared to a standard GA.
Similarly, although the phenotype-based speciation scheme does not guarantee diversity maintenance,
when used in conjunction with fitness sharing, it was reported to give better results than fitness sharing on its own.

Summary
* Either only mate with genotypically / phenotypically similar members or
* Add bits (tags) to problem representation
* that are initially randomly set
* subject to recombination and mutation
* when selecting partner for recombination, only pick members with a good match

#### 1.7.5.5 Implicit Approaches for Preserving Diversity

• expand with more

Summary
Island models, Cellular EAs employ more arbitrary geographic constraint, segmenting populations in algorithmic space.

##### 1.7.5.5.1 “Island” Model Parallel EAs
• expand

Summary
Periodic migration of individual solutions between populations
Run multiple populations in parallel
After a (usually fixed) number of generations (an Epoch), exchange individuals with neighbors
Repeat until ending criteria met
Partially inspired by parallel/clustered systems

Parameters
* How often to exchange individuals ?
* too quick and all sub-populations converge to same solution
* too slow and waste time
* most authors use range ~25-150 generations
* can do it adaptively (stop each pop when no improvement for (say) 25 generations)
* How many, which individuals to exchange ?
* usually ~2-5, but depends on population size.
* Copied vs moved
* Martin et al found that better to exchange randomly selected individuals than best
* Operators can differ between the sub-populations

##### 1.7.5.5.2 Cellular EAs
• expand

Summary
Like a smoothed, continuous island logic, or animals without modern human travel.
Impose spatial structure (usually grid) in 1 pop
Current
individual
Consider each individual to exist on a point on a (usually rectangular toroid) grid
Selection (hence recombination) and replacement happen using concept of a neighborhood a.k.a. deme
Leads to different parts of grid searching different parts of space, good solutions diffuse across grid over a number of gens
* Assume rectangular grid so each individual has 8 immediate neighbors
* Equivalent of 1 generation is:
* pick individual in pop at random
* pick one of its neighbors using roulette wheel
* crossover to produce 1 child, mutate
* replace individual if fitter
* circle through population until done

## 1.8 Modern approaches

Lexicase selection: