Now, we’ll describe some of the most widely known evolutionary algorithm variants.

Historical EA variants:

* Genetic Algorithms

* Evolution Strategies

* Evolutionary Programming

* Genetic Programming

More recent versions:

* Learning Classifier Systems

* Differential Evolution

* Particle Swarm Optimization

* Estimation of Distribution Algorithms

* Evolutionary Robotics

* Neuroevolution

* Many more!

https://en.wikipedia.org/wiki/Genetic_algorithm

http://scholarpedia.org/article/Genetic_algorithms

The genetic algorithm (GA) is the most widely known type of evolutionary
algorithm.

GAs have largely (if mistakenly) been considered as function
optimization methods.

- Developed: USA in the 1960’s
- Early names: J. Holland, K. DeJong, D. Goldberg
- Holland’s original GA is now known as the simple genetic algorithm (SGA)
- Typically applied to:
- discrete function optimization
- binary representation
- straightforward problems
- now, used for benchmarks

- Features:
- not too fast
- SGA missing new variants (elitsm, sus)
- often modeled by theorists

- Other GAs use different:
- Representations
- Mutations
- Crossovers
- Selection mechanisms

GAs employ:

a binary representation,

fitness proportionate selection,

a low probability of mutation,

an emphasis on genetically inspired recombination as a means of
generating new candidate solutions.

- | - |
---|---|

Representation | Bit-strings |

Recombination | 1-Point crossover |

Mutation | Bit flip |

Parent selection | Fitness proportional, implemented by Roulette Wheel |

Survivor selection | Generational |

Select parents - for the mating pool

(size of mating pool = population size)

Shuffle the mating pool

Apply crossover - for each consecutive pair with probability
p_{c}, otherwise copy parents

Apply mutation - for each offspring (bit-flip with probability
p_{m} independently for each bit)

Replace the whole population with the resulting offspring

In detail:

GAs traditionally have a fixed work-flow:

given a population of μ individuals,

parent selection fills an intermediary population of μ,

allowing duplicates.

Then the intermediary population is shuffled to create random
pairs,

and crossover is applied to each consecutive pair with probability
p_{c},

and the children replace the parents immediately.

The new intermediary population undergoes mutation individual by
individual,

where each of the l bits in an individual is modified by mutation with
independent probability p_{m}.

The resulting intermediary population forms the next generation
replacing the previous one entirely.

Note that in this new generation there might be pieces,

perhaps complete individuals,

from the previous one that survived crossover and mutation without being
modified,

but the likelihood of this is rather low (depending on the parameters μ,
p_{c}, p_{m}).

- Simple problem: maximize x
^{2}over {0, 1, …, 31} - GA approach:
- Representation: binary code, e.g., 01101 ↔︎ 13
- Population size: 4
- 1-point x-over, bit-wise mutation
- Roulette wheel selection
- Random initialization

- We previously showed one generational cycle done by hand

We covered this previously.

Has been subject of many (early) studies.

In the early years of the field,

there was significant attention paid to trying to establish suitable
values for GA parameters,

such as the population size, crossover and mutation probabilities.

Recommendations were for mutation rates between 1/l and 1/μ,

crossover probabilities around 0.6-0.8,

and population sizes in the fifties or low hundreds,

although to some extent these values reflect the computing power
available in the 1980s and 1990s.

It has been recognized that there are some flaws in the SGA.

Shows many shortcomings, e.g.,

Representation is too restrictive.

Mutation and crossover operators are only applicable for bit-string and
integer representations.

Selection mechanism is sensitive to converging populations with close
fitness values.

Generational population model can be improved with explicit survivor
selection.

Various features have been added since the first SGA.

Factors such as **elitism,** and **non-generational
models** were added to offer faster convergence if needed.

**SUS** is preferred to roulette wheel
implementation,

and most commonly **rank-based selection** is used,

implemented via **tournament selection** for simplicity and
speed.

Studying the biases in the interplay between representation and
one-point crossover led to the development of alternatives,

such as **uniform crossover**,

and a stream of work through ‘messy-GAs’ and ‘Linkage Learning’,

to Estimation of Distribution Algorithms.

Analysis and experience has recognised the need to use
**non-binary representations** where more
appropriate.

The problem of how to choose a suitable fixed mutation rate has largely
been solved by adopting the idea of
**self-adaptation**,

where the rates are encoded as extra genes in an individuals
representation and allowed to evolve.

https://en.wikipedia.org/wiki/Evolution_strategy

http://scholarpedia.org/article/Evolutionary_strategies

This one was actually intended as a numerical optimization
technique.

- Developed: Germany in the 1960’s
- Early names: I. Rechenberg, H.-P. Schwefel
- Typically applied to:
- numerical optimization

- Attributed features:
- fast
- good optimizer for real-valued optimization
- relatively much theory

- Special:
- self-adaptation of (mutation) parameters standard

- μ is population number
- λ is offspring number

- | - |
---|---|

Representation | Real-valued vectors |

Recombination | Discrete or intermediary |

Mutation | Gaussian perturbation |

Parent selection | Uniform random |

Survivor selection | (μ,λ) or (μ+λ) |

- Task: minimize f: R
^{n}in R - Algorithm: “two-membered ES” using
- Vectors from R
^{n}directly as chromosomes - Population size 1
- Only mutation creating one child
- Greedy selection

- Vectors from R

- z values drawn from normal distribution N(ξ,σ)
- mean ξ is set to 0
- variation σ is called mutation step size

- σ is varied on the fly by the “1/5 success rule”:

Rechenberg’s 1/5 success rule states that the ratio of successful mutations to all mutations should be 1/5.

Hence if the ratio is greater than 1/5 the step size should be increased,

and if the ratio is less than 1/5,

then it should be decreased.

The rule is executed at periodic intervals,

for instance, after k iterations,

each σ is reset by:

σ = σ / c if p_{s}> 1/5

σ = σ * c if p_{s}< 1/5

σ = σ if p_{s}= 1/5

where:

p_{s}is the % of successful mutations

constant c is \(0.8 \leq c \leq 1\)

Using this mechanism,

changes in the parameter values are now based on feedback from the search.

The original nozzle:

To actually build it:

ES works with extended chromosomes \(\bar{x}\), \(\bar{p}\),

where \(\bar{x}\) ∈ R^{n} is a
vector from the domain of the given objective function to be
optimized,

while \(\bar{p}\) carries the algorithm
parameters.

- Chromosomes consist of three parts:
- Object variables: x
_{1}, …, x_{n} - Strategy parameters:
- Mutation step sizes: \(\sigma_1, ..., \sigma_{n_\sigma}\)
- Rotation angles: \(\alpha_1, ..., \alpha_{n_\alpha}\)

- Object variables: x
- Not every component is always present
- Full size: <x
_{1}, …, x_{n}, σ_{1},…,σ_{n}, α_{1}, …, α_{k}> - where k = n(n-1)/2 (no. of i,j pairs)

- Creates one child
- Acts per variable / position by either
- Averaging parental values, or
- Selecting one of the parental values

- From two or more parents by either:
- Using two selected parents to make a child
- Selecting two parents for each position

Two fixed parents | Two parents selected for each i | |
---|---|---|

z_i = (x_i + y_i)/2 | Local intermediary | Global intermediary |

z_i is x_i or y_i chosen randomly | Local discrete | Global discrete |

Parents are selected by uniform random distribution whenever an
operator needs one/some.

Thus, ES parent selection is unbiased.

Every individual has the same probability to be selected.

- Given a dynamically changing fitness landscape (optimum location shifted every 200 generations)
- Self-adaptive ES is able to
- follow the optimum and
- adjust the mutation step size after every shift !

μ > 1 to carry different strategies.

λ > μ to generate offspring surplus.

(μ,λ)-selection to get rid of mis-adapted σ’s.

Mixing strategy parameters by (intermediary) recombination on them.

Is quite high for ES.

Takeover time *τ* _*_ is a measure to quantify the selection
pressure.

The number of generations it takes until the application of selection
completely fills the population with copies of the best
individual.

Goldberg and Deb showed:

For proportional selection in a genetic algorithm the takeover time is
*λ ln(λ)*

Task: to create a color mix yielding a target color (that of a well
known cherry brandy).

Ingredients: water + red, yellow, blue dye

Representation: <w, r, y, b> no self-adaptation!

Values scaled to give a predefined total volume (30 ml).

Mutation: lo / med / hi values of σ used with equal chance.

Selection: (1,8) strategy

Fitness: students effectively making the mix and comparing it with
target color

Termination criterion: student satisfied with mixed color

Solution is found mostly within 20 generations

Accuracy is very good.

https://en.wikipedia.org/wiki/Ackley_function

The Ackley function is widely used for testing optimization
algorithms,

much like an image dataset, or openAI gym, now.

In its two-dimensional form, as shown in the plot above,

it is characterized by a nearly flat outer region,

and a large hole at the centre.

The function poses a risk for optimization algorithms,

particularly hillclimbing algorithms,

to be trapped in one of its many local minima.

- The Ackley function (here used with n=30):

\(f(x,y)=-20\exp \left[-0.2{\sqrt{1/n)} \sum_{i=1}^n x^2_i}\,\right] -\exp \left[ 1/n \sum_{i=1}^n cos(2\pi x_i) \right]+e+20\) - Evolution strategy:
- Representation:
- -30 < x
_{i}< 30 - 30 step sizes

- -30 < x
- (30,200) selection
- Termination : after 200000 fitness evaluations
- Results: average best solution is 7.48 x 10
^{-8}(very good)

- Representation:

https://en.wikipedia.org/wiki/Evolutionary_programming

http://scholarpedia.org/article/Evolutionary_programming

- Developed: USA in the 1960’s
- Early names: D. Fogel
- Typically applied to:
- traditional EP: prediction by finite state machines
- contemporary EP: (numerical) optimization and beyond

- Attributed features:
- now, very open framework: any representation and mutation operators OK
- crossbred with ES (contemporary EP)
- consequently: now, hard to say what “standard” EP is

- Special:
- no recombination
- self-adaptation of parameters standard (contemporary EP)

- | - |
---|---|

Representation | Real-valued vectors |

Recombination | None |

Mutation | Gaussian perturbation |

Parent selection | Deterministic (each parent one offspring) |

Survivor selection | Probabilistic (μ+μ) |

EP aimed at achieving intelligence

Intelligence was viewed as adaptive behavior

Prediction of the environment was considered a prerequisite to adaptive
behavior

Thus: capability to predict is key to intelligence

https://en.wikipedia.org/wiki/Finite-state_machine

The original paper:

Fogel, L.J., Owens, A.J., Walsh, M.J.

Artificial Intelligence Through Simulated Evolution

New York: John Wiley, 1966

A discussion of it:

06-PopularVariants/fogel66.pdf

06-PopularVariants/fogel66a.pdf

06-PopularVariants/fogel95.pdf

- Finite state machine (FSM):
- States S
- Inputs I
- Outputs O
- Transition function δ: S x I -> S x O
- Transforms input stream into output stream

- Can be used for predictions,
- e.g. to predict next input symbol in a sequence

A population of FSMs is exposed to the environment, that is,

the sequence of symbols that has been observed up to the current
time.

For each parent machine,

as each input symbol is presented to the machine,

the corresponding output symbol is compared with the next input
symbol.

The worth of this prediction is then measured with respect to the payoff
function

(e.g., all-none, squared error).

After the last prediction is made,

a function of the payoff for the sequence of symbols indicates the
fitness of the machine or program

(e.g., average payoff per symbol).

Offspring machines are created by randomly mutating the parents,

and are scored in a similar manner.

Those machines that provide the greatest payoff are retained,

to become parents of the next generation,

and the process iterates.

When new symbols are to be predicted,

the best available machine serves as the basis for making such a
prediction,

and the new observation is added to the available database.

Consider the FSM with:

S = {A, B, C}

I = {0, 1}

O = {a, b, c}

δ given by a diagram (above).

Consider the following FSM

Task: predict next input

Quality (prediction accuracy): % of in_{(i+1)} =
out_{i}

Given initial state C

Input sequence 011101

Leads to output 110111

Quality: 3 out of 5

- Evolve a function P, where:

P(n) = 1 if n is prime, 0 otherwise - I = N = {1, 2, 3, …, n, …}
- O = {0,1}
- Correct prediction: out
_{i}= P(in(i+1)) - Fitness function:
- 1 point for correct prediction of next input
- 0 point for incorrect prediction
- Penalty for “too many” states

- Parent selection: each FSM is mutated once
- Mutation operators (one selected randomly):
- Change an output symbol
- Change a state transition (i.e. redirect edge)
- Add a state
- Delete a state
- Change the initial state

- Survivor selection: (μ+μ)
- Results: overfitting, after 202 inputs, the best FSM had one state,
and both outputs were 0,
- i.e., it always predicted “not prime”…

- Main point: not perfect accuracy, but demonstration that simulated evolutionary process can create good solutions for intelligent task.
- Not all approaches using FSM’s failed.
- Others were successful; this is just a funny one!

No predefined representation in general.

Thus, no predefined mutation (must match representation).

Often applies self-adaptation of mutation parameters.

- For continuous parameter optimization
- Chromosomes consist of two parts:
- Object variables: x
_{1}, …, x_{n} - Mutation step sizes: σ
_{1}, …, σ_{n}

- Object variables: x
- Full size: <x
_{1}, …, x_{n}, σ_{1}, …, σ_{n}>

- Chromosomes: <x
_{1}, …, x_{n}, σ_{1}, …, σ_{n}> - σ
_{i}’ = σ_{i}* (1 + α * N(0,1)) - x
_{i}’ = x_{i}+ σ_{i}’ * N_{i}(0,1) - α ≈ 0.2
- boundary rule: σ’ < ε
_{0}⇒ σ’ = ε_{0} - Other variants proposed and tried:
- Using variance instead of standard deviation
- Mutate σ-last
- Other distributions, e.g, Cauchy instead of Gaussian

Since the 1990s,

EP variants for optimization of real-valued parameter vectors have
become more frequent,

and even positioned as “standard’ EP.

During the history of EP a number of mutation schemes,

such as one in which the step size is inversely related to the fitness
of the solutions,

have been proposed.

Since the proposal of meta-EP,

self-adaptation of step sizes has become the norm.

Other ideas from ES have also informed the development of EP
algorithms,

and a version with self-adaptation of covariance matrices, called
R-meta-EP is also in use.

Worthy of note is Yao’s improved fast evolutionary programming algorithm
(IFEP).

whereby two offspring are created from each parent,

one using a Gaussian distribution to generate the random
mutations,

and the other using the Cauchy distribution.

The latter has a fatter tail

(i.e., more chance of generating a large mutation),

which the authors suggest gives the overall algorithm greater chance of
escaping from local minima,

whilst the Gaussian distribution (if small step sizes evolve),

gives greater ability to fine-tune the current parents.

Little to none, though it was present:

http://scholarpedia.org/article/Evolutionary_programming

Rationale:

One point in the search space stands for a species,

not for an individual and there can be no crossover between
species.

Much historical debate “mutation vs. crossover”.

The issue of using a mutation-only algorithm versus a recombination and
mutation variant has been intensively discussed since the 1990s.

“the traditional practice of setting operator probabilities at constant
,

… is quite limiting and may even prevent the successful discovery of
suitable solutions.”

- Each individual creates one child by mutation
- Thus:
- Deterministic
- Not biased by fitness

- Use survivor selection primarily

06-PopularVariants/fogel02.pdf

Skim paper, figure of ANN at end.

- Neural nets for evaluating future values of moves are evolved
- NNs have fixed structure with 5046 weights, these are evolved + one weight for “kings”
- Representation:
- vector of 5046 real numbers for object variables (weights)
- vector of 5046 real numbers for σ’s

- Mutation:
- Gaussian, lognormal scheme with σ-first
- Plus special mechanism for the kings’ weight

- Population size 15

Tournament size q = 5

Programs (with NN inside) play against other programs,

no human trainer or hard-wired intelligence.

After 840 generation (6 months!),

best strategy was tested against humans via Internet.

Program earned “expert class” ranking,

outperforming 99.61% of all rated players.

**NeuroEvolution**

The checkers player is a kind of:

https://en.wikipedia.org/wiki/Neuroevolution

http://www.scholarpedia.org/article/Neuroevolution

Evolutionary algorithms generate artificial neural networks (ANN),

their parameters, and their rules.

It is commonly used as part of the reinforcement learning
paradigm,

and it can be contrasted with conventional deep learning
techniques,

that use gradient descent on a neural network with a fixed topology.

Discussion question:

What happened to the field of EP’s evolving FSMs?

https://en.wikipedia.org/wiki/Genetic_programming

https://geneticprogramming.com/

https://www.genetic-programming.com/

The first record of the proposal to evolve programs is probably that
of Alan Turing in the 1950s.

06-PopularVariants/turing50.pdf

Non-GP EC grew in the 60s and 70s,

but despite this progress,

there was a gap of some thirty years,

before Richard Forsyth demonstrated the successful evolution of small
programs,

represented as trees,

to perform classification of crime scene evidence for the UK Home
Office!

06-PopularVariants/beagle81.pdf

Although the idea of evolving programs,

initially in the computer language LISP,

was discussed among John Holland’s students,

it was not until they organized the first Genetic Algorithms conference
in Pittsburgh,

that Nichael Cramer published evolved programs in two specially designed
languages.

https://dl.acm.org/doi/10.5555/645511.657085

In 1988, another PhD student of John Holland,

John Koza, patented his invention of a GA for program evolution:

“Non-Linear Genetic Algorithms for Solving Problems”,

“United States Patent 4935877”, 1990

This was followed by publication in the International Joint
Conference on Artificial Intelligence IJCAI-89.

06-PopularVariants/koza_ijcai1989.pdf

Then, Koza followed this with 205 publications on “Genetic
Programming” (GP).

GP was a name coined by David Goldberg, also a PhD student of John
Holland.

There was an enormous expansion of the number of publications.

In 2010, Koza listed 77 results where Genetic Programming was human
competitive.

06-PopularVariants/koza_GPEM2010article.pdf

- Developed: Grew mostly in the USA in the 1990’s
- Early popularizer: J. Koza
- Typically applied to:
- Machine learning tasks (prediction, classification…)

- Attributed features:
- Competes with neural nets and alike
- Often needs huge populations (thousands)
- Slow

- Special:
- non-linear chromosomes: trees, graphs
- Mutation possible but not necessary

Genetic programming is a relatively young member of the evolutionary
algorithm family.

EAs discussed so far are typically applied to optimization
problems.

Most other EAs find some input, realizing maximum payoff.

GP could instead be positioned in machine learning.

GP is used to seek models with maximum fit.

Once maximization is introduced,

modeling problems can be seen as special cases of optimization.

This is the basis of using evolution for such tasks:

models, represented as parse trees,

are treated as individuals,

and their fitness is the model quality to be maximized.

**Discussion question:**

Might GP subsume other fields?

Does GP suffice in all cases where evolving an FSM might?

Would you guess GP has any disadvantages compared to simpler
representations?

The parse trees used by GP as chromosomes are expressions in a given
formal syntax.

Depending on the problem at hand,

and the users’ perceptions on what the solutions must look like,

this can be the syntax of arithmetic expressions,

formulas in first-order predicate logic,

or code written in a programming language.

In particular, they can be envisioned as executable source code,

that is, programs.

The syntax of functional programming,

e.g., the language LISP,

very closely matches the so-called Polish notation of expressions.

For instance, the formula we covered before:

\(2 \cdot \pi + ((x + 3) -
\frac{y}{5+1})\)

can be rewritten in this Polish notation as:

+(·(2, π), −(+(x, 3), /(y, +(5, 1)))),

while the executable LISP code1 looks like:

(+ (· 2 π) (− (+ x 3) (/ y (+ 5 1)))).

This is cool!

- | - |
---|---|

Representation | Tree structures |

Recombination | Exchange of subtrees |

Mutation | Random change in trees |

Parent selection | Fitness proportional |

Survivor selection | Generational replacement |

Most are basic trees:

But, there are other flavors too (more below).

Bank wants to distinguish good from bad loan applicants

Model needed that matches historical data

ID | No of children (NoS) | Salary (S) | Marital status | OK? |
---|---|---|---|---|

ID-1 | 2 | 45000 | Married | 0 |

ID-3 | 1 | 40000 | Divorced | 1 |

ID-2 | 0 | 30000 | Single | 1 |

… |

A possible model:

```
IF (NoC = 2) AND (S > 80000)
THEN good
ELSE bad
```

In general:

```
IF formula
THEN good
ELSE bad
```

Only unknown is the right formula, hence:

Our search space (phenotypes) is the set of formulas.

Natural fitness of a formula:

percentage of well classified cases of the model it stands for.

```
IF (NoC = 2) AND (S > 80000)
THEN good
ELSE bad
```

can be represented by the following tree

Compare:

GA scheme using crossover AND mutation sequentially (be it
probabilistically)

GP scheme using crossover OR mutation (chosen probabilistically)

- Parent selection was typically fitness proportionate.
- Over-selection was used in very large populations.
- rank population by fitness and divide it into two groups:
- group 1: best x% of population,
- group 2: other (100-x)%

- Around f80% of selection operations chooses from group 1, 20% from group 2
- for pop. size = 1000, 2000, 4000, 8000 x = 32%, 16%, 8%, 4%
- motivation: to increase efficiency, %’s come from rule of thumb

- rank population by fitness and divide it into two groups:
- Survivor selection:
- Typical: generational scheme (thus none)
- Recently steady-state is becoming popular for its elitism

- Maximum initial depth of trees D
_{max}is set - Full method (each branch has depth = D
_{max}):- nodes at depth d < D
_{max}randomly chosen from function set F - nodes at depth d = D
_{max}randomly chosen from terminal set T

- nodes at depth d < D
- Grow method (each branch has depth ≤ D
_{max}):- nodes at depth d < D
_{max}randomly chosen from F ∪ T - nodes at depth d = D
_{max}randomly chosen from T

- nodes at depth d < D
- Common GP initialisation:
- ramped half-and-half, where grow & full method each deliver half of initial population

Average tree sizes tend to grow during the course of a run.

This is known as bloat,

sometimes called the ‘survival of the fattest’.

This may be a broader phenomenon than GP.

Bloat is seemingly observed in organismal chromosomes as well.

Some small viruses show size constraints.

The simplest way to prevent bloat,

is to introduce a maximum tree size.

If the child(ren) resulting from a variation aperation would exceed this
maximum size,

then the variation operator is forbidden from acting.

The threshold can be seen as an additional parameter of mutation and
recombination in GP.

Another mechanism is parsimony pressure.

(i.e., being ‘stingy’ or ungenerous)

is achieved through introducing a penalty term in the fitness
formula,

that reduces the fitness of large chromosomes,

or uses other multiobjective techniques.

- Given some points in
**R**^{2}, (x_{1}, y_{1}), … , (x_{n}, y_{n}) - Find function f(x) s.t. ∀i = 1, …, n : f(x
_{i}) = y_{i} - Possible GP solution:
- Representation by F = {+, -, /, sin, cos}, T =
**R**∪ {x} - Fitness is the error

\(err(f) = \sum_{i=1}^n (f(x_i) - y_i)^2\) - All operators standard
- pop.size = 1000, ramped half-half initialisation
- Termination: n “hits” or 50000 fitness evaluations reached (where
“hit” is if | f(x
_{i}) - y_{i}| < 0.0001)

- Representation by F = {+, -, /, sin, cos}, T =

Types of GP include:

Tree-based Genetic Programming

Stack-based Genetic Programming

Linear Genetic Programming (LGP)

Grammatical Evolution

Extended Compact Genetic Programming (ECGP)

Cartesian Genetic Programming (CGP)

Probabilistic Incremental Program Evolution (PIPE)

Strongly Typed Genetic Programming (STGP)

Genetic Improvement of Software for Multiple Objectives (GISMO)

Basic steps in generational, Tree-based GP:

- Generate an initial, stochastic population.
- Iteratively perform selection, genetic operation, and evaluation:
- Evaluate each program (hypothesis) in the current population, against the given dataset and determine how well it performed, the value recorded as a fitness score.
- Randomly select programs and compare their fitness scores.
- Apply one of three genetic operators to a copy of the leading
program:
- reproduction
- mutation
- crossover

- … then move the program into the subsequent generation.
- Evaluate all programs (hypotheses) in each new generation.

- Repeat until the user-defined termination criteria are met.

Review this briefly:

https://geneticprogramming.com/software/

Mention Clojush, Tiny Genetic Programming

http://www.gp-field-guide.org.uk/

06-PopularVariants/GP_book1.pdf

https://www.moshesipper.com/evolved-to-win.html

06-PopularVariants/evolvedtowin.pdf

Mention his GP bot competitions.

And, our future lecture for this course:

18-GeneticProgramming.html

https://en.wikipedia.org/wiki/Learning_classifier_system

Is much like an independent discovery of processes which
resemble:

reinforcement learning, Q-learning, TD-learning.

https://en.wikipedia.org/wiki/Reinforcement_learning

https://en.wikipedia.org/wiki/Temporal_difference_learning

https://en.wikipedia.org/wiki/Q-learning (show images)

Read (show this briefly):

https://artint.info/2e/html/ArtInt2e.Ch12.html

https://artint.info/3e/html/ArtInt3e.Ch13.html

Animation on ch12 here (Java applets, lol):

https://artint.info/2e/online.html

Genome:

{condition:action:payoff}

{condition:action:payoff,accuracy}

Where payoff is like RL’s reward.

- Developed: First described in 1976
- Early names: Holland
- Typically applied to:
- Machine learning tasks working with rule sets
- Giving best response to current state environment

- Attributed features:
- Combination classifier system and learning algorithm
- Cooperation (iso usual competition)
- Using genetic algorithms
- Michigan-style (rule is individual) vs Pittsburgh-style (rule set is individual)

- | - |
---|---|

Representation | Tuple of {condition:action:payoff,accuracy}, conditions use {0,1,#} alphabet |

Survivor selection | Stochastic, inversely related to number of rules covering same environmental niche |

Recombination | One-point crossover on conditions/actions |

Parent selection | Fitness proportional with sharing within environmental niches |

Mutation | Binary resetting as appropriate on action/conditions |

Fitness | Each reward received updates predicted pay-off and accuracy of rules in relevant action sets by reinforcement learning |

Each rule of the rule base is a tuple {condition:action:payoff}

Match set: subset of rules whose condition matches the current inputs
from the environment

Action set: subset of the match set advocating the chosen action

Later, the rule-tuple included an accuracy value,

reflecting the system’s experience of how well the predicted payoff
matches the reward received

- k-bit multiplexer is bit-string with length k where
- k = l + 2
^{l} - l is the address part
- 2
^{l}is the data part - example: l=2, k=6, 101011 correct string

- k = l + 2
- Return the value of the data bit specified by the address part
- (take position 10 of data part 1011, which is 0)
- Rewards can be assigned to the correct answer

The Pittsburgh-style LCS predates,

but is similar to the better-known GP:

Each member of the evolutionary algorithm’s population represents a
complete model,

of the mapping from input to output spaces.

Each gene in an individual typically represents a rule,

and a new input item may match more than one rule,

in which case typically the first match is taken.

This means that the representation should be viewed as an ordered
list.

Two individuals which contain the same rules,

but in a different order on the genome,

are effectively different models.

Learning of appropriately complex models is typically facilitated by
using a variable-length representation,

so that new rules can be added at any stage.

This approach has several conceptual advantages.

In particular, since fitness is awarded to complete rule sets,

models can be learned for complex multi-step problems.

The downside of this flexibility is that,

like GP, Pittsburgh-style LCS suffers from bloat,

and the search space becomes potentially infinite.

Nevertheless, given sufficient computational resources,

and effective methods of parsimony to counteract bloat,

Pittsburgh-style LCS has demonstrated state-of-the-art performance in
several machine learning domains,

especially for applications such as bioinformatics and medicine,

where human-interpretability of the evolved models is vital,

and large data-sets are available,

so that the system can evolve off-line to minimize prediction
error.

Two recent examples winning Humies Awards,

for better than human performance are in the realms of:

prostate cancer detection and protein structure prediction.

https://en.wikipedia.org/wiki/Differential_evolution

- Developed: USA in 1995
- Early names: Storn, Price
- Typically applied to:
- Non-linear and non-differentiable continuous space functions

- Attributed features:
- populations are lists
- four parents are needed to create a new individual
- different variants exist, due to changing the base vector

- Special:
- differential mutation

- | - |
---|---|

Representation | Real-valued vectors |

Recombination | Uniform crossover |

Mutation | Differential mutation |

Parent selection | Uniform random selection of the 3 necessary vectors |

Survivor selection | Deterministic elitist replacement (parent vs child) |

In this section we describe a young member of the evolutionary
algorithm family:

differential evolution (DE).

Its birth can be dated to 1995,

when Storn and Price published a technical report describing the main
concepts behind a

“new heuristic approach for minimizing possibly nonlinear and
non-differentiable continuous space functions”.

The distinguishing feature that delivered the name of this
approach,

is a twist to the usual reproduction operators in EC:

the so-called **differential mutation**.

Given a population of candidate solution vectors in the set of real
numbers, R^{n},

a new mutant vector x’ is produced,

by adding a perturbation vector to an existing one,

x’ = x + p,

where the perturbation vector, p, is the scaled vector difference
of:

two other, randomly chosen population members

p = F · (y − z),

where y and x are randomly chosen population members,

the scaling factor F > 0 is a real number,

that controls the rate at which the population evolves.

DE uses uniform crossover but with a slight twist

The other reproduction operator is the usual uniform crossover,

subject to one parameter,

the crossover probability Cr ∈ [0, 1]

that defines the chance that for any position in the parents currently
undergoing crossover,

the allele of the first parent will be included in the child.

Remember that in GAs, the crossover rate pc ∈ [0, 1] is defined for
any given pair of individuals,

and it is the likelihood of actually executing crossover.

In DE, at one randomly chosen position,

the child allele is taken from the first parent, without making a random
decision.

This ensures that the child does not duplicate the second parent.

The number of inherited mutant alleles follows a binomial
distribution

- Population is a list, not related to fitness value
- Creating a mutant vector population
- For each new mutant, three vectors are chosen randomly from the population P (base vector υ and two others, y and z)

- Trial vector population is created
- Uniform crossover between and

- Deterministic selection applied to each pair and
- where the i-th individual in next generation is the one with highest fitness value

- Different variants exists due to changing the base vector
- The variant is described as DE/a/b/c where
- a is the base vector (rand or best)
- b is the number of different vectors to define perturbation vectorexample four randomly chosen:
- c denotes the crossover scheme (“bin” is uniform crossover)

https://en.wikipedia.org/wiki/Particle_swarm_optimization

http://scholarpedia.org/article/Particle_swarm_optimization

- Developed: in the 1995
- Early names: Kennedy, Eberhart
- Typically applied to:
- Optimizing nonlinear functions

- Attributed features:
- No crossover
- Every candidate solution carries it own perturbation vector

- Special:
- inspired by social behavior of bird flocking/fish schooling
- Particles with location and velocity iso individuals with genotype and mutation

- | - |
---|---|

Representation | Real-valued vectors |

Recombination | None |

Mutation | Adding velocity vector |

Parent selection | Deterministic (each parent creates one offspring via mutation) |

Survivor selection | Generational (offspring replaces parents) |

Every population member can be considered as a pair,

where the first vector is candidate solution,

and the second one a perturbation vector in IR^{n}

On a conceptual level every population member in a PSO can be
considered as a pair <x, p>,

where x ∈ IRn is a candidate solution vector and p ∈ IRn is a
perturbation vector,

that determines how the solution vector is changed to produce a new
one.

The main idea is that a new pair <x’, p’> is produced from <x,
p>

by first calculating a new perturbation vector p’

(using p and some additional information)

and adding this to x.

That is,

x’ = x + p’

The core of the PSO perspective is to consider a population member as a
point in space,

with a position and a velocity,

and use the latter to determine a new position (and a new
velocity).

Thus, looking under the hood of a PSO,

we find that a perturbation vector is a velocity vector v,

and a new velocity vector v’ is defined as:

the weighted sum of three components: v and two vector
differences.

The first points from the current position x to the best position y the
given population member ever had in the past,

and the second points from x to the best position z the whole population
ever had.

The perturbation vector determines how the solution vector is changed to produce a new one.

- A member is considered as a point in space with a position and a velocity
- Perturbation vector is a velocity vector and a new velocity vector
is defined as the weighted sum of three components:
- Current velocity vector
- Vector difference current position to best position of member so far
- Vector difference from current position to best position of population so far

- where
*w*and*Φ**i*are the weights and*U1*and*U2*randomizer matrices - Because the personal best and global best must be remembered, the populations are lists

https://en.wikipedia.org/wiki/Estimation_of_distribution_algorithm

Estimation of distribution algorithms (EDA) are based on the idea of
replacing the creation of offspring by ‘standard’ variation operators
(recombination and mutation) by a three-step process.

First a ‘graphical model’ is chosen to represent the current state of
the search,

in terms of the dependencies between variables (genes) describing a
candidate solution.

Next the parameters of this model are estimated from the current
population,

to create a conditional probability distribution over the
variables.

Finally, offspring are created by sampling this distribution.

```
BEGIN
INITIALISE population P 0 with µ random candidate solutions;
set t = 0;
REPEAT UNTIL ( TERMINATION CONDITION is satisfied ) DO
EVALUATE each candidate in P t ;
SELECT subpopulation Pst to be used in the modeling steps;
MODEL SELECTION creates graph G by dependencies in Pst ;
MODEL FITTING creates Bayesian Network BN with G and Pst ;
MODEL SAMPLING produces a set Sample(BN ) of µ candidates;
set t = t + 1;
P t = Sample(BN );
OD
END
```

https://en.wikipedia.org/wiki/Evolutionary_robotics

http://scholarpedia.org/article/Evolutionary_Robotics

https://en.wikipedia.org/wiki/Neuroevolution

http://www.scholarpedia.org/article/Neuroevolution

Discussed above!