Which EA is best?

How do I choose an EA?

“I want the EA solve my problem…”

Different objectives imply different ways of designing and working with
an EA.

A good first step is to examine the given problem context.

We can roughly distinguish two main types of problems:

* design (one-off) problems

* repetitive problems

* including on-line control problems as special cases

- Experiment design
- Algorithm design
- Test problems
- Measurements and statistics
- Some tips and summary

- Has a
**goal**or goals - Involves
**algorithm**design and implementation - Needs
**problem**(s) to run the algorithm(s) on - Amounts to
**running**the algorithm(s) on the problem(s) - Delivers
**measurement data**, the results - Is concluded with
**evaluating**the results in the light of the given goal(s) - Is often
**documented**

- Get a good solution for a given problem
- Show that EC is applicable in a (new) problem domain
- Show that
*my_EA*is better than*benchmark_EA* - Show that EAs outperform traditional algorithms
- Find best setup for parameters of a given algorithm
- Understand algorithm behavior (e.g., pop dynamics)
- See how an EA scales-up with problem size
- See how performance is influenced by parameters
- …

Logistics: Optimizing delivery routes.

* Different destinations each day

* Limited time to run algorithm each day

* Must *always* be *reasonably* good route in limited
time

- Optimizing spending on improvements to national road network
- Total cost: billions of Euros
- Computing costs relatively negligible
- Six months to run algorithm on hundreds computers
- Many runs possible
- Must produce
*very*good result just*once*

**Design** perspective:

find a very good solution at least once

**Production** perspective:

find a good solution at almost every run

**Publication** perspective:

must meet scientific standards (heh?)

**Application** perspective:

good enough is good enough (verification!)

These perspectives have very different implications on evaluating the
results.

However, the are often left implicit…

**Design a representation**

Design a way of mapping a genotype to a phenotype

Design a way of evaluating an individual

**Introduce diversity**

Design suitable mutation operator(s)

Design suitable recombination operator(s)

**Select individuals**

Decide how to select individuals to be parents

Decide how to select individuals for the next generation (how to manage
the population )

**Instantiate**

Decide how to start: initialization method

Decide how to stop: termination criterion

One can generalize many representations to numerical.

In assuming this unified representation space,

we can define purely numerical lanscapes with different shapes.

- 5 DeJong functions
- 25 “hard” objective functions

- Frequently encountered or otherwise important variants of given practical problem
- Selection from recognized benchmark problem repository (“challenging” by being NP— ?!)
- Problem instances made by random generator
- Choice has severe implications on
- generalizability and
- scope of the results

- I invented “tricky mutation”
- Showed that it is a good idea by:
- Running standard (?) GA and tricky GA
- On 10 objective functions from the literature
- Finding tricky GA better on 7, equal on 1, worse on 2 cases

- I wrote it down in a paper
- And it got published!
- What did I learn from this experience?
- Is this good work?

What did I (my readers) not learn:

* How relevant are these results (test functions)?

* What is the scope of claims about the superiority of the tricky
GA?

* Is there a property distinguishing the 7 good and the 2 bad
functions?

* Are my results generalizable? (Is the tricky GA applicable for other
problems? Which ones?)

- Advantages:
- Results could be considered as very relevant viewed from the application domain (data supplier)

- Disadvantages
- Can be over-complicated
- Can be few available sets of real data
- May be commercial sensitive – difficult to publish and to allow others to compare
- Results are hard to generalize

```
* OR-Library
* http://www.ms.ic.ac.uk/info.html
* UCI Machine Learning Repository www.ics.uci.edu/~mlearn/MLRepository.html
```

- Advantage:
- Well-chosen problems and instances (hopefully)
- Much other work on these results comparable

- Disadvantage:
- Not real – might miss crucial aspect
- Algorithms get tuned for popular test suites

`* GA/EA Repository of Test Problem Generators`

- http://www.cs.uwyo.edu/~wspears/generators.html
- Advantage:
- Allow very systematic comparisons for they
- can produce many instances with the same characteristics
- enable gradual traversal of a range of characteristics (hardness)

- Can be shared allowing comparisons with other researchers

- Allow very systematic comparisons for they
- Disadvantage
- Not real – might miss crucial aspect
- Given generator might have hidden bias

- EAs are stochastic, so don’t rely on conclusions from a single run
- perform sufficient number of independent runs
- use statistical measures (averages, standard deviations)
- use statistical tests to assess reliability of conclusions

- EA experimentation is about comparison, so always do a fair
competition
- use the same amount of resources for the competitors
- try different computational limits (to coop with turtle/hare effect)
- use the same performance measures

Many different ways to measure, for example:

* Average result in given time

* Average time for given result

* Proportion of runs within % of target

* Best result over *n* runs

* Amount of computing required to reach target in given time with %
confidence

* …

- Elapsed time?
- Depends on computer, network, etc…

- CPU Time?
- Depends on skill of programmer, implementation, etc…

- Generations?
- Difficult to compare when parameters like population size change

- Evaluations?
- Evaluation time could depend on algorithm, e.g. direct vs. indirect representation

Performance measures (off-line)

* **Efficiency** (alg. speed)

* CPU time

* No. of steps, i.e., generated points in the search space

* **Effectivity** (alg. quality)

* Success rate

* Solution quality at termination

“Working” measures (on-line)

* Population distribution (genotypic)

* Fitness distribution (phenotypic)

* Improvements per time unit or per genetic operator

* …

- Number of of generated points in the search space = number of of fitness evaluations
- (don’t use no. of generations!)
**AES**: average number of evaluations to solution**SR**: success rate = % of runs finding a solution (individual with acceptabe quality / fitness)**MBF**: mean best fitness at termination, i.e., best per run, mean over a set of runs- SR ≠ MBF
- Low SR, high MBF: good approximizer (more time helps?)
- High SR, low MBF: “Murphy” algorithm

- Basic rule: use the same computational limit for each competitor
- Allow each EA the same no. of evaluations, but
- Beware of hidden labour, e.g. in heuristic mutation operators
- Beware of possibly fewer evaluations by smart operators

- EA vs. heuristic: allow the same no. of steps:
- Defining “step” is crucial, might imply bias!
- Scale-up comparisons eliminate this bias

Which algorithm is better?

Why?

When?

Populations mean (or best) fitness

Comparing algorithms A and B by their scale-up behaviour.

Algorithm B can be considered preferable because its scale-up curve is
less steep.

Comparing algorithms on problem instances with a scalable parameter

More abstract distributions:

Comparing algorithms by histograms of the best found fitness values

Which algorithm is better?

Why?

When?

Comparing algorithms A and B by after terminating at time T1 and T2 (for
a minimisation problem).

Algorithm A clearly wins in the first case, while B is better in the
second one

Averaging can “choke” or filter out interesting information

Overlay of curves can lead to very “cloudy” figures

Best: show continuous mean, sem, STD over x axis, plus various
randomized controls!

Like the supplementary figures in this paper of mine:

https://www.nature.com/articles/srep18112

- Algorithms are stochastic, results have element of “luck”
- If a claim is made “Mutilation A is better than mutation B”, need to show statistical significance of comparisons
- Fundamental problem: two series of samples (random drawings) from the SAME distribution may have DIFFERENT averages and standard deviations
- Tests can show if the differences are significant or not

Is the new method better?

- Standard deviations supply additional info
- T-test (and alike) indicate the chance that the values came from the
same underlying distribution (difference is due to random effects) E.g.
with 7% chance in this example.

You’re better off with visualizations and distributions…

- Start with descriptive plots.
- If you must resort to statistics, do so only after fully exploring descriptive plots, distributions, and raw data!
- T-test assummes:
- Data taken from continuous interval or close approximation
- Normal distribution
- Similar variances for too few data points
- Similar sized groups of data points

- Other tests:
- Wilcoxon - preferred to t-test where numbers are small or distribution is not known.
- F-test - tests if two samples have different variances.

- I invented myEA for problem X
- Looked and found 3 other EAs and a traditional benchmark heuristic for problem X in the literature
- Asked myself when and why is myEA better

- Found/made problem instance generator for problem X with 2
parameters:
*n*(problem size)*k*(some problem specific indicator)

- Selected 5 values for
*k*and 5 values for*n* - Generated 100 problem instances for all combinations
- Executed all alg’s on each instance 100 times (benchmark was also stochastic)
- Recorded AES, SR, MBF values w/ same comp. limit
- (AES for benchmark?)
- Put my program code and the instances on the Web

- Arranged results “in 3D” (
*n, k*) + performance - (with special attention to the effect of
*n*, as for scale-up) - Assessed statistical significance of results
- Found the niche for my_EA:
- Weak in … cases, strong in … cases, comparable otherwise
- Thereby I answered the “when question”

- Analyzed the specific features and the niches of each algorithm thus answering the “why question”
- Learned a lot about problem X and its solvers
- Achieved generalizable results, or at least claims with well-identified scope based on solid data
- Facilitated reproducing my results, so further research

- Be organized
- Decide what you want & define appropriate measures
- Choose test problems carefully
- Make an experiment plan (estimate time when possible)
- Perform sufficient number of runs
- Keep all experimental data (never throw away anything)
- Use good statistics (“standard” tools from Web, MS, R)
- Present results well (figures, graphs, tables, …)
- Watch the scope of your claims
- Aim at generalizable results
- Publish code for reproducibility of results (if applicable)
- Publish data for external validation (open science)