Password for the Vimeo videos is in Zulip chat.
https://vimeo.com/674616757
Tip: If anyone want to speed up the lecture videos a little,
inspect the page, go to the browser console, and paste this in:
document.querySelector('video').playbackRate = 1.2
Add discussion questions
"**Wherever there is an adaptation that is highly successful in a broad range of similar environments, it is apt to emerge again and again, independently - the phenomenon known in biology as convergent evolution. I call these adaptations 'good tricks.'"**
- Daniel Dennett (A thought-leader in evolution, and a great writer)
A perspective-changing read by Dennet and Levin:
https://aeon.co/essays/how-to-understand-cells-tissues-and-organisms-as-agents-with-agendas
How to construct such a tree structure?
* DNA sequence databases were first assembled at Los Alamos National
Laboratory (LANL), New Mexico, by Walter Goad and colleagues in the
GenBank database and at the European Molecular Biology Laboratory (EMBL)
in Heidelberg, Germany.
* Initially, a sequence entry included a computer filename, and DNA or
protein sequence files.
* These were eventually expanded to include much more information about
the sequence, such as function, mutations, encoded proteins, regulatory
sites, and references.
* This information was then placed, along with the sequence, into a
database format that could be readily searched for many types of
information.
]
* An important step in providing sequence database access was the
development of Web pages that allow queries to be made of the major
sequence databases (GenBank, EMBL, etc.).
* An early example of this technology at NCBI was a menu-driven program
called GEN-INFO developed by D. Benson, D. Lipman, and colleagues.
* This program searched rapidly through previously indexed sequence
databases for entries that matched a biologist’s query.
* Subsequently, a derivative program called ENTREZ with a simple
window-based interface, and eventually a Web-based interface, was
developed at NCBI.
* The idea behind these programs was to provide an easy-to-use interface
with a flexible search procedure to the sequence databases.
* Because DNA sequencing involves ordering a set of peaks (A, G, C, or
T) on a sequencing gel, the process can be quite error-prone, depending
on the quality of the data.
* As more DNA sequences became available in the late 1970s, interest
also increased in developing computer programs to analyze these
sequences in various ways.
* In 1982 and 1984, Nucleic Acids Research published two special issues
devoted to the application of computers for sequence analysis, including
programs for large mainframe computers down to the then-new
microcomputers.
* In 1970, A.J. Gibbs and G.A. McIntyre (1970) described a new method
for comparing two amino acid and nucleotide sequences in which a graph
was drawn with one sequence written across the page and the other down
the left-hand side.
* Whenever the same letter appeared in both sequences, a dot was placed
at the intersection of the corresponding sequence positions on the
graph
Various methods for aligning entire matching segments, small matching adjacent segments, and multiple variable-length segments.
Global versus local alignment:
Also, multiple sequence alignment:
Why is this useful?
* Methods for predicting RNA secondary structure on computers were also
developed at an early time.
* For example, if the complement of a sequence on an RNA molecule is
repeated down the sequence in the opposite chemical direction, the
regions may base-pair and form a hairpin structure:
* There are a large number of proteins whose sequences are known, but
very few whose structures have been solved.
* Solving protein structures involves the time-consuming and highly
specialized procedures of X-ray crystallography and nuclear magnetic
resonance (NMR).
* Consequently, there is much interest in trying to predict the
structure of a protein, given its sequence.
* Early attempts were made at predicting protein structure from
sequence.
+++++++++++++++++++ Cahoot-05-1
Variations within a family of related nucleic acid or protein
sequences provide a source of information for evolutionary
biology,
enabling the discovery of relationships between species in an
objectively quantifiable manner.
It’s not just species that one can compare,
but also proteins within an organism,
which can be duplicated within an organism,
and then re-purposed for new, independent functions.
The first genome database, was called ACEDB (a C. elegans
database),
and the methods to access this database were developed by Mike Cherry
and colleagues (Cherry and Cartinhour 1993).
This database was accessible through the internet and allowed retrieval
of sequences,
information about genes and mutants, investigator addresses, and
references.
Similar databases were subsequently developed using the same methods for
A. thaliana and S. cerevisiae.
This is C. elegans:
And then the field of bioinformatics exploded
“… from 1982 to the present, the number of bases in GenBank has doubled
approximately every 18 months”. As of 15 August 2017, GenBank release
221.0 has 203,180,606 loci, 240,343,378,258 bases, from 203,180,606
reported sequences.
Venn of the Nexus of many fields
Contrasted to data science
Same job, way worse pay…
Slightly more detail
Even more detail
A different perspective
AI methods in Bioinformatics
https://en.wikipedia.org/wiki/Computational_epidemiology
https://en.wikipedia.org/wiki/Mathematical_modelling_of_infectious_disease
https://en.wikipedia.org/wiki/Compartmental_models_in_epidemiology
https://en.wikipedia.org/wiki/Computational_biology
https://en.wikipedia.org/wiki/Bioinformatics
https://en.wikipedia.org/wiki/Sequence_assembly
https://en.wikipedia.org/wiki/Sequence_analysis
https://en.wikipedia.org/wiki/Comparative_genomics
https://en.wikipedia.org/wiki/Health_informatics
https://en.wikipedia.org/wiki/Imaging_informatics
https://en.wikipedia.org/wiki/Neuroinformatics
https://en.wikipedia.org/wiki/Computational_neuroscience
https://en.wikipedia.org/wiki/Modelling_biological_systems
https://en.wikipedia.org/wiki/Computational_phylogenetics
https://en.wikipedia.org/wiki/Computational_genomics
https://en.wikipedia.org/wiki/Biodiversity_informatics
https://en.wikipedia.org/wiki/Biological_network
https://en.wikipedia.org/wiki/Structural_bioinformatics
https://en.wikipedia.org/wiki/Ecosystem_model
https://en.wikipedia.org/wiki/Models_of_DNA_evolution
https://en.wikipedia.org/wiki/Translational_bioinformatics
https://en.wikipedia.org/wiki/Gene_ontology
https://en.wikipedia.org/wiki/Gene_prediction
https://en.wikipedia.org/wiki/Bioimage_informatics
https://en.wikipedia.org/wiki/Protein_structure_prediction
https://en.wikipedia.org/wiki/Computational_anatomy
https://en.wikipedia.org/wiki/Cellular_model
https://en.wikipedia.org/wiki/Computational_biology
https://en.wikipedia.org/wiki/Bioinformatics
https://en.wikipedia.org/wiki/Sequence_assembly
https://en.wikipedia.org/wiki/Sequence_analysis
https://en.wikipedia.org/wiki/Comparative_genomics
https://en.wikipedia.org/wiki/Health_informatics
https://en.wikipedia.org/wiki/Imaging_informatics
https://en.wikipedia.org/wiki/Neuroinformatics
https://en.wikipedia.org/wiki/Computational_neuroscience
https://en.wikipedia.org/wiki/Modelling_biological_systems
https://en.wikipedia.org/wiki/Computational_phylogenetics
https://en.wikipedia.org/wiki/Computational_genomics
https://en.wikipedia.org/wiki/Biodiversity_informatics
https://en.wikipedia.org/wiki/Structural_bioinformatics
https://en.wikipedia.org/wiki/Ecosystem_model
https://en.wikipedia.org/wiki/Models_of_DNA_evolution
https://en.wikipedia.org/wiki/Translational_bioinformatics
https://en.wikipedia.org/wiki/Gene_ontology
https://en.wikipedia.org/wiki/Gene_prediction
https://en.wikipedia.org/wiki/Bioimage_informatics
https://en.wikipedia.org/wiki/Protein_structure_prediction
https://en.wikipedia.org/wiki/Computational_anatomy
https://en.wikipedia.org/wiki/Cellular_model
https://www.mkbergman.com/374/an-intrepid-guide-to-ontologies/
In computer science and information science,
an ontology is a formal naming and definition of:
the types, properties, and interrelationships of the entities,
that really exist in a particular domain of discourse.
An upper ontology (or foundation ontology) is a model
of the common objects that are generally applicable across a wide range
of domain ontologies.
It usually employs a core glossary that contains the terms and
associated object descriptions as they are used in various relevant
domain sets, for example, the Basic Formal Ontology (BFO)
Domain ontology: Open Biomedical Ontologies (abbreviated OBO; formerly Open Biological Ontologies) is an effort to create controlled vocabularies for shared use across different biological and medical domains. As of 2006, OBO forms part of the resources of the U.S. National Center for Biomedical Ontology where it will form a central element of the NCBO’s BioPortal.
The Sequence Ontology (SO) at http://www.sequenceontology.org is a
collaborative ontology project for the definition of sequence features
used in biological sequence annotation.
For example, an X element combinatorial repeat is a repeat region
located between the X element and the telomere or adjacent Y’
element.
The Gene Ontology (GO) is a controlled vocabulary that connects each
gene to one or more functions.
http://geneontology.org/
The ontology is intended to categorize gene products rather than the
genes themselves.
Different products of the same gene may play very different roles,
and labelling and treating all of these functions under the same gene
name may (and often does) lead to confusion.
++++++++++++ Cahoot-05-2