1 05-BioInfoBasics


1.1 Audio-recording

1.2 Opening thought

"**Wherever there is an adaptation that is highly successful in a broad range of similar environments, it is apt to emerge again and again, independently - the phenomenon known in biology as convergent evolution. I call these adaptations 'good tricks.'"**
    - Daniel Dennett (A thought-leader in evolution, and a great writer)

A perspective-changing read by Dennet and Levin:
https://aeon.co/essays/how-to-understand-cells-tissues-and-organisms-as-agents-with-agendas

1.3 Protein sequencing

05-BioInfoBasics/protein-seqence-example.png

1.3.1 First sequences to be databased were proteins

How to construct such a tree structure?

1.4 DNA sequence databases

05-BioInfoBasics/databases1.webp
* DNA sequence databases were first assembled at Los Alamos National Laboratory (LANL), New Mexico, by Walter Goad and colleagues in the GenBank database and at the European Molecular Biology Laboratory (EMBL) in Heidelberg, Germany.
* Initially, a sequence entry included a computer filename, and DNA or protein sequence files.
* These were eventually expanded to include much more information about the sequence, such as function, mutations, encoded proteins, regulatory sites, and references.
* This information was then placed, along with the sequence, into a database format that could be readily searched for many types of information.
05-BioInfoBasics/databases2.png

1.5 Sequence retrieval from public databases

05-BioInfoBasics/entrez.png]
* An important step in providing sequence database access was the development of Web pages that allow queries to be made of the major sequence databases (GenBank, EMBL, etc.).
* An early example of this technology at NCBI was a menu-driven program called GEN-INFO developed by D. Benson, D. Lipman, and colleagues.
* This program searched rapidly through previously indexed sequence databases for entries that matched a biologist’s query.
* Subsequently, a derivative program called ENTREZ with a simple window-based interface, and eventually a Web-based interface, was developed at NCBI.
* The idea behind these programs was to provide an easy-to-use interface with a flexible search procedure to the sequence databases.

1.6 Sequence analysis software

05-BioInfoBasics/dna_sequencing_workflow.jpg
* Because DNA sequencing involves ordering a set of peaks (A, G, C, or T) on a sequencing gel, the process can be quite error-prone, depending on the quality of the data.
* As more DNA sequences became available in the late 1970s, interest also increased in developing computer programs to analyze these sequences in various ways.
* In 1982 and 1984, Nucleic Acids Research published two special issues devoted to the application of computers for sequence analysis, including programs for large mainframe computers down to the then-new microcomputers.

1.7 Dot matrix method for comparing sequences

05-BioInfoBasics/hist01.png
* In 1970, A.J. Gibbs and G.A. McIntyre (1970) described a new method for comparing two amino acid and nucleotide sequences in which a graph was drawn with one sequence written across the page and the other down the left-hand side.
* Whenever the same letter appeared in both sequences, a dot was placed at the intersection of the corresponding sequence positions on the graph

1.8 Alignment of sequences, global, local, and multiple

Various methods for aligning entire matching segments, small matching adjacent segments, and multiple variable-length segments.

Global versus local alignment:
05-BioInfoBasics/Global-alignment-vs-Local-alignment.png
Also, multiple sequence alignment:
05-BioInfoBasics/alignment-types.jpg
Why is this useful?
05-BioInfoBasics/uses-of-sequence-alignment-l.jpg

1.9 RNA and protein structure

1.9.1 Prediction of RNA secondary structure

05-BioInfoBasics/hist02.png
* Methods for predicting RNA secondary structure on computers were also developed at an early time.
* For example, if the complement of a sequence on an RNA molecule is repeated down the sequence in the opposite chemical direction, the regions may base-pair and form a hairpin structure:

1.9.2 Prediction of protein structure

05-BioInfoBasics/hist03.png
* There are a large number of proteins whose sequences are known, but very few whose structures have been solved.
* Solving protein structures involves the time-consuming and highly specialized procedures of X-ray crystallography and nuclear magnetic resonance (NMR).
* Consequently, there is much interest in trying to predict the structure of a protein, given its sequence.
* Early attempts were made at predicting protein structure from sequence.

1.10 Evolutionary relationships

05-BioInfoBasics/tree.jpg

+++++++++++++++++++ Cahoot-05-1

1.10.1 Protein, DNA, and RNA sequences

Variations within a family of related nucleic acid or protein sequences provide a source of information for evolutionary biology,
enabling the discovery of relationships between species in an objectively quantifiable manner.
05-BioInfoBasics/protein_tree.jpg
It’s not just species that one can compare,
but also proteins within an organism,
which can be duplicated within an organism,
and then re-purposed for new, independent functions.

1.11 Genome databases

05-BioInfoBasics/Pic_1-The-Human-Genome-1.jpg

1.11.1 The first genome database

The first genome database, was called ACEDB (a C. elegans database),
and the methods to access this database were developed by Mike Cherry and colleagues (Cherry and Cartinhour 1993).
This database was accessible through the internet and allowed retrieval of sequences,
information about genes and mutants, investigator addresses, and references.
Similar databases were subsequently developed using the same methods for A. thaliana and S. cerevisiae.
This is C. elegans:
05-BioInfoBasics/celegansacedb.jpg

1.12 Boom

And then the field of bioinformatics exploded
05-BioInfoBasics/genbank.png
“… from 1982 to the present, the number of bases in GenBank has doubled approximately every 18 months”. As of 15 August 2017, GenBank release 221.0 has 203,180,606 loci, 240,343,378,258 bases, from 203,180,606 reported sequences.

1.13 Bioinformatics Today

Venn of the Nexus of many fields

05-BioInfoBasics/image5.png
05-BioInfoBasics/image4.png

Contrasted to data science
05-BioInfoBasics/image2.png
Same job, way worse pay…

Slightly more detail
05-BioInfoBasics/image3.png

Even more detail
05-BioInfoBasics/image1.png

A different perspective
05-BioInfoBasics/image6.png

AI methods in Bioinformatics
05-BioInfoBasics/image7.png

1.13.1 Sub-fields

05-BioInfoBasics/bioinformatics_diagram1-1024x1011.png

https://en.wikipedia.org/wiki/Computational_epidemiology
https://en.wikipedia.org/wiki/Mathematical_modelling_of_infectious_disease
https://en.wikipedia.org/wiki/Compartmental_models_in_epidemiology
https://en.wikipedia.org/wiki/Computational_biology
https://en.wikipedia.org/wiki/Bioinformatics
https://en.wikipedia.org/wiki/Sequence_assembly
https://en.wikipedia.org/wiki/Sequence_analysis
https://en.wikipedia.org/wiki/Comparative_genomics
https://en.wikipedia.org/wiki/Health_informatics
https://en.wikipedia.org/wiki/Imaging_informatics
https://en.wikipedia.org/wiki/Neuroinformatics
https://en.wikipedia.org/wiki/Computational_neuroscience
https://en.wikipedia.org/wiki/Modelling_biological_systems
https://en.wikipedia.org/wiki/Computational_phylogenetics
https://en.wikipedia.org/wiki/Computational_genomics
https://en.wikipedia.org/wiki/Biodiversity_informatics
https://en.wikipedia.org/wiki/Biological_network
https://en.wikipedia.org/wiki/Structural_bioinformatics
https://en.wikipedia.org/wiki/Ecosystem_model
https://en.wikipedia.org/wiki/Models_of_DNA_evolution
https://en.wikipedia.org/wiki/Translational_bioinformatics
https://en.wikipedia.org/wiki/Gene_ontology
https://en.wikipedia.org/wiki/Gene_prediction
https://en.wikipedia.org/wiki/Bioimage_informatics
https://en.wikipedia.org/wiki/Protein_structure_prediction
https://en.wikipedia.org/wiki/Computational_anatomy
https://en.wikipedia.org/wiki/Cellular_model
https://en.wikipedia.org/wiki/Computational_biology
https://en.wikipedia.org/wiki/Bioinformatics
https://en.wikipedia.org/wiki/Sequence_assembly
https://en.wikipedia.org/wiki/Sequence_analysis
https://en.wikipedia.org/wiki/Comparative_genomics
https://en.wikipedia.org/wiki/Health_informatics
https://en.wikipedia.org/wiki/Imaging_informatics
https://en.wikipedia.org/wiki/Neuroinformatics
https://en.wikipedia.org/wiki/Computational_neuroscience
https://en.wikipedia.org/wiki/Modelling_biological_systems
https://en.wikipedia.org/wiki/Computational_phylogenetics
https://en.wikipedia.org/wiki/Computational_genomics
https://en.wikipedia.org/wiki/Biodiversity_informatics
https://en.wikipedia.org/wiki/Structural_bioinformatics
https://en.wikipedia.org/wiki/Ecosystem_model
https://en.wikipedia.org/wiki/Models_of_DNA_evolution
https://en.wikipedia.org/wiki/Translational_bioinformatics
https://en.wikipedia.org/wiki/Gene_ontology
https://en.wikipedia.org/wiki/Gene_prediction
https://en.wikipedia.org/wiki/Bioimage_informatics
https://en.wikipedia.org/wiki/Protein_structure_prediction
https://en.wikipedia.org/wiki/Computational_anatomy
https://en.wikipedia.org/wiki/Cellular_model

1.13.2 Ontologies

https://www.mkbergman.com/374/an-intrepid-guide-to-ontologies/
05-BioInfoBasics/ontology_070501d_SemanticSpectrum.png
In computer science and information science,
an ontology is a formal naming and definition of:
the types, properties, and interrelationships of the entities,
that really exist in a particular domain of discourse.
05-BioInfoBasics/ontology_070501b_OntologyLevels.png
An upper ontology (or foundation ontology) is a model of the common objects that are generally applicable across a wide range of domain ontologies.
It usually employs a core glossary that contains the terms and associated object descriptions as they are used in various relevant domain sets, for example, the Basic Formal Ontology (BFO)

Domain ontology: Open Biomedical Ontologies (abbreviated OBO; formerly Open Biological Ontologies) is an effort to create controlled vocabularies for shared use across different biological and medical domains. As of 2006, OBO forms part of the resources of the U.S. National Center for Biomedical Ontology where it will form a central element of the NCBO’s BioPortal.

1.13.2.1 Sequence ontology

05-BioInfoBasics/seq_ont.png
The Sequence Ontology (SO) at http://www.sequenceontology.org is a collaborative ontology project for the definition of sequence features used in biological sequence annotation.
For example, an X element combinatorial repeat is a repeat region located between the X element and the telomere or adjacent Y’ element.

1.13.2.2 Gene ontology

05-BioInfoBasics/gene_ont.png
The Gene Ontology (GO) is a controlled vocabulary that connects each gene to one or more functions.
http://geneontology.org/
The ontology is intended to categorize gene products rather than the genes themselves.
Different products of the same gene may play very different roles,
and labelling and treating all of these functions under the same gene name may (and often does) lead to confusion.

++++++++++++ Cahoot-05-2

1.14 Databases and data sources