Tip: If anyone want to speed up the lecture videos a little, inspect
the page, go to the browser console, and paste this in:
document.querySelector('video').playbackRate = 1.2
1.2 What is a data-oriented
“lab”?
in bio-informatics or computational-biology
…or a lab in any field of -informatics or computational-
My personal experience felt about like this:
1.3 How to explore data?
Using a real data-focused IDE!
https://www.spyder-ide.org/
Actually exploring data this real IDE (above), is in contrast to the
later documenting and publishing you do with different tools
(below).
1.4 What is a data-oriented “lab
notebook”?
in bio-informatics or computational-biology
…or a lab in any field of -informatics or computational-
https://en.wikipedia.org/wiki/Lab_notebook
* Lab notebooks are a real thing, and scientists actually keep
them!
* They are used for documentation, which feeds into publication.
1.5 Consistency and
publication!
The classic paper model:
How to increase consistency, transparency, and computability in
scientific publishing?
1.5.1 How to publish experimental
data exploration code?
Goal: A communicative, transparent, easy-to-read, and actually
executable publication.
1.5.2 Tools
A variety of data-focused documentation and publication tools.
1.5.2.1 Jupyter
https://jupyter.org/ notebooks made this approach popular.
A type of lab-notebook for data analysis, that enabled more easily
reproducible science!
Jupyter notebooks are NOT great for coding, implementation,
early-stage programming, NOR for real data exploration (in my opinion),
but they were a great start for improving the culture of transparent
publication. They are nicer for more polished documentation stages of
coding data analysis.
Despite its widespread popularity, jupyter, and the ipynb format are
both buggy and poorly programmed…
Thanks to Jupyter for kickstarting a trend, and for the great
high-level design/concept, but the back-end implementation was
technically bad.
Thus, I no longer recommend using .ipynb as your
primary format (perhaps a secondary format; see below).
Install
* System repos:
* There are multiple packages, named differently across distribution, so
search:
* $ sudo dnf/apt/zypper search jupyter
* For example, install some, or all, of the results of the previous
search:
* $ sudo dnf/apt/zypper install jupyter-*
If you want a newer version, don’t have sudo access, or use system
repositories, then use:
As a computer scientist into security, I am very much an enthusastic
fan of Virtual Machines in general, though our book for biologists who
want to do computation better ( https://www.biostarhandbook.com/ ) also
suggests VMs as a way to bundle up an environment for replicability:
Do a real science study,
write the scripts that go from raw data to final figures (with no
human input!!),
put the data and scripts in a VM,
run them all,
generate the figures and paper, etc.
Then right when you finish, export the VM as an OVA/snapshot (e.g.,
https://www.virtualbox.org/), and publish it all.
Afterwards, there is no question about what produced the results,
and anyone can do so.
This is both good proactive future-proof science, but also good
defensive strategy if your results are being replicated by another
researcher.
1.5.4.2 Online virtualization of
shared code, data, and environment
Publishing a git repository of your work has become the gold
standard of public code/data.
Sites like https://mybinder.org/ enables you to run the jupyter
notebooks in an arbitrary Git repository on someone-else’s computer,
while installing a pre-specified dependency environment.
One problem is that you can’t securely perform any push operations
back to that Git repository you are working on, without disclosing your
password to the owner of the binder server…
Future-proof dependency handling is much less reliable than a
VM.
Ask:
* Is this as future-proof and standalone reproducible as a VM?
* Is it as easy as a VM?
* Is it as likely to be reproduced?