Previous: 21-FunctionalProg.html

- Password for the Vimeo videos is in Zulip chat.
- Lecture 1 - (Basic matplotlib, matrices, images, models): https://vimeo.com/537862669
- Lecture 2 - (numpy and matrix-based math): https://vimeo.com/478824328
- Lecture 3 - (data, stats, pandas): https://vimeo.com/539803723
- Lecture 4 - (jupyter, where to go from here?): https://vimeo.com/540808006

- Tip: If anyone want to speed up the lecture videos a little, inspect
the page, go to the browser console, and paste this in:

`document.querySelector('video').playbackRate = 1.2`

* Why should you care about crossword?

* Why should you care about matrices!

* What did the first pre-computers process?

* What is a List?

* Just a special case of a matrix

* THE universal user interface (UI)

* What is an image?

* ../../Bioinformatics/Content/20-ImageBasics.html
(actually review the top of this page.html)

* 22-DataVis/data_00a_matplotlib.py
(pudb3)

* How do yo detect a face?

* How do you detect a simple shape?

* How do you detect a line?

* How do you detect an edge?

* 22-DataVis/data_00b_images.py
(spyder)

* How do I do computer vision, or machine learning face
recognition?

* How does one store/model a 3D environment, like a realistic game
map?

* How does one image a brain, a brain over time?

* How does one keep abstract time series data?

* How does one keep abstract experimental data?

* How does one simulate a game or real-world conflict over a space?

Matrices are deeply intertwined with computation!

Welcome to the MATRIX!

+++++++++++ Cahoot-22a.1

https://mst.instructure.com/courses/58101/quizzes/57373

- https://numpy.org/ is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.
- https://en.wikipedia.org/wiki/NumPy

Step the code!

* 22-DataVis/data_01_numpy.py

Show these links, but don’t go over them in detail:

* https://scipy-lectures.org/intro/numpy/index.html

* https://numpy.org/doc/stable/user/index.html

* https://numpy.org/doc/stable/user/absolute_beginners.html

* https://numpy.org/doc/stable/user/quickstart.html

* https://numpy.org/doc/stable/user/basics.html

If you are interesting computational math, modeling, physics, AI, or
machine learning, I highly suggest you read the above tutorials in
full.

++++++ Cahoot-22b.1

https://mst.instructure.com/courses/58101/quizzes/57426

```
"In data science, 85 percent of time spent is preparing data, 10 percent of time is spent complaining about the need to prepare data, and 5 percent of the time is actually analyzing or modeling data..."
**"Datasets are like people... interrogate them enough, and they will tell you whatever you want to hear... whether or not it is true."**
```

The state of data analysis in many domains of science is indeed actually this dark, sometimes in this way:

If you can’t see the pattern, with simple descriptive statistics and
graphs, the pattern is probably not real!

https://cacm.acm.org/magazines/2019/9/238959-an-inability-to-reproduce/fulltext

http://blogs.nature.com/news/2012/12/is-the-scientific-literature-self-correcting.html

Can’t find a taxpayer-funded publication behind a for-profit paywall, just read this article from the journal Science:

- http://www.sciencemag.org/news/2016/04/whos-downloading-pirated-papers-everyone
- http://www.sciencemag.org/news/2016/04/alexandra-elbakyan-founded-sci-hub-thwart-journal-paywalls?IntCmp=scihub-1-11
- Hint: Can you find the .onion?

J. P. A. Ioannidis, “Why most published research findings are false,” PLoSMed, vol. 2, no. 8, p. e124, 2005.

Elaborate on this one in class

OpenScience-Collaboration, “Estimating the reproducibility of psychological science.,” Science, vol. 349, p. aac4716, Aug. 2015.

Elaborate on this one in class

D. Butler, “Biologists join physics preprint club,” Nature, vol. 425, pp. 548–548, Oct. 2003.

Delamothe, R. Smith, M. A. Keller, J. Sack, and B. Witscher, “Netprints: the next phase in the evolution of biomedical publishing,” BMJ, vol. 319, pp. 1515–1516, Dec. 1999.

Van Noorden, “Mathematicians aim to take publishers out of publishing,” Nature, Jan. 2013.

C. M. Bennett, M. B. Miller, and G. L. Wolford, “Neural correlates of inter-species perspective taking in the post-mortem atlantic salmon: An argument for multiple comparisons correction,” NeuroImage, vol. 47, no. 1, p. S125, 2009.

http://genomesunzipped.org/2011/07/why-publish-science-in-peer-reviewed-journals.php

Publication bias

Actual bias

Some domains of science are vulnerable to such problems, while others are not.

- Examples?

In psychology or biology, data mining is often an accusation, while in computer science, it may be something we say with pride. The difference is, in part, one of transparency.

What to do about it??

```
**Dr. Taylor's Tao of data analysis: Follow the data, and abstract as little as possible!**
Occasionally, thoughtful abstraction and summary statistics will be needed and helpful, but much more rarely, and usually only in the end-stage analysis or automation, not in initial exploration (initial bushwhacking science).
```

- 22-DataVis/data_02_statistics.py (If you have really simple data, and want to calculate some basic statistics)
- 22-DataVis/data_03_pandas.py (If
you have larger, more complicated datasets).
- Trace this one in Spyder for a quick demo

For the one-off little summary, not really for large-scale data
analysis:

* https://docs.python.org/3/library/statistics.html

If we are doing science, how do we organize our data correctly the
first time, so as not to have to spend all that time wrangling it?

* Wide, narrow, columns, rows?

If you are doing data analysis, what language do you use?

* Python

* R

* Matlab

* Julia

Provide some history and context on these and the dataframe.

To learn more:

https://learnxinyminutes.com/docs/pythonstatcomp/

Q: How did the panda interpret the data wrong?

A: He was “Bamboozled”!

**pandas has created pandamonium in the data science
world!**

A great way to pander to the needs of your data… as you ponder the dataset’s deeper meaning.

The pandas dataframe allows you to arbitrarily retrieve complex subsets of your data!!!

- The official documentation:
- https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/index.html
- review in lecture
- https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html

- Third-party (may be out-of-date)
- https://scipy-lectures.org/packages/statistics/index.html
- http://data-analysis-in-python.org/3_pandas.html

Note: pandas was/is a rapidly evolving package, and they have ruthlessly broken backwards compatibility for new optimizations over the years, so these (or any) cheatsheets may not be current.

In the past, you may have pulled data you wanted to analyze into excel, whereas pandas can do all that and more!

```
* https://www.tomasbeuzen.com/python-programming-for-data-science/README.html (good interactive ipynb book).
* https://jakevdp.github.io/PythonDataScienceHandbook/ (good book in Jupyter notebooks)
* https://pythonprogramming.net/data-analysis-tutorials/
* http://data-analysis-in-python.org/
* https://pandas.pydata.org/pandas-docs/stable/tutorials.html
* http://shop.oreilly.com/product/0636920023784.do
* Pandas cheat sheet: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
```

+++++++++++++ Cahoot-22c.1

https://mst.instructure.com/courses/58101/quizzes/57516

Data scientists love beautiful data pictures!

* http://www.scipy-lectures.org/intro/matplotlib/index.html

* https://matplotlib.org/tutorials/

* https://matplotlib.org/tutorials/introductory/usage.html (go over in
lecture)

* https://matplotlib.org/tutorials/introductory/pyplot.html

See: ../../Bioinformatics/Content/02-PlatformTools.html

- Data and eScience deep dive
- MST’s ACM-data (They analyze Mo data…)
- https_/modata.blog/|h.htmlttps:_modata.blog/
- https://modata.blog/learn/

- Pretty data pictures
- https://informationisbeautiful.net/
- https://informationisbeautiful.net/visualizations/worlds-biggest-data-breaches-hacks/

- Data science tutorials and competitions
- https://www.kaggle.com/
- Example competition story: final project
- In-class, download a jupyter notebook, and walk through an analysis,
for example:
- https://www.kaggle.com/c/plant-pathology-2021-fgvc8

These are some resources to actually learn data analysis and science
in a focused, sequential way:

* https://jakevdp.github.io/PythonDataScienceHandbook/ (looks like a
quite good book, built from Jupyter notebooks)

* http://data-analysis-in-python.org/

+++++++++++ Cahoot-22d.1

https://mst.instructure.com/courses/58101/quizzes/57573

What is jupyter notebook?

An IDE

A lab notebook

A format for tutorials

A python interpreter

Sci-fy love story

Next: 23-Regex.html