My copy of Probably Overthinking It has arrived!
It should be shipping soon. If you would like to pre-order, you can get 30% from University of Chicago Press. Use the code UCPNEW.
Today I asked ChatGPT to solve almost every exercise in Think Python and DSIRP.
It did.
My conclusion: everyone who writes code should spend the next month doing professional development on writing code with LLM-assist.
This is how code will be written from now on.
In the last week, three people on reddit/r/statistics have asked about testing whether a sample came from a Gaussian distribution.
The answer is that you should never test for normality. The result is a non-answer to the wrong question.
This semester I wrote a book, Data Structures and Information Retrieval in Python.
It covers data structures and algorithms, organized around a motivating example: building a search engine.
It's all in Jupyter notebooks that run on Colab.
With quizzes!
A few years ago I wrote a short a book that explains basic use of Git:
It contains exercises you can do on a practice repository:
Today I added a page called "Merge conflicts with minimal pain":
Good luck!
Math notation is good for a lot of things, but representing algorithms is not one of them.
Fortunately, we have other formal languages that are really good at representing algorithms -- programming languages.
It's a shame academic papers don't use them more often.
The "Elements of Data Science" curriculum is now substantially complete:
It is a notebook-based introduction to Data Science in Python for people with no prior experience in programming or statistics.
I have started work on the second edition of Think Bayes. I think it will be much better!
Here's the notebook for Chapter 1:
There are exercises at the end if you want to play along at home.
The 3rd Edition of Think Python is available now at
The print edition is available for preorder, expected to ship in June.
What's new? Jupyter notebooks, turtle graphics, doctest, unittest, regular expressions, and a new, full color, parrot on the cover!
This is interesting if true, but I'm not sure it is.
I tried to replicate the US graph with GSS data and I'm not seeing it.
Whether the gap is growing depends how seriously we take the last two points is a noisy series.
But it's nowhere near 30 points.
NEW: an ideological divide is emerging between young men and women in many countries around the world.
I think this one of the most important social trends unfolding today, and provides the answer to several puzzles.
A recent article in the Financial Times claims that there is an increasing ideological gender gap in several countries.
In this article, I replicate their analysis with GSS data and conclude that there is little evidence the gap in the US is growing.
A modest proposal: Let's stop using the term "bias-variance tradeoff".
"Underfitting" and "overfitting" are clear, self-explanatory, and easy to remember and recognize.
"Bias" and "variance" add nothing but confusion.
And the word "bias" is already too overloaded.
Modeling and Simulation in Python is off to the printer!
And available for pre-order :)
To celebrate, I published on of my favorite examples: One Queue or Two?
I have news! After 18 years at
@olincollege
, I am leaving at the end of this academic year. Since I turn 55 in May, I am retiring, technically :)
Not sure yet what I'll do next. I'll take some time to figure it out -- watch this space!
Planning my Data Science class for the spring, I have 7 slots reserved to critique interesting data visualizations: what do you notice? what do you wonder? what works? what would you try changing?
The NYT cochlea of COVID will be on the list.
What else?
Most of the teaching examples for classification algorithms are toy data, fake applications (looking at you, irises and Titanic datasets).
Any suggestions for examples with real data, real applications, ideally in the space of data science for social good?
@causalinf
If grandfathers count: when I was 10 I got to stay up all night in the print room of the Boston Herald. I got (and still have) my name in several typefaces on metal slugs.
And I got to press the big, red STOP THE PRESSES button!
Which is actually big and red.
The second edition of Modeling and Simulation in Python is done:
I am printing copies for my class this fall. If you would like a hard copy, you can get one from Lulu:
Cover design by Olin's own Tim Sauder.
Can someone explain why, if you write an idea in math notation, that's "theory", which provides deep understanding of the math "behind" it, but if you write the same idea in a programming language, it's just hacking?
This bizarre prejudice is the bane of my professional life.
Programming languages like Python are more readable than pseudocode, and have the additional advantage of being executable and debuggable.
Pseudocode is obsolete.
I know I shouldn't always take the bait, but someone on the Internet was mean about Bayesian statistics, so I wrote a manifesto:
"Bayesian and frequentist results are not the same, ever"
Got a helpful email today from a student reading my book with a screenreader.
Among the suggestions for better accessibility: use "can not" rather than "can't"; with a screenreader it is hard to hear the difference between "can" and "can't".
And that's an important difference!
For my Data Science class in the spring I am compiling resources related to data visualization.
What books would you recommend? Web sites? Other?
I will collect suggestions and post them next week.
Ruin this joke by explaining it? Ok!
This is an instance of Berkson's paradox. If a dingy restaurant with a broken website has bad food, it won't last, so among surviving restaurants, greasy styrofoam is correlated with good food.
There's a chapter about this in my book!
I was checking to see if a burrito place near me had good reviews and the photos of the place look dingy and the food looks greasy and they serve it in styrofoam and when I clicked their website I got a 404 error. This is going to be the best food I’ve ever tasted.
Scientists: if you are still writing articles primarily in the passive voice because you think journals require it, please check the style guides.
Many journals, include Nature, Science, and PNAS, have been begging you to stop for years.
I'm working on a new series of notebooks to teach probability and Bayesian statistics.
The first notebook starts with a famous example of the conjunction fallacy, Tversky and Kahneman's Linda the banker.
Want to play along at home?
Today is my first day on an exciting new project.
This semester I will be at Harvard one day a week, co-teaching a seminar on data science education and helping to design a new undergrad class for Spring 2020.
Details at
More examples of Simpson's paradox in the General Social Survey. Old people are racist, sexist, and homophobic, but it's not because they're old; it's because they were born a long time ago.
In my Complexity Science class, I mentioned the way NumPy creates "views" to avoid copying arrays.
To explain the idea more clearly, I created this notebook, which you can run on Colab:
It has some exercises you can work on, in case you are bored.
If you've been vaccinated, thank a scientist. And then thank about 100 engineers.
Because inventing mRNA vaccines is science. But manufacturing billions of doses, keeping them cold, and delivering them around the world is a feat of engineering.
I've been working on Modeling and Simulation in Python for... a while.
On my 4th try, I have a version I am happy with. It's still a work in progress, but I've posted a mostly complete draft:
I don't understand why
@github
STILL can't reliably render a Jupyter notebook. It's been years!
NBViewer is 100% reliable as far as I can tell. Why is this so hard?
This fall, I am taking my Data Science class on the road...
...the really long road to Ashesi University in Ghana:
Classes start August 31.
I can't wait!
Every time someone invites me to Slack, I spend 15 minutes figuring out what email address to use, I connect once, and then never use it again.
I'm on about 20 channels, can't log into any of them, and have no idea how or why anyone uses it.
Is it terrible, or am I just old?
I ran into a NumPy gotcha today: np.var and np.cov have inconsistent default behavior.
var divides by N
cov divides by N-1
IMO, both should use N, which computes a simple descriptive statistic; if you want an estimator, you have to ask for it
Theorem: If students find a statistical concept hard, the problem is the concept, not the students.
Proof by example: p-values, confidence intervals and likelihood functions are "hard to understand" because they are fundamentally broken, bad ideas.
Last week I asked for your favorite data visualization resources. Thank you to everyone who replied. I have organized the responses (and added a few of my own):
What's new in English version 20.20?
* Singular "they" is recommended.
* Ending a sentence with a preposition is allowed.
* Split infinitives are allowed.
* "Whom" is now deprecated.
* Latin plurals are deprecated.
* The passive voice in science writing is deprecated.
Here's a quick Bayesian analysis of the results from the vaccine trial announced today:
Based on some guesses about the raw data, and some modeling assumptions, it seems unlikely that the effectiveness is less than 80%.
I was at
@Google
today to give a talk about Chapter 7 of Probably Overthinking It: Causation, Collision, and Confusion.
I'll post the video when it's available, but in the meantime, the slides are here:
conda is so slow it is now unusable, and now can't solve some environments it used to.
mamba is fast but buggy and the documentation is not ready for prime time.
Is there a good option for package management?
There are lots of good articles explaining how LLMs work at a mechanical level.
This is the best explanation I've seen of how LLMs are able to do what they do, at least as we currently understand it.
I wrote an article that uses Bayesian decision analysis to find the optimal strategy for plugging in a USB connector.
It turns out there's a reason it's so common to flip twice.
My revised tutorial on Bayesian Statistics is ready to go. The slides and notebooks are here:
If you are coming to
@SciPyConf
and you want to see it live, good seats are still available:
I am excited to announce the forthcoming third edition of Think Python!
What's new? Jupyter notebooks on Colab, learning to program with ChatGPT, regular expressions, automated testing -- and turtle graphics that work in notebooks!
Does anyone know why GitHub has such a hard time rendering Jupyter notebooks? For me, it often fails several times and then works, or never works.
Whereas nbviewer seems to work, quickly, 100% of the time.
@github
, can you borrow nbviewer's renderer?
If you've ever been confused about joint distributions, marginal distributions, and conditional distributions, you might like the next notebook in Bite Size Bayes:
Welcome to the world of two dimensions!
I'm teaching Complexity Science this semester, so I updated the notebooks.
They run on
@googlecolab
, so you can run them without installing anything!
Links here:
I'm starting a new job this week, as a curriculum designer at Brilliant
@brilliantorg
, focusing on data science and computer science.
If you want to try one of their online classes, here's a freebee:
My thoughts on this whole 10x engineer thing:
Many people are effectively 0x engineers (that's "zero ex") because they are working on things that will never have positive impact.
Pay less attention to 10x; focus on making sure you are not 0x.
The video of my tutorial on Bayesian Decision Analysis, from PyData Global 2022, is available now.
For links to the video, slides, and Jupyter notebook, start at
I am developing examples of survival analysis for my Data Science class. Anyone have any favorite application domains and/or datasets?
@Cmrn_DP
So far:
1) Literal survival probably using datasets from R
2) Time until marriage, divorce, NSFG
3) Customer conversion, dataset?
I am moving toward treating NumPy as core Python and using all NumPy functions instead of the math module. The side effect is that Python sequences get quietly upgraded to NumPy arrays, which are usually faster and smaller.
I am working on an ebook, tentatively called Bite Size Bayes, that introduces Bayesian statistics gradually, for people with no prior stats.
Here's Python notebook 2, if you want to check it out.
R version coming soon!
Replication crisis update
"none of the 193 experiments were described in sufficient detail"
"of the 50 experiments from 23 papers that were [repeated], effect sizes were, on average, 85 percent lower than those reported in the original experiments."
A few weeks ago I led a workshop at Harvard on "Using computation to teach everything else"
The slides for the workshop are at
Including my favorite provocative slides:
Just ran some older code and got a million warnings about features that have been deprecated.
If you find yourself writing one of these warning messages, please, please, please include instructions for whatever the new thing is that replaces the deprecated thing.
Mental health tip: DO NOT watch the election live like it's a sporting event.
To keep you watching, TV people will make it seem like new information is streaming in.
It's not. They're just making up stories about noise.
If you must, check once Tuesday night. Then go to bed.
Let's practice Bayesian thinking.
H: aliens
D: unsupported testimony before Congress
P(D|H) = high
if there were aliens, someone would talk
P(D|not H) = high
not even the craziest thing said in Congress this week
Likelihood ratio close to 1 = little or no evidence
Breaking: Former US Intelligence agent David Grusch, while testifying to Congress in the UFO hearings, just scared the crap out of me.
The idea of UFOs and aliens isn’t frightening to me, but the idea that they could exist and that they could be trying to harm humanity is very
I just published an article that demonstrates my incremental process for developing and testing models in PyMC:
It also explains the relationships between the four distributions of Bayesian analysis.
In recent Jupyter , it looks like the default matplotlib backend is inline, so the magic command
%matplotlib inline
is no longer necessary.
Good: that's one less thing to explain to beginners.
But can someone confirm that I can count on this behavior?
@ProjectJupyter
I'm developing an introductory data science curriculum. Would you like to play along?
Try out this do-it-yourself, choose-your-own-adventure mini-project that explores the relationship between political alignment and other attitudes and beliefs.
Exploring the effect of researcher choices on statistical results: 73 teams estimate the same effect size with the same data, and generate 1,253 different results.
As always, statistical results depend on modeling decisions.
We are born Gaussian, but we grow up to be lognormal.
Video, slides, and Jupyter notebooks from my talk
@PyDataGlobal
:
Extremes, outliers, and GOATS:
on life in a lognormal world
Just heard from my friends
@OReillyMedia
that Think Bayes second edition is a go.
So, let's celebrate with the notebook for Chapter 2, featuring cookies, dice, socks, and Elvis.
If you search for "python breadth first search", a substantial majority of the implementations you find are accidentally quadratic. Should be O(n+m), instead they are O(n^2).
Here's the first hit: can you spot the performance error?