Ok, so LLMs are a Thing.
How do they work? Embeddings.
WTF are embeddings?
I spent a year doing a deep dive. But when I was researching, I couldn't find anything that explained them in business, engineering, AND math contexts. So I wrote a thing.🚀
Ten years ago, I didn't know what unit tests, version control, or continuous integration were.
Today, my code fails at each of these steps at least once a day.
Follow your dreams.
Me at career day in middle school, brightly: Okay, kids. I'm a 'machine learning engineer', which is just a fancy word for -
Kid in the back: Yeah, we know what you do. PyTorch or Tensorflow?
Me: Well, I, uh, don't do deep learn-
Kids: BOOOO GO HOME I BET YOU ONLY USE ONE CORE
Hottest programming skills in 2018:
5. Fixing git merge conflicts
4. Correctly mapping ports in Docker containers to host machines
3. Getting info from AWS documentation
2. Pulling summary stats from a data stream
1. Turning any of the above into a conference talk about AI
ML textbooks have titles like “A Gentle Introduction to Linear Algebra for Data Analysis” and then the first sentence hits you out the gate with, “Assume the inverse covariance matrix theta we were imagining in our head before you picked up the book is transposed to reveal— ”
Sorting Hat: Merge sort.
Harry Potter: Er, excuse me?
Sorting Hat: The white board, do a merge sort.
Harry Potter: But I thought Hogwarts was a mag-
Dumbledore: My dear boy, castle rent is much too expensive. We've pivoted to SaaS products. The hat does our interviews now.
LLMs are so weird because one side is people with five PhDs who have been studying neuron activations for the past three decades and on the other side is someone called leetm5n with an anime avatar just casually releasing increasingly better performing fine tunes of mistral
Producer: Pitch me.
Me: It's a heartfelt romance about two data scientists who have never met, but leave each other carefully-commented notes in a shared codebase, falling in love in the process. It's called "The Jupyter Notebook."
Producer: Get out.
Producer: Pitch me.
Me: It's an ensemble sitcom about a lovable, goofball DevOps team that works for a startup in New York and investigates outages. It's called Brooklyn Five-Nines.
Producer: Get out.
Hey, I’m trying to improve this unsupervised model to correctly label users. Looking for an intern to improve it. Here’s a picture of the current clusters. If anyone has ideas, feel free to respond to this tweet with a gist of your implementation. Winner gets a lollipop.
New blog post: For the past couple years, I've been telling people who ask me for advice not to go into data science. Here's why: The data science job market is way oversaturated. Here's what they should do instead.
Some MIT faculty have put together a course called "The Missing Semester of Your CS Education." Having looked it over a bit, it looks fantastic and will benefit data science people from non-dev backgrounds fill in a lot of gaps, too.
I was worried ChatGPT would make me obsolete and then I tried it and it almost got the syntax I wanted, I just had to prompt it seventeen more times, now I’m pair programming with someone with no long-term memory and network timeouts every day, this thing is truly revolutionary
Ten years ago I was working with malformed data solely in spreadsheets. After ten years of hard work learning engineering and statistics, I finally am blessed to be working with malformed data in queues, matrices, containers, and serialized formats.
Suppose you have to choose between a black box AI surgeon that runs on TensorFlow 1.0 on an EC2 instance that hasn't been upgraded to Python 3 but has a 80% cure rate and a black box AI surgeon with an 80% cure rate that runs on Excel vlookups. Do you want to live on this planet?
This is actually not dumb! One of the first things I learned year 1 on the job was that executives are often busy and consume content very differently from the rest of us at work, usually via email on the phone. (BlackBerry at the time 🫠
December 31 Resolutions:
1. Well-named Jupyter notebooks that run in order
2. No random temporary S3 buckets
3. Clean commit messages
4. Small p-values, good A/B tests, no peeking
5. Lots of READMEs
January 16:
1. This model runs. Please, please, don't ask me how it works.
An absolutely fantastic way to increase this is to start a blog. Almost all the cool fun stuff in my professional life for me has come from doing stuff then blogging about it.
"The amount of serendipity that will occur in your life is directly proportional to the degree to which you do something you're passionate about combined with the total number of people to whom this is effectively communicated."
Hope everyone is enjoying their week-long stint as a data scientist, the most glamorous tech job of the last 10 years, supposed to be spent analyzing sophisticated models, but really spent mostly monitoring those few stray batch import jobs that haven't finished yet.
Some personal news: I had a baby this weekend! 👶 Everyone is healthy, we are thrilled, and big sister is thrilled. ❤️
So I’m looking forward to catching up with everyone on Twitter either at 2:37 am or next year.
Me, standing outside Geoffrey Hinton’s office, matted hair, bloodshot eyes, shouting at passers-by: IT’S ALL A SCAM. THERE IS NO AI. NEURAL NETS ARE JUST MATRIX OPERATIONS. *sobbing softly as guards approach* they’re just matrix operations.
Hot data science trends:
2011 T-tests on laptops
2012 Hadoop
2013 Bayesian inference
2016 Spark
2017 Deep learning
2019 Reinforcement learning
2022 Robot war
2023 Cloud computing outlawed
2024 Computers outlawed
2025 T-tests on pen and paper
Producer: Pitch me.
Me: It's a reality competition show featuring two groups people that can't stop talking about what they do: gym rats and data scientists. We pit them against each other in feats of strength. It's called "CrossFit Validation"
Producer: Get out.
I am much less worried about sentient AIs than the fact that we are surrounded and influenced by machine learning systems and are not taught how to reason through how they work.
GitHub just wrote an article about how they had to write their own search engine (in Rust, for performance reasons) and a new probabilistic data structure to reduce indexing time because ES and Lucene were blockers for them but sure big data is over 😅
Gonna start a conference called
#NormIPS
that’s just presentations of middlebrow ML topics. “how to structure Python packages 2022”, “how many k-folds is too many”, “how to make the browser pop-up come up when the notebook is done running”, “putting features in Postgres”, etc.
Big shoutout to this book. I’ve already recommended it to a lot of people looking for either an intro or refresh to linear algebra and am psyched to recommend it in an official capacity.
The paperback for my
@OReillyMedia
book "Essential Math for Data Science" is now available! Thanks to the hard-working folks at O’Reilly for helping make this book as great as possible for readers. It is going to fill a much-needed gap.
Producer: Pitch me.
Me: It’s a gripping action thriller about a detective who switches from Java to Python, hoping to catch a criminal developer, who then switches from Python to Java. Both have extreme syntax difficulties. It’s called Brace/Off.
Producer: Get out.
Marginalia, the indie search engine that surfaced non-commercial content first, is currently on the front page of HN and handling the traffic load with one $5k commodity server with 128GB RAM/24 cores at 85% utilization with a single Java app
Job Req:
------
Years of Experience: 37
PhD: Required
Languages: Python, R, Scala, Fortran, and Cantonese
Experience with: Machine Learning, DevOps, Agile, Marie Kondo Method
Someone who is good at recruiting, help me find a data scientist, my company is dying.
I'm an introvert, so I'm not having as hard of a time as the poor extroverts, but something that I really miss is ambiently being around people. I sometimes like being in cafes, in workspaces, surrounded by conversation and the pulse of busy-ness, feeling like a part of humanity.
Fifty billion years ago in March, before 2020 really hit the ground running, I started working on a fun proof-of-concept ML project to really explain all the things that need to happen for machine learning to work in the wild. I finally wrote it all up.
When you go to a dude's website and it's just plain HTML, not even any CSS, and links to posts like "Some thoughts on prime numbers" and "Efficiently checking tries for fun and profit" and a picture of him in a sweater at a party from 2014, watch the fuck out. This guy codes.
i'm just a girl standing in front of the ml research community please begging everyone to type their python method inputs and outputs especially if they are tensors or weird nested dicts of lists of dicts
My developer path:
Learning how to work with dirty data in Excel @ 19
Learning how to work with dirty data in Access @ 24
Learning how to work with dirty data in pandas @ 25
Learning how to work with dirty data in scikit @ 28
Learning how to work with dirty data in Airflow @ 33
Producer: Pitch me.
Me: It's a musical comedy about an overeager team of data scientists that does way too many tests on bad data. It's called "Gimme Gimme Gimme a NaN after Midnight."
Producer: Get out.
Me: *over my shoulder* The team calls themselves A/B/B/A.
Producer: Pitch me.
Me: It’s a high-stakes Korean thriller where you watch people under enormous stress kill Linux processes. It’s called Pid Game.
Producer: Get out.
Checking out Effective Python by
@haxor
. I’m a big fan of the book so far because it takes all the Pythonic best practices you hear about in conferences and on StackOverflow and contenxtualizes and organizes them. Thanks to the publisher for sending this over.
I love how the LinkedIn crowd is like, “Get fifteen hours of sleep, create space for your meditation. Truly focus on your success.”
Buddy, just now as I was trying to eat a piece of toast for 2 minutes, the toddler found me and bit me. LMK how that fits into The Strategy.
CTO: We’re thinking of replacing our on-perm server with a distributed system in the cloud. What kinds of considerations would help us make that decision?
Me:
My working theory is that 10% of any data science/adjacent job is actual machine learning. Unless your title is "Machine Learning Engineer", in which case it can be as much as 20%.
Every data science article on Medium is like “How every day I deploy a million-feature deep learning model at scale to millions of users. This is Real machine learning.” Meanwhile I spent a good half hour today figuring out why sbt wouldn’t build and it was because of a typo.
2019:
+ Had my second baby
+ Read 103 books
+ Started a newsletter that now has 250+ paid and 3k free subscribers
+ Changed 1k + diapers
+ Cancelled a load balancer that was costing me $50/month in AWS
2020 Resolutions:
+ Sleep more than 5 hours at a time
Getting marginally depressed thinking about all those brilliant Joel on Software essays, the level of craftsmanship in the software, those gorgeous offices, the whole of Stack Overflow’s contributions to humanity and for what? To end up as training data for ChatGPT.
VC: Ok, whatchu got?
Me: Imagine WeWork, but with clean and stocked bathrooms.
VC: That's just WeWork.
Me: No it's not.
VC: We won't fund it.
Me: The bathrooms will be cleaned with AI.
VC: Here's $30 million.