Dentist: "You have mild evidence of teeth grinding"
Me: "Oh darn! What could happen?"
Them: "I can't guarantee you exactly what will happen"
Me: "Sure, but for patient population like me what is a distribution of outcomes?"
Them: "Medicine doesn't work that way"
Me: "Yes it does"
It occurs to me that dentists do themselves a bit of a disservice by not qualifying their interventions.
How important is it that I floss, really? Saying "really really important" to everything just makes the general population regularly not follow dental advice.
When I stopped writing papers, and started writing blog posts everything got better.
I had more impact, got more immediate feedback, was offered jobs with increased flexibility, and enjoyed the writing process infinitely more.
@robertsd
Do you have a sense of what the distribution of outcomes are for people presenting with mild grinding?
Do you have a sense for the efficacy of various interventions?
"No" is a fine answer (I don't expect you to have numbers off the top of your head) but I'd be curious if you do
Every few months Dask gets a PR titled "Pandas Compatibility" from
@TomAugspurger
.
This is the kind of cross-project maintenance that is crucial for smooth operation across the
#PyData
ecosystem, but often goes unnoticed.
Thanks Tom!
Ooh, Databricks wrote a negative piece on Dask. Fun!
I've always avoided negative messaging around other projects. Aside from tone, it's also incredibly hard to do well/honestly.
I'll just leave this here:
How much do you know about Koalas? And no, we're not talking about these (even though they're pretty cool): 🐨
We compared the performance of Koalas (PySpark) and Dask – and the results are in!
I enjoyed making this screencast accelerating a processing pipeline several orders of magnitude with Numba, Dask, and CuPy/RAPIDS.
I did it all in a couple of hours. It was a blast to use everything together.
The new Pandas 1.3 release has great performance improvements for text data thanks to
@ApacheArrow
I made an 8 minute video about how to enable it here:
Pandas 1.0 is here!
* Read the release notes:
* Read the blogpost reflecting on what 1.0 means to our project:
* Install with conda / PyPI:
Thanks to our 300+ contributors to this release.
It's ok to not be productive right now.
When I experience stress I usually respond by working harder. My stress is usually work related, so this usually makes sense.
But right now working harder doesn't help the situation. Taking time to relax and assess mental health does.
Looking at job applications I find that I'm more drawn to candidates with GitHub accounts over candidates with PhD's and CVs.
It's not because the work is more impressive. It's because I can more rapidly assess fit with GitHub than with papers.
Prefect is what happens when an Airflow developer and a Dask developer get together to make a data engineering platform.
I've enjoyed watching this company develop and am excited to see how people use it, now that the core is open source.
Spark, Dask, DuckDB, Polars: TPC-H Benchmarks at scale
This is the result of a couple weeks of work comparing large data frameworks on benchmarks ranging in size 10GB to 10TB. No project wins. It's really interesting analyzing results though.
If we care about global warming then we should be open to Nuclear Power.
Renewables are awesome, but they're not likely to solve the problem on their own.
I support Nuclear because I care deeply about the environment. I encourage you to do the same.
Two short blogposts on Python + HPC:
1. Five reasons to keep your HPC center rather than transition to the cloud
2. How to retrofit an HPC center for interactive data science with Conda, Jupyter{Hub,Lab}, and Dask
I'm conda updating numpy in a big environment. Previously the slowness of this caused me frustration.
However, now that Conda gives live reports about its work I suddenly have great compassion for it. Go conda go!
I guess UX really does matter.
New blogpost on Dask Arrays, Numba, smoothing images, and using NumPy's generalized universal functions to make complex computation more accessible for users.
Opportunities and challenges of GPU-enabled
#PyData
My first impressions since joining NVIDIA. GPUs are truly impressive, but there are social and technical challenges to integrating them into an ecosystem.
High performance processing pipelines
I often refer to this video demo. It uses Dask, Numba, GPUs, and asyncio to accelerate an image processing pipeline several orders of magnitude.
Recording of my talk "Refactoring the SciPy Ecosystem for Heterogeneous Computing" at
#SciPy2019
Two main points:
- GPUs are neat
- We should focus on standards and protocols
I'm glad to see that Pandas 1.0 includes optional support for
@numba_jit
for accelerated Python.
I thought I'd repost this blog from two years ago on The Case for Numba in Community Code
Credit models with Dask. A guest post by Rich Postelnik using dask to build credit modelling computational systems.
A fine example of complex "Big Data" workflows that fall outside of big dataframe/array/sql paradigms.
Note that
#Dask
arrays support any Numpy-like containers. It would be great to see someone parallelize around CuPy for general purpose distributed GPU arrays in Python.
Sales tip for engineers: Ask more questions before you dive into deep technical answers.
Life tip for anyone: listen more, speak less
Single-page blogpost version here:
This is great!
One challenge to integrating GPUs into the Python ecosystem is the lack of uniform API. PyTorch, RAPIDS, Numba all wrap cuda separately so there is no generic way to manipulate the GPU.
This advance enables ecosystem libraries like Dask to operate generically.
I like this article not only because Dask does well, but also because it was written by someone who wasn't associated with any of the existing projects.
Benchmarks run by people with a vested interest are almost always incorrect.
Accidental bias is too easy.
Video visualizing 1,000,000,000 points
This is a common example with Datashader. This video is different in that it focuses on performance tuning. When we start, an update takes 40s. When we're done, it takes ~1s.
High performance requires thought.
Anti-Hype: Remember that FORTRAN-77 supported tensors. It called them Arrays. Almost all languages have supported them natively since without the need for fanfare.
Small Docker Images with Conda
This small blogpost from
@jcristharif
is old, but keeps coming up at work. I thought I'd retweet it out here for others.
Slides for my
#AnacondaCon
talk on Real Time Processing with
#Dask
It includes a Binder environment that includes all of the examples as live runnable notebooks. (see futures and streaming dataframes in particular)
Coiled is growing
We're expanding our engineering teams on both the web/product side and the Dask-OSS/support side.
We're looking for both experts and dabblers.
Come play!
The
@rapidsai
/
@dask_dev
joint team is working on high performance networking. We were spending too much time interpreting text printouts to see bandwidth across nodes.
Now we have live visual diagnostics with Dask
The Dask project is looking to hire a community developer to support a variety of ongoing work in the life sciences.
Does this interest you, or someone that you know? Please apply.
Small post on how to craft a bug report that a package maintainer will appreciate:
I wrote this to summarize common anti-patterns for new users, but thought that others might find it helpful as well.
Matplotlib is used by 17% of papers on
@arXiv
@tacaswell
and I played around with this a couple months ago at a CZI meeting.
Nice work
@matplotlib
devs!
"How many people use your OSS project?"
This is a critical question for funding OSS, and is also really hard to answer well. This blogpost is my attempt to answer it honestly for Dask.
I'm excited about the developments in
@openucx
and UCX-Py for high performance communication in Python. This should help
@dask_dev
users in HPC, but also many others in Python and beyond.
Thanks Akshay Venkatesh and
@TomAugspurger
for your work.
I miss the forced idleness of air travel.
After a couple weeks of intense productivity, being trapped in an internet-less metal tube suspended miles above the earth sounds like bliss.
New screencast on a recent Dask feature that minimizes memory use through smarter scheduling
This is having a huge impact in a surprising number of common workloads.
This Friday
@CoiledHQ
engineers are collectively taking a day off.
We found that people weren't taking vacation time, either due to being on a small team (it's hard to be the only one out) or due to COVID (vacations are less fun).
So we're experimenting with extra days off.
Recently I've been seeing public institutions and OSS developers build around proprietary APIs in ways that slightly concern me.
I wrote down some thoughts on avoiding cloud lock-in by adopting open standards:
Toolz was my first major OSS project (it has a wider install base than Dask). Today, after several years of bug-free constant use, I discovered a bug.
It was easy to fix, but an odd blast from the past.
Also, the entire toolz test suite runs in 1.2s. I really miss tiny projects
Now with GPUs!
The speedups for stencil computations on GPUs were surprisingly nice for this problem. This was also my first time writing CUDA code in Python with
@numba_jit
, which was a pleasure. Thanks
@numba_jit
!
New blogpost on Dask Arrays, Numba, smoothing images, and using NumPy's generalized universal functions to make complex computation more accessible for users.
You don't need to apologize for having your kids interrupt business calls.
Someone did this yesterday and I was surprised. I think that we're past that now.
Kids are just part of the modern workplace. It's like hearing a slack message in the background.
I stuttered until age 16, when it miraculously went away (there are a few situations which trigger it still)
Every time I see a fellow stutterer my heart aches. Stuttering feels like really wanting to chase after someone in sports, but with your shoelaces tied.