Matthew Rocklin Profile
Matthew Rocklin

@mrocklin

8,726
Followers
99
Following
280
Media
6,233
Statuses

Open source maintainer. @dask_dev author. CEO at @CoiledHQ Additionally I try to be a decent human and help the world from melting.

Austin, TX
Joined April 2009
Don't wanna be here? Send us removal request.
@mrocklin
Matthew Rocklin
2 months
Dentist: "You have mild evidence of teeth grinding" Me: "Oh darn! What could happen?" Them: "I can't guarantee you exactly what will happen" Me: "Sure, but for patient population like me what is a distribution of outcomes?" Them: "Medicine doesn't work that way" Me: "Yes it does"
101
110
5K
@mrocklin
Matthew Rocklin
6 years
#AntiHype : Single large machines are often a surprisingly pragmatic computing platform
Tweet media one
42
950
2K
@mrocklin
Matthew Rocklin
2 months
It occurs to me that dentists do themselves a bit of a disservice by not qualifying their interventions. How important is it that I floss, really? Saying "really really important" to everything just makes the general population regularly not follow dental advice.
15
15
2K
@mrocklin
Matthew Rocklin
4 years
When I stopped writing papers, and started writing blog posts everything got better. I had more impact, got more immediate feedback, was offered jobs with increased flexibility, and enjoyed the writing process infinitely more.
19
180
2K
@mrocklin
Matthew Rocklin
4 years
I've left my position at NVIDIA (it was great!) I'm starting a Dask company Wish me luck :)
127
147
1K
@mrocklin
Matthew Rocklin
2 months
Communicating nuance is really important to establish trust.
9
9
932
@mrocklin
Matthew Rocklin
5 years
I left employment at Anaconda Inc. last week. I start employment at NVIDIA next month. Surprisingly little about my work will change. More details:
47
63
455
@mrocklin
Matthew Rocklin
4 years
Introducing Coiled Computing A Dask Company
Tweet media one
19
99
410
@mrocklin
Matthew Rocklin
7 years
I've been playing with reactive/streaming systems recently. Here is a blogpost about streaming Pandas dataframes.
5
112
330
@mrocklin
Matthew Rocklin
2 months
@robertsd Do you have a sense of what the distribution of outcomes are for people presenting with mild grinding? Do you have a sense for the efficacy of various interventions? "No" is a fine answer (I don't expect you to have numbers off the top of your head) but I'd be curious if you do
16
2
306
@mrocklin
Matthew Rocklin
5 years
Parallel GPU Arrays with #Dask and CuPy. A blogpost on first steps.
Tweet media one
3
103
291
@mrocklin
Matthew Rocklin
4 years
Tiny summary of a lot of work: Coiled raised $5M seed funding $ pip install coiled >>> cluster = coiled.Cluster()
@CoiledHQ
Coiled
4 years
We're excited to launch Coiled Cloud and to announce our recent funding. "Dask and scalable data science for everyone, everywhere."
11
92
338
16
44
283
@mrocklin
Matthew Rocklin
6 years
New Blogpost: Beyond Numpy arrays in Python Preparing the ecosystem for sparse, distributed, and GPU Numpy-style arrays.
1
89
272
@mrocklin
Matthew Rocklin
5 years
Are we allowed to drop Python 2 now?
22
35
258
@mrocklin
Matthew Rocklin
4 years
Every few months Dask gets a PR titled "Pandas Compatibility" from @TomAugspurger . This is the kind of cross-project maintenance that is crucial for smooth operation across the #PyData ecosystem, but often goes unnoticed. Thanks Tom!
2
19
252
@mrocklin
Matthew Rocklin
3 years
Ooh, Databricks wrote a negative piece on Dask. Fun! I've always avoided negative messaging around other projects. Aside from tone, it's also incredibly hard to do well/honestly. I'll just leave this here:
@databricks
Databricks
3 years
How much do you know about Koalas? And no, we're not talking about these (even though they're pretty cool): 🐨 We compared the performance of Koalas (PySpark) and Dask – and the results are in!
5
16
43
23
40
250
@mrocklin
Matthew Rocklin
7 years
Hello world. Please stop calling multi-dimensional arrays "tensors". This angers mathematicians and physicists to no end.
21
93
238
@mrocklin
Matthew Rocklin
5 years
I enjoyed making this screencast accelerating a processing pipeline several orders of magnitude with Numba, Dask, and CuPy/RAPIDS. I did it all in a couple of hours. It was a blast to use everything together.
1
56
238
@mrocklin
Matthew Rocklin
4 years
I guess now that Pandas is at 1.0 we can finally use it in production ;)
@pandas_dev
pandas
4 years
Pandas 1.0 is here! * Read the release notes: * Read the blogpost reflecting on what 1.0 means to our project: * Install with conda / PyPI: Thanks to our 300+ contributors to this release.
17
853
2K
2
17
199
@mrocklin
Matthew Rocklin
5 years
Why I avoid Slack, and encourage developers to hold technical conversations on GitHub.
8
52
174
@mrocklin
Matthew Rocklin
5 years
Slides on four libraries for high performance Python and how they play together, delivered at @PASC_Conference - @numba_jit - @dask_dev - @rapidsai - @openucx
2
61
170
@mrocklin
Matthew Rocklin
4 years
It's ok to not be productive right now. When I experience stress I usually respond by working harder. My stress is usually work related, so this usually makes sense. But right now working harder doesn't help the situation. Taking time to relax and assess mental health does.
4
29
163
@mrocklin
Matthew Rocklin
3 years
Looking at job applications I find that I'm more drawn to candidates with GitHub accounts over candidates with PhD's and CVs. It's not because the work is more impressive. It's because I can more rapidly assess fit with GitHub than with papers.
7
8
163
@mrocklin
Matthew Rocklin
3 years
Earlier this year I realized that my job is now to hire smart people, get them to talk to each other, and then step out of the way most of the time.
8
9
157
@mrocklin
Matthew Rocklin
5 months
The OG Xarray+Pangeo+Dask gang at AGU It's been a good decade 🙂
Tweet media one
4
10
148
@mrocklin
Matthew Rocklin
5 years
Prefect is what happens when an Airflow developer and a Dask developer get together to make a data engineering platform. I've enjoyed watching this company develop and am excited to see how people use it, now that the core is open source.
@PrefectIO
Prefect
5 years
To celebrate its second anniversary, Prefect is now open source! What will you build?
3
18
95
2
30
142
@mrocklin
Matthew Rocklin
4 years
I'm taking this week off and will be unresponsive to email/GitHub/tweets. Cheers everyone!
Tweet media one
4
2
142
@mrocklin
Matthew Rocklin
6 months
Spark, Dask, DuckDB, Polars: TPC-H Benchmarks at scale This is the result of a couple weeks of work comparing large data frameworks on benchmarks ranging in size 10GB to 10TB. No project wins. It's really interesting analyzing results though.
4
27
142
@mrocklin
Matthew Rocklin
5 years
If we care about global warming then we should be open to Nuclear Power. Renewables are awesome, but they're not likely to solve the problem on their own. I support Nuclear because I care deeply about the environment. I encourage you to do the same.
@SylvainCorlay
Sylvain Corlay
5 years
This is the worldwide CO2 emission per capita. Do you know why French emit three times less CO2 per capita than north americans?
Tweet media one
3
11
34
12
36
138
@mrocklin
Matthew Rocklin
5 years
Two short blogposts on Python + HPC: 1. Five reasons to keep your HPC center rather than transition to the cloud 2. How to retrofit an HPC center for interactive data science with Conda, Jupyter{Hub,Lab}, and Dask
Tweet media one
5
55
137
@mrocklin
Matthew Rocklin
7 years
Accelerating Geospatial analysis in Python with #GeoPandas #Cython and #Dask A recent project with @jorisvdbossche
Tweet media one
3
58
138
@mrocklin
Matthew Rocklin
6 years
New Blogpost: The case for Numba in community code
Tweet media one
14
54
128
@mrocklin
Matthew Rocklin
5 years
I'm conda updating numpy in a big environment. Previously the slowness of this caused me frustration. However, now that Conda gives live reports about its work I suddenly have great compassion for it. Go conda go! I guess UX really does matter.
Tweet media one
4
22
127
@mrocklin
Matthew Rocklin
5 years
The role and best practices for open source software maintainers: It's more about social interactions and logistics than algorithms and writing code.
2
45
120
@mrocklin
Matthew Rocklin
5 years
Early experiments with and a plan for with Dask, RAPIDS, and GPU Dataframes.
Tweet media one
2
34
116
@mrocklin
Matthew Rocklin
3 years
As a developer I haven't had to think about Python 2 compatibility for a long while now. Thank you world for taking the trouble to switch.
4
1
113
@mrocklin
Matthew Rocklin
4 years
A short blogpost about writing to a short attention span:
9
33
112
@mrocklin
Matthew Rocklin
5 years
New blogpost on Dask Arrays, Numba, smoothing images, and using NumPy's generalized universal functions to make complex computation more accessible for users.
1
28
107
@mrocklin
Matthew Rocklin
5 years
Opportunities and challenges of GPU-enabled #PyData My first impressions since joining NVIDIA. GPUs are truly impressive, but there are social and technical challenges to integrating them into an ecosystem.
5
34
110
@mrocklin
Matthew Rocklin
5 years
Blogpost encouraging people to write more short blogposts: Writing is good. Writing is easy if we don’t make a big deal out of it.
7
23
111
@mrocklin
Matthew Rocklin
3 years
High performance processing pipelines I often refer to this video demo. It uses Dask, Numba, GPUs, and asyncio to accelerate an image processing pipeline several orders of magnitude.
4
20
109
@mrocklin
Matthew Rocklin
2 years
A new post on the historical mission of #SciPy , our success, and some ideas from community members on what to focus on next.
5
30
109
@mrocklin
Matthew Rocklin
5 years
Numpy drops Python 2
2
49
107
@mrocklin
Matthew Rocklin
6 years
New Blog: HDF in the cloud An analysis of the challenges and potential solutions for scientific data on cloud storage
Tweet media one
7
66
104
@mrocklin
Matthew Rocklin
5 years
Recording of my talk "Refactoring the SciPy Ecosystem for Heterogeneous Computing" at #SciPy2019 Two main points: - GPUs are neat - We should focus on standards and protocols
1
41
106
@mrocklin
Matthew Rocklin
2 years
#Dask developers hanging out at #SciPy2022 It's absolutely amazing to see these folks in person again.
Tweet media one
0
11
101
@mrocklin
Matthew Rocklin
4 years
New post: Seven stages of openness in open source software
3
41
102
@mrocklin
Matthew Rocklin
4 years
I'm glad to see that Pandas 1.0 includes optional support for @numba_jit for accelerated Python. I thought I'd repost this blog from two years ago on The Case for Numba in Community Code
1
21
100
@mrocklin
Matthew Rocklin
6 years
Sparse ND-Arrays in Python, built on NumPy and Scipy.sparse. Release notes for v0.2: Thanks to new maintainer @hameerabbasi !
0
35
99
@mrocklin
Matthew Rocklin
6 years
Credit models with Dask. A guest post by Rich Postelnik using dask to build credit modelling computational systems. A fine example of complex "Big Data" workflows that fall outside of big dataframe/array/sql paradigms.
Tweet media one
1
37
98
@mrocklin
Matthew Rocklin
7 years
Dask and Pandas and XGBoost: playing nicely between distributed systems. #PyData
0
42
98
@mrocklin
Matthew Rocklin
5 years
Slides for my talk at #GTC2019 about scaling GPU analytics in Python with @dask_dev and @rapidsai
Tweet media one
2
31
96
@mrocklin
Matthew Rocklin
4 years
I'm sitting in a room full of really senior Scientific Python developers. I'm pretty sure that all of us are spending our day writing proposals.
5
4
98
@mrocklin
Matthew Rocklin
6 years
Note that #Dask arrays support any Numpy-like containers. It would be great to see someone parallelize around CuPy for general purpose distributed GPU arrays in Python.
@teoliphant
Travis Oliphant
6 years
I've been impressed recently with Chainer and CuPy . I like the fact that NumPy *is* the CPU implementation for Chainer and CuPy is NumPy on GPU. #NumPy #Python #MachineLearning #ArtificalIntelligence
2
115
302
1
28
98
@mrocklin
Matthew Rocklin
3 years
My workspace leveled up recently
Tweet media one
7
2
97
@mrocklin
Matthew Rocklin
4 years
Sales tip for engineers: Ask more questions before you dive into deep technical answers. Life tip for anyone: listen more, speak less Single-page blogpost version here:
3
18
97
@mrocklin
Matthew Rocklin
6 years
Blogpost: Write dumb code A summary of several conversations that I (and many other) have frequently when maintaining software projects.
6
50
96
@mrocklin
Matthew Rocklin
3 years
This is great! One challenge to integrating GPUs into the Python ecosystem is the lack of uniform API. PyTorch, RAPIDS, Numba all wrap cuda separately so there is no generic way to manipulate the GPU. This advance enables ecosystem libraries like Dask to operate generically.
@gmarkall
Graham Markall
3 years
The NVIDIA CUDA Python runtime, driver, and NVRTC API bindings are now in a public Github repo: Docs: Background:
3
59
255
0
14
95
@mrocklin
Matthew Rocklin
4 years
Short blogpost on OSS growth metrics, VC funding, and a couple of awkward conversations.
5
11
91
@mrocklin
Matthew Rocklin
4 years
I like this article not only because Dask does well, but also because it was written by someone who wasn't associated with any of the existing projects. Benchmarks run by people with a vested interest are almost always incorrect. Accidental bias is too easy.
@RealUncleData
Tomas Peluritis
4 years
Did a comparison on Spark, Dask, Pandas, Koalas, Modin. Check it out: #dataengineering #spark #dask #modin #pandas #dataframes #python #performance
8
31
136
4
8
91
@mrocklin
Matthew Rocklin
2 years
Write tests. For reasons other than just correctness
5
20
88
@mrocklin
Matthew Rocklin
5 years
A small blogpost on avoiding code indirection for readability I request this often during code review, and wanted a writeup to point to in the future.
4
32
88
@mrocklin
Matthew Rocklin
7 years
Slides for my #SciPy2017 talk on Advanced Techniques in #Dask
0
40
84
@mrocklin
Matthew Rocklin
1 year
Video visualizing 1,000,000,000 points This is a common example with Datashader. This video is different in that it focuses on performance tuning. When we start, an update takes 40s. When we're done, it takes ~1s. High performance requires thought.
7
27
85
@mrocklin
Matthew Rocklin
3 years
I'm moving back to Austin, TX this weekend. Yesterday I drove through CA, AZ, NM. Today, West Texas.
Tweet media one
11
0
85
@mrocklin
Matthew Rocklin
6 years
Anti-Hype: Remember that FORTRAN-77 supported tensors. It called them Arrays. Almost all languages have supported them natively since without the need for fanfare.
8
17
84
@mrocklin
Matthew Rocklin
5 years
Slides from my #SciPy2019 talk: "Refactoring the SciPy Ecosystem for Heterogeneity"
2
40
85
@mrocklin
Matthew Rocklin
3 years
Small Docker Images with Conda This small blogpost from @jcristharif is old, but keeps coming up at work. I thought I'd retweet it out here for others.
6
14
85
@mrocklin
Matthew Rocklin
6 years
Slides for my #AnacondaCon talk on Real Time Processing with #Dask It includes a Binder environment that includes all of the examples as live runnable notebooks. (see futures and streaming dataframes in particular)
2
24
82
@mrocklin
Matthew Rocklin
5 years
A short blogpost motivating and highlighting simple HTML outputs in Jupyter
Tweet media one
4
18
82
@mrocklin
Matthew Rocklin
5 years
Slides from my #AnacondaCON talk on challenges and opportunities for the #PyData Ecosystem and GPUs
3
25
79
@mrocklin
Matthew Rocklin
1 year
Tweet media one
3
7
80
@mrocklin
Matthew Rocklin
4 years
Coiled is growing We're expanding our engineering teams on both the web/product side and the Dask-OSS/support side. We're looking for both experts and dabblers. Come play!
4
42
82
@mrocklin
Matthew Rocklin
7 years
New post: Distributed Pandas Dataframes with #Dask #PyData Includes interactive profiling @BokehPlots
2
46
80
@mrocklin
Matthew Rocklin
5 years
The @rapidsai / @dask_dev joint team is working on high performance networking. We were spending too much time interpreting text printouts to see bandwidth across nodes. Now we have live visual diagnostics with Dask
Tweet media one
4
16
81
@mrocklin
Matthew Rocklin
8 years
Screenshots using #Dask with JupyterLab
Tweet media one
Tweet media two
0
34
79
@mrocklin
Matthew Rocklin
4 years
The Dask project is looking to hire a community developer to support a variety of ongoing work in the life sciences. Does this interest you, or someone that you know? Please apply.
1
61
79
@mrocklin
Matthew Rocklin
6 years
Small post on how to craft a bug report that a package maintainer will appreciate: I wrote this to summarize common anti-patterns for new users, but thought that others might find it helpful as well.
4
44
78
@mrocklin
Matthew Rocklin
4 years
"How many people use your OSS project?" This is a critical question for funding OSS, and is also really hard to answer well. This blogpost is my attempt to answer it honestly for Dask.
Tweet media one
10
17
76
@mrocklin
Matthew Rocklin
5 years
I'm excited about the developments in @openucx and UCX-Py for high performance communication in Python. This should help @dask_dev users in HPC, but also many others in Python and beyond. Thanks Akshay Venkatesh and @TomAugspurger for your work.
Tweet media one
Tweet media two
Tweet media three
Tweet media four
2
20
72
@mrocklin
Matthew Rocklin
4 years
There is no higher praise than "I happily stopped using the software I built to use someone else's software"
1
4
74
@mrocklin
Matthew Rocklin
3 years
I miss the forced idleness of air travel. After a couple weeks of intense productivity, being trapped in an internet-less metal tube suspended miles above the earth sounds like bliss.
11
2
74
@mrocklin
Matthew Rocklin
4 years
I have a joke about open source software, but I'm pretty sure that someone else has already made it for me.
3
4
74
@mrocklin
Matthew Rocklin
6 years
New blogpost about the changing relationship between public research institutions and open source software
0
30
74
@mrocklin
Matthew Rocklin
3 years
New screencast on a recent Dask feature that minimizes memory use through smarter scheduling This is having a huge impact in a surprising number of common workloads.
4
19
71
@mrocklin
Matthew Rocklin
7 years
In my opinion this is the biggest feature to hit Spark for Python users in a very long while
@databricks
Databricks
7 years
Introducing Vectorized UDFs for #PySpark : How to run your native Python code with PySpark, fast. #python
2
114
147
0
35
70
@mrocklin
Matthew Rocklin
4 years
This Friday @CoiledHQ engineers are collectively taking a day off. We found that people weren't taking vacation time, either due to being on a small team (it's hard to be the only one out) or due to COVID (vacations are less fun). So we're experimenting with extra days off.
3
5
72
@mrocklin
Matthew Rocklin
6 years
Recently I've been seeing public institutions and OSS developers build around proprietary APIs in ways that slightly concern me. I wrote down some thoughts on avoiding cloud lock-in by adopting open standards:
0
29
71
@mrocklin
Matthew Rocklin
5 years
Toolz was my first major OSS project (it has a wider install base than Dask). Today, after several years of bug-free constant use, I discovered a bug. It was easy to fix, but an odd blast from the past. Also, the entire toolz test suite runs in 1.2s. I really miss tiny projects
3
3
71
@mrocklin
Matthew Rocklin
6 years
Slides for my #PyData NYC talk on streaming processing in Python
0
20
69
@mrocklin
Matthew Rocklin
5 years
Now with GPUs! The speedups for stencil computations on GPUs were surprisingly nice for this problem. This was also my first time writing CUDA code in Python with @numba_jit , which was a pleasure. Thanks @numba_jit !
@mrocklin
Matthew Rocklin
5 years
New blogpost on Dask Arrays, Numba, smoothing images, and using NumPy's generalized universal functions to make complex computation more accessible for users.
1
28
107
1
17
68
@mrocklin
Matthew Rocklin
4 years
You don't need to apologize for having your kids interrupt business calls. Someone did this yesterday and I was surprised. I think that we're past that now. Kids are just part of the modern workplace. It's like hearing a slack message in the background.
2
7
69
@mrocklin
Matthew Rocklin
6 years
Slides and demonstrations for my keynote at the AMS Python symposium on scaling Dask and XArray to HPC and cloud systems.
2
30
67
@mrocklin
Matthew Rocklin
8 years
Really been enjoying building a Dask.distributed web UI with the @BokehPlots server. This just landed in master.
3
48
68
@mrocklin
Matthew Rocklin
7 years
New post: Distributed NumPy on a cluster with image analysis. #Dask #PyData
1
36
66
@mrocklin
Matthew Rocklin
4 years
I stuttered until age 16, when it miraculously went away (there are a few situations which trigger it still) Every time I see a fellow stutterer my heart aches. Stuttering feels like really wanting to chase after someone in sports, but with your shoelaces tied.
5
3
66