Matthew Rocklin @mrocklin Twitter profile

Last Seen Profiles

@Baw_OfficialBr

@WER596974567792

@nagamotosh84756

@ZayaAI_PathDx

@anyag

@thesimsrepproj

@alwaleedalradhi

@JumparRatt71355

@cedwilliamsjr

@Sam_Azin

@api_cosplay

@AksnEditor

@skamusicslacker

@TrollfaceBTC

@Claudinhocapeta

@FasihaHassan

@NUFC

@YuOhtani_

@KU__HV

@lecmonza

@AshVandelay

@ptxx_49

@CorrieDynamoFC

@ChampsQueueBot

@HouseLyndseyRN

@Ali3131__

@PunishedLtD

@MFROSHI

@bannapafe5

@AIARTGALLARY

@06Alpsert40

@kelebekhzn35

@GerrittGriggs

@alcoitd

@xiongzheng999

@meletrix

Matthew Rocklin

@mrocklin

2 months

Dentist: "You have mild evidence of teeth grinding" Me: "Oh darn! What could happen?" Them: "I can't guarantee you exactly what will happen" Me: "Sure, but for patient population like me what is a distribution of outcomes?" Them: "Medicine doesn't work that way" Me: "Yes it does"

101

110

5K

Matthew Rocklin

@mrocklin

6 years

#AntiHype : Single large machines are often a surprisingly pragmatic computing platform

42

950

2K

Matthew Rocklin

@mrocklin

2 months

It occurs to me that dentists do themselves a bit of a disservice by not qualifying their interventions. How important is it that I floss, really? Saying "really really important" to everything just makes the general population regularly not follow dental advice.

15

2K

Matthew Rocklin

@mrocklin

4 years

When I stopped writing papers, and started writing blog posts everything got better. I had more impact, got more immediate feedback, was offered jobs with increased flexibility, and enjoyed the writing process infinitely more.

19

180

2K

Matthew Rocklin

@mrocklin

4 years

I've left my position at NVIDIA (it was great!) I'm starting a Dask company Wish me luck :)

127

147

1K

Matthew Rocklin

@mrocklin

2 months

Communicating nuance is really important to establish trust.

9

932

Matthew Rocklin

@mrocklin

5 years

I left employment at Anaconda Inc. last week. I start employment at NVIDIA next month. Surprisingly little about my work will change. More details:

47

63

455

Matthew Rocklin

@mrocklin

4 years

Introducing Coiled Computing A Dask Company

19

99

410

Matthew Rocklin

@mrocklin

7 years

I've been playing with reactive/streaming systems recently. Here is a blogpost about streaming Pandas dataframes.

5

112

330

Matthew Rocklin

@mrocklin

2 months

@robertsd Do you have a sense of what the distribution of outcomes are for people presenting with mild grinding? Do you have a sense for the efficacy of various interventions? "No" is a fine answer (I don't expect you to have numbers off the top of your head) but I'd be curious if you do

16

2

306

Matthew Rocklin

@mrocklin

5 years

Parallel GPU Arrays with #Dask and CuPy. A blogpost on first steps.

3

103

291

Matthew Rocklin

@mrocklin

4 years

Tiny summary of a lot of work: Coiled raised $5M seed funding $ pip install coiled >>> cluster = coiled.Cluster()

Coiled

@CoiledHQ

4 years

We're excited to launch Coiled Cloud and to announce our recent funding. "Dask and scalable data science for everyone, everywhere."

11

92

338

16

44

283

Matthew Rocklin

@mrocklin

6 years

New Blogpost: Beyond Numpy arrays in Python Preparing the ecosystem for sparse, distributed, and GPU Numpy-style arrays.

1

89

272

Matthew Rocklin

@mrocklin

5 years

Are we allowed to drop Python 2 now?

22

35

258

Matthew Rocklin

@mrocklin

4 years

Every few months Dask gets a PR titled "Pandas Compatibility" from @TomAugspurger . This is the kind of cross-project maintenance that is crucial for smooth operation across the #PyData ecosystem, but often goes unnoticed. Thanks Tom!

2

19

252

Matthew Rocklin

@mrocklin

3 years

Ooh, Databricks wrote a negative piece on Dask. Fun! I've always avoided negative messaging around other projects. Aside from tone, it's also incredibly hard to do well/honestly. I'll just leave this here:

Databricks

@databricks

3 years

How much do you know about Koalas? And no, we're not talking about these (even though they're pretty cool): 🐨 We compared the performance of Koalas (PySpark) and Dask – and the results are in!

5

16

43

23

40

250

Matthew Rocklin

@mrocklin

7 years

Hello world. Please stop calling multi-dimensional arrays "tensors". This angers mathematicians and physicists to no end.

21

93

238

Matthew Rocklin

@mrocklin

5 years

I enjoyed making this screencast accelerating a processing pipeline several orders of magnitude with Numba, Dask, and CuPy/RAPIDS. I did it all in a couple of hours. It was a blast to use everything together.

High Performance Python Processing Pipeline

We start with a simple signals processing workload, and then accelerate it by several orders magnitude using the following libraries:1. Numpy: https://numpy...

www.youtube.com

1

56

238

Matthew Rocklin

@mrocklin

3 years

The new Pandas 1.3 release has great performance improvements for text data thanks to @ApacheArrow I made an 8 minute video about how to enable it here:

Pandas/Arrow/Dask String Performance Improvements | Matt Rocklin

Learn more at https://bit.ly/3QozLAOPandas 1.3 added a string[pyarrow] datatype. This provides excellent performance improvements, especially in scalable sit...

www.youtube.com

5

46

226

Matthew Rocklin

@mrocklin

4 years

I guess now that Pandas is at 1.0 we can finally use it in production ;)

pandas

@pandas_dev

4 years

Pandas 1.0 is here! * Read the release notes: * Read the blogpost reflecting on what 1.0 means to our project: * Install with conda / PyPI: Thanks to our 300+ contributors to this release.

17

853

2K

2

17

199

Matthew Rocklin

@mrocklin

5 years

Why I avoid Slack, and encourage developers to hold technical conversations on GitHub.

8

52

174

Matthew Rocklin

@mrocklin

5 years

Slides on four libraries for high performance Python and how they play together, delivered at @PASC_Conference - @numba_jit - @dask_dev - @rapidsai - @openucx

2

61

170

Matthew Rocklin

@mrocklin

4 years

It's ok to not be productive right now. When I experience stress I usually respond by working harder. My stress is usually work related, so this usually makes sense. But right now working harder doesn't help the situation. Taking time to relax and assess mental health does.

4

29

163

Matthew Rocklin

@mrocklin

3 years

Looking at job applications I find that I'm more drawn to candidates with GitHub accounts over candidates with PhD's and CVs. It's not because the work is more impressive. It's because I can more rapidly assess fit with GitHub than with papers.

7

8

163

Matthew Rocklin

@mrocklin

3 years

Earlier this year I realized that my job is now to hire smart people, get them to talk to each other, and then step out of the way most of the time.

8

9

157

Matthew Rocklin

@mrocklin

5 months

The OG Xarray+Pangeo+Dask gang at AGU It's been a good decade 🙂

4

10

148

Matthew Rocklin

@mrocklin

5 years

Prefect is what happens when an Airflow developer and a Dask developer get together to make a data engineering platform. I've enjoyed watching this company develop and am excited to see how people use it, now that the core is open source.

Prefect

@PrefectIO

5 years

To celebrate its second anniversary, Prefect is now open source! What will you build?

3

18

95

2

30

142

Matthew Rocklin

@mrocklin

4 years

I'm taking this week off and will be unresponsive to email/GitHub/tweets. Cheers everyone!

4

2

142

Matthew Rocklin

@mrocklin

6 months

Spark, Dask, DuckDB, Polars: TPC-H Benchmarks at scale This is the result of a couple weeks of work comparing large data frameworks on benchmarks ranging in size 10GB to 10TB. No project wins. It's really interesting analyzing results though.

Spark, Dask, DuckDB, Polars: TPC-H Benchmarks at Scale

We run the common TPC-H Benchmark suite at 10 GB, 100 GB, 1 TB, and 10 TB scale on the cloud a local machine and compare performance for common large datafra...

www.youtube.com

4

27

142

Matthew Rocklin

@mrocklin

5 years

If we care about global warming then we should be open to Nuclear Power. Renewables are awesome, but they're not likely to solve the problem on their own. I support Nuclear because I care deeply about the environment. I encourage you to do the same.

Sylvain Corlay

@SylvainCorlay

5 years

This is the worldwide CO2 emission per capita. Do you know why French emit three times less CO2 per capita than north americans?

3

11

34

12

36

138

Matthew Rocklin

@mrocklin

5 years

Two short blogposts on Python + HPC: 1. Five reasons to keep your HPC center rather than transition to the cloud 2. How to retrofit an HPC center for interactive data science with Conda, Jupyter{Hub,Lab}, and Dask

5

55

137

Matthew Rocklin

@mrocklin

7 years

Accelerating Geospatial analysis in Python with #GeoPandas #Cython and #Dask A recent project with @jorisvdbossche

3

58

138

Matthew Rocklin

@mrocklin

6 years

New Blogpost: The case for Numba in community code

14

54

128

Matthew Rocklin

@mrocklin

5 years

I'm conda updating numpy in a big environment. Previously the slowness of this caused me frustration. However, now that Conda gives live reports about its work I suddenly have great compassion for it. Go conda go! I guess UX really does matter.

4

22

127

Matthew Rocklin

@mrocklin

5 years

The role and best practices for open source software maintainers: It's more about social interactions and logistics than algorithms and writing code.

2

45

120

Matthew Rocklin

@mrocklin

5 years

Early experiments with and a plan for with Dask, RAPIDS, and GPU Dataframes.

2

34

116

Matthew Rocklin

@mrocklin

3 years

As a developer I haven't had to think about Python 2 compatibility for a long while now. Thank you world for taking the trouble to switch.

4

1

113

Matthew Rocklin

@mrocklin

4 years

A short blogpost about writing to a short attention span:

9

33

112

Matthew Rocklin

@mrocklin

5 years

New blogpost on Dask Arrays, Numba, smoothing images, and using NumPy's generalized universal functions to make complex computation more accessible for users.

1

28

107

Matthew Rocklin

@mrocklin

5 years

Opportunities and challenges of GPU-enabled #PyData My first impressions since joining NVIDIA. GPUs are truly impressive, but there are social and technical challenges to integrating them into an ecosystem.

5

34

110

Matthew Rocklin

@mrocklin

5 years

Blogpost encouraging people to write more short blogposts: Writing is good. Writing is easy if we don’t make a big deal out of it.

7

23

111

Matthew Rocklin

@mrocklin

3 years

High performance processing pipelines I often refer to this video demo. It uses Dask, Numba, GPUs, and asyncio to accelerate an image processing pipeline several orders of magnitude.

High Performance Python Processing Pipeline

We start with a simple signals processing workload, and then accelerate it by several orders magnitude using the following libraries:1. Numpy: https://numpy...

www.youtube.com

4

20

109

Matthew Rocklin

@mrocklin

2 years

A new post on the historical mission of #SciPy , our success, and some ideas from community members on what to focus on next.

5

30

109

Matthew Rocklin

@mrocklin

5 years

Numpy drops Python 2

2

49

107

Matthew Rocklin

@mrocklin

6 years

New Blog: HDF in the cloud An analysis of the challenges and potential solutions for scientific data on cloud storage

7

66

104

Matthew Rocklin

@mrocklin

5 years

Recording of my talk "Refactoring the SciPy Ecosystem for Heterogeneous Computing" at #SciPy2019 Two main points: - GPUs are neat - We should focus on standards and protocols

1

41

106

Matthew Rocklin

@mrocklin

2 years

#Dask developers hanging out at #SciPy2022 It's absolutely amazing to see these folks in person again.

0

11

101

Matthew Rocklin

@mrocklin

4 years

New post: Seven stages of openness in open source software

3

41

102

Matthew Rocklin

@mrocklin

4 years

I'm glad to see that Pandas 1.0 includes optional support for @numba_jit for accelerated Python. I thought I'd repost this blog from two years ago on The Case for Numba in Community Code

1

21

100

Matthew Rocklin

@mrocklin

6 years

Sparse ND-Arrays in Python, built on NumPy and Scipy.sparse. Release notes for v0.2: Thanks to new maintainer @hameerabbasi !

0

35

99

Matthew Rocklin

@mrocklin

6 years

Credit models with Dask. A guest post by Rich Postelnik using dask to build credit modelling computational systems. A fine example of complex "Big Data" workflows that fall outside of big dataframe/array/sql paradigms.

1

37

98

Matthew Rocklin

@mrocklin

7 years

Dask and Pandas and XGBoost: playing nicely between distributed systems. #PyData

0

42

98

Matthew Rocklin

@mrocklin

5 years

Slides for my talk at #GTC2019 about scaling GPU analytics in Python with @dask_dev and @rapidsai

2

31

96

Matthew Rocklin

@mrocklin

4 years

I'm sitting in a room full of really senior Scientific Python developers. I'm pretty sure that all of us are spending our day writing proposals.

5

4

98

Matthew Rocklin

@mrocklin

6 years

Note that #Dask arrays support any Numpy-like containers. It would be great to see someone parallelize around CuPy for general purpose distributed GPU arrays in Python.

Travis Oliphant

@teoliphant

6 years

I've been impressed recently with Chainer and CuPy . I like the fact that NumPy *is* the CPU implementation for Chainer and CuPy is NumPy on GPU. #NumPy #Python #MachineLearning #ArtificalIntelligence

2

115

302

1

28

98

Matthew Rocklin

@mrocklin

3 years

My workspace leveled up recently

7

2

97

Matthew Rocklin

@mrocklin

4 years

Sales tip for engineers: Ask more questions before you dive into deep technical answers. Life tip for anyone: listen more, speak less Single-page blogpost version here:

3

18

97

Matthew Rocklin

@mrocklin

6 years

Blogpost: Write dumb code A summary of several conversations that I (and many other) have frequently when maintaining software projects.

6

50

96

Matthew Rocklin

@mrocklin

7 years

The new NumPy 1.13.0 release (released just a few hours ago) enables #Dask arrays to work with normal NumPy ufuncs:

Add __array_ufunc__ protocol to dask.array by mrocklin · Pull Request #2438 · dask/dask

This implements the __array_ufunc__ protocol for dask.array. Generally we just dispatch to the already-existing functions. I did this in haste and without properly reviewing the NEP/PR so I would...

github.com

0

62

95

Matthew Rocklin

@mrocklin

3 years

This is great! One challenge to integrating GPUs into the Python ecosystem is the lack of uniform API. PyTorch, RAPIDS, Numba all wrap cuda separately so there is no generic way to manipulate the GPU. This advance enables ecosystem libraries like Dask to operate generically.

Graham Markall

@gmarkall

3 years

The NVIDIA CUDA Python runtime, driver, and NVRTC API bindings are now in a public Github repo: Docs: Background:

3

59

255

0

14

95

Matthew Rocklin

@mrocklin

4 years

Short blogpost on OSS growth metrics, VC funding, and a couple of awkward conversations.

5

11

91

Matthew Rocklin

@mrocklin

4 years

I like this article not only because Dask does well, but also because it was written by someone who wasn't associated with any of the existing projects. Benchmarks run by people with a vested interest are almost always incorrect. Accidental bias is too easy.

Tomas Peluritis

@RealUncleData

4 years

Did a comparison on Spark, Dask, Pandas, Koalas, Modin. Check it out: #dataengineering #spark #dask #modin #pandas #dataframes #python #performance

8

31

136

4

8

91

Matthew Rocklin

@mrocklin

2 years

Write tests. For reasons other than just correctness

5

20

88

Matthew Rocklin

@mrocklin

5 years

A small blogpost on avoiding code indirection for readability I request this often during code review, and wanted a writeup to point to in the future.

4

32

88

Matthew Rocklin

@mrocklin

7 years

Slides for my #SciPy2017 talk on Advanced Techniques in #Dask

0

40

84

Matthew Rocklin

@mrocklin

1 year

Video visualizing 1,000,000,000 points This is a common example with Datashader. This video is different in that it focuses on performance tuning. When we start, an update takes 40s. When we're done, it takes ~1s. High performance requires thought.

High Performance Visualization | Parallel performance with Dask &...

Interactively visualize a billion points with Dask and Datashader.This takes an example from Anaconda (thanks @Peter Wang!) visualizing the NYC Taxi data wit...

www.youtube.com

7

27

85

Matthew Rocklin

@mrocklin

3 years

I'm moving back to Austin, TX this weekend. Yesterday I drove through CA, AZ, NM. Today, West Texas.

11

0

85

Matthew Rocklin

@mrocklin

6 years

Anti-Hype: Remember that FORTRAN-77 supported tensors. It called them Arrays. Almost all languages have supported them natively since without the need for fanfare.

8

17

84

Matthew Rocklin

@mrocklin

5 years

Slides from my #SciPy2019 talk: "Refactoring the SciPy Ecosystem for Heterogeneity"

2

40

85

Matthew Rocklin

@mrocklin

3 years

Small Docker Images with Conda This small blogpost from @jcristharif is old, but keeps coming up at work. I thought I'd retweet it out here for others.

6

14

85

Matthew Rocklin

@mrocklin

6 years

Slides for my #AnacondaCon talk on Real Time Processing with #Dask It includes a Binder environment that includes all of the examples as live runnable notebooks. (see futures and streaming dataframes in particular)

2

24

82

Matthew Rocklin

@mrocklin

5 years

A short blogpost motivating and highlighting simple HTML outputs in Jupyter

4

18

82

Matthew Rocklin

@mrocklin

5 years

Slides from my #AnacondaCON talk on challenges and opportunities for the #PyData Ecosystem and GPUs

3

25

79

Matthew Rocklin

@mrocklin

1 year

OG PyData/SciPy crowd meeting up in Austin. Represent here: @pyscript_dev @anacondainc @quansightai @scikit_learn @NumFOCUS @mozilla @enthought @dask_dev @Inria

3

7

80

Matthew Rocklin

@mrocklin

4 years

Coiled is growing We're expanding our engineering teams on both the web/product side and the Dask-OSS/support side. We're looking for both experts and dabblers. Come play!

4

42

82

Matthew Rocklin

@mrocklin

7 years

New post: Distributed Pandas Dataframes with #Dask #PyData Includes interactive profiling @BokehPlots

2

46

80

Matthew Rocklin

@mrocklin

5 years

The @rapidsai / @dask_dev joint team is working on high performance networking. We were spending too much time interpreting text printouts to see bandwidth across nodes. Now we have live visual diagnostics with Dask

4

16

81

Matthew Rocklin

@mrocklin

8 years

Screenshots using #Dask with JupyterLab

0

34

79

Matthew Rocklin

@mrocklin

4 years

The Dask project is looking to hire a community developer to support a variety of ongoing work in the life sciences. Does this interest you, or someone that you know? Please apply.

Dask Life Sciences Fellow [Open Job] - NumFOCUS

Dask is an open-source library for parallel computing in Python that interoperates with existing Python data science libraries like Numpy, Pandas, Scikit-Learn, and Jupyter. Dask is used today...

numfocus.org

1

61

79

Matthew Rocklin

@mrocklin

6 years

Small post on how to craft a bug report that a package maintainer will appreciate: I wrote this to summarize common anti-patterns for new users, but thought that others might find it helpful as well.

4

44

78

Matthew Rocklin

@mrocklin

2 years

Matplotlib is used by 17% of papers on @arXiv @tacaswell and I played around with this a couple months ago at a CZI meeting. Nice work @matplotlib devs!

How Popular is Matplotlib?

This analysis tracks the growth of Matplotlib on the preprint server arXiv beginning in 2002 with 1% up to 2022 with 17% of all papers using Matplotlib...

www.coiled.io

5

14

77

Matthew Rocklin

@mrocklin

4 years

"How many people use your OSS project?" This is a critical question for funding OSS, and is also really hard to answer well. This blogpost is my attempt to answer it honestly for Dask.

10

17

76

Matthew Rocklin

@mrocklin

5 years

I'm excited about the developments in @openucx and UCX-Py for high performance communication in Python. This should help @dask_dev users in HPC, but also many others in Python and beyond. Thanks Akshay Venkatesh and @TomAugspurger for your work.

2

20

72

Matthew Rocklin

@mrocklin

4 years

There is no higher praise than "I happily stopped using the software I built to use someone else's software"

1

4

74

Matthew Rocklin

@mrocklin

3 years

I miss the forced idleness of air travel. After a couple weeks of intense productivity, being trapped in an internet-less metal tube suspended miles above the earth sounds like bliss.

11

2

74

Matthew Rocklin

@mrocklin

4 years

I have a joke about open source software, but I'm pretty sure that someone else has already made it for me.

3

4

74

Matthew Rocklin

@mrocklin

6 years

New blogpost about the changing relationship between public research institutions and open source software

0

30

74

Matthew Rocklin

@mrocklin

3 years

New screencast on a recent Dask feature that minimizes memory use through smarter scheduling This is having a huge impact in a surprising number of common workloads.

Dask Smarter Scheduling for Memory Use | Matt Rocklin

Learn more at https://bit.ly/3Q5TlC8In this video, you will learn about smarter scheduling for memory use with Dask.

www.youtube.com

4

19

71

Matthew Rocklin

@mrocklin

7 years

In my opinion this is the biggest feature to hit Spark for Python users in a very long while

Databricks

@databricks

7 years

Introducing Vectorized UDFs for #PySpark : How to run your native Python code with PySpark, fast. #python

2

114

147

0

35

70

Matthew Rocklin

@mrocklin

4 years

This Friday @CoiledHQ engineers are collectively taking a day off. We found that people weren't taking vacation time, either due to being on a small team (it's hard to be the only one out) or due to COVID (vacations are less fun). So we're experimenting with extra days off.

3

5

72

Matthew Rocklin

@mrocklin

6 years

Recently I've been seeing public institutions and OSS developers build around proprietary APIs in ways that slightly concern me. I wrote down some thoughts on avoiding cloud lock-in by adopting open standards:

0

29

71

Matthew Rocklin

@mrocklin

5 years

Toolz was my first major OSS project (it has a wider install base than Dask). Today, after several years of bug-free constant use, I discovered a bug. It was easy to fix, but an odd blast from the past. Also, the entire toolz test suite runs in 1.2s. I really miss tiny projects

3

71

Matthew Rocklin

@mrocklin

6 years

Slides for my #PyData NYC talk on streaming processing in Python

0

20

69

Matthew Rocklin

@mrocklin

5 years

Now with GPUs! The speedups for stencil computations on GPUs were surprisingly nice for this problem. This was also my first time writing CUDA code in Python with @numba_jit , which was a pleasure. Thanks @numba_jit !

numba-cuda-stencil.ipynb

GitHub Gist: instantly share code, notes, and snippets.

gist.github.com

Matthew Rocklin

@mrocklin

5 years

New blogpost on Dask Arrays, Numba, smoothing images, and using NumPy's generalized universal functions to make complex computation more accessible for users.

1

28

107

1

17

68

Matthew Rocklin

@mrocklin

3 years

Dask History, a 20 minute video

Dask History | History of PyData & Dask | Matt Rocklin

Learn more at https://bit.ly/3PYNBKuMatthew Rocklin talks about the history of PyData and Dask, how Dask evolved over the years, and what it looks like today.

www.youtube.com

3

22

69

Matthew Rocklin

@mrocklin

4 years

You don't need to apologize for having your kids interrupt business calls. Someone did this yesterday and I was surprised. I think that we're past that now. Kids are just part of the modern workplace. It's like hearing a slack message in the background.

2

7

69

Matthew Rocklin

@mrocklin

6 years

Slides and demonstrations for my keynote at the AMS Python symposium on scaling Dask and XArray to HPC and cloud systems.

2

30

67

Matthew Rocklin

@mrocklin

8 years

Really been enjoying building a Dask.distributed web UI with the @BokehPlots server. This just landed in master.

3

48

68

Matthew Rocklin

@mrocklin

7 years

New post: Distributed NumPy on a cluster with image analysis. #Dask #PyData

1

36

66

Matthew Rocklin

@mrocklin

4 years

I stuttered until age 16, when it miraculously went away (there are a few situations which trigger it still) Every time I see a fellow stutterer my heart aches. Stuttering feels like really wanting to chase after someone in sports, but with your shoelaces tied.

5

3

66