Leland McInnes Profile Banner
Leland McInnes Profile
Leland McInnes

@leland_mcinnes

5,768
Followers
813
Following
79
Media
3,723
Statuses

A mathematician dabbling in the world of data science. Researcher at the Tutte Institute for Mathematics and Computing. UMAP, HDBSCAN, PyNNDescent. He / Him.

Ottawa, Ontario
Joined October 2016
Don't wanna be here? Send us removal request.
@leland_mcinnes
Leland McInnes
6 years
Our paper on UMAP, a faster alternative to t-SNE, is now up on arXiv! The paper provides a more detailed account of the theoretical underpinnings of the algorithm, as well as performance benchmarks.
Tweet media one
Tweet media two
Tweet media three
15
575
1K
@leland_mcinnes
Leland McInnes
4 years
The first release candidate for UMAP 0.4 is out providing lots of new features, including performance improvements, embedding to different manifolds, inverse transform, and plotting tools.
Tweet media one
Tweet media two
Tweet media three
10
360
1K
@leland_mcinnes
Leland McInnes
3 years
The latest version of umap-learn is now out. Version 0.5 includes some major new features, including ParametricUMAP, DensMAP, AlignedUMAP, model composition, and model updating. Thank you to everyone who contributed! 1/14
Tweet media one
Tweet media two
Tweet media three
9
305
1K
@leland_mcinnes
Leland McInnes
5 months
The landscape of the Machine Learning section of ArXiv.
Tweet media one
23
168
796
@leland_mcinnes
Leland McInnes
5 years
Understanding UMAP - an interactive introduction to the algorithm and how to us (and mis-use) it from @_coenen and @adamrpearce . A must read for anyone interested in dimension reduction.
7
232
655
@leland_mcinnes
Leland McInnes
4 years
UMAP 0.4 is now out! It includes a host of new features, including plotting support, better sparse data support, inverse transforms, and embedding to non-euclidean manifolds. pip install umap-learn See this thread for some of the new features:
@leland_mcinnes
Leland McInnes
4 years
UMAP 0.4 supports embedding to non-Euclidean manifolds, including spheres, Poincare disks, and more.
Tweet media one
Tweet media two
8
40
131
5
177
585
@leland_mcinnes
Leland McInnes
5 years
An updated and significantly expanded version of our UMAP paper is now on arXiv: More explanation, algorithm descriptions, and more experiments looking at stability, and working directly on high dimensional data -- as high as 1.8 million dimensional data!
Tweet media one
Tweet media two
Tweet media three
11
230
574
@leland_mcinnes
Leland McInnes
4 months
Introducing DataMapPlot for creating beautiful presentation ready plots of data maps. 🧵
Tweet media one
Tweet media two
10
110
551
@leland_mcinnes
Leland McInnes
3 months
14
105
518
@leland_mcinnes
Leland McInnes
6 years
UMAP version 0.3 is now available. You can now add new data to an existing embedding, embed using labelled data, or use both features for metric learning. Documentation is on readthedocs: .
Tweet media one
Tweet media two
Tweet media three
8
187
502
@leland_mcinnes
Leland McInnes
2 years
Ever needed a few more colours than the standard colour cycle for your plot? Ever wanted a categorical colour palette based around your own custom colours? With glasbey you can create and extend custom categorical colour palettes with ease.🧵
Tweet media one
12
60
471
@leland_mcinnes
Leland McInnes
7 years
The new numba based version of UMAP is out. Now faster than ever, it takes only 2.5 minutes to embed the full 70000 points of the 784-dimensional "Fashion MNIST" dataset.
Tweet media one
11
173
466
@leland_mcinnes
Leland McInnes
4 years
Here's a really nice simple intuitive explanation of th HDBSCAN clustering algorithm:
4
75
296
@leland_mcinnes
Leland McInnes
5 years
Really enjoying the #mlprague conference. Slides for my talk on topological approaches to unsupervised learning problems can be found here:
6
66
260
@leland_mcinnes
Leland McInnes
3 months
A major update for DataMapPlot adds interactive plots. See for an example. Let's dig in to what you can do with DatMapPlot 0.2 ... 🧵
6
56
251
@leland_mcinnes
Leland McInnes
4 years
Pynndescent, an approximate nearest neighbor search library, got a major update recently. Index construction is now multicore by default. Querying is now much faster -- competitive with some of the fastest ANN libraries around. (1/4)
Tweet media one
Tweet media two
Tweet media three
Tweet media four
4
49
243
@leland_mcinnes
Leland McInnes
2 years
A new round of Approximate Nearest Neighbour search benchmarking by is out, including lots of new libraries and algorithms. It is good to see PyNNDescent still performing very well.
Tweet media one
Tweet media two
Tweet media three
Tweet media four
5
38
192
@leland_mcinnes
Leland McInnes
6 years
I just started playing around with @datashader edge bundling for visualizing graphs associated to UMAP embeddings. Here's one for MNIST:
Tweet media one
7
29
182
@leland_mcinnes
Leland McInnes
5 years
My talk at PyData NYC on dimension reduction is now available. Hopefully it provides a useful basic taxonomy to help people navigate the vast zoo of dimension reduction techniques.
6
42
180
@leland_mcinnes
Leland McInnes
4 years
This is some amazing work from @tim_sainburg . Some major takeaways: - lightning fast transform/inverse_transform operations (comparable to PCA if you have a GPU); - semi-supervised classification: 97.8% accuracy on MNIST with only 4 labelled items per class!
@tim_sainburg
Tim Sainburg
4 years
New paper "Parametric UMAP: learning embeddings with deep neural networks for representation and semi-supervised learning" with @leland_mcinnes and @TqGentner ! 1/
Tweet media one
4
61
295
1
38
153
@leland_mcinnes
Leland McInnes
1 year
Have you been frustrated that HDBSCAN doesn't use all your cores, or is too slow? Fast-hdbscan is a numba based version of HDBSCAN that can use all your cores and significantly outperform the hdbscan python package for low-d Euclidean data.
Tweet media one
Tweet media two
2
18
146
@leland_mcinnes
Leland McInnes
4 years
UMAP 0.4 supports embedding to non-Euclidean manifolds, including spheres, Poincare disks, and more.
Tweet media one
Tweet media two
8
40
131
@leland_mcinnes
Leland McInnes
7 years
Initial experimental version of UMAP code now on github: . Aiming for better dimension reduction than t-SNE.
Tweet media one
7
49
123
@leland_mcinnes
Leland McInnes
6 years
Here's a useful term when looking at t-SNE/UMAP plots ...
@DataSciFact
Data Science Fact
6 years
Apophenia: The tendency to see patterns in random data.
5
178
436
2
33
122
@leland_mcinnes
Leland McInnes
2 months
The landscape of Machine Leaning on ArXiv: Now available in a zoomable, searchable version with paper titles on hover.
@leland_mcinnes
Leland McInnes
5 months
The landscape of the Machine Learning section of ArXiv.
Tweet media one
23
168
796
0
27
116
@leland_mcinnes
Leland McInnes
3 years
Playing with some nlp related tools I've been working on, I ended up with some nice visualizations. This is Top2Vec style topic words on a UMAP layout of 20-newsgroups document vectors using masked word-clouds for each newsgroup.
Tweet media one
Tweet media two
3
19
110
@leland_mcinnes
Leland McInnes
5 years
Using UMAP to make neural net activation spaces more interpretable.
@OpenAI
OpenAI
5 years
In collaboration with Google, we're releasing Activation Atlases: a new technique for visualizing what interactions between neurons can represent. 💻Blog: 📝Paper: 🔤Code: 🗺️Demo:
17
852
2K
2
22
107
@leland_mcinnes
Leland McInnes
5 years
If you want to spend some time exploring a UMAP embedding of images (like MNIST) @GrantCuster put together a nice tool:
2
37
104
@leland_mcinnes
Leland McInnes
4 years
A great introduction to HDBSCAN and density based clustering:
Tweet media one
Tweet media two
0
22
101
@leland_mcinnes
Leland McInnes
3 years
My #scipy2021 talk on PyNNDescent, a library for fast approximate nearest neighbour search is now available:
5
23
99
@leland_mcinnes
Leland McInnes
6 years
What if I redesigned HDBSCAN from scratch based on the theory behind UMAP? Apparently it might actually work fine and look something like this:
3
23
99
@leland_mcinnes
Leland McInnes
6 years
Simplicial Autoencoders using UMAP theory to build better autoencoders (and a nice introduction to UMAP as well):
3
30
98
@leland_mcinnes
Leland McInnes
3 years
The first release candidate for umap-learn 0.5 is out. Take the opportunity to verify the new version works for you.
3
14
96
@leland_mcinnes
Leland McInnes
3 years
A great example of what UMAP is for: look at your data and realise it wasn't what you thought -- and then use it to ask better questions about your data before proceeding with fancier ML tools.
@alexijielu
Alex Lu
3 years
It was only when we visualized the UMAP that we got suspicious: the representations of all IDRs split into two big blobs. That's when we decided to interpret the features, and then we realized: half the features had a big "M" capturing the start methionine.
1
3
10
0
21
95
@leland_mcinnes
Leland McInnes
5 years
This is a nice way to get some sense of what UMAP is doing at least for low dimensional data.
@MaxNoichl
Max Noichl
5 years
2D UMAP of a 3D woolly mammoth, to build intuitions about how features are preserved in dimensionality reduction. Wonderful 3D scan from the people at @3D_Digi_Si .
Tweet media one
Tweet media two
5
28
77
0
26
89
@leland_mcinnes
Leland McInnes
3 years
Hypergraphs and simplicial complexes are going to become ever more prevalent. Here's a great article on some of the reasons why they are so interesting.
2
21
85
@leland_mcinnes
Leland McInnes
7 years
Embedding the MNIST test set with a new manifold learning approach. Captures more global structure than t-SNE.
Tweet media one
6
27
83
@leland_mcinnes
Leland McInnes
5 years
I'm considering dropping python 2.7 support for hdbscan and umap-learn. Let me know if this would be extremely painful for you. Also let me know if this would make you happy.
16
1
84
@leland_mcinnes
Leland McInnes
4 years
I really want to emphasize how amazing @numba_jit is. Pynndescent is pure python code relying on numba for acceleration. It is performance competitive with *highly optimized* C++ code. I still can't actually believe how incredibly well numba works!
3
20
80
@leland_mcinnes
Leland McInnes
4 years
Delivery is apparently a little slower to Canada, but I finally got my copy of @math3ma 's book! Certainly worth the wait...
Tweet media one
2
5
78
@leland_mcinnes
Leland McInnes
5 years
Suppose UMAP could represent data not as 2d points, but as 2d gaussians with a full covariance matrix. Would that be useful? What would be the best way to represent that visually?
11
7
77
@leland_mcinnes
Leland McInnes
4 years
I have been revisiting pynndescent recently, and with help from the @numba_jit team I managed to get some significant performance gains. Preliminary tests on @fulhack 's ann-benchmarks is looking very promising. Hopefully I'll have a new 0.5 release with these changes out soon.
Tweet media one
Tweet media two
Tweet media three
Tweet media four
2
17
76
@leland_mcinnes
Leland McInnes
2 months
I just added support for these binary embedding vectors to pynndescent. Using them directly with UMAP should be possible very soon...
@Nils_Reimers
Nils Reimers
2 months
🚀 𝐂𝐨𝐡𝐞𝐫𝐞 𝐄𝐦𝐛𝐞𝐝 𝐕𝟑 - 𝐢𝐧𝐭𝟖 & 𝐛𝐢𝐧𝐚𝐫𝐲 𝐒𝐮𝐩𝐩𝐨𝐫𝐭🚀 I'm excited to launch our native support for int8 & binary embeddings for Cohere Embed V3. They slash your vector DB cost 4x - 32x while keeping 95% - 100% of the search quality.
Tweet media one
14
73
437
4
8
74
@leland_mcinnes
Leland McInnes
3 years
Plots not meta enough? Here is a nice UMAP plot of different plots. From "Viral Visualizations: How Coronavirus Skeptics Use Orthodox Data Practices to Promote Unorthodox Science Online"
Tweet media one
2
16
73
@leland_mcinnes
Leland McInnes
4 years
Support for an "inverse transform" has been added to UMAP 0.4, providing the ability to generate a high dimensional representation of a point in the embedding space.
Tweet media one
Tweet media two
2
18
66
@leland_mcinnes
Leland McInnes
3 years
AlignedUMAP allows sequences of different UMAP embeddings to be aligned with each other according to relations among the datasets. This can be particularly useful for situations such as time evolving data. 7/ 14
Tweet media one
8
9
68
@leland_mcinnes
Leland McInnes
4 years
An upcoming feature currently in the 0.5dev branch of UMAP will make this much easier to do. e.g. mapper1 = umap.UMAP(metric='euclidean').fit(continuous_data) mapper2 = umap.UMAP(metric="dice").fit(discrete_data) consensus_mapper = mapper1 * mapper2
@NikolayOskolkov
Nikolay Oskolkov
4 years
I just published UMAP for Data Integration
2
57
183
0
14
67
@leland_mcinnes
Leland McInnes
6 years
A paper in @JOSS_TheOJ for the UMAP software implementation is now published: . Thanks to the editors ( @arokem ) and reviewers ( @TerryTangYuan ) for providing such a smooth process for publication.
1
23
65
@leland_mcinnes
Leland McInnes
4 years
@DrPattiJones PCA provides a global linear projection onto the hyperplane defined by the directions of global maximal variance in your data. UMAP attempts to stitch together many local views of the data accounting for local variance, into an intermediate structure, then represent that in low D
3
3
61
@leland_mcinnes
Leland McInnes
6 years
Inspired by the t-SNE animation from @ChaseClarkatUIC I decided to try something similar for UMAP. Here is an animation for varying values of the n_neighbors parameter. Increasing values give more weight to global structure over local structure.
2
19
59
@leland_mcinnes
Leland McInnes
6 years
UMAP now has 1,000 github stars! Thanks to all the users and contributors! There are more features coming in version 0.3 soon, and some exciting ones in very early development.
Tweet media one
1
11
59
@leland_mcinnes
Leland McInnes
3 years
@ch402 @SuhnyllaKler @AnthropicAI An example of current work: is linear optimal transport applied to word vectors a decent sentence/document embedding model? It turns out yes, yes it is. There's still a long way to go to scale and benchmark on larger datasets, but it's promising.
4
11
56
@leland_mcinnes
Leland McInnes
1 month
A new minor release of umap-learn adds some very useful features: - Updating ParametricUMAP to Keras3 (kindly provided submitted by @fchollet ); - Initial support for binary embedding vectors with metric="bit_hamming" and metric="bit_jaccard".
1
3
58
@leland_mcinnes
Leland McInnes
3 years
I'll be giving a talk on PyNNDescent, a library for approximate nearest neighbour search, at #SciPy2021 on Friday.
0
10
54
@leland_mcinnes
Leland McInnes
5 years
Code from my lightning talk: ensemble topic modelling in Python with pLSA for fast stable topic modelling with the enstop package: #SciPy2019
1
17
51
@leland_mcinnes
Leland McInnes
3 years
HDBSCAN is now in RAPIDS!
@RAPIDSai
RAPIDS AI
3 years
Out now, RAPIDS release 21.06! New #cuML and #cuGraph algorithms, new list functionality, a whole new way to measure @RAPIDSai progress with the change to CalVer, and much more!
1
24
67
2
6
53
@leland_mcinnes
Leland McInnes
6 years
It's well worth reading the paper on FIt-SNE -- useful techniques and fun math.
@GCLinderman
George C. Linderman
6 years
@F_Vaggi @leland_mcinnes FIt-SNE uses an O(N) interpolation scheme to accelerate the computation of the gradient at each step. More details are available in the preprint () or some notes I wrote ()
1
1
21
1
9
49
@leland_mcinnes
Leland McInnes
6 years
I belatedly got to experimenting with FIt-SNE from @GCLinderman . It's very impressive and very fast -- definitely the implementation you should be using if you want to use t-SNE for visualization.
1
11
48
@leland_mcinnes
Leland McInnes
6 years
Good news for #rstats users looking for dimension reduction: An R package wrapping UMAP: ; and an independent implementation of UMAP in R: !
0
25
46
@leland_mcinnes
Leland McInnes
5 years
Thanks also go to James Melville, author of the UWOT implementation of UMAP for R (), who has joined as a co-author.
0
9
45
@leland_mcinnes
Leland McInnes
4 years
The ambient coordinates of your data (coming from features) need not be related to the intrinsic notion of distance internal to the data itself. An idea worth wrapping your head around.
@TopologyFact
Topology Fact
4 years
'It's not so easy to free oneself from the idea that coordinates must have an immediate metrical meaning.' -- Albert Einstein
0
25
109
4
14
44
@leland_mcinnes
Leland McInnes
6 years
Really interesting to see UMAP on real-world data!
@EvNewell1
Evan Newell
6 years
Checkout Etienne Becht's bioRxiv preprint that compares UMAP with t-SNE for visualizing CyTOF and scRNAseq data. Many advantages of UMAP over t-SNE for high dimensional single-cell data! @leland_mcinnes
Tweet media one
Tweet media two
11
109
227
2
10
43
@leland_mcinnes
Leland McInnes
4 years
Documentation for UMAP 0.4 now includes examples of UMAP usage for visualization, exploratory analysis, and scientific publications. If you have a compelling use case, we would love to include it as well.
Tweet media one
4
3
43
@leland_mcinnes
Leland McInnes
5 years
Code from my lightning talk: ensemble topic modelling in Python with pLSA for fast stable topic modelling with the enstop package:
1
10
42
@leland_mcinnes
Leland McInnes
6 years
This was a fantastic series of of posts! If you want a well written intro to some of the ideas in topological data analysis this is a great place to start.
@asemic_horizon @scikit_tda @leland_mcinnes I wrote a series of posts leading up to some TDA (see "Topology" section here: ) And then a few posts in the TDA family before I lost steam (see Computational Topology section of )
1
4
36
1
13
42
@leland_mcinnes
Leland McInnes
5 years
Many thanks to @datametrician @cjnolet and @rapidsai for making this possible -- definitely some amazing performance available for UMAP on GPU!
@franschrandez
Nicolas Fernandez
5 years
Reproduced the #UMAP on #RAPIDS example by @ceshine_en () on Colab (with help from ). Seeing 60X speedup on Colab @leland_mcinnes @rapidsai @keithjkraus @datametrician @rodaramburu see Colab Gist
Tweet media one
Tweet media two
0
13
44
1
18
41
@leland_mcinnes
Leland McInnes
2 years
It is a huge testament to the power of @numba_jit that a pure python library like PyNNDescent can be performance competitive with C++ libraries from Google (ScaNN), Microsoft (DiskANN), and Facebook (FAISS) among others. Many, many thanks to the whole @numba_jit team!
3
6
41
@leland_mcinnes
Leland McInnes
2 years
The glasbey library is on github: Documentation can be found on readthedocs: And you can pip install it: $ pip install glasbey
1
6
39
@leland_mcinnes
Leland McInnes
5 years
An amazing introduction to UMAP and its parameters. This is for UMAP what the Distill article was for t-SNE. Great work from @_coenen and @adamrpearce as always!
@_coenen
Andy Coenen
5 years
Understanding UMAP - a high-level introduction to how the algorithm works, how to use it effectively, and how it compares with t-SNE.
8
181
630
1
19
38
@leland_mcinnes
Leland McInnes
5 years
@rctatman Here's a plan we use: Take the term-frequency matrix, remove the "expected" frequency (by subtracting, or using the column marginal as a noise model), UMAP with hellinger distance, and HDBSCAN for clustering. Still fine tuning the process, but has been very powerful so far.
4
2
39
@leland_mcinnes
Leland McInnes
3 years
@EmilyTWinn13 @SC_Griffith After the flood Noah is checking up on the animals. They're all breeding well, except for a pair of snakes. Noah gets a little worried and follows them. Eventually they find a fallen tree, and suddenly ... lots of baby snakes. It turns out that adders need logs to multiply.
1
4
39
@leland_mcinnes
Leland McInnes
6 years
@michaelhoffman Many of the t-SNE (and UMAP) plots I see suffer from potential over-plotting issues. This is particularly dangerous if you are trying to eyeball cluster purity. Using such plots as a starting point for further analysis rather than an endpoint is critical.
4
10
38
@leland_mcinnes
Leland McInnes
5 years
I've started telling people "Look at your data, because whatever you think you know about the data is almost certainly wrong". I'm not sure it works any better, but at least I warned them...
@Squared2020
Justin Jacobs
5 years
“Have you tried looking at the data?” is my most common question when talking to folks who are inexperienced with data. Over the last two years, about 90% of the time, the answer has been, “Why?” or “What good would that do?” 🙄
0
2
13
2
11
36
@leland_mcinnes
Leland McInnes
2 years
This is a fascinating paper -- using a contrastive approach on augmentations of images to learn a low dimensional representation they generate truly impressive results for image datasets!
@jnboehm
Nik Böhm
2 years
Ever wondered what image datasets look like if they could be visualized? We have developed a new algorithm for visualization based on contrastive learning. Joint work with @hippopedoid and @CellTypist . The full details are available as a preprint 🧵/16
Tweet media one
4
67
269
1
3
37
@leland_mcinnes
Leland McInnes
6 years
A new version of UMAP is now available. A new layout algorithm provides more accurate embeddings even faster than before.
1
17
35
@leland_mcinnes
Leland McInnes
6 years
I'll be speaking at the Fields Institute today on using UMAP theory for general unsupervised learning. I'll be happy to chat more about these ideas afterwards as well.
2
2
37
@leland_mcinnes
Leland McInnes
4 years
Here's a really great interactive article using UMAP to explore and compare large deep neural networks by @mwli16 and @scheidegger :
0
9
35
@leland_mcinnes
Leland McInnes
2 years
I will be co-chairing the machine learning track at SciPy this year. Submissions are open, so if you have a machine learning project in python consider submitting. This is a great opportunity to share your work with a wide audience. @SciPyConf
0
6
36
@leland_mcinnes
Leland McInnes
6 years
Getting close to finishing version 0.3 of UMAP, including some useful new features. Ideally it'll come at just before or at @SciPyConf this year.
2
5
36
@leland_mcinnes
Leland McInnes
4 years
The core neighbor search in UMAP has been expanded upon in a separate library, PyNNDescent, which provides significantly improved performance. Combined with PyNNDescent UMAP 0.4 now support multi-core computation end-to-end (MNIST in ~45s on a laptop).
2
7
35
@leland_mcinnes
Leland McInnes
3 years
ParametricUMAP uses a neural network to learn a UMAP embedding. This allows for a number of significant advantages. 2/14
Tweet media one
3
5
32
@leland_mcinnes
Leland McInnes
3 years
The approximate nearest neighbour search from UMAP is now fully moved to an ANN library PyNNDescent (). In turn PyNNDescent has seen significant development and is faster, multithreaded, and supports new metrics such as Wasserstein distance. 11/14
1
2
32
@leland_mcinnes
Leland McInnes
4 years
UMAP 0.4 also supports added hyper-parameters allowing you to tune results toward different tasks, such as outlier detection.
Tweet media one
1
2
31
@leland_mcinnes
Leland McInnes
4 years
@kareem_carr Clustering is a hard problem, not least because even defining what constitutes a "cluster" is hard. I would be interested to hear your thoughts of what, conceptually, makes a cluster.
1
1
33