MLing biomolecules en route to structural systems biology. Asst Prof of Systems Biology and CS
@Columbia
. Prev.
@Harvard
SysBio;
@Stanford
Genetics, Stats.
We’ve a new review on DL methods for protein-protein interactions, focused on discovering novel interactions, structurally characterizing known interactions, and designing binders. Work by
@JuliaRuRogers
and Gergő Nikolényi, who’ve done a great job distilling a huge field. More👇
Can we characterize the full diversity of protein interactions that coordinate cell function? Deep learning is a promising way!
@MoAlQuraishi
, Gergő Nikolényi, and I review the ecosystem of DL models for protein interaction discovery, elucidation & design
We have successfully trained OpenFold from scratch, our trainable PyTorch implementation of AlphaFold2. The new OpenFold (OF) (slightly) outperforms AlphaFold2 (AF2). I believe this is the first publicly available reproduction of AF2. We learned a lot. A🧵1/12
CASP14
#s
just came out and they’re astounding—DeepMind looks to have solved protein structure prediction. Median GDT_TS went from 68.5 (CASP13) to 92.4!!!! Cf. their 2nd best CASP13 struct scored 92.8 (out of 100). Median RMSD is 2.1Å. I think it's over
An announcement I’ve been aching to make! After much sweat, we’ve built a trainable version of AlphaFold2, implemented in PyTorch, which we’re calling OpenFold.
GitHub:
Colab:
Why a trainable version of AlphaFold2 you ask? ⬇️
Now that the
#alphafold
hype has completely died down (ha!), I've written a new blog post on the AF2 method paper: . This is a technical deep-dive into aspects of AF2 that I find most surprising/innovative and of relevance to broader biomolecular modeling.
Some news! After many lovely years at
@harvardmed
I'm moving to
@Columbia
fall 2020 to start a new lab as an Assistant Professor in Systems Biology and the Program for Mathematical Genomics--and I'm recruiting students and postdocs! Email/DM me or see . 1/2
AlphaFold 3 is out! As expected expands coverage to small molecules and nucleic acids. And replaces the structure module with a diffusion-based one. Unfortunately no code or model weights--just a web server for a limited set of ligands:
Baker lab's effort at reproducing AlphaFold2 is out on bioRxiv. Pretty impressive performance gains (relative to original trRosetta), if not quite yet at AlphaFold2 level.
Building on last week’s announcement of OpenFold, an academic-industry consortium is being announced today within the non-profit
@openmsf
. The OpenFold Consortium will develop open source ML-based molecular modeling tools in a community-driven fashion. 1/3
Deep learning has obviously transformed protein structure prediction, but can it do the same for the rest of biology? In a perspective by
@sorger_peter
and myself out this week in
@naturemethods
, we begin to try to answer this question:
We built a new diffusion protein design model named Genie. We preprinted it a while ago (soon after RFDiffusion and Chroma preprints) but kept mum due to embargo. Final ICML version (major update) with code and paper here (1/7)
Interesting status update from DeepMind on AlphaFold (just that, no model, paper, or code). All atom version in the works (similar to RFAA). Meaningful gains on small molecules but far from 'solved' (think AF1 vs AF2). Same w/nucleic acids and antibodies.
New! We’ve just put up a note evaluating the latest, in-development version of AlphaFold (“AlphaFold-latest”). This is a preview - development is still in progress - but performance across a wide range of tasks is striking.
Highlights in the thread.
1/7
I have a new review out on machine learning in protein structure prediction in past 2 years (not focused on AlphaFold but obviously mentions it) part of a special issue on "Machine Learning in Chemical Biology" in COCHBI edited by
@cwcoley
and Xiao Wang.
Today with
@emblebi
, we're launching the
#AlphaFold
Protein Structure Database, which offers the most complete and accurate picture of the human proteome, doubling humanity’s accumulated knowledge of high-accuracy human protein structures - for free: 1/
Even ~2 years after AlphaFold2's announcement this paper () remains my favorite in the post-AF2 realm. To be sure RFDiffusion is a strong contender and arguably has been more immediately useful, but I strongly believe this work will stand the test of time.
I’m late to my own party but excited to share our new work on predicting SLiM-mediated protein-protein interactions, out today in
@naturemethods
with Joe Cunningham,
@GregKoytiger
, and
@sorger_peter
! A blogpost is forthcoming but for now a tweetstorm (1/8)
Last year we presented
#AlphaFold
v2 which predicts 3D structures of proteins down to atomic accuracy. Today we’re proud to share the methods in
@Nature
w/open source code. Excited to see the research this enables. More very soon!
These are for single domains-not whole proteins-and there are a few poor predictions. So corner cases remain but core problem appears solved: 88% of predictions are <4Å, 76% <3Å, 46% <2Å. Unlike last time where there was some competition, this time AF2 was best for 88/97 targets.
Excited to release a preliminary version of ProteinNet, a data set for doing machine learning on protein structure. Aim is to lower the barrier to entry to protein folding, and spur more ML researchers to tackle the problem. More here: (1/3)
Put up new preprint on arXiv () describing ProteinNet, a dataset for doing ML on protein sequence-structure relationships. ProteinNet is already on GitHub () and I hope the preprint sheds greater transparency on how it is constructed.
Haven't looked at in detail but appears very interesting. Claims that AF2 learns an energy function for proteins independent of MSAs, while MSAs are used primarily (and implicitly) by AF2 to solve the global search problem.
Glad to see
@DeepMindAI
’s AlphaFold paper finally out. I had the pleasure of being one of the reviewers and getting to write the accompanying
@NatureNV
article. The future of protein structure prediction is looking very bright!
A protein language model for MSAs. Likely relevant for the 'trunk' part of the AlphaFold2 model. Basically just an axial transformer with tied row attention, but they see a rather dramatic jump in performance.
This is great news! Our plans for OpenFold won’t change, as having a trainable platform is still incredibly valuable for modifying and building on AF2. The first step ofc is reproducing the AF2 weights independently which is what we’re currently working on.
"The AlphaFold parameters are made available under the terms of the Creative Commons Attribution 4.0 International (CC BY 4.0) license" 🙂
(thanks to
@BrianWeitzner
for alerting me)
We have a faculty search this year (all ranks). If you're a computational biologist I strongly recommend you apply! Lots of fantastic people in dept () and at Columbia interested in ML and biomolecules (
@NShahLab
@helloanum
@HarmenBussemkr
V. Cornish) 1/5
As promised here is our high-level review of
#AlphaFold
and surrounding events, by
@NazimBouatta
, Peter Sorger, and myself. This is nearly all
@NazimBouatta
's work, a string theorist turned protein theorist who has taken a far more expansive view of the field than I could have.
The final version of our RGN2 manuscript for single-sequence structure prediction is out in
@NatureBiotech
! Peer review dramatically improved this work, thanks to
@fraser_lab
,
@thisismadani
, and anonymous reviewers. For more on what’s new, see thread below by
@NazimBouatta
⬇️
Our new approach for predicting protein 3D structure using single sequence + protein language model, w/o MSAs, is out. We combine a protein language model (AminoBERT) with a structural module using a transfer matrix formalism. (1/5)
This looks interesting: an open-source implementation of the AlphaFold distance prediction NN. . I haven't had a chance to look in detail yet but there's an associated preprint:
First off: model weights, training code and colab notebook are here . We are also making available a training set of 400K unique MSAs & predicted structures for self-distillation. These lives in the Registry of Open Data on AWS 2/12
Protein language models scaled up massively (training on ~5600 GPUs!) Unfortunately doesn't seem to have resulted in a meaningful performance improvement yet.
Exciting to see the preprint on ESM2/ESMFold from
@alexrives
group out: . Far larger protein LMs than ESM-1b and applied to MSA-less protein structure prediction. Can't wait for code/models!
OpenFold preprint is out! Much richer story than expctd 1) AF2 shockingly robust to data elision; train on 1k chains→get AF1 acc; train on helices or sheets→do ok on other 2) it learns 1D→2D→3D proteins. Tweetorial👇incl💯animation of low-D predictions
Our preprint on OpenFold, our trainable reproduction of AlphaFold2, is finally up ()! Since we open-sourced parameters in June, we've trained the model to high accuracy more than 50 times, on a variety of datasets. Here's what we learned (a lot) -> (1/19)
An updated version of my AlphaFold blogpost is now a Letter to the Editor in Bioinformatics: The science part was revised to reflect new information and reviewer feedback. The 'sociology' part was scrubbed to make it a dignified piece of writing :-D
To be clear: we have NOT yet trained this new model from scratch but are doing so now and expect to release new model weights shortly. We have however confirmed that OF’s inference is identical to AF2’s by loading it with AF2's weights and predicting identical structures.
Great work using inter-residue orientations to exceed AlphaFold’s performance on protein structure prediction by Jianyi Yang, Ivan Anishchenko, and others from the Baker lab: . First heard about this at RosettaCon and I’m very glad to finally see it out!
Finally, by “we”, I mean the inimitable OpenFold team, led by
@gahdritz
,
@SachinKadyan99
, Will Gerecke, and Luna Xia. All were co-supervised by
@NazimBouatta
and myself (I mostly stayed out of the way to avoid slowing them down.) More very soon.
A key finding is that AF2/OF accuracy climbs very sharply then tapers off for a long and gradual increase. While total training time took ~100K A100 hours, 90% of final accuracy could be achieved in ~3K hours. This has important implications for training AF2/OF variants. 4/12
Coming back to my (new) office after four weeks of travel to find it freshly decorated by my lab. I get to work with the best people ❤️. (Also learned that the one letter abbreviation for ornithine is O)
We have a new faculty position open in my department () with a strong focus on machine learning and quantitative biology, broadly defined. We value method development as much as hypothesis- and discovery-driven science. And we keep getting more GPUs!
As we saw with the recent AlphaFold-Multimer, some applications can benefit from training new AF2 variants and possibly integrating AF2 within larger models. DeepMind’s JAX version, while excellent, is missing training code. PyTorch is also more widely used, hence OpenFold.
2nd is memory: we use less due to optimizations and custom CUDA kernels, enabling inference of much longer sequences. In general we get up to ~4,600 residues on a 40GB A100 and we believe we can optimize further. 7/12
Google's preprint on annotating the protein universe just got an update that includes clustered training/test splits, as well as new timing experiments. Looks like a major revision.
Somehow this slipped my radar. Very cool looking work from the
@DrorLab
: Hierarchical, rotation-equivariant neural networks to predict the structure of protein complexes.
We combine a new protein language model (AminoBERT) with an improved version of our end-to-end differentiable machinery (RGN2) to directly generate 3D coordinates. On orphan proteins, RGN2 outperforms all major methods, including
#AlphaFold
, RoseTTAFold, and trRosetta. (2/4)
I'm not myself when I haven't programmed in a while. I notice this most acutely when I get an uninterrupted block of coding time after a months-long drought, and feel like I am made whole again. Is anyone else this way? Unfortunately the droughts are increasing in intensity.
I should note that another blog post has been written by
@c_outeiral
and it’s great and entirely complementary, so be sure to read his for another perspective:
Back to model: as this scatterplot shows (GDT_TS scores on CAMEO-based validation set) accuracy is very comparable to AF2 but slightly higher on average with OF, perhaps because of our slightly larger training set. 3/12
Our PyTorch implementation has some advantages over the publicly available JAX implementation from DeepMind, beyond the obvious one of being trainable. 5/12
Been perusing the new CASP14 abstracts (): MSR & Baidu entered, and AlphaFold2 is using raw MSAs (cf. extracting pairwise values) and doing self-consistent predictions! RGNs are self-consistent too, but details likely very different. More on our entry soon.
This was a big effort within the lab and with many external collaborators. Internally credit goes to the OF team led by
@gahdritz
(w/
@SachinKadyan99
, Luna Xia, Will Gerecke) and co-advised by
@NazimBouatta
and me. 9/12
I had the pleasure of visiting
@broadinstitute
last week to give the MIA talk on differentiable protein structure learning. Video of the talk is up now:
New blogpost up on protein representation learning: . I use our recent UniRep preprint () in collab with
@EthanAlley
@grigonomics
@SurgeBiswas
@geochurch
as a springboard for reflecting on the future of the field.
1st is speed: OF inference is up to 2x faster on short proteins even when excluding JAX compilation. On longer proteins advantage lessens, until AF2 begins to OOM (see 2nd point). Inference speed is key when coupled with fast MSA schemes like MMseqs2 6/12
MSA generation is not slow. In ColabFold we generate MSAs in seconds using MMseqs2. This can be tweaked to run in < second using batch. Most of the time of AlphaFold/ColabFold is spent predicting the structure.
Congratulations to
@demishassabis
and John Jumper who have won the 2023 Breakthrough Prize in Life Sciences for the development of
#AlphaFold
, our AI system that solved the 50-year-old challenge of protein structure prediction. 1/
Very excited about the launch of the CZI New York BioHub and what it means for the NYC ecosystem! Congratulations to
@califano_lab
for leading this effort!
We’re thrilled to share that we’re launching a new
@CZBiohub
in New York!
Bringing together engineers + scientists at
@Columbia
,
@RockefellerUniv
and
@Yale
,
#CZBiohubNY
will engineer immune cells for earlier detection & treatment of disease
A piece of holiday-time reflection: one thing I’m grateful about in science is the existence of a real field-wide community, made more visible by Twitter. I suspect this is less true in other professions and is a genuinely positive feature.
This looks great. I think it's an idea that's been in the ether for some time but getting it to work is an altogether different matter. Will be interesting to see if it can be translated to animals, especially mammals.
Final ProteinNet paper is now in
@BMCBioinfo
Also quick update: raw MSAs for PN12 are available upon request (4TB), PN13 is in progress, planning on prelim PN14 in time for CASP14, and should have co-evo inputs soon for <=PN12.
Columbia is hiring! We have tenure-track/tenured positions at all ranks in the Program for Mathematical Genomics (Dept of Systems Biology). We have a special interest in method development but all areas of comp/sys bio are welcome. Come be my colleague!
Ref. implementation of RGNs is now available on GitHub (), along with 6 pre-trained models spanning CASP7 - 12. The code enables training quite a variety of RGN models, including ones I’ve never tried!
Hearing Gorman recite today reminded me of Teddy Roosevelt’s words that we are “a new nation, based on a mighty continent, of boundless possibilities.” Optimism may not be our birthright but it is our national character, and for the 1st time in at least 10 months, I’m feeling it.
Excited to share our latest pre-print🎉 - a framework for low-N protein engineering with data-efficient deep learning! Had a blast working with brilliant
@EthanAlley
@SurgeBiswas
@kesvelt
and
@geochurch
.
Thread (1/7)
Georgy Derevyanko and
@g_lamoureux_
have just made public a very cool PyTorch library for differentiable protein primitives, with optimized CUDA kernels!
Been a while since I've blogged, but I figured yesterday's paper release deserved some background. In this post I write a little more about the conceptual ideas that led me to end-to-end differentiability for proteins.
Unusual approach for our lab - fantastic work from
@pgainza
+
@Freyer02952299
and fun collaboration with
@mmbronstein
on using learning techniques for Deciphering interaction fingerprints from protein molecular surfaces". Take a look.
Really cool work: incorporate a learnable ODE model of signal transduction within an ML framework to predict cell response to perturbations. I happen to be writing a review in which I speculate that this should be possible. Kudos to the
@sandercbio
team for actually doing it!
Aiming at more comprehensive computable perturbation/response models of cell biology. Preprint updated: Interpretable Machine Learning for Perturbation Biology
If you're interested in the latest on drug discovery + ML and QM, go follow
@davidlmobley
. He's done an amazing job live tweeting
#OECUP2019
. I feel like I'm practically there!
Congratulations to
@Liu_Changchang
for passing her PhD defense with flying colors! Changchang is the first graduate student to be (co-)supervised by me (w/Peter Sorger), and I could not be more proud. Can't wait to see what you do next Changchang!
@atomadam2
@pollyp1
The lucky ones, yes, but not all (or even most I suspect.) Students in colleges or even universities without strong graduate programs can be quite isolated from academic norms.
Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences (from FAIR)
- unsupervised learning recovers representations that map to multiple levels of biological granularity