Developing data intensive computational methods โข PI @ Seoul National University ๐ฐ๐ท โข
#FirstGen
โข he/him โข Hauptschรผler โข
@martinsteinegger
@mstdn
.science
Foldseek-Multimer is a protein complex aligner that is up to 10,000x times faster than SOTA methods without sacrificing quality, enabling the comparison of billions of complex pairs per day. 1/5
๐
๐พ
๐
Protein structure prediction with
#AlphaFold
2 in the browser using Google Colab. Just paste your protein in the input box and push "Run all". MSAs are generated by an MMseqs2 API call. Work by
@sokrypton
,
@milot_mirdita
. Try it out here:
Foldseek, our fast structural aligner, is now published
@NatureBiotech
. It allows you to search through large structural databases like
#AlphaFold
or
#ESMatlas
in seconds. A long journey since '18! 1/6
๐
๐พ
๐
We clustered the
#AlphaFold
structure database with our novel Foldseek algorithm. We identified 2.27M clusters and analyzed them by function, annotation, domains and evolution. Amazing collaboration with
@pedrobeltrao
lab. 1/n
๐
๐พ
I got tenure at Seoul National University! Starting off with as a first-gen with a lower secondary education (Hauptschule) in Germany makes this very meaningful to me. I am grateful for all that supported and believed in me. Excited for what lies ahead!
ColabFold updated: we speed up
#AlphaFold2
's prediction to allow thousands of structures on a single GPU in a day. Add taxonomical aware paired MSA for complex predictions, a new metagenomic db and support to run it on your local machines.
Updated ๐
ColabFold makes structure prediction and complex modeling of
#AlphaFold
2 and
#RoseTTAFold
accessible through Google Colab. We show that MSAs produced by MMseqs2 match the accuracy of AF2 (HHblits/HMMer) while being faster.
๐
Code
Search your protein structures against
#AlphaFold
DBs and
#PDB
in seconds using our Foldseek server. Just paste your PDB file and click search. We offer local (SW) and global (TMalign) structural alignments. Server was build by the amazing
@milot_mirdita
๐
ColabFold makes folding with
#AlphaFold
&
#RoseTTAfold
accessible to everyone. Our MSA server processed >1.6 million requests to date. We thank the community for all the help to improve Colabfold.
Now published
@naturemethods
๐
AlphaFold2 predicts protein structures at near crystal structure accuracy in less than <10 minutes (~300aa). The animation below shows the prediction of a viral RNA polymerase with >2k residues. I am grateful I could contribute to this huge milestone.
๐
Igor Tolstoy brought to my attention that the
#AlphaFold
database contains predictions of nearly identical sequences with large pLDDT differences. For example, the two 99.6% similar sequences below have an avg. pLDDT of 97 and 33. We found over 1 Mio. of these cases in the AFDB.
Our work on clustering the 214M
#AlphaFold
protein structure was published in
@Nature
. We identified 2.3M clusters using our fast structure cluster algorithm and analysed its annotations, evolution and novel domains. 1/4
๐
๐
It feels surreal to receive the Overton Prize from
@ISCB
! This reflects the incredible support of my mentors (
@SoedingL
,
@StevenSalzberg1
), collaborators, postdocs, students and friend
@milot_mirdita
. Excited to share this journey with you all at
#ISMB2024
in Montreal.
MSA diversity is key for
#AlphaFold2
's accuracy. Larger databases == better results. So, we generated MSAs from 22 peta-bytes of SRA data and show that ColabFold could have improved from rank 11 to 3 at CASP15. โ
๐
๐พ
Maximum likelihood structural phylogeny beyond the twilight zone by combining Foldseek's 3Di alphabet with AA alignments. ML trees resolve the topology of distantly related proteins where traditional AA methods fall short.
๐
๐พ
Foldseek, our local structural aligner, is four orders of magnitude faster than SOA structural aligners at similar sensitivity. Allowing to detect hits in the midnight zone confidently.
Code:
๐
๐
Predict protein structures in batches using the ColabFold "AlphaFold2_batch" notebook. It will predict all structures for a set of fasta files stored in a Google Drive folder. Try it out here:
Thanks to
@milot_mirdita
@sokrypton
Explore our clustered
#AlphaFold
structural database with our new website by
@milot_mirdita
@clmgilchrist
@jgyyy15
. With it you can find clusters, filter members by taxonomy, browse similar clusters and search with Foldseek.
๐
๐
ColabFold now uses the AlphaFold-multimer models paired with MMseqs2 searches for prediction of protein complexes. Just provide chains separated by : and press "Run all" (provide the same sequence multiple times for homooligmers). Check it out at
Our Foldtree notebook allows you to compute and visualize trees from protein structures in the browser. Generate trees from either 1) a set of protein structures, 2) AlphaFoldโDB identifiers or 3) an
#AlphaFold
cluster identifier .
๐
In a post
#AlphaFold
world, we can use protein structures in ways we never could before. Can we build phylogenies with them? Are they any good? Yes! Foldtree () surpasses traditional sequence-based methods, even for closely related proteins.๐
We predicted the structure of 140k protein isoforms from human using
#AlphaFold
/ColabFold. When comparing them to their canonical MANE partner we saw that structure predictions can improve genome annotation. Data is available at
๐
Foldseek got published in 2024 in Volume 42 of
@NatureBiotech
. Here is the timeline of FS releases:
Source code: 2021/07
Webserver: 2022/01
Preprint: 2022/02
Journal: 2023/05
Print: 2024/02
Hundreds of millions of protein structures will require new tools. Foldseek is a fast structural aligner that scales to billions of structures. Work by Stephanie Kim, Michel van Kempen and Johannes Sรถding and me.
๐ Poster at the
@ISMB
next week
๐พ
My postdoc Stephanie Kim presents Foldseek, our fast and accurate protein structure aligner, during today's poster session (P156-T)
@ECCBinfo
#ECCB2022
. Unfortunately, she has to leave earlier to reach her flight, so please make sure to not miss her.
Our Foldseek structural clustering of the
#AlphaFold
DB is now accessible through the AFDB website and API. It allows the fast discovery of similar structures for
@uniprot
proteins. It is a pleasure to work with the AlphaFold DB team! Foldseek cluster ๐
The
#AlphaFold
Database has levelled up ๐
๐ Sequence-based search: Find protein structures in the database using BLAST
๐คWith
@thesteinegger
team, we bring structure similarity clusters for seamless navigation
A collaboration with
@GoogleDeepMind
#AlphaFold
2 Colab has processed >10k queries. We now also search against BFD, Mgnify, SMAG(
@tomodelmont
), MetaEuk in addition to UniRef. SMAG&MetaEuk have >20M eukaryotic environmental proteins that were not used in AF2 before.
@sokrypton
@milot_mirdita
ColabFold now uses a faster MMseqs2 backend server. We switched from BFD/Mgnify to ColabFoldDB, a larger metagenomic database, and reduced rate limits a lot, so batch AlphaFold2 runs should be faster.
๐ป
Our
#AlphaFold
cluster site has new features:
โ Search for clusters using a protein structure via
#Foldseek
โก Filter candidate clusters
โข Explore the cluster using a pavian-style interactive Sankey taxonomy plot
๐
๐ work by
@milot_mirdita
@jgyyy15
Thatโs right! AlphaFold is coming to
@galaxyproject
! Soon anyone anywhere will be able to fold and analyze the structure of nearly any protein completely for free! Many thanks to
@thesteinegger
et al for working with us to deploy their optimized ColabFold implementation.
At
#ISMBECCB2023
, my talented students present their exciting work: 22 petabase search for structure prediction, structural clustering of AFDB, IDP multimer prediction, structural compression & metagenomic classification. Find us at poster B-036, B-038, B-039, B-040, B-114.
.
@MetaAI
released ESMfold and structure predictions for most metagenomic MGnify90 sequences. Thanks for early-access
@TomSercu
@alexrives
to 36 mio structures clustered at 30% seq. id.
Check them out on our Foldseek search server:
Foldseek's webserver now allows predicting structures thanks to the great ESMfold API. Just paste an amino acid sequence, click "PREDICT STRUCTURE" and search against structures from ESMatlas,AlphaFoldDB,PDB & more. Work by
@milot_mirdita
. Check it out at
Through homology search & pLMs, we identified an effective kynureninase that degrades a key immunosuppressor in cancer, reducing tumor weight in mice by 3.4x.
๐
Seek & rank your own protein based on only a handful of measures.
โต
My group and I are excited to join this years
#ISMBECCB2023
in Lyon. We are present 11 posters and 4 talks. We are preparing a set of updated stickers of our methods for the poster session. Sneak peek below!
Our Foldcomp library compressed the
#AlphaFold
/Uniprot from 23TB to 950GB at an avg. loss of <0.5ร ; decompresses ~200 structures per second per core, and has a python interface to download dbs, compress/decompress. Work by:
@HKgenomics
@milot_mirdita
1/4
๐พ
Our Foldseek server now includes the
#AlphaFold
UniProt DB clustered to 52M structures at ~50% seq. id & 80% cov. The full Foldseek AlphaFoldDB, including Cฮฑ, can be downloaded through the Foldseek databases module (~700GB download, ~950GB extracted) 1/6
๐
Hundreds of millions of protein structures will require new tools. Foldseek is a fast structural aligner that scales to billions of structures. Work by Stephanie Kim, Michel van Kempen and Johannes Sรถding and me.
๐ Poster at the
@ISMB
next week
๐พ
Weโre also sharing the proteomes of 20 other biologically-significant organisms, totalling over 350k structures. Soon we plan to expand to over 100 million, covering almost every sequenced protein known to science & the
@uniprot
reference database.
2/
ColabFold supports to upload custom templates now. Here is an example of a GPCR (ACM2_HUMAN) modeled with an active and inactive template using no MSA information. The example was taken from
@huhlim
and
@MeikelFeig
's preprint
OmegaFold is open source. Thank you so much for releasing it. It installs very easily. On Colab a protein with 583 res. ran out of memory (16GB GPU), 320 worked (it took 13min). 583 ran in ~23m on 24GB GPU. Complex prediction by glycine linker seem to work for a toy example.
Taxonomic assignment of contigs 2-18x time faster than state-of-the-art. At a glance: we assign each ORF a tax-label considering alignment uncertainty followed by a weighted majority prediction.
๐
Code:
New Foldcomp release! Our protein structure compression algorithm now supports multiple input/output file types as well as multiple chains/fragments. We updated the Python API & new DBs including
#ESMatlas
and
#AlphFold
cluster. Great work by
@HKgenomics
Conterminator terminates contamination in genomes.
@StevenSalzberg1
and me report over 114K/2M contaminations in RefSeq/GenBank and two unexpected ones in GRCH38 alt. scaffold and C. elegans ref. genome.
Preprint:
Code:
New MMseqs2 release 14-7e284: includes the features to run the ColabFold pipeline, position-specific gap costs/profile-profile Gotoh-Smith-Waterman, speed-ups and more. Thanks a lot to all contributors!
๐ conda install -c conda-forge -c bioconda mmseqs2
ProstT5 is a protein LLM with structure-aware embeddings. It was trained on structures (Foldseekโs 3Di) and AA sequences. It translates AA to 3Di for sensitive foldseek search and designs proteins by converting 3Di to AA.
๐
Code:
Our new bilingual protein language model (pLM), ProstT5, translates between protein sequence and structure. Besides producing more structure-aware embeddings that are better at remote homology detection than sequence-pLMs, its translation capability enables inverse folding.
Foldseek got a 3D structure visualization using NGL thanks to my postdoc
@clmgilchrist
and
@milot_mirdita
. We generate missing atoms using pulchra and superpose aligned sequences using TMalign in the browser using
#WebAssembly
๐
๐
ColabFold preprint update. Two Highlights: colabfold_batch executes MMseqs2+AlphaFold2 in batch and is nearly 100x faster using early-stopping at โฅ85pLDDT compared to
#AlphaFold
2. ColabFold+AlphaFold-multimer performs similar to AlphaFold-multimer.
๐
Enjoyed making a small contribution to this study by
@AlbertPol10
and
@LabParton
. Quite amazed by how well AlphaFold2 predicts the unusual structures of the caveolin proteins.
The 100 most cited AI papers for 2022.
A detailed analysis of the most cited papers for the last three years allows good insights into the organisations and countries publishing the most impactful AI research right now.
Read here:
A thread ๐งต
Metabuli ๋ถ๋ฆฌ improves metagenomic read classification through metamers, DNA-AA k-mers, to be sensitive and specific, recovering 99% and 98% of DNA or AA classifiers. Great work
@JaebeomKim6
!
๐พ
๐
๐conda install -c bioconda metabuli
.
@arian_jamasb
integrated Foldcomp, our structure compression algorithm, into Graphein - a Geometric Deep Learning framework for protein structures.
Now, you can train networks on a proteome scale in Colab! Great work. ๐
View the notebook:
Foldseek Release 8: supporting searches against clustered databases (with prebuilt DBs for AFDB50 and PDB100) and protein-complexes. HTML output was improved by
@clmgilchrist
. In the webserver, you can download and (re)upload results.
๐พ or bioconda ๐
We changed the license of the ColabFoldDB and PDB70 from CC BY-NC to CC BY-4.0. Now, there shouldnโt be any further roadblocks for commercial use of AlphaFold2 or ColabFold.
Bakers lab yeast all against all protein complex paper is online. They scanned 8.3 million PPIs using a smaller RoseTTAFold model and predicted the complex structure for high scoring PPIs using Alphafold2. Science review process is blazing fast (<3month)
Our structural aligner Foldseek can now automatically download databases in a single command. We provide the PDB (
#PDB50
) and Alphafold DB. You can download Foldseek here:
foldseek databases PDB pdb tmp
foldseek easy-search query.pdb pdb aln.m8 tmp
100B parameter protein language model trained on
@uniprot
and the
#ColabFoldDB
using 768 NVIDIA A100 GPUs for several months. The LM shows significant improvements in most prediction categories. Note: the model is not open-source; only the training data is currently available.
โxTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Proteinโ
A borderline-SOTA antibody structure-prediction method is tucked away in the results section
Penguin is our new assembler that reconstructs manyfold more accurate strain-level viral genomes and 16S rRNAs from metagenomes through a novel greedy AA/DNA-hybrid bayesian overlap extension strategy. By
@AnnikaJochheim
et al.
๐
๐พ
.
@Rosy_Zh
presented a new method called Spacedust, which can cluster and detect reoccurring gene neighborhoods. It allows to find gene similarities using amino acids&structures. Check out her poster for more details and Marv stickers.
Code:
#ISMBECCB2023
Protein-Vec enhances protein function prediction by combining independently contrastively learned protein classifiers for EC, GO, PFAM, Gene3D, and TMscore (Aspect-vecs) into a merged embedding to boost prediction performance.
๐
๐พ
My student
@HKgenomics
talks about Foldcomp, a fast protein structure compression algorithm. Foldcomp compresses the AFDB down from 23TB to 1TB at the speed of gzip.
#ISMBECCB2023
๐
Code:
We released the MMseqs2 ColabFold databases at: . Additionally to the BFD/Mgnify we also built a database containing additional metagenomic databases: MetaEuk, SMAG, TOPAZ, MGV, GPD and MetaClust2. Thanks
@milot_mirdita
for getting MMseqs2 ready.
Agnostos defines a framework to annotate genes beyond the twilight zone using clustering and remote homology detection. It organized over 415 million genes from 1,749 metagenomes. Maybe the dark matter is not so dark after all. Great work
@ChiaraVanni5
!
๐
"As part of our commitment to releasing our research breakthroughs safely and responsibly, we will not be sharing model weights, to prevent use in potentially unsafe applications." ๐
Conterminator a method to terminate contamination in genome and protein databases is published
@GenomeBiology
.
@StevenSalzberg1
and me found >114K/2M likely contaminations in RefSeq/GenBank.
๐
๐พ
๐ conda install conterminator
Here is a blog post about ColabFold by
@labriataphd
, which summarizes our efforts very well. Thank you so much for writing it.
@labriataphd
did you try to predict PRTEINSEQENCE?
Our protein level assembler โplassโ paper is now published at
@naturemethods
. Plass recovers many fold more proteins from complex metagenomes compared to nucleotide assemblers.
Paper:
Code&Data:
@milot_mirdita
@virus_x_team
The
#AlphaFold
source code has been updated and now accounts for multi-chain protein complexes - providing a significant improvement in accuracy for predicting protein interactions:
Generate predictions from your browser via:
For those interested in exploring the structural space using Foldseek, check out the tutorial video from
@SBGrid
, where I demonstrate the webserver and command line interface of foldseek. Thank you for hosting me.
๐ฅ
MetaEuk predicts eukaryotic proteins from metagenomes. They extract millions of yet unknown proteins from marine metagenomes
@TaraOcean_
.
Preprint:
Code:
The proteins can be searched at
Yesterday was my last day at the lab of
@StevenSalzberg1
. I feel so lucky that I was able to join such a fun and talented group. I'm looking forward to starting my own lab at Seoul National University
@SNUnow
. I am hiring! Please reach out via DM or email.
Yesterday, we talked about ColabFold (AlphaFold2/RoseTTAFold in Google Colab)
@ProteinBoston
. Below are the slides. Including a comparison of the MMseqs2 vs
@DeepMind
's jackhmmer version on CASP14-FM targets. We also covered some updates coming soon! Video will be posted soon too
This paper covers the
#ESMatlas
, a huge metagenomic protein structure database and the lightning-fast
#ESMfold
structure predictor, which the authors provide as API, allowing for direct structure predictions. Kudos to
@alexrives
& the
@MetaAI
team for this exceptional work!
In a Science study,
@MetaAI
researchers show the power of a large language model,
#ESMFold
, to enable protein structure prediction and analysis.
Using ESMFold, they generated a databaseโthe ESM Metagenomic Atlasโof over 600 million metagenomic proteins.
.
@sokrypton
talked about ColabFold at the
@emblebi
AlphaFold webinar. Below is a screenshot of its complex modelling possibilities. He also presented a memory resource friendly modeling approach for large complexes using trimming implemented in the AlphaFold2_advanced_beta colab
Today we present four posters
#ISMBECCB2023
. A fast structural MSA algorithm (FoldMason), a NN to de-noise & select particles from Cryo-ET images, novel fungal core genes, and a benchmark of AA and structure measures for proteome comparison.
Poster C-114, C-148, C-238, C-262
Curious to see how the new PDB identifier, with 5-characters instead of 4, will impact bioinformatics pipelines. This might be a "millennium bug" moment in structure bioinformatics.
Today's the day: PDB no longer has 3-character chemical component IDs for incoming depositions. 1st structure with a 5-character CCD has been deposited.
Details at wwPDB: PDB Entries w/Novel Ligands Now Distributed Only in PDBx/mmCIF & PDBML File Formats
The
#CASP14
results are out and
#AlphaFold2
won.
It produces predictions with an margin of error close to crystal structures. Protein structure prediction might be solved.
I am happy that I could contribute to this mile stone. See you at the conference.
CASP14
#s
just came out and theyโre astoundingโDeepMind looks to have solved protein structure prediction. Median GDT_TS went from 68.5 (CASP13) to 92.4!!!! Cf. their 2nd best CASP13 struct scored 92.8 (out of 100). Median RMSD is 2.1ร . I think it's over
.
@clmgilchrist
and I have refreshed the Foldseek webserver interface and made searches much quicker. We have also added the
@CATHDatabase
with the help of
@nicolabordin
.
To explore the updated server, visit:
Today
@milot_mirdita
is defending his PhD. I am so excited to hear his talk. It was such a pleasure to work with you. MMseqs2, ColabFold, and many more methods weren't possible without you. Good luck. :)
Thank you
@DeepMind
for making the AlphaFold2's weights available for academic as well as commercial usage. Thus, making AF2 fully open to everybody (who gives proper attribution). We will reflect this change in the ColabFold usage texts soon.
"The AlphaFold parameters are made available under the terms of the Creative Commons Attribution 4.0 International (CC BY 4.0) license" ๐
(thanks to
@BrianWeitzner
for alerting me)
Today
@DeepMind
released a colab for
#AlphaFold
2 using HMMer for the homology search against a reduced version of Uniprot, BFD, and Mgnify. Thank you for linking our Colab. Itโs great to have different favors available.
DeepMind colab
I am incredibly proud of my two students
@imGyuriKim
and
@JaebeomKim6
for receiving the prestigious Korean Presidential Scholarship '์ 1๊ธฐ ๋ํ์ ๋ํต๋ น๊ณผํ์ฅํ๊ธ'. This is a very competitive price and it is exceptionally rare that two are awarded to the same lab. Congratulations
SkewIT (Skew Index Test) quantifies the bacterial GC Skew to detect mis-assembled genomes. It detected multiple mis-assemblies of complete RefSeq genomes. Great work
@JenniferLu717
and
@StevenSalzberg1
Preprint:
Code: (not public)
End-to-end differentiable (vectorized) Smith-Waterman implemented in Jax. A new tool to optimize MSAs based on specific use cases like protein structure quality, phylogeny and many more. Great work by Petti et al.
Code:
Adieu Lyon! It was an incredible
#ISMBECCB2023
! Immensely grateful for the warm welcome extended to my students - for many, it was their first international conference. Thanks to
@BQPMalfoy
and his baby for capturing the moment.
Behind the scenes of AlphaFold2's success at CASP14. The manuscript describes how difficult targets were processed in order to achieve the highest performance. One take away: search full length sequences instead of just a single domains.
๐
.
@DrArunimaSingh
has written a summary about Foldseek for
@naturemethods
. It's a great overview of the method and includes information about what we're working on. Arunima is also at the
#ISMBECCB2023
right now, so don't miss your chance to talk to her.
๐
A bit late, but I just found this tweet interesting. The AlphaFold DB contains a weak prediction that can be predicted well by Deepmind's AF2 Colab. How is this possible?
@sokrypton
Here is an opposite example - Uniprot B2HHE4.
#AlphaFold
database model is low confidence whereas
#OmegaFold
models are reasonably good without MSA.
Our Marv stickers arrived just in time for
#ISMBECCB2023
. Stickers are available at our posters. I am looking forward to reconnect with old friends, make new connections, and learn about the latest in bioinformatics. See you in Lyon.
spacegraphcats provides a tool to index and query metagenomic sequence diversity. Helps to recover missing content from genome bins and to quantify diversity. Published
@GenomeBiology
by
@ctitusbrown
et al. Great work!
๐
๐ป
.
@Deepmind
released the improved AlphaFold-multimer-v2 to reduce the clash problem. We integrated it in ColabFold. Itโs still possible to use older complex methods using model_type. Thank you for open sourcing it and John,
@tfgg2
and
@richevans_dm
for answering our questions.
Reciprocal best structure hit (RBSH) search with Foldseek detects more hits compared to sequence based methods.
Great work by Vivian Angela Monzon,Typhaine Paysan-Lafosse, Valerie Wood and
@Alexbateman1
๐
Code:
Foldcomp is a protein structures compression algorithm and indexing system. It improves compression by 3x over PIC at similar speed to Gzip and reconstructs at ~0.08ร Cฮฑ. AFDB/ESMatlas-HQ dbs for download. ๐interface over pip.
๐พ
๐
AlphaFold2 improves the protein structure model quality by recycling (default 3 times), meaning feeding the prediction x times through the model.
@sokrypton
figured that you can fold a de-novo designed protein from a single sequence by increasing recycles.
AF2-multimer models monomer complexes by concatenating MSAs. We observed that monomers are best modeled with unpaired ("stair-case") MSAs. In this example the unpaired MSA of ColabFold+AF2-multimer (soon public) picks up an intra-complex signal that AlphaFold-Colab misses.
Roland Dunbrack ๐ณ๏ธโ๐ @rolanddunbrack.bsky.social
Hmm, Alphafold-multimer went off the rails on this one. Homodimer of BRD2 bromodomains 1 and 2. Even the single chains are a mess with large overlaps and breaks in the chain.
We have setup a new ColabFold MSA server provided by Korean Bioinformation Center. For the switch we will have a short downtime ~8pm KST/1pm CET/7am EST.
We accelerated the MSA generation using multiple threads and updated Uniref30 to 2022_02 and PDB to March 2022.
.
@daniel_c0deb0t
's block-aligner is a library to align protein/nucleotide sequences using adaptive banding blocks + SIMD. Its ~9 times faster than Farrar's striped SW, implemented in Rust and available here:
๐