Zhe Gan Profile
Zhe Gan

@zhegan4

2,170
Followers
321
Following
16
Media
134
Statuses

Staff Research Scientist @Apple AI/ML. Ex-Principal Researcher @Microsoft Azure AI. Working on building large-scale vision and multimodal foundation models.

Joined February 2019
Don't wanna be here? Send us removal request.
@zhegan4
Zhe Gan
7 months
🚀🚀Introducing Ferret, a new MLLM that can refer and ground anything anywhere at any granularity. 📰 1⃣ Ferret enables referring of an image region at any shape 2⃣ It often shows better precise understanding of small image regions than GPT-4V (sec 5.6)
Tweet media one
11
113
476
@zhegan4
Zhe Gan
2 years
NUWA-Infinity is our new multimodal generative model that is able to generate high-quality images and videos from given text or image input. We can generate images with resolution up to 38912 × 2048 pixels. check demo here: abs:
@_akhaliq
AK
2 years
NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis abs: project page: Compared to DALL·E, Imagen/Parti , generates HR images with arbitrary sizes and support long-duration video generation
6
123
559
6
57
214
@zhegan4
Zhe Gan
21 days
💡Imagine a multimodal LLM that can understand your iPhone screen📱? Here it is, we present Ferret-UI, that can do precise referring and grounding on your iPhone screen, and advanced reasoning. Free-form referring in, and boxes out. Ferret itself will also be presented at ICLR.
@_akhaliq
AK
21 days
Apple presents Ferret-UI Grounded Mobile UI Understanding with Multimodal LLMs Recent advancements in multimodal large language models (MLLMs) have been noteworthy, yet, these general-domain MLLMs often fall short in their ability to comprehend and interact effectively with
Tweet media one
33
409
2K
11
41
203
@zhegan4
Zhe Gan
7 months
🌟 Introducing VeCLIP: Improving CLIP training via visual-enriched captions 📘 ⛽️ Data is the fuel for CLIP training, however, alt-text can be noisy. 🚀🚀 By using Vicuna and LLaVA for text rewriting, VeCLIP boosts CLIP perf across 3M-200M data scales.
Tweet media one
6
27
147
@zhegan4
Zhe Gan
2 months
🚀🚀 Excited to release code & ckpt for our new image encoders. 1. VeCLIP: 83.1% 0-shot on ImgNet with H14, trained on DFN-5B and 300M synthetic captions 2. MOFI: SOTA on image retrieval, trained on 1B entity-annotated images.
6
37
133
@zhegan4
Zhe Gan
2 years
GIT is our new MM foundation model, and achieves new sota across 12 image/video captioning and QA tasks, including the first human-parity on TextCaps. GIT achieves an acc. of 88.79% on ImageNet-1k using a generative scheme. GIT can recognize logos, landmarks, characters, etc.
@_akhaliq
AK
2 years
GIT: A Generative Image-to-text Transformer for Vision and Language abs: model surpasses the human performance for the first time on TextCaps (138.2 vs. 125.5 in CIDEr)
Tweet media one
2
39
189
2
17
109
@zhegan4
Zhe Gan
7 months
🌟 Introduce MGIE, using Multimodal LLM for instruction-based image editing 📜 🔍 (code will be released soon) Using MLLM to generate expressive instructions, and achieves superior results compared with InstructPix2Pix.
Tweet media one
1
17
106
@zhegan4
Zhe Gan
4 months
🎁🎁 Ferret is a multimodal LLM that is able to refer and ground, and is now open-sourced. Find out our code and checkpoints below: . Merry Christmas and Happy new year! work led by @XyouH @HaotianZhang4AI @yinfeiy
@zhegan4
Zhe Gan
7 months
🚀🚀Introducing Ferret, a new MLLM that can refer and ground anything anywhere at any granularity. 📰 1⃣ Ferret enables referring of an image region at any shape 2⃣ It often shows better precise understanding of small image regions than GPT-4V (sec 5.6)
Tweet media one
11
113
476
4
21
105
@zhegan4
Zhe Gan
3 years
During the summer, we hosted a special Vision-Lang Talk Series. With 11 invited speakers, we covered topics like captioning, VQA, ALIGN, MDETR, ViLD, MERLOT, MoCo etc. Want to know more? 👇👇 YouTube: @MSFTResearch @jw2yang4ai @ChunyuanLi @PengchuanZ
1
16
77
@zhegan4
Zhe Gan
17 days
🌟Besides Ferret-UI, we also upgrade Ferret to Ferret-v2 for natural images. Several designs made along the way. (1) sphinx-like any-res for refer and ground. (2) CLIP encoder for global low-res img, DINOv2 encoder for sub-images. (3) high-res dense alignment before final sft.
@_akhaliq
AK
18 days
Apple presents Ferret-v2 An Improved Baseline for Referring and Grounding with Large Language Models While Ferret seamlessly integrates regional understanding into the Large Language Model (LLM) to facilitate its referring and grounding capability, it poses certain
Tweet media one
3
61
382
1
16
48
@zhegan4
Zhe Gan
4 years
Two papers got accepted to #ECCV2020 ! (1) UNITER: , a SOTA pre-trained V+L model; (2) VALUE (Spotlight): , the first work on probing pre-trained V+L models. Joint work with: @YenChunChen4 @LINJIEFUN @Licheng_Yu and others.
Tweet media one
1
3
28
@zhegan4
Zhe Gan
3 years
We all know GPT-3 is a strong few-shot learner for NLP problems, but can it also benefit multimodal tasks? In this work, we provide an empirical study of GPT-3 for OK-VQA, and show using GPT-3 in a few-shot manner surpasses supervised sota by +8.6 points (from 39.4 to 48.0). :)
@_akhaliq
AK
3 years
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA abs: a simple yet effective method that Prompts GPT3 via the use of Image Captions. Using only 16 examples, PICa surpasses the supervised sota by an absolute +8.6 points on the OK-VQA dataset
Tweet media one
1
12
83
0
1
26
@zhegan4
Zhe Gan
3 years
We are hiring 2022 summer PhD research interns who are interested in Vision+Language research. Please send an email to zhe.gan @microsoft .com or apply directly if you are interested. Job Link:
0
3
17
@zhegan4
Zhe Gan
4 years
All our tutorial slides and video recordings are available now: . Feel free to check them out if you are interested in Vision+Language Research. Joint efforts with: @Licheng_Yu , @luowei_zhou , @LINJIEFUN , @YenChunChen4 , Yu, JJ and Xiaodong.
@zhegan4
Zhe Gan
4 years
We will host a tutorial on Recent Advances in Vision+Language Research at #CVPR2020 : (Zoom link provided inside). Welcome to join us at 1:15pm (PST) June 15th! Organizers: @Licheng_Yu , @luowei_zhou , @LINJIEFUN , @YenChunChen4 , Yu, JJ, Xiaodong, and me.
Tweet media one
0
7
8
0
4
16
@zhegan4
Zhe Gan
3 years
So happy for this. Our ClipBERT paper is nominated for CVPR 2021 best paper! :)
@jayleicn
Jie Lei
3 years
ClipBERT is nominated for best paper! 😆
3
14
105
0
0
15
@zhegan4
Zhe Gan
2 years
Please join us in our tutorial session if you are interested in vision-language research, or just want to chat and say hi. We will cover VLP for image-text, video-text, and core vision tasks, and also VLP for text2img synthesis.
@LINJIEFUN
Linjie (Lindsey) Li
2 years
Interested in Vision Language Pre-training (VLP) but do not know where to start? Hard to track the rapid progress in VLP? Come and join us at our CVPR2022 VLP tutorial on 19th Jun (9am-5pm CDT) in person in New Orleans or virtually. #CVPR2022
Tweet media one
Tweet media two
Tweet media three
0
24
109
0
0
15
@zhegan4
Zhe Gan
3 years
Come and join us in this new benchmark for video and language! More details about the challenge here:
@LINJIEFUN
Linjie (Lindsey) Li
3 years
🎉Our VALUE paper has been accepted to NeurIPS 2021 Dataset and Benchmark Track. Only 25 days left for the VALUE Challenge 2021! Participate to win up to $22.5K prizes! More details:
0
10
48
0
1
14
@zhegan4
Zhe Gan
4 years
Our VILLA paper that uses adversarial training for V+L pre-training and fine-tuning got accepted to @NeurIPSConf #NeurIPS2020 as a Spotlight paper with review scores 8887. Welcome to check it out :) arXiv: @YenChunChen4 @LINJIEFUN @Eiri1114
Tweet media one
0
3
12
@zhegan4
Zhe Gan
1 year
Interesting paper that uses high-dim. sparse semantic representations to train a CLIP model w/ SOTA performance and better interpretability.
@alex8937
Chen Chen
1 year
Is it possible to build a VLM with sparse semantic representation that is as powerful as, or even better than, dense presentations like CLIP and ALIGN? Excited to share STAIR 𓊍: Learning Sparse Text and Image Representation in Grounded Tokens 🧵👇
Tweet media one
1
5
16
1
2
12
@zhegan4
Zhe Gan
7 months
work led by Haoxuan and @HaotianZhang4AI , joint efforts with @Phyyysalis , Bowen, @MrZiruiWang , @lyon_cao , @yinfeiy
1
0
9
@zhegan4
Zhe Gan
4 years
We will host a tutorial on Recent Advances in Vision+Language Research at #CVPR2020 : (Zoom link provided inside). Welcome to join us at 1:15pm (PST) June 15th! Organizers: @Licheng_Yu , @luowei_zhou , @LINJIEFUN , @YenChunChen4 , Yu, JJ, Xiaodong, and me.
Tweet media one
0
7
8
@zhegan4
Zhe Gan
4 years
We achieve No. 1 on two challenging multilingual benchmarks: XTREME and XGLUE. Welcome to check our FILTER paper: . XTREME: XGLUE:
Tweet media one
0
0
8
@zhegan4
Zhe Gan
3 years
Thank you for our awesome speakers: @Yezhou_Yang , @ashkamath20 , @kohjingyu , @YinCui1 , @yinfeiy , @DamienTeney , @jesu9 , @rown , @endernewton , Hanwang Zhang, and Beer Changpinyo.
0
0
8
@zhegan4
Zhe Gan
4 years
If you are interested in Mutual Information and Optimal Transport, check our #ICML2020 papers: (i) CLUB: , an upper bound of MI that is deeply connected with contrastive learning. (ii) GOT : used for cross-domain alignment (VQA, NMT).
Tweet media one
0
1
8
@zhegan4
Zhe Gan
7 months
@AB_StateSpeed Yes, we will have demo hosted soon (still under legal review). Pretty fun to play with the model. Will let you know once demo is hosted. :)
2
0
7
@zhegan4
Zhe Gan
5 years
Our new work "UNITER: LEARNING UNIVERSAL IMAGE-TEXT REPRESENTATIONS" SOTA on 9 datasets (VQA, VCR, NLVR, Img-Txt Retrieval, Vis-Entailment, Grounding). GREAT effort by everyone, especially Yen-Chun, @LINJIEFUN , and @Licheng_Yu !
Tweet media one
Tweet media two
0
2
6
@zhegan4
Zhe Gan
2 months
MOFI is trained with our newly collected Image-to-Entities dataset with 1 billion images and 2 million distinct entities, covering rich visual concepts in the wild. Both VeCLIP and MOFI ckpts are released, providing yet another choice for your downstream tasks.
Tweet media one
0
1
6
@zhegan4
Zhe Gan
7 months
how we generate new LLM-VeC captions for CLIP training. 👇
Tweet media one
0
0
7
@zhegan4
Zhe Gan
2 months
VeCLIP is trained with DFN-5B and our new VeCap-300M data, collected from our scalable recaptioning pipeline.
Tweet media one
0
1
6
@zhegan4
Zhe Gan
1 year
Presenting FIBER for vision-lang pre-training at #NeurIPS2022 . It performs fusion in the backbone and a coarse-to-tine pre-training, and can be used for VQA, captioning, retrieval, grounding, object detection etc. Code: w/ @ZiYiDou @ashkamath20 etc.
@ashkamath20
Aishwarya Kamath
2 years
Presenting FIBER (Fusion In-the-Backbone transformER) a novel V&L architecture w/ deep multi-modal fusion + a new pre-training strategy that first learns through coarse-grained image level objectives, and then obtains fine-grained understanding using image-text-box data.
Tweet media one
6
56
292
0
1
6
@zhegan4
Zhe Gan
21 days
led by Keen You, @HaotianZhang4AI @yinfeiy among others.
0
1
5
@zhegan4
Zhe Gan
7 months
Uniform performance boost across 3M-200M data scales. 👇
Tweet media one
1
0
5
@zhegan4
Zhe Gan
16 days
@WenhuChen Nice work! Btw, our MM1 model can also do multi-image reasoning, such as the examples shown in Figure 2 and the Appendix. :)
1
0
5
@zhegan4
Zhe Gan
5 years
FreeLB is a general adversarial training method for NLP tasks. We show that it improves BERT and RoBERTa on GLUE and CommonsenseQA benchmarks. Our single model also achieves SOTA on ARC dataset for commonsense reasoning. Paper link:
Tweet media one
Tweet media two
0
3
4
@zhegan4
Zhe Gan
7 months
@JeffLaiZF (finally found your twitter account🤣)
1
0
2
@zhegan4
Zhe Gan
4 years
Happy to share our ACL paper: Discourse-Aware Neural Extractive Text Summarization arXiv: code: We propose DiscoBERT. This BERT is not only able to "disco and dance", but more importantly, being able to do summarization. 😉
Tweet media one
0
1
4
@zhegan4
Zhe Gan
2 months
tagging the right Bowen account @bwzhang_usc :)
0
0
3
@zhegan4
Zhe Gan
7 months
@HaotianZhang4AI @Phyyysalis @MrZiruiWang @lyon_cao @yinfeiy nice work led by @XyouH (finally found your twitter account) 🤣
0
0
3
@zhegan4
Zhe Gan
2 years
@multimodalart Yes, for now, it is trained on more restricted domain to show its potential application scenario. We believe the method will work on more general domain, but training the model takes time and is computationally heavy, and we are actually actively working on this. Stay tuned!
0
0
3
@zhegan4
Zhe Gan
7 months
work led by Jeff, @HaotianZhang4AI , joint efforts w/ @Phyyysalis @yinfeiy , Meng Cao, among many others.
1
0
2
@zhegan4
Zhe Gan
7 months
🔍 Compressing LLMs: The truth is rarely pure and never simple: 📔 Takeaway: Perplexity, though widely used, can provide some "false promises" for LLM compression, our LLM-KICK unveils favorable merits and unfortunate plights of SoTA compression methods.
@ajayjaiswal1994
Ajay Jaiswal
7 months
A Deep Dive Investigating True Merits and Limitations of SoTA Compression Algorithms. Glad to share our recent collaboration with great mentors @zhegan4 @YangYinfei @Phyyysalis #BowenZhang from @Apple AIML. @VITAGroupUT #LLMs #Compression
Tweet media one
2
6
29
0
0
2
@zhegan4
Zhe Gan
4 years
Here is another ACL paper from our team: Distilling Knowledge Learned in BERT for Text Generation arXiv: code: coming soon! (busy with EMNLP and NeurIPS deadlines...) We propose to use Knowledge Distillation to let BERT speak 😀
Tweet media one
0
0
2
@zhegan4
Zhe Gan
21 days
@ysu_nlp Thanks. We are also working on web AI agent. Cannot wait to try your new benchmark on this.
1
0
2
@zhegan4
Zhe Gan
7 months
@NielsRogge [cont.] so we used a combined strategy (Vicuna+LLaVA) to get more visual-enriched captions for CLIP training. By doing so, we observe a clear performance boost. Nevertheless, we are indeed inspired by both of the works you mentioned. Thanks for reading our work in detail. :)
0
0
2
@zhegan4
Zhe Gan
7 months
@NielsRogge Some additional comments. When playing with CC3M/CC12M, where the captions are already in good quality, using LLM for rewriting as in LaCLIP works. However, for other web-crawled data where alt-texts are can be more noisy (Fig 1), we found using LLM for re-writing is not enough.
0
0
2
@zhegan4
Zhe Gan
2 years
@BrianHorakh thanks for the suggestion. We will take this into consideration. :)
0
0
2
@zhegan4
Zhe Gan
2 years
@abunayla_ I think the model with base and large sizes will be released, for the huge-size one, I am sure what's the policy, as private data is used for model training... for now, there is no code repo yet.
1
0
1
@zhegan4
Zhe Gan
21 days
@TobyJLi yeah, it's a great collaboration and we are further improving the model together. We are both in Seattle, so connected together for the project. :)
0
0
1
@zhegan4
Zhe Gan
1 month
@kohjingyu Haha sorry for that
0
0
1
@zhegan4
Zhe Gan
3 years
0
0
1