Zhe Gan @zhegan4 Twitter profile

Last Seen Profiles

@_spicyhoney

@SanjeevKapoor

@RiverStewards

@adinascozylife

@Tolualiki

@UnderdogSA1

@hourlymoomins

@WfldHospice

@TransPantyBulge

@ZTXofficial

@cz_binance

@AmalR00342937

@mattRlive

@soul_reveals

@Natespage

@SrvG_d

@cnmbwjx

@MDaroots

@adamcooper1109

@BigRedBronchos

@SpeedyEmo

@torafights

@jaycuffau

@torafights

@JillFeatherston

@Mucho67

@AAA_Balushi

@SveepL

@people

@ShLetsMeet

@ColemanBurkhart

@LuridOCE1

@Alkraas_art

@inkstainzed

@AR500ARMOR

@greensheikh

Zhe Gan

@zhegan4

7 months

🚀🚀Introducing Ferret, a new MLLM that can refer and ground anything anywhere at any granularity. 📰 1⃣ Ferret enables referring of an image region at any shape 2⃣ It often shows better precise understanding of small image regions than GPT-4V (sec 5.6)

11

113

476

Zhe Gan

@zhegan4

2 years

NUWA-Infinity is our new multimodal generative model that is able to generate high-quality images and videos from given text or image input. We can generate images with resolution up to 38912 × 2048 pixels. check demo here: abs:

NUWA-Infinity: Autoregressive over Autoregressive Generation for...

In this paper, we present NUWA-Infinity, a generative model for infinite visual synthesis, which is defined as the task of generating arbitrarily-sized high-resolution images or long-duration...

arxiv.org

AK

@_akhaliq

2 years

NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis abs: project page: Compared to DALL·E, Imagen/Parti , generates HR images with arbitrary sizes and support long-duration video generation

6

123

559

6

57

214

Zhe Gan

@zhegan4

21 days

💡Imagine a multimodal LLM that can understand your iPhone screen📱? Here it is, we present Ferret-UI, that can do precise referring and grounding on your iPhone screen, and advanced reasoning. Free-form referring in, and boxes out. Ferret itself will also be presented at ICLR.

AK

@_akhaliq

21 days

Apple presents Ferret-UI Grounded Mobile UI Understanding with Multimodal LLMs Recent advancements in multimodal large language models (MLLMs) have been noteworthy, yet, these general-domain MLLMs often fall short in their ability to comprehend and interact effectively with

33

409

2K

11

41

203

Zhe Gan

@zhegan4

7 months

🌟 Introducing VeCLIP: Improving CLIP training via visual-enriched captions 📘 ⛽️ Data is the fuel for CLIP training, however, alt-text can be noisy. 🚀🚀 By using Vicuna and LLaVA for text rewriting, VeCLIP boosts CLIP perf across 3M-200M data scales.

6

27

147

Zhe Gan

@zhegan4

2 months

🚀🚀 Excited to release code & ckpt for our new image encoders. 1. VeCLIP: 83.1% 0-shot on ImgNet with H14, trained on DFN-5B and 300M synthetic captions 2. MOFI: SOTA on image retrieval, trained on 1B entity-annotated images.

6

37

133

Zhe Gan

@zhegan4

2 years

GIT is our new MM foundation model, and achieves new sota across 12 image/video captioning and QA tasks, including the first human-parity on TextCaps. GIT achieves an acc. of 88.79% on ImageNet-1k using a generative scheme. GIT can recognize logos, landmarks, characters, etc.

AK

@_akhaliq

2 years

GIT: A Generative Image-to-text Transformer for Vision and Language abs: model surpasses the human performance for the first time on TextCaps (138.2 vs. 125.5 in CIDEr)

2

39

189

2

17

109

Zhe Gan

@zhegan4

7 months

🌟 Introduce MGIE, using Multimodal LLM for instruction-based image editing 📜 🔍 (code will be released soon) Using MLLM to generate expressive instructions, and achieves superior results compared with InstructPix2Pix.

1

17

106

Zhe Gan

@zhegan4

4 months

🎁🎁 Ferret is a multimodal LLM that is able to refer and ground, and is now open-sourced. Find out our code and checkpoints below: . Merry Christmas and Happy new year! work led by @XyouH @HaotianZhang4AI @yinfeiy

GitHub - apple/ml-ferret

Contribute to apple/ml-ferret development by creating an account on GitHub.

github.com

Zhe Gan

@zhegan4

7 months

🚀🚀Introducing Ferret, a new MLLM that can refer and ground anything anywhere at any granularity. 📰 1⃣ Ferret enables referring of an image region at any shape 2⃣ It often shows better precise understanding of small image regions than GPT-4V (sec 5.6)

11

113

476

4

21

105

Zhe Gan

@zhegan4

3 years

During the summer, we hosted a special Vision-Lang Talk Series. With 11 invited speakers, we covered topics like captioning, VQA, ALIGN, MDETR, ViLD, MERLOT, MoCo etc. Want to know more? 👇👇 YouTube: @MSFTResearch @jw2yang4ai @ChunyuanLi @PengchuanZ

Microsoft Vision+Language Summer Talk Series

This talk series is a special Vision-and-Language Summer Talk Series in 2021 summer, jointly hosted by Microsoft Research Deep Learning Team (https://www.mic...

www.youtube.com

1

16

77

Zhe Gan

@zhegan4

17 days

🌟Besides Ferret-UI, we also upgrade Ferret to Ferret-v2 for natural images. Several designs made along the way. (1) sphinx-like any-res for refer and ground. (2) CLIP encoder for global low-res img, DINOv2 encoder for sub-images. (3) high-res dense alignment before final sft.

AK

@_akhaliq

18 days

Apple presents Ferret-v2 An Improved Baseline for Referring and Grounding with Large Language Models While Ferret seamlessly integrates regional understanding into the Large Language Model (LLM) to facilitate its referring and grounding capability, it poses certain

3

61

382

1

16

48

Zhe Gan

@zhegan4

4 years

Two papers got accepted to #ECCV2020 ! (1) UNITER: , a SOTA pre-trained V+L model; (2) VALUE (Spotlight): , the first work on probing pre-trained V+L models. Joint work with: @YenChunChen4 @LINJIEFUN @Licheng_Yu and others.

1

3

28

Zhe Gan

@zhegan4

3 years

We all know GPT-3 is a strong few-shot learner for NLP problems, but can it also benefit multimodal tasks? In this work, we provide an empirical study of GPT-3 for OK-VQA, and show using GPT-3 in a few-shot manner surpasses supervised sota by +8.6 points (from 39.4 to 48.0). :)

AK

@_akhaliq

3 years

An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA abs: a simple yet effective method that Prompts GPT3 via the use of Image Captions. Using only 16 examples, PICa surpasses the supervised sota by an absolute +8.6 points on the OK-VQA dataset

1

12

83

0

1

26

Zhe Gan

@zhegan4

3 years

We are hiring 2022 summer PhD research interns who are interested in Vision+Language research. Please send an email to zhe.gan @microsoft .com or apply directly if you are interested. Job Link:

0

3

17

Zhe Gan

@zhegan4

4 years

All our tutorial slides and video recordings are available now: . Feel free to check them out if you are interested in Vision+Language Research. Joint efforts with: @Licheng_Yu , @luowei_zhou , @LINJIEFUN , @YenChunChen4 , Yu, JJ and Xiaodong.

Zhe Gan

@zhegan4

4 years

We will host a tutorial on Recent Advances in Vision+Language Research at #CVPR2020 : (Zoom link provided inside). Welcome to join us at 1:15pm (PST) June 15th! Organizers: @Licheng_Yu , @luowei_zhou , @LINJIEFUN , @YenChunChen4 , Yu, JJ, Xiaodong, and me.

0

7

8

0

4

16

Zhe Gan

@zhegan4

3 years

So happy for this. Our ClipBERT paper is nominated for CVPR 2021 best paper! :)

Jie Lei

@jayleicn

3 years

ClipBERT is nominated for best paper! 😆

3

14

105

0

15

Zhe Gan

@zhegan4

2 years

Please join us in our tutorial session if you are interested in vision-language research, or just want to chat and say hi. We will cover VLP for image-text, video-text, and core vision tasks, and also VLP for text2img synthesis.

Linjie (Lindsey) Li

@LINJIEFUN

2 years

Interested in Vision Language Pre-training (VLP) but do not know where to start? Hard to track the rapid progress in VLP? Come and join us at our CVPR2022 VLP tutorial on 19th Jun (9am-5pm CDT) in person in New Orleans or virtually. #CVPR2022

0

24

109

0

15

Zhe Gan

@zhegan4

3 years

Come and join us in this new benchmark for video and language! More details about the challenge here:

Linjie (Lindsey) Li

@LINJIEFUN

3 years

🎉Our VALUE paper has been accepted to NeurIPS 2021 Dataset and Benchmark Track. Only 25 days left for the VALUE Challenge 2021! Participate to win up to $22.5K prizes! More details:

0

10

48

0

1

14

Zhe Gan

@zhegan4

4 years

Our VILLA paper that uses adversarial training for V+L pre-training and fine-tuning got accepted to @NeurIPSConf #NeurIPS2020 as a Spotlight paper with review scores 8887. Welcome to check it out :) arXiv: @YenChunChen4 @LINJIEFUN @Eiri1114

0

3

12

Zhe Gan

@zhegan4

1 year

Interesting paper that uses high-dim. sparse semantic representations to train a CLIP model w/ SOTA performance and better interpretability.

Chen Chen

@alex8937

1 year

Is it possible to build a VLM with sparse semantic representation that is as powerful as, or even better than, dense presentations like CLIP and ALIGN? Excited to share STAIR 𓊍: Learning Sparse Text and Image Representation in Grounded Tokens 🧵👇

1

5

16

1

2

12

Zhe Gan

@zhegan4

7 months

work led by Haoxuan and @HaotianZhang4AI , joint efforts with @Phyyysalis , Bowen, @MrZiruiWang , @lyon_cao , @yinfeiy

1

0

9

Zhe Gan

@zhegan4

2 months

VeCLIP arXiv: MOFI arXiv (ICLR 2024): led by @JeffLaiZF , @HaotianZhang4AI , @yinfeiy , Bowen, Wentao, Aleksei, @Phyyysalis , etc.

MOFI: Learning Image Representations from Noisy Entity Annotated Images

We present MOFI, Manifold OF Images, a new vision foundation model designed to learn image representations from noisy entity annotated images. MOFI differs from previous work in two key aspects:...

arxiv.org

0

9

Zhe Gan

@zhegan4

4 years

We will host a tutorial on Recent Advances in Vision+Language Research at #CVPR2020 : (Zoom link provided inside). Welcome to join us at 1:15pm (PST) June 15th! Organizers: @Licheng_Yu , @luowei_zhou , @LINJIEFUN , @YenChunChen4 , Yu, JJ, Xiaodong, and me.

0

7

8

Zhe Gan

@zhegan4

4 years

We achieve No. 1 on two challenging multilingual benchmarks: XTREME and XGLUE. Welcome to check our FILTER paper: . XTREME: XGLUE:

0

8

Zhe Gan

@zhegan4

3 years

Thank you for our awesome speakers: @Yezhou_Yang , @ashkamath20 , @kohjingyu , @YinCui1 , @yinfeiy , @DamienTeney , @jesu9 , @rown , @endernewton , Hanwang Zhang, and Beer Changpinyo.

0

8

Zhe Gan

@zhegan4

4 years

If you are interested in Mutual Information and Optimal Transport, check our #ICML2020 papers: (i) CLUB: , an upper bound of MI that is deeply connected with contrastive learning. (ii) GOT : used for cross-domain alignment (VQA, NMT).

0

1

8

Zhe Gan

@zhegan4

7 months

@AB_StateSpeed Yes, we will have demo hosted soon (still under legal review). Pretty fun to play with the model. Will let you know once demo is hosted. :)

2

0

7

Zhe Gan

@zhegan4

5 years

Our new work "UNITER: LEARNING UNIVERSAL IMAGE-TEXT REPRESENTATIONS" SOTA on 9 datasets (VQA, VCR, NLVR, Img-Txt Retrieval, Vis-Entailment, Grounding). GREAT effort by everyone, especially Yen-Chun, @LINJIEFUN , and @Licheng_Yu !

0

2

6

Zhe Gan

@zhegan4

2 months

MOFI is trained with our newly collected Image-to-Entities dataset with 1 billion images and 2 million distinct entities, covering rich visual concepts in the wild. Both VeCLIP and MOFI ckpts are released, providing yet another choice for your downstream tasks.

0

1

6

Zhe Gan

@zhegan4

7 months

how we generate new LLM-VeC captions for CLIP training. 👇

0

7

Zhe Gan

@zhegan4

2 months

VeCLIP is trained with DFN-5B and our new VeCap-300M data, collected from our scalable recaptioning pipeline.

0

1

6

Zhe Gan

@zhegan4

1 year

Presenting FIBER for vision-lang pre-training at #NeurIPS2022 . It performs fusion in the backbone and a coarse-to-tine pre-training, and can be used for VQA, captioning, retrieval, grounding, object detection etc. Code: w/ @ZiYiDou @ashkamath20 etc.

GitHub - microsoft/FIBER: Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone - microsoft/FIBER

github.com

Aishwarya Kamath

@ashkamath20

2 years

Presenting FIBER (Fusion In-the-Backbone transformER) a novel V&L architecture w/ deep multi-modal fusion + a new pre-training strategy that first learns through coarse-grained image level objectives, and then obtains fine-grained understanding using image-text-box data.

6

56

292

0

1

6

Zhe Gan

@zhegan4

21 days

led by Keen You, @HaotianZhang4AI @yinfeiy among others.

0

1

5

Zhe Gan

@zhegan4

7 months

Uniform performance boost across 3M-200M data scales. 👇

1

0

5

Zhe Gan

@zhegan4

16 days

@WenhuChen Nice work! Btw, our MM1 model can also do multi-image reasoning, such as the examples shown in Figure 2 and the Appendix. :)

1

0

5

Zhe Gan

@zhegan4

5 years

FreeLB is a general adversarial training method for NLP tasks. We show that it improves BERT and RoBERTa on GLUE and CommonsenseQA benchmarks. Our single model also achieves SOTA on ARC dataset for commonsense reasoning. Paper link:

0

3

4

Zhe Gan

@zhegan4

7 months

@JeffLaiZF (finally found your twitter account🤣)

1

0

2

Zhe Gan

@zhegan4

4 years

Happy to share our ACL paper: Discourse-Aware Neural Extractive Text Summarization arXiv: code: We propose DiscoBERT. This BERT is not only able to "disco and dance", but more importantly, being able to do summarization. 😉

0

1

4

Zhe Gan

@zhegan4

2 months

tagging the right Bowen account @bwzhang_usc :)

0

3

Zhe Gan

@zhegan4

17 days

joint work with @HaotianZhang4AI @XyouH @WilliamWangNLP @yinfeiy etc.

0

1

3

Zhe Gan

@zhegan4

7 months

@HaotianZhang4AI @Phyyysalis @MrZiruiWang @lyon_cao @yinfeiy nice work led by @XyouH (finally found your twitter account) 🤣

0

3

Zhe Gan

@zhegan4

2 years

@multimodalart Yes, for now, it is trained on more restricted domain to show its potential application scenario. We believe the method will work on more general domain, but training the model takes time and is computationally heavy, and we are actually actively working on this. Stay tuned!

0

3

Zhe Gan

@zhegan4

7 months

work led by Jeff, @HaotianZhang4AI , joint efforts w/ @Phyyysalis @yinfeiy , Meng Cao, among many others.

1

0

2

Zhe Gan

@zhegan4

7 months

🔍 Compressing LLMs: The truth is rarely pure and never simple: 📔 Takeaway: Perplexity, though widely used, can provide some "false promises" for LLM compression, our LLM-KICK unveils favorable merits and unfortunate plights of SoTA compression methods.

Compressing LLMs: The Truth is Rarely Pure and Never Simple

Despite their remarkable achievements, modern Large Language Models (LLMs) face exorbitant computational and memory footprints. Recently, several works have shown significant success in...

arxiv.org

Ajay Jaiswal

@ajayjaiswal1994

7 months

A Deep Dive Investigating True Merits and Limitations of SoTA Compression Algorithms. Glad to share our recent collaboration with great mentors @zhegan4 @YangYinfei @Phyyysalis #BowenZhang from @Apple AIML. @VITAGroupUT #LLMs #Compression

2

6

29

0

2

Zhe Gan

@zhegan4

4 years

Here is another ACL paper from our team: Distilling Knowledge Learned in BERT for Text Generation arXiv: code: coming soon! (busy with EMNLP and NeurIPS deadlines...) We propose to use Knowledge Distillation to let BERT speak 😀

0

2

Zhe Gan

@zhegan4

21 days

@ysu_nlp Thanks. We are also working on web AI agent. Cannot wait to try your new benchmark on this.

1

0

2

Zhe Gan

@zhegan4

7 months

@NielsRogge [cont.] so we used a combined strategy (Vicuna+LLaVA) to get more visual-enriched captions for CLIP training. By doing so, we observe a clear performance boost. Nevertheless, we are indeed inspired by both of the works you mentioned. Thanks for reading our work in detail. :)

0

2

Zhe Gan

@zhegan4

7 months

@NielsRogge Some additional comments. When playing with CC3M/CC12M, where the captions are already in good quality, using LLM for rewriting as in LaCLIP works. However, for other web-crawled data where alt-texts are can be more noisy (Fig 1), we found using LLM for re-writing is not enough.

0

2

Zhe Gan

@zhegan4

2 years

@BrianHorakh thanks for the suggestion. We will take this into consideration. :)

0

2

Zhe Gan

@zhegan4

2 years

@abunayla_ I think the model with base and large sizes will be released, for the huge-size one, I am sure what's the policy, as private data is used for model training... for now, there is no code repo yet.

1

0

1

Zhe Gan

@zhegan4

21 days

@TobyJLi yeah, it's a great collaboration and we are further improving the model together. We are both in Seattle, so connected together for the project. :)

0

1

Zhe Gan

@zhegan4

1 month

@kohjingyu Haha sorry for that

0

1

Zhe Gan

@zhegan4

3 years

@JunjieHu12 @UWMadison Congrats!

0

1