Staff Research Scientist
@Apple
AI/ML. Ex-Principal Researcher
@Microsoft
Azure AI. Working on building large-scale vision and multimodal foundation models.
🚀🚀Introducing Ferret, a new MLLM that can refer and ground anything anywhere at any granularity.
📰
1⃣ Ferret enables referring of an image region at any shape
2⃣ It often shows better precise understanding of small image regions than GPT-4V (sec 5.6)
NUWA-Infinity is our new multimodal generative model that is able to generate high-quality images and videos from given text or image input. We can generate images with resolution up to 38912 × 2048 pixels.
check demo here:
abs:
NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis
abs:
project page:
Compared to DALL·E, Imagen/Parti , generates HR images with arbitrary sizes and support long-duration video generation
💡Imagine a multimodal LLM that can understand your iPhone screen📱? Here it is, we present Ferret-UI, that can do precise referring and grounding on your iPhone screen, and advanced reasoning. Free-form referring in, and boxes out. Ferret itself will also be presented at ICLR.
Apple presents Ferret-UI
Grounded Mobile UI Understanding with Multimodal LLMs
Recent advancements in multimodal large language models (MLLMs) have been noteworthy, yet, these general-domain MLLMs often fall short in their ability to comprehend and interact effectively with
🌟 Introducing VeCLIP: Improving CLIP training via visual-enriched captions
📘
⛽️ Data is the fuel for CLIP training, however, alt-text can be noisy.
🚀🚀 By using Vicuna and LLaVA for text rewriting, VeCLIP boosts CLIP perf across 3M-200M data scales.
🚀🚀 Excited to release code & ckpt for our new image encoders.
1. VeCLIP:
83.1% 0-shot on ImgNet with H14, trained on DFN-5B and 300M synthetic captions
2. MOFI:
SOTA on image retrieval, trained on 1B entity-annotated images.
GIT is our new MM foundation model, and achieves new sota across 12 image/video captioning and QA tasks, including the first human-parity on TextCaps. GIT achieves an acc. of 88.79% on ImageNet-1k using a generative scheme. GIT can recognize logos, landmarks, characters, etc.
GIT: A Generative Image-to-text Transformer for Vision and Language
abs:
model surpasses the human performance for the first time on TextCaps (138.2 vs. 125.5 in CIDEr)
🌟 Introduce MGIE, using Multimodal LLM for instruction-based image editing
📜
🔍 (code will be released soon)
Using MLLM to generate expressive instructions, and achieves superior results compared with InstructPix2Pix.
🎁🎁 Ferret is a multimodal LLM that is able to refer and ground, and is now open-sourced. Find out our code and checkpoints below: .
Merry Christmas and Happy new year!
work led by
@XyouH
@HaotianZhang4AI
@yinfeiy
🚀🚀Introducing Ferret, a new MLLM that can refer and ground anything anywhere at any granularity.
📰
1⃣ Ferret enables referring of an image region at any shape
2⃣ It often shows better precise understanding of small image regions than GPT-4V (sec 5.6)
During the summer, we hosted a special Vision-Lang Talk Series. With 11 invited speakers, we covered topics like captioning, VQA, ALIGN, MDETR, ViLD, MERLOT, MoCo etc.
Want to know more? 👇👇
YouTube:
@MSFTResearch
@jw2yang4ai
@ChunyuanLi
@PengchuanZ
🌟Besides Ferret-UI, we also upgrade Ferret to Ferret-v2 for natural images.
Several designs made along the way. (1) sphinx-like any-res for refer and ground. (2) CLIP encoder for global low-res img, DINOv2 encoder for sub-images. (3) high-res dense alignment before final sft.
Apple presents Ferret-v2
An Improved Baseline for Referring and Grounding with Large Language Models
While Ferret seamlessly integrates regional understanding into the Large Language Model (LLM) to facilitate its referring and grounding capability, it poses certain
Two papers got accepted to
#ECCV2020
! (1) UNITER: , a SOTA pre-trained V+L model; (2) VALUE (Spotlight): , the first work on probing pre-trained V+L models.
Joint work with:
@YenChunChen4
@LINJIEFUN
@Licheng_Yu
and others.
We all know GPT-3 is a strong few-shot learner for NLP problems, but can it also benefit multimodal tasks? In this work, we provide an empirical study of GPT-3 for OK-VQA, and show using GPT-3 in a few-shot manner surpasses supervised sota by +8.6 points (from 39.4 to 48.0). :)
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA
abs:
a simple yet effective method that Prompts GPT3 via the use of Image Captions. Using only 16 examples, PICa surpasses the supervised sota by an absolute +8.6 points on the OK-VQA dataset
We are hiring 2022 summer PhD research interns who are interested in Vision+Language research. Please send an email to zhe.gan
@microsoft
.com or apply directly if you are interested.
Job Link:
All our tutorial slides and video recordings are available now: . Feel free to check them out if you are interested in Vision+Language Research.
Joint efforts with:
@Licheng_Yu
,
@luowei_zhou
,
@LINJIEFUN
,
@YenChunChen4
, Yu, JJ and Xiaodong.
Please join us in our tutorial session if you are interested in vision-language research, or just want to chat and say hi. We will cover VLP for image-text, video-text, and core vision tasks, and also VLP for text2img synthesis.
Interested in Vision Language Pre-training (VLP) but do not know where to start? Hard to track the rapid progress in VLP? Come and join us at our CVPR2022 VLP tutorial on 19th Jun (9am-5pm CDT) in person in New Orleans or virtually.
#CVPR2022
🎉Our VALUE paper has been accepted to NeurIPS 2021 Dataset and Benchmark Track.
Only 25 days left for the VALUE Challenge 2021!
Participate to win up to $22.5K prizes!
More details:
Is it possible to build a VLM with sparse semantic representation that is as powerful as, or even better than, dense presentations like CLIP and ALIGN?
Excited to share STAIR 𓊍: Learning Sparse Text and Image Representation in Grounded Tokens
🧵👇
If you are interested in Mutual Information and Optimal Transport, check our
#ICML2020
papers: (i) CLUB: , an upper bound of MI that is deeply connected with contrastive learning. (ii) GOT : used for cross-domain alignment (VQA, NMT).
@AB_StateSpeed
Yes, we will have demo hosted soon (still under legal review). Pretty fun to play with the model. Will let you know once demo is hosted. :)
Our new work "UNITER: LEARNING UNIVERSAL IMAGE-TEXT REPRESENTATIONS" SOTA on 9 datasets (VQA, VCR, NLVR, Img-Txt Retrieval, Vis-Entailment, Grounding). GREAT effort by everyone, especially
Yen-Chun,
@LINJIEFUN
, and
@Licheng_Yu
!
MOFI is trained with our newly collected Image-to-Entities dataset with 1 billion images and 2 million distinct entities, covering rich visual concepts in the wild.
Both VeCLIP and MOFI ckpts are released, providing yet another choice for your downstream tasks.
Presenting FIBER for vision-lang pre-training at
#NeurIPS2022
. It performs fusion in the backbone and a coarse-to-tine pre-training, and can be used for VQA, captioning, retrieval, grounding, object detection etc.
Code:
w/
@ZiYiDou
@ashkamath20
etc.
Presenting FIBER (Fusion In-the-Backbone transformER) a novel V&L architecture w/ deep multi-modal fusion + a new pre-training strategy that first learns through coarse-grained image level objectives, and then obtains fine-grained understanding using image-text-box data.
FreeLB is a general adversarial training method for NLP tasks. We show that it improves BERT and RoBERTa on GLUE and CommonsenseQA benchmarks. Our single model also achieves SOTA on ARC dataset for commonsense reasoning.
Paper link:
Happy to share our ACL paper: Discourse-Aware Neural Extractive Text Summarization
arXiv:
code:
We propose DiscoBERT. This BERT is not only able to "disco and dance", but more importantly, being able to do summarization. 😉
@multimodalart
Yes, for now, it is trained on more restricted domain to show its potential application scenario. We believe the method will work on more general domain, but training the model takes time and is computationally heavy, and we are actually actively working on this. Stay tuned!
🔍 Compressing LLMs: The truth is rarely pure and never simple:
📔 Takeaway: Perplexity, though widely used, can provide some "false promises" for LLM compression, our LLM-KICK unveils favorable merits and unfortunate plights of SoTA compression methods.
Here is another ACL paper from our team: Distilling Knowledge Learned in BERT for Text Generation
arXiv:
code: coming soon! (busy with EMNLP and NeurIPS deadlines...)
We propose to use Knowledge Distillation to let BERT speak 😀
@NielsRogge
[cont.] so we used a combined strategy (Vicuna+LLaVA) to get more visual-enriched captions for CLIP training. By doing so, we observe a clear performance boost. Nevertheless, we are indeed inspired by both of the works you mentioned. Thanks for reading our work in detail. :)
@NielsRogge
Some additional comments. When playing with CC3M/CC12M, where the captions are already in good quality, using LLM for rewriting as in LaCLIP works. However, for other web-crawled data where alt-texts are can be more noisy (Fig 1), we found using LLM for re-writing is not enough.
@abunayla_
I think the model with base and large sizes will be released, for the huge-size one, I am sure what's the policy, as private data is used for model training... for now, there is no code repo yet.
@TobyJLi
yeah, it's a great collaboration and we are further improving the model together. We are both in Seattle, so connected together for the project. :)