Introducing Veo: our most capable generative video model. 🎥
It can create high-quality, 1080p clips that can go beyond 60 seconds.
From photorealism to surrealism and animation, it can tackle a range of cinematic styles. 🧵
#GoogleIO
After 4 wonderful years at Berkeley, I'm Phinally Done :) Huge thanks to everyone who supported me, especially my advisor
@trevordarrell
This week, I'm starting as a Research Scientist at Google Labs with Steve Seitz! Excited to continue my research on video understanding :)
Want to create short summaries of long videos? Check out our work, “CLIP-It! Language-Guided Video Summarization”. Joint work with Anna Rohrbach and
@trevordarrell
Paper:
Project Page:
Results:
Excited to share our work, “Strumming to the Beat: Audio-Conditioned Contrastive Video Textures” with Shiry Ginosar,
@andrewhowens
, Alyosha Efros, and
@trevordarrell
Website:
Talk:
Paper:
Want to create short summaries of long videos? Check out our work, “CLIP-It! Language-Guided Video Summarization”. Joint work with Anna Rohrbach and
@trevordarrell
Paper:
Project Page:
Results:
It's always great to look back on the year in a year-in-review blogpost with
@JeffDean
& James Manyika. It's been an amazingly productive year for us, doing awesome research, shipping products and advancing science - 2024 is going to be incredible!
Never fly
@lufthansa
!! They canceled my flight and haven't issued a refund yet. It's been 6 months and I haven't received a response from their customer relations team. How can airlines get away with this?!
🚨I'll be
@ICCVConference
next week, presenting my work at the
@cveu_workshop
on Oct 2nd! Excited to share two new works on learning from instructional videos. Please stop by for the talk/poster or reach out if you'd like to connect!
Oral: 1:45 - 2:30 PM
Poster: 12 - 1:45 PM
We’re excited to announce 𝗚𝗲𝗺𝗶𝗻𝗶:
@Google
’s largest and most capable AI model.
Built to be natively multimodal, it can understand and operate across text, code, audio, image and video - and achieves state-of-the-art performance across many tasks. 🧵
🎥Introducing Veo, our new generative video model from
@GoogleDeepMind
.
With just a text, image or video prompt, you can create and edit HQ videos over 60 seconds in different visual styles. Join the waitlist in Labs to try it out in our new experimental tool, VideoFX
#GoogleIO
@colorado_reed
and I are starting an AI + Climate Change reading group to meet every two weeks, Tuesday at 5 pm. Here’s a website with more info: You can join the meeting announcement list there if you’re interested in attending!
We’re excited to announce the Berkeley AI Research Climate Initiative! The BCI aims to foster the development of fundamental AI research through directly working on impactful problems related to the most pressing issue of our time: climate change.
Will be in person
@eccvconf
presenting at the poster session on Tuesday, Oct 25th, 3:30-5:30 PM, Hall B poster 12! Please do stop by if you’d like to learn more about our work or just chat!
Join our
#AI4Climate
reading group today
@5PM
PST to hear from
@sarameghanbeery
!
“Computer Vision for Global-Scale Biodiversity Monitoring - Scaling Geospatial and Taxonomic Coverage Using Contextual Clues”
Where:
More info:
A generic video summary is an abridged version of a video that conveys the whole story and features the most important scenes. Yet, the importance of scenes in a video is often subjective, and users should have the option of customizing the summary by using language.
Although not new (Siamese net are from the 90s), Self-Supervised image representation Learning (SSL) is getting a lot of attention recently, as it gets closer to supervised learning performance.
⬇️ A thread trying to summarize recent advances and their challenges ⬇️
We show that our model outperforms baselines on human perceptual scores, can handle diverse input videos, and can combine semantic and audiovisual cues in order to synthesize videos that synchronize well with an audio signal.
Enjoy more examples here:
We draw on recent techniques from self-supervised learning to learn this distance metric, allowing us to compare frames in a manner that scales to more challenging dynamics, and to condition on other data, such as audio.
Our work is inspired by Video Textures, which showed that new videos could be generated from a single one by stitching its frames together in a novel yet consistent order. However, due to its use of hand-designed distance metrics, it was limited to simple, repetitive videos.
Existing models for generic summarization have not exploited available language models, which can serve as an effective prior for saliency. We propose a single framework for addressing both generic and query-focused video summarization, typically approached separately.
CLIP-It is a language-guided multimodal transformer that learns to score frames in a video based on their importance relative to one another and their correlation with a user-defined query (query-focused summ.) or an automatically generated dense video caption (generic summ.)
We learn representations for video frames and frame-to-frame transition probabilities by fitting a video-specific model trained using contrastive learning. Our model also naturally extends to an audio-conditioned setting without requiring any finetuning.
The Transformer design enables effective contextualization across frames. We demonstrate the impact of language guidance on generic summarization. We establish the new state-of-the-art on both generic and query-focused datasets in supervised and unsupervised settings.
PCs Update:
Preparing for authors notification tomorrow.
To avoid CMT traffic, will first publish list of accepted papers as customary. Expect this late afternoon PST. Will tweet!
We will then publish status, meta-reviews and final reviews on CMT - platform warned about traffic!
@icra2020
Many countries/universities have imposed travel restrictions. Consulates are closed and we cannot apply for visas now even if these bans were to be lifted later. Please consider those who have health issues and don't want to travel. Could we get a refund on the registration?
We propose an instructional video summarization network that combines a context-aware temporal video encoder and a segment scoring transformer. Using pseudo summaries as weak supervision, our network constructs a visual summary for any instructional video using video and speech
Evaluation: We collect a high-quality test set, WikiHow Summaries, by scraping WikiHow articles that contain video demonstrations and visual depictions of steps.
We’re especially excited to have folks from the earth, atmospheric, and climate sciences join from Berkeley, so even if it’s not directly related to your research, it will be a nice chance to look for potential connections to your work. This is open to folks outside Berkeley too!
Anyone who has tried to follow a recipe from a YouTube video would agree that most videos contain irrelevant filler content! In this work, we introduce an approach for creating short visual summaries of instructional videos containing only the most important steps
(i) relevant steps are likely to appear in multiple videos of the same task (Task Relevance), and (ii) they are more likely to be described by the demonstrator verbally (Cross-Modal Saliency)
Data: Existing video summarization datasets rely on manual frame-level annotations, making them subjective and limited in size. To overcome this, we first automatically generate pseudo summaries for a corpus of instructional videos by exploiting two key assumptions