BREAKING
OpenAI released a implementation of Consistency Models
consistency models, a new family of generative models that achieve high sample quality without adversarial training. They support fast one-step generation by design, while still allowing for few-step sampling to…
Scaling Transformer to 1M tokens and beyond with RMT
Recurrent Memory Transformer retains information across up to 2 million tokens.
During inference, the model effectively utilized memory for up to 4,096 segments with a total length of 2,048,000 tokens—significantly exceeding…
Apple announces LLM in a flash: Efficient Large Language Model Inference with Limited Memory
paper page:
Large language models (LLMs) are central to modern natural language processing, delivering exceptional performance in various tasks. However, their…
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
project page:
sota FID(7.27 on COCO), without ever training on COCO, human raters find Imagen samples to be on par with the COCO data itself in image-text alignment
Microsoft presents The Era of 1-bit LLMs
All Large Language Models are in 1.58 Bits
Recent research, such as BitNet, is paving the way for a new era of 1-bit Large Language Models (LLMs). In this work, we introduce a 1-bit LLM variant, namely BitNet b1.58, in which every single…
Google presents Genie
Generative Interactive Environments
introduce Genie, the first generative interactive environment trained in an unsupervised manner from unlabelled Internet videos. The model can be prompted to generate an endless variety of action-controllable virtual…
Track Anything: Segment Anything Meets Videos
Track-Anything is a flexible and interactive tool for video object tracking and segmentation
suitable for:
- Video object tracking and segmentation with shot changes.
- Visualized development and data annnotation for video object…
TikTok presents Depth Anything
Unleashing the Power of Large-Scale Unlabeled Data
paper page:
demo:
Depth Anything is trained on 1.5M labeled images and 62M+ unlabeled images jointly, providing the most capable Monocular Depth…
Apple presents Ferret-UI
Grounded Mobile UI Understanding with Multimodal LLMs
Recent advancements in multimodal large language models (MLLMs) have been noteworthy, yet, these general-domain MLLMs often fall short in their ability to comprehend and interact effectively with
Alibaba presents EMO: Emote Portrait Alive
Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions
tackle the challenge of enhancing the realism and expressiveness in talking head video generation by focusing on the dynamic and nuanced…
Meta releases Llama 2: Open Foundation and Fine-Tuned Chat Models
paper:
blog:
develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion…
Language Modeling Is Compression
paper page:
It has long been established that predictive models can be transformed into lossless compressors and vice versa. Incidentally, in recent years, the machine learning community has focused on training…
Tracking Anything with Decoupled Video Segmentation
paper page:
Training data for video segmentation are expensive to annotate. This impedes extensions of end-to-end algorithms to new video segmentation tasks, especially in large-vocabulary settings. To…
Its over
run Large Language Models like LLaMA, llama.cpp, GPT-J, Pythia, OPT, GALACTICA, gpt4all, auto-gpt easily in a web ui, free, and open source
github:
Dreamix: Video Diffusion Models are General Video Editors
abs:
project page:
present diffusion-based method that is able to perform text-based motion and appearance editing of general videos
JPMorgan announces DocLLM
A layout-aware generative language model for multimodal document understanding
paper page:
Enterprise documents such as forms, invoices, receipts, reports, contracts, and other similar records, often carry rich semantics at the…
Meta just released MusicGen, a simple and controllable model for music generation
MusicGen is a single stage auto-regressive Transformer model trained over a 32kHz EnCodec tokenizer with 4 codebooks sampled at 50 Hz. Unlike existing methods like MusicLM, MusicGen doesn't not…
MVDream: Multi-view Diffusion for 3D Generation
paper page:
propose MVDream, a multi-view diffusion model that is able to generate geometrically consistent multi-view images from a given text prompt. By leveraging image diffusion models pre-trained on…
Google announces Stealing Part of a Production Language Model
We introduce the first model-stealing attack that extracts precise, nontrivial information from black-box production language models like OpenAI's ChatGPT or Google's PaLM-2. Specifically, our attack recovers the…
OpenLLaMA 13B Released
model:
present a permissively licensed open source reproduction of Meta AI's LLaMA large language model. We are releasing 3B, 7B and 13B models trained on 1T tokens. We provide PyTorch and JAX weights of pre-trained OpenLLaMA…
Tencent announces AppAgent
Multimodal Agents as Smartphone Users
paper page:
Recent advancements in large language models (LLMs) have led to the creation of intelligent agents capable of performing complex tasks. This paper introduces a novel LLM-based…
BloombergGPT: A Large Language Model for Finance
a 50 billion parameter language model that is trained on a wide range of financial data. Construct a 363 billion token dataset based on Bloomberg’s extensive data sources, perhaps the largest …
zeroscope_v2 XL, A watermark-free Modelscope-based video model capable of generating high quality video at 1024 x 576
Model on
@huggingface
:
This model was trained with offset noise using 9,923 clips and 29,769 tagged frames at 24 frames, 1024x576…
Open AI releases GPT-4V(ision) system card
paper:
GPT-4 with vision (GPT-4V) enables users to instruct GPT-4 to analyze image inputs provided by the user, and is the latest capability we are making broadly available. Incorporating additional modalities…