We are getting closer to “Her” (part 2!)
Conversationally do anything with emails, using LLM chaining & few-shot prompting for tool use (
@LangChainAI
inspired)
This is now realtime (ish), thanks to
#OpenAI
gpt-3.5-turbo
🔈 on for voice realism!
🧵
I “jailbroke” a Google Nest Mini so that you can run your own LLM’s, agents and voice models.
Here’s a demo using it to manage all my messages (with help from
@onbeeper
)
🔊 on, and wait for surprise guest!
I thought hard about how to best tackle this and why, see 🧵
I wanted to imagine how we’d better use
#stablediffusion
for video content / AR.
A major obstacle, why most videos are so flickery, is lack of temporal & viewing angle consistency, so I experimented with an approach to fix this
See 🧵 for process & examples
I’m working on open sourcing the PCB design, build instructions, firmware, bot & server code - expect something in the next week or so.
If you don't want to source Nest Mini's (or shells from AliExpress) it's still a great dev platform for developing an assistant!
Stay tuned!
Here's one of my modded Google Nest Mini's using
@OpenAI
function calling to take notes & control 💡
I’m releasing all code & docs to get this exact demo running yourself, including:
💬 Messages
🏡 Home Assistant integration
📝 Note-taking
We are getting closer to “Her” where conversation is the new interface.
Siri couldn’t do it, so I built an e-mail summarizing feature using
#GPT3
and life-like
#AI
generated voice on iOS.
(🔈Audio on to be 🤯with voice realism!)
How did I do this? 👇
I've been experimenting with multiple of these, announcing important messages as they come in, morning briefings, noting down ideas and memos, and browsing agents.
I couldn’t resist - here's a playful (unscripted!) video of two talking to each other prompted to be AI’s from "Her
After looking into jailbreaking options, I opted to completely replace the PCB.
This let’s you use a cheap ($2) but powerful & developer friendly WiFi chip with a highly capable audio framework.
This allows a paradigm of multiple cheap edge devices for audio & voice detection…
1/ I created this with Stable Diffusion using image inpainting and “walking through the latent space”
Without using tweening, every frame is generated by an interpolated embedding and variable denoising strength, so keeping continuity was tricky
See 🧵for process
I used the
#StableDiffusion
2 Depth Guided model to create architecture photos from dollhouse furniture.
By using a depth-map you can create images with incredible spatial consistency without using any of the original RGB image.
See 🧵
The custom PCB uses
@EspressifSystem
's ESP32-S3
I went through 2 revisions from a module to a SoC package with extra flash, simplifying to single-sided SMT (< $10 BOM)
All features such as LED’s, capacitive touch, mute switch are working, & even programmable from Arduino (/IDF)
& offloading large models to a more powerful local device (whether your M2 Mac, PC server w/ GPU or even "tinybox"!)
In most cases this device is already trusted with your credentials and data so you don’t have to hand these off to some cloud & data need never leave your home
For this demo I used a custom “Maubot” with my
@onbeeper
credentials (a messaging app which securely bridges your messaging clients using the Matrix protocol & e2e encryption) which runs locally serving an API
I’m then using GPT3.5 (for speed) with function calling to query this
I used AI to create a (comedic) guided meditation for the New Year!
(audio on, no meditation pose necessary!)
Used ChatGPT for an initial draft, and TorToiSe trained on only 30s of audio of Sam Harris
See 🧵 for implementation details
Fro the prompt I added details such as family & friends, current date, notification preferences & a list additional character voices that GPT can respond in.
The response is then parsed and sent to
@elevenlabsio
Here are some more out there takes, including turning my couch into a jumping castle! 🏰🎈
There are endless possibilities here for content creation. Follow for more creative AI experiments!
Once the "atlas" was learned I could then run it through
#depth2img
, then use the new atlas to reproject across the video.
This last remapping part is quick so you could imagine it being rendered live based on your viewing angle for
#AR
(for a pre-generated scene)
Imagine just speaking and waving your cursor to have a personalized AI assist you in any app
Here’s a quick demo of using voice, my cursor gestures & what's visible on-screen to prompt an
#LLM
agent
(with access to my calendar & preferences)
(🔊 on for voice prompts!)
🧵
@LangChainAI
This provides an incredibly natural way of searching for emails & then referencing them
“are there any unread emails mentioning X in the last month?”
“tell me more about the last one”
“who else was cc’d on the picnic one?”
“reply to the one about X saying …”
another e.g.:
A Neural Radiance Field (
#NeRF
) lets you create unique viewpoints you couldn’t otherwise - here’s a great example of creating a drone shot from frames pulled from a camera video. Details follow...
6/ Not all walks through the latent space were a smooth path, but it’s easy to script it to find pairs that work well (and let your GPU replace your central heating)
Having the ability to play with these models on this level is incredible.
More creative AI experiments to come!
I’ve been building
@onjuai
, a tool that makes it incredibly natural to interact with computers
Make conversational requests, powered by LLMs & the context of the app you’re in, without breaking your flow
Here’s a first example of using it with... Terminal!
(🧵 for access)
Ideally you want to learn a single representation of an object across time or different viewing directions to perform a *single*
#img2img
generation on.
For this I used (2021)
I used the Gmail API to feed in recent unread e-mails into a prompt and send to the
@OpenAI
#GPT3
Completion API. Calling out details such as not “just reading them out” and other prompt tweaks gave good results
The authors of the paper recommend using Mask R-CNN for creating a segmentation mask before training, but for this I found it easier (and cleaner) to just create a mask with the Rotobrush in After Effects
The audio model was fine-tuned on speech from the movie Her.
I got good results with TorToiSe, but have also experimented with ViTS & YourTTS from
@coqui_ai
and more recently
@ElevenLabs
.
None are fast enough for a snappy response together with da-vinci-003 completions, so...
It’s an incredible time to be building interactive experiences.
(Unposted) voice experiments I've been running are smart home control from Pi's, a morning chat briefed with my daily priorities, events, weather, sleep data etc.
See my profile for other creative AI experiments
@dessy_ocean
@onbeeper
Quick update: I've made some PCB improvements for WiFi performance etc. including making 4-layer (should have done this sooner) & ordered a batch to validate
Meanwhile I applied to list this on
@crowd_supply
& also following up with
@seeedstudio
& others.
So stay tuned! 🙏
@colinfortuner
This uses
@elevenlabsio
, but tbh feel I can't release this wider for folks until there is a good open source option - lack of good data privacy makes me uncomfortable recommending people send all their actual e-mail summaries to them.
I have hopes for …
I imagined how I might ask questions about my books, without the distraction of taking out my phone
Here's an experiment using
#GPT4
on a Kindle with a voice request through an
@Apple
HomePod
5/ Some tricks were required with blending and adjusting the inpainting mask to smoothly switch over the init images of the two real phones
(example generations on the right)
We are getting closer to “Her” where conversation is the new interface.
Siri couldn’t do it, so I built an e-mail summarizing feature using
#GPT3
and life-like
#AI
generated voice on iOS.
(🔈Audio on to be 🤯with voice realism!)
How did I do this? 👇
4/ Transitions were done using a customized
@huggingface
🧨Diffusers pipeline.
This lets me “slerp” between both noise latents AND text embeddings, for each given seed & prompt respectively
(while keeping denoising strength at ~0.8)
2/ First off, finding the right combination of prompt, seed and denoising strength for an
#img2img
in-painting is a roll of the dice
Luckily it is easy to script large batches to cherrypick
@OpenAI
Here are the settings I used, you can see how
#GPT3
does a great job of conversationally summarizing. (For the sake of privacy I made up the e-mails shown in the demo)
This learns an "atlas" to represent an object and its background across the video.
Regularization losses during training help preserve the original shape, with a result that resembles a usable slightly "unwrapped" version of the object
2/ This model is unique as it was fine-tuned from the Stable Diffusion 2 base with an extra channel for depth.
Using MiDaS (a model to predict depth from a single image), it can create new images with matching depth maps to your "init image"
Hacked together a pair of wide-angle cameras onto a home-made rocket so that I could say I've built & launched a rocket from NASA... The landing could use some work but re-usability is in the bag 🚀♻️
@NASAAmes
What worked best for me was putting the prompt and few-shot examples in the “system” message, and getting the assistant to think it is providing the command to the user, who will then query the API and return results to the assistant to summarize (see e.g.)
Voice transcription runs on device using the app name & selected text for context which gives it incredible accuracy.
See the previous demo using Terminal
I’ve been building
@onjuai
, a tool that makes it incredibly natural to interact with computers
Make conversational requests, powered by LLMs & the context of the app you’re in, without breaking your flow
Here’s a first example of using it with... Terminal!
(🧵 for access)
3/ I set the denoising strength to 1.0 so that none of the original RGB image was used
Even with widely different prompts it was able to generate consistent objects
Using simple, recognizable shapes such as wooden doll-house furniture worked great for this
@fffiloni
@huggingface
I'd be happy to help guide if someone more experienced with this wants to set one up!
Currently learning an "atlas" takes some time (I left it overnight but it probably converges to something usable sooner).
This is just an MVP of the ideal case - which would be to scan your…
5/ Here are a few of the prompts used:
"A beautiful rustic Balinese villa, architecture magazine, modern bedroom, infinity pool outside, design minimalism, stone surfaces"
@NVIDIAAIDev
Here is the same scene shown in Instant NGP. While
@nerfstudioteam
is missing some features of Instant NGP, being community-driven the rate of progress since their first launch a few weeks has been incredible to watch
8/ There is some “creativity” in how the depth-map is matched under the prompt.
Here are a few outtakes where the model tried to match the plant to antlers, toys, candles, statues, a double-necked guitar and even a kid with Mickey ears🤯
Follow for more creative experiments 👨🎨
4/ Regular photos ended up having an unavoidable “doll-house” feel to them (even with heavy prompt tweaking) due to the extreme perspective.
I found that changing to a longer focal length (3x on an iPhone) and capturing from further away resolved this.
@ekryski
@adafruit
Yep, I want to fix all the long tail issues with the PCB before releasing - avoiding wasting a lot of time debugging. Aiming to finish revisions in the next week.
Mostly interested in getting people experimenting with what they’d like to see built!
Beyond the above digital minimalism, I’ve spent a lot of time imagining what a LLM reading experience might look like, especially for fiction
Imagine pausing a book, and talking to characters at that moment, or unhurriedly exploring the scenes that the author has vividly crafted
@dessy_ocean
@onbeeper
Quick update: I've made some PCB improvements for WiFi performance etc. including making 4-layer (should have done this sooner) & ordered a batch to validate
Meanwhile I applied to list this on
@crowd_supply
& also following up with
@seeedstudio
& others.
So stay tuned! 🙏
An underlooked angle on why
@Apple
Reality (AR) could be relevant is Iris scanning and “Proof of Personhood” for digital spaces.
We’ve seen an explosion in
#AI
Agents, generative capabilities & increasingly realistic speech, not counting Twitter’s existing bot problems.
🧵
3 email commands are learned (which are then formatted into a GMail API request):
- search for emails with params, giving a list of email snippets indexed so the LLM can refer to in further requests
- get the full e-mail by index
- reply to an e-mail by index with a response
@wakingup
Diffusion models & autoregressive transformers are coming for audio!
Text-To-Speech was created using
I also highly enjoyed reading the author's blog
@colinfortuner
This uses
@elevenlabsio
, but tbh feel I can't release this wider for folks until there is a good open source option - lack of good data privacy makes me uncomfortable recommending people send all their actual e-mail summaries to them.
I have hopes for …
I originally manually chained responses, keeping track of and pruning history to feed into the next chain, and using stop tokens to prevent the LLM from hallucinating the API response.
There was some refactoring & experimentation to make use of the new chat completions API...
6/ "Luxurious modern studio bedroom, trending architecture magazine photo, colorful framed art hanging over bed, design minimalism, furry white rugs, trendy, industrial, pop art, boho chic"
@colinfortuner
@elevenlabsio
There are efforts to speed up TorToiSe, but it's inherently an approach that is still too slow for realtime (~1min)
I fine-tuned YourTTS with an hour of this voice - results were fast & decent but not nearly as expressive and still had phoneme errors.
I believe the magic happens when you pair 🍐 voice (for rich requests & feedback) with physical inputs (for shortcuts & confirmation) to get the answer where you need it
(More demos in different apps coming!)
Sign up for access at !
ChatGPT came up with some creative ideas, but the delivery was still fairly vanilla, so I iterated on it heavily and added a few Sam-isms from my experience with the
@wakingup
app (Jokes aside - highly recommended)
@wakingup
I split up text into short chunks to create the most natural flow, then did a grid search across multiple parameters to find the most realistic copy of Sam's mannerisms
Each sentence takes about 2-3 minutes to generate on a 3090, and I generated ~20 for each to cherrypick
I “jailbroke” a Google Nest Mini so that you can run your own LLM’s, agents and voice models.
Here’s a demo using it to manage all my messages (with help from
@onbeeper
)
🔊 on, and wait for surprise guest!
I thought hard about how to best tackle this and why, see 🧵
Finding the right reference material is critical (quality over quantity) so I used 3 carefully edited clips of <10s each.
Tricks with prompt engineering is also possible - the content influences the generated audio intonations
While already has hardware verification with their Secure Enclave chip & “hardware root of trust”, an AR headset with iris scanning and liveness detection would ensure a 1:1 mapping of human to device
Right now this just uses on-screen text, but it is easy to imagine this with multi-modal models (actively experimenting) and even other hardware form factors (😲🤫)
@JimmyBrumant
Voice Activity Detection for filtering out spoken phrases, Whisper for STT running on my Macbook / Linux server (depending), and using the "no speech prob".
If you were asking about TTS, using Elevenlabs - tbh best out there for voice cloning (for now...)
Responses are currently using GPT4, chained when needed with memory, preferences & integrations.
Although it can provide highly specific answers, it works across all apps w/o extensions needed for each & allows credentials for integrations to be centrally & securely managed
@guru154929
@OpenAI
Take a look at the readme and code. This uses timeouts to only listen after responses or when the device first turns on or is tapped, and when the mute switch is not flipped. Of course wakewords can be used but wouldn’t make as good conversation or demo clearly.
But critically,…
Lastly, yes, the headset will be 💰 initially, but
1) I’d expect this to set a precedent of HW verification for other devices & mfg'ers
2) this is (relatively) small change compared to other "flexes of fungibility" for the digital world (NFT's)
It's actually incredibly simple:
When I hit the shortcut to start voice recording, it captures the current window and uses Apple's Vision framework to perform OCR on all the text.
This takes about 1-2 seconds but runs in parallel to voice recording
@rhasspy
@IgorAntarov
@onbeeper
@home_assistant
Yep, it uses I2S for the 2 mic's & speaker.
I'm an advocate for offloading to a local GPU / "secure enclave" w credentials for my applications, but you can do whatever you'd like with it!
The great handling of unbounded scenes like this one is thanks to spatial distortion (as proposed in Mip-NeRF 360 paper) where any point beyond a unit sphere (from 1 to ∞) is mapped to into a second sphere (from 1 to 2). allow the network to learn beyond a scene beyond a cube
Here's one of my modded Google Nest Mini's using
@OpenAI
function calling to take notes & control 💡
I’m releasing all code & docs to get this exact demo running yourself, including:
💬 Messages
🏡 Home Assistant integration
📝 Note-taking
Transcription has to be fast & flawless for a good UX.
To do this the model runs on-device and is conditioned on the app you’re in and with previous commands or messages to ensure accuracy with ambiguous words.
(Recordings never need to leave your device)
This demo uses Siri & shortcuts to send the question to a server.
It uses Kindle's built in browser to load a formatted page with JavaScript to fetch updates as the response from
@OpenAI
is streaming.
The browser is really limited so I had to keep it minimal and stick to ES5
Nerfstudio is a great platform for experimenting with the latest research breakthroughs in NeRF’s, co-created by one of the author’s of the original paper, . Here are some examples of features they’ve rolled into their de facto model: