If you're still at CVPR and have the stamina to make it through another poster session, check out RawNeRF tomorrow morning! We exploit the fact that NeRF is surprisingly robust to image noise to reconstruct scenes directly from raw HDR sensor data.
Sora is our first video generation model - it can create HD videos up to 1 min long. AGI will be able to simulate the physical world, and Sora is a key step in that direction. thrilled to have worked on this with
@billpeeb
at
@openai
for the past year
Code finally released for our CVPR 2022 papers (mip-NeRF 360/Ref-NeRF/RawNeRF)! You can also find links for each paper's dataset on its project page.
The code has some nice new camera utilities for larger real scenes, like this one.
We've finally released code for three of our CVPR2022 papers: mip-NeRF 360, Ref-NeRF, and RawNeRF. Instead of three separate releases, we've done something a little unusual and merged them into a single repo. Excited to see what people do with this!
Friday was my last day at Google after 3 years.
Will dearly miss hanging out with all my amazing coworkers, but looking forward to trying something new. The last few years of progress have been crazy and I expect even wilder things to come.
goodbye to my nerfing camera 💔
ReconFusion = standard single-scene optimized 3D reconstruction, additionally guided by a multi-view diffusion prior to allow for decent outputs from significantly fewer input views.
some thoughts below for view synthesis fanatics...
1/
The use of sigma in NeRF volume rendering is neither wrong nor a bug, but an intentional choice.
Here's the integral form of the radiative transfer equation, from rendering experts
@_jannovak
@wkjarosz
et al
How does this become the equation in NeRF?
Very glad I can finally talk about our newly-minted
#CVPR2022
paper. We extended mip-NeRF to handle unbounded "360" scenes, and it got us ~photorealistic renderings and beautiful depth maps. Explainer video: and paper:
The new
@googlemaps
Immersive View feature is going to be pretty amazing (and uses Neural Radiance Fields, or NeRFs, developed by
@GoogleAI
, UCBerkeley and UCSD researchers)
See
Watch Maps Immersive portion of
#GoogleIO
keynote:
+1 to Ben P
and between bill freeman’s famous (and accurate) plot and never-ending complaints about the randomness of the review process… can we really blame people for coming to the misguided conclusion “draw as many samples as I can to maximize chance of getting an outlier”?
One paper can change your life.
But which one? Overproductivity doesn't just come from paper counting, but from the desperate acts of young researchers under extreme pressure to be part of that one paper.
Happy to announce DreamFusion, our new method for Text-to-3D!
We optimize a NeRF from scratch using a pretrained text-to-image diffusion model. No 3D data needed!
Joint work w/ the incredible team of
@BenMildenhall
@ajayj_
@jon_barron
#dreamfusion
check out our work on improving joint NeRF+camera optimization!
shoutout to this iOS app that records images and arkit poses, I was able to use it to capture all the arkit scenes for this paper in about 15 minutes
Introducing CamP🏕️ — a method to precondition camera optimization for NeRFs to significantly improve quality. With CamP we’re able to create high quality reconstructions even when input poses are bad.
Project page:
ArXiv:
(1/n)
@ajayj_
@poolio
@jon_barron
@BenMildenhall
I'm much more interested to see my home office redone to be mid-century modern style than I am in seeing a purple duck eating an everything bagel. A generative model conditioned on my personal experience would allow me to explore decisions that are within my reach to achieve[2/n]
From our latest project, an homage to the original Photo Tourism visualizations by
@Jimantha
et al. - interpolating between camera pose, focal length, aspect ratio, and scene appearance from different tourist images. More details at
@_pratul_
@jon_barron
These days, who can say what crazy stuff is happening to your photos after they're captured, especially on a cellphone? Better to go straight to the source and grab those pixels fresh from the Bayer quads.
That means, fewer views required to reconstruct the same quality as before. No one wants to capture 100-1000+ images for a good nerf or splatting result, it's super tedious!! Progress toward this means progress toward faster, easier, more casual 3D capture in the future 🎉
n/n
Plus, optimizing NeRF in the linear space of raw data means you can postprocess its rendered novel views just like any raw photograph (e.g., adjusting exposure, tonemapping, or white balance). You can even render synthetic defocus with correctly exposed bokeh.
Great overview from
@fdellaert
! I'd also like to highlight
@TinghuiZhou
and
@Jimantha
et al. for bringing volume rendering into deep learning for view synthesis with their paper Stereo Magnification in 2018.
2020 was the year in which *neural volume rendering* exploded onto the scene, triggered by the impressive NeRF paper by Mildenhall et al. I wrote a post as a way of getting up to speed in a fascinating and very young field and share my journey with you:
3. T * sigma can be thought of as a PDF, implying that we're returning expected color along the ray. This is easily extended to return the expected value of any other quantity (eg. T * sigma * distance to get expected depth).
/end
This is such a huge blocker for neural rendering research! Graphics algorithms have so much dynamism and it is nigh impossible to try integrating these with current ML frameworks without a lot of custom low level work
for instance, a simulator of the future could involve NNs in kD tree lookup, NNs for sim2real, NNs for contact prediction. These pieces depend on each other but not necessarily on a per-gradient update basis, so potentially the AD software could be designed around that.
@pesarlin
certainly true but emphasizing this too much can also result in psychological torment, leading people to lock themselves in a room obsessing over how to create the Next Big Thing, rather than achieving a healthy balance and creative flow of research progress over time
I'm excited about ReconFusion because it shows the potential of taking that basic setup and guiding optimization with a smarter prior that has a strong opinion about what uncaptured novel views should look like, based on learning from large multiview image datasets.
8/
1. It produces rendering weights T * alpha that match the alpha compositing model used in previous view synthesis methods like Neural Volumes and multiplane images.
2. It's a more common convention in graphics/rendering work (for the reasons in the screenshot above).
This produces the typical integrand T * sigma * color from NeRF.
Ok, so what's up with the sigma debate? As some replies to the original thread have noted, this is a matter of volume rendering convention. Later on the same page of Novak et al we see:
Time to revise these areas... so awkward trying to fit neural rendering submissions into these categories. Many are hiding in "stereo and multiview 3d", the place to be for orals this year 🔥🔥🔥
So we can either keep sigma * L in the integrand or fold them together into a new emitted L'. This is analogous to the question of premultiplied alpha in over-compositing.
We *intentionally* chose sigma * L in NeRF. It makes sense for a variety of reasons:
T is transmittance, L_e is (emitted) color, and mu_a is what we call "density" (but is technically the absorption coefficient) and typically denote by sigma in NeRF.
Since NeRF's rendering model doesn't include in- or out-scattering, mu_s = 0 and the second term disappears.
Honourable Mention
@ICCV_2021
Mip-NeRF: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields
Jonathan T Barron, Ben Mildenhall, Matthew Tancik , Peter Hedman, Ricardo Martin-Brualla, Pratul Srinivasan
[Session 5 A/B]
Appearance interpolation is done directly in NeRF weight space! Use simple meta-learning (Reptile) to get the initial scene weights, then just do a couple extra gradient steps to acquire the appearance of any new input image
Many view synthesis projects later, I'd say we are somewhat remiss for not having included more of these plots in our papers. The difficulty of reconstructing a captured scene is 100% tied to view sampling density, and as a community we tend to underemphasize this fact.
5/
This rate is basically sampling with 1 pixel of disparity between input views -- that is, you move the camera such that the nearest thing in the scene only moves 1 pixel in the resulting image (!!) Obviously, this is totally intractable.
3/
The point of NeRF was: setting up a dumb brute-force optimization for a dense scene reconstruction that can rerender all your input images works surprisingly well.
There's nothing "learned" there beyond the prior encoded by your 3D representation, plus the input images.
7/
In ReconFusion, we ran this test on the kitchenlego scene (quality vs # input views) and got a very satisfying result.
btw -- the lines do end up crossing for some metrics, at around 160 views :) The original full capture is about 250 input images.
6/
Back in grad school,
@_pratul_
, Ren, and I talked a lot about plenoptic sampling in the context of view synthesis. There's a fundamental Nyquist rate sampling that needs to be achieved to guarantee "perfect" view synthesis/light field interpolation at a given resolution.
2/
@BartWronsk
Some very cool results on tracing through variable index of refraction in “Refractive Radiative Transfer Equation” by Ament et al in siggraph 2014. not including the wind simulation though!
@mmalex
one thing siggraph conference track nailed is, don't force authors to label their own paper as "lower tier" -- leave it as dual-track that can end up as either conference/journal and let reviewers decide. win/win as even good papers are encouraged to tighten up to 7 pages :)
Their beautiful high-res results convinced
@_pratul_
and I to immediately switch over from working on depth map based warping to a volumetric rendering model using multiplane images.
@jperldev
ha oops. that was meant to be in contrast to surface-based rendering (discontinuity at occlusion boundaries is hard to deal with), especially of SDFs (needs implicit differentiation).
only "trivial" as in, implement the fwd pass and autodiff gives you the gradient for free...
In LLFF, we related our view synthesis performance to this fundamental limit, showing that if we used N planes in an MPI, we could increase the baseline between sampled views to N pixels.
You can read this as image quality vs # of input views, (going from most to least).
4/
@mmalex
the cultural shift is the main issue yeah, people have made various well-intentioned attempts to do this but i think most of them have missed the mark due to failing/refusing to understand the motivations behind prestige seeking behavior...
@simesgreen
siggraph submissions never fail to surprise, this is definitely right up there with computational citrus peeling and the knitting compiler
"We partner with several zoos and circuses..."
This is probably a good time to mention that my team (a subset of this author list, and a part of Google Research) is recruiting interns for next summer at our San Francisco and London offices. Email me at barron
@google
.com if you're interested in working on something NeRF-y.