Implicit Neural Representation for Vision
June 18th (afternoon), held in conjunction with CVPR 2024
Location: Seattle Convention Center - Summit 335-336
Overview
An emerging area within deep learning, implicit neural representation (INR), also known as neural fields, offers a powerful new mechanism and paradigm for processing and representing visual data.In contrast with the dominant big data setting, INR focuses on neural networks which parameterize a field, often in a coordinate-based manner. The most well-known of this class of models is NeRF, which has been wildly successful for 3D modeling, especially for novel view synthesis. INR for 2D images and videos have many compelling properties as well, particularly for visual data compression. By treating the weights of the networks as the data itself and leveraging their implicit reconstruction ability, several multimodal compression techniques have been developed.
This is a relatively new area in vision, with many opportunities to propose new algorithms, extend existing applications, and innovate entirely new systems. Since working with INRs often requires less resources than many areas, this sort of research is especially accessible in the academic setting. Additionally, while there are many workshops for NeRF, there are often none for the incredibly broad spectrum of other INR work. Therefore, we propose this workshop as an avenue to build up the fledgling INR community, and help disseminate knowledge in the field about this exciting area. We thus invite researchers to the Workshop for Implicit Neural Representation for Vision (INRV) where we investigate multiple directions, challenges, and opportunities related to implicit neural representation.
The simple design and flexibility offered by INRs, and recent work that proposes hypernetworks to circumvent expensive per-model training, points to the potential of INR a unified architecture that can efficiently represent audio, images, video and 3D data. The following are some of the key directions and challenges in INR.
Compression is learning: Data is crucial for modern AI success, and efficiently storing and transmitting this data is increasingly vital. An efficient data compressor is also an efficient learner. A good compressor is a step towards learning a good data representation for downstream tasks.
Faster training, encoding of INRs: INRs leverage neural networks to compress individual data points. However, a more efficient approach would be to devise methods that can identify patterns within a dataset and generalize to unseen data points. Currently, the prolonged training time is a significant drawback of INRs due to the overfitting of each data point. Several meta-learning methods have been introduced to address this, reducing the training time from hours to just seconds.
Unified architecture: The advantage of having a network with implicit inputs is that it can, at least in theory, be used to represent all kinds of data - images, video, audio and 3D using the same network architecture - a holy grail for the field.
Downstream tasks based on INRs: Besides being an efficient and unified representation, INR presents exciting opportunities for various vision tasks. These include enhancing signals, recognizing patterns, and generating new data.
Call for Papers
INRV will have an 8-page CVPR proceedings track. Note that 8 pages is a maximum; shorter submissions, given they are original and of good quality, are also permissible. All submissions must follow the CVPR 2024 style guide, since if accepted they will be published in the conference proceedings.- Submission Deadline: March 31st, 11:59 PM Pacific Time
- Acceptance Notification: April 7th, 11:59 PM Pacific Time
- Camera Ready: April 14th, 11:59 PM Pacific Time
- Submission Website: CMT
Papers ought to describe original research that leverages implicit neural representation to solve computer vision problems. Topics of interest are those which lie at the intersection of computer vision and implicit neural representation, including but not limited to:
- Compression: Image, Video, Scene
- Restoration: Denoising, Inpainting
- Enhancement: Interpolation, Superresolution
- Generalizable INR: Hypernetworks, meta-learning
- Architecture Design and Optimization: Reducing model size and training time
- Generation: Generating videos, scenes, images
- Recognition: Classification, detection, segmentation
Participation
Each workshop paper must be registered under a full, in-person registration type ("AUTHOR / FULL PASSPORT").
Register hereSubmission
Submit in this CMT portal: CMT
You may submit one supplementary file (zip or pdf), maximum 100MB.
Program Schedule
Accepted Papers
Contextualising Implicit Representations for Semantic Tasks
Theo Costain, Kejie Li, Victor Prisacariu
Connecting NeRFs, Images, and Text
Francesco Ballerini, Pierluigi Zama Ramirez, Roberto Mirabella, Samuele Salti, Luigi Di Stefano
StegaNeRV: Video Steganography using Implicit Neural Representation
Monsij Biswal, Tong Shao, Kenneth Rose, Peng Yin, Sean McCarthy
In Search of a Data Transformation That Accelerates Neural Field Training
Junwon Seo, Sangyoon Lee, Kwang In Kim, Jaeho Lee
DiVa-360: The Dynamic Visual Dataset for Immersive Neural Fields
Cheng-You Lu, Peisen Zhou, Angela Xing, Chandradeep Pokhariya, Arnab Dey, Ishaan Shah, Rugved Mavidipalli, Dylan Hu, Andrew Comport, Kefan Chen, Srinath Sridhar
ImplicitTerrain: a Continuous Surface Model for Terrain Data Analysis
Haoan Feng, Xin Xu, Leila De Floriani
Adversarial Text to Continuous Image Generation
Kilichbek Haydarov, Aashiq Muhamed, Xiaoqian shen, Jovana Lazarevic, Ivan Skorokhodov, Chamuditha Jayanga Galappaththige, Mohamed Elhoseiny
Speakers
Vincent Sitzmann is an Assistant Professor at MIT EECS, where he is leading the Scene Representation Group. Previously, he did his Ph.D. at Stanford University as well as a Postdoc at MIT CSAIL. His research interest lies in building models that perceive and model the world the way that humans do. Specifically, Vincent works towards models that can learn to reconstruct a rich state description of their environment, such as reconstructing its 3D structure, materials, semantics, etc. from vision. More importantly, these models should then also be able to model the impact of their own actions on that environment, i.e., learn a "mental simulator" or "world model". Vincent is particularly interested in models that can learn these skills fully self-supervised only from video and by self-directed interaction with the world.
Xiaolong Wang is an Assistant Professor in the ECE department at the University of California, San Diego. He received his Ph.D. in Robotics at Carnegie Mellon University. His postdoctoral training was at the University of California, Berkeley. His research focuses on the intersection between computer vision and robotics. His specific interest lies in learning 3D and dynamics representations from videos and physical robotic interaction data. These comprehensive representations are utilized to facilitate the learning of human-like robot skills, with the goal of generalizing the robot to interact effectively with a wide range of objects and environments in the real physical world. He is the recipient of the NSF CAREER Award, Intel Rising Star Faculty Award, and Research Awards from Sony, Amazon, Adobe, and Cisco.
Hyunjik is a research scientist at Google DeepMind at the London office, working on various topics in Deep Learning. His research interests keep on evolving, and currently he's interested in video generation, neural compression, and neural fields, in particular the idea of using them for compression and doing deep learning directly on the compressed space rather than on traditional array data (e.g. pixels). Prior to that, he worked on group equivariant deep learning, theoretical properties of self-attention, unsupervised representation learning (disentangling) and learning stochastic processes via Deep Learning methods (neural processes).
Srinath Sridhar is an assistant professor of computer science at Brown University. He received his PhD at the Max Planck Institute for Informatics and was subsequently a postdoctoral researcher at Stanford. His research interests are in 3D computer vision and machine learning. Specifically, his group (https://ivl.cs.brown.edu) focuses on visual understanding of 3D human physical interactions with applications ranging from robotics to mixed reality. He is a recipient of the NSF CAREER award, a Google Research Scholar award, and his work received the Eurographics Best Paper Honorable Mention. He spends part of his time as a visiting academic at Amazon Robotics and has previously spent time at Microsoft Research Redmond and Honda Research Institute.
Jia-Bin Huang is a Capital One endowed Associate Professor in Computer Science at the University of Maryland College Park. He received his Ph.D. degree from the Department of Electrical and Computer Engineering at the University of Illinois, Urbana-Champaign. His research interests include computer vision, computer graphics, and machine learning. Huang is the recipient of the Thomas & Margaret Huang Award, NSF CRII award, faculty award from Samsung, Google, 3M, Qualcomm, and a Google Research Scholar Award.
Namitha is a second year master's student in the department of Computer Science at the University of Maryland (UMD), advised by Professor Abhinav Shrivastava. She looks forward to continuing with Abhinav Shrivastava's group as a PhD student in the fall. Previously, she obtained her bachelor's in Computer Science from RVCE Bangalore. Her research interests lie in understanding implicit representations and exploring their utility in computer vision tasks.
Please contact Matt Gwilliam with any questions: mgwillia [at] umd [dot] edu.
Important Dates and Details
- Paper submission deadline: March 31st. CMT portal: CMT
- Notification of acceptance: April 7th
- Camera ready due: April 14th
- Workshop date: June 18th, PM