Implicit Neural Representation for Vision

June 18th (afternoon), held in conjunction with CVPR 2024


An emerging area within deep learning, implicit neural representation (INR), also known as neural fields, offers a powerful new mechanism and paradigm for processing and representing visual data.In contrast with the dominant big data setting, INR focuses on neural networks which parameterize a field, often in a coordinate-based manner. The most well-known of this class of models is NeRF, which has been wildly successful for 3D modeling, especially for novel view synthesis. INR for 2D images and videos have many compelling properties as well, particularly for visual data compression. By treating the weights of the networks as the data itself and leveraging their implicit reconstruction ability, several multimodal compression techniques have been developed.

This is a relatively new area in vision, with many opportunities to propose new algorithms, extend existing applications, and innovate entirely new systems. Since working with INRs often requires less resources than many areas, this sort of research is especially accessible in the academic setting. Additionally, while there are many workshops for NeRF, there are often none for the incredibly broad spectrum of other INR work. Therefore, we propose this workshop as an avenue to build up the fledgling INR community, and help disseminate knowledge in the field about this exciting area. We thus invite researchers to the Workshop for Implicit Neural Representation for Vision (INRV) where we investigate multiple directions, challenges, and opportunities related to implicit neural representation.

The simple design and flexibility offered by INRs, and recent work that proposes hypernetworks to circumvent expensive per-model training, points to the potential of INR a unified architecture that can efficiently represent audio, images, video and 3D data. The following are some of the key directions and challenges in INR.

  • Compression is learning: Data is crucial for modern AI success, and efficiently storing and transmitting this data is increasingly vital. An efficient data compressor is also an efficient learner. A good compressor is a step towards learning a good data representation for downstream tasks.

  • Faster training, encoding of INRs: INRs leverage neural networks to compress individual data points. However, a more efficient approach would be to devise methods that can identify patterns within a dataset and generalize to unseen data points. Currently, the prolonged training time is a significant drawback of INRs due to the overfitting of each data point. Several meta-learning methods have been introduced to address this, reducing the training time from hours to just seconds.

  • Unified architecture: The advantage of having a network with implicit inputs is that it can, at least in theory, be used to represent all kinds of data - images, video, audio and 3D using the same network architecture - a holy grail for the field.

  • Downstream tasks based on INRs: Besides being an efficient and unified representation, INR presents exciting opportunities for various vision tasks. These include enhancing signals, recognizing patterns, and generating new data.

Call for Papers

INRV will have an 8-page CVPR proceedings track. Note that 8 pages is a maximum; shorter submissions, given they are original and of good quality, are also permissible. All submissions must follow the CVPR 2024 style guide, since if accepted they will be published in the conference proceedings.
  • Submission Deadline: March 31st, 11:59 PM Pacific Time
  • Acceptance Notification: April 7th, 11:59 PM Pacific Time
  • Camera Ready: April 14th, 11:59 PM Pacific Time
  • Submission Website: CMT

Papers ought to describe original research that leverages implicit neural representation to solve computer vision problems. Topics of interest are those which lie at the intersection of computer vision and implicit neural representation, including but not limited to:

  • Compression: Image, Video, Scene
  • Restoration: Denoising, Inpainting
  • Enhancement: Interpolation, Superresolution
  • Generalizable INR: Hypernetworks, meta-learning
  • Architecture Design and Optimization: Reducing model size and training time
  • Generation: Generating videos, scenes, images
  • Recognition: Classification, detection, segmentation


Each workshop paper must be registered under a full, in-person registration type ("AUTHOR / FULL PASSPORT").

Register here


Submit in this CMT portal: CMT

You may submit one supplementary file (zip or pdf), maximum 100MB.

This is a work in progress. All times and speaker orders are tentative.

Program Schedule

Time (Seattle, UTC-7)
01:00 - 01:15
Opening Remarks
Shishira Maiya
01:15 - 01:45
Invited Talk #1
Vincent Sitzmann
01:45 - 03:00
Accepted Paper Talks
10 Minute Talks for 7 Accepted Papers
03:00 - 03:45
Poster Session and Buffer
03:45 - 04:15
Invited Talk #2
Jia-Bin Huang
04:15 - 04:30
Invited Short Talk #1
Jaeho Lee
04:30 - 05:00
Invited Talk #3
Srinath Sridhar
05:00 - 05:15
Invited Short Talk #2
Namitha Padmanabhan
05:15 - 05:45
Invited Talk #4
Xiaolong Wang
05:45 - 06:15
Invited Talk #5
Hyunjik Kim
06:15 - 06:30
Closing Remarks
Matthew Gwilliam


Vincent Sitzmann

Vincent Sitzmann is an Assistant Professor at MIT EECS, where he is leading the Scene Representation Group. Previously, he did his Ph.D. at Stanford University as well as a Postdoc at MIT CSAIL. His research interest lies in building models that perceive and model the world the way that humans do. Specifically, Vincent works towards models that can learn to reconstruct a rich state description of their environment, such as reconstructing its 3D structure, materials, semantics, etc. from vision. More importantly, these models should then also be able to model the impact of their own actions on that environment, i.e., learn a "mental simulator" or "world model". Vincent is particularly interested in models that can learn these skills fully self-supervised only from video and by self-directed interaction with the world.

Xiaolong Wang

Xiaolong Wang is an Assistant Professor in the ECE department at the University of California, San Diego. He received his Ph.D. in Robotics at Carnegie Mellon University. His postdoctoral training was at the University of California, Berkeley. His research focuses on the intersection between computer vision and robotics. His specific interest lies in learning 3D and dynamics representations from videos and physical robotic interaction data. These comprehensive representations are utilized to facilitate the learning of human-like robot skills, with the goal of generalizing the robot to interact effectively with a wide range of objects and environments in the real physical world. He is the recipient of the NSF CAREER Award, Intel Rising Star Faculty Award, and Research Awards from Sony, Amazon, Adobe, and Cisco.

Hyunjik Kim

Hyunjik is a research scientist at Google DeepMind at the London office, working on various topics in Deep Learning. His research interests keep on evolving, and currently he's interested in video generation, neural compression, and neural fields, in particular the idea of using them for compression and doing deep learning directly on the compressed space rather than on traditional array data (e.g. pixels). Prior to that, he worked on group equivariant deep learning, theoretical properties of self-attention, unsupervised representation learning (disentangling) and learning stochastic processes via Deep Learning methods (neural processes).

Srinath Sridhar

Srinath Sridhar is an assistant professor of computer science at Brown University. He received his PhD at the Max Planck Institute for Informatics and was subsequently a postdoctoral researcher at Stanford. His research interests are in 3D computer vision and machine learning. Specifically, his group ( focuses on visual understanding of 3D human physical interactions with applications ranging from robotics to mixed reality. He is a recipient of the NSF CAREER award, a Google Research Scholar award, and his work received the Eurographics Best Paper Honorable Mention. He spends part of his time as a visiting academic at Amazon Robotics and has previously spent time at Microsoft Research Redmond and Honda Research Institute.

Jia-Bin Huang

Jia-Bin Huang is a Capital One endowed Associate Professor in Computer Science at the University of Maryland College Park. He received his Ph.D. degree from the Department of Electrical and Computer Engineering at the University of Illinois, Urbana-Champaign. His research interests include computer vision, computer graphics, and machine learning. Huang is the recipient of the Thomas & Margaret Huang Award, NSF CRII award, faculty award from Samsung, Google, 3M, Qualcomm, and a Google Research Scholar Award.

Namitha Padmanabhan

Namitha is a second year master's student in the department of Computer Science at the University of Maryland (UMD), advised by Professor Abhinav Shrivastava. She looks forward to continuing with Abhinav Shrivastava's group as a PhD student in the fall. Previously, she obtained her bachelor's in Computer Science from RVCE Bangalore. Her research interests lie in understanding implicit representations and exploring their utility in computer vision tasks.

Jaeho Lee

Jaeho Lee is an assistant professor at POSTECH EE, and a visiting researcher at Google. At POSTECH, he leads Efficient Learning Laboratory (, which focuses on developing algorithmic and theoretical foundations for efficient machine learning. At Google, he works with MLPerf team to reduce the computational cost of production-level generative models of Google. He received his Ph.D. degree from the Department of Electrical and Computer Engineering at the University of Illinois at Urbana-Champaign.

Please contact Matt Gwilliam with any questions: mgwillia [at] umd [dot] edu.

Important Dates and Details

  • Paper submission deadline: March 31st. CMT portal: CMT
  • Notification of acceptance: April 7th
  • Camera ready due: April 14th
  • Workshop date: June 18th, PM