Hyunjik Kim

Preprints | Publications | Hobbies

I'm a researcher at the Foundation Models team in Apple, working from the New York office. Previous to that, I was a research scientist at Google DeepMind at the London office, working full-time on Google DeepMind's video generation model Veo. Prior to that I did my PhD in machine learning at the University of Oxford, supervised by Prof. Yee Whye Teh in the Machine Learning group at the Department of Statistics.

My research interests keep on evolving, and currently I'm excited about multimodal understanding and generation, with a focus on images and video. I'm also interested in neural compression, in particular video compression, and neural fields with applications to video. Prior to that, I worked on neural fields, group equivariant deep learning, theoretical properties of self-attention, unsupervised representation learning (disentangling) and learning stochastic processes via Deep Learning methods (neural processes).

Before my PhD, I studied Mathematics at the University of Cambridge, from which I obtained B.A. and M.Math. degrees. I spent a summer at Microsoft Research, Cambridge as a research intern, and worked on collaborative filtering. I also spent a summer interning at DeepMind working on unsupervised learning of disentangled representations.

Curriculum Vitae (last updated: 2024)

Google scholar page

E-mail: hyunjik11@gmail.com

Recent

Veo 1 & 2

Veo 2 is Google Deepmind's most capable video generation model to date (Dec 2024). It generates high-quality, 1080p resolution videos that can go beyond a minute, in a wide range of cinematic and visual styles.

Core contributor. Focused on pre-training: data, latent space design and decoder.
Veo 1 blog post
Veo 2 blog post

Good, Cheap, and Fast: Overfitted Image Compression with Wasserstein Distortion

Abstract: Inspired by the success of generative image models, recent work on learned image compression increasingly focuses on better probabilistic models of the natural image distribution, leading to excellent image quality. This, however, comes at the expense of a computational complexity that is several orders of magnitude higher than today's commercial codecs, and thus prohibitive for most practical applications. With this paper, we demonstrate that by focusing on modeling visual perception rather than the data distribution, we can achieve a very good trade-off between visual quality and bit rate similar to "generative" compression models such as HiFiC, while requiring less than 1% of the multiply-accumulate operations (MACs) for decompression. We do this by optimizing C3, an overfitted image codec, for Wasserstein Distortion (WD), and evaluating the image reconstructions with a human rater study, showing that WD clearly outperforms LPIPS as an optimization objective. The study also reveals that WD outperforms other perceptual metrics such as LPIPS, DISTS, and MS-SSIM as a predictor of human ratings, remarkably achieving over 94% Pearson correlation with Elo scores.

Jona Ballé, Luca Versari, Emilien Dupont Hyunjik Kim, Matthias Bauer,
CVPR, 2025 (highlight)
pdf

Publications

C3: High-performance and low-complexity neural compression from a single image or video

Abstract: Most neural compression models are trained on large datasets of images or videos in order to generalize to unseen data. Such generalization typically requires large and expressive architectures with a high decoding complexity. Here we introduce C3, a neural compression method with strong rate-distortion (RD) performance that instead overfits a small model to each image or video separately. The resulting decoding complexity of C3 can be an order of magnitude lower than neural baselines with similar RD performance. C3 builds on COOL-CHIC (Ladune et al.) and makes several simple and effective improvements for images. We further develop new methodology to apply C3 to videos. On the CLIC2020 image benchmark, we match the RD performance of VTM, the reference implementation of the H.266 codec, with less than 3k MACs/pixel for decoding. On the UVG video benchmark, we match the RD performance of the Video Compression Transformer (Mentzer et al.), a well-established neural video codec, with less than 5k MACs/pixel for decoding.

Hyunjik Kim^*, Matthias Bauer^*, Lucas Theis, Jonathan Schwarz, Emilien Dupont^*
^*Equal contribution.
CVPR, 2024
pdf | github | project page

Finding Increasingly Large Extremal Graphs with AlphaZero and Tabu Search

Abstract: This work studies a central extremal graph theory problem inspired by a 1975 conjecture of Erdős, which aims to find graphs with a given size (number of nodes) that maximize the number of edges without having 3- or 4-cycles. We formulate this problem as a sequential decision-making problem and compare AlphaZero, a neural network-guided tree search, with tabu search, a heuristic local search method. Using either method, by introducing a curriculum -- jump-starting the search for larger graphs using good graphs found at smaller sizes -- we improve the state-of-the-art lower bounds for several sizes. We also propose a flexible graph-generation environment and a permutation-invariant network architecture for learning to search in the space of graphs.

Abbas Mehrabian^*, Ankit Anand^*, Hyunjik Kim^* et al.,
^*Equal contribution.
Neurips 2023 Workshop: MATH-AI.
pdf

Learning Instance-Specific Augmentations by Capturing Local Invariances

Abstract: We introduce InstaAug, a method for automatically learning input-specific augmentations from data. Previous methods for learning augmentations have typically assumed independence between the original input and the transformation applied to that input. This can be highly restrictive, as the invariances we hope our augmentation will capture are themselves often highly input dependent. InstaAug instead introduces a learnable invariance module that maps from inputs to tailored transformation parameters, allowing local invariances to be captured. This can be simultaneously trained alongside the downstream model in a fully end-to-end manner, or separately learned for a pre-trained model. We empirically demonstrate that InstaAug learns meaningful input-dependent augmentations for a wide range of transformation classes, which in turn provides better performance on both supervised and self-supervised tasks.

Ning Miao, Emile Mathieu, Yann Dubois, Tom Rainforth, Yee Whye Teh, Adam Foster, Hyunjik Kim
ICML, 2023
pdf

Spatial Functa: Scaling Functa to ImageNet Classification and Generation

Abstract: Neural fields, also known as implicit neural representations, have emerged as a powerful means to represent complex signals of various modalities. Based on this Dupont et al. (2022) introduce a framework that views neural fields as data, termed *functa*, and proposes to do deep learning directly on this dataset of neural fields. In this work, we show that the proposed framework faces limitations when scaling up to even moderately complex datasets such as CIFAR-10. We then propose *spatial functa*, which overcome these limitations by using spatially arranged latent representations of neural fields, thereby allowing us to scale up the approach to ImageNet-1k at 256x256 resolution. We demonstrate competitive performance to Vision Transformers (Steiner et al., 2022) on classification and Latent Diffusion (Rombach et al., 2022) on image generation respectively.

Matthias Bauer^*, Emilien Dupont, Andy Brock, Dan Rosenbaum Jonathan Schwarz, Hyunjik Kim^*,
^*Equal contribution.
ICLR 2023 Workshop: Neural Fields across Fields.
pdf | bibtex

Pre-training via Denoising for Molecular Property Prediction

Abstract: Many important problems involving molecular property prediction from 3D structures have limited data, posing a generalization challenge for neural networks. In this paper, we describe a pre-training technique that utilizes large datasets of 3D molecular structures at equilibrium to learn meaningful representations for downstream tasks. Inspired by recent advances in noise regularization, our pre-training objective is based on denoising. Relying on the well-known link between denoising autoencoders and score-matching, we also show that the objective corresponds to learning a molecular force field -- arising from approximating the physical state distribution with a mixture of Gaussians -- directly from equilibrium structures. Our experiments demonstrate that using this pre-training objective significantly improves performance on multiple benchmarks, achieving a new state-of-the-art on the majority of targets in the widely used QM9 dataset. Our analysis then provides practical insights into the effects of different factors -- dataset sizes, model size and architecture, and the choice of upstream and downstream datasets -- on pre-training.

Sheheryar Zaidi^*, Michael Schaarschmidt^*, James Martens, Hyunjik Kim, Yee Whye Teh, Alvaro Sanchez Gonzalez, Peter Battaglia, Razvan Pascanu, Jonathan Godwin
^*Equal contribution.
ICLR 2023, notable top 25%.
pdf | bibtex

From data to functa: Your data point is a function and you can treat it like one

Abstract: It is common practice in deep learning to represent a measurement of the world on a discrete grid, e.g. a 2D grid of pixels. However, the underlying signal represented by these measurements is often continuous, e.g. the scene depicted in an image. A powerful continuous alternative is then to represent these measurements using an implicit neural representation, a neural function trained to output the appropriate measurement value for any input spatial location. In this paper, we take this idea to its next level: what would it take to perform deep learning on these functions instead, treating them as data? In this context we refer to the data as functa, and propose a framework for deep learning on functa. This view presents a number of challenges around efficient conversion from data to functa, compact representation of functa, and effectively solving downstream tasks on functa. We outline a recipe to overcome these challenges and apply it to a wide range of data modalities including images, 3D shapes, neural radiance fields (NeRF) and data on manifolds. We demonstrate that this approach has various compelling properties across data modalities, in particular on the canonical tasks of generative modeling, data imputation, novel view synthesis and classification.

Emilien Dupont^*, Hyunjik Kim^*, Ali Eslami, Danilo Rezende, Dan Rosenbaum
^*Equal contribution.
ICML 2022.
pdf | bibtex | github

Group Equivariant Subsampling

Abstract: Subsampling is used in convolutional neural networks (CNNs) in the form of pooling or strided convolutions, to reduce the spatial dimensions of feature maps and to allow the receptive fields to grow exponentially with depth. However, it is known that such subsampling operations are not translation equivariant, unlike convolutions that are translation equivariant. Here, we first introduce translation equivariant subsampling/upsampling layers that can be used to construct exact translation equivariant CNNs. We then generalise these layers beyond translations to general groups, thus proposing group equivariant subsampling/upsampling. We use these layers to construct group equivariant autoencoders (GAEs) that allow us to learn low-dimensional equivariant representations. We empirically verify on images that the representations are indeed equivariant to input translations and rotations, and thus generalise well to unseen positions and orientations. We further use GAEs in models that learn object-centric representations on multi-object datasets, and show improved data efficiency and decomposition compared to non-equivariant baselines.

Jin Xu, Hyunjik Kim, Tom Rainforth Yee Whye Teh
NeurIPS 2021.
pdf | bibtex

LieTransformer: Equivariant Self-Attention for Lie Groups

Abstract: Group equivariant neural networks are used as building blocks of group invariant neural networks, which have been shown to improve generalisation performance and data efficiency through principled parameter sharing. Such works have mostly focused on group equivariant convolutions, building on the result that group equivariant linear maps are necessarily convolutions. In this work, we extend the scope of the literature to non-linear neural network modules, namely self-attention, that is emerging as a prominent building block of deep learning models. We propose the LieTransformer, an architecture composed of LieSelfAttention layers that are equivariant to arbitrary Lie groups and their discrete subgroups. We demonstrate the generality of our approach by showing experimental results that are competitive to baseline methods on a wide range of tasks: shape counting on point clouds, molecular property regression and modelling particle trajectories under Hamiltonian dynamics.

Michael Hutchinson^*, Charline Le Lan^*, Sheheryar Zaidi^*, Emilien Dupont, Yee Whye Teh, Hyunjik Kim
^*Equal contribution.
ICML 2021.
pdf | bibtex | github

The Lipschitz Constant of Self-Attention

Abstract: Lipschitz constants of neural networks have been explored in various contexts in deep learning, such as provable adversarial robustness, estimating Wasserstein distance, stabilising training of GANs, and formulating invertible neural networks. Such works have focused on bounding the Lipschitz constant of fully connected or convolutional networks, composed of linear maps and pointwise non-linearities. In this paper, we investigate the Lipschitz constant of self-attention, a non-linear neural network module widely used in sequence modelling. We prove that the standard dot-product self-attention is not Lipschitz, and propose an alternative L2 self-attention that is Lipschitz. We derive an upper bound on the Lipschitz constant of L2 self-attention and provide empirical evidence for its asymptotic tightness. To demonstrate the practical relevance of the theory, we formulate invertible self-attention and use it in a Transformer-based architecture for a character-level language modelling task.

Hyunjik Kim, George Papamakarios, Andriy Mnih
ICML 2021.
pdf | bibtex

MetaFun: Meta-Learning with Iterative Functional Updates

Abstract: We develop a functional encoder-decoder approach to supervised meta-learning, where labeled data is encoded into an infinite-dimensional functional representation rather than a finite-dimensional one. Furthermore, rather than directly producing the representation, we learn a neural update rule resembling functional gradient descent which iteratively improves the representation. The final representation is used to condition the decoder to make predictions on unlabeled data. Our approach is the first to demonstrates the success of encoder-decoder style meta-learning methods like conditional neural processes on large-scale few-shot classification benchmarks such as miniImageNet and tieredImageNet, where it achieves state-of-the-art performance.

Jin Xu, Jean-Francois Ton, Hyunjik Kim, Adam Kosiorek, Yee Whye Teh
ICML 2020.
pdf | bibtex

Attentive Neural Processes

Abstract: Neural Processes (NPs) (Garnelo et al., 2018a,b) approach regression by learning to map a context set of observed input-output pairs to a distribution over regression functions. Each function models the distribution of the output given an input, conditioned on the context. NPs have the benefit of fitting observed data efficiently with linear complexity in the number of context input-output pairs, and can learn a wide family of conditional distributions; they learn predictive distributions conditioned on context sets of arbitrary size. Nonetheless, we show that NPs suffer a fundamental drawback of underfitting, giving inaccurate predictions at the inputs of the observed data they condition on. We address this issue by incorporating attention into NPs, allowing each input location to attend to the relevant context points for the prediction. We show that this greatly improves the accuracy of predictions, results in noticeably faster training, and expands the range of functions that can be modelled.

Hyunjik Kim, Andriy Mnih, Jonathan Schwarz, Marta Garnelo, Ali Eslami, Dan Rosenbaum, Oriol Vinyals, Yee Whye Teh
Bayesian Deep Learning Workshop, NeurIPS 2018. Contributed Talk.
pdf
ICLR 2019.
pdf | bibtex | openreview | github

Sequential Attend, Infer, Repeat: Generative Modelling of Moving Objects

Abstract: We present Sequential Attend, Infer, Repeat (SQAIR), an interpretable deep generative model for videos of moving objects. It can reliably discover and track objects throughout the sequence of frames, and can also generate future frames conditioning on the current frame, thereby simulating expected motion of objects. This is achieved by explicitly encoding object presence, locations and appearances in the latent variables of the model. SQAIR retains all strengths of its predecessor, Attend, Infer, Repeat (AIR, Eslami et. al., 2016), including learning in an unsupervised manner, and addresses its shortcomings. We use a moving multi-MNIST dataset to show limitations of AIR in detecting overlapping or partially occluded objects, and show how SQAIR overcomes them by leveraging temporal consistency of objects. Finally, we also apply SQAIR to real-world pedestrian CCTV data, where it learns to reliably detect, track and generate walking pedestrians with no supervision.

Adam Kosiorek, Hyunjik Kim, Ingmar Posner, Yee Whye Teh
NeurIPS 2018, Spotlight.
pdf | bibtex | github

Disentangling by Factorising

Abstract: We define and address the problem of unsupervised learning of disentangled representations on data generated from independent factors of variation. We propose FactorVAE, a method that disentangles by encouraging the distribution of representations to be factorial and hence independent across the dimensions. We show that it improves upon β-VAE by providing a better trade-off between disentanglement and reconstruction quality. Moreover, we highlight the problems of a commonly used disentanglement metric and introduce a new metric that does not suffer from them.

Hyunjik Kim, Andriy Mnih
Learning Disentangled Representations: From Perception to Control Workshop, NIPS 2017. Spotlight Talk.
pdf
ICML 2018.
pdf | bibtex

Scaling up the Automatic Statistician: Scalable Structure Discovery using Gaussian Processes

Abstract: Automating statistical modelling is a challenging problem in artificial intelligence. The Automatic Statistician takes a first step in this direction, by employing a kernel search algorithm with Gaussian Processes (GP) to provide interpretable statistical models for regression problems. However this does not scale due to its O(N^3) running time for the model selection. We propose Scalable Kernel Composition (SKC), a scalable kernel search algorithm that extends the Automatic Statistician to bigger data sets. In doing so, we derive a cheap upper bound on the GP marginal likelihood that sandwiches the marginal likelihood with the variational lower bound. We show that the upper bound is significantly tighter than the lower bound and thus useful for model selection.

Hyunjik Kim, Yee Whye Teh
AutoML 2016, Journal of Machine Learning Research Workshop and Conference Proceedings.
Practical Bayesian Nonparametrics Workshop, NIPS 2016. Oral & Travel Award.
pdf
AISTATS 2018, Oral.
pdf | bibtex

Preprints

Meta-Learning surrogate models for sequential decision making

Abstract: We introduce a unified probabilistic framework for solving sequential decision making problems ranging from Bayesian optimisation to contextual bandits and reinforcement learning. This is accomplished by a probabilistic model-based approach that explains observed data while capturing predictive uncertainty during the decision making process. Crucially, this probabilistic model is chosen to be a Meta-Learning system that allows learning from a distribution of related problems, allowing data efficient adaptation to a target task. As a suitable instantiation of this framework, we explore the use of Neural processes due to statistical and computational desiderata. We apply our framework to a broad range of problem domains, such as control problems, recommender systems and adversarial attacks on RL agents, demonstrating an efficient and general black-box learning approach.

Jonathan Schwarz, Alexandre Galashov, Hyunjik Kim, Marta Garnelo, David Saxton, Pushmeet Kohli, Ali Eslami, Yee Whye Teh
ArXiv, 2019.
pdf | bibtex

Collaborative Filtering with Side Information: a Gaussian Process Perspective

Abstract: We tackle the problem of collaborative filtering (CF) with side information, through the lens of Gaussian Process (GP) regression. Driven by the idea of using the kernel to explicitly model user-item similarities, we formulate the GP in a way that allows the incorporation of low-rank matrix factorisation, arriving at our model, the Tucker Gaussian Process (TGP). Consequently, TGP generalises classical Bayesian matrix factorisation models, and goes beyond them to give a natural and elegant method for incorporating side information, giving enhanced predictive performance for CF problems. Moreover we show that it is a novel model for regression, especially well-suited to grid-structured data and problems where the dependence on covariates is close to being separable.

Hyunjik Kim, Xiaoyu Lu, Seth Flaxman, Yee Whye Teh
ArXiv, 2016.
pdf | bibtex

Talks

Topics on Attention in Deep Learning

Abstract: Attention mechanisms are being widely used in state-of-the-art deep learning models across various data modalities. In this talk, we explore the concept of attention or self-attention from two perspectives: 1. A methodological point of view and 2. A theoretical point of view. For 1, we study the Attentive Neural Process (ANP) that incorporates attention into the recently introduced Neural Process (NP), a deep neural network that learns a stochastic process, with applications in meta-learning. We show the role of attention in ANPs that allows it to address some fundamental drawbacks of NPs. For 2, we investigate the Lipschitz constant of self-attention, that measures how much the output of self-attention can change with respect to the change in its inputs. We thus theoretically demonstrate how self-attention is different to standard neural network architectures such as fully connected networks and convolutional networks.

Venue: Summer AI Seminar Series @POSTECH, 06/08/20.
slides

Attention: the Analogue of Kernels in Deep Learning

Abstract: There have been many recent works that lie at the intersection of kernel methods and deep learning, namely Deep Kernel Learning, Deep Gaussian Processes (GPs) and Convolutional GPs. However such works are often motivated by borrowing ideas that originate from deep learning and incorporating them into kernel methods. In this talk, we will explore the concept of attention or self-attention, that has interestingly travelled the opposite path; it is inherently motivated from kernels, but is being used extensively in state-of-the-art deep learning models in various data modalities. We investigate attention in more detail by studying the Attentive Neural Process (ANP) that incorporates attention into the recently introduced Neural Process (NP), a deep models that learns a stochastic process. We show that ANPs address some fundamental drawbacks of NPs by bringing them closer to GPs, while maintaining the benefits of neural networks such as scalability and flexibility.

Venue: Recent Developments in Kernel Methods workshop @Gatsby Computational Neuroscience Unit, UCL, 27/09/19.
slides

Interpretable Models in Probabilistic Deep Learning

Abstract: As Deep Learning (DL) solutions to real-world problems are becoming increasingly common, DL researchers are striving to better understand the models that they develop. The community has been using the term ‘interpretability’ to describe models and methods that help us achieve this rather vague goal. However many claim that deep models are inherently uninterpretable due to their black-box nature, and stop paying attention to interpretability in deep models on these grounds. In this talk, we show that ‘deep’ and ‘interpretability’ are not mutually exclusive terms, hence it is both possible and necessary to devise interpretable deep models. We first clarify what is meant by the term ‘interpretability’, by listing its desiderata and properties. We then introduce examples of deep probabilistic models that enjoy various properties of interpretability: the talk will cover FactorVAE, a model for learning disentangled representations, and the Attentive Neural Process, a model for learning stochastic processes in a data-driven fashion, focusing on their applications to image data.

Venues: Korea Institute of Science and Technology (KIST), Center for Imaging Media Research, 03/04/19.
Naver Labs, 04/04/19.
Seoul National University, Computer Vision Lab, 05/04/19.
slides

Hobbies

I enjoy playing football (soccer, not the American one...!), learning languages, solving maths olympiad problems and coding challenges. Here are my solutions for Advent of Code, 2024.

Public Engagement

Introducing Machine Learning to the Public

I helped create a cute two-minute animation that introduces machine learning to the general public, along with friends at Oxford. Check it out below!

Further details can be found here.