Hyunjik Kim
Preprints  PublicationsI'm a research scientist at DeepMind at the Google London office, working on probabilistic modelling, models for deep learning that use attention, and group equivariant neural networks. Prior to that I did my PhD in machine learning at the University of Oxford, supervised by Prof. Yee Whye Teh in the Machine Learning group at the Department of Statistics.
Broadly speaking, my research interests lie in the field of probabilistic modelling and deep learning, mostly at the intersection of the two. My narrower research interests keep on evolving, and currently I'm interested in group equivariant neural networks and selfattention. In particular, I've recently done theoretical work on studying the Lipschitz constant of selfattention. Prior to that, I worked on unsupervised representation learning (disentangling) and learning stochastic processes via Deep Learning methods (neural processes). I have also worked on scaling up inference for Gaussian processes, in particular on regression models for collaborative filtering that are motivated by a scalable approximation to a GP, as well as a method for scaling up the compositional kernel search used by the Automatic Statistician via variational sparse GP methods.
Before my PhD, I studied Mathematics at the University of Cambridge, from which I obtained B.A. and M.Math. degrees. I spent a summer at Microsoft Research, Cambridge as a research intern, and worked on collaborative filtering. I also spent a summer interning at DeepMind working on unsupervised learning of disentangled representations.
Curriculum Vitae (last updated: July 2020)
Email: hyunjikk@google.com
Recent
The Lipschitz Constant of SelfAttention
Abstract: Lipschitz constants of neural networks have been explored in various contexts in deep learning, such as provable adversarial robustness, estimating Wasserstein distance, stabilising training of GANs, and formulating invertible neural networks. Such works have focused on bounding the Lipschitz constant of fully connected or convolutional networks, composed of linear maps and pointwise nonlinearities. In this paper, we investigate the Lipschitz constant of selfattention, a nonlinear neural network module widely used in sequence modelling. We prove that the standard dotproduct selfattention is not Lipschitz, and propose an alternative L2 selfattention that is Lipschitz. We derive an upper bound on the Lipschitz constant of L2 selfattention and provide empirical evidence for its asymptotic tightness. To demonstrate the practical relevance of the theory, we formulate invertible selfattention and use it in a Transformerbased architecture for a characterlevel language modelling task. Hyunjik Kim, George Papamakarios, Andriy MnihArXiv, 2020. pdf  bibtex 
MetaFun: MetaLearning with Iterative Functional Updates
Abstract: We develop a functional encoderdecoder approach to supervised metalearning, where labeled data is encoded into an infinitedimensional functional representation rather than a finitedimensional one. Furthermore, rather than directly producing the representation, we learn a neural update rule resembling functional gradient descent which iteratively improves the representation. The final representation is used to condition the decoder to make predictions on unlabeled data. Our approach is the first to demonstrates the success of encoderdecoder style metalearning methods like conditional neural processes on largescale fewshot classification benchmarks such as miniImageNet and tieredImageNet, where it achieves stateoftheart performance. Jin Xu, JeanFrancois Ton, Hyunjik Kim, Adam Kosiorek, Yee Whye TehICML 2020. pdf  bibtex 
Publications
Attentive Neural Processes
Abstract: Neural Processes (NPs) (Garnelo et al., 2018a,b) approach regression by learning to map a context set of observed inputoutput pairs to a distribution over regression functions. Each function models the distribution of the output given an input, conditioned on the context. NPs have the benefit of fitting observed data efficiently with linear complexity in the number of context inputoutput pairs, and can learn a wide family of conditional distributions; they learn predictive distributions conditioned on context sets of arbitrary size. Nonetheless, we show that NPs suffer a fundamental drawback of underfitting, giving inaccurate predictions at the inputs of the observed data they condition on. We address this issue by incorporating attention into NPs, allowing each input location to attend to the relevant context points for the prediction. We show that this greatly improves the accuracy of predictions, results in noticeably faster training, and expands the range of functions that can be modelled. Hyunjik Kim, Andriy Mnih, Jonathan Schwarz, Marta Garnelo, Ali Eslami, Dan Rosenbaum, Oriol Vinyals, Yee Whye TehBayesian Deep Learning Workshop, NeurIPS 2018. Contributed Talk. ICLR 2019. pdf  bibtex  openreview  github 
Sequential Attend, Infer, Repeat: Generative Modelling of Moving Objects
Abstract: We present Sequential Attend, Infer, Repeat (SQAIR), an interpretable deep generative model for videos of moving objects. It can reliably discover and track objects throughout the sequence of frames, and can also generate future frames conditioning on the current frame, thereby simulating expected motion of objects. This is achieved by explicitly encoding object presence, locations and appearances in the latent variables of the model. SQAIR retains all strengths of its predecessor, Attend, Infer, Repeat (AIR, Eslami et. al., 2016), including learning in an unsupervised manner, and addresses its shortcomings. We use a moving multiMNIST dataset to show limitations of AIR in detecting overlapping or partially occluded objects, and show how SQAIR overcomes them by leveraging temporal consistency of objects. Finally, we also apply SQAIR to realworld pedestrian CCTV data, where it learns to reliably detect, track and generate walking pedestrians with no supervision. Adam Kosiorek, Hyunjik Kim, Ingmar Posner, Yee Whye TehNeurIPS 2018, Spotlight. pdf  bibtex  github 
Disentangling by Factorising
Abstract: We define and address the problem of unsupervised learning of disentangled representations on data generated from independent factors of variation. We propose FactorVAE, a method that disentangles by encouraging the distribution of representations to be factorial and hence independent across the dimensions. We show that it improves upon βVAE by providing a better tradeoff between disentanglement and reconstruction quality. Moreover, we highlight the problems of a commonly used disentanglement metric and introduce a new metric that does not suffer from them. Hyunjik Kim, Andriy MnihLearning Disentangled Representations: From Perception to Control Workshop, NIPS 2017. Spotlight Talk. ICML 2018. pdf  bibtex 
Scaling up the Automatic Statistician: Scalable Structure Discovery using Gaussian Processes
Abstract: Automating statistical modelling is a challenging problem in artificial intelligence. The Automatic Statistician takes a first step in this direction, by employing a kernel search algorithm with Gaussian Processes (GP) to provide interpretable statistical models for regression problems. However this does not scale due to its O(N^3) running time for the model selection. We propose Scalable Kernel Composition (SKC), a scalable kernel search algorithm that extends the Automatic Statistician to bigger data sets. In doing so, we derive a cheap upper bound on the GP marginal likelihood that sandwiches the marginal likelihood with the variational lower bound. We show that the upper bound is significantly tighter than the lower bound and thus useful for model selection. Hyunjik Kim, Yee Whye TehAutoML 2016, Journal of Machine Learning Research Workshop and Conference Proceedings. Practical Bayesian Nonparametrics Workshop, NIPS 2016. Oral & Travel Award. AISTATS 2018, Oral. pdf  bibtex 
Preprints
MetaLearning surrogate models for sequential decision making
Abstract: We introduce a unified probabilistic framework for solving sequential decision making problems ranging from Bayesian optimisation to contextual bandits and reinforcement learning. This is accomplished by a probabilistic modelbased approach that explains observed data while capturing predictive uncertainty during the decision making process. Crucially, this probabilistic model is chosen to be a MetaLearning system that allows learning from a distribution of related problems, allowing data efficient adaptation to a target task. As a suitable instantiation of this framework, we explore the use of Neural processes due to statistical and computational desiderata. We apply our framework to a broad range of problem domains, such as control problems, recommender systems and adversarial attacks on RL agents, demonstrating an efficient and general blackbox learning approach. Jonathan Schwarz, Alexandre Galashov, Hyunjik Kim, Marta Garnelo, David Saxton, Pushmeet Kohli, Ali Eslami, Yee Whye TehArXiv, 2019. pdf  bibtex 
Collaborative Filtering with Side Information: a Gaussian Process Perspective
Abstract: We tackle the problem of collaborative filtering (CF) with side information, through the lens of Gaussian Process (GP) regression. Driven by the idea of using the kernel to explicitly model useritem similarities, we formulate the GP in a way that allows the incorporation of lowrank matrix factorisation, arriving at our model, the Tucker Gaussian Process (TGP). Consequently, TGP generalises classical Bayesian matrix factorisation models, and goes beyond them to give a natural and elegant method for incorporating side information, giving enhanced predictive performance for CF problems. Moreover we show that it is a novel model for regression, especially wellsuited to gridstructured data and problems where the dependence on covariates is close to being separable. Hyunjik Kim, Xiaoyu Lu, Seth Flaxman, Yee Whye TehArXiv, 2016. pdf  bibtex 
Talks
Topics on Attention in Deep Learning
Abstract: Attention mechanisms are being widely used in stateoftheart deep learning models across various data modalities. In this talk, we explore the concept of attention or selfattention from two perspectives: 1. A methodological point of view and 2. A theoretical point of view. For 1, we study the Attentive Neural Process (ANP) that incorporates attention into the recently introduced Neural Process (NP), a deep neural network that learns a stochastic process, with applications in metalearning. We show the role of attention in ANPs that allows it to address some fundamental drawbacks of NPs. For 2, we investigate the Lipschitz constant of selfattention, that measures how much the output of selfattention can change with respect to the change in its inputs. We thus theoretically demonstrate how selfattention is different to standard neural network architectures such as fully connected networks and convolutional networks. Venue: Summer AI Seminar Series @POSTECH, 06/08/20.slides 
Attention: the Analogue of Kernels in Deep Learning
Abstract: There have been many recent works that lie at the intersection of kernel methods and deep learning, namely Deep Kernel Learning, Deep Gaussian Processes (GPs) and Convolutional GPs. However such works are often motivated by borrowing ideas that originate from deep learning and incorporating them into kernel methods. In this talk, we will explore the concept of attention or selfattention, that has interestingly travelled the opposite path; it is inherently motivated from kernels, but is being used extensively in stateoftheart deep learning models in various data modalities. We investigate attention in more detail by studying the Attentive Neural Process (ANP) that incorporates attention into the recently introduced Neural Process (NP), a deep models that learns a stochastic process. We show that ANPs address some fundamental drawbacks of NPs by bringing them closer to GPs, while maintaining the benefits of neural networks such as scalability and flexibility. Venue: Recent Developments in Kernel Methods workshop @Gatsby Computational Neuroscience Unit, UCL, 27/09/19.slides 
Interpretable Models in Probabilistic Deep Learning
Abstract: As Deep Learning (DL) solutions to realworld problems are becoming increasingly common, DL researchers are striving to better understand the models that they develop. The community has been using the term ‘interpretability’ to describe models and methods that help us achieve this rather vague goal. However many claim that deep models are inherently uninterpretable due to their blackbox nature, and stop paying attention to interpretability in deep models on these grounds. In this talk, we show that ‘deep’ and ‘interpretability’ are not mutually exclusive terms, hence it is both possible and necessary to devise interpretable deep models. We first clarify what is meant by the term ‘interpretability’, by listing its desiderata and properties. We then introduce examples of deep probabilistic models that enjoy various properties of interpretability: the talk will cover FactorVAE, a model for learning disentangled representations, and the Attentive Neural Process, a model for learning stochastic processes in a datadriven fashion, focusing on their applications to image data. Venues: Korea Institute of Science and Technology (KIST), Center for Imaging Media Research, 03/04/19.Naver Labs, 04/04/19. Seoul National University, Computer Vision Lab, 05/04/19. slides 
Public Engagement
Introducing Machine Learning to the Public
I helped create a cute twominute animation that introduces machine learning to the general public, along with friends at Oxford.
Check it out below!
