Bayesian and Neural Systems (BayesWatch)
Machine learning research group in Edinburgh

Publications

  • Self-Supervised Relational Reasoning for Representation Learning

    To appear in Advances in Neural Information Processing Systems (NeurIPS)

    Massimiliano Patacchiola, Amos Storkey
    In self-supervised learning, a system is tasked with achieving a surrogate objective by defining alternative targets on a set of unlabeled data. The aim is to build useful representations that can be used in downstream tasks, without costly manual annotation. In this work, we propose a novel self-supervised formulation of relational reasoning that allows a learner to bootstrap a signal from information implicit in unlabeled data. Training a relation head to discriminate how entities relate to themselves (intra-reasoning) and other entities (inter-reasoning), results in rich and descriptive representations in the underlying neural network backbone, which can be used in downstream tasks such as classification and image retrieval. We evaluate the proposed method following a rigorous experimental procedure, using standard datasets, protocols, and backbones. Self-supervised relational reasoning outperforms the best competitor in all conditions by an average 14% in accuracy, and the most recent state-of-the-art model by 3%. We link the effectiveness of the method to the maximization of a Bernoulli log-likelihood, which can be considered as a proxy for maximizing the mutual information, resulting in a more efficient objective with respect to the commonly used contrastive losses.
    @inproceedings{Patacchiola2020Self-Supervised,
    author = {Massimiliano Patacchiola and Amos Storkey},
    title = {Self-Supervised Relational Reasoning for Representation Learning},
    year = {2020},
    month = {Dec},
    booktitle = {To appear in Advances in Neural Information Processing Systems (NeurIPS)},
    url = {https://arxiv.org/abs/2006.05849},
    }
  • Bayesian Meta-Learning for the Few-Shot Setting via Deep Kernels

    To appear in Advances in Neural Information Processing Systems (NeurIPS)

    Massimiliano Patacchiola, Jack Turner, Elliot J. Crowley, Michael O'Boyle, Amos Storkey
    Recently, different machine learning methods have been introduced to tackle the challenging few-shot learning scenario that is, learning from a small labeled dataset related to a specific task. Common approaches have taken the form of meta-learning: learning to learn on the new problem given the old. Following the recognition that meta-learning is implementing learning in a multi-level model, we present a Bayesian treatment for the meta-learning inner loop through the use of deep kernels. As a result we can learn a kernel that transfers to new tasks; we call this Deep Kernel Transfer (DKT). This approach has many advantages: is straightforward to implement as a single optimizer, provides uncertainty quantification, and does not require estimation of task-specific parameters. We empirically demonstrate that DKT outperforms several state-of-the-art algorithms in few-shot classification, and is the state of the art for cross-domain adaptation and regression. We conclude that complex meta-learning routines can be replaced by a simpler Bayesian model without loss of accuracy.
    @inproceedings{Patacchiola2020Bayesian,
    author = {Massimiliano Patacchiola and Jack Turner and Elliot J. Crowley and Michael O'Boyle and Amos Storkey},
    title = {Bayesian Meta-Learning for the Few-Shot Setting via Deep Kernels},
    year = {2020},
    month = {Dec},
    booktitle = {To appear in Advances in Neural Information Processing Systems (NeurIPS)},
    url = {https://arxiv.org/abs/1910.05199},
    }
  • Latent Adversarial Debiasing: Mitigating Collider Bias in Deep Neural Networks

    Luke N. Darlow, Stanisław Jastrzębski, Amos Storkey
    Collider bias is a harmful form of sample selection bias that neural networks are ill-equipped to handle. This bias manifests itself when the underlying causal signal is strongly correlated with other confounding signals due to the training data collection procedure. In the situation where the confounding signal is easy-to-learn, deep neural networks will latch onto this and the resulting model will generalise poorly to in-the-wild test scenarios. We argue herein that the cause of failure is a combination of the deep structure of neural networks and the greedy gradient-driven learning process used - one that prefers easy-to-compute signals when available. We show it is possible to mitigate against this by generating bias-decoupled training data using latent adversarial debiasing (LAD), even when the confounding signal is present in 100% of the training data. By training neural networks on these adversarial examples,we can improve their generalisation in collider bias settings. Experiments show state-of-the-art performance of LAD in label-free debiasing with gains of 76.12% on background coloured MNIST, 35.47% on fore-ground coloured MNIST, and 8.27% on corrupted CIFAR-10.
    @unpublished{Darlow2020Latent,
    author = {Luke N. Darlow and Stanisław Jastrzębski and Amos Storkey},
    title = {Latent Adversarial Debiasing: Mitigating Collider Bias in Deep Neural Networks},
    year = {2020},
    month = {Nov},
    institution = {School of Informatics, University of Edinburgh},
    url = {https://arxiv.org/abs/2011.11486},
    }
  • Non-greedy Gradient-based Hyperparameter Optimization Over Long Horizons

    Paul Micaelli, Amos Storkey
    Gradient-based hyperparameter optimization is an attractive way to perform meta-learning across a distribution of tasks, or improve the performance of an optimizer on a single task. However, this approach has been unpopular for tasks requiring long horizons (many gradient steps), due to memory scaling and gradient degradation issues. A common workaround is to learn hyperparameters online or split the horizon into smaller chunks. However, this introduces greediness which comes with a large performance drop, since the best local hyperparameters can make for poor global solutions. In this work, we enable non-greediness over long horizons with a two-fold solution. First, we share hyperparameters that are contiguous in time, and show that this drastically mitigates gradient degradation issues. Then, we derive a forward-mode differentiation algorithm for the popular momentum- based SGD optimizer, which allows for a memory cost that is constant with horizon size. When put together, these solutions allow us to learn hyperparameters without any prior knowledge. Compared to the baseline of hand-tuned off-the-shelf hyperparameters, our method compares favorably on simple datasets like SVHN. On CIFAR-10 we match the baseline performance, and demonstrate for the first time that learning rate, momentum and weight decay schedules can be learned with gradients on a dataset of this size. Code is available at: https://github.com/polo5/NonGreedyGradientHPO
    @unpublished{Micaelli2020Non-greedy,
    author = {Paul Micaelli and Amos Storkey},
    title = {Non-greedy Gradient-based Hyperparameter Optimization Over Long Horizons},
    year = {2020},
    month = {Jul},
    institution = {School of Informatics, University of Edinburgh},
    url = {https://arxiv.org/abs/2007.07869},
    }
  • Optimizing Grouped Convolutions on Edge Devices

    International Conference on Application-specific Systems, Architectures and Processors (ASAP)

    Perry Gibson, José Cano, Jack Turner, Elliot J. Crowley, Michael O'Boyle, Amos Storkey
    When deploying a deep neural network on constrained hardware, it is possible to replace the network's standard convolutions with grouped convolutions. This allows for substantial memory savings with minimal loss of accuracy. However, current implementations of grouped convolutions in modern deep learning frameworks are far from performing optimally in terms of speed. In this paper we propose Grouped Spatial Pack Convolutions (GSPC), a new implementation of grouped convolutions that outperforms existing solutions. We implement GSPC in TVM, which provides state-of-the-art performance on edge devices. We analyze a set of networks utilizing different types of grouped convolutions and evaluate their performance in terms of inference time on several edge devices. We observe that our new implementation scales well with the number of groups and provides the best inference times in all settings, improving the existing implementations of grouped convolutions in TVM, PyTorch and TensorFlow Lite by 3.4x, 8x and 4x on average respectively. Code is available at https://github.com/gecLAB/tvm-GSPC/.
    @inproceedings{Gibson2020Optimizing,
    author = {Perry Gibson and José Cano and Jack Turner and Elliot J. Crowley and Michael O'Boyle and Amos Storkey},
    title = {Optimizing Grouped Convolutions on Edge Devices},
    year = {2020},
    month = {Jul},
    booktitle = {International Conference on Application-specific Systems, Architectures and Processors (ASAP)},
    url = {https://arxiv.org/abs/2006.09791},
    }
  • Constraint-Based Regularization of Neural Networks

    Benedict Leimkuhler, Timothée Pouchon, Tiffany Vlaar, Amos Storkey
    We propose a method for efficiently incorporating constraints into a stochastic gradient Langevin framework for the training of deep neural networks. Constraints allow direct control of the parameter space of the model. Appropriately designed, they reduce the vanishing/exploding gradient problem, control weight magnitudes and stabilize deep neural networks and thus improve the robustness of training algorithms and the generalization capabilities of the trained neural network. We present examples of constrained training methods motivated by orthogonality preservation for weight matrices and explicit weight normalizations. We describe the methods in the overdamped formulation of Langevin dynamics and the underdamped form, in which momenta help to improve sampling efficiency. The methods are explored in test examples in image classification and natural language processing.
    @unpublished{Leimkuhler2020Constraint-Based,
    author = {Benedict Leimkuhler and Timothée Pouchon and Tiffany Vlaar and Amos Storkey},
    title = {Constraint-Based Regularization of Neural Networks},
    year = {2020},
    month = {Jun},
    institution = {School of Informatics, University of Edinburgh},
    url = {https://arxiv.org/abs/2006.10114},
    }
  • Neural Architecture Search without Training

    Joseph Mellor, Jack Turner, Amos Storkey, Elliot J. Crowley
    The time and effort involved in hand-designing deep neural networks is immense. This has prompted the development of Neural Architecture Search (NAS) techniques to automate this design. However, NAS algorithms tend to be extremely slow and expensive; they need to train vast numbers of candidate networks to inform the search process. This could be remedied if we could infer a network's trained accuracy from its initial state. In this work, we examine how the linear maps induced by data points correlate for untrained network architectures in the NAS-Bench-201 search space, and motivate how this can be used to give a measure of modelling flexibility which is highly indicative of a network's trained performance. We incorporate this measure into a simple algorithm that allows us to search for powerful networks without any training in a matter of seconds on a single GPU. Code to reproduce our experiments is available at https://github.com/BayesWatch/nas-without-training.
    @unpublished{Mellor2020Neural,
    author = {Joseph Mellor and Jack Turner and Amos Storkey and Elliot J. Crowley},
    title = {Neural Architecture Search without Training},
    year = {2020},
    month = {Jun},
    institution = {School of Informatics, University of Edinburgh},
    url = {https://arxiv.org/abs/2006.04647},
    }
  • Defining Benchmarks for Continual Few-Shot Learning

    Antreas Antoniou, Massimiliano Patacchiola, Mateusz Ochal, Amos Storkey
    Both few-shot and continual learning have seen substantial progress in the last years due to the introduction of proper benchmarks. That being said, the field has still to frame a suite of benchmarks for the highly desirable setting of continual few-shot learning, where the learner is presented a number of few-shot tasks, one after the other, and then asked to perform well on a validation set stemming from all previously seen tasks. Continual few-shot learning has a small computational footprint and is thus an excellent setting for efficient investigation and experimentation. In this paper we first define a theoretical framework for continual few-shot learning, taking into account recent literature, then we propose a range of flexible benchmarks that unify the evaluation criteria and allows exploring the problem from multiple perspectives. As part of the benchmark, we introduce a compact variant of ImageNet, called SlimageNet64, which retains all original 1000 classes but only contains 200 instances of each one (a total of 200K data-points) downscaled to 64 x 64 pixels. We provide baselines for the proposed benchmarks using a number of popular few-shot learning algorithms, as a result, exposing previously unknown strengths and weaknesses of those algorithms in continual and data-limited settings.
    @unpublished{Antoniou2020Defining,
    author = {Antreas Antoniou and Massimiliano Patacchiola and Mateusz Ochal and Amos Storkey},
    title = {Defining Benchmarks for Continual Few-Shot Learning},
    year = {2020},
    month = {Apr},
    institution = {School of Informatics, University of Edinburgh},
    url = {https://arxiv.org/abs/2004.11967},
    }
  • Meta-Learning in Neural Networks: A Survey

    Timothy Hospedales, Antreas Antoniou, Paul Micaelli, Amos Storkey
    The field of meta-learning, or learning-to-learn, has seen a dramatic rise in interest in recent years. Contrary to conventional approaches to AI where a given task is solved from scratch using a fixed learning algorithm, meta-learning aims to improve the learning algorithm itself, given the experience of multiple learning episodes. This paradigm provides an opportunity to tackle many of the conventional challenges of deep learning, including data and computation bottlenecks, as well as the fundamental issue of generalization. In this survey we describe the contemporary meta-learning landscape. We first discuss definitions of meta-learning and position it with respect to related fields, such as transfer learning, multi-task learning, and hyperparameter optimization. We then propose a new taxonomy that provides a more comprehensive breakdown of the space of meta-learning methods today. We survey promising applications and successes of meta-learning including few-shot learning, reinforcement learning and architecture search. Finally, we discuss outstanding challenges and promising areas for future research.
    @unpublished{Hospedales2020Meta-Learning,
    author = {Timothy Hospedales and Antreas Antoniou and Paul Micaelli and Amos Storkey},
    title = {Meta-Learning in Neural Networks: A Survey},
    year = {2020},
    month = {Apr},
    institution = {School of Informatics, University of Edinburgh},
    url = {https://arxiv.org/abs/2004.05439},
    }
  • Comparing Recurrent and Convolutional Neural Networks for Predicting Wave Propagation

    Workshop on Deep Learning and Differential Equations, ICLR

    Stathi Fotiadis, Eduardo Pignatelli, Mario Lino Valencia, Chris Cantwell, Amos Storkey, Anil A. Bharath
    Dynamical systems can be modelled by partial differential equations and numerical computations are used everywhere in science and engineering. In this work, we investigate the performance of recurrent and convolutional deep neural network architectures to predict the surface waves. The system is governed by the Saint-Venant equations. We improve on the long-term prediction over previous methods while keeping the inference time at a fraction of numerical simulations. We also show that convolutional networks perform at least as well as recurrent networks in this task. Finally, we assess the generalisation capability of each network by extrapolating in longer time-frames and in different physical settings.
    @inproceedings{Fotiadis2020Comparing,
    author = {Stathi Fotiadis and Eduardo Pignatelli and Mario Lino Valencia and Chris Cantwell and Amos Storkey and Anil A. Bharath},
    title = {Comparing Recurrent and Convolutional Neural Networks for Predicting Wave Propagation},
    year = {2020},
    month = {Apr},
    booktitle = {Workshop on Deep Learning and Differential Equations, ICLR},
    url = {https://arxiv.org/abs/2002.08981},
    }
  • BlockSwap: Fisher-guided Block Substitution for Network Compression on a Budget

    International Conference on Learning Representations (ICLR)

    Jack Turner, Elliot J. Crowley, Michael O'Boyle, Amos Storkey, Gavin Gray
    The desire to map neural networks to varying-capacity devices has led to the development of a wealth of compression techniques, many of which involve replacing standard convolutional blocks in a large network with cheap alternative blocks. However, not all blocks are created equally; for a required compute budget there may exist a potent combination of many different cheap blocks, though exhaustively searching for such a combination is prohibitively expensive. In this work, we develop BlockSwap: a fast algorithm for choosing networks with interleaved block types by passing a single minibatch of training data through randomly initialised networks and gauging their Fisher potential. These networks can then be used as students and distilled with the original large network as a teacher. We demonstrate the effectiveness of the chosen networks across CIFAR-10 and ImageNet for classification, and COCO for detection, and provide a comprehensive ablation study of our approach. BlockSwap quickly explores possible block configurations using a simple architecture ranking system, yielding highly competitive networks in orders of magnitude less time than most architecture search techniques (e.g. 8 minutes on a single CPU for CIFAR-10).
    @inproceedings{Turner2020BlockSwap,
    author = {Jack Turner and Elliot J. Crowley and Michael O'Boyle and Amos Storkey and Gavin Gray},
    title = {{BlockSwap}: {F}isher-guided Block Substitution for Network Compression on a Budget},
    year = {2020},
    month = {Apr},
    booktitle = {International Conference on Learning Representations (ICLR)},
    url = {https://arxiv.org/abs/1906.04113},
    }
  • DHOG: Deep Hierarchical Object Grouping

    Luke N. Darlow, Amos Storkey
    Recently, a number of competitive methods have tackled unsupervised representation learning by maximising the mutual information between the representations produced from augmentations. The resulting representations are then invariant to stochastic augmentation strategies, and can be used for downstream tasks such as clustering or classification. Yet data augmentations preserve many properties of an image and so there is potential for a suboptimal choice of representation that relies on matching easy-to-find features in the data. We demonstrate that greedy or local methods of maximising mutual information (such as stochastic gradient optimisation) discover local optima of the mutual information criterion; the resulting representations are also less-ideally suited to complex downstream tasks. Earlier work has not specifically identified or addressed this issue. We introduce deep hierarchical object grouping (DHOG) that computes a number of distinct discrete representations of images in a hierarchical order, eventually generating representations that better optimise the mutual information objective. We also find that these representations align better with the downstream task of grouping into underlying object classes. We tested DHOG on unsupervised clustering, which is a natural downstream test as the target representation is a discrete labelling of the data. We achieved new state-of-the-art results on the three main benchmarks without any prefiltering or Sobel-edge detection that proved necessary for many previous methods to work. We obtain accuracy improvements of: 4.3% on CIFAR-10, 1.5% on CIFAR-100-20, and 7.2% on SVHN.
    @unpublished{Darlow2020DHOG,
    author = {Luke N. Darlow and Amos Storkey},
    title = {{DHOG}: Deep Hierarchical Object Grouping},
    year = {2020},
    month = {Mar},
    institution = {School of Informatics, University of Edinburgh},
    url = {https://arxiv.org/abs/2003.08821},
    }
  • Learning to Learn via Self-Critique

    Advances in Neural Information Processing Systems (NeurIPS)

    Antreas Antoniou, Amos Storkey
    In few-shot learning, a machine learning system learns from a small set of labelled examples relating to a specific task, such that it can generalize to new examples of the same task. Given the limited availability of labelled examples in such tasks ,we wish to make use of all the information we can. Usually a model learns task-specific information from a small training-set (support-set) to predict on an unlabelled validation set (target-set). The target-set contains additional task-specific information which is not utilized by existing few-shot learning methods. Making use of the target-set examples via transductive learning requires approaches beyond the current methods; at inference time, the target-set contains only unlabelled input data-points, and so discriminative learning cannot be used. In this paper, we propose a framework called Self-Critique and Adaptor SCA, which learns to learn a label-free loss function, parameterized as a neural network. A base-model learns on a support-set using existing methods (e.g. stochastic gradient descent combined with the cross-entropy loss), and then is updated for the incoming target-task using the learnt loss function. This label-free loss function is itself optimized such that the learnt model achieves higher generalization performance. Experiments demonstrate that SCA offers substantially reduced error-rates compared to baselines which only adapt on the support-set, and results in state of the art benchmark performance on Mini-ImageNet and Caltech-UCSD Birds 200.
    @inproceedings{Antoniou2019Learning,
    author = {Antreas Antoniou and Amos Storkey},
    title = {Learning to Learn via Self-Critique},
    year = {2019},
    month = {Dec},
    booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
    url = {https://arxiv.org/abs/1905.10295},
    }
  • Zero-shot Knowledge Transfer via Adversarial Belief Matching

    Advances in Neural Information Processing Systems (NeurIPS)

    Paul Micaelli, Amos Storkey
    Performing knowledge transfer from a large teacher network to a smaller student is a popular task in modern deep learning applications. However, due to growing dataset sizes and stricter privacy regulations, it is increasingly common not to have access to the data that was used to train the teacher. We propose a novel method which trains a student to match the predictions of its teacher without using any data or metadata. We achieve this by training an adversarial generator to search for images on which the student poorly matches the teacher, and then using them to train the student. Our resulting student closely approximates its teacher for simple datasets like SVHN, and on CIFAR10 we improve on the state- of-the-art for few-shot distillation (with 100 images per class), despite using no data. Finally, we also propose a metric to quantify the degree of belief matching between teacher and student in the vicinity of decision boundaries, and observe a significantly higher match between our zero-shot student and the teacher, than between a student distilled with real data and the teacher.
    @inproceedings{Micaelli2019Zero-shot,
    author = {Paul Micaelli and Amos Storkey},
    title = {Zero-shot Knowledge Transfer via Adversarial Belief Matching},
    year = {2019},
    month = {Dec},
    booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
    url = {https://arxiv.org/abs/1905.09768},
    }
  • Performance Aware Convolutional Neural Network Channel Pruning for Embedded GPUs

    International Symposium on Workload Characterization (IISWC)

    Valentin Radu, Kuba Kaszyk, Yuan Wen, Jack Turner, José Cano, Elliot J. Crowley, Björn Franke, Amos Storkey, Michael O’Boyle
    Convolutional Neural Networks (CNN) are becoming a common presence in many applications and services, due to their superior recognition accuracy. They are increasingly being used on mobile devices, many times just by porting large models designed for server space, although several model compression techniques have been considered. One model compression technique intended to reduce computations is channel pruning. Mobile and embedded systems now have GPUs which are ideal for the parallel computations of neural networks and for their lower energy cost per operation. Specialized libraries perform these neural network computations through highly optimized routines. As we find in our experiments, these libraries are optimized for the most common network shapes, making uninstructed channel pruning inefficient. We evaluate higher level libraries, which analyze the input characteristics of a convolutional layer, based on which they produce optimized OpenCL (Arm Compute Library and TVM) and CUDA (cuDNN) code. However, in reality, these characteristics and subsequent choices intended for optimization can have the opposite effect. We show that a reduction in the number of convolutional channels, pruning 12% of the initial size, is in some cases detrimental to performance, leading to 2x slowdown. On the other hand, we also find examples where performance-aware pruning achieves the intended results, with performance speedups of 3x with cuDNN and above 10x with Arm Compute Library and TVM. Our findings expose the need for hardware-instructed neural network pruning.
    @inproceedings{Radu2019Performance,
    author = {Valentin Radu and Kuba Kaszyk and Yuan Wen and Jack Turner and José Cano and Elliot J. Crowley and Björn Franke and Amos Storkey and Michael O’Boyle},
    title = {Performance Aware Convolutional Neural Network Channel Pruning for Embedded {GPU}s},
    year = {2019},
    month = {Nov},
    booktitle = {International Symposium on Workload Characterization (IISWC)},
    url = {https://arxiv.org/abs/2002.08697},
    }
  • Separable Layers Enable Structured Efficient Linear Substitutions

    Gavin Gray, Elliot J. Crowley, Amos Storkey
    In response to the development of recent efficient dense layers, this paper shows that something as simple as replacing linear components in pointwise convolutions with structured linear decompositions also produces substantial gains in the efficiency/accuracy tradeoff. Pointwise convolutions are fully connected layers and are thus prepared for replacement by structured transforms. Networks using such layers are able to learn the same tasks as those using standard convolutions, and provide Pareto-optimal benefits in efficiency/accuracy, both in terms of computation (mult-adds) and parameter count (and hence memory).
    @unpublished{Gray2019Separable,
    author = {Gavin Gray and Elliot J. Crowley and Amos Storkey},
    title = {Separable Layers Enable Structured Efficient Linear Substitutions},
    year = {2019},
    month = {Jun},
    institution = {School of Informatics, University of Edinburgh},
    url = {https://arxiv.org/abs/1906.00859},
    }
  • Exploration by Random Network Distillation

    International Conference on Learning Representations (ICLR)

    Yuri Burda, Harrison Edwards, Amos Storkey, Oleg Klimov
    We introduce an exploration bonus for deep reinforcement learning methods that is easy to implement and adds minimal overhead to the computation performed. The bonus is the error of a neural network predicting features of the observations given by a fixed randomly initialized neural network. We also introduce a method to flexibly combine intrinsic and extrinsic rewards. We find that the random network distillation (RND) bonus combined with this increased flexibility enables significant progress on several hard exploration Atari games. In particular we establish state of the art performance on Montezuma's Revenge, a game famously difficult for deep reinforcement learning methods. To the best of our knowledge, this is the first method that achieves better than average human performance on this game without using demonstrations or having access to the underlying state of the game, and occasionally completes the first level.
    @inproceedings{Burda2019Exploration,
    author = {Yuri Burda and Harrison Edwards and Amos Storkey and Oleg Klimov},
    title = {Exploration by Random Network Distillation},
    year = {2019},
    month = {May},
    booktitle = {International Conference on Learning Representations (ICLR)},
    url = {https://arxiv.org/abs/1810.12894},
    }
  • How to train your MAML

    International Conference on Learning Representations (ICLR)

    Antreas Antoniou, Harrison Edwards, Amos Storkey
    The field of few-shot learning has recently seen substantial advancements. Most of these advancements came from casting few-shot learning as a meta-learning problem. Model Agnostic Meta Learning or MAML is currently one of the best approaches for few-shot learning via meta-learning. MAML is simple, elegant and very powerful, however, it has a variety of issues, such as being very sensitive to neural network architectures, often leading to instability during training, requiring arduous hyperparameter searches to stabilize training and achieve high generalization and being very computationally expensive at both training and inference times. In this paper, we propose various modifications to MAML that not only stabilize the system, but also substantially improve the generalization performance, convergence speed and computational overhead of MAML, which we call MAML++.
    @inproceedings{Antoniou2019How,
    author = {Antreas Antoniou and Harrison Edwards and Amos Storkey},
    title = {How to train your {MAML}},
    year = {2019},
    month = {May},
    booktitle = {International Conference on Learning Representations (ICLR)},
    url = {https://arxiv.org/abs/1810.09502},
    }
  • Large-Scale Study of Curiosity-Driven Learning

    International Conference on Learning Representations (ICLR)

    Yuri Burda, Harrison Edwards, Deepak Pathak, Amos Storkey, Trevor Darrell, Alexei A. Efros
    Reinforcement learning algorithms rely on carefully engineering environment rewards that are extrinsic to the agent. However, annotating each environment with hand-designed, dense rewards is not scalable, motivating the need for developing reward functions that are intrinsic to the agent. Curiosity is a type of intrinsic reward function which uses prediction error as reward signal. In this paper: (a) We perform the first large-scale study of purely curiosity-driven learning, i.e. without any extrinsic rewards, across 54 standard benchmark environments, including the Atari game suite. Our results show surprisingly good performance, and a high degree of alignment between the intrinsic curiosity objective and the hand-designed extrinsic rewards of many game environments. (b) We investigate the effect of using different feature spaces for computing prediction error and show that random features are sufficient for many popular RL game benchmarks, but learned features appear to generalize better (e.g. to novel game levels in Super Mario Bros.). (c) We demonstrate limitations of the prediction-based rewards in stochastic setups.
    @inproceedings{Burda2019Large-Scale,
    author = {Yuri Burda and Harrison Edwards and Deepak Pathak and Amos Storkey and Trevor Darrell and Alexei A. Efros},
    title = {Large-Scale Study of Curiosity-Driven Learning},
    year = {2019},
    month = {May},
    booktitle = {International Conference on Learning Representations (ICLR)},
    url = {https://arxiv.org/abs/1808.04355},
    }
  • On the Relation Between the Sharpest Directions of DNN Loss and the SGD Step Length

    International Conference on Learning Representations (ICLR)

    Stanisław Jastrzębski, Zachary Kenton, Nicolas Ballas, Asja Fischer, Yoshua Bengio, Amos Storkey
    Recent work has identified that using a high learning rate or a small batch size for Stochastic Gradient Descent (SGD) based training of deep neural networks encourages finding flatter minima of the training loss towards the end of training. Moreover, measures of the flatness of minima have been shown to correlate with good generalization performance. Extending this previous work, we investigate the loss curvature through the Hessian eigenvalue spectrum in the early phase of training and find an analogous bias: even at the beginning of training, a high learning rate or small batch size influences SGD to visit flatter loss regions. In addition, the evolution of the largest eigenvalues appears to always follow a similar pattern, with a fast increase in the early phase, and a decrease or stabilization thereafter, where the peak value is determined by the learning rate and batch size. Finally, we find that by altering the learning rate just in the direction of the eigenvectors associated with the largest eigenvalues, SGD can be steered towards regions which are an order of magnitude sharper but correspond to models with similar generalization, which suggests the curvature of the endpoint found by SGD is not predictive of its generalization properties.
    @inproceedings{Jastrzębski2019On,
    author = {Stanisław Jastrzębski and Zachary Kenton and Nicolas Ballas and Asja Fischer and Yoshua Bengio and Amos Storkey},
    title = {On the Relation Between the Sharpest Directions of {DNN} Loss and the {SGD} Step Length},
    year = {2019},
    month = {May},
    booktitle = {International Conference on Learning Representations (ICLR)},
    url = {https://arxiv.org/abs/1807.05031},
    }
  • Distilling with Performance Enhanced Students

    Jack Turner, Elliot J. Crowley, Valentin Radu, José Cano, Amos Storkey, Michael O'Boyle
    The task of accelerating large neural networks on general purpose hardware has, in recent years, prompted the use of channel pruning to reduce network size. However, the efficacy of pruning based approaches has since been called into question. In this paper, we turn to distillation for model compression---specifically, attention transfer---and develop a simple method for discovering performance enhanced student networks. We combine channel saliency metrics with empirical observations of runtime performance to design more accurate networks for a given latency budget. We apply our methodology to residual and densely-connected networks, and show that we are able to find resource-efficient student networks on different hardware platforms while maintaining very high accuracy. These performance-enhanced student networks achieve up to 10% boosts in top-1 ImageNet accuracy over their channel-pruned counterparts for the same inference time.
    @unpublished{Turner2019Distilling,
    author = {Jack Turner and Elliot J. Crowley and Valentin Radu and José Cano and Amos Storkey and Michael O'Boyle},
    title = {Distilling with Performance Enhanced Students},
    year = {2019},
    month = {Mar},
    institution = {School of Informatics, University of Edinburgh},
    url = {https://arxiv.org/abs/1810.10460},
    }
  • Assume, Augment and Learn: Unsupervised Few-Shot Meta-Learning via Random Labels and Data Augmentation

    Antreas Antoniou, Amos Storkey
    The field of few-shot learning has been laboriously explored in the supervised setting, where per-class labels are available. On the other hand, the unsupervised few-shot learning setting, where no labels of any kind are required, has seen little investigation. We propose a method, named Assume, Augment and Learn or AAL, for generating few-shot tasks using unlabeled data. We randomly label a random subset of images from an unlabeled dataset to generate a support set. Then by applying data augmentation on the support set's images, and reusing the support set's labels, we obtain a target set. The resulting few-shot tasks can be used to train any standard meta-learning framework. Once trained, such a model, can be directly applied on small real-labeled datasets without any changes or fine-tuning required. In our experiments, the learned models achieve good generalization performance in a variety of established few-shot learning tasks on Omniglot and Mini-Imagenet.
    @unpublished{Antoniou2019Assume,
    author = {Antreas Antoniou and Amos Storkey},
    title = {Assume, Augment and Learn: Unsupervised Few-Shot Meta-Learning via Random Labels and Data Augmentation},
    year = {2019},
    month = {Feb},
    institution = {School of Informatics, University of Edinburgh},
    url = {https://arxiv.org/abs/1902.09884},
    }
  • What Information Does a ResNet Compress?

    Luke N. Darlow, Amos Storkey
    The information bottleneck principle (Shwartz-Ziv & Tishby, 2017) suggests that SGD-based training of deep neural networks results in optimally compressed hidden layers, from an information theoretic perspective. However, this claim was established on toy data. The goal of the work we present here is to test whether the information bottleneck principle is applicable to a realistic setting using a larger and deeper convolutional architecture, a ResNet model. We trained PixelCNN++ models as inverse representation decoders to measure the mutual information between hidden layers of a ResNet and input image data, when trained for (1) classification and (2) autoencoding. We find that two stages of learning happen for both training regimes, and that compression does occur, even for an autoencoder. Sampling images by conditioning on hidden layers' activations offers an intuitive visualisation to understand what a ResNets learns to forget.
    @unpublished{Darlow2019What,
    author = {Luke N. Darlow and Amos Storkey},
    title = {What Information Does a {R}es{N}et Compress?},
    year = {2019},
    month = {Jan},
    institution = {School of Informatics, University of Edinburgh},
    url = {https://arxiv.org/abs/2003.06254},
    }
  • Pruning Neural Networks: Is it Time to Nip It in the Bud?

    Workshop on Compact Deep Neural Networks with industrial applications, NeurIPS

    Elliot J. Crowley, Jack Turner, Amos Storkey, Michael O'Boyle
    Pruning is a popular technique for compressing a neural network: a large pre-trained network is fine-tuned while connections are successively removed. However, the value of pruning has largely evaded scrutiny. In this extended abstract, we examine residual networks obtained through Fisher-pruning and make two interesting observations. First, when time-constrained, it is better to train a simple, smaller network from scratch than prune a large network. Second, it is the architectures obtained through the pruning process --- not the learnt weights ---that prove valuable. Such architectures are powerful when trained from scratch. Furthermore, these architectures are easy to approximate without any further pruning: we can prune once and obtain a family of new, scalable network architectures for different memory requirements.
    @inproceedings{Crowley2018Pruning,
    author = {Elliot J. Crowley and Jack Turner and Amos Storkey and Michael O'Boyle},
    title = {Pruning Neural Networks: Is it Time to Nip It in the Bud?},
    year = {2018},
    month = {Dec},
    booktitle = {Workshop on Compact Deep Neural Networks with industrial applications, NeurIPS},
    url = {https://arxiv.org/abs/1810.04622},
    }
  • Moonshine: Distilling with Cheap Convolutions

    Advances in Neural Information Processing Systems (NeurIPS)

    Elliot J. Crowley, Gavin Gray, Amos Storkey
    Many engineers wish to deploy modern neural networks in memory-limited settings; but the development of flexible methods for reducing memory use is in its infancy, and there is little knowledge of the resulting cost-benefit. We propose structural model distillation for memory reduction using a strategy that produces a student architecture that is a simple transformation of the teacher architecture: no redesign is needed, and the same hyperparameters can be used. Using attention transfer, we provide Pareto curves/tables for distillation of residual networks with four benchmark datasets, indicating the memory versus accuracy payoff. We show that substantial memory savings are possible with very little loss of accuracy, and confirm that distillation provides student network performance that is better than training that student architecture directly on data.
    @inproceedings{Crowley2018Moonshine,
    author = {Elliot J. Crowley and Gavin Gray and Amos Storkey},
    title = {Moonshine: Distilling with Cheap Convolutions},
    year = {2018},
    month = {Dec},
    booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
    url = {https://arxiv.org/abs/1711.02613},
    }
  • Dilated DenseNets for Relational Reasoning

    Antreas Antoniou, Agnieszka Słowik, Elliot J. Crowley, Amos Storkey
    Despite their impressive performance in many tasks, deep neural networks often struggle at relational reasoning. This has recently been remedied with the introduction of a plug-in relational module that considers relations between pairs of objects. Unfortunately, this is combinatorially expensive. In this extended abstract, we show that a DenseNet incorporating dilated convolutions excels at relational reasoning on the Sort-of-CLEVR dataset, allowing us to forgo this relational module and its associated expense.
    @unpublished{Antoniou2018Dilated,
    author = {Antreas Antoniou and Agnieszka Słowik and Elliot J. Crowley and Amos Storkey},
    title = {Dilated {D}ense{N}ets for Relational Reasoning},
    year = {2018},
    month = {Nov},
    institution = {School of Informatics, University of Edinburgh},
    url = {https://arxiv.org/abs/1811.00410},
    }
  • CINIC-10 is not ImageNet or CIFAR-10

    Luke N. Darlow, Elliot J. Crowley, Antreas Antoniou, Amos Storkey
    In this brief technical report we introduce the CINIC-10 dataset as a plug-in extended alternative for CIFAR-10. It was compiled by combining CIFAR-10 with images selected and downsampled from the ImageNet database. We present the approach to compiling the dataset, illustrate the example images for different classes, give pixel distributions for each part of the repository, and give some standard benchmarks for well known models. Details for download, usage, and compilation can be found in the associated github repository.
    @techreport{Darlow2018CINIC-10,
    author = {Luke N. Darlow and Elliot J. Crowley and Antreas Antoniou and Amos Storkey},
    title = {{CINIC-10} is not {I}mage{N}et or {CIFAR-10}},
    year = {2018},
    month = {Oct},
    institution = {School of Informatics, University of Edinburgh}, number = {EDI-INF-ANC-1802},
    url = {https://arxiv.org/abs/1810.03505},
    }
  • GINN: Geometric Illustration of Neural Networks

    Luke N. Darlow, Amos Storkey
    This informal technical report details the geometric illustration of decision boundaries for ReLU units in a three layer fully connected neural network. The network is designed and trained to predict pixel intensity from an (x, y) input location. The Geometric Illustration of Neural Networks (GINN) tool was built to visualise and track the points at which ReLU units switch from being active to off (or vice versa) as the network undergoes training. Several phenomenon were observed and are discussed herein.
    @techreport{Darlow2018GINN,
    author = {Luke N. Darlow and Amos Storkey},
    title = {{GINN}: Geometric Illustration of Neural Networks},
    year = {2018},
    month = {Oct},
    institution = {School of Informatics, University of Edinburgh}, number = {EDI-INF-ANC-1801},
    url = {https://arxiv.org/abs/1810.01860},
    }
  • Augmenting Image Classifiers using Data Augmentation Generative Adversarial Networks

    International Conference on Artificial Neural Networks (ICANN)

    Antreas Antoniou, Amos Storkey, Harrison Edwards
    Effective training of neural networks requires much data. In the low-data regime, parameters are underdetermined, and learnt networks generalise poorly. Data Augmentation alleviates this by using existing data more effectively, but standard data augmentation produces only limited plausible alternative data. Given the potential to generate a much broader set of augmentations, we design and train a generative model to do data augmentation. The model, based on image conditional Generative Adversarial Networks, uses data from a source domain and learns to take a data item and augment it by generating other within-class data items. As this generative process does not depend on the classes themselves, it can be applied to novel unseen classes. We demonstrate that a Data Augmentation Generative Adversarial Network (DAGAN) augments classifiers well on Omniglot, EMNIST and VGG-Face.
    @inproceedings{Antoniou2018Augmenting,
    author = {Antreas Antoniou and Amos Storkey and Harrison Edwards},
    title = {Augmenting Image Classifiers using Data Augmentation Generative Adversarial Networks},
    year = {2018},
    month = {Oct},
    booktitle = {International Conference on Artificial Neural Networks (ICANN)},
    url = {https://www.bayeswatch.com/assets/papers/Augmenting_Image_Classifiers_using_Data_Augmentation_Generative_Adversarial_Networks.pdf},
    }
  • Three Factors Influencing Minima in SGD

    International Conference on Artificial Neural Networks (ICANN)

    Stanisław Jastrzębski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, Amos Storkey
    We investigate the dynamical and convergent properties of stochastic gradient descent (SGD) applied to Deep Neural Networks (DNNs). Characterizing the relation between learning rate, batch size and the properties of the final minima, such as width or generalization, remains an open question. In order to tackle this problem we investigate the previously proposed approximation of SGD by a stochastic differential equation (SDE). We theoretically argue that three factors - learning rate, batch size and gradient covariance - influence the minima found by SGD. In particular we find that the ratio of learning rate to batch size is a key determinant of SGD dynamics and of the width of the final minima, and that higher values of the ratio lead to wider minima and often better generalization. We confirm these findings experimentally. Further, we include experiments which show that learning rate schedules can be replaced with batch size schedules and that the ratio of learning rate to batch size is an important factor influencing the memorization process.
    @inproceedings{Jastrzębski2018Three,
    author = {Stanisław Jastrzębski and Zachary Kenton and Devansh Arpit and Nicolas Ballas and Asja Fischer and Yoshua Bengio and Amos Storkey},
    title = {Three Factors Influencing Minima in {SGD}},
    year = {2018},
    month = {Oct},
    booktitle = {International Conference on Artificial Neural Networks (ICANN)},
    url = {http://arxiv.org/abs/1711.04623},
    }
  • Characterising Across-Stack Optimisations for Deep Convolutional Neural Networks

    International Symposium on Workload Characterization (IISWC)

    Jack Turner, José Cano, Valentin Radu, Elliot J. Crowley, Michael O'Boyle, Amos Storkey
    Convolutional Neural Networks (CNNs) are extremely computationally demanding, presenting a large barrier to their deployment on resource-constrained devices. Since such systems are where some of their most useful applications lie (e.g. obstacle detection for mobile robots, vision-based medical assistive technology), significant bodies of work from both machine learning and systems communities have attempted to provide optimisations that will make CNNs available to edge devices. In this paper we unify the two viewpoints in a Deep Learning Inference Stack and take an across-stack approach by implementing and evaluating the most common neural network compression techniques (weight pruning, channel pruning, and quantisation) and optimising their parallel execution with a range of programming approaches (OpenMP, OpenCL) and hardware architectures (CPU, GPU). We provide comprehensive Pareto curves to instruct trade-offs under constraints of accuracy, execution time, and memory space.
    @inproceedings{Turner2018Characterising,
    author = {Jack Turner and José Cano and Valentin Radu and Elliot J. Crowley and Michael O'Boyle and Amos Storkey},
    title = {Characterising Across-Stack Optimisations for Deep Convolutional Neural Networks},
    year = {2018},
    month = {Sep},
    booktitle = {International Symposium on Workload Characterization (IISWC)},
    url = {https://arxiv.org/abs/1809.07196},
    }
  • Asymptotically Exact Inference in Differentiable Generative Models

    Electronic Journal of Statistics

    Matt Graham, Amos Storkey
    Many generative models can be expressed as a differentiable function applied to input variables sampled from a known probability distribution. This framework includes both the generative component of learned parametric models such as variational autoencoders and generative adversarial networks, and also procedurally defined simulator models which involve only differentiable operations. Though the distribution on the input variables to such models is known, often the distribution on the output variables is only implicitly defined. We present a method for performing efficient Markov chain Monte Carlo inference in such models when conditioning on observations of the model output. For some models this offers an asymptotically exact inference method where approximate Bayesian computation might otherwise be employed. We use the intuition that computing conditional expectations is equivalent to integrating over a density defined on the manifold corresponding to the set of inputs consistent with the observed outputs. This motivates the use of a constrained variant of Hamiltonian Monte Carlo which leverages the smooth geometry of the manifold to move between inputs exactly consistent with observations. We validate the method by performing inference experiments in a diverse set of models.
    @article{Graham2017Asymptotically,
    author = {Matt Graham and Amos Storkey},
    title = {Asymptotically Exact Inference in Differentiable Generative Models},
    year = {2017},
    month = {Dec},
    journal = {Electronic Journal of Statistics},
    volume = {1},
    url = {http://dx.doi.org/10.1214/17-EJS1340SI},
    }
  • Data Augmentation Generative Adversarial Networks

    Antreas Antoniou, Amos Storkey, Harrison Edwards
    Effective training of neural networks requires much data. In the low-data regime, parameters are underdetermined, and learnt networks generalise poorly. Data Augmentation alleviates this by using existing data more effectively. However standard data augmentation produces only limited plausible alternative data. Given there is potential to generate a much broader set of augmentations, we design and train a generative model to do data augmentation. The model, based on image conditional Generative Adversarial Networks, takes data from a source domain and learns to take any data item and generalise it to generate other within-class data items. As this generative process does not depend on the classes themselves, it can be applied to novel unseen classes of data. We show that a Data Augmentation Generative Adversarial Network (DAGAN) augments standard vanilla classifiers well. We also show a DAGAN can enhance few-shot learning systems such as Matching Networks. We demonstrate these approaches on Omniglot, on EMNIST having learnt the DAGAN on Omniglot, and VGG-Face data. In our experiments we can see over 13% increase in accuracy in the low-data regime experiments in Omniglot (from 69% to 82%), EMNIST (73.9% to 76\) and VGG-Face (4.5% to 12%); in Matching Networks for Omniglot we observe an increase of 0.5% (from 96.9% to 97.4%) and an increase of 1.8% in EMNIST (from 59.5% to 61.3%).
    @unpublished{Antoniou2017Data,
    author = {Antreas Antoniou and Amos Storkey and Harrison Edwards},
    title = {Data Augmentation Generative Adversarial Networks},
    year = {2017},
    month = {Nov},
    institution = {School of Informatics, University of Edinburgh},
    url = {https://arxiv.org/abs/1711.04340},
    }
  • Continuously Tempered Hamiltonian Monte Carlo

    Conference on Uncertainty in Artificial Intelligence (UAI)

    Matt Graham, Amos Storkey
    Hamiltonian Monte Carlo (HMC) is a powerful Markov chain Monte Carlo (MCMC) method for performing approximate inference in complex probabilistic models of continuous variables. In common with many MCMC methods, however, the standard HMC approach performs poorly in distributions with multiple isolated modes. We present a method for augmenting the Hamiltonian system with an extra continuous temperature control variable which allows the dynamic to bridge between sampling a complex target distribution and a simpler unimodal base distribution. This augmentation both helps improve mixing in multimodal targets and allows the normalisation constant of the target distribution to be estimated. The method is simple to implement within existing HMC code, requiring only a standard leapfrog integrator. We demonstrate experimentally that the method is competitive with annealed importance sampling and simulating tempering methods at sampling from challenging multimodal distributions and estimating their normalising constants.
    @inproceedings{Graham2017Continuously,
    author = {Matt Graham and Amos Storkey},
    title = {Continuously Tempered {H}amiltonian {M}onte {C}arlo},
    year = {2017},
    month = {Aug},
    booktitle = {Conference on Uncertainty in Artificial Intelligence (UAI)},
    url = {https://arxiv.org/abs/1704.03338},
    }
  • Asymptotically Exact Inference in Differentiable Generative Models

    International Conference on Artificial Intelligence and Statistics (AISTATS)

    Matt Graham, Amos Storkey
    Many generative models can be expressed as a differentiable function of random inputs drawn from some simple probability density. This framework includes both deep generative architectures such as Variational Autoencoders and a large class of procedurally defined simulator models. We present a method for performing efficient MCMC inference in such models when conditioning on observations of the model output. For some models this offers an asymptotically exact inference method where Approximate Bayesian Computation might otherwise be employed. We use the intuition that inference corresponds to integrating a density across the manifold corresponding to the set of inputs consistent with the observed outputs. This motivates the use of a constrained variant of Hamiltonian Monte Carlo which leverages the smooth geometry of the manifold to coherently move between inputs exactly consistent with observations. We validate the method by performing inference tasks in a diverse set of models.
    @inproceedings{Graham2017Asymptotically,
    author = {Matt Graham and Amos Storkey},
    title = {Asymptotically Exact Inference in Differentiable Generative Models},
    year = {2017},
    month = {Apr},
    booktitle = {International Conference on Artificial Intelligence and Statistics (AISTATS)},
    url = {https://arxiv.org/abs/1605.07826},
    }
  • Towards a Neural Statistician

    International Conference on Learning Representations (ICLR)

    Harrison Edwards, Amos Storkey
    An efficient learner is one who reuses what they already know to tackle a new problem. For a machine learner, this means understanding the similarities amongst datasets. In order to do this, one must take seriously the idea of working with datasets, rather than datapoints, as the key objects to model. Towards this goal, we demonstrate an extension of a variational autoencoder that can learn a method for computing representations, or statistics, of datasets in an unsupervised fashion. The network is trained to produce statistics that encapsulate a generative model for each dataset. Hence the network enables efficient learning from new datasets for both unsupervised and supervised tasks. We show that we are able to learn statistics that can be used for: clustering datasets, transferring generative models to new datasets, selecting representative samples of datasets and classifying previously unseen classes. We refer to our model as a neural statistician, and by this we mean a neural network that can learn to compute summary statistics of datasets without supervision.
    @inproceedings{Edwards2017Towards,
    author = {Harrison Edwards and Amos Storkey},
    title = {Towards a Neural Statistician},
    year = {2017},
    month = {Apr},
    booktitle = {International Conference on Learning Representations (ICLR)},
    url = {https://arxiv.org/abs/1606.02185},
    }
  • Resource-Efficient Feature Gathering at Test Time

    Workshop on Reliable Machine Learning in the Wild, NeurIPS

    Gavin Gray, Amos Storkey
    Data collection is costly. A machine learning model requires input data to produce an output prediction, but that input is often not cost-free to produce accurately. For example, in the social sciences, it may require collecting samples; in signal processing it may involve investing in expensive accurate sensors. The problem of allocating a budget across the collection of different input variables is largely over- looked in machine learning, but is important under real-world constraints. Given that the noise level on each input feature depends on how much resource has been spent gathering it, and given a fixed budget, we ask how to allocate that budget to maximise our expected reward. At the same time, the optimal model parameters will depend on the choice of budget allocation, and so searching the space of pos- sible budgets is costly. Using doubly stochastic gradient methods we propose a solution that allows expressive models and massive datasets, while still providing an interpretable budget allocation for feature gathering at test time.
    @inproceedings{Gray2016Resource-Efficient,
    author = {Gavin Gray and Amos Storkey},
    title = {Resource-Efficient Feature Gathering at Test Time},
    year = {2016},
    month = {Dec},
    booktitle = {Workshop on Reliable Machine Learning in the Wild, NeurIPS},
    url = {/assets/papers/resource-efficient-wildml16.pdf},
    }
  • Censoring Representations with an Adversary

    International Conference on Learning Representations (ICLR)

    Harrison Edwards, Amos Storkey
    In practice, there are often explicit constraints on what representations or decisions are acceptable in an application of machine learning. For example it may be a legal requirement that a decision must not favour a particular group. Alternatively it can be that that representation of data must not have identifying information. We address these two related issues by learning flexible representations that minimize the capability of an adversarial critic. This adversary is trying to predict the relevant sensitive variable from the representation, and so minimizing the performance of the adversary ensures there is little or no information in the representation about the sensitive variable. We demonstrate this adversarial approach on two problems: making decisions free from discrimination and removing private information from images. We formulate the adversarial model as a minimax problem, and optimize that minimax objective using a stochastic gradient alternate min-max optimizer. We demonstrate the ability to provide discriminant free representations for standard test problems, and compare with previous state of the art methods for fairness, showing statistically significant improvement across most cases. The flexibility of this method is shown via a novel problem: removing annotations from images, from unaligned training examples of annotated and unannotated images, and with no a priori knowledge of the form of annotation provided to the model.
    @inproceedings{Edwards2016Censoring,
    author = {Harrison Edwards and Amos Storkey},
    title = {Censoring Representations with an Adversary},
    year = {2016},
    month = {Mar},
    booktitle = {International Conference on Learning Representations (ICLR)},
    url = {https://arxiv.org/abs/1511.05897},
    }
  • Evaluation of a Pre-surgical Functional MRI Workflow: From Data Acquisition to Reporting

    International Journal of Medical Informatics

    Cyril Pernet, Krzysztof J Gorgolewski, Dominic Job, David Rodriguez, Amos J Storkey, Ian Whittle, Joanna Wardlaw
    Purpose: Present and assess clinical protocols and associated automated workflow for pre-surgical functional magnetic resonance imaging in brain tumor patients. Methods: Protocols were validated using a single-subject reliability approach based on 10 healthy control subjects. Results from the automated workflow were evaluated in 9 patients with brain tumors, comparing fMRI results to direct electrical stimulation (DES) of the cortex. Results: Using a new approach to compute single-subject fMRI reliability in controls, we show that not all tasks are suitable in the clinical context, even if they show meaningful results at the group level. Comparison of the fMRI results from patients to DES showed good correspondence between techniques (odds ratio 36). Conclusion: Providing that validated and reliable fMRI protocols are used, fMRI can accurately delineate eloquent areas, thus providing an aid to medical decision regarding brain tumor surgery.
    @article{Pernet2016Evaluation,
    author = {Cyril Pernet and Krzysztof J Gorgolewski and Dominic Job and David Rodriguez and Amos J Storkey and Ian Whittle and Joanna Wardlaw},
    title = {Evaluation of a Pre-surgical Functional {MRI} Workflow: From Data Acquisition to Reporting},
    year = {2016},
    month = {Feb},
    journal = {International Journal of Medical Informatics},
    volume = {86},
    url = {http://homepages.inf.ed.ac.uk/amos/publications/Pernet_al_Evaluation_Pre_Surgical.pdf},
    }
  • Stochastic Parallel Block Coordinate Descent for Large-scale Saddle Point Problems

    AAAI Conference on Artificial Intelligence (AAAI)

    Zhanxing Zhu, Amos Storkey
    We consider convex-concave saddle point problems with a separable structure and non-strongly convex functions. We propose an efficient stochastic block coordinate descent method using adaptive primal-dual updates, which enables flexible parallel optimization for large-scale problems. Our method shares the efficiency and flexibility of block coordinate descent methods with the simplicity of primal-dual methods and utilizing the structure of the separable convex-concave saddle point problem. It is capable of solving a wide range of machine learning applications, including robust principal component analysis, Lasso, and feature selection by group Lasso, etc. Theoretically and empirically, we demonstrate significantly better performance than state-of-the-art methods in all these applications.
    @inproceedings{Zhu2016Stochastic,
    author = {Zhanxing Zhu and Amos Storkey},
    title = {Stochastic Parallel Block Coordinate Descent for Large-scale Saddle Point Problems},
    year = {2016},
    month = {Feb},
    booktitle = {AAAI Conference on Artificial Intelligence (AAAI)},
    url = {https://arxiv.org/abs/1511.07294},
    }
  • Covariance-Controlled Adaptive Langevin Thermostat for Large-Scale Bayesian Sampling

    Advances in Neural Information Processing Systems (NeurIPS)

    Xiaocheng Shang, Zhanxing Zhu, Benedict Leimkuhler, Amos Storkey
    Monte Carlo sampling for Bayesian posterior inference is a common approach used in machine learning. The Markov Chain Monte Carlo procedures that are used are often discrete-time analogues of associated stochastic differential equations (SDEs). These SDEs are guaranteed to leave invariant the required posterior distribution. An area of current research addresses the computational benefits of stochastic gradient methods in this setting. Existing techniques rely on estimating the variance or covariance of the subsampling error, and typically assume constant variance. In this article, we propose a covariance-controlled adaptive Langevin thermostat that can effectively dissipate parameter-dependent noise while maintaining a desired target distribution. The proposed method achieves a substantial speedup over popular alternative schemes for large-scale machine learning applications.
    @inproceedings{Shang2015Covariance-Controlled,
    author = {Xiaocheng Shang and Zhanxing Zhu and Benedict Leimkuhler and Amos Storkey},
    title = {Covariance-Controlled Adaptive {L}angevin Thermostat for Large-Scale {B}ayesian Sampling},
    year = {2015},
    month = {Dec},
    booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
    url = {https://arxiv.org/abs/1510.08692},
    }
  • Adaptive Stochastic Primal-dual Coordinate Descent for Separable Saddle Point Problems

    Joint European Conference on Machine Learning and Knowledge Discovery in Databases

    Zhanxing Zhu, Amos Storkey
    We consider a generic convex-concave saddle point problem with a separable structure, a form that covers a wide-ranged machine learning applications. Under this problem structure, we follow the framework of primal-dual updates for saddle point problems, and incorporate stochastic block coordinate descent with adaptive stepsizes into this framework. We theoretically show that our proposal of adaptive stepsizes potentially achieves a sharper linear convergence rate compared with the existing methods. Additionally, since we can select “mini-batch” of block coordinates to update, our method is also amenable to parallel processing for large-scale data. We apply the proposed method to regularized empirical risk minimization and show that it performs comparably or, more often, better than state-of-the-art methods on both synthetic and real-world data sets.
    @inproceedings{Zhu2015Adaptive,
    author = {Zhanxing Zhu and Amos Storkey},
    title = {Adaptive Stochastic Primal-dual Coordinate Descent for Separable Saddle Point Problems},
    year = {2015},
    month = {Aug},
    booktitle = {Joint European Conference on Machine Learning and Knowledge Discovery in Databases},
    url = {https://arxiv.org/abs/1506.04093},
    }
  • Training Deep Convolutional Neural Networks to Play Go

    International Conference on Machine Learning (ICML)

    Chris Clark, Amos Storkey
    Mastering the game of Go has remained a long standing challenge to the field of AI. Modern computer Go systems rely on processing millions of possible future positions to play well, but intuitively a stronger and more 'humanlike' way to play the game would be to rely on pattern recognition abilities rather then brute force computation. Following this sentiment, we train deep convolutional neural networks to play Go by training them to predict the moves made by expert Go players. To solve this problem we introduce a number of novel techniques, including a method of tying weights in the network to 'hard code' symmetries that are expect to exist in the target function, and demonstrate in an ablation study they considerably improve performance. Our final networks are able to achieve move prediction accuracies of 41.1% and 44.4% on two different Go datasets, surpassing previous state of the art on this task by significant margins. Additionally, while previous move prediction programs have not yielded strong Go playing programs, we show that the networks trained in this work acquired high levels of skill. Our convolutional neural networks can consistently defeat the well known Go program GNU Go, indicating it is state of the art among programs that do not use Monte Carlo Tree Search. It is also able to win some games against state of the art Go playing program Fuego while using a fraction of the play time. This success at playing Go indicates high level principles of the game were learned.
    @inproceedings{Clark2015Training,
    author = {Chris Clark and Amos Storkey},
    title = {Training Deep Convolutional Neural Networks to Play {G}o},
    year = {2015},
    month = {Jun},
    booktitle = {International Conference on Machine Learning (ICML)},
    url = {https://arxiv.org/abs/1412.3409},
    }
  • The Supervised Hierarchical Dirichlet process

    IEEE Transactions on Pattern Analysis and Machine Intelligence (Special Issue on Bayesian Nonparametrics)

    Andrew M. Dai, Amos Storkey
    We propose the supervised hierarchical Dirichlet process (sHDP), a nonparametric generative model for the joint distribution of a group of observations and a response variable directly associated with that whole group. We compare the sHDP with another leading method for regression on grouped data, the supervised latent Dirichlet allocation (sLDA) model. We evaluate our method on two real-world classification problems and two real-world regression problems. Bayesian nonparametric regression models based on the Dirichlet process, such as the Dirichlet process-generalised linear models (DP-GLM) have previously been explored; these models allow flexibility in modelling nonlinear relationships. However, until now, Hierarchical Dirichlet Process (HDP) mixtures have not seen significant use in supervised problems with grouped data since a straightforward application of the HDP on the grouped data results in learnt clusters that are not predictive of the responses. The sHDP solves this problem by allowing for clusters to be learnt jointly from the group structure and from the label assigned to each group.
    @article{Dai2015Supervised,
    author = {Andrew M. Dai and Amos Storkey},
    title = {The Supervised Hierarchical {D}irichlet process},
    year = {2015},
    month = {Apr},
    journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence (Special Issue on Bayesian Nonparametrics)},
    volume = {37},
    url = {https://arxiv.org/abs/1412.5236},
    }
  • Multi-period Trading Prediction Markets with Connections to Machine Learning

    International Conference on Machine Learning (ICML)

    Jinli Hu, Amos Storkey
    We present a new model for prediction markets, in which we use risk measures to model agents and introduce a market maker to describe the trading process. This specific choice on modelling tools brings us mathematical convenience. The analysis shows that the whole market effectively approaches a global objective, despite that the market is designed such that each agent only cares about its own goal. Additionally, the market dynamics provides a sensible algorithm for optimising the global objective. An intimate connection between machine learning and our markets is thus established, such that we could 1) analyse a market by applying machine learning methods to the global objective, and 2) solve machine learning problems by setting up and running certain markets.
    @inproceedings{Hu2014Multi-period,
    author = {Jinli Hu and Amos Storkey},
    title = {Multi-period Trading Prediction Markets with Connections to Machine Learning},
    year = {2014},
    month = {Jun},
    booktitle = {International Conference on Machine Learning (ICML)},
    url = {https://arxiv.org/abs/1403.0648},
    }
  • Series Expansion Approximations of Brownian Motion for Non-Linear Kalman Filtering of Diffusion Processes

    IEEE Transactions on Signal Processing

    Simon Lyons, Simo Särkkä, Amos Storkey
    In this paper, we describe a novel application of sigma-point methods to continuous-discrete filtering. In principle, the nonlinear continuous- discrete filtering problem can be solved exactly. In practice, the solution contains terms that are computationally intractible. Assumed density filtering methods attempt to match statistics of the filtering distribution to some set of more tractible probability distributions. We describe a novel method that decomposes the Brownian motion driving the signal in a generalised Fourier series, which is truncated after a number of terms. This approximation to Brownian can be described using a relatively small number of Fourier coefficients, and allows us to compute statistics of the filtering distribution with a single application of a sigma-point method. Assumed density filters that exist in the literature usually rely on discretisation of the signal dynamics followed by iterated application of a sigma point transform (or a limiting case thereof). Iterating the transform in this manner can lead to loss of information about the filtering distri- bution in highly nonlinear settings. We demonstrate that our method is better equipped to cope with such problems.
    @article{Lyons2014Series,
    author = {Simon Lyons and Simo Särkkä and Amos Storkey},
    title = {Series Expansion Approximations of {B}rownian Motion for Non-Linear {K}alman Filtering of Diffusion Processes},
    year = {2014},
    month = {Mar},
    journal = {IEEE Transactions on Signal Processing},
    volume = {62},
    url = {https://arxiv.org/abs/1302.5324},
    }
  • Bayesian Inference in Sparse Gaussian Graphical Models

    Peter Orchard, Felix Agakov, Amos Storkey
    One of the fundamental tasks of science is to find explainable relationships between observed phenomena. One approach to this task that has received attention in recent years is based on probabilistic graphical modelling with sparsity constraints on model structures. In this paper, we describe two new approaches to Bayesian inference of sparse structures of Gaussian graphical models (GGMs). One is based on a simple modification of the cutting-edge block Gibbs sampler for sparse GGMs, which results in significant computational gains in high dimensions. The other method is based on a specific construction of the Hamiltonian Monte Carlo sampler, which results in further significant improvements. We compare our fully Bayesian approaches with the popular regularisation-based graphical LASSO, and demonstrate significant advantages of the Bayesian treatment under the same computing costs. We apply the methods to a broad range of simulated data sets, and a real-life financial data set.
    @techreport{Orchard2013Bayesian,
    author = {Peter Orchard and Felix Agakov and Amos Storkey},
    title = {{B}ayesian Inference in Sparse {G}aussian Graphical Models},
    year = {2013},
    month = {Sep},
    institution = {School of Informatics, University of Edinburgh}, number = {1},
    url = {https://arxiv.org/abs/1309.7311},
    }
  • A Topic Model for Melodic Sequences

    International Conference on Machine Learning (ICML)

    Athina Spiliopoulou, Amos Storkey
    We examine the problem of learning a probabilistic model for melody directly from musical sequences belonging to the same genre. This is a challenging task as one needs to capture not only the rich temporal structure evident in music, but also the complex statistical dependencies among different music components. To address this problem we introduce the Variable-gram Topic Model, which couples the latent topic formalism with a systematic model for contextual information. We evaluate the model on next-step prediction. Additionally, we present a novel way of model evaluation, where we directly compare model samples with data sequences using the Maximum Mean Discrepancy of string kernels, to assess how close is the model distribution to the data distribution. We show that the model has the highest performance under both evaluation measures when compared to LDA, the Topic Bigram and related non-topic models.
    @inproceedings{Spiliopoulou2012Topic,
    author = {Athina Spiliopoulou and Amos Storkey},
    title = {A Topic Model for Melodic Sequences},
    year = {2012},
    month = {Jun},
    booktitle = {International Conference on Machine Learning (ICML)},
    url = {https://arxiv.org/abs/1206.6441},
    }
  • Isoelastic Agents and Wealth Updates in Machine Learning Markets

    International Conference on Machine Learning (ICML)

    Amos Storkey, Jono Millin, Krzysztof Geras
    Recently, prediction markets have shown considerable promise for developing flexible mechanisms for machine learning. In this paper, agents with isoelastic utilities are considered. It is shown that the costs associated with homogeneous markets of agents with isoelastic utilities produce equilibrium prices corresponding to alpha-mixtures, with a particular form of mixing component relating to each agent's wealth. We also demonstrate that wealth accumulation for logarithmic and other isoelastic agents (through payoffs on prediction of training targets) can implement both Bayesian model updates and mixture weight updates by imposing different market payoff structures. An iterative algorithm is given for market equilibrium computation. We demonstrate that inhomogeneous markets of agents with isoelastic utilities outperform state of the art aggregate classifiers such as random forests, as well as single classifiers (neural networks, decision trees) on a number of machine learning benchmarks, and show that isoelastic combination methods are generally better than their logarithmic counterparts.
    @inproceedings{Storkey2012Isoelastic,
    author = {Amos Storkey and Jono Millin and Krzysztof Geras},
    title = {Isoelastic Agents and Wealth Updates in Machine Learning Markets},
    year = {2012},
    month = {Jun},
    booktitle = {International Conference on Machine Learning (ICML)},
    url = {https://arxiv.org/abs/1206.6443},
    }
  • Comparing Probabilistic Models for Melodic Sequences

    Proceedings of the ECML-PKDD

    Athina Spiliopoulou, Amos Storkey
    Modelling the real world complexity of music is a challenge for machine learning. We address the task of modeling melodic sequences from the same music genre. We perform a comparative analysis of two probabilistic models; a Dirichlet Variable Length Markov Model (Dirichlet-VMM) and a Time Convolutional Restricted Boltzmann Machine (TC-RBM). We show that the TC-RBM learns descriptive music features, such as underlying chords and typical melody transitions and dynamics. We assess the models for future prediction and compare their performance to a VMM, which is the current state of the art in melody generation. We show that both models perform significantly better than the VMM, with the Dirichlet-VMM marginally outperforming the TC-RBM. Finally, we evaluate the short order statistics of the models, using the Kullback-Leibler divergence between test sequences and model samples, and show that our proposed methods match the statistics of the music genre significantly better than the VMM.
    @inproceedings{Spiliopoulou2011Comparing,
    author = {Athina Spiliopoulou and Amos Storkey},
    title = {Comparing Probabilistic Models for Melodic Sequences},
    year = {2011},
    month = {Sep},
    booktitle = {Proceedings of the ECML-PKDD},
    url = {https://arxiv.org/abs/1109.6804},
    }
  • Machine Learning Markets

    International Conference on Artificial Intelligence and Statistics (AISTATS)

    Amos Storkey
    Prediction markets show considerable promise for developing flexible mechanisms for machine learning. Here, machine learning markets for multivariate systems are defined, and a utility-based framework is established for their analysis. This differs from the usual approach of defining static betting functions. It is shown that such markets can implement model combination methods used in machine learning, such as product of expert and mixture of expert approaches as equilibrium pricing models, by varying agent utility functions. They can also implement models composed of local potentials, and message passing methods. Prediction markets also allow for more flexible combinations, by combining multiple different utility functions. Conversely, the market mechanisms implement inference in the relevant probabilistic models. This means that market mechanism can be utilized for implementing parallelized model building and inference for probabilistic modelling.
    @inproceedings{Storkey2011Machine,
    author = {Amos Storkey},
    title = {Machine Learning Markets},
    year = {2011},
    month = {Apr},
    booktitle = {International Conference on Artificial Intelligence and Statistics (AISTATS)},
    url = {https://arxiv.org/abs/1106.4509},
    }