Publications · BayesWatch

Transferrable Surrogates in Expressive Neural Architecture Search Spaces

International Conference on Automated Machine Learning (AutoML)

Shiwen Qin, Gabriela Kadlecová, Martin Pilát, Shay B. Cohen, Roman Neruda, Elliot J. Crowley, Jovita Lukasik, Linus Ericsson

Neural architecture search (NAS) faces a challenge in balancing the exploration of expressive, broad search spaces that enable architectural innovation with the need for efficient evaluation of architectures to effectively search such spaces. We investigate surrogate model training for improving search in highly expressive NAS search spaces based on context-free grammars. We show that i) surrogate models trained either using zero-cost-proxy metrics and neural graph features (GRAF) or by fine-tuning an off-the-shelf LM have high predictive power for the performance of architectures both within and across datasets, ii) these surrogates can be used to filter out bad architectures when searching on novel datasets, thereby significantly speeding up search and achieving better final performances, and iii) the surrogates can be further used directly as the search objective for huge speed-ups.

@inproceedings{Qin2025_9_Transferrable,
author = {Shiwen Qin and Gabriela Kadlecová and Martin Pilát and Shay B. Cohen and Roman Neruda and Elliot J. Crowley and Jovita Lukasik and Linus Ericsson},
title = {Transferrable Surrogates in Expressive Neural Architecture Search Spaces},
year = {2025},
month = {Sep},
booktitle = {International Conference on Automated Machine Learning (AutoML)},
url = {https://arxiv.org/abs/2504.12971},
}

Lightweight Online Adaption for Time Series Foundation Model Forecasts

International Conference on Machine Learning (ICML)

Thomas L. Lee, William Toner, Rajkarn Singh, Artjom Joosem, Martin Asenov

Foundation models (FMs) have emerged as a promising approach for time series forecasting. While effective, FMs typically remain fixed during deployment due to the high computational costs of learning them online. Consequently, deployed FMs fail to adapt their forecasts to current data characteristics, despite the availability of online feedback from newly arriving data. This raises the question of whether FM performance can be enhanced by the efficient usage of this feedback. We propose AdapTS to answer this question. AdapTS is a lightweight mechanism for the online adaption of FM forecasts in response to online feedback. AdapTS consists of two parts: a) the AdapTS-Forecaster which is used to learn the current data distribution; and b) the AdapTS-Weighter which is used to combine the forecasts of the FM and the AdapTS-Forecaster. We evaluate the performance of AdapTS in conjunction with several recent FMs across a suite of standard time series datasets. In all of our experiments we find that using AdapTS improves performance. This work demonstrates how efficient usage of online feedback can be used to improve FM forecasts.

@inproceedings{Lee2025_6_Lightweight,
author = {Thomas L. Lee and William Toner and Rajkarn Singh and Artjom Joosem and Martin Asenov},
title = {Lightweight Online Adaption for Time Series Foundation Model Forecasts},
year = {2025},
month = {Jun},
booktitle = {International Conference on Machine Learning (ICML)},
url = {https://arxiv.org/abs/2502.12920},
}

Model Diffusion for Certifiable Few-shot Transfer Learning

Workshop on Neural Network Weights as a New Data Modality @ ICLR

Fady Rezk, Royson Lee, Henry Gouk, Timothy Hospedales, Minyoung Kim

In modern large-scale deep learning, a prevalent and effective workflow for solving low-data problems is adapting powerful pre-trained foundation models (FMs) to new tasks via parameter-efficient fine-tuning (PEFT). However, while empirically effective, the resulting solutions lack generalisation guarantees to certify their accuracy - which may be required for ethical or legal reasons prior to deployment in high-importance applications. In this paper we develop a novel transfer learning approach that is designed to facilitate non-vacuous learning theoretic generalisation guarantees for downstream tasks, even in the low-shot regime. Specifically, we first use upstream tasks to train a distribution over PEFT parameters. We then learn the downstream task by a sample-and-evaluate procedure -- sampling plausible PEFTs from the trained diffusion model and selecting the one with the highest likelihood on the downstream data. Crucially, this confines our model hypothesis to a finite set of PEFT samples. In contrast to learning in the typical continuous hypothesis spaces of neural network weights, this facilitates tighter risk certificates. We instantiate our bound and show non-trivial generalization guarantees compared to existing learning approaches which lead to vacuous bounds in the low-shot regime.

@inproceedings{Rezk2025_4_Model,
author = {Fady Rezk and Royson Lee and Henry Gouk and Timothy Hospedales and Minyoung Kim},
title = {Model Diffusion for Certifiable Few-shot Transfer Learning},
year = {2025},
month = {Apr},
booktitle = {Workshop on Neural Network Weights as a New Data Modality @ ICLR},
url = {https://openreview.net/forum?id=Bqekdoe5CK},
}

COP-GEN-Beta: Unified Generative Modelling of COPernicus Imagery Thumbnails

Foundation and Large Vision Models in Remote Sensing @ CVPR

Miguel Espinosa, Valerio Marsocci, Yuru Jia, Elliot J. Crowley, Mikolaj Czerkawski

In remote sensing, multi-modal data from various sensors capturing the same scene offers rich opportunities, but learning a unified representation across these modalities remains a significant challenge. Traditional methods have often been limited to single or dual-modality approaches. In this paper, we introduce COP-GEN-Beta, a generative diffusion model trained on optical, radar, and elevation data from the Major TOM dataset. What sets COP-GEN-Beta apart is its ability to map any subset of modalities to any other, enabling zero-shot modality translation after training. This is achieved through a sequence-based diffusion transformer, where each modality is controlled by its own timestep embedding. We extensively evaluate COP-GEN-Beta on thumbnail images from the Major TOM dataset, demonstrating its effectiveness in generating high-quality samples. Qualitative and quantitative evaluations validate the model's performance, highlighting its potential as a powerful pre-trained model for future remote sensing tasks.

@inproceedings{Espinosa2025_4_COPGENBeta,
author = {Miguel Espinosa and Valerio Marsocci and Yuru Jia and Elliot J. Crowley and Mikolaj Czerkawski},
title = {COP-GEN-Beta: Unified Generative Modelling of COPernicus Imagery Thumbnails},
year = {2025},
month = {Apr},
booktitle = {Foundation and Large Vision Models in Remote Sensing @ CVPR},
url = {https://arxiv.org/abs/2504.08548},
}

Is Limited Participant Diversity Impeding EEG-based Machine Learning?

arXiv

Philipp Bomatter, Henry Gouk

The application of machine learning (ML) to electroencephalography (EEG) has great potential to advance both neuroscientific research and clinical applications. However, the generalisability and robustness of EEG-based ML models often hinge on the amount and diversity of training data. It is common practice to split EEG recordings into small segments, thereby increasing the number of samples substantially compared to the number of individual recordings or participants. We conceptualise this as a multi-level data generation process and investigate the scaling behaviour of model performance with respect to the overall sample size and the participant diversity through large-scale empirical studies. We then use the same framework to investigate the effectiveness of different ML strategies designed to address limited data problems: data augmentations and self-supervised learning. Our findings show that model performance scaling can be severely constrained by participant distribution shifts and provide actionable guidance for data collection and ML research.

@unpublished{Bomatter2025_3_Is,
author = {Philipp Bomatter and Henry Gouk},
title = {Is Limited Participant Diversity Impeding EEG-based Machine Learning?},
year = {2025},
month = {Mar},
institution = {University of Edinburgh},
url = {https://arxiv.org/abs/2503.13497},
}

Strategic Classification with Randomised Classifiers

arXiv

Jack Geary, Henry Gouk

We consider the problem of strategic classification, where a learner must build a model to classify agents based on features that have been strategically modified. Previous work in this area has concentrated on the case when the learner is restricted to deterministic classifiers. In contrast, we perform a theoretical analysis of an extension to this setting that allows the learner to produce a randomised classifier. We show that, under certain conditions, the optimal randomised classifier can achieve better accuracy than the optimal deterministic classifier, but under no conditions can it be worse. When a finite set of training data is available, we show that the excess risk of Strategic Empirical Risk Minimisation over the class of randomised classifiers is bounded in a similar manner as the deterministic case. In both the deterministic and randomised cases, the risk of the classifier produced by the learner converges to that of the corresponding optimal classifier as the volume of available training data grows. Moreover, this convergence happens at the same rate as in the i.i.d. case. Our findings are compared with previous theoretical work analysing the problem of strategic classification. We conclude that randomisation has the potential to alleviate some issues that could be faced in practice without introducing any substantial downsides.

@unpublished{Geary2025_2_Strategic,
author = {Jack Geary and Henry Gouk},
title = {Strategic Classification with Randomised Classifiers},
year = {2025},
month = {Feb},
institution = {University of Edinburgh},
url = {https://arxiv.org/abs/2502.01313},
}

Performance of Zero-Shot Time Series Foundation Models on Cloud Data

I Can’t Believe It’s Not Better Workshop @ ICLR

William Toner, Thomas L. Lee, Artjom Joosem, Rajkarn Singh, Martin Asenov

Time series foundation models (FMs) have emerged as a popular paradigm for zero-shot multi-domain forecasting. FMs are trained on numerous diverse datasets and claim to be effective forecasters across multiple different time series domains, including cloud data. In this work we investigate this claim, exploring the effectiveness of FMs on cloud data. We demonstrate that many well-known FMs fail to generate meaningful or accurate zero-shot forecasts in this setting. We support this claim empirically, showing that FMs are outperformed consistently by simple linear baselines. We also illustrate a number of interesting pathologies, including instances where FMs suddenly output seemingly erratic, random-looking forecasts. Our results suggest a widespread failure of FMs to model cloud data.

@inproceedings{Toner2025_1_Performance,
author = {William Toner and Thomas L. Lee and Artjom Joosem and Rajkarn Singh and Martin Asenov},
title = {Performance of Zero-Shot Time Series Foundation Models on Cloud Data},
year = {2025},
month = {Jan},
booktitle = {I Can’t Believe It’s Not Better Workshop @ ICLR},
url = {https://arxiv.org/abs/2502.12944},
}

LLM-Personalize: Aligning LLM Planners with Human Preferences via Reinforced Self-Training for Housekeeping Robots

International Conference on Computational Linguistics (COLING)

Dongge Han, Trevor McInroe, Adam Jelley, Stefano V. Albrecht, Peter Bell, Amos Storkey

Large language models (LLMs) have shown significant potential for robotics applications, particularly task planning, by harnessing their language comprehension and text generation capabilities. However, in applications such as household robotics, a critical gap remains in the personalization of these models to individual user preferences. We introduce LLM-Personalize, a novel framework with an optimization pipeline designed to personalize LLM planners for household robotics. Our LLM-Personalize framework features an LLM planner that performs iterative planning in multi-room, partially-observable household scenarios, making use of a scene graph constructed with local observations. The generated plan consists of a sequence of high-level actions which are subsequently executed by a controller. Central to our approach is the optimization pipeline, which combines imitation learning and iterative self-training to personalize the LLM planner. In particular, the imitation learning phase performs initial LLM alignment from demonstrations, and bootstraps the model to facilitate effective iterative self-training, which further explores and aligns the model to user preferences. We evaluate LLM-Personalize on Housekeep, a challenging simulated real-world 3D benchmark for household rearrangements, and show that LLM-Personalize achieves more than a 30 percent increase in success rate over existing LLM planners, showcasing significantly improved alignment with human preferences.

@inproceedings{Han2024_12_LLMPersonalize,
author = {Dongge Han and Trevor McInroe and Adam Jelley and Stefano V. Albrecht and Peter Bell and Amos Storkey},
title = {LLM-Personalize: Aligning LLM Planners with Human Preferences via Reinforced Self-Training for Housekeeping Robots},
year = {2024},
month = {Dec},
booktitle = {International Conference on Computational Linguistics (COLING)},
url = {https://arxiv.org/abs/2404.14285},
}

einspace: Searching for Neural Architectures from Fundamental Operations

Advances in Neural Information Processing Systems (NeurIPS)

Linus Ericsson, Miguel Espinosa, Chenhongyi Yang, Antreas Antoniou, Amos Storkey, Shay B. Cohen, Steven McDonagh, Elliot J. Crowley

Neural architecture search (NAS) finds high performing networks for a given task. Yet the results of NAS are fairly prosaic; they did not e.g. create a shift from convolutional structures to transformers. This is not least because the search spaces in NAS often aren't diverse enough to include such transformations a priori. Instead, for NAS to provide greater potential for fundamental design shifts, we need a novel expressive search space design which is built from more fundamental operations. To this end, we introduce einspace, a search space based on a parameterised probabilistic context-free grammar. Our space is versatile, supporting architectures of various sizes and complexities, while also containing diverse network operations which allow it to model convolutions, attention components and more. It contains many existing competitive architectures, and provides flexibility for discovering new ones. Using this search space, we perform experiments to find novel architectures as well as improvements on existing ones on the diverse Unseen NAS datasets. We show that competitive architectures can be obtained by searching from scratch, and we consistently find large improvements when initialising the search with strong baselines. We believe that this work is an important advancement towards a transformative NAS paradigm where search space expressivity and strategic search initialisation play key roles.

@inproceedings{Ericsson2024_12_einspace,
author = {Linus Ericsson and Miguel Espinosa and Chenhongyi Yang and Antreas Antoniou and Amos Storkey and Shay B. Cohen and Steven McDonagh and Elliot J. Crowley},
title = {einspace: Searching for Neural Architectures from Fundamental Operations},
year = {2024},
month = {Dec},
booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
url = {https://arxiv.org/abs/2405.20838},
}

Diffusion for World Modeling: Visual Details Matter in Atari

Advances in Neural Information Processing Systems (NeurIPS)

Eloi Alonso*, Adam Jelley*, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, François Fleuret

World models constitute a promising approach for training reinforcement learning agents in a safe and sample-efficient manner. Recent world models predominantly operate on sequences of discrete latent variables to model environment dynamics. However, this compression into a compact discrete representation may ignore visual details that are important for reinforcement learning. Concurrently, diffusion models have become a dominant approach for image generation, challenging well-established methods modeling discrete latents. Motivated by this paradigm shift, we introduce *DIAMOND* (DIffusion As a Model Of eNvironment Dreams), a reinforcement learning agent trained in a diffusion world model. We analyze the key design choices that are required to make diffusion suitable for world modeling, and demonstrate how improved visual details can lead to improved agent performance. *DIAMOND* achieves a mean human normalized score of 1.46 on the competitive Atari 100k benchmark; a new best for agents trained entirely within a world model. To foster future research on diffusion for world modeling, we release our code, agents and playable world models at https://github.com/eloialonso/diamond.

@inproceedings{Alonso*2024_12_Diffusion,
author = {Eloi Alonso* and Adam Jelley* and Vincent Micheli and Anssi Kanervisto and Amos Storkey and Tim Pearce and François Fleuret},
title = {Diffusion for World Modeling: Visual Details Matter in {A}tari},
year = {2024},
month = {Dec},
booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
url = {https://arxiv.org/abs/2405.12399},
}

Generating Compositional Scenes via Text-to-image RGBA Instance Generation

Advances in Neural Information Processing Systems (NeurIPS)

Alessandro Fontanella, Petru-Daniel Tudosiu, Yongxin Yang, Shifeng Zhang, Sarah Parisot

Text-to-image diffusion generative models can generate high quality images at the cost of tedious prompt engineering. Controllability can be improved by introducing layout conditioning, however existing methods lack layout editing ability and fine-grained control over object attributes. The concept of multi-layer generation holds great potential to address these limitations, however generating image instances concurrently to scene composition limits control over fine-grained object attributes, relative positioning in 3D space and scene manipulation abilities. In this work, we propose a novel multi-stage generation paradigm that is designed for fine-grained control, flexibility and interactivity. To ensure control over instance attributes, we devise a novel training paradigm to adapt a diffusion model to generate isolated scene components as RGBA images with transparency information. To build complex images, we employ these pre-generated instances and introduce a multi-layer composite generation process that smoothly assembles components in realistic scenes. Our experiments show that our RGBA diffusion model is capable of generating diverse and high quality instances with precise control over object attributes. Through multi-layer composition, we demonstrate that our approach allows to build and manipulate images from highly complex prompts with fine-grained control over object appearance and location, granting a higher degree of control than competing methods.

@inproceedings{Fontanella2024_12_Generating,
author = {Alessandro Fontanella and Petru-Daniel Tudosiu and Yongxin Yang and Shifeng Zhang and Sarah Parisot},
title = {Generating Compositional Scenes via Text-to-image RGBA Instance Generation},
year = {2024},
month = {Dec},
booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
url = {https://arxiv.org/abs/2411.10913},
}

Evaluating the evaluators: Are validation methods for few-shot learning fit for purpose?

Transactions on Machine Learning Research

Luísa Shimabucoro, Ruchika Chavhan, Timothy Hospedales, Henry Gouk

Numerous benchmarks for Few-Shot Learning have been proposed in the last decade. However all of these benchmarks focus on performance averaged over many tasks, and the question of how to reliably evaluate and tune models trained for individual few-shot tasks has not been addressed. This paper presents the first investigation into task-level validation---a fundamental step when deploying a model. We measure the accuracy of performance estimators in the few-shot setting, consider strategies for model selection, and examine the reasons for the failure of evaluators usually thought of as being robust. We conclude that cross-validation with a low number of folds is the best choice for directly estimating the performance of a model, whereas using bootstrapping or cross validation with a large number of folds is better for model selection purposes. Overall, we find that with current methods, benchmarks, and validation strategies, one can not get a reliable picture of how effectively methods perform on individual tasks. However, we find that existing methods already provide enough information to enable selection of few-shot learners on a task-level basis.

@article{Shimabucoro2024_11_Evaluating,
author = {Luísa Shimabucoro and Ruchika Chavhan and Timothy Hospedales and Henry Gouk},
title = {Evaluating the evaluators: Are validation methods for few-shot learning fit for purpose?},
year = {2024},
month = {Nov},
journal = {Transactions on Machine Learning Research},
volume = {},
url = {https://openreview.net/forum?id=dKKY2mDEnD},
}

There is no SAMantics! Exploring SAM as a Backbone for Visual Understanding Tasks

ArXiv

Miguel Espinosa, Chenhongyi Yang, Linus Ericsson, Steven McDonagh, Elliot J. Crowley

The Segment Anything Model (SAM) was originally designed for label-agnostic mask generation. Does this model also possess inherent semantic understanding, of value to broader visual tasks? In this work we follow a multi-staged approach towards exploring this question. We firstly quantify SAM's semantic capabilities by comparing base image encoder efficacy under classification tasks, in comparison with established models (CLIP and DINOv2). Our findings reveal a significant lack of semantic discriminability in SAM feature representations, limiting potential for tasks that require class differentiation. This initial result motivates our exploratory study that attempts to enable semantic information via in-context learning with lightweight fine-tuning where we observe that generalisability to unseen classes remains limited. Our observations culminate in the proposal of a training-free approach that leverages DINOv2 features, towards better endowing SAM with semantic understanding and achieving instance-level class differentiation through feature-based similarity. Our study suggests that incorporation of external semantic sources provides a promising direction for the enhancement of SAM's utility with respect to complex visual tasks that require semantic understanding.

@unpublished{Espinosa2024_11_There,
author = {Miguel Espinosa and Chenhongyi Yang and Linus Ericsson and Steven McDonagh and Elliot J. Crowley},
title = {There is no SAMantics! Exploring SAM as a Backbone for Visual Understanding Tasks},
year = {2024},
month = {Nov},
institution = {University of Edinburgh},
url = {https://arxiv.org/abs/2411.15288},
}

PlainMamba: Improving Non-Hierarchical Mamba in Visual Recognition

British Machine Vision Conference (BMVC)

Chenhongyi Yang, Zehui Chen, Miguel Espinosa, Linus Ericsson, Zhenyu Wang, Jiaming Liu, Elliot J. Crowley

We present PlainMamba: a simple non-hierarchical state space model (SSM) designed for general visual recognition. The recent Mamba model has shown how SSMs can be highly competitive with other architectures on sequential data and initial attempts have been made to apply it to images. In this paper, we further adapt the selective scanning process of Mamba to the visual domain, enhancing its ability to learn features from two-dimensional images by (i) a continuous 2D scanning process that improves spatial continuity by ensuring adjacency of tokens in the scanning sequence, and (ii) direction-aware updating which enables the model to discern the spatial relations of tokens by encoding directional information. Our architecture is designed to be easy to use and easy to scale, formed by stacking identical PlainMamba blocks, resulting in a model with constant width throughout all layers. The architecture is further simplified by removing the need for special tokens. We evaluate PlainMamba on a variety of visual recognition tasks including image classification, semantic segmentation, object detection, and instance segmentation. Our method achieves performance gains over previous non-hierarchical models and is competitive with hierarchical alternatives. For tasks requiring high-resolution inputs, in particular, PlainMamba requires much less computing while maintaining high performance. Code and models are available at https://github.com/ChenhongyiYang/PlainMamba

@inproceedings{Yang2024_11_PlainMamba,
author = {Chenhongyi Yang and Zehui Chen and Miguel Espinosa and Linus Ericsson and Zhenyu Wang and Jiaming Liu and Elliot J. Crowley},
title = {{P}lain{M}amba: Improving Non-Hierarchical Mamba in Visual Recognition},
year = {2024},
month = {Nov},
booktitle = {British Machine Vision Conference (BMVC)},
url = {https://arxiv.org/abs/2403.17695},
}

EgoPoseFormer: A Simple Baseline for Stereo Egocentric 3D Human Pose Estimation

European Conference on Computer Vision (ECCV)

Chenhongyi Yang, Anastasia Tkach, Shreyas Hampali, Linguang Zhang, Elliot J. Crowley, Cem Keskin

We present EgoPoseFormer, a simple yet effective transformer-based model for stereo egocentric human pose estimation. The main challenge in egocentric pose estimation is overcoming joint invisibility, which is caused by self-occlusion or a limited field of view (FOV) of head-mounted cameras. Our approach overcomes this challenge by incorporating a two-stage pose estimation paradigm: in the first stage, our model leverages the global information to estimate each joint's coarse location, then in the second stage, it employs a DETR style transformer to refine the coarse locations by exploiting fine-grained stereo visual features. In addition, we present a deformable stereo operation to enable our transformer to effectively process multi-view features, which enables it to accurately localize each joint in the 3D world. We evaluate our method on the stereo UnrealEgo dataset and show it significantly outperforms previous approaches while being computationally efficient: it improves MPJPE by 27.4mm (45% improvement) with only 7.9% model parameters and 13.1% FLOPs compared to the state-of-the-art. Surprisingly, with proper training techniques, we find that even our first-stage pose proposal network can achieve superior performance compared to previous arts. We also show that our method can be seamlessly extended to monocular settings, which achieves state-of-the-art performance on the SceneEgo dataset, improving MPJPE by 25.5mm (21% improvement) compared to the best existing method with only 60.7% model parameters and 36.4% FLOPs.

@inproceedings{Yang2024_10_EgoPoseFormer,
author = {Chenhongyi Yang and Anastasia Tkach and Shreyas Hampali and Linguang Zhang and Elliot J. Crowley and Cem Keskin},
title = {{E}go{P}ose{F}ormer: A Simple Baseline for Stereo Egocentric {3D} Human Pose Estimation},
year = {2024},
month = {Oct},
booktitle = {European Conference on Computer Vision (ECCV)},
url = {https://arxiv.org/abs/2403.18080},
}

WidthFormer: Toward Efficient Transformer-based BEV View Transformation

IEEE / RSJ International Conference on Intelligent Robots and Systems (IROS)

Chenhongyi Yang, Tianwei Lin, Lichao Huang, Elliot J. Crowley

In this work, we present WidthFormer, a novel transformer-based Bird's-Eye-View (BEV) 3D detection method tailored for real-time autonomous-driving applications. WidthFormer is computationally efficient, robust and does not require any special engineering effort to deploy. In this work, we propose a novel 3D positional encoding mechanism capable of accurately encapsulating 3D geometric information, which enables our model to generate high-quality BEV representations with only a single transformer decoder layer. This mechanism is also beneficial for existing sparse 3D object detectors. Inspired by the recently-proposed works, we further improve our model's efficiency by vertically compressing the image features when serving as attention keys and values. We also introduce two modules to compensate for potential information loss due to feature compression. Experimental evaluation on the widely-used nuScenes 3D object detection benchmark demonstrates that our method outperforms previous approaches across different 3D detection architectures. More importantly, our model is highly efficient. For example, when using 256×704 input images, it achieves 1.5 ms and 2.8 ms latency on NVIDIA 3090 GPU and Horizon Journey-5 edge computing chips, respectively. Furthermore, WidthFormer also exhibits strong robustness to different degrees of camera perturbations. Our study offers valuable insights into the deployment of BEV transformation methods in real-world, complex road environments. Code is available at https://github.com/ChenhongyiYang/WidthFormer.

@inproceedings{Yang2024_10_WidthFormer,
author = {Chenhongyi Yang and Tianwei Lin and Lichao Huang and Elliot J. Crowley},
title = {{W}idth{F}ormer: Toward Efficient Transformer-based BEV View Transformation},
year = {2024},
month = {Oct},
booktitle = {IEEE / RSJ International Conference on Intelligent Robots and Systems (IROS)},
url = {https://arxiv.org/abs/2401.03836},
}

Diffusion Models for Counterfactual Generation and Anomaly Detection in Brain Images

IEEE Transactions on Medical Imaging

Alessandro Fontanella, Grant Mair, Joanna Wardlaw, Emanuele Trucco, Amos Storkey

Segmentation masks of pathological areas are useful in many medical applications, such as brain tumour and stroke management. Moreover, healthy counterfactuals of diseased images can be used to enhance radiologists’ training files and to improve the interpretability of segmentation models. In this work, we present a weakly supervised method to generate a healthy version of a diseased image and then use it to obtain a pixel-wise anomaly map. To do so, we start by considering a saliency map that approximately covers the pathological areas, obtained with ACAT. Then, we propose a technique that allows to perform targeted modifications to these regions, while preserving the rest of the image. In particular, we employ a diffusion model trained on healthy samples and combine Denoising Diffusion Probabilistic Model (DDPM) and Denoising Diffusion Implicit Model (DDIM) at each step of the sampling process. DDPM is used to modify the areas affected by a lesion within the saliency map, while DDIM guarantees reconstruction of the normal anatomy outside of it. The two parts are also fused at each timestep, to guarantee the generation of a sample with a coherent appearance and a seamless transition between edited and unedited parts. We verify that when our method is applied to healthy samples, the input images are reconstructed without significant modifications. We compare our approach with alternative weakly supervised methods on the task of brain lesion segmentation, achieving the highest mean Dice and IoU scores among the models considered.

@article{Fontanella2024_9_Diffusion,
author = {Alessandro Fontanella and Grant Mair and Joanna Wardlaw and Emanuele Trucco and Amos Storkey},
title = {Diffusion Models for Counterfactual Generation and Anomaly Detection in Brain Images},
year = {2024},
month = {Sep},
journal = {IEEE Transactions on Medical Imaging},
volume = {},
url = {https://arxiv.org/abs/2308.02062},
}

Selecting Pre-trained Models for Transfer Learning with Data-centric Meta-features

International Conference on Automated Machine Learning (AutoML Workshop Track)

Matt van den Nieuwenhuijzen, Carola Doerr, Henry Gouk, Jan N. van Rijn

When applying a neural network to address a new learning problem, it is common to not train the network from scratch, but instead start with a neural network that has already been trained on a related dataset, and then fine-tune this on the data of the target task. This poses the question: which pre-trained network should be selected? In this work, we investigate this problem in the context of three different dataset relationships: same-source, same-domain, and cross-domain. We utilize Meta-Album, which offers an extensive collection of datasets from various unrelated domains. We first split each of the 30 datasets of Meta-Album into a meta-train dataset and meta-test dataset, then create pre-trained models for each meta-train dataset, and finally compare the performances of the pre-trained models in a fine-tuning context when applied to meta-test tasks. We categorize the performances into the three dataset relationship groups and find that the same-source category has the best performance. Then, using meta-features of the meta-train dataset and meta-test tasks, we train statistical meta-models that are employed to select the best pre-trained model for a given meta-test task. Our best meta-model identifies the best-performing model in approximately 25% of cases. It improves upon a baseline that always selects the best average model by more than 30%.

@inproceedings{Nieuwenhuijzen2024_9_Selecting,
author = {Matt van den Nieuwenhuijzen and Carola Doerr and Henry Gouk and Jan N. van Rijn},
title = {Selecting Pre-trained Models for Transfer Learning with Data-centric Meta-features},
year = {2024},
month = {Sep},
booktitle = {International Conference on Automated Machine Learning (AutoML Workshop Track)},
url = {https://openreview.net/forum?id=W92wb1TBWd},
}

Automated Prior Elicitation from Large Language Models for Bayesian Logistic Regression

International Conference on Automated Machine Learning (AutoML Workshop Track)

Henry Gouk, Boyan Gao

We investigate how one can automatically retrieve prior knowledge and use it to improve the sample efficiency of training linear models. This is addressed using the Bayesian formulation of logistic regression, which relies on the specification of a prior distribution that accurately captures the belief the data analyst, or an associated domain expert, has about the values of the model parameters before having seen any data. We develop a broadly applicable strategy for crafting informative priors through the use of Large Language Models (LLMs). The method relies on generating synthetic data using the LLM, and then modelling the distribution over labels that the LLM associates with the generated data. In contrast to existing methods, the proposed approach does not require a substantial time investment from a domain expert and has the potential to leverage access to a much broader range of information. Moreover, our method is straightforward to implement, requiring only the ability to make black-box queries of a pre-trained LLM. The experimental evaluation demonstrates that the proposed approach can have a substantial benefit in some situations, at times achieving an absolute improvement of more than 10% accuracy in the severely data-scarce regime. We show that such gains can be had even when only a small volume of information is elicited from the LLM.

@inproceedings{Gouk2024_9_Automated,
author = {Henry Gouk and Boyan Gao},
title = {Automated Prior Elicitation from Large Language Models for Bayesian Logistic Regression},
year = {2024},
month = {Sep},
booktitle = {International Conference on Automated Machine Learning (AutoML Workshop Track)},
url = {https://openreview.net/forum?id=euLzlnU7gz},
}

DLAS: A Conceptual Model for Across-Stack Deep Learning Acceleration

ACM Transactions on Architecture and Code Optimization

Perry Gibson, José Cano, Elliot J. Crowley, Amos Storkey, Michael O'Boyle

Deep Neural Networks (DNNs) are very computationally demanding, which presents a significant barrier to their deployment, especially on resource-constrained devices. Significant work from both the machine learning and computing systems communities has attempted to accelerate DNNs. However, the number of techniques available and the required domain knowledge for their exploration continues to grow, making design space exploration (DSE) increasingly difficult. To unify the perspectives from these two communities, this paper introduces the Deep Learning Acceleration Stack (DLAS), a conceptual model for DNN deployment and acceleration. We adopt a six-layer representation that organizes and illustrates the key areas for DNN acceleration, from machine learning to software and computer architecture. We argue that the DLAS model balances simplicity and expressiveness, assisting practitioners from various domains in tackling co-design acceleration challenges. We demonstrate the interdependence of the DLAS layers, and thus the need for co-design, through an across-stack perturbation study, using a modified tensor compiler to generate experiments for combinations of a few parameters across the DLAS layers. Our perturbation study assesses the impact on inference time and accuracy when varying DLAS parameters across two datasets, seven popular DNN architectures, four compression techniques, three algorithmic primitives (with sparse and dense variants), untuned and auto-scheduled code generation, and four hardware platforms. The study observes significant changes in the relative performance of design choices with the introduction of new DLAS parameters (e.g., the fastest algorithmic primitive varies with the level of quantization). Given the strong evidence for the need for co-design, and the high costs of DSE, DLAS offers a valuable conceptual model for better exploring advanced co-designed accelerated deep learning solutions.

@article{Gibson2024_9_DLAS,
author = {Perry Gibson and José Cano and Elliot J. Crowley and Amos Storkey and Michael O'Boyle},
title = {{DLAS}: A Conceptual Model for Across-Stack Deep Learning Acceleration},
year = {2024},
month = {Sep},
journal = {ACM Transactions on Architecture and Code Optimization},
volume = {},
url = {https://dl.acm.org/doi/10.1145/3688609},
}

Liouna: Biologically Plausible Learning for Efficient Pre-Training of Transferrable Deep Models

ICML Workshop on Advancing Neural Network Training: Computational Efficiency, Scalability, and Resource Optimization

Fady Rezk, Antreas Antoniou, Henry Gouk, Timothy Hospedales

Biologically plausible learning algorithms, inspired by the inherent constraints of biological neural systems, offer a promising path towards communication and memory-efficient learning with extreme parallelizability where layers learning is decoupled to train in parallel. In this work, we introduce Liouna (Arabic for "plasticity"), an unsupervised biologically plausible local learning algorithm inspired by predictive coding and masked image modelling. We derive Liouna's update rule, which elegantly reduces to a simple Hebbian rule with subtractive inhibition. We establish new state-of-the-art results for local learning rules across CIFAR-10, CIFAR-100, STL-10, and Imagenette, without imposing training procedures that hinder the attainability of the true benefits of local learning. Remarkably, we discover and demonstrate an emergent behaviour in Liouna, where it learns inter-class similarity and separability through feature sharing and specialization, despite observing no labels during training. Notably, we are the first to study the transfer performance of local learning algorithms. By pre-training on unlabelled data, Liouna outperforms previous state-of-the-art methods on 6 out of 8 downstream tasks and even surpasses end-to-end (E2E) supervised training in the low compute regime. Liouna also demonstrates competitive performance with SimCLR pre-trained models in the resource-limited pre-training scenario. This highlights Liouna's potential for efficient transfer learning and/or acceleration of the initial stages of pre-training improving its convergence rates in wall-clock time.

@inproceedings{Rezk2024_7_Liouna,
author = {Fady Rezk and Antreas Antoniou and Henry Gouk and Timothy Hospedales},
title = {Liouna: Biologically Plausible Learning for Efficient Pre-Training of Transferrable Deep Models},
year = {2024},
month = {Jul},
booktitle = {ICML Workshop on Advancing Neural Network Training: Computational Efficiency, Scalability, and Resource Optimization},
url = {https://openreview.net/forum?id=bYwg5Awx6n},
}

Skin Malignancy Classification Using Patients' Skin Images and Meta-Data: Multimodal Fusion for Improving Fairness

Medical Imaging with Deep Learning

Ke Wang, Ningyuan Shan, Henry Gouk, Iris Szu-Szu Ho

Skin cancer image classification across skin tones is a challenging problem due to the fact that skin cancer can present differently on different skin tones. This study evaluates the performance of image only models and fusion models in skin malignancy classification. The fusion models we consider are able to take in additional patient data, such as an indicator of their skin tone, and merge this information with the features provided by the image-only model. Results from the experiment show that fusion models perform substantially better than image-only models. In particular, we find that a form of multiplicative fusion results in the best performing models. This finding suggests that skin tones add predictive value in skin malignancy prediction problems. We further demonstrate that feature fusion methods reduce, but do not entirely eliminate, the disparity in performance of the model on patients with different skin tones.

@inproceedings{Wang2024_7_Skin,
author = {Ke Wang and Ningyuan Shan and Henry Gouk and Iris Szu-Szu Ho},
title = {Skin Malignancy Classification Using Patients' Skin Images and Meta-Data: Multimodal Fusion for Improving Fairness},
year = {2024},
month = {Jul},
booktitle = {Medical Imaging with Deep Learning},
url = {https://openreview.net/forum?id=5TWfxGVFWc},
}

Chunking: Continual Learning is not just about Distribution Shift

Third Conference on Lifelong Learning Agents (CoLLAs 2024)

Thomas L. Lee, Amos Storkey

Work on continual learning (CL) has thus far largely focused on the problems arising from shifts in the data distribution. However, CL can be decomposed into two sub-problems: (a) shifts in the data distribution, and (b) dealing with the fact that the data is split into chunks and so only a part of the data is available to be trained on at any point in time. In this work, we look at the latter sub-problem, the \emph{chunking} of data. We show that chunking is an important part of CL, accounting for around half of the performance drop from offline learning in our experiments. Furthermore, our results reveal that current CL algorithms do not address the chunking sub-problem, only performing as well as plain SGD training when there is no shift in the data distribution. Therefore, we show that chunking is both an important and currently unaddressed sub-problem and until it is addressed CL methods will be capped in performance. Additionally, we analyse why performance drops when learning occurs on identically distributed chunks of data, and find that forgetting, which is often seen to be a problem due to distribution shift, still arises and is a significant problem. Motivated by an analysis of the linear case, we show that performance on the chunking sub-problem can be increased by using per-chunk weight averaging and that this performance transfers to the full CL setting, where there is distribution shift. Hence, we argue that work on chunking can help advance CL in general.

@inproceedings{Lee2024_7_Chunking,
author = {Thomas L. Lee and Amos Storkey},
title = {Chunking: Continual Learning is not just about Distribution Shift},
year = {2024},
month = {Jul},
booktitle = {Third Conference on Lifelong Learning Agents (CoLLAs 2024)},
url = {https://arxiv.org/abs/2310.02206},
}

Planning to Go Out-of-Distribution in Offline-to-Online Reinforcement Learning

Reinforcement Learning Conference (RLC)

Trevor McInroe, Adam Jelley, Stefano V. Albrecht, Amos Storkey

Offline pretraining with a static dataset followed by online fine-tuning (offline-to-online, or OtO) is a paradigm well matched to a real-world RL deployment process. In this scenario, we aim to find the best-performing policy within a limited budget of online interactions. Previous work in the OtO setting has focused on correcting for bias introduced by the policy-constraint mechanisms of offline RL algorithms. Such constraints keep the learned policy close to the behavior policy that collected the dataset, but we show this can unnecessarily limit policy performance if the behavior policy is far from optimal. Instead, we forgo constraints and frame OtO RL as an exploration problem that aims to maximize the benefit of online data-collection. We first study the major online RL exploration methods based on intrinsic rewards and UCB in the OtO setting, showing that intrinsic rewards add training instability through reward-function modification, and UCB methods are myopic and it is unclear which learned-component's ensemble to use for action selection. We then introduce an algorithm for planning to go out-of-distribution (PTGOOD) that avoids these issues. PTGOOD uses a non-myopic planning procedure that targets exploration in relatively high-reward regions of the state-action space unlikely to be visited by the behavior policy. By leveraging concepts from the Conditional Entropy Bottleneck, PTGOOD encourages data collected online to provide new information relevant to improving the final deployment policy without altering rewards. We show empirically in several continuous control tasks that PTGOOD significantly improves agent returns during online fine-tuning and avoids the suboptimal policy convergence that many of our baselines exhibit in several environments.

@inproceedings{McInroe2024_6_Planning,
author = {Trevor McInroe and Adam Jelley and Stefano V. Albrecht and Amos Storkey},
title = {Planning to Go Out-of-Distribution in Offline-to-Online Reinforcement Learning},
year = {2024},
month = {Jun},
booktitle = {Reinforcement Learning Conference (RLC)},
url = {https://arxiv.org/abs/2310.05723},
}

Plug and Play Active Learning for Object Detection

IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR)

Chenhongyi Yang, Lichao Huang, Elliot J. Crowley

Annotating data for supervised learning is expensive and tedious, and we want to do as little of it as possible. To make the most of a given "annotation budget" we can turn to active learning (AL) which aims to identify the most informative samples in a dataset for annotation. Active learning algorithms are typically uncertainty-based or diversity-based. Both have seen success in image classification, but fall short when it comes to object detection. We hypothesise that this is because: (1) it is difficult to quantify uncertainty for object detection as it consists of both localisation and classification, where some classes are harder to localise, and others are harder to classify; (2) it is difficult to measure similarities for diversity-based AL when images contain different numbers of objects. We propose a two-stage active learning algorithm Plug and Play Active Learning (PPAL) that overcomes these difficulties. It consists of (1) Difficulty Calibrated Uncertainty Sampling, in which we used a category-wise difficulty coefficient that takes both classification and localisation into account to re-weight object uncertainties for uncertainty-based sampling; (2) Category Conditioned Matching Similarity to compute the similarities of multi-instance images as ensembles of their instance similarities. PPAL is highly generalisable because it makes no change to model architectures or detector training pipelines. We benchmark PPAL on the MS-COCO and Pascal VOC datasets using different detector architectures and show that our method outperforms the prior state-of-the-art. Code is available at https://github.com/ChenhongyiYang/PPAL

@inproceedings{Yang2024_6_Plug,
author = {Chenhongyi Yang and Lichao Huang and Elliot J. Crowley},
title = {Plug and Play Active Learning for Object Detection},
year = {2024},
month = {Jun},
booktitle = {IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR)},
url = {https://arxiv.org/abs/2211.11612},
}

On the Limitations of General Purpose Domain Generalisation Methods

Henry Gouk, Ondrej Bohdal, Da Li, Timothy Hospedales

We investigate the fundamental performance limitations of learning algorithms in several Domain Generalisation (DG) settings. Motivated by the difficulty with which previously proposed methods have in reliably outperforming Empirical Risk Minimisation (ERM), we derive upper bounds on the excess risk of ERM, and lower bounds on the minimax excess risk. Our findings show that in all the DG settings we consider, it is not possible to significantly outperform ERM. Our conclusions are limited not only to the standard covariate shift setting, but also two other settings with additional restrictions on how domains can differ. The first constrains all domains to have a non-trivial bound on pairwise distances, as measured by a broad class of integral probability metrics. The second alternate setting considers a restricted class of DG problems where all domains have the same underlying support. Our analysis also suggests how different strategies can be used to optimise the performance of ERM in each of these DG setting. We also experimentally explore hypotheses suggested by our theoretical analysis.

@unpublished{Gouk2024_5_On,
author = {Henry Gouk and Ondrej Bohdal and Da Li and Timothy Hospedales},
title = {On the Limitations of General Purpose Domain Generalisation Methods},
year = {2024},
month = {May},
institution = {University of Edinburgh},
url = {https://arxiv.org/abs/2202.00563},
}

Approximate Bayesian Class-Conditional Models under Continuous Representation Shift

International Conference on Artificial Intelligence and Statistics (AISTATS 2024)

Thomas L. Lee, Amos Storkey

For models consisting of a classifier in some representation space, learning online from a non-stationary data stream often necessitates changes in the representation. So, the question arises of what is the best way to adapt the classifier to shifts in representation. Current methods only slowly change the classifier to representation shift, introducing noise into learning as the classifier is misaligned to the representation. We propose DeepCCG, an empirical Bayesian approach to solve this problem. DeepCCG works by updating the posterior of a class conditional Gaussian classifier such that the classifier adapts in one step to representation shift. The use of a class conditional Gaussian classifier also enables DeepCCG to use a log conditional marginal likelihood loss to update the representation. To perform the update to the classifier and representation, DeepCCG maintains a fixed number of examples in memory and so a key part of DeepCCG is selecting what examples to store, choosing the subset that minimises the KL divergence between the true posterior and the posterior induced by the subset. We explore the behaviour of DeepCCG in online continual learning (CL), demonstrating that it performs well against a spectrum of online CL methods and that it reduces the change in performance due to representation shift.

@inproceedings{Lee2024_5_Approximate,
author = {Thomas L. Lee and Amos Storkey},
title = {Approximate {B}ayesian Class-Conditional Models under Continuous Representation Shift},
year = {2024},
month = {May},
booktitle = {International Conference on Artificial Intelligence and Statistics (AISTATS 2024)},
url = {https://arxiv.org/abs/2305.19076},
}

DAM: Towards a Foundation Model for Forecasting

International Conference on Learning Representations (ICLR)

Luke Darlow, Qiwen Deng, Ahmed Hassan, Martin Asenov, Rajkarn Singh, Artjom Joosen, Adam Barker, Amos Storkey

It is challenging to scale time series forecasting models such that they forecast accurately for multiple distinct domains and datasets, all with potentially different underlying collection procedures (e.g., sample resolution), patterns (e.g., period-icity), and prediction requirements (e.g., reconstruction vs. forecasting). We call this general task universal forecasting. Existing methods usually assume that input data is regularly sampled, and they forecast to pre-determined horizons, resulting in failure to generalise outside of the scope of their training. We propose the DAM – a neural model that takes randomly sampled histories and outputs an adjustable basis composition as a continuous function of time for forecasting to non-fixed horizons. It involves three key components: (1) a flexible approach for using randomly sampled histories from a long-tail distribution, that enables an efficient global perspective of the underlying temporal dynamics while retaining focus on the recent history; (2) a transformer backbone that is trained on these actively sampled histories to produce, as representational output, (3) the basis coefficients of a continuous function of time. We show that a single univariate DAM, trained on 25 time series datasets, either outperformed or closely matched existing SoTA models at multivariate long-term forecasting across 18 datasets, including 8 held-out for zero-shot transfer, even though these models were trained to specialise for each dataset-horizon combination. This single DAM excels at zero-shot transfer and very-long-term forecasting, performs well at imputation, is interpretable via basis function composition and attention, can be tuned for different inference-cost requirements, is robust to missing and irregularly sampled data by design.

@inproceedings{Darlow2024_4_DAM,
author = {Luke Darlow and Qiwen Deng and Ahmed Hassan and Martin Asenov and Rajkarn Singh and Artjom Joosen and Adam Barker and Amos Storkey},
title = {{DAM}: Towards a Foundation Model for Forecasting},
year = {2024},
month = {Apr},
booktitle = {International Conference on Learning Representations (ICLR)},
url = {https://openreview.net/forum?id=4NhMhElWqP},
}

Hyperparameter Selection in Continual Learning

Thomas L. Lee, Sigrid Passano Hellan, Linus Ericsson, Elliot J. Crowley, Amos Storkey

In continual learning (CL) -- where a learner trains on a stream of data -- standard hyperparameter optimisation (HPO) cannot be applied, as a learner does not have access to all of the data at the same time. This has prompted the development of CL-specific HPO frameworks. The most popular way to tune hyperparameters in CL is to repeatedly train over the whole data stream with different hyperparameter settings. However, this end-of-training HPO is unrealistic as in practice a learner can only see the stream once. Hence, there is an open question: what HPO framework should a practitioner use for a CL problem in reality? This paper answers this question by evaluating several realistic HPO frameworks. We find that all the HPO frameworks considered, including end-of-training HPO, perform similarly. We therefore advocate using the realistic and most computationally efficient method: fitting the hyperparameters on the first task and then fixing them throughout training.

@unpublished{Lee2024_4_Hyperparameter,
author = {Thomas L. Lee and Sigrid Passano Hellan and Linus Ericsson and Elliot J. Crowley and Amos Storkey},
title = {Hyperparameter Selection in Continual Learning},
year = {2024},
month = {Apr},
institution = {University of Edinburgh},
url = {https://arxiv.org/abs/2404.06466},
}

Is Scaling Learned Optimizers Worth It? Evaluating The Value of VeLO's 4000 TPU Months

I Can't Believe It's Not Better! (NeurIPS Workshop)

Fady Rezk, Antreas Antoniou, Henry Gouk, Timothy Hospedales

We analyze VeLO (versatile learned optimizer), the largest scale attempt to train a general purpose "foundational" optimizer to date. VeLO was trained on thousands of machine learning tasks using over 4000 TPU months with the goal of producing an optimizer capable of generalizing to new problems while being hyperparameter free, and outperforming industry standards such as Adam. We independently evaluate VeLO on the MLCommons optimizer benchmark suite. We find that, contrary to initial claims: (1) VeLO has a critical hyperparameter that needs problem-specific tuning, (2) VeLO does not necessarily outperform competitors in quality of solution found, and (3) VeLO is not faster than competing optimizers at reducing the training loss. These observations call into question VeLO's generality and the value of the investment in training it.

@inproceedings{Rezk2023_12_Is,
author = {Fady Rezk and Antreas Antoniou and Henry Gouk and Timothy Hospedales},
title = {Is Scaling Learned Optimizers Worth It? Evaluating The Value of {VeLO}'s 4000 {TPU} Months},
year = {2023},
month = {Dec},
booktitle = {I Can't Believe It's Not Better! (NeurIPS Workshop)},
url = {https://arxiv.org/abs/2310.18191},
}

Generate Your Own Scotland: Satellite Image Generation Conditioned on Maps

NeurIPS 2023 Workshop on Diffusion Models

Miguel Espinosa, Elliot J. Crowley

Despite recent advancements in image generation, diffusion models still remain largely underexplored in Earth Observation. In this paper we show that state-of-the-art pretrained diffusion models can be conditioned on cartographic data to generate realistic satellite images. We provide two large datasets of paired OpenStreetMap images and satellite views over the region of Mainland Scotland and the Central Belt. We train a ControlNet model and qualitatively evaluate the results, demonstrating that both image quality and map fidelity are possible. Finally, we provide some insights on the opportunities and challenges of applying these models for remote sensing. Our model weights and code for creating the dataset are publicly available at https://github.com/miquel-espinosa/map-sat.

@inproceedings{Espinosa2023_12_Generate,
author = {Miguel Espinosa and Elliot J. Crowley},
title = {Generate Your Own {S}cotland: Satellite Image Generation Conditioned on Maps},
year = {2023},
month = {Dec},
booktitle = {NeurIPS 2023 Workshop on Diffusion Models},
url = {https://arxiv.org/abs/2308.16648},
}

Quality Diversity for Visual Pre-Training

International Conference on Computer Vision

Ruchika Chavhan, Henry Gouk, Da Li, Timothy Hospedales

Models pre-trained on large datasets such as ImageNet provide the de-facto standard for transfer learning, with both supervised and self-supervised approaches proving effective. However, emerging evidence suggests that any single pre-trained feature will not perform well on diverse downstream tasks. Each pre-training strategy encodes a certain inductive bias, which may suit some downstream tasks but not others. Notably, the augmentations used in both supervised and self-supervised training lead to features with high invariance to spatial and appearance transformations. This renders them sub-optimal for tasks that demand sensitivity to these factors. In this paper we develop a feature that better supports diverse downstream tasks by providing a diverse set of sensitivities and invariances. In particular, we are inspired by Quality-Diversity in evolution, to define a pre-training objective that requires high quality yet diverse features--where diversity is defined in terms of transformation (in) variances. Our framework plugs in to both supervised and self-supervised pre-training, and produces a small ensemble of features. We further show how downstream tasks can easily and efficiently select their preferred (in) variances. Both empirical and theoretical analysis show the efficacy of our representation and transfer learning approach for diverse downstream tasks.

@inproceedings{Chavhan2023_10_Quality,
author = {Ruchika Chavhan and Henry Gouk and Da Li and Timothy Hospedales},
title = {Quality Diversity for Visual Pre-Training},
year = {2023},
month = {Oct},
booktitle = {International Conference on Computer Vision},
url = {http://openaccess.thecvf.com/content/ICCV2023/html/Chavhan_Quality_Diversity_for_Visual_Pre-Training_ICCV_2023_paper.html},
}

Deep Learning Detection of Diabetic Retinopathy in Scotland’s Diabetic Eye Screening Programme

Alan D Fleming, Joseph Mellor, Stuart J McGurnaghan, Luke A K Blackbourn, Keith A Goatman, Caroline Styles, Amos J Storkey, Paul M McKeigue, Helen M Colhoun

"Background/Aims Support vector machine-based automated grading (known as iGradingM) has been shown to be safe, cost-effective and robust in the diabetic retinopathy (DR) screening (DES) programme in Scotland. It triages screening episodes as gradable with no DR versus manual grading required. The study aim was to develop a deep learning-based autograder using images and gradings from DES and to compare its performance with that of iGradingM. Methods Retinal images, quality assurance (QA) data and routine DR grades were obtained from national datasets in 179 944 patients for years 2006–2016. QA grades were available for 744 images. We developed a deep learning-based algorithm to detect whether either eye contained ungradable images or any DR. The sensitivity and specificity were evaluated against consensus QA grades and routine grades. Results Images used in QA which were ungradable or with DR were detected by deep learning with better specificity compared with manual graders (p<0.001) and with iGradingM (p<0.001) at the same sensitivities. Any DR according to the DES final grade was detected with 89.19% (270 392/303 154) sensitivity and 77.41% (500 945/647 158) specificity. Observable disease and referable disease were detected with sensitivities of 96.58% (16 613/17 201) and 98.48% (22 600/22 948), respectively. Overall, 43.84% of screening episodes would require manual grading. Conclusion A deep learning-based system for DR grading was evaluated in QA data and images from 11 years in 50% of people attending a national DR screening programme. The system could reduce the manual grading workload at the same sensitivity compared with the current automated grading system."

@article{Fleming2023_9_Deep,
author = {Alan D Fleming and Joseph Mellor and Stuart J McGurnaghan and Luke A K Blackbourn and Keith A Goatman and Caroline Styles and Amos J Storkey and Paul M McKeigue and Helen M Colhoun},
title = {Deep Learning Detection of Diabetic Retinopathy in {S}cotland’s Diabetic Eye Screening Programme},
year = {2023},
month = {Sep},
journal = {},
volume = {},
url = {https://bjo.bmj.com/content/early/2023/09/13/bjo-2023-323395},
}

Evaluating the Evaluators: Are Current Few-Shot Learning Benchmarks Fit for Purpose?

ICML Workshop on Data-Centric Machine Learning Research

Luisa Shimabucoro, Timothy Hospedales, Henry Gouk

Numerous benchmarks for Few-Shot Learning have been proposed in the last decade. However all of these benchmarks focus on performance averaged over many tasks, and the question of how to reliably evaluate and tune models trained for individual tasks in this regime has not been addressed. This paper presents the first investigation into task-level evaluation -- a fundamental step when deploying a model. We measure the accuracy of performance estimators in the few-shot setting, consider strategies for model selection, and examine the reasons for the failure of evaluators usually thought of as being robust. We conclude that cross-validation with a low number of folds is the best choice for directly estimating the performance of a model, whereas using bootstrapping or cross validation with a large number of folds is better for model selection purposes. Overall, we find that existing benchmarks for few-shot learning are not designed in such a way that one can get a reliable picture of how effectively methods can be used on individual tasks.

@inproceedings{Shimabucoro2023_7_Evaluating,
author = {Luisa Shimabucoro and Timothy Hospedales and Henry Gouk},
title = {Evaluating the Evaluators: Are Current Few-Shot Learning Benchmarks Fit for Purpose?},
year = {2023},
month = {Jul},
booktitle = {ICML Workshop on Data-Centric Machine Learning Research},
url = {https://arxiv.org/abs/2307.02732},
}

QuickQual: Lightweight, Convenient Retinal Image Quality Scoring with Off-the-Shelf Pretrained Models

Justin Engelmann, Amos Storkey, Miguel O. Bernabeu

Image quality remains a key problem for both traditional and deep learning (DL)-based approaches to retinal image analysis, but identifying poor quality images can be time consuming and subjective. Thus, automated methods for retinal image quality scoring (RIQS) are needed. The current state-of-the-art is MCFNet, composed of three Densenet121 backbones each operating in a different colour space. MCFNet, and the EyeQ dataset released by the same authors, was a huge step forward for RIQS. We present QuickQual, a simple approach to RIQS, consisting of a single off-the-shelf ImageNet-pretrained Densenet121 backbone plus a Support Vector Machine (SVM). QuickQual performs very well, setting a new state-of-the-art for EyeQ (Accuracy: 88.50% vs 88.00% for MCFNet; AUC: 0.9687 vs 0.9588). This suggests that RIQS can be solved with generic perceptual features learned on natural images, as opposed to requiring DL models trained on large amounts of fundus images. Additionally, we propose a Fixed Prior linearisation scheme, that converts EyeQ from a 3-way classification to a continuous logistic regression task. For this task, we present a second model, QuickQual MEga Minified Estimator (QuickQual-MEME), that consists of only 10 parameters on top of an off-the-shelf Densenet121 and can distinguish between gradable and ungradable images with an accuracy of 89.18% (AUC: 0.9537). Code and model are available on GitHub: this https URL . QuickQual is so lightweight, that the entire inference code (and even the parameters for QuickQual-MEME) is already contained in this paper.

@unpublished{Engelmann2023_7_QuickQual,
author = {Justin Engelmann and Amos Storkey and Miguel O. Bernabeu},
title = {{Q}uick{Q}ual: Lightweight, Convenient Retinal Image Quality Scoring with Off-the-Shelf Pretrained Models},
year = {2023},
month = {Jul},
institution = {University of Edinburgh},
url = {https://arxiv.org/abs/2307.13646},
}

Efficient and Fully-Automatic Retinal Choroid Segmentation in OCT Through DL-based Distillation of a Hand-Crafted Pipeline

Jamie Burke, Justin Engelmann, Charlene Hamid, Megan Reid-Schachter, Tom Pearson, Dan Pugh, Neeraj Dhaun, Stuart King, Tom MacGillivray, Miguel O Bernabeu, Amos Storkey, Ian JC MacCormick

Retinal vascular phenotypes, derived from low-cost, non-invasive retinal imaging, have been linked to systemic conditions such as cardio-, neuro- and reno-vascular disease. Recent high-resolution optical coherence tomography (OCT) allows imaging of the choroidal microvasculature which could provide more information about vascular health that complements the superficial retinal vessels, which current vascular phenotypes are based on. Segmentation of the choroid in OCT is a key step in quantifying choroidal parameters like thickness and area. Gaussian Process Edge Tracing (GPET) is a promising, clinically validated method for this. However, GPET is semi-automatic and thus requires time-consuming manual interventions by specifically trained personnel which introduces subjectivity and limits the potential for analysing larger datasets or deploying GPET into clinical practice. We introduce DeepGPET, which distils GPET into a neural network to yield a fully-automatic and efficient choroidal segmentation method. DeepGPET achieves excellent agreement with GPET on data from 3 clinical studies (AUC=0.9994, Dice=0.9664; Pearson correlation of 0.8908 for choroidal thickness and 0.9082 for choroidal area), while reducing the mean processing time per image from 34.49s (15.09) to 1.25s (0.10) on a standard laptop CPU and removing all manual interventions. DeepGPET will be made available for researchers upon publication.

@unpublished{Burke2023_7_Efficient,
author = {Jamie Burke and Justin Engelmann and Charlene Hamid and Megan Reid-Schachter and Tom Pearson and Dan Pugh and Neeraj Dhaun and Stuart King and Tom MacGillivray and Miguel O Bernabeu and Amos Storkey and Ian JC MacCormick},
title = {Efficient and Fully-Automatic Retinal Choroid Segmentation in {OCT} Through DL-based Distillation of a Hand-Crafted Pipeline},
year = {2023},
month = {Jul},
institution = {University of Edinburgh},
url = {https://arxiv.org/abs/2307.00904},
}

ACAT: Adversarial Counterfactual Attention for Classification and Detection in Medical Imaging

International Conference on Machine Learning (ICML)

Alessandro Fontanella, Antreas Antoniou, Wenwen Li, Joanna Wardlaw, Grant Mair, Emanuele Trucco, Amos Storkey

In some medical imaging tasks and other settings where only small parts of the image are informative for the classification task, traditional CNNs can sometimes struggle to generalise. Manually annotated Regions of Interest (ROI) are sometimes used to isolate the most informative parts of the image. However, these are expensive to collect and may vary significantly across annotators. To overcome these issues, we propose a framework that employs saliency maps to obtain soft spatial attention masks that modulate the image features at different scales. We refer to our method as Adversarial Counterfactual Attention (ACAT). ACAT increases the baseline classification accuracy of lesions in brain CT scans from 71.39% to 72.55% and of COVID-19 related findings in lung CT scans from 67.71% to 70.84% and exceeds the performance of competing methods. We investigate the best way to generate the saliency maps employed in our architecture and propose a way to obtain them from adversarially generated counterfactual images. They are able to isolate the area of interest in brain and lung CT scans without using any manual annotations. In the task of localising the lesion location out of 6 possible regions, they obtain a score of 65.05% on brain CT scans, improving the score of 61.29% obtained with the best competing method.

@inproceedings{Fontanella2023_7_ACAT,
author = {Alessandro Fontanella and Antreas Antoniou and Wenwen Li and Joanna Wardlaw and Grant Mair and Emanuele Trucco and Amos Storkey},
title = {{ACAT}: Adversarial Counterfactual Attention for Classification and Detection in Medical Imaging},
year = {2023},
month = {Jul},
booktitle = {International Conference on Machine Learning (ICML)},
url = {https://arxiv.org/abs/2303.15421},
}

Meta Omnium: A Benchmark for General-Purpose Learning-to-Learn

Computer Vision and Pattern Recognition

Ondrej Bohdal, Yinbing Tian, Yongshuo Zong, Ruchika Chavhan, Da Li, Henry Gouk, Li Guo, Timothy Hospedales

Meta-learning and other approaches to few-shot learning are widely studied for image recognition, and are increasingly applied to other vision tasks such as pose estimation and dense prediction. This naturally raises the question of whether there is any few-shot meta-learning algorithm capable of generalizing across these diverse task types? To support the community in answering this question, we introduce Meta Omnium, a dataset-of-datasets spanning multiple vision tasks including recognition, keypoint localization, semantic segmentation and regression. We experiment with popular few-shot meta-learning baselines and analyze their ability to generalize across tasks and to transfer knowledge between them. Meta Omnium enables meta-learning researchers to evaluate model generalization to a much wider array of tasks than previously possible, and provides a single framework for evaluating meta-learners across a wide suite of vision applications in a consistent manner.

@inproceedings{Bohdal2023_6_Meta,
author = {Ondrej Bohdal and Yinbing Tian and Yongshuo Zong and Ruchika Chavhan and Da Li and Henry Gouk and Li Guo and Timothy Hospedales},
title = {Meta {O}mnium: A Benchmark for General-Purpose Learning-to-Learn},
year = {2023},
month = {Jun},
booktitle = {Computer Vision and Pattern Recognition},
url = {http://openaccess.thecvf.com/content/CVPR2023/html/Bohdal_Meta_Omnium_A_Benchmark_for_General-Purpose_Learning-To-Learn_CVPR_2023_paper.html},
}

Amortised Invariance Learning for Contrastive Self-Supervision

International Conference on Learning Representations

Ruchika Chavhan, Henry Gouk, Jan Stuehmer, Calum Heggan, Mehrdad Taghoobi, Timothy Hospedales

Contrastive self-supervised learning methods famously produce high quality transferable representations by learning invariances to different data augmentations. Invariances established during pre-training can be interpreted as strong inductive biases. However these may or may not be helpful, depending on if they match the invariance requirements of downstream tasks or not. This has led to several attempts to learn task-specific invariances during pre-training, however, these methods are highly compute intensive and tedious to train. We introduce the notion of amortised invariance learning for contrastive self supervision. In the pre-training stage, we parameterize the feature extractor by differentiable invariance hyper-parameters that control the invariances encoded by the representation. Then, for any downstream task, both linear readout and task-specific invariance requirements can be efficiently and effectively learned by gradient-descent. We evaluate the notion of amortised invariances for contrastive learning over two different modalities: vision and audio, on two widely-used contrastive learning methods in vision: SimCLR and MoCo-v2 with popular architectures like ResNets and Vision Transformers, and SimCLR with ResNet-18 for audio. We show that our amortised features provide a reliable way to learn diverse downstream tasks with different invariance requirements, while using a single feature and avoiding task-specific pre-training. This provides an exciting perspective that opens up new horizons in the field of general purpose representation learning.

@inproceedings{Chavhan2023_5_Amortised,
author = {Ruchika Chavhan and Henry Gouk and Jan Stuehmer and Calum Heggan and Mehrdad Taghoobi and Timothy Hospedales},
title = {Amortised Invariance Learning for Contrastive Self-Supervision},
year = {2023},
month = {May},
booktitle = {International Conference on Learning Representations},
url = {https://arxiv.org/abs/2302.12712},
}

Effectiveness of Debiasing Techniques: An Indigenous Qualitative Analysis

International Conference on Learning Representations (Tiny Papers Track)

Vithya Yogarajan, Gillian Dobbie, Henry Gouk

An indigenous perspective on the effectiveness of debiasing techniques for pre-trained language models (PLMs) is presented in this paper. The current techniques used to measure and debias PLMs are skewed towards the US racial biases and rely on pre-defined bias attributes (e.g. "black" vs "white"). Some require large datasets and further pre-training. Such techniques are not designed to capture the underrepresented indigenous populations in other countries, such as M\=aori in New Zealand. Local knowledge and understanding must be incorporated to ensure unbiased algorithms, especially when addressing a resource-restricted society.

@inproceedings{Yogarajan2023_5_Effectiveness,
author = {Vithya Yogarajan and Gillian Dobbie and Henry Gouk},
title = {Effectiveness of Debiasing Techniques: An Indigenous Qualitative Analysis},
year = {2023},
month = {May},
booktitle = {International Conference on Learning Representations (Tiny Papers Track)},
url = {https://arxiv.org/abs/2304.11094},
}

Contrastive Meta-Learning for Partially Observable Few-Shot Learning

International Conference on Learning Representations (ICLR)

Adam Jelley, Amos Storkey, Antreas Antoniou, Sam Devlin

Many contrastive and meta-learning approaches learn representations by identifying common features in multiple views. However, the formalism for these approaches generally assumes features to be shared across views to be captured coherently. We consider the problem of learning a unified representation from partial observations, where useful features may be present in only some of the views. We approach this through a probabilistic formalism enabling views to map to representations with different levels of uncertainty in different components; these views can then be integrated with one another through marginalisation over that uncertainty. Our approach, Partial Observation Experts Modelling (POEM), then enables us to meta-learn consistent representations from partial observations. We evaluate our approach on an adaptation of a comprehensive few-shot learning benchmark, Meta-Dataset, and demonstrate the benefits of POEM over other meta-learning methods at representation learning from partial observations. We further demonstrate the utility of POEM by meta-learning to represent an environment from partial views observed by an agent exploring the environment.

@inproceedings{Jelley2023_5_Contrastive,
author = {Adam Jelley and Amos Storkey and Antreas Antoniou and Sam Devlin},
title = {Contrastive Meta-Learning for Partially Observable Few-Shot Learning},
year = {2023},
month = {May},
booktitle = {International Conference on Learning Representations (ICLR)},
url = {https://arxiv.org/abs/2301.13136},
}

GPViT: A High Resolution Non-Hierarchical Vision Transformer with Group Propagation

International Conference on Learning Representations (ICLR)

Chenhongyi Yang, Jiarui Xu, Shalini De Mello, Elliot J. Crowley, Xiaolong Wang

We present the Group Propagation Vision Transformer (GPViT): a novel nonhierarchical (i.e. non-pyramidal) transformer model designed for general visual recognition with high-resolution features. High-resolution features (or tokens) are a natural fit for tasks that involve perceiving fine-grained details such as detection and segmentation, but exchanging global information between these features is expensive in memory and computation because of the way self-attention scales. We provide a highly efficient alternative Group Propagation Block (GP Block) to exchange global information. In each GP Block, features are first grouped together by a fixed number of learnable group tokens; we then perform Group Propagation where global information is exchanged between the grouped features; finally, global information in the updated grouped features is returned back to the image features through a transformer decoder. We evaluate GPViT on a variety of visual recognition tasks including image classification, semantic segmentation, object detection, and instance segmentation. Our method achieves significant performance gains over previous works across all tasks, especially on tasks that require high-resolution outputs, for example, our GPViT-L3 outperforms Swin Transformer-B by 2.0 mIoU on ADE20K semantic segmentation with only half as many parameters. Code and pre-trained models are available at https://github.com/ChenhongyiYang/GPViT

@inproceedings{Yang2023_5_GPViT,
author = {Chenhongyi Yang and Jiarui Xu and Shalini De Mello and Elliot J. Crowley and Xiaolong Wang},
title = {{GPViT}: A High Resolution Non-Hierarchical Vision Transformer with Group Propagation},
year = {2023},
month = {May},
booktitle = {International Conference on Learning Representations (ICLR)},
url = {https://arxiv.org/abs/2212.06795},
}

Can Deep Learning on Retinal Images Augment Known Risk Factors for Cardiovascular Disease Prediction in Diabetes? A Prospective Cohort Study from the National Screening Programme in Scotland

International Journal of Medical Informatics

Joseph Mellor, Wenhua Jiang, Alan Fleming, Stuart McGurnaghan, Luke Blackbourn, Caroline Styles, Amos J Storkey, Paul M McKeigue, Helen M Colhoun, Scottish Diabetes Research Network Epidemiology Group

"Aims: This study's objective was to evaluate whether deep learning (DL) on retinal photographs from a diabetic retinopathy screening programme improve prediction of incident cardiovascular disease (CVD). Methods: DL models were trained to jointly predict future CVD risk and CVD risk factors and used to output a DL score. Poisson regression models including clinical risk factors with and without a DL score were fitted to study cohorts with 2,072 and 38,730 incident CVD events in type 1 (T1DM) and type 2 diabetes (T2DM) respectively. Results: DL scores were independently associated with incident CVD with adjusted standardised incidence rate ratios of 1.14 (P = 3 × 10-04 95 % CI (1.06, 1.23)) and 1.16 (P = 4 × 10-33 95 % CI (1.13, 1.18)) in T1DM and T2DM cohorts respectively. The differences in predictive performance between models with and without a DL score were statistically significant (differences in test log-likelihood 6.7 and 51.1 natural log units) but the increments in C-statistics from 0.820 to 0.822 and from 0.709 to 0.711 for T1DM and T2DM respectively, were small. Conclusions: These results show that in people with diabetes, retinal photographs contain information on future CVD risk. However for this to contribute appreciably to clinical prediction of CVD further approaches, including exploitation of serial images, need to be evaluated."

@article{Mellor2023_4_Can,
author = {Joseph Mellor and Wenhua Jiang and Alan Fleming and Stuart McGurnaghan and Luke Blackbourn and Caroline Styles and Amos J Storkey and Paul M McKeigue and Helen M Colhoun and Scottish Diabetes Research Network Epidemiology Group},
title = {Can Deep Learning on Retinal Images Augment Known Risk Factors for Cardiovascular Disease Prediction in Diabetes? A Prospective Cohort Study from the National Screening Programme in {S}cotland},
year = {2023},
month = {Apr},
journal = {International Journal of Medical Informatics},
volume = {},
url = {https://pubmed.ncbi.nlm.nih.gov/37167840/},
}

Adversarial Robustness of β−VAE Through the Lens of Local Geometry

International Conference on Artificial Intelligence and Statistics (AISTATS)

Asif Khan, Amos Storkey

Variational autoencoders (VAEs) are susceptible to adversarial attacks. An adversary can find a small perturbation in the input sample to change its latent encoding non-smoothly, thereby compromising the reconstruction. A known reason for such vulnerability is the latent space distortions arising from a mismatch between approximated latent posterior and a prior distribution. Consequently, a slight change in the inputs leads to a significant change in the latent space encodings. This paper demonstrates that the sensitivity around a data point is due to a directional bias of a stochastic pullback metric tensor induced by the encoder network. The pullback metric tensor measures the infinitesimal volume change from input to latent space. Thus, it can be viewed as a lens to analyse the effect of small changes in the input leading to distortions in the latent space. We propose robustness evaluation scores using the eigenspectrum of a pullback metric. Moreover, we empirically show that the scores correlate with the robustness parameter β of the β−VAE.

@inproceedings{Khan2023_4_Adversarial,
author = {Asif Khan and Amos Storkey},
title = {Adversarial Robustness of {β−VAE} Through the Lens of Local Geometry},
year = {2023},
month = {Apr},
booktitle = {International Conference on Artificial Intelligence and Statistics (AISTATS)},
url = {https://arxiv.org/abs/2208.03923},
}

Detection of Multiple Retinal Diseases in Ultra-Widefield Fundus Images using Deep Learning: Data-driven Identification of Relevant Regions

Nature Machine Intelligence

Justin Engelmann, Alice D. McTrusty, Ian J. C. MacCormick, Emma Pead, Amos Storkey, Miguel O. Bernabeu

Ultra-widefield (UWF) imaging is a promising modality that captures a larger retinal field of view compared to traditional fundus photography. Previous studies showed that deep learning (DL) models are effective for detecting retinal disease in UWF images, but primarily considered individual diseases under less-than-realistic conditions (excluding images with other diseases, artefacts, comorbidities, or borderline cases; and balancing healthy and diseased images) and did not systematically investigate which regions of the UWF images are relevant for disease detection. We first improve on the state of the field by proposing a DL model that can recognise multiple retinal diseases under more realistic conditions. We then use global explainability methods to identify which regions of the UWF images the model generally attends to. Our model performs very well, separating between healthy and diseased retinas with an area under the curve (AUC) of 0.9206 on an internal test set, and an AUC of 0.9841 on a challenging, external test set. When diagnosing specific diseases, the model attends to regions where we would expect those diseases to occur. We further identify the posterior pole as the most important region in a purely data-driven fashion. Surprisingly, 10% of the image around the posterior pole is sufficient for achieving comparable performance to having the full images available.

@article{Engelmann2022_12_Detection,
author = {Justin Engelmann and Alice D. McTrusty and Ian J. C. MacCormick and Emma Pead and Amos Storkey and Miguel O. Bernabeu},
title = {Detection of Multiple Retinal Diseases in Ultra-Widefield Fundus Images using Deep Learning: Data-driven Identification of Relevant Regions},
year = {2022},
month = {Dec},
journal = {Nature Machine Intelligence},
volume = {},
url = {https://arxiv.org/abs/2203.06113},
}

Hamiltonian Latent Operators for Content and Motion Disentanglement in Image Sequences

Advances in Neural Information Processing Systems

Asif Khan, Amos Storkey

We present a deep latent variable model for high dimensional sequential data. Our model factorises the latent space into content and motion variables. To model the diverse dynamics, we split the motion space into subspaces, and introduce a unique Hamiltonian operator for each subspace. The Hamiltonian formulation provides reversible dynamics that learn to constrain the motion path to conserve invariant properties. The explicit split of the motion space decomposes the Hamiltonian into symmetry groups and gives long-term separability of the dynamics. This split also means representations can be learnt that are easy to interpret and control. We demonstrate the utility of our model for swapping the motion of two videos, generating sequences of various actions from a given image and unconditional sequence generation.

@inproceedings{Khan2022_12_Hamiltonian,
author = {Asif Khan and Amos Storkey},
title = {{H}amiltonian Latent Operators for Content and Motion Disentanglement in Image Sequences},
year = {2022},
month = {Dec},
booktitle = {Advances in Neural Information Processing Systems},
url = {https://arxiv.org/abs/2112.01641},
}

Prediction-Guided Distillation for Dense Object Detection

European Conference on Computer Vision

Chenhongyi Yang, Mateusz Ochal, Amos Storkey, Elliot J. Crowley

Real-world object detection models should be cheap and accurate. Knowledge distillation (KD) can boost the accuracy of a small, cheap detection model by leveraging useful information from a larger teacher model. However, a key challenge is identifying the most informative features produced by the teacher for distillation. In this work, we show that only a very small fraction of features within a ground-truth bounding box are responsible for a teacher's high detection performance. Based on this, we propose Prediction-Guided Distillation (PGD), which focuses distillation on these key predictive regions of the teacher and yields considerable gains in performance over many existing KD baselines. In addition, we propose an adaptive weighting scheme over the key regions to smooth out their influence and achieve even better performance. Our proposed approach outperforms current state-of-the-art KD baselines on a variety of advanced one-stage detection architectures. Specifically, on the COCO dataset, our method achieves between +3.1% and +4.6% AP improvement using ResNet-101 and ResNet-50 as the teacher and student backbones, respectively. On the CrowdHuman dataset, we achieve +3.2% and +2.0% improvements in MR and AP, also using these backbones. Our code is available at https://github.com/ChenhongyiYang/PGD.

@inproceedings{Yang2022_10_PredictionGuided,
author = {Chenhongyi Yang and Mateusz Ochal and Amos Storkey and Elliot J. Crowley},
title = {Prediction-Guided Distillation for Dense Object Detection},
year = {2022},
month = {Oct},
booktitle = {European Conference on Computer Vision},
url = {https://arxiv.org/abs/2203.05469},
}

Deep Attention Super-Resolution of Brain Magnetic Resonance Images Acquired under Clinical Protocols

Front Comput Neurosci

Bryan M Li, Leonardo V Castorina, Maria Del Carmen Valdés Hernández, Una Clancy, Stewart J Wiseman, Eleni Sakka, Amos J Storkey, Daniela Jaime Garcia, Yajun Cheng, Fergus Doubal, Michael T Thrippleton, Michael Stringer, Joanna M Wardlaw

Vast quantities of Magnetic Resonance Images (MRI) are routinely acquired in clinical practice but, to speed up acquisition, these scans are typically of a quality that is sufficient for clinical diagnosis but sub-optimal for large-scale precision medicine, computational diagnostics, and large-scale neuroimaging research. Here, we present a critic-guided framework to upsample low-resolution (often 2D) MRI scans. In addition, we incorporated feature-importance and self-attention methods into our model to improve the interpretability of this work. We evaluate our framework on paired low- and high-resolution brain MRI structural full scans (i.e. T1-, T2-weighted and FLAIR sequences are simultaneously input) obtained in clinical and research settings from scanners manufactured by Siemens, Phillips and GE. We showed that the upsampled MRIs are qualitatively faithful to the ground-truth high-quality scans (PSNR = 35.39; MAE = 3.78E −3; NMSE = 4.32E −10; SSIM = 0.9852; mean normal-appearing grey/white matter ratio intensity differences ranging from 0.0363 to 0.0784 for FLAIR, from 0.0010 to 0.0138 for T1-weighted and from 0.0156 to 0.074 for T2-weighted sequences). The automatic raw segmentations of tissues and lesions using the super-resolved images have fewer false positives and higher accuracy than those obtained from interpolated images in protocols represented with more than three sets in the training sample, making our approach a strong candidate for practical application in clinical research.

@article{Li2022_8_Deep,
author = {Bryan M Li and Leonardo V Castorina and Maria Del Carmen Valdés Hernández and Una Clancy and Stewart J Wiseman and Eleni Sakka and Amos J Storkey and Daniela Jaime Garcia and Yajun Cheng and Fergus Doubal and Michael T Thrippleton and Michael Stringer and Joanna M Wardlaw},
title = {Deep Attention Super-Resolution of Brain Magnetic Resonance Images Acquired under Clinical Protocols},
year = {2022},
month = {Aug},
journal = {Front Comput Neurosci},
volume = {},
url = {https://www.medrxiv.org/content/10.1101/2022.01.24.22269144v1},
}

Robust and Efficient Computation of Retinal Fractal Dimension through Deep Approximation

9th MICCAI Workshop on Ophthalmic Medical Image Analysis at MICCAI 2022

Justin Engelmann, Ana Villaplana-Velasco, Amos Storkey, Miguel O. Bernabeu

A retinal trait, or phenotype, summarises a specific aspect of a retinal image in a single number. This can then be used for further analyses, e.g. with statistical methods. However, reducing an aspect of a complex image to a single, meaningful number is challenging. Thus, methods for calculating retinal traits tend to be complex, multi-step pipelines that can only be applied to high quality images. This means that researchers often have to discard substantial portions of the available data. We hypothesise that such pipelines can be approximated with a single, simpler step that can be made robust to common quality issues. We propose Deep Approximation of Retinal Traits (DART) where a deep neural network is used predict the output of an existing pipeline on high quality images from synthetically degraded versions of these images. We demonstrate DART on retinal Fractal Dimension (FD) calculated by VAMPIRE, using retinal images from UK Biobank that previous work identified as high quality. Our method shows very high agreement with FD VAMPIRE on unseen test images (Pearson r=0.9572). Even when those images are severely degraded, DART can still recover an FD estimate that shows good agreement with FD VAMPIRE obtained from the original images (Pearson r=0.8817). This suggests that our method could enable researchers to discard fewer images in the future. Our method can compute FD for over 1,000img/s using a single GPU. We consider these to be very encouraging initial results and hope to develop this approach into a useful tool for retinal analysis.

@inproceedings{Engelmann2022_7_Robust,
author = {Justin Engelmann and Ana Villaplana-Velasco and Amos Storkey and Miguel O. Bernabeu},
title = {Robust and Efficient Computation of Retinal Fractal Dimension through Deep Approximation},
year = {2022},
month = {Jul},
booktitle = {9th MICCAI Workshop on Ophthalmic Medical Image Analysis at MICCAI 2022},
url = {https://arxiv.org/abs/2207.05757},
}

Learning Task Embeddings for Teamwork Adaptation in Multi-Agent Reinforcement Learning

Lukas Schäfer, Filippos Christianos, Amos Storkey, Stefano V. Albrecht

Successful deployment of multi-agent reinforcement learning often requires agents to adapt their behaviour. In this work, we discuss the problem of teamwork adaptation in which a team of agents needs to adapt their policies to solve novel tasks with limited fine-tuning. Motivated by the intuition that agents need to be able to identify and distinguish tasks in order to adapt their behaviour to the current task, we propose to learn multi-agent task embeddings (MATE). These task embeddings are trained using an encoder-decoder architecture optimised for reconstruction of the transition and reward functions which uniquely identify tasks. We show that a team of agents is able to adapt to novel tasks when provided with task embeddings. We propose three MATE training paradigms: independent MATE, centralised MATE, and mixed MATE which vary in the information used for the task encoding. We show that the embeddings learned by MATE identify tasks and provide useful information which agents leverage during adaptation to novel tasks.

@unpublished{Schäfer2022_7_Learning,
author = {Lukas Schäfer and Filippos Christianos and Amos Storkey and Stefano V. Albrecht},
title = {Learning Task Embeddings for Teamwork Adaptation in Multi-Agent Reinforcement Learning},
year = {2022},
month = {Jul},
institution = {University of Edinburgh},
url = {https://arxiv.org/abs/2207.02249},
}

Global Explainability in Aligned Image Modalities

Interpretable Machine Learning in Healthcare at ICML 2022

Justin Engelmann, Amos Storkey, Miguel O. Bernabeu

Deep learning (DL) models are very effective on many computer vision problems and increasingly used in critical applications. They are also inherently black box. A number of methods exist to generate image-wise explanations that allow practitioners to understand and verify model predictions for a given image. Beyond that, it would be desirable to validate that a DL model \textit{generally} works in a sensible way, i.e. consistent with domain knowledge and not relying on undesirable data artefacts. For this purpose, the model needs to be explained globally. In this work, we focus on image modalities that are naturally aligned such that each pixel position represents a similar relative position on the imaged object, as is common in medical imaging. We propose the pixel-wise aggregation of image-wise explanations as a simple method to obtain label-wise and overall global explanations. These can then be used for model validation, knowledge discovery, and as an efficient way to communicate qualitative conclusions drawn from inspecting image-wise explanations. We further propose Progressive Erasing Plus Progressive Restoration (PEPPR) as a method to quantitatively validate that these global explanations are faithful to how the model makes its predictions. We then apply these methods to ultra-widefield retinal images, a naturally aligned modality. We find that the global explanations are consistent with domain knowledge and faithfully reflect the model's workings.

@inproceedings{Engelmann2021_12_Global,
author = {Justin Engelmann and Amos Storkey and Miguel O. Bernabeu},
title = {Global Explainability in Aligned Image Modalities},
year = {2021},
month = {Dec},
booktitle = {Interpretable Machine Learning in Healthcare at ICML 2022},
url = {https://arxiv.org/abs/2112.09591},
}

Gradient-Based Hyperparameter Optimization Over Long Horizons

Advances in Neural Information Processing Systems

Paul Micaelli, Amos Storkey

Gradient-based hyperparameter optimization has earned a widespread popularity in the context of few-shot meta-learning, but remains broadly impractical for tasks with long horizons (many gradient steps), due to memory scaling and gradient degradation issues. A common workaround is to learn hyperparameters online, but this introduces greediness which comes with a significant performance drop. We propose forward-mode differentiation with sharing (FDS), a simple and efficient algorithm which tackles memory scaling issues with forward-mode differentiation, and gradient degradation issues by sharing hyperparameters that are contiguous in time. We provide theoretical guarantees about the noise reduction properties of our algorithm, and demonstrate its efficiency empirically by differentiating through ∼104 gradient steps of unrolled optimization. We consider large hyperparameter search ranges on CIFAR-10 where we significantly outperform greedy gradient-based alternatives, while achieving ×20 speedups compared to the state-of-the-art black-box methods.

@inproceedings{Micaelli2021_12_GradientBased,
author = {Paul Micaelli and Amos Storkey},
title = {Gradient-Based Hyperparameter Optimization Over Long Horizons},
year = {2021},
month = {Dec},
booktitle = {Advances in Neural Information Processing Systems},
accepted = {2021-09-28},
url = {https://arxiv.org/abs/2007.07869},
}

Contrastive Object-level Pre-training with Spatial Noise Curriculum Learning

Chenhongyi Yang, Lichao Huang, Elliot J. Crowley

The goal of contrastive learning based pre-training is to leverage large quantities of unlabeled data to produce a model that can be readily adapted downstream. Current approaches revolve around solving an image discrimination task: given an anchor image, an augmented counterpart of that image, and some other images, the model must produce representations such that the distance between the anchor and its counterpart is small, and the distances between the anchor and the other images are large. There are two significant problems with this approach: (i) by contrasting representations at the image-level, it is hard to generate detailed object-sensitive features that are beneficial to downstream object-level tasks such as instance segmentation; (ii) the augmentation strategy of producing an augmented counterpart is fixed, making learning less effective at the later stages of pre-training. In this work, we introduce Curricular Contrastive Object-level Pre-training (CCOP) to tackle these problems: (i) we use selective search to find rough object regions and use them to build an inter-image object-level contrastive loss and an intra-image object-level discrimination loss into our pre-training objective; (ii) we present a curriculum learning mechanism that adaptively augments the generated regions, which allows the model to consistently acquire a useful learning signal, even in the later stages of pre-training. Our experiments show that our approach improves on the MoCo v2 baseline by a large margin on multiple object-level tasks when pre-training on multi-object scene image datasets. Code is available at https://github.com/ChenhongyiYang/CCOP.

@unpublished{Yang2021_11_Contrastive,
author = {Chenhongyi Yang and Lichao Huang and Elliot J. Crowley},
title = {Contrastive Object-level Pre-training with Spatial Noise Curriculum Learning},
year = {2021},
month = {Nov},
institution = {University of Edinburgh},
url = {https://arxiv.org/abs/2111.13651},
}

Better Training using Weight-Constrained Stochastic Dynamics

International Conference on Machine Learning (ICML)

Benedict Leimkuhler, Tiffany Vlaar, Timothée Pouchon, Amos Storkey

We employ constraints to control the parameter space of deep neural networks throughout training. The use of customized, appropriately designed constraints can reduce the vanishing/exploding gradients problem, improve smoothness of classification boundaries, control weight magnitudes and stabilize deep neural networks, and thus enhance the robustness of training algorithms and the generalization capabilities of neural networks. We provide a general approach to efficiently incorporate constraints into a stochastic gradient Langevin framework, allowing enhanced exploration of the loss landscape. We also present specific examples of constrained training methods motivated by orthogonality preservation for weight matrices and explicit weight normalizations. Discretization schemes are provided both for the overdamped formulation of Langevin dynamics and the underdamped form, in which momenta further improve sampling efficiency. These optimization schemes can be used directly, without needing to adapt neural network architecture design choices or to modify the objective with regularization terms, and see performance improvements in classification tasks.

@inproceedings{Leimkuhler2021_7_Better,
author = {Benedict Leimkuhler and Tiffany Vlaar and Timothée Pouchon and Amos Storkey},
title = {Better Training using Weight-Constrained Stochastic Dynamics},
year = {2021},
month = {Jul},
booktitle = {International Conference on Machine Learning (ICML)},
accepted = {2021-05-08},
url = {https://arxiv.org/abs/2106.10704},
}

Neural Architecture Search without Training

International Conference on Machine Learning (ICML)

Joseph Mellor, Jack Turner, Amos Storkey, Elliot J. Crowley

The time and effort involved in hand-designing deep neural networks is immense. This has prompted the development of Neural Architecture Search (NAS) techniques to automate this design. However, NAS algorithms tend to be slow and expensive; they need to train vast numbers of candidate networks to inform the search process. This could be alleviated if we could partially predict a network's trained accuracy from its initial state. In this work, we examine the overlap of activations between datapoints in untrained networks and motivate how this can give a measure which is usefully indicative of a network's trained performance. We incorporate this measure into a simple algorithm that allows us to search for powerful networks without any training in a matter of seconds on a single GPU, and verify its effectiveness on NAS-Bench-101, NAS-Bench-201, and Network Design Spaces. Finally, our approach can be readily combined with more expensive search methods; we examine a simple adaptation of regularised evolutionary search that outperforms its predecessor. Code for reproducing our experiments is available at https://github.com/BayesWatch/nas-without-training.

@inproceedings{Mellor2021_7_Neural,
author = {Joseph Mellor and Jack Turner and Amos Storkey and Elliot J. Crowley},
title = {Neural Architecture Search without Training},
year = {2021},
month = {Jul},
booktitle = {International Conference on Machine Learning (ICML)},
accepted = {2021-05-08},
url = {https://arxiv.org/abs/2006.04647},
}

Substituting Convolutions for Neural Network Compression

IEEE Access

Elliot J. Crowley, Gavia Gray, Jack Turner, Amos Storkey

Many practitioners would like to deploy deep, convolutional neural networks in memory-limited scenarios, e.g. on an embedded device. However, with an abundance of compression techniques available it is not obvious how to proceed; many bring with them additional hyperparameter tuning, and are specific to particular network types. In this paper, we propose a simple compression technique that is general, easy to apply, and requires minimal tuning. Given a large, trained network, we propose (i) substituting its expensive convolutions with cheap alternatives, leaving the overall architecture unchanged; (ii) treating this new network as a student and training it with the original as a teacher through distillation. We demonstrate this approach separately for (i) networks predominantly consisting of full 3×3 convolutions and (ii) 1×1 or pointwise convolutions which together make up the vast majority of contemporary networks. We are able to leverage a number of methods that have been developed as efficient alternatives to fully-connected layers for pointwise substitution, allowing us provide Pareto-optimal benefits in efficiency/accuracy.

@article{Crowley2021_5_Substituting,
author = {Elliot J. Crowley and Gavia Gray and Jack Turner and Amos Storkey},
title = {Substituting Convolutions for Neural Network Compression},
year = {2021},
month = {May},
journal = {IEEE Access},
volume = {},
accepted = {2021-05-20},
url = {https://ieeexplore.ieee.org/document/9446890},
}

How Sensitive are Meta-Learners to Dataset Imbalance?

ICLR Learning to Learn Workshop

Mateusz Ochal, Massimiliano Patacchiola, Amos Storkey, Jose Vazquez, Sen Wang

Meta-Learning (ML) has proven to be a useful tool for training Few-Shot Learning (FSL) algorithms by exposure to batches of tasks sampled from a meta-dataset. However, the standard training procedure overlooks the dynamic nature of the real-world where object classes are likely to occur at different frequencies. While it is generally understood that imbalanced tasks harm the performance of supervised methods, there is no significant research examining the impact of imbalanced meta-datasets on the FSL evaluation task. This study exposes the magnitude and extent of this problem. Our results show that ML methods are more robust against meta-dataset imbalance than imbalance at the task-level with a similar imbalance ratio (ρ<20), with the effect holding even in long-tail datasets under a larger imbalance (ρ=65). Overall, these results highlight an implicit strength of ML algorithms, capable of learning generalizable features under dataset imbalance and domain-shift. The code to reproduce the experiments is released under an open-source license.

@inproceedings{Ochal2021_5_How,
author = {Mateusz Ochal and Massimiliano Patacchiola and Amos Storkey and Jose Vazquez and Sen Wang},
title = {How Sensitive are Meta-Learners to Dataset Imbalance?},
year = {2021},
month = {May},
booktitle = {ICLR Learning to Learn Workshop},
appeared = {2021-04-12},
url = {https://arxiv.org/abs/2104.05344},
}

Meta-Learning in Neural Networks: A Survey

IEEE Transactions on Pattern Analysis and Machine Intelligence

Timothy Hospedales, Antreas Antoniou, Paul Micaelli, Amos Storkey

The field of meta-learning, or learning-to-learn, has seen a dramatic rise in interest in recent years. Contrary to conventional approaches to AI where a given task is solved from scratch using a fixed learning algorithm, meta-learning aims to improve the learning algorithm itself, given the experience of multiple learning episodes. This paradigm provides an opportunity to tackle many of the conventional challenges of deep learning, including data and computation bottlenecks, as well as the fundamental issue of generalization. In this survey we describe the contemporary meta-learning landscape. We first discuss definitions of meta-learning and position it with respect to related fields, such as transfer learning, multi-task learning, and hyperparameter optimization. We then propose a new taxonomy that provides a more comprehensive breakdown of the space of meta-learning methods today. We survey promising applications and successes of meta-learning including few-shot learning, reinforcement learning and architecture search. Finally, we discuss outstanding challenges and promising areas for future research.

@article{Hospedales2021_5_MetaLearning,
author = {Timothy Hospedales and Antreas Antoniou and Paul Micaelli and Amos Storkey},
title = {Meta-Learning in Neural Networks: A Survey},
year = {2021},
month = {May},
journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
volume = {},
accepted = {2021-04-27},
url = {https://ieeexplore.ieee.org/document/9428530},
}

Neural Architecture Search as Program Transformation Exploration

International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)

Jack Turner, Elliot J. Crowley, Michael O'Boyle

Improving the performance of deep neural networks (DNNs) is important to both the compiler and neural architecture search (NAS) communities. Compilers apply program transformations in order to exploit hardware parallelism and memory hierarchy. However, legality concerns mean they fail to exploit the natural robustness of neural networks. In contrast, NAS techniques mutate networks by operations such as the grouping or bottlenecking of convolutions, exploiting the resilience of DNNs. In this work, we express such neural architecture operations as program transformations whose legality depends on a notion of representational capacity. This allows them to be combined with existing transformations into a unified optimization framework. This unification allows us to express existing NAS operations as combinations of simpler transformations. Crucially, it allows us to generate and explore new tensor convolutions. We prototyped the combined framework in TVM and were able to find optimizations across different DNNs, that significantly reduce inference time - over 3 times in the majority of cases.

@inproceedings{Turner2021_4_Neural,
author = {Jack Turner and Elliot J. Crowley and Michael O'Boyle},
title = {Neural Architecture Search as Program Transformation Exploration},
year = {2021},
month = {Apr},
booktitle = {International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)},
appeared = {2020-11-19},
accepted = {2020-11-19},
url = {https://arxiv.org/abs/2102.06599},
}

Few-Shot Learning with Class Imbalance

Mateusz Ochal, Massimiliano Patacchiola, Amos Storkey, Jose Vazquez, Sen Wang

Few-shot learning aims to train models on a limited number of labeled samples given in a support set in order to generalize to unseen samples from a query set. In the standard setup, the support set contains an equal amount of data points for each class. However, this assumption overlooks many practical considerations arising from the dynamic nature of the real world, such as class-imbalance. In this paper, we present a detailed study of few-shot class-imbalance along three axes: meta-dataset vs. task imbalance, effect of different imbalance distributions (linear, step, random), and effect of rebalancing techniques. We extensively compare over 10 state-of-the-art few-shot learning and meta-learning methods using unbalanced tasks and meta-datasets. Our analysis using Mini-ImageNet reveals that 1) compared to the balanced task, the performances on class-imbalance tasks counterparts always drop, by up to 18.0% for optimization-based methods, and up to 8.4 for metric-based methods, 2) contrary to popular belief, meta-learning algorithms, such as MAML, do not automatically learn to balance by being exposed to imbalanced tasks during (meta-)training time, 3) strategies used to mitigate imbalance in supervised learning, such as oversampling, can offer a stronger solution to the class imbalance problem, 4) the effect of imbalance at the meta-dataset level is less significant than the effect at the task level with similar imbalance magnitude. The code to reproduce the experiments is released under an open-source license.

@unpublished{Ochal2021_1_FewShot,
author = {Mateusz Ochal and Massimiliano Patacchiola and Amos Storkey and Jose Vazquez and Sen Wang},
title = {Few-Shot Learning with Class Imbalance},
year = {2021},
month = {Jan},
institution = {University of Edinburgh},
url = {https://arxiv.org/abs/2101.02523},
}

Self-Supervised Relational Reasoning for Representation Learning

Advances in Neural Information Processing Systems (NeurIPS)

Massimiliano Patacchiola, Amos Storkey

In self-supervised learning, a system is tasked with achieving a surrogate objective by defining alternative targets on a set of unlabeled data. The aim is to build useful representations that can be used in downstream tasks, without costly manual annotation. In this work, we propose a novel self-supervised formulation of relational reasoning that allows a learner to bootstrap a signal from information implicit in unlabeled data. Training a relation head to discriminate how entities relate to themselves (intra-reasoning) and other entities (inter-reasoning), results in rich and descriptive representations in the underlying neural network backbone, which can be used in downstream tasks such as classification and image retrieval. We evaluate the proposed method following a rigorous experimental procedure, using standard datasets, protocols, and backbones. Self-supervised relational reasoning outperforms the best competitor in all conditions by an average 14% in accuracy, and the most recent state-of-the-art model by 3%. We link the effectiveness of the method to the maximization of a Bernoulli log-likelihood, which can be considered as a proxy for maximizing the mutual information, resulting in a more efficient objective with respect to the commonly used contrastive losses.

@inproceedings{Patacchiola2020_12_SelfSupervised,
author = {Massimiliano Patacchiola and Amos Storkey},
title = {Self-Supervised Relational Reasoning for Representation Learning},
year = {2020},
month = {Dec},
booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
appeared = {2020-09-25},
accepted = {2020-09-25},
url = {https://arxiv.org/abs/2006.05849},
}

Bayesian Meta-Learning for the Few-Shot Setting via Deep Kernels

Advances in Neural Information Processing Systems (NeurIPS)

Massimiliano Patacchiola, Jack Turner, Elliot J. Crowley, Michael O'Boyle, Amos Storkey

Recently, different machine learning methods have been introduced to tackle the challenging few-shot learning scenario that is, learning from a small labeled dataset related to a specific task. Common approaches have taken the form of meta-learning: learning to learn on the new problem given the old. Following the recognition that meta-learning is implementing learning in a multi-level model, we present a Bayesian treatment for the meta-learning inner loop through the use of deep kernels. As a result we can learn a kernel that transfers to new tasks; we call this Deep Kernel Transfer (DKT). This approach has many advantages: is straightforward to implement as a single optimizer, provides uncertainty quantification, and does not require estimation of task-specific parameters. We empirically demonstrate that DKT outperforms several state-of-the-art algorithms in few-shot classification, and is the state of the art for cross-domain adaptation and regression. We conclude that complex meta-learning routines can be replaced by a simpler Bayesian model without loss of accuracy.

@inproceedings{Patacchiola2020_12_Bayesian,
author = {Massimiliano Patacchiola and Jack Turner and Elliot J. Crowley and Michael O'Boyle and Amos Storkey},
title = {{B}ayesian Meta-Learning for the Few-Shot Setting via Deep Kernels},
year = {2020},
month = {Dec},
booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
accepted = {2020-09-25},
url = {https://arxiv.org/abs/1910.05199},
}

Constraint-Based Regularisation of Neural Networks

NeurIPS OPT2020: 12th Annual Workshop on Optimization for Machine Learning

Benedict Leimkuhler, Timothée Pouchon, Tiffany Vlaar, Amos Storkey

We propose a method for efficiently incorporating constraints into a stochastic gradient Langevin framework for the training of deep neural networks. Constraints allow direct control of the parameter space of the model. Appropriately designed, they reduce the vanishing/exploding gradient problem, control weight magnitudes and stabilize deep neural networks and thus improve the robustness of training algorithms and generalization capabilities of the trained neural network. We present examples of constrained training methods motivated by orthogonality preservation for weight matrices and explicit weight normalizations. We describe the methods in the overdamped formulation of Langevin dynamics and the underdamped form, in which momenta help to improve sampling efficiency. Our methods see performance improvements on image classification tasks.

@inproceedings{Leimkuhler2020_12_ConstraintBased,
author = {Benedict Leimkuhler and Timothée Pouchon and Tiffany Vlaar and Amos Storkey},
title = {Constraint-Based Regularisation of Neural Networks},
year = {2020},
month = {Dec},
booktitle = {NeurIPS OPT2020: 12th Annual Workshop on Optimization for Machine Learning},
url = {http://homepages.inf.ed.ac.uk/amos/publications/LeimkuhlerPouchonVlaarStorkey2020ConstraintBasedRegularisatonNeurIPSWSOPT.pdf},
}

Classification with a Domain Shift in Medical Imaging

Med-NeurIPS 2020: Medical Imaging meets NeurIPS Workshop

Alessandro Fontanella, Emma Pead, Tom MacGillivray, Miguel O. Bernabeu, Amos Storkey

Labelled medical imaging datasets are often small in size, but other unlabelled datasets with a domain shift may be available. In this work, we propose a method that is able to exploit these additional unlabelled data, possibly with a domain shift, to improve predictions on our labelled data. To this aim, we learn features in a self-supervised way while projecting all the data onto the same space to achieve better transfer. We first test our approach on natural images and verify its effectiveness on Office-31 data. Then, we apply it to retinal fundus datasets and through a series of experiments on age-related macular degeneration (AMD) and diabetic retinopathy (DR) grading, we show how our method improves the baseline of pre-training on ImageNet and fine-tuning on the labelled data in terms of classification accuracy, AUC and clinical interpretability.

@inproceedings{Fontanella2020_12_Classification,
author = {Alessandro Fontanella and Emma Pead and Tom MacGillivray and Miguel O. Bernabeu and Amos Storkey},
title = {Classification with a Domain Shift in Medical Imaging},
year = {2020},
month = {Dec},
booktitle = {Med-NeurIPS 2020: Medical Imaging meets NeurIPS Workshop},
accepted = {2020-11-01},
url = {http://www.cse.cuhk.edu.hk/~qdou/public/medneurips2020/43_Classification_with_a_domain_shift_in_medical_imaging.pdf},
}

Defining Benchmarks for Continual Few-Shot Learning

NeurIPS MetaLearn 2020: Workshop on Meta-Learning

Antreas Antoniou, Massimiliano Patacchiola, Mateusz Ochal, Amos Storkey

Both few-shot and continual learning have seen substantial progress in the last years due to the introduction of proper benchmarks. That being said, the field has still to frame a suite of benchmarks for the highly desirable setting of continual few-shot learning, where the learner is presented a number of few-shot tasks, one after the other, and then asked to perform well on a validation set stemming from all previously seen tasks. Continual few-shot learning has a small computational footprint and is thus an excellent setting for efficient investigation and experimentation. In this paper we first define a theoretical framework for continual few-shot learning, taking into account recent literature, then we propose a range of flexible benchmarks that unify the evaluation criteria and allows exploring the problem from multiple perspectives. As part of the benchmark, we introduce a compact variant of ImageNet, called SlimageNet64, which retains all original 1000 classes but only contains 200 instances of each one (a total of 200K data-points) downscaled to 64 x 64 pixels. We provide baselines for the proposed benchmarks using a number of popular few-shot learning algorithms, as a result, exposing previously unknown strengths and weaknesses of those algorithms in continual and data-limited settings.

@inproceedings{Antoniou2020_12_Defining,
author = {Antreas Antoniou and Massimiliano Patacchiola and Mateusz Ochal and Amos Storkey},
title = {Defining Benchmarks for Continual Few-Shot Learning},
year = {2020},
month = {Dec},
booktitle = {NeurIPS MetaLearn 2020: Workshop on Meta-Learning},
accepted = {2020-11-01},
url = {https://arxiv.org/abs/2004.11967},
}

Latent Adversarial Debiasing: Mitigating Collider Bias in Deep Neural Networks

Luke N. Darlow, Stanisław Jastrzębski, Amos Storkey

Collider bias is a harmful form of sample selection bias that neural networks are ill-equipped to handle. This bias manifests itself when the underlying causal signal is strongly correlated with other confounding signals due to the training data collection procedure. In the situation where the confounding signal is easy-to-learn, deep neural networks will latch onto this and the resulting model will generalise poorly to in-the-wild test scenarios. We argue herein that the cause of failure is a combination of the deep structure of neural networks and the greedy gradient-driven learning process used - one that prefers easy-to-compute signals when available. We show it is possible to mitigate against this by generating bias-decoupled training data using latent adversarial debiasing (LAD), even when the confounding signal is present in 100% of the training data. By training neural networks on these adversarial examples,we can improve their generalisation in collider bias settings. Experiments show state-of-the-art performance of LAD in label-free debiasing with gains of 76.12% on background coloured MNIST, 35.47% on fore-ground coloured MNIST, and 8.27% on corrupted CIFAR-10.

@unpublished{Darlow2020_11_Latent,
author = {Luke N. Darlow and Stanisław Jastrzębski and Amos Storkey},
title = {Latent Adversarial Debiasing: Mitigating Collider Bias in Deep Neural Networks},
year = {2020},
month = {Nov},
institution = {University of Edinburgh},
url = {https://arxiv.org/abs/2011.11486},
}

Optimizing Grouped Convolutions on Edge Devices

International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Perry Gibson, José Cano, Jack Turner, Elliot J. Crowley, Michael O'Boyle, Amos Storkey

When deploying a deep neural network on constrained hardware, it is possible to replace the network's standard convolutions with grouped convolutions. This allows for substantial memory savings with minimal loss of accuracy. However, current implementations of grouped convolutions in modern deep learning frameworks are far from performing optimally in terms of speed. In this paper we propose Grouped Spatial Pack Convolutions (GSPC), a new implementation of grouped convolutions that outperforms existing solutions. We implement GSPC in TVM, which provides state-of-the-art performance on edge devices. We analyze a set of networks utilizing different types of grouped convolutions and evaluate their performance in terms of inference time on several edge devices. We observe that our new implementation scales well with the number of groups and provides the best inference times in all settings, improving the existing implementations of grouped convolutions in TVM, PyTorch and TensorFlow Lite by 3.4x, 8x and 4x on average respectively. Code is available at https://github.com/gecLAB/tvm-GSPC/.

@inproceedings{Gibson2020_7_Optimizing,
author = {Perry Gibson and José Cano and Jack Turner and Elliot J. Crowley and Michael O'Boyle and Amos Storkey},
title = {Optimizing Grouped Convolutions on Edge Devices},
year = {2020},
month = {Jul},
booktitle = {International Conference on Application-specific Systems, Architectures and Processors (ASAP)},
accepted = {2020-05-20},
url = {https://arxiv.org/abs/2006.09791},
}

Comparing Recurrent and Convolutional Neural Networks for Predicting Wave Propagation

Workshop on Deep Learning and Differential Equations, ICLR

Stathi Fotiadis, Eduardo Pignatelli, Mario Lino Valencia, Chris Cantwell, Amos Storkey, Anil A. Bharath

Dynamical systems can be modelled by partial differential equations and numerical computations are used everywhere in science and engineering. In this work, we investigate the performance of recurrent and convolutional deep neural network architectures to predict the surface waves. The system is governed by the Saint-Venant equations. We improve on the long-term prediction over previous methods while keeping the inference time at a fraction of numerical simulations. We also show that convolutional networks perform at least as well as recurrent networks in this task. Finally, we assess the generalisation capability of each network by extrapolating in longer time-frames and in different physical settings.

@inproceedings{Fotiadis2020_4_Comparing,
author = {Stathi Fotiadis and Eduardo Pignatelli and Mario Lino Valencia and Chris Cantwell and Amos Storkey and Anil A. Bharath},
title = {Comparing Recurrent and Convolutional Neural Networks for Predicting Wave Propagation},
year = {2020},
month = {Apr},
booktitle = {Workshop on Deep Learning and Differential Equations, ICLR},
url = {https://arxiv.org/abs/2002.08981},
}

BlockSwap: Fisher-guided Block Substitution for Network Compression on a Budget

International Conference on Learning Representations (ICLR)

Jack Turner, Elliot J. Crowley, Michael O'Boyle, Amos Storkey, Gavia Gray

The desire to map neural networks to varying-capacity devices has led to the development of a wealth of compression techniques, many of which involve replacing standard convolutional blocks in a large network with cheap alternative blocks. However, not all blocks are created equally; for a required compute budget there may exist a potent combination of many different cheap blocks, though exhaustively searching for such a combination is prohibitively expensive. In this work, we develop BlockSwap: a fast algorithm for choosing networks with interleaved block types by passing a single minibatch of training data through randomly initialised networks and gauging their Fisher potential. These networks can then be used as students and distilled with the original large network as a teacher. We demonstrate the effectiveness of the chosen networks across CIFAR-10 and ImageNet for classification, and COCO for detection, and provide a comprehensive ablation study of our approach. BlockSwap quickly explores possible block configurations using a simple architecture ranking system, yielding highly competitive networks in orders of magnitude less time than most architecture search techniques (e.g. 8 minutes on a single CPU for CIFAR-10).

@inproceedings{Turner2020_4_BlockSwap,
author = {Jack Turner and Elliot J. Crowley and Michael O'Boyle and Amos Storkey and Gavia Gray},
title = {{BlockSwap}: {F}isher-guided Block Substitution for Network Compression on a Budget},
year = {2020},
month = {Apr},
booktitle = {International Conference on Learning Representations (ICLR)},
url = {https://arxiv.org/abs/1906.04113},
}

DHOG: Deep Hierarchical Object Grouping

Luke N. Darlow, Amos Storkey

Recently, a number of competitive methods have tackled unsupervised representation learning by maximising the mutual information between the representations produced from augmentations. The resulting representations are then invariant to stochastic augmentation strategies, and can be used for downstream tasks such as clustering or classification. Yet data augmentations preserve many properties of an image and so there is potential for a suboptimal choice of representation that relies on matching easy-to-find features in the data. We demonstrate that greedy or local methods of maximising mutual information (such as stochastic gradient optimisation) discover local optima of the mutual information criterion; the resulting representations are also less-ideally suited to complex downstream tasks. Earlier work has not specifically identified or addressed this issue. We introduce deep hierarchical object grouping (DHOG) that computes a number of distinct discrete representations of images in a hierarchical order, eventually generating representations that better optimise the mutual information objective. We also find that these representations align better with the downstream task of grouping into underlying object classes. We tested DHOG on unsupervised clustering, which is a natural downstream test as the target representation is a discrete labelling of the data. We achieved new state-of-the-art results on the three main benchmarks without any prefiltering or Sobel-edge detection that proved necessary for many previous methods to work. We obtain accuracy improvements of: 4.3% on CIFAR-10, 1.5% on CIFAR-100-20, and 7.2% on SVHN.

@unpublished{Darlow2020_3_DHOG,
author = {Luke N. Darlow and Amos Storkey},
title = {{DHOG}: Deep Hierarchical Object Grouping},
year = {2020},
month = {Mar},
institution = {University of Edinburgh},
url = {https://arxiv.org/abs/2003.08821},
}

Learning to Learn via Self-Critique

Advances in Neural Information Processing Systems (NeurIPS)

Antreas Antoniou, Amos Storkey

In few-shot learning, a machine learning system learns from a small set of labelled examples relating to a specific task, such that it can generalize to new examples of the same task. Given the limited availability of labelled examples in such tasks ,we wish to make use of all the information we can. Usually a model learns task-specific information from a small training-set (support-set) to predict on an unlabelled validation set (target-set). The target-set contains additional task-specific information which is not utilized by existing few-shot learning methods. Making use of the target-set examples via transductive learning requires approaches beyond the current methods; at inference time, the target-set contains only unlabelled input data-points, and so discriminative learning cannot be used. In this paper, we propose a framework called Self-Critique and Adaptor SCA, which learns to learn a label-free loss function, parameterized as a neural network. A base-model learns on a support-set using existing methods (e.g. stochastic gradient descent combined with the cross-entropy loss), and then is updated for the incoming target-task using the learnt loss function. This label-free loss function is itself optimized such that the learnt model achieves higher generalization performance. Experiments demonstrate that SCA offers substantially reduced error-rates compared to baselines which only adapt on the support-set, and results in state of the art benchmark performance on Mini-ImageNet and Caltech-UCSD Birds 200.

@inproceedings{Antoniou2019_12_Learning,
author = {Antreas Antoniou and Amos Storkey},
title = {Learning to Learn via Self-Critique},
year = {2019},
month = {Dec},
booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
url = {https://arxiv.org/abs/1905.10295},
}

Zero-shot Knowledge Transfer via Adversarial Belief Matching

Advances in Neural Information Processing Systems (NeurIPS)

Paul Micaelli, Amos Storkey

Performing knowledge transfer from a large teacher network to a smaller student is a popular task in modern deep learning applications. However, due to growing dataset sizes and stricter privacy regulations, it is increasingly common not to have access to the data that was used to train the teacher. We propose a novel method which trains a student to match the predictions of its teacher without using any data or metadata. We achieve this by training an adversarial generator to search for images on which the student poorly matches the teacher, and then using them to train the student. Our resulting student closely approximates its teacher for simple datasets like SVHN, and on CIFAR10 we improve on the state- of-the-art for few-shot distillation (with 100 images per class), despite using no data. Finally, we also propose a metric to quantify the degree of belief matching between teacher and student in the vicinity of decision boundaries, and observe a significantly higher match between our zero-shot student and the teacher, than between a student distilled with real data and the teacher.

@inproceedings{Micaelli2019_12_Zeroshot,
author = {Paul Micaelli and Amos Storkey},
title = {Zero-shot Knowledge Transfer via Adversarial Belief Matching},
year = {2019},
month = {Dec},
booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
url = {https://arxiv.org/abs/1905.09768},
}

Performance Aware Convolutional Neural Network Channel Pruning for Embedded GPUs

International Symposium on Workload Characterization (IISWC)

Valentin Radu, Kuba Kaszyk, Yuan Wen, Jack Turner, José Cano, Elliot J. Crowley, Björn Franke, Amos Storkey, Michael O’Boyle

Convolutional Neural Networks (CNN) are becoming a common presence in many applications and services, due to their superior recognition accuracy. They are increasingly being used on mobile devices, many times just by porting large models designed for server space, although several model compression techniques have been considered. One model compression technique intended to reduce computations is channel pruning. Mobile and embedded systems now have GPUs which are ideal for the parallel computations of neural networks and for their lower energy cost per operation. Specialized libraries perform these neural network computations through highly optimized routines. As we find in our experiments, these libraries are optimized for the most common network shapes, making uninstructed channel pruning inefficient. We evaluate higher level libraries, which analyze the input characteristics of a convolutional layer, based on which they produce optimized OpenCL (Arm Compute Library and TVM) and CUDA (cuDNN) code. However, in reality, these characteristics and subsequent choices intended for optimization can have the opposite effect. We show that a reduction in the number of convolutional channels, pruning 12% of the initial size, is in some cases detrimental to performance, leading to 2x slowdown. On the other hand, we also find examples where performance-aware pruning achieves the intended results, with performance speedups of 3x with cuDNN and above 10x with Arm Compute Library and TVM. Our findings expose the need for hardware-instructed neural network pruning.

@inproceedings{Radu2019_11_Performance,
author = {Valentin Radu and Kuba Kaszyk and Yuan Wen and Jack Turner and José Cano and Elliot J. Crowley and Björn Franke and Amos Storkey and Michael O’Boyle},
title = {Performance Aware Convolutional Neural Network Channel Pruning for Embedded {GPU}s},
year = {2019},
month = {Nov},
booktitle = {International Symposium on Workload Characterization (IISWC)},
url = {https://arxiv.org/abs/2002.08697},
}

Separable Layers Enable Structured Efficient Linear Substitutions

Gavia Gray, Elliot J. Crowley, Amos Storkey

In response to the development of recent efficient dense layers, this paper shows that something as simple as replacing linear components in pointwise convolutions with structured linear decompositions also produces substantial gains in the efficiency/accuracy tradeoff. Pointwise convolutions are fully connected layers and are thus prepared for replacement by structured transforms. Networks using such layers are able to learn the same tasks as those using standard convolutions, and provide Pareto-optimal benefits in efficiency/accuracy, both in terms of computation (mult-adds) and parameter count (and hence memory).

@unpublished{Gray2019_6_Separable,
author = {Gavia Gray and Elliot J. Crowley and Amos Storkey},
title = {Separable Layers Enable Structured Efficient Linear Substitutions},
year = {2019},
month = {Jun},
institution = {University of Edinburgh},
url = {https://arxiv.org/abs/1906.00859},
}

Exploration by Random Network Distillation

International Conference on Learning Representations (ICLR)

Yuri Burda, Harrison Edwards, Amos Storkey, Oleg Klimov

We introduce an exploration bonus for deep reinforcement learning methods that is easy to implement and adds minimal overhead to the computation performed. The bonus is the error of a neural network predicting features of the observations given by a fixed randomly initialized neural network. We also introduce a method to flexibly combine intrinsic and extrinsic rewards. We find that the random network distillation (RND) bonus combined with this increased flexibility enables significant progress on several hard exploration Atari games. In particular we establish state of the art performance on Montezuma's Revenge, a game famously difficult for deep reinforcement learning methods. To the best of our knowledge, this is the first method that achieves better than average human performance on this game without using demonstrations or having access to the underlying state of the game, and occasionally completes the first level.

@inproceedings{Burda2019_5_Exploration,
author = {Yuri Burda and Harrison Edwards and Amos Storkey and Oleg Klimov},
title = {Exploration by Random Network Distillation},
year = {2019},
month = {May},
booktitle = {International Conference on Learning Representations (ICLR)},
url = {https://arxiv.org/abs/1810.12894},
}

How to train your MAML

International Conference on Learning Representations (ICLR)

Antreas Antoniou, Harrison Edwards, Amos Storkey

The field of few-shot learning has recently seen substantial advancements. Most of these advancements came from casting few-shot learning as a meta-learning problem. Model Agnostic Meta Learning or MAML is currently one of the best approaches for few-shot learning via meta-learning. MAML is simple, elegant and very powerful, however, it has a variety of issues, such as being very sensitive to neural network architectures, often leading to instability during training, requiring arduous hyperparameter searches to stabilize training and achieve high generalization and being very computationally expensive at both training and inference times. In this paper, we propose various modifications to MAML that not only stabilize the system, but also substantially improve the generalization performance, convergence speed and computational overhead of MAML, which we call MAML++.

@inproceedings{Antoniou2019_5_How,
author = {Antreas Antoniou and Harrison Edwards and Amos Storkey},
title = {How to train your {MAML}},
year = {2019},
month = {May},
booktitle = {International Conference on Learning Representations (ICLR)},
url = {https://arxiv.org/abs/1810.09502},
}

Large-Scale Study of Curiosity-Driven Learning

International Conference on Learning Representations (ICLR)

Yuri Burda, Harrison Edwards, Deepak Pathak, Amos Storkey, Trevor Darrell, Alexei A. Efros

Reinforcement learning algorithms rely on carefully engineering environment rewards that are extrinsic to the agent. However, annotating each environment with hand-designed, dense rewards is not scalable, motivating the need for developing reward functions that are intrinsic to the agent. Curiosity is a type of intrinsic reward function which uses prediction error as reward signal. In this paper: (a) We perform the first large-scale study of purely curiosity-driven learning, i.e. without any extrinsic rewards, across 54 standard benchmark environments, including the Atari game suite. Our results show surprisingly good performance, and a high degree of alignment between the intrinsic curiosity objective and the hand-designed extrinsic rewards of many game environments. (b) We investigate the effect of using different feature spaces for computing prediction error and show that random features are sufficient for many popular RL game benchmarks, but learned features appear to generalize better (e.g. to novel game levels in Super Mario Bros.). (c) We demonstrate limitations of the prediction-based rewards in stochastic setups.

@inproceedings{Burda2019_5_LargeScale,
author = {Yuri Burda and Harrison Edwards and Deepak Pathak and Amos Storkey and Trevor Darrell and Alexei A. Efros},
title = {Large-Scale Study of Curiosity-Driven Learning},
year = {2019},
month = {May},
booktitle = {International Conference on Learning Representations (ICLR)},
url = {https://arxiv.org/abs/1808.04355},
}

On the Relation Between the Sharpest Directions of DNN Loss and the SGD Step Length

International Conference on Learning Representations (ICLR)

Stanisław Jastrzębski, Zachary Kenton, Nicolas Ballas, Asja Fischer, Yoshua Bengio, Amos Storkey

Recent work has identified that using a high learning rate or a small batch size for Stochastic Gradient Descent (SGD) based training of deep neural networks encourages finding flatter minima of the training loss towards the end of training. Moreover, measures of the flatness of minima have been shown to correlate with good generalization performance. Extending this previous work, we investigate the loss curvature through the Hessian eigenvalue spectrum in the early phase of training and find an analogous bias: even at the beginning of training, a high learning rate or small batch size influences SGD to visit flatter loss regions. In addition, the evolution of the largest eigenvalues appears to always follow a similar pattern, with a fast increase in the early phase, and a decrease or stabilization thereafter, where the peak value is determined by the learning rate and batch size. Finally, we find that by altering the learning rate just in the direction of the eigenvectors associated with the largest eigenvalues, SGD can be steered towards regions which are an order of magnitude sharper but correspond to models with similar generalization, which suggests the curvature of the endpoint found by SGD is not predictive of its generalization properties.

@inproceedings{Jastrzębski2019_5_On,
author = {Stanisław Jastrzębski and Zachary Kenton and Nicolas Ballas and Asja Fischer and Yoshua Bengio and Amos Storkey},
title = {On the Relation Between the Sharpest Directions of {DNN} Loss and the {SGD} Step Length},
year = {2019},
month = {May},
booktitle = {International Conference on Learning Representations (ICLR)},
url = {https://arxiv.org/abs/1807.05031},
}

Distilling with Performance Enhanced Students

Jack Turner, Elliot J. Crowley, Valentin Radu, José Cano, Amos Storkey, Michael O'Boyle

The task of accelerating large neural networks on general purpose hardware has, in recent years, prompted the use of channel pruning to reduce network size. However, the efficacy of pruning based approaches has since been called into question. In this paper, we turn to distillation for model compression---specifically, attention transfer---and develop a simple method for discovering performance enhanced student networks. We combine channel saliency metrics with empirical observations of runtime performance to design more accurate networks for a given latency budget. We apply our methodology to residual and densely-connected networks, and show that we are able to find resource-efficient student networks on different hardware platforms while maintaining very high accuracy. These performance-enhanced student networks achieve up to 10% boosts in top-1 ImageNet accuracy over their channel-pruned counterparts for the same inference time.

@unpublished{Turner2019_3_Distilling,
author = {Jack Turner and Elliot J. Crowley and Valentin Radu and José Cano and Amos Storkey and Michael O'Boyle},
title = {Distilling with Performance Enhanced Students},
year = {2019},
month = {Mar},
institution = {University of Edinburgh},
url = {https://arxiv.org/abs/1810.10460},
}

Assume, Augment and Learn: Unsupervised Few-Shot Meta-Learning via Random Labels and Data Augmentation

Antreas Antoniou, Amos Storkey

The field of few-shot learning has been laboriously explored in the supervised setting, where per-class labels are available. On the other hand, the unsupervised few-shot learning setting, where no labels of any kind are required, has seen little investigation. We propose a method, named Assume, Augment and Learn or AAL, for generating few-shot tasks using unlabeled data. We randomly label a random subset of images from an unlabeled dataset to generate a support set. Then by applying data augmentation on the support set's images, and reusing the support set's labels, we obtain a target set. The resulting few-shot tasks can be used to train any standard meta-learning framework. Once trained, such a model, can be directly applied on small real-labeled datasets without any changes or fine-tuning required. In our experiments, the learned models achieve good generalization performance in a variety of established few-shot learning tasks on Omniglot and Mini-Imagenet.

@unpublished{Antoniou2019_2_Assume,
author = {Antreas Antoniou and Amos Storkey},
title = {Assume, Augment and Learn: Unsupervised Few-Shot Meta-Learning via Random Labels and Data Augmentation},
year = {2019},
month = {Feb},
institution = {University of Edinburgh},
url = {https://arxiv.org/abs/1902.09884},
}

What Information Does a ResNet Compress?

Luke N. Darlow, Amos Storkey

The information bottleneck principle (Shwartz-Ziv & Tishby, 2017) suggests that SGD-based training of deep neural networks results in optimally compressed hidden layers, from an information theoretic perspective. However, this claim was established on toy data. The goal of the work we present here is to test whether the information bottleneck principle is applicable to a realistic setting using a larger and deeper convolutional architecture, a ResNet model. We trained PixelCNN++ models as inverse representation decoders to measure the mutual information between hidden layers of a ResNet and input image data, when trained for (1) classification and (2) autoencoding. We find that two stages of learning happen for both training regimes, and that compression does occur, even for an autoencoder. Sampling images by conditioning on hidden layers' activations offers an intuitive visualisation to understand what a ResNets learns to forget.

@unpublished{Darlow2019_1_What,
author = {Luke N. Darlow and Amos Storkey},
title = {What Information Does a {R}es{N}et Compress?},
year = {2019},
month = {Jan},
institution = {University of Edinburgh},
url = {https://arxiv.org/abs/2003.06254},
}

Pruning Neural Networks: Is it Time to Nip It in the Bud?

Workshop on Compact Deep Neural Networks with industrial applications, NeurIPS

Elliot J. Crowley, Jack Turner, Amos Storkey, Michael O'Boyle

Pruning is a popular technique for compressing a neural network: a large pre-trained network is fine-tuned while connections are successively removed. However, the value of pruning has largely evaded scrutiny. In this extended abstract, we examine residual networks obtained through Fisher-pruning and make two interesting observations. First, when time-constrained, it is better to train a simple, smaller network from scratch than prune a large network. Second, it is the architectures obtained through the pruning process --- not the learnt weights ---that prove valuable. Such architectures are powerful when trained from scratch. Furthermore, these architectures are easy to approximate without any further pruning: we can prune once and obtain a family of new, scalable network architectures for different memory requirements.

@inproceedings{Crowley2018_12_Pruning,
author = {Elliot J. Crowley and Jack Turner and Amos Storkey and Michael O'Boyle},
title = {Pruning Neural Networks: Is it Time to Nip It in the Bud?},
year = {2018},
month = {Dec},
booktitle = {Workshop on Compact Deep Neural Networks with industrial applications, NeurIPS},
url = {https://arxiv.org/abs/1810.04622},
}

Moonshine: Distilling with Cheap Convolutions

Advances in Neural Information Processing Systems (NeurIPS)

Elliot J. Crowley, Gavia Gray, Amos Storkey

Many engineers wish to deploy modern neural networks in memory-limited settings; but the development of flexible methods for reducing memory use is in its infancy, and there is little knowledge of the resulting cost-benefit. We propose structural model distillation for memory reduction using a strategy that produces a student architecture that is a simple transformation of the teacher architecture: no redesign is needed, and the same hyperparameters can be used. Using attention transfer, we provide Pareto curves/tables for distillation of residual networks with four benchmark datasets, indicating the memory versus accuracy payoff. We show that substantial memory savings are possible with very little loss of accuracy, and confirm that distillation provides student network performance that is better than training that student architecture directly on data.

@inproceedings{Crowley2018_12_Moonshine,
author = {Elliot J. Crowley and Gavia Gray and Amos Storkey},
title = {Moonshine: Distilling with Cheap Convolutions},
year = {2018},
month = {Dec},
booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
url = {https://arxiv.org/abs/1711.02613},
}

Dilated DenseNets for Relational Reasoning

Antreas Antoniou, Agnieszka Słowik, Elliot J. Crowley, Amos Storkey

Despite their impressive performance in many tasks, deep neural networks often struggle at relational reasoning. This has recently been remedied with the introduction of a plug-in relational module that considers relations between pairs of objects. Unfortunately, this is combinatorially expensive. In this extended abstract, we show that a DenseNet incorporating dilated convolutions excels at relational reasoning on the Sort-of-CLEVR dataset, allowing us to forgo this relational module and its associated expense.

@unpublished{Antoniou2018_11_Dilated,
author = {Antreas Antoniou and Agnieszka Słowik and Elliot J. Crowley and Amos Storkey},
title = {Dilated {D}ense{N}ets for Relational Reasoning},
year = {2018},
month = {Nov},
institution = {University of Edinburgh},
url = {https://arxiv.org/abs/1811.00410},
}

CINIC-10 is not ImageNet or CIFAR-10

Luke N. Darlow, Elliot J. Crowley, Antreas Antoniou, Amos Storkey

In this brief technical report we introduce the CINIC-10 dataset as a plug-in extended alternative for CIFAR-10. It was compiled by combining CIFAR-10 with images selected and downsampled from the ImageNet database. We present the approach to compiling the dataset, illustrate the example images for different classes, give pixel distributions for each part of the repository, and give some standard benchmarks for well known models. Details for download, usage, and compilation can be found in the associated github repository.

@techreport{Darlow2018_10_CINIC10,
author = {Luke N. Darlow and Elliot J. Crowley and Antreas Antoniou and Amos Storkey},
title = {{CINIC-10} is not {I}mage{N}et or {CIFAR-10}},
year = {2018},
month = {Oct},
institution = {School of Informatics, University of Edinburgh}, number = {EDI-INF-ANC-1802},
url = {https://arxiv.org/abs/1810.03505},
}

GINN: Geometric Illustration of Neural Networks

Luke N. Darlow, Amos Storkey

This informal technical report details the geometric illustration of decision boundaries for ReLU units in a three layer fully connected neural network. The network is designed and trained to predict pixel intensity from an (x, y) input location. The Geometric Illustration of Neural Networks (GINN) tool was built to visualise and track the points at which ReLU units switch from being active to off (or vice versa) as the network undergoes training. Several phenomenon were observed and are discussed herein.

@techreport{Darlow2018_10_GINN,
author = {Luke N. Darlow and Amos Storkey},
title = {{GINN}: Geometric Illustration of Neural Networks},
year = {2018},
month = {Oct},
institution = {School of Informatics, University of Edinburgh}, number = {EDI-INF-ANC-1801},
url = {https://arxiv.org/abs/1810.01860},
}

Augmenting Image Classifiers using Data Augmentation Generative Adversarial Networks

International Conference on Artificial Neural Networks (ICANN)

Antreas Antoniou, Amos Storkey, Harrison Edwards

Effective training of neural networks requires much data. In the low-data regime, parameters are underdetermined, and learnt networks generalise poorly. Data Augmentation alleviates this by using existing data more effectively, but standard data augmentation produces only limited plausible alternative data. Given the potential to generate a much broader set of augmentations, we design and train a generative model to do data augmentation. The model, based on image conditional Generative Adversarial Networks, uses data from a source domain and learns to take a data item and augment it by generating other within-class data items. As this generative process does not depend on the classes themselves, it can be applied to novel unseen classes. We demonstrate that a Data Augmentation Generative Adversarial Network (DAGAN) augments classifiers well on Omniglot, EMNIST and VGG-Face.

@inproceedings{Antoniou2018_10_Augmenting,
author = {Antreas Antoniou and Amos Storkey and Harrison Edwards},
title = {Augmenting Image Classifiers using Data Augmentation Generative Adversarial Networks},
year = {2018},
month = {Oct},
booktitle = {International Conference on Artificial Neural Networks (ICANN)},
url = {https://www.bayeswatch.com/assets/papers/Augmenting_Image_Classifiers_using_Data_Augmentation_Generative_Adversarial_Networks.pdf},
}

Three Factors Influencing Minima in SGD

International Conference on Artificial Neural Networks (ICANN)

Stanisław Jastrzębski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, Amos Storkey

We investigate the dynamical and convergent properties of stochastic gradient descent (SGD) applied to Deep Neural Networks (DNNs). Characterizing the relation between learning rate, batch size and the properties of the final minima, such as width or generalization, remains an open question. In order to tackle this problem we investigate the previously proposed approximation of SGD by a stochastic differential equation (SDE). We theoretically argue that three factors - learning rate, batch size and gradient covariance - influence the minima found by SGD. In particular we find that the ratio of learning rate to batch size is a key determinant of SGD dynamics and of the width of the final minima, and that higher values of the ratio lead to wider minima and often better generalization. We confirm these findings experimentally. Further, we include experiments which show that learning rate schedules can be replaced with batch size schedules and that the ratio of learning rate to batch size is an important factor influencing the memorization process.

@inproceedings{Jastrzębski2018_10_Three,
author = {Stanisław Jastrzębski and Zachary Kenton and Devansh Arpit and Nicolas Ballas and Asja Fischer and Yoshua Bengio and Amos Storkey},
title = {Three Factors Influencing Minima in {SGD}},
year = {2018},
month = {Oct},
booktitle = {International Conference on Artificial Neural Networks (ICANN)},
url = {http://arxiv.org/abs/1711.04623},
}

Characterising Across-Stack Optimisations for Deep Convolutional Neural Networks

International Symposium on Workload Characterization (IISWC)

Jack Turner, José Cano, Valentin Radu, Elliot J. Crowley, Michael O'Boyle, Amos Storkey

Convolutional Neural Networks (CNNs) are extremely computationally demanding, presenting a large barrier to their deployment on resource-constrained devices. Since such systems are where some of their most useful applications lie (e.g. obstacle detection for mobile robots, vision-based medical assistive technology), significant bodies of work from both machine learning and systems communities have attempted to provide optimisations that will make CNNs available to edge devices. In this paper we unify the two viewpoints in a Deep Learning Inference Stack and take an across-stack approach by implementing and evaluating the most common neural network compression techniques (weight pruning, channel pruning, and quantisation) and optimising their parallel execution with a range of programming approaches (OpenMP, OpenCL) and hardware architectures (CPU, GPU). We provide comprehensive Pareto curves to instruct trade-offs under constraints of accuracy, execution time, and memory space.

@inproceedings{Turner2018_9_Characterising,
author = {Jack Turner and José Cano and Valentin Radu and Elliot J. Crowley and Michael O'Boyle and Amos Storkey},
title = {Characterising Across-Stack Optimisations for Deep Convolutional Neural Networks},
year = {2018},
month = {Sep},
booktitle = {International Symposium on Workload Characterization (IISWC)},
url = {https://arxiv.org/abs/1809.07196},
}

Asymptotically Exact Inference in Differentiable Generative Models

Electronic Journal of Statistics

Matt Graham, Amos Storkey

Many generative models can be expressed as a differentiable function applied to input variables sampled from a known probability distribution. This framework includes both the generative component of learned parametric models such as variational autoencoders and generative adversarial networks, and also procedurally defined simulator models which involve only differentiable operations. Though the distribution on the input variables to such models is known, often the distribution on the output variables is only implicitly defined. We present a method for performing efficient Markov chain Monte Carlo inference in such models when conditioning on observations of the model output. For some models this offers an asymptotically exact inference method where approximate Bayesian computation might otherwise be employed. We use the intuition that computing conditional expectations is equivalent to integrating over a density defined on the manifold corresponding to the set of inputs consistent with the observed outputs. This motivates the use of a constrained variant of Hamiltonian Monte Carlo which leverages the smooth geometry of the manifold to move between inputs exactly consistent with observations. We validate the method by performing inference experiments in a diverse set of models.

@article{Graham2017_12_Asymptotically,
author = {Matt Graham and Amos Storkey},
title = {Asymptotically Exact Inference in Differentiable Generative Models},
year = {2017},
month = {Dec},
journal = {Electronic Journal of Statistics},
volume = {1},
url = {http://dx.doi.org/10.1214/17-EJS1340SI},
}

Data Augmentation Generative Adversarial Networks

Antreas Antoniou, Amos Storkey, Harrison Edwards

Effective training of neural networks requires much data. In the low-data regime, parameters are underdetermined, and learnt networks generalise poorly. Data Augmentation alleviates this by using existing data more effectively. However standard data augmentation produces only limited plausible alternative data. Given there is potential to generate a much broader set of augmentations, we design and train a generative model to do data augmentation. The model, based on image conditional Generative Adversarial Networks, takes data from a source domain and learns to take any data item and generalise it to generate other within-class data items. As this generative process does not depend on the classes themselves, it can be applied to novel unseen classes of data. We show that a Data Augmentation Generative Adversarial Network (DAGAN) augments standard vanilla classifiers well. We also show a DAGAN can enhance few-shot learning systems such as Matching Networks. We demonstrate these approaches on Omniglot, on EMNIST having learnt the DAGAN on Omniglot, and VGG-Face data. In our experiments we can see over 13% increase in accuracy in the low-data regime experiments in Omniglot (from 69% to 82%), EMNIST (73.9% to 76\) and VGG-Face (4.5% to 12%); in Matching Networks for Omniglot we observe an increase of 0.5% (from 96.9% to 97.4%) and an increase of 1.8% in EMNIST (from 59.5% to 61.3%).

@unpublished{Antoniou2017_11_Data,
author = {Antreas Antoniou and Amos Storkey and Harrison Edwards},
title = {Data Augmentation Generative Adversarial Networks},
year = {2017},
month = {Nov},
institution = {University of Edinburgh},
url = {https://arxiv.org/abs/1711.04340},
}

Continuously Tempered Hamiltonian Monte Carlo

Conference on Uncertainty in Artificial Intelligence (UAI)

Matt Graham, Amos Storkey

Hamiltonian Monte Carlo (HMC) is a powerful Markov chain Monte Carlo (MCMC) method for performing approximate inference in complex probabilistic models of continuous variables. In common with many MCMC methods, however, the standard HMC approach performs poorly in distributions with multiple isolated modes. We present a method for augmenting the Hamiltonian system with an extra continuous temperature control variable which allows the dynamic to bridge between sampling a complex target distribution and a simpler unimodal base distribution. This augmentation both helps improve mixing in multimodal targets and allows the normalisation constant of the target distribution to be estimated. The method is simple to implement within existing HMC code, requiring only a standard leapfrog integrator. We demonstrate experimentally that the method is competitive with annealed importance sampling and simulating tempering methods at sampling from challenging multimodal distributions and estimating their normalising constants.

@inproceedings{Graham2017_8_Continuously,
author = {Matt Graham and Amos Storkey},
title = {Continuously Tempered {H}amiltonian {M}onte {C}arlo},
year = {2017},
month = {Aug},
booktitle = {Conference on Uncertainty in Artificial Intelligence (UAI)},
url = {https://arxiv.org/abs/1704.03338},
}

Asymptotically Exact Inference in Differentiable Generative Models

International Conference on Artificial Intelligence and Statistics (AISTATS)

Matt Graham, Amos Storkey

Many generative models can be expressed as a differentiable function of random inputs drawn from some simple probability density. This framework includes both deep generative architectures such as Variational Autoencoders and a large class of procedurally defined simulator models. We present a method for performing efficient MCMC inference in such models when conditioning on observations of the model output. For some models this offers an asymptotically exact inference method where Approximate Bayesian Computation might otherwise be employed. We use the intuition that inference corresponds to integrating a density across the manifold corresponding to the set of inputs consistent with the observed outputs. This motivates the use of a constrained variant of Hamiltonian Monte Carlo which leverages the smooth geometry of the manifold to coherently move between inputs exactly consistent with observations. We validate the method by performing inference tasks in a diverse set of models.

@inproceedings{Graham2017_4_Asymptotically,
author = {Matt Graham and Amos Storkey},
title = {Asymptotically Exact Inference in Differentiable Generative Models},
year = {2017},
month = {Apr},
booktitle = {International Conference on Artificial Intelligence and Statistics (AISTATS)},
url = {https://arxiv.org/abs/1605.07826},
}

Towards a Neural Statistician

International Conference on Learning Representations (ICLR)

Harrison Edwards, Amos Storkey

An efficient learner is one who reuses what they already know to tackle a new problem. For a machine learner, this means understanding the similarities amongst datasets. In order to do this, one must take seriously the idea of working with datasets, rather than datapoints, as the key objects to model. Towards this goal, we demonstrate an extension of a variational autoencoder that can learn a method for computing representations, or statistics, of datasets in an unsupervised fashion. The network is trained to produce statistics that encapsulate a generative model for each dataset. Hence the network enables efficient learning from new datasets for both unsupervised and supervised tasks. We show that we are able to learn statistics that can be used for: clustering datasets, transferring generative models to new datasets, selecting representative samples of datasets and classifying previously unseen classes. We refer to our model as a neural statistician, and by this we mean a neural network that can learn to compute summary statistics of datasets without supervision.

@inproceedings{Edwards2017_4_Towards,
author = {Harrison Edwards and Amos Storkey},
title = {Towards a Neural Statistician},
year = {2017},
month = {Apr},
booktitle = {International Conference on Learning Representations (ICLR)},
url = {https://arxiv.org/abs/1606.02185},
}

Resource-Efficient Feature Gathering at Test Time

Workshop on Reliable Machine Learning in the Wild, NeurIPS

Gavia Gray, Amos Storkey

Data collection is costly. A machine learning model requires input data to produce an output prediction, but that input is often not cost-free to produce accurately. For example, in the social sciences, it may require collecting samples; in signal processing it may involve investing in expensive accurate sensors. The problem of allocating a budget across the collection of different input variables is largely over- looked in machine learning, but is important under real-world constraints. Given that the noise level on each input feature depends on how much resource has been spent gathering it, and given a fixed budget, we ask how to allocate that budget to maximise our expected reward. At the same time, the optimal model parameters will depend on the choice of budget allocation, and so searching the space of pos- sible budgets is costly. Using doubly stochastic gradient methods we propose a solution that allows expressive models and massive datasets, while still providing an interpretable budget allocation for feature gathering at test time.

@inproceedings{Gray2016_12_ResourceEfficient,
author = {Gavia Gray and Amos Storkey},
title = {Resource-Efficient Feature Gathering at Test Time},
year = {2016},
month = {Dec},
booktitle = {Workshop on Reliable Machine Learning in the Wild, NeurIPS},
url = {/assets/papers/resource-efficient-wildml16.pdf},
}

Censoring Representations with an Adversary

International Conference on Learning Representations (ICLR)

Harrison Edwards, Amos Storkey

In practice, there are often explicit constraints on what representations or decisions are acceptable in an application of machine learning. For example it may be a legal requirement that a decision must not favour a particular group. Alternatively it can be that that representation of data must not have identifying information. We address these two related issues by learning flexible representations that minimize the capability of an adversarial critic. This adversary is trying to predict the relevant sensitive variable from the representation, and so minimizing the performance of the adversary ensures there is little or no information in the representation about the sensitive variable. We demonstrate this adversarial approach on two problems: making decisions free from discrimination and removing private information from images. We formulate the adversarial model as a minimax problem, and optimize that minimax objective using a stochastic gradient alternate min-max optimizer. We demonstrate the ability to provide discriminant free representations for standard test problems, and compare with previous state of the art methods for fairness, showing statistically significant improvement across most cases. The flexibility of this method is shown via a novel problem: removing annotations from images, from unaligned training examples of annotated and unannotated images, and with no a priori knowledge of the form of annotation provided to the model.

@inproceedings{Edwards2016_3_Censoring,
author = {Harrison Edwards and Amos Storkey},
title = {Censoring Representations with an Adversary},
year = {2016},
month = {Mar},
booktitle = {International Conference on Learning Representations (ICLR)},
url = {https://arxiv.org/abs/1511.05897},
}

Evaluation of a Pre-surgical Functional MRI Workflow: From Data Acquisition to Reporting

International Journal of Medical Informatics

Cyril Pernet, Krzysztof J Gorgolewski, Dominic Job, David Rodriguez, Amos J Storkey, Ian Whittle, Joanna Wardlaw

Purpose: Present and assess clinical protocols and associated automated workflow for pre-surgical functional magnetic resonance imaging in brain tumor patients. Methods: Protocols were validated using a single-subject reliability approach based on 10 healthy control subjects. Results from the automated workflow were evaluated in 9 patients with brain tumors, comparing fMRI results to direct electrical stimulation (DES) of the cortex. Results: Using a new approach to compute single-subject fMRI reliability in controls, we show that not all tasks are suitable in the clinical context, even if they show meaningful results at the group level. Comparison of the fMRI results from patients to DES showed good correspondence between techniques (odds ratio 36). Conclusion: Providing that validated and reliable fMRI protocols are used, fMRI can accurately delineate eloquent areas, thus providing an aid to medical decision regarding brain tumor surgery.

@article{Pernet2016_2_Evaluation,
author = {Cyril Pernet and Krzysztof J Gorgolewski and Dominic Job and David Rodriguez and Amos J Storkey and Ian Whittle and Joanna Wardlaw},
title = {Evaluation of a Pre-surgical Functional {MRI} Workflow: From Data Acquisition to Reporting},
year = {2016},
month = {Feb},
journal = {International Journal of Medical Informatics},
volume = {86},
url = {http://homepages.inf.ed.ac.uk/amos/publications/Pernet_al_Evaluation_Pre_Surgical.pdf},
}

Stochastic Parallel Block Coordinate Descent for Large-scale Saddle Point Problems

AAAI Conference on Artificial Intelligence (AAAI)

Zhanxing Zhu, Amos Storkey

We consider convex-concave saddle point problems with a separable structure and non-strongly convex functions. We propose an efficient stochastic block coordinate descent method using adaptive primal-dual updates, which enables flexible parallel optimization for large-scale problems. Our method shares the efficiency and flexibility of block coordinate descent methods with the simplicity of primal-dual methods and utilizing the structure of the separable convex-concave saddle point problem. It is capable of solving a wide range of machine learning applications, including robust principal component analysis, Lasso, and feature selection by group Lasso, etc. Theoretically and empirically, we demonstrate significantly better performance than state-of-the-art methods in all these applications.

@inproceedings{Zhu2016_2_Stochastic,
author = {Zhanxing Zhu and Amos Storkey},
title = {Stochastic Parallel Block Coordinate Descent for Large-scale Saddle Point Problems},
year = {2016},
month = {Feb},
booktitle = {AAAI Conference on Artificial Intelligence (AAAI)},
url = {https://arxiv.org/abs/1511.07294},
}

Covariance-Controlled Adaptive Langevin Thermostat for Large-Scale Bayesian Sampling

Advances in Neural Information Processing Systems (NeurIPS)

Xiaocheng Shang, Zhanxing Zhu, Benedict Leimkuhler, Amos Storkey

Monte Carlo sampling for Bayesian posterior inference is a common approach used in machine learning. The Markov Chain Monte Carlo procedures that are used are often discrete-time analogues of associated stochastic differential equations (SDEs). These SDEs are guaranteed to leave invariant the required posterior distribution. An area of current research addresses the computational benefits of stochastic gradient methods in this setting. Existing techniques rely on estimating the variance or covariance of the subsampling error, and typically assume constant variance. In this article, we propose a covariance-controlled adaptive Langevin thermostat that can effectively dissipate parameter-dependent noise while maintaining a desired target distribution. The proposed method achieves a substantial speedup over popular alternative schemes for large-scale machine learning applications.

@inproceedings{Shang2015_12_CovarianceControlled,
author = {Xiaocheng Shang and Zhanxing Zhu and Benedict Leimkuhler and Amos Storkey},
title = {Covariance-Controlled Adaptive {L}angevin Thermostat for Large-Scale {B}ayesian Sampling},
year = {2015},
month = {Dec},
booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
url = {https://arxiv.org/abs/1510.08692},
}

Adaptive Stochastic Primal-dual Coordinate Descent for Separable Saddle Point Problems

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

Zhanxing Zhu, Amos Storkey

We consider a generic convex-concave saddle point problem with a separable structure, a form that covers a wide-ranged machine learning applications. Under this problem structure, we follow the framework of primal-dual updates for saddle point problems, and incorporate stochastic block coordinate descent with adaptive stepsizes into this framework. We theoretically show that our proposal of adaptive stepsizes potentially achieves a sharper linear convergence rate compared with the existing methods. Additionally, since we can select “mini-batch” of block coordinates to update, our method is also amenable to parallel processing for large-scale data. We apply the proposed method to regularized empirical risk minimization and show that it performs comparably or, more often, better than state-of-the-art methods on both synthetic and real-world data sets.

@inproceedings{Zhu2015_8_Adaptive,
author = {Zhanxing Zhu and Amos Storkey},
title = {Adaptive Stochastic Primal-dual Coordinate Descent for Separable Saddle Point Problems},
year = {2015},
month = {Aug},
booktitle = {Joint European Conference on Machine Learning and Knowledge Discovery in Databases},
url = {https://arxiv.org/abs/1506.04093},
}

Brain White Matter Structure and Information Processing Speed in Healthy Older Age

Brain Structure and Function

Ksenia A. Kuznetsova, Susana Munoz Maniega, Stuart J. Ritchie, Simon R. Cox, Amos J. Storkey, John M. Starr, Joanna M. Wardlaw, Ian J. Deary, Mark E. Bastin

Cognitive decline, especially the slowing of information processing speed, is associated with normal ageing. This decline may be due to brain cortico-cortical disconnection caused by age-related white matter deterioration. We present results from a large, narrow age range cohort of generally healthy, community-dwelling subjects in their seventies who also had their cognitive ability tested in youth (age 11 years). We investigate associations between older age brain white matter structure, several measures of information processing speed and childhood cognitive ability in 581 subjects. Analysis of diffusion tensor MRI data using Tract-based Spatial Statistics (TBSS) showed that all measures of information processing speed, as well as a general speed factor composed from these tests (gspeed), were significantly associated with fractional anisotropy (FA) across the white matter skeleton rather than in specific tracts. Cognitive ability measured at age 11 years was not associated with older age white matter FA, except for the gspeed-independent components of several individual processing speed tests. These results indicate that quicker and more efficient information processing requires global connectivity in older age, and that associations between white matter FA and information processing speed (both individual test scores and gspeed), unlike some other aspects of later life brain structure, are generally not accounted for by cognitive ability measured in youth.

@article{Kuznetsova2015_8_Brain,
author = {Ksenia A. Kuznetsova and Susana Munoz Maniega and Stuart J. Ritchie and Simon R. Cox and Amos J. Storkey and John M. Starr and Joanna M. Wardlaw and Ian J. Deary and Mark E. Bastin},
title = {Brain White Matter Structure and Information Processing Speed in Healthy Older Age},
year = {2015},
month = {Aug},
journal = {Brain Structure and Function},
volume = {},
url = {https://doi.org/10.1007%2Fs00429-015-1097-5},
}

Training Deep Convolutional Neural Networks to Play Go

International Conference on Machine Learning (ICML)

Chris Clark, Amos Storkey

Mastering the game of Go has remained a long standing challenge to the field of AI. Modern computer Go systems rely on processing millions of possible future positions to play well, but intuitively a stronger and more 'humanlike' way to play the game would be to rely on pattern recognition abilities rather then brute force computation. Following this sentiment, we train deep convolutional neural networks to play Go by training them to predict the moves made by expert Go players. To solve this problem we introduce a number of novel techniques, including a method of tying weights in the network to 'hard code' symmetries that are expect to exist in the target function, and demonstrate in an ablation study they considerably improve performance. Our final networks are able to achieve move prediction accuracies of 41.1% and 44.4% on two different Go datasets, surpassing previous state of the art on this task by significant margins. Additionally, while previous move prediction programs have not yielded strong Go playing programs, we show that the networks trained in this work acquired high levels of skill. Our convolutional neural networks can consistently defeat the well known Go program GNU Go, indicating it is state of the art among programs that do not use Monte Carlo Tree Search. It is also able to win some games against state of the art Go playing program Fuego while using a fraction of the play time. This success at playing Go indicates high level principles of the game were learned.

@inproceedings{Clark2015_6_Training,
author = {Chris Clark and Amos Storkey},
title = {Training Deep Convolutional Neural Networks to Play {G}o},
year = {2015},
month = {Jun},
booktitle = {International Conference on Machine Learning (ICML)},
url = {https://arxiv.org/abs/1412.3409},
}

Reduced Structural Connectivity within a Prefrontal-Motor-Subcortical Network in Amyotrophic Lateral Sclerosis

Journal of Magnetic Resonance Imaging

Colin R. Buchanan and Leslie D. Pettit and Amos Storkey and Sharon Abrahams and Mark E. Bastin

Background: To investigate white matter structural connectivity changes associated with amyotrophic lateral sclerosis (ALS) using network analysis and compare the results with those obtained using standard voxel-based methods, specifically Tract-based Spatial Statistics (TBSS). Methods: MRI data were acquired from 30 patients with ALS and 30 age-matched healthy controls. For each subject, 85 grey matter regions (network nodes) were identified from high resolution structural MRI, and network connections formed from the white matter tracts generated by diffusion MRI and probabilistic tractography. Whole-brain networks were constructed using strong constraints on anatomical plausibility and a weighting reflecting tract-averaged fractional anisotropy (FA). Results: Analysis using Network-based Statistics (NBS), without a priori selected regions, identified an impaired motor-frontal-subcortical subnetwork (10 nodes and 12 bidirectional connections), consistent with upper motor neuron pathology, in the ALS group compared with the controls (P = 0.020). Reduced FA in three of the impaired network connections, which involved fibers of the corticospinal tract, correlated with rate of disease progression (P ≤ 0.024). A novel network-tract comparison revealed that the connections involved in the affected network had a strong correspondence (mean overlap of 86.2%) with white matter tracts identified as having reduced FA compared with the control group using TBSS. Conclusion: These findings suggest that white matter degeneration in ALS is strongly linked to the motor cortex, and that impaired structural networks identified using NBS have a strong correspondence to affected white matter tracts identified using more conventional voxel-based methods.

@article{Bastin2015_6_Reduced,
author = {Colin R. Buchanan and Leslie D. Pettit and Amos Storkey and Sharon Abrahams and Mark E. Bastin},
title = {Reduced Structural Connectivity within a Prefrontal-Motor-Subcortical Network in Amyotrophic Lateral Sclerosis},
year = {2015},
month = {Jun},
journal = {Journal of Magnetic Resonance Imaging},
volume = {41},
url = {https://doi.org/10.1002/jmri.24695},
}

Aggregation Under Bias: Renyi Divergence Aggregation and its Implementation via Machine Learning Markets

Proceedings of ECML/PKDD 2015

Amos Storkey, Zhanxing Zhu

We consider a generic convex-concave saddle point problem with separable structure, a form that covers a wide-ranged machine learning applications. Under this problem structure, we follow the framework of primal-dual updates for saddle point problems, and incorporate stochastic block coordinate descent with adaptive stepsize into this framework. We theoretically show that our proposal of adaptive stepsize potentially achieves a sharper linear convergence rate compared with the existing methods. Additionally, since we can select "mini-batch" of block coordinates to update, our method is also amenable to parallel processing for large-scale data. We apply the proposed method to regularized empirical risk minimization and show that it performs comparably or, more often, better than state-of-the-art methods on both synthetic and real-world data sets.

@inproceedings{Storkey2015_6_Aggregation,
author = {Amos Storkey and Zhanxing Zhu},
title = {Aggregation Under Bias: {R}enyi Divergence Aggregation and its Implementation via Machine Learning Markets},
year = {2015},
month = {Jun},
booktitle = {Proceedings of ECML/PKDD 2015},
url = {https://arxiv.org/abs/1506.04093},
}

The Supervised Hierarchical Dirichlet process

IEEE Transactions on Pattern Analysis and Machine Intelligence (Special Issue on Bayesian Nonparametrics)

Andrew M. Dai, Amos Storkey

We propose the supervised hierarchical Dirichlet process (sHDP), a nonparametric generative model for the joint distribution of a group of observations and a response variable directly associated with that whole group. We compare the sHDP with another leading method for regression on grouped data, the supervised latent Dirichlet allocation (sLDA) model. We evaluate our method on two real-world classification problems and two real-world regression problems. Bayesian nonparametric regression models based on the Dirichlet process, such as the Dirichlet process-generalised linear models (DP-GLM) have previously been explored; these models allow flexibility in modelling nonlinear relationships. However, until now, Hierarchical Dirichlet Process (HDP) mixtures have not seen significant use in supervised problems with grouped data since a straightforward application of the HDP on the grouped data results in learnt clusters that are not predictive of the responses. The sHDP solves this problem by allowing for clusters to be learnt jointly from the group structure and from the label assigned to each group.

@article{Dai2015_4_Supervised,
author = {Andrew M. Dai and Amos Storkey},
title = {The Supervised Hierarchical {D}irichlet process},
year = {2015},
month = {Apr},
journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence (Special Issue on Bayesian Nonparametrics)},
volume = {37},
url = {https://arxiv.org/abs/1412.5236},
}

Multi-period Trading Prediction Markets with Connections to Machine Learning

International Conference on Machine Learning (ICML)

Jinli Hu, Amos Storkey

We present a new model for prediction markets, in which we use risk measures to model agents and introduce a market maker to describe the trading process. This specific choice on modelling tools brings us mathematical convenience. The analysis shows that the whole market effectively approaches a global objective, despite that the market is designed such that each agent only cares about its own goal. Additionally, the market dynamics provides a sensible algorithm for optimising the global objective. An intimate connection between machine learning and our markets is thus established, such that we could 1) analyse a market by applying machine learning methods to the global objective, and 2) solve machine learning problems by setting up and running certain markets.

@inproceedings{Hu2014_6_Multiperiod,
author = {Jinli Hu and Amos Storkey},
title = {Multi-period Trading Prediction Markets with Connections to Machine Learning},
year = {2014},
month = {Jun},
booktitle = {International Conference on Machine Learning (ICML)},
url = {https://arxiv.org/abs/1403.0648},
}

Series Expansion Approximations of Brownian Motion for Non-Linear Kalman Filtering of Diffusion Processes

IEEE Transactions on Signal Processing

Simon Lyons, Simo Särkkä, Amos Storkey

In this paper, we describe a novel application of sigma-point methods to continuous-discrete filtering. In principle, the nonlinear continuous- discrete filtering problem can be solved exactly. In practice, the solution contains terms that are computationally intractible. Assumed density filtering methods attempt to match statistics of the filtering distribution to some set of more tractible probability distributions. We describe a novel method that decomposes the Brownian motion driving the signal in a generalised Fourier series, which is truncated after a number of terms. This approximation to Brownian can be described using a relatively small number of Fourier coefficients, and allows us to compute statistics of the filtering distribution with a single application of a sigma-point method. Assumed density filters that exist in the literature usually rely on discretisation of the signal dynamics followed by iterated application of a sigma point transform (or a limiting case thereof). Iterating the transform in this manner can lead to loss of information about the filtering distri- bution in highly nonlinear settings. We demonstrate that our method is better equipped to cope with such problems.

@article{Lyons2014_3_Series,
author = {Simon Lyons and Simo Särkkä and Amos Storkey},
title = {Series Expansion Approximations of {B}rownian Motion for Non-Linear {K}alman Filtering of Diffusion Processes},
year = {2014},
month = {Mar},
journal = {IEEE Transactions on Signal Processing},
volume = {62},
url = {https://arxiv.org/abs/1302.5324},
}

Test-Retest Reliability of Structural Brain Networks from Diffusion MRI

NeuroImage

Colin R Buchanan, Cyril R Pernet, Krzysztof J Gorgolewski, Amos Storkey, Mark E Bastin

Structural brain networks constructed from diffusion MRI (dMRI) and tractography have been demonstrated in healthy volunteers and more recently in various disorders affecting brain connectivity. However, few studies have addressed the reproducibility of the resulting networks. We measured the test–retest properties of such networks by varying several factors affecting network construction using ten healthy volunteers who underwent a dMRI protocol at 1.5 T on two separate occasions. Each T1-weighted brain was parcellated into 84 regions-of-interest and network connections were identified using dMRI and two alternative tractography algorithms, two alternative seeding strategies, a white matter waypoint constraint and three alternative network weightings. In each case, four common graph-theoretic measures were obtained. Network properties were assessed both node-wise and per network in terms of the intraclass correlation coefficient (ICC) and by comparing within- and between-subject differences. Our findings suggest that test–retest performance was improved when: 1) seeding from white matter, rather than grey; and 2) using probabilistic tractography with a two-fibre model and sufficient streamlines, rather than deterministic tensor tractography. In terms of network weighting, a measure of streamline density produced better test–retest performance than tract-averaged diffusion anisotropy, although it remains unclear which is a more accurate representation of the underlying connectivity. For the best performing configuration, the global within-subject differences were between 3.2% and 11.9% with ICCs between 0.62 and 0.76. The mean nodal within-subject differences were between 5.2% and 24.2% with mean ICCs between 0.46 and 0.62. For 83.3% (70/84) of nodes, the within-subject differences were smaller than between-subject differences. Overall, these findings suggest that whilst current techniques produce networks capable of characterising the genuine between-subject differences in connectivity, future work must be undertaken to improve network reliability.

@article{Buchanan2014_2_TestRetest,
author = {Colin R Buchanan and Cyril R Pernet and Krzysztof J Gorgolewski and Amos Storkey and Mark E Bastin},
title = {Test-Retest Reliability of Structural Brain Networks from Diffusion {MRI}},
year = {2014},
month = {Feb},
journal = {NeuroImage},
volume = {86},
url = {https://doi.org/10.1016/j.neuroimage.2013.09.054},
}

Bayesian Inference in Sparse Gaussian Graphical Models

Peter Orchard, Felix Agakov, Amos Storkey

One of the fundamental tasks of science is to find explainable relationships between observed phenomena. One approach to this task that has received attention in recent years is based on probabilistic graphical modelling with sparsity constraints on model structures. In this paper, we describe two new approaches to Bayesian inference of sparse structures of Gaussian graphical models (GGMs). One is based on a simple modification of the cutting-edge block Gibbs sampler for sparse GGMs, which results in significant computational gains in high dimensions. The other method is based on a specific construction of the Hamiltonian Monte Carlo sampler, which results in further significant improvements. We compare our fully Bayesian approaches with the popular regularisation-based graphical LASSO, and demonstrate significant advantages of the Bayesian treatment under the same computing costs. We apply the methods to a broad range of simulated data sets, and a real-life financial data set.

@techreport{Orchard2013_9_Bayesian,
author = {Peter Orchard and Felix Agakov and Amos Storkey},
title = {{B}ayesian Inference in Sparse {G}aussian Graphical Models},
year = {2013},
month = {Sep},
institution = {School of Informatics, University of Edinburgh}, number = {1},
url = {https://arxiv.org/abs/1309.7311},
}

Image Analysis for Cosmology: Results from the GREAT10 Star Challenge

Astrophysical Journal Supplement Series

Tom D. Kitchin, B. Rowe, M. Gill, C. Heymans, R. Massey, D. Witherick, F. Courbin, K. Georgatzis, M. Gentile, D. Gruen, M. Kilbinger, G.L. Li, A.P. Mariglis, G. Meylan, Amos Storkey, B. Xin

We present the results from the first public blind point-spread function (PSF) reconstruction challenge, the GRavitational lEnsing Accuracy Testing 2010 (GREAT10) Star Challenge. Reconstruction of a spatially varying PSF, sparsely sampled by stars, at non-star positions is a critical part in the image analysis for weak lensing where inaccuracies in the modeled ellipticity e and size R-2 can impact the ability to measure the shapes of galaxies. This is of importance because weak lensing is a particularly sensitive probe of dark energy and can be used to map the mass distribution of large scale structure. Participants in the challenge were presented with 27,500 stars over 1300 images subdivided into 26 sets, where in each set a category change was made in the type or spatial variation of the PSF. Thirty submissions were made by nine teams. The best methods reconstructed the PSF with an accuracy of sigma(e) approximate to 2.5 x 10(-4) and sigma(R-2)/R-2 approximate to 7.4 x 10(-4). For a fixed pixel scale, narrower PSFs were found to be more difficult to model than larger PSFs, and the PSF reconstruction was severely degraded with the inclusion of an atmospheric turbulence model (although this result is likely to be a strong function of the amplitude of the turbulence power spectrum).

@article{Kitchin2013_6_Image,
author = {Tom D. Kitchin and B. Rowe and M. Gill and C. Heymans and R. Massey and D. Witherick and F. Courbin and K. Georgatzis and M. Gentile and D. Gruen and M. Kilbinger and G.L. Li and A.P. Mariglis and G. Meylan and Amos Storkey and B. Xin},
title = {Image Analysis for Cosmology: Results from the {GREAT10} Star Challenge},
year = {2013},
month = {Jun},
journal = {Astrophysical Journal Supplement Series},
volume = {205},
url = {https://doi.org/10.1088/0067-0049/205/2/12},
}

Charles Bonnet Syndrome: Evidence for a Generative Model in the Cortex?

PLOS Computational Biology;

David P. Reichert, Peggy Series, Amos Storkey

Several theories propose that the cortex implements an internal model to explain, predict, and learn about sensory data, but the nature of this model is unclear. One condition that could be highly informative here is Charles Bonnet syndrome (CBS), where loss of vision leads to complex, vivid visual hallucinations of objects, people, and whole scenes. CBS could be taken as indication that there is a generative model in the brain, specifically one that can synthesise rich, consistent visual representations even in the absence of actual visual input. The processes that lead to CBS are poorly understood. Here, we argue that a model recently introduced in machine learning, the deep Boltzmann machine (DBM), could capture the relevant aspects of (hypothetical) generative processing in the cortex. The DBM carries both the semantics of a probabilistic generative model and of a neural network. The latter allows us to model a concrete neural mechanism that could underlie CBS, namely, homeostatic regulation of neuronal activity. We show that homeostatic plasticity could serve to make the learnt internal model robust against e.g. degradation of sensory input, but overcompensate in the case of CBS, leading to hallucinations. We demonstrate how a wide range of features of CBS can be explained in the model and suggest a potential role for the neuromodulator acetylcholine. This work constitutes the first concrete computational model of CBS and the first application of the DBM as a model in computational neuroscience. Our results lend further credence to the hypothesis of a generative model in the brain.

@article{Reichert2013_6_Charles,
author = {David P. Reichert and Peggy Series and Amos Storkey},
title = {{C}harles {B}onnet Syndrome: Evidence for a Generative Model in the Cortex?},
year = {2013},
month = {Jun},
journal = {PLOS Computational Biology;},
volume = {9},
url = {https://doi.org/10.1371/journal.pcbi.1003134},
}

Single subject fMRI Test-Retest Reliability Metrics and Confounding Factors

Nueroimage

Krzysztof J. Gorgolewski, Amos Storkey, Mark E. Bastin, Ian R. Whittle, Cyril R. Pernet

While the fMRI test–retest reliability has been mainly investigated from the point of view of group level studies, here we present analyses and results for single-subject test–retest reliability. One important aspect of group level reliability is that not only does it depend on between-session variance (test–retest), but also on between-subject variance. This has partly led to a debate regarding which reliability metric to use and how different sources of noise contribute to between-session variance. Focusing on single subject reliability allows considering between-session only. In this study, we measured test–retest reliability in four behavioural tasks (motor mapping, covert verb generation, overt word repetition, and a landmark identification task) to ensure generalisation of the results and at three levels of data processing (time-series correlation, t value variance, and overlap of thresholded maps) to understand how each step influences the other and how confounding factors influence reliability at each of these steps. The contributions of confounding factors (scanner noise, subject motion, and coregistration) were investigated using multiple regression and relative importance analyses at each step. Finally, to achieve a fuller picture of what constitutes a reliable task, we introduced a bootstrap technique of within- vs. between-subject variance. Our results show that (i) scanner noise and coregistration errors have little contribution to between-session variance (ii) subject motion (especially correlated with the stimuli) can have detrimental effects on reliability (iii) different tasks lead to different reliability results. This suggests that between-session variance in fMRI is mostly caused by the variability of underlying cognitive processes and motion correlated with the stimuli rather than technical limitations of data processing.

@article{Gorgolewski2013_4_Single,
author = {Krzysztof J. Gorgolewski and Amos Storkey and Mark E. Bastin and Ian R. Whittle and Cyril R. Pernet},
title = {Single subject fMRI Test-Retest Reliability Metrics and Confounding Factors},
year = {2013},
month = {Apr},
journal = {Nueroimage},
volume = {69},
url = {https://doi.org/10.1016/j.neuroimage.2012.10.085},
}

A Test-Retest Functional MRI Dataset for Motor, Language and Spatial Attention Functions

Gigascience

Krzysztof J. Gorgolewski, Amos Storkey, Mark E. Bastin, Ian R. Whittle, Joanna M. Wardlaw, Cyril R. Pernet

Since its inception over twenty years ago, functional magnetic resonance imaging (fMRI) has been used in numerous studies probing neural underpinnings of human cognition. However, the between session variance of many tasks used in fMRI remains understudied. Such information is especially important in context of clinical applications. A test-retest dataset was acquired to validate fMRI tasks used in pre-surgical planning. In particular, five task-related fMRI time series (finger, foot and lip movement, overt verb generation, covert verb generation, overt word repetition, and landmark tasks) were used to investigate which protocols gave reliable single-subject results. Ten healthy participants in their fifties were scanned twice using an identical protocol 2–3 days apart. In addition to the fMRI sessions, high-angular resolution diffusion tensor MRI (DTI), and high-resolution 3D T1-weighted volume scans were acquired.

@article{Gorgolewski2013_4_TestRetest,
author = {Krzysztof J. Gorgolewski and Amos Storkey and Mark E. Bastin and Ian R. Whittle and Joanna M. Wardlaw and Cyril R. Pernet},
title = {A Test-Retest Functional {MRI} Dataset for Motor, Language and Spatial Attention Functions},
year = {2013},
month = {Apr},
journal = {Gigascience},
volume = {},
url = {https://doi.org/10.1186%2F2047-217X-2-6},
}

Continuous relaxations for discrete Hamiltonian Monte-Carlo

Advances in Neural Information Processing Systems (NIPS 2012)

Yichuan Zhang, Charles Sutton, Amos Storkey, Zoubin Ghahramani

Continuous relaxations play an important role in discrete optimization, but have not seen much use in approximate probabilistic inference. Here we show that a general form of the Gaussian Integral Trick makes it possible to transform a wide class of discrete variable undirected models into fully continuous systems. The continuous representation allows the use of gradient-based Hamiltonian Monte Carlo for inference, results in new ways of estimating normalization constants (partition functions), and in general opens up a number of new avenues for inference in difficult discrete systems. We demonstrate some of these continuous relaxation inference algorithms on a number of illustrative problems

@inproceedings{Zhang2012_12_Continuous,
author = {Yichuan Zhang and Charles Sutton and Amos Storkey and Zoubin Ghahramani},
title = {Continuous relaxations for discrete {H}amiltonian {M}onte-{C}arlo},
year = {2012},
month = {Dec},
booktitle = {Advances in Neural Information Processing Systems (NIPS 2012)},
url = {https://homepages.inf.ed.ac.uk/amos/publications/ZhangSuttonStorkeyGhahramani2012ContinuousRelaxationsDiscreteHMC.pdf},
}

The Coloured Noise Expansion and Parameter Estimation of Diffusion Processes

Advances in Neural Information Processing Systems 25 (NIPS2012)

Simon Lyons, Simo Särkkä, Amos Storkey

Stochastic differential equations (SDE) are a natural tool for modelling systems that are inherently noisy or contain uncertainties that can be modelled as stochastic processes. Crucial to the process of using SDE to build mathematical models is the ability to estimate parameters of those models from observed data. Over the past few decades, significant progress has been made on this problem, but we are still far from having a definitive solution. We describe a novel method of approximating a diffusion process that we show to be useful in Markov chain Monte-Carlo (MCMC) inference algorithms. We take the ‘white’ noise that drives a diffusion process and decompose it into two terms. The first is a ‘coloured noise’ term that can be deterministically controlled by a set of auxilliary variables. The second term is small and enables us to form a linear Gaussian ‘small noise’ approximation. The decomposition allows us to take a diffusion process of interest and cast it in a form that is amenable to sampling by MCMC methods. We explain why many state-of-the-art inference methods fail on highly nonlinear inference problems. We demonstrate experimentally that our method performs well in such situations. Our results show that this method is a promising new tool for use in inference and parameter estimation problems.

@inproceedings{Lyons2012_12_Coloured,
author = {Simon Lyons and Simo Särkkä and Amos Storkey},
title = {The Coloured Noise Expansion and Parameter Estimation of Diffusion Processes},
year = {2012},
month = {Dec},
booktitle = {Advances in Neural Information Processing Systems 25 (NIPS2012)},
url = {https://proceedings.neurips.cc/paper/2012/hash/5c936263f3428a40227908d5a3847c0b-Abstract.html},
}

Adaptive Thresholding for Reliable Topological Inference in Single Subject fMRI Analysis

Frontiers in Human Neuroscience

Krzysztof J. Gorgolewski, Amos Storkey, Mark E. Bastin, Cyril R. Pernet

Single subject fMRI has proved to be a useful tool for mapping functional areas in clinical procedures such as tumor resection. Using fMRI data, clinicians assess the risk, plan and execute such procedures based on thresholded statistical maps. However, because current thresholding methods were developed mainly in the context of cognitive neuroscience group studies, most single subject fMRI maps are thresholded manually to satisfy specific criteria related to single subject analyzes. Here, we propose a new adaptive thresholding method which combines Gamma-Gaussian mixture modeling with topological thresholding to improve cluster delineation. In a series of simulations we show that by adapting to the signal and noise properties, the new method performs well in terms of total number of errors but also in terms of the trade-off between false negative and positive cluster error rates. Similarly, simulations show that adaptive thresholding performs better than fixed thresholding in terms of over and underestimation of the true activation border (i.e., higher spatial accuracy). Finally, through simulations and a motor test–retest study on 10 volunteer subjects, we show that adaptive thresholding improves reliability, mainly by accounting for the global signal variance. This in turn increases the likelihood that the true activation pattern can be determined offering an automatic yet flexible way to threshold single subject fMRI maps.

@article{Gorgolewski2012_8_Adaptive,
author = {Krzysztof J. Gorgolewski and Amos Storkey and Mark E. Bastin and Cyril R. Pernet},
title = {Adaptive Thresholding for Reliable Topological Inference in Single Subject f{MRI} Analysis},
year = {2012},
month = {Aug},
journal = {Frontiers in Human Neuroscience},
volume = {6},
url = {https://homepages.inf.ed.ac.uk/amos/publications/GorgolewskiStorkeyBasinPernet2012AdaptiveThresholdingTopologicalInferenceFMRI.pdf},
}

A Topic Model for Melodic Sequences

International Conference on Machine Learning (ICML)

Athina Spiliopoulou, Amos Storkey

We examine the problem of learning a probabilistic model for melody directly from musical sequences belonging to the same genre. This is a challenging task as one needs to capture not only the rich temporal structure evident in music, but also the complex statistical dependencies among different music components. To address this problem we introduce the Variable-gram Topic Model, which couples the latent topic formalism with a systematic model for contextual information. We evaluate the model on next-step prediction. Additionally, we present a novel way of model evaluation, where we directly compare model samples with data sequences using the Maximum Mean Discrepancy of string kernels, to assess how close is the model distribution to the data distribution. We show that the model has the highest performance under both evaluation measures when compared to LDA, the Topic Bigram and related non-topic models.

@inproceedings{Spiliopoulou2012_6_Topic,
author = {Athina Spiliopoulou and Amos Storkey},
title = {A Topic Model for Melodic Sequences},
year = {2012},
month = {Jun},
booktitle = {International Conference on Machine Learning (ICML)},
url = {https://arxiv.org/abs/1206.6441},
}

Isoelastic Agents and Wealth Updates in Machine Learning Markets

International Conference on Machine Learning (ICML)

Amos Storkey, Jono Millin, Krzysztof Geras

Recently, prediction markets have shown considerable promise for developing flexible mechanisms for machine learning. In this paper, agents with isoelastic utilities are considered. It is shown that the costs associated with homogeneous markets of agents with isoelastic utilities produce equilibrium prices corresponding to alpha-mixtures, with a particular form of mixing component relating to each agent's wealth. We also demonstrate that wealth accumulation for logarithmic and other isoelastic agents (through payoffs on prediction of training targets) can implement both Bayesian model updates and mixture weight updates by imposing different market payoff structures. An iterative algorithm is given for market equilibrium computation. We demonstrate that inhomogeneous markets of agents with isoelastic utilities outperform state of the art aggregate classifiers such as random forests, as well as single classifiers (neural networks, decision trees) on a number of machine learning benchmarks, and show that isoelastic combination methods are generally better than their logarithmic counterparts.

@inproceedings{Storkey2012_6_Isoelastic,
author = {Amos Storkey and Jono Millin and Krzysztof Geras},
title = {Isoelastic Agents and Wealth Updates in Machine Learning Markets},
year = {2012},
month = {Jun},
booktitle = {International Conference on Machine Learning (ICML)},
url = {https://arxiv.org/abs/1206.6443},
}

Discriminative Mixtures of Sparse Latent Fields for Stress Testing

International Conference on AI in Statistics (AISTATS)

Felix V. Agakov, Peter Orchard, Amos Storkey

We describe a simple and efficient approach to learning structures of sparse high-dimensional latent variable models. Standard algorithms either learn structures of specific predefined forms, or estimate sparse graphs in the data space ignoring the possibility of the latent variables. In contrast, our method learns rich dependencies and allows for latent variables that may confound the relations between the observations. We extend the model to conditional mixtures with side information and non-Gaussian marginal distributions of the observations. We then show that our model may be used for learning sparse latent variable structures corresponding to multiple unknown states, and for uncovering features useful for explaining and predicting structural changes. We apply the model to real-world financial data with heavy-tailed marginals covering the low- and high- market volatility periods of 2005-2011. We show that our method tends to give rise to significantly higher likelihoods of test data than standard network learning methods exploiting the sparsity assumption. We also demonstrate that our approach may be practical for financial stress-testing and visualization of dependencies between financial instruments.

@inproceedings{Agakov2012_4_Discriminative,
author = {Felix V. Agakov and Peter Orchard and Amos Storkey},
title = {Discriminative Mixtures of Sparse Latent Fields for Stress Testing},
year = {2012},
month = {Apr},
booktitle = {International Conference on AI in Statistics (AISTATS)},
url = {https://homepages.inf.ed.ac.uk/amos/publications/AgakovOrchardStorkey2012MixSparseLatentFieldsStressTesting.pdf},
}

Neuronal Adaptation for Sampling-Based Probabilistic Inference in Perceptual Bistability

Advances in Neural Information Processing Systems 24 (NIPS2011)

David P. Reichert, Peggy Series, Amos Storkey

It has been argued that perceptual multistability reflects probabilistic inference performed by the brain when sensory input is ambiguous. Alternatively, more traditional explanations of multistability refer to low-level mechanisms such as neuronal adaptation. We employ a Deep Boltzmann Machine (DBM) model of cortical processing to demonstrate that these two different approaches can be combined in the same framework. Based on recent developments in machine learning, we show how neuronal adaptation can be understood as a mechanism that improves probabilistic, sampling-based inference. Using the ambiguous Necker cube image, we analyze the perceptual switching exhibited by the model. We also examine the influence of spatial attention, and explore how binocular rivalry can be modeled with the same approach. Our work joins earlier studies in demonstrating how the principles underlying DBMs relate to cortical processing, and offers novel perspectives on the neural implementation of approximate probabilistic inference in the brain

@inproceedings{Reichert2011_12_Neuronal,
author = {David P. Reichert and Peggy Series and Amos Storkey},
title = {Neuronal Adaptation for Sampling-Based Probabilistic Inference in Perceptual Bistability},
year = {2011},
month = {Dec},
booktitle = {Advances in Neural Information Processing Systems 24 (NIPS2011)},
url = {https://homepages.inf.ed.ac.uk/amos/publications/ReichertSeriesStorkey2011PerceptualBistability.pdf},
}

Particle Smoothing in Continuous Time: a Fast Approach via Density Estimation

Frontiers in Human Neuroscience

Lawrence Murray, Amos Storkey

We consider the particle smoothing problem for state-space models where the transition density is not available in closed form, in particular for continuous-time, nonlinear models expressed via stochastic differential equations (SDEs). Conventional forward-backward and two-filter smoothers for the particle filter require a closed-form transition density, with the linear-Gaussian Euler-Maruyama discretization usually applied to the SDEs to achieve this. We develop a pair of variants using kernel density approximations to relieve the dependence, and in doing so enable use of faster and more accurate discretization schemes such as Runge-Kutta. In addition, the new methods admit arbitrary proposal distributions, providing an avenue to deal with degeneracy issues. Experimental results on a functional magnetic resonance imaging (fMRI) deconvolution task demonstrate comparable accuracy and significantly improved runtime over conventional techniques.

@article{Murray2011_10_Particle,
author = {Lawrence Murray and Amos Storkey},
title = {Particle Smoothing in Continuous Time: a Fast Approach via Density Estimation},
year = {2011},
month = {Oct},
journal = {Frontiers in Human Neuroscience},
volume = {59},
url = {https://homepages.inf.ed.ac.uk/amos/publications/MurrayStorkey2011ParticleSmoothingContinuousTime.pdf},
}

Tractor: Magnetic Resonance Imaging and Tractography with R

Frontiers in Human Neuroscience

Jon D. Clayden, Susana Munoz Maniega, Amos Storkey, Martin D. King, Mark E. Bastin, Chris A. Clark

Statistical techniques play a major role in contemporary methods for analyzing magnetic resonance imaging (MRI) data. In addition to the central role that classical statistical methods play in research using MRI, statistical modeling and machine learning techniques are key to many modern data analysis pipelines. Applications for these techniques cover a broad spectrum of research, including many preclinical and clinical studies, and in some cases these methods are working their way into widespread routine use. In this manuscript we describe a software tool called TractoR (for “Tractography with R”), a collection of packages for the R language and environment, along with additional infrastructure for straightforwardly performing common image processing tasks. TractoR provides general purpose functions for reading, writing and manipulating MR images, as well as more specific code for fitting signal models to diffusion MRI data and performing tractography, a technique for visualizing neural connectivity.

@article{Clayden2011_10_Tractor,
author = {Jon D. Clayden and Susana Munoz Maniega and Amos Storkey and Martin D. King and Mark E. Bastin and Chris A. Clark},
title = {Tractor: Magnetic Resonance Imaging and Tractography with {R}},
year = {2011},
month = {Oct},
journal = {Frontiers in Human Neuroscience},
volume = {44},
url = {https://www.tractor-mri.org.uk/paper/index.html},
}

Comparing Probabilistic Models for Melodic Sequences

Proceedings of the ECML-PKDD

Athina Spiliopoulou, Amos Storkey

Modelling the real world complexity of music is a challenge for machine learning. We address the task of modeling melodic sequences from the same music genre. We perform a comparative analysis of two probabilistic models; a Dirichlet Variable Length Markov Model (Dirichlet-VMM) and a Time Convolutional Restricted Boltzmann Machine (TC-RBM). We show that the TC-RBM learns descriptive music features, such as underlying chords and typical melody transitions and dynamics. We assess the models for future prediction and compare their performance to a VMM, which is the current state of the art in melody generation. We show that both models perform significantly better than the VMM, with the Dirichlet-VMM marginally outperforming the TC-RBM. Finally, we evaluate the short order statistics of the models, using the Kullback-Leibler divergence between test sequences and model samples, and show that our proposed methods match the statistics of the music genre significantly better than the VMM.

@inproceedings{Spiliopoulou2011_9_Comparing,
author = {Athina Spiliopoulou and Amos Storkey},
title = {Comparing Probabilistic Models for Melodic Sequences},
year = {2011},
month = {Sep},
booktitle = {Proceedings of the ECML-PKDD},
url = {https://arxiv.org/abs/1109.6804},
}

Expectation-Maximization Methods for Solving (PO)MDPs and Optimal Control Problems

Silvia Chiappa, David Barber (Eds.) Bayesian Time Series Models, Cambridge University Press.

Mark Toussaint, Amos Storkey, Stephan Harmeling

Expectation-Maximization Methods for Solving {(PO)MDP}s and Optimal Control Problems

@inproceedings{Toussaint2011_5_ExpectationMaximization,
author = {Mark Toussaint and Amos Storkey and Stephan Harmeling},
title = {Expectation-Maximization Methods for Solving {(PO)MDP}s and Optimal Control Problems},
year = {2011},
month = {May},
booktitle = {Silvia Chiappa, David Barber (Eds.) Bayesian Time Series Models, Cambridge University Press.},
url = {https://homepages.inf.ed.ac.uk/amos/publications/ToussainStorkeyHarmeling2011EMPOMDP.pdf},
}

The Grouped Author-Topic Model for Unsupervised Entity Resolution

International Conference on Artificial Neural Networks (ICANN)

Andrew Dai, Amos Storkey

In line with recent work exploring Deep Boltzmann Machines (DBMs) as models of cortical processing, we demonstrate the potential of DBMs as models of object-based attention, combining generative principles with attentional ones. We show: (1) How inference in DBMs can be related qualitatively to theories of attentional recurrent processing in the visual cortex; (2) that deepness and topographic receptive fields are important for realizing the attentional state; (3) how more explicit attentional suppressive mechanisms can be implemented, depending crucially on sparse representations being formed during learning.

@inproceedings{Dai2011_5_Grouped,
author = {Andrew Dai and Amos Storkey},
title = {The Grouped Author-Topic Model for Unsupervised Entity Resolution},
year = {2011},
month = {May},
booktitle = {International Conference on Artificial Neural Networks (ICANN)},
url = {https://homepages.inf.ed.ac.uk/amos/publications/ReichertSeriesStorkey2011ObjectBasedAttention.pdf},
}

A Hierarchical Generative Model of Recurrent Object-Based Attention in the Visual Cortex

International Conference on Artificial Neural Networks (ICANN)

David P. Reichert, Peggy Series, Amos Storkey

In line with recent work exploring Deep Boltzmann Machines (DBMs) as models of cortical processing, we demonstrate the potential of DBMs as models of object-based attention, combining generative principles with attentional ones. We show: (1) How inference in DBMs can be related qualitatively to theories of attentional recurrent processing in the visual cortex; (2) that deepness and topographic receptive fields are important for realizing the attentional state; (3) how more explicit attentional suppressive mechanisms can be implemented, depending crucially on sparse representations being formed during learning.

@inproceedings{Reichert2011_5_Hierarchical,
author = {David P. Reichert and Peggy Series and Amos Storkey},
title = {A Hierarchical Generative Model of Recurrent Object-Based Attention in the Visual Cortex},
year = {2011},
month = {May},
booktitle = {International Conference on Artificial Neural Networks (ICANN)},
url = {https://homepages.inf.ed.ac.uk/amos/publications/ReichertSeriesStorkey2011ObjectBasedAttention.pdf},
}

Machine Learning Markets

International Conference on Artificial Intelligence and Statistics (AISTATS)

Amos Storkey

Prediction markets show considerable promise for developing flexible mechanisms for machine learning. Here, machine learning markets for multivariate systems are defined, and a utility-based framework is established for their analysis. This differs from the usual approach of defining static betting functions. It is shown that such markets can implement model combination methods used in machine learning, such as product of expert and mixture of expert approaches as equilibrium pricing models, by varying agent utility functions. They can also implement models composed of local potentials, and message passing methods. Prediction markets also allow for more flexible combinations, by combining multiple different utility functions. Conversely, the market mechanisms implement inference in the relevant probabilistic models. This means that market mechanism can be utilized for implementing parallelized model building and inference for probabilistic modelling.

@inproceedings{Storkey2011_4_Machine,
author = {Amos Storkey},
title = {Machine Learning Markets},
year = {2011},
month = {Apr},
booktitle = {International Conference on Artificial Intelligence and Statistics (AISTATS)},
url = {https://arxiv.org/abs/1106.4509},
}

Sparse Instrumental Variables (SPIV) for Genome-Wide Studies

Advances in Neural Information Processing Systems 23 (NIPS2010)

Felix V. Agakov, Paul McKeigue, Jon Krohn, Amos Storkey

This paper describes a probabilistic framework for studying associations between multiple genotypes, biomarkers, and phenotypic traits in the presence of noise and unobserved confounders for large genetic studies. The framework builds on sparse linear methods developed for regression and modified here for inferring causal structures of richer networks with latent variables. The method is motivated by the use of genotypes as “instruments” to infer causal associations between phenotypic biomarkers and outcomes, without making the common restrictive assumptions of instrumental variable methods. The method may be used for an effective screening of potentially interesting genotype-phenotype and biomarker-phenotype associations in genome-wide studies, which may have important implications for validating biomarkers as possible proxy endpoints for early-stage clinical trials. Where the biomarkers are gene transcripts, the method can be used for fine mapping of quantitative trait loci (QTLs) detected in genetic linkage studies. The method is applied for examining effects of gene transcript levels in the liver on plasma HDL cholesterol levels for a sample of sequenced mice from a heterogeneous stock, with ∼ 10^5 genetic instruments and ∼ 47 × 10^3 gene transcripts

@inproceedings{Agakov2010_12_Sparse,
author = {Felix V. Agakov and Paul McKeigue and Jon Krohn and Amos Storkey},
title = {Sparse Instrumental Variables ({SPIV}) for Genome-Wide Studies},
year = {2010},
month = {Dec},
booktitle = {Advances in Neural Information Processing Systems 23 (NIPS2010)},
url = {https://homepages.inf.ed.ac.uk/amos/publications/AgakovEtAl2010SparseInstrumentalVariables.pdf},
}

Hallucinations in Charles Bonnet Syndrome Induced by Homeostasis: a Deep Boltzmann Machine Model

Advances in Neural Information Processing Systems 23 (NIPS2010)

David P. Reichert, Peggy Series, Amos Storkey

The Charles Bonnet Syndrome (CBS) is characterized by complex vivid visual hallucinations in people with, primarily, eye diseases and no other neurological pathology. We present a Deep Boltzmann Machine model of CBS, exploring two core hypotheses: First, that the visual cortex learns a generative or predictive model of sensory input, thus explaining its capability to generate internal imagery. And second, that homeostatic mechanisms stabilize neuronal activity levels, leading to hallucinations being formed when input is lacking. We reproduce a variety of qualitative findings in CBS. We also introduce a modification to the DBM that allows us to model a possible role of acetylcholine in CBS as mediating the balance of feed-forward and feed-back processing. Our model might provide new insights into CBS and also demonstrates that generative frameworks are promising as hypothetical models of cortical learning and perception.

@inproceedings{Reichert2010_12_Hallucinations,
author = {David P. Reichert and Peggy Series and Amos Storkey},
title = {Hallucinations in {C}harles {B}onnet Syndrome Induced by Homeostasis: a Deep {B}oltzmann Machine Model},
year = {2010},
month = {Dec},
booktitle = {Advances in Neural Information Processing Systems 23 (NIPS2010)},
url = {https://homepages.inf.ed.ac.uk/amos/publications/ReichertSeriesStorkey2010CharlesBonnet.pdf},
}

When Training and Test Sets are Different: Characterising Learning Transfer

In Dataset Shift in Machine Learning, Eds Candela, Sugiyama, Schwaighofer, Lawrence. MIT Press.

Amos Storkey

In this chapter, a number of common forms of dataset shift are introduced, and each is related to a particular form of causal probabilistic model. Examples are given for the different types of shift, and some corresponding modelling approaches. By characterising dataset shift in this way, there is potential for the development of models which capture the specific types of variations, combine different modes of variation, or do model selection to assess whether dataset shift is an issue in particular circumstances. As an example of how such models can be developed, an illustration is provided for one approach to adapting Gaussian process methods for a particular type of dataset shift called Mixture Component Shift.

@inproceedings{Storkey2009_12_When,
author = {Amos Storkey},
title = {When Training and Test Sets are Different: Characterising Learning Transfer},
year = {2009},
month = {Dec},
booktitle = {In Dataset Shift in Machine Learning, Eds Candela, Sugiyama, Schwaighofer, Lawrence. MIT Press.},
url = {https://homepages.inf.ed.ac.uk/amos/publications/Storkey2009TrainingTestDifferent.pdf},
}

Reproducibility of Tract Segmentation between Sessions using an Unsupervised Modelling-based Approach

Neuroimage

Jonathan D. Clayden, Amos Storkey, Susana Munoz Maniega, Mark E. Bastin

This work describes a reproducibility analysis of scalar water diffusion parameters, measured within white matter tracts segmented using a probabilistic shape modelling method. In common with previously reported neighbourhood tractography (NT) work, the technique optimises seed point placement for fibre tracking by matching the tracts generated using a number of candidate points against a reference tract, which is derived from a white matter atlas in the present study. No direct constraints are applied to the fibre tracking results. An Expectation–Maximisation algorithm is used to fully automate the procedure, and make dramatically more efficient use of data than earlier NT methods. Within-subject and between-subject variances for fractional anisotropy and mean diffusivity within the tracts are then separated using a random effects model. We find test–retest coefficients of variation (CVs) similar to those reported in another study using landmark-guided single seed points; and subject to subject CVs similar to a constraint-based multiple ROI method. We conclude that our approach is at least as effective as other methods for tract segmentation using tractography, whilst also having some additional benefits, such as its provision of a goodness-of-match measure for each segmentation.

@article{Clayden2009_10_Reproducibility,
author = {Jonathan D. Clayden and Amos Storkey and Susana Munoz Maniega and Mark E. Bastin},
title = {Reproducibility of Tract Segmentation between Sessions using an Unsupervised Modelling-based Approach},
year = {2009},
month = {Oct},
journal = {Neuroimage},
volume = {45},
url = {https://homepages.inf.ed.ac.uk/amos/publications/ClaydenEtAl2009ReproducibilityUnsupervisedTractMatching.pdf},
}

Tract Shape Modelling Provides Evidence of Topological Change in Corpus Callosum Genu During Normal Ageing

Neuroimage

Mark E. Bastin, Jakob P. Piatowski, Amos Storkey, Laura J. Brown, Alistair M. Maclullich, Jonathan D. Clayden

Understanding how ageing affects brain structure is an important challenge for medical science. By allowing segmentation of fasciculi-of-interest from diffusion magnetic resonance imaging (dMRI) data, tractography provides a promising tool for assessing white matter connectivity in old age. However, the output from tractography algorithms is usually strongly dependent on the subjective location of user-specified seed points, with the result that it can be both difficult and time consuming to identify the same tract reliably in cross-sectional studies. Here we investigate whether a novel method for automatic single seed point placement based on tract shape modelling, termed probabilistic model-based neighbourhood tractography (PNT), can reliably segment the same tract from subject to subject in a non-demented cohort aged over 65 years. For the fasciculi investigated (genu and splenium of corpus callosum, cingulum cingulate gyri, corticospinal tracts and uncinate fasciculi), PNT was able to provide anatomically plausible representations of the tract in question in 70 to 90% of subjects compared with 2.5 to 60% if single seed points were simply transferred directly from standard to native space. In corpus callosum genu there was a significant negative correlation between a PNT-derived measure of tract shape similarity to a young brain reference tract and age, and a trend towards a significant negative correlation between tract-averaged fractional anisotropy and age; results that are consistent with previous dMRI studies of normal ageing. These data show that it is possible automatically to segment comparable tracts in the brains of older subjects using single seed point tractography, if the seed point is carefully chosen.

@article{Bastin2008_7_Tract,
author = {Mark E. Bastin and Jakob P. Piatowski and Amos Storkey and Laura J. Brown and Alistair M. Maclullich and Jonathan D. Clayden},
title = {Tract Shape Modelling Provides Evidence of Topological Change in Corpus Callosum Genu During Normal Ageing},
year = {2008},
month = {Jul},
journal = {Neuroimage},
volume = {43},
url = {https://homepages.inf.ed.ac.uk/amos/publications/BastinEtAl2008TractShapeTopologicalChangeAging.pdf},
}

Modelling Motion Primitives and Their Timing in Biologically Executed Movements

Advances in Neural Information Processing Systems 20 (NIPS2007)

Ben H. Williams, Marc Toussaint, Amos Storkey

Biological movement is built up of sub-blocks or motion primitives. Such primitives provide a compact representation of movement which is also desirable in robotic control applications. We analyse handwriting data to gain a better understanding of primitives and their timings in biological movements. Inference of the shape and the timing of primitives can be done using a factorial HMM based model, allowing the handwriting to be represented in primitive timing space. This representation provides a distribution of spikes corresponding to the primitive activations, which can also be modelled using HMM architectures. We show how the coupling of the low level primitive model, and the higher level timing model during inference can produce good reconstructions of handwriting, with shared primitives for all characters modelled. This coupled model also captures the variance profile of the dataset which is accounted for by spike timing jitter. The timing code provides a compact representation of the movement while generating a movement without an explicit timing model produces a scribbling style of output.

@inproceedings{Williams2008_1_Modelling,
author = {Ben H. Williams and Marc Toussaint and Amos Storkey},
title = {Modelling Motion Primitives and Their Timing in Biologically Executed Movements},
year = {2008},
month = {Jan},
booktitle = {Advances in Neural Information Processing Systems 20 (NIPS2007)},
url = {https://homepages.inf.ed.ac.uk/amos/publications/WilliamsToussaintStorkey2008MotionPrimitivesTiming.pdf},
}

Continuous Time Particle Filtering for fMRI

Advances in Neural Information Processing Systems 20 (NIPS2007)

Lawrence Murray, Amos Storkey

We construct a biologically motivated stochastic differential model of the neural and hemodynamic activity underlying the observed Blood Oxygen Level Dependent (BOLD) signal in Functional Magnetic Resonance Imaging (fMRI). The model poses a difficult parameter estimation problem, both theoretically due to the nonlinearity and divergence of the differential system, and computationally due to its time and space complexity. We adapt a particle filter and smoother to the task, and discuss some of the practical approaches used to tackle the difficulties, including use of sparse matrices and parallelisation. Results demonstrate the tractability of the approach in its application to an effective connectivity study.

@inproceedings{Murray2008_1_Continuous,
author = {Lawrence Murray and Amos Storkey},
title = {Continuous Time Particle Filtering for f{MRI}},
year = {2008},
month = {Jan},
booktitle = {Advances in Neural Information Processing Systems 20 (NIPS2007)},
url = {https://homepages.inf.ed.ac.uk/amos/publications/MurrayStorkey2008ContinuousTimeParticleFilterFmri.pdf},
}

A Probabilistic Model-based Approach to Consistent White Matter Tract Segmentation

IEEE Transactions on Medical Imaging

Jonathan D. Clayden, Mark E. Bastin, Amos Storkey

Since the invention of diffusion MRI, currently the only established method for studying white matter connectivity in a clinical environment, there has been a great deal of interest in the effects of various pathologies on the connectivity of the brain. As methods for in vivo tractography have been developed it has become possible to track and segment specific white matter structures of interest for particular study. However, the consistency and reproducibility of tractography-based segmentation remain limited, and attempts to improve them have thus far typically involved the imposition of strong constraints on the tract reconstruction process itself. In this work we take a different approach, developing a formal probabilistic model for the relationships between comparable tracts in different scans, and then using it to choose a tract, a posteriori, which best matches a predefined reference tract for the structure of interest. We demonstrate that this method is able to significantly improve segmentation consistency without directly constraining the tractography algorithm.

@article{Clayden2007_11_Probabilistic,
author = {Jonathan D. Clayden and Mark E. Bastin and Amos Storkey},
title = { A Probabilistic Model-based Approach to Consistent White Matter Tract Segmentation},
year = {2007},
month = {Nov},
journal = {IEEE Transactions on Medical Imaging},
volume = {26},
url = {https://homepages.inf.ed.ac.uk/amos/publications/ClaydenStorkeyBastin2007ProbabilisticTractSegmentation.pdf},
}

A Probabilistic Model-Based Approach to Consistent White Matter Tract Segmentation

Proceedings of the ISMRM 15th Scientific Meeting and Exhibition

Jonathan D. Clayden, Mark E. Bastin, Amos Storkey

Tractography algorithms have advantages, as tools for segmentation of white matter structures from diffusion MRI (dMRI) data, over more established region of interest (ROI) approaches. In particular, they are capable of automatically segmenting irregularly shaped structures that would be difficult and error-prone for a human observer to isolate. The main problem with tractography based segmentation is that algorithms require a seed point as a starting location. Since this point is typically placed by a human observer, and the segmentation can be very sensitive to its placement, a strong element of subjectivity remains in the results. We have recently demonstrated proof of concept for an approach to automated seed point placement in which a set of points are each used to generate a ìcandidateî tract, and the single seed point is chosen whose corresponding tract matches best to a predefined reference tract [1]. In that case, each candidate seed point is treated as a hypothesis, and the hypothesis with the best evidence to support itóin terms of tract similarityóis chosen. In the present work we take this approach further, developing a formal probabilistic model for the shape and length relationships between tracts, which resolves many of the shortcomings of the previous method.

@inproceedings{Clayden2007_2_Probabilistic,
author = {Jonathan D. Clayden and Mark E. Bastin and Amos Storkey},
title = {A Probabilistic Model-Based Approach to Consistent White Matter Tract Segmentation},
year = {2007},
month = {Feb},
booktitle = {Proceedings of the ISMRM 15th Scientific Meeting and Exhibition},
url = {https://cds.ismrm.org/protected/07MProceedings/PDFfiles/00078.pdf},
}

Mixture Regression for Covariate Shift

Advances in Neural Information Processing Systems 19 (NIPS2006)

Amos Storkey, Masashi Sugiyama

In supervised learning there is a typical presumption that the training and test points are taken from the same distribution. In practice this assumption is commonly violated. The situations where the training and test data are from different distributions is called covariate shift. Recent work has examined techniques for dealing with covariate shift in terms of minimisation of generalisation error. As yet the literature lacks a Bayesian generative perspective on this problem. This paper tackles this issue for regression models. Recent work on covariate shift can be understood in terms of mixture regression. Using this view, we obtain a general approach to regression under covariate shift, which reproduces previous work as a special case. The main advantages of this new formulation over previous models for covariate shift are that we no longer need to presume the test and training densities are known, the regression and density estimation are combined into a single procedure, and previous methods are reproduced as special cases of this procedure, shedding light on the implicit assumptions the methods are making.

@inproceedings{Storkey2007_1_Mixture,
author = {Amos Storkey and Masashi Sugiyama},
title = {Mixture Regression for Covariate Shift},
year = {2007},
month = {Jan},
booktitle = {Advances in Neural Information Processing Systems 19 (NIPS2006)},
url = {https://homepages.inf.ed.ac.uk/amos/publications/StorkeySugiyama2007MixtureRegressionForCovariateShift.pdf},
}

Learning Structural Equation Models for fMRI

Advances in Neural Information Processing Systems 19 (NIPS2006)

Amos Storkey, Enrico Simonotto, Heather Whalley, Stephen Lawrie, Lawrence Murray, David McGonigle

Structural equation models can be seen as an extension of Gaussian belief net- works to cyclic graphs, and we show they can be understood generatively as the model for the joint distribution of long term average equilibrium activity of Gaus- sian dynamic belief networks. Most use of structural equation models in fMRI involves postulating a particular structure and comparing learnt parameters across different groups. In this paper it is argued that there are situations where priors about structure are not ﬁrm or exhaustive, and given sufﬁcient data, it is worth investigating learning network structure as part of the approach to connectivity analysis. First we demonstrate structure learning on a toy problem. We then show that for particular fMRI data the simple models usually assumed are not supported. We show that is is possible to learn sensible structural equation models that can provide modelling beneﬁts, but that are not necessarily going to be the same as a true causal model, and suggest the combination of prior models and learning or the use of temporal information from dynamic models may provide more beneﬁts than learning structural equations alone.

@inproceedings{Storkey2007_1_Learning,
author = {Amos Storkey and Enrico Simonotto and Heather Whalley and Stephen Lawrie and Lawrence Murray and David McGonigle},
title = {Learning Structural Equation Models for f{MRI}},
year = {2007},
month = {Jan},
booktitle = {Advances in Neural Information Processing Systems 19 (NIPS2006)},
url = {https://homepages.inf.ed.ac.uk/amos/publications/StorkeyEtAl2007LearningStructuralEquationModelsForFMRI.pdf},
}

A Primitive Based Generative Model to Infer Timing Information in Unpartitioned Handwriting Data"

Twentieth International Joint Conference on Artificial Intelligence (IJCAI)

Ben H. Williams, Marc Toussaint, Amos Storkey

Biological movement control and planning is based upon motor primitives. In our approach, we presume that each motor primitive takes responsibility for controlling a small sub-block of motion, containing coherent muscle activation outputs. A central timing controller cues these subroutines of movement, creating complete movement strategies that are built up by overlaying primitives, thus creating synergies of muscle activation. This partitioning allows the movement to be defined by a sparse code representing the timing of primitive activations. This paper shows that it is possible to use a factorial hidden Markov model to infer primitives in handwriting data. The variation in the handwriting data can to a large extent be explained by timing variation in the triggering of the primitives. Once an appropriate set of primitives has been inferred, the characters can be represented as a set of timings of primitive activations, along with variances, giving a very compact representation of the character. The model is naturally partitioned into a low level primitive output stage, and a top-down primitive timing stage. This partitioning gives us an insight into behaviours such as scribbling, and what is learnt in order to write a new character.

@inproceedings{Williams2007_1_Primitive,
author = {Ben H. Williams and Marc Toussaint and Amos Storkey},
title = {A Primitive Based Generative Model to Infer Timing Information in Unpartitioned Handwriting Data"},
year = {2007},
month = {Jan},
booktitle = {Twentieth International Joint Conference on Artificial Intelligence (IJCAI)},
url = {https://homepages.inf.ed.ac.uk/amos/publications/WilliamsEtAl2007PrimitiveBasedGenerativeModelToInferTimingInformationInUnpartitionedHandwritingData.pdf},
}

Probabilistic Inference for Solving (PO)MDPs

Informatics Research Report 0934

Marc Toussaint, Stefan Harmeling, Amos Storkey

The development of probabilistic inference techniques has made considerable progress in recent years, in particular with respect to exploiting the structure (e.g., factored, hierarchical or relational) of discrete and continuous problem domains. We show that these techniques can be used also for solving Markov Decision Processes (MDPs) or partial observable MDPs (POMDPs) when formulated in terms of a structured dynamic Bayesian network (DBN). The approach is based on an equivalence between maximization of the expected future return in the time-unlimited MDP and likelihood maximization in a related mixture of finite-time MDPs. This allows us to use expectation maximization (EM) for computing optimal policies, using arbitrary inference techniques in the E-step. Unlike previous approaches we can show that this actually optimizes the discounted expected future return for arbitrary reward functions and without assuming an ad hoc finite total time. We first develop the approach for standard MDPs and demonstrate it using exact inference on a discrete maze and Gaussian belief state propagation in non-linear stochastic optimal control problems. Then we present an extension for solving POMDPs. We consider an agent model that includes an internal memory variable used for gating reactive behaviors. Using exact inference on the respective DBN, the EM-algorithm solves complex maze problems by learning appropriate internal memory representations.

@techreport{Toussaint2006_12_Probabilistic,
author = {Marc Toussaint and Stefan Harmeling and Amos Storkey},
title = {Probabilistic Inference for Solving {(PO)MDP}s},
year = {2006},
month = {Dec},
institution = {School of Informatics, University of Edinburgh}, number = {},
url = {https://homepages.inf.ed.ac.uk/amos/publications/ToussaintetAl2006ProbabilisticInferenceForSolvingPOMDPs.pdf},
}

Automated Assessment of Tract Similarity in Group Diffusion MRI Data

Proceedings of the ISMRM 14th Scientific Meeting & Exhibition, Seattle, USA

Jonathan D. Clayden, Mark E. Bastin, Amos Storkey

Neighbourhood Tractography: a New Approach to Seed Point Placement for Group Fibre Tracking

@inproceedings{Clayden2006_9_Automated,
author = {Jonathan D. Clayden and Mark E. Bastin and Amos Storkey},
title = {Automated Assessment of Tract Similarity in Group Diffusion {MRI} Data},
year = {2006},
month = {Sep},
booktitle = {Proceedings of the ISMRM 14th Scientific Meeting & Exhibition, Seattle, USA},
url = {http://www.homepages.ucl.ac.uk/~sejjjd2/papers/ClaydenEtAl2006NeighbourhoodTractography.pdf},
}

Neighbourhood Tractography: a New Approach to Seed Point Placement for Group Fibre Tracking

Proceedings of the Annual Meeting of the ISMRM British Chapter, Guildford, UK

Jonathan D. Clayden, Mark E. Bastin, Amos Storkey

Neighbourhood Tractography: a New Approach to Seed Point Placement for Group Fibre Tracking

@inproceedings{Clayden2006_9_Neighbourhood,
author = {Jonathan D. Clayden and Mark E. Bastin and Amos Storkey},
title = {Neighbourhood Tractography: a New Approach to Seed Point Placement for Group Fibre Tracking},
year = {2006},
month = {Sep},
booktitle = {Proceedings of the Annual Meeting of the ISMRM British Chapter, Guildford, UK},
url = {http://www.homepages.ucl.ac.uk/~sejjjd2/papers/ClaydenEtAl2006NeighbourhoodTractography.pdf},
}

Extracting Motion Primitives from Natural Handwriting Data

International Conference on Artificial Neural Networks (ICANN)

Ben H. Williams, Marc Toussaint, Amos Storkey

For the past 10 years it has become clear that biological movement is made up of sub-routine type blocks, or motor primitives, with a central controller timing the activation of these blocks, creating synergies of muscle activation. This paper shows that it is possible to use a factorial hidden Markov model to infer primitives in handwriting data. These primitives are not predefined in terms of location of occurrence within the handwriting, and they are not limited or defined by a particular character set. Also, the variation in the data can to a large extent be explained by timing variation in the triggering of the primitives. Once an appropriate set of primitives has been inferred, the characters can be represented as a set of timings of primitive activations, along with variances, giving a very compact representation of the character. Separating the motor system into a motor primitive part, and a timing control gives us a possible insight into how we might create scribbles on paper.

@inproceedings{Williams2006_9_Extracting,
author = {Ben H. Williams and Marc Toussaint and Amos Storkey},
title = {Extracting Motion Primitives from Natural Handwriting Data},
year = {2006},
month = {Sep},
booktitle = {International Conference on Artificial Neural Networks (ICANN)},
url = {https://homepages.inf.ed.ac.uk/amos/publications/WilliamsToussaintStorkey2006ExtractingMotionPrimitives.pdf},
}

Improved Segmentation Reproducibility in Group Tractography using a Quantitative Tract Similarity Measure

Neuroimage

Jonathan D. Clayden, Mark E. Bastin, Amos Storkey

The field of tractography is rapidly developing, and many automatic or semiautomatic algorithms have now been devised to segment and visualize neural white matter fasciculi in vivo. However, these algorithms typically need to be given a starting location as input, and their output can be strongly dependent on the exact location of this "seed point". No robust method has yet been devised for placing these seed points so as to segment a comparable tract in a group of subjects. Here, we develop a measure of tract similarity, based on the shapes and lengths of the two tracts being compared, and apply it to the problem of consistent seed point placement and tract segmentation in group data. We demonstrate that using a single seed point transferred from standard space to each native space produces considerable variability in tractography output between scans. However, by seeding in a group of nearby candidate points and choosing the output with the greatest similarity to a reference tract chosen in advance--a method we refer to as neighborhood tractography--this variability can be significantly reduced.

@article{Clayden2006_7_Improved,
author = {Jonathan D. Clayden and Mark E. Bastin and Amos Storkey},
title = {Improved Segmentation Reproducibility in Group Tractography using a Quantitative Tract Similarity Measure},
year = {2006},
month = {Jul},
journal = {Neuroimage},
volume = {33},
url = {https://homepages.inf.ed.ac.uk/amos/publications/ClaydenEtAl2006TractSimilarity.pdf},
}

Probabilistic inference for solving discrete and continuous state Markov Decision Processes

ICML06 - Proceedings of the 23rd international conference on Machine learning

Marc Toussaint, Amos Storkey

Inference in Markov Decision Processes has recently received interest as a means to infer goals of an observed action, policy recognition, and also as a tool to compute policies. A particularly interesting aspect of the approach is that any existing inference technique in DBNs now becomes available for answering behavioral question--including those on continuous, factorial, or hierarchical state representations. Here we present an Expectation Maximization algorithm for computing optimal policies. Unlike previous approaches we can show that this actually optimizes the discounted expected future return for arbitrary reward functions and without assuming an ad hoc finite total time. The algorithm is generic in that any inference technique can be utilized in the E-step. We demonstrate this for exact inference on a discrete maze and Gaussian belief state propagation in continuous stochastic optimal control problems.

@inproceedings{Toussaint2006_6_Probabilistic,
author = {Marc Toussaint and Amos Storkey},
title = {Probabilistic inference for solving discrete and continuous state Markov Decision Processes},
year = {2006},
month = {Jun},
booktitle = {ICML06 - Proceedings of the 23rd international conference on Machine learning},
url = {https://homepages.inf.ed.ac.uk/amos/publications/ToussaintStorkey2006ProbabilisticInferenceSolvingMDPs.pdf},
}