Please use this identifier to cite or link to this item:
http://hdl.handle.net/1893/33493
Appears in Collections: | Computing Science and Mathematics eTheses |
Title: | The Constrained InfoMax Objective: a Unifying Principle to Learn Optimal Representations in Both Unsupervised and Supervised scenarios |
Author(s): | Crescimanna, Vincenzo |
Supervisor(s): | Graham, Bruce |
Issue Date: | Dec-2020 |
Publisher: | University of Stirling |
Abstract: | In the machine learning framework, the data representation is defined as a possible description of the visible data, e.g. to describe the number of balls in a table (nine) we can use two equivalent descriptions: either "9" or "IX". Both descriptions are universal representations of the same concept (the number) that can appear in different forms. From this simple example we can see that more than a single representation can be associated to the data. For this reason it is useful to define the characteristics that define the optimal representation, the most preferable between all the possible ones. The first property is the non-ambiguity, i.e. from a given representation it is possible to reconstruct only one class of visible data (e.g. if we write the digit nine too quickly it could be appear close to an eight, and so this representation is ambiguous); the second important property that characterises the optimal solution is the shortness, i.e. the representation should use the shortest code such that the non-ambiguity property is satisfied (e.g. a possible representation of a number is writing down as many points as the number indicates. If the number is one million, one million points have to be drawn: - this representation is not short!). In other words we can assert that the optimal representation of the visible data is the description that contains all the relevant knowledge about the data, and not more than that. Although intuitively clear, the definition of the optimal representation is not unequivocal and it differs from context to context. In this work, utilising the basic concepts of Information Theory – a theory suitable to measure the knowledge stored within the data – we provide a definition of the optimal representation based on information that works in two of the main contexts of machine learning applications: Supervised and Unsupervised Learning. Such a concept is defined via an optimisation principle, the Constrained InfoMax (CIM): the optimal representation is from the ones sharing maximal information with the task class, and is the one storing the least knowledge. As this objective is unfeasible-to-compute, to evaluate it quantitatively we provide a variational approximation of the CIM, the Variational InfoMax (VIM), – a unique variational objective to learn data representation both in supervised and unsupervised settings. In both settings, the trained models with VIM outperform, on the standard metrics to evaluate the representation quality, the alternative variational models proposed in literature. This work was firstly motivated by the need to define the optimal representation within the unsupervised setting, and so define the principle to learn it. Indeed, from the introduction of the Variational AutoEncoder (VAE) [37] – the first variational model to learn a possibly non-linear representation, trained optimising the Evidence Lower Bound (ELBO) – two main definitions of optimal representation have been provided: - disentangled [28] – separate out the generative factors of the data; and - informative [89] – the most informative description about the data. Following the work done in [5], we observe that actually the two definitions are not in contrast, but only two faces of the same coin. Indeed, an optimal representation of the data has to store all the necessary information to generate the data but none of the generative factors have to be redundant. So the optimal representation, is both informative and disentangled, and it is the one maximising the Constrained InfoMax with respect to the data generation task. With the help of basic geometric notions, we also observe that the Variational AutoEncoder trained with VIM is a pure inference model: the representation is defined by the encoder (i.e. the inference process) and not by the decoder (i.e. the generative process). To show this, we consider a special Variational AutoEncoder with linear decoder and non-linear encoder, and we observe that, differently from the ELBO trained models that are learning only a linear representation of the data, the VIM trained models are learning a possibly non-linear representation. In particular we observe that the solution of the ELBO trained model is equivalent to the Principal Component Analysis (PCA) one, whereas the VIM trained models are learning a Mixture PCA representation. Finally, we consider the supervised setting, precisely the classification task, and we observe that the CIM objective is equivalent to the well-known Information Bottleneck [72]. However, its variational description, VIM, better approximates the theoretical objective than the Variational Information Bottleneck [4]. |
Type: | Thesis or Dissertation |
URI: | http://hdl.handle.net/1893/33493 |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
thesis.pdf | 11.8 MB | Adobe PDF | View/Open |
This item is protected by original copyright |
Items in the Repository are protected by copyright, with all rights reserved, unless otherwise indicated.
The metadata of the records in the Repository are available under the CC0 public domain dedication: No Rights Reserved https://creativecommons.org/publicdomain/zero/1.0/
If you believe that any material held in STORRE infringes copyright, please contact library@stir.ac.uk providing details and we will remove the Work from public display in STORRE and investigate your claim.