Coding for Data Analytics: New Information Distances

Sajedi, Ahmad

Coding for Data Analytics: New Information Distances

Files

Sajedi_Ahmad.pdf (775.13 KB)

Date

2020-09-03

Authors

Sajedi, Ahmad

Advisor

Yang, En-hui

Publisher

University of Waterloo

Abstract

Distance plays a vital role in many applications of data analytics. For instance, in image retrieval, organization, and management, one needs a proper similarity distance to measure the perceptual similarity between any two images X and Y. Once such a similarity distance is defined for any two images, it could be used to retrieve images in a database that are perceptually similar to a query image according to the similarity distance in image retrieval, and to organize images into different groups according to their mutual similarity in image management. Likewise, in data clustering and bioinformatics, the notion of distance also plays a dominant role. The concept of distance between any two data objects X and Y (continuous or discrete) is addressed from the perspective of Shannon information theory. Consider a coding paradigm where X and Y are encoded into a sequence of coded bits specifying a codeword (or method) which would, in turn, convert Y into X ̂, and X into Y ̂ such that both the distortion between X and X ̂ and the distortion between Y and Y ̂ are less than or equal to a prescribed threshold D. To have a universality to some extent, we consider a class C of coding schemes within the coding paradigm. Given a class C the information distance R_C (X, Y, D) between X and Y at the distortion level D is defined as the smallest number of coded bits afforded by coding schemes from C. R_C (X ,Y, D) is shown to be indeed a pseudo distance in some sense; it is further characterized or bounded. New information distance is defined for all stationary sources with discrete alphabets and Independent and Identically Distributed (IID) sources with continuous alphabets such that the distortion level is small. For example, a pseudo distance for memoryless jointly Gaussian sources is presented when the distortion level is small or the distortion level is less than or equal to a special term which is a function of statistical properties of the sources, such as variances. When C is the class of so-called separately precoded broadcast codes, it is shown that for any Discrete Memoryless Source (DMS) X and Y, R_C (X,Y,D) is equal to the maximum of the Wyner-Ziv coding rate of X with Y as side information and the Wyner-Ziv coding rate of Y with X as side information. In the general case where C consists of all codes within the coding paradigm, upper and lower bounds to R_C (X, Y, D) are established, and are further shown to be tight when X and Y are IID pair source, for example, memoryless jointly Gaussian or memoryless Doubly Symmetric Binary Source (DSBS)s. The distance R_C (X, Y, D) generalizes the notion of information distance defined within the framework of Kolmogorov complexity. In contrast to other information distances in the literature, this information distance is also applicable to both discrete and continuous-valued data.