Image/Video Compression: Human and Computer Vision Perspectives

Amer, Hossam

Image/Video Compression: Human and Computer Vision Perspectives

Files

Amer_Hossam.pdf (26.42 MB)

Date

2020-05-15

Authors

Amer, Hossam

Publisher

University of Waterloo

Abstract

As we start a new decade, image and video compression should further improve to satisfy each of the human and computer visions. Human and computer visions have different perspectives on the perceived images and videos, which are compressed due to bandwidth and storage requirements. From a human vision (HV) perspective, one key aspect of human satisfaction is the perceived quality of these compressed images and videos. From a computer vision (CV) perspective, especially in image classification, one crucial aspect for machine satisfaction is the ability to accurately recognize patterns or objects in these compressed images and videos. This thesis is motivated to address a variety of image/video compression problems to serve each of human and computer vision perspectives. For HV, our goal is focused on video compression to improve the trade-off between compression rate, compression distortion, and time complexity, while our goal for CV is to show that compression if used in the right manner, helps improve deep neural network (DNN) machines in terms of classification accuracy while reducing the size in bits of the input image. Towards the HV perspective, we first introduced a global rate-distortion optimization (RDO) model rather than the existing RDO in the state-of-the-art video codec, High Efficiency Video Coding (HEVC), that is traditionally performed within each frame with fixed quantization parameters (QPs), without fully considering the coding dependencies between the current frame and future frames within a temporal propagation chain. To further improve the coding efficiency of HEVC, it is desirable to perform a global RDO among consecutive frames while maintaining a similar coding complexity. To address this problem, temporal dependencies are first measured via a model for the energy of prediction residuals that enables the formulation of the global RDO in low-delay (LD) HEVC. Second, we introduce the notion of propagation length, which is defined as the impact length of the current frame on future frames. This length is estimated via offline experiments and used to propose two novel methods to predict the impact of the coding distortion of the current frame on future frames from previous frames of similar coding properties. Third, we apply these two methods to adaptively determine the Lagrangian multiplier and its corresponding QP for each frame in the LD configuration of HEVC. Experimental results show that, in comparison to the default LD HEVC, the first method can achieve, on average, BD-rate savings of 5.0% and 4.9% in low-delay-P (LDP) and low-delay-B (LDB) configurations, respectively, and the second can achieve, on average, BD-rate savings of 4.9% and 4.9% in the LDP and LDB configurations, respectively, all with only 1% increase in the encoding time. This work has piqued serious interest from industry, such as Google. Along with the HV perspective, despite the rate distortion performance improvement that HEVC offers, it is computationally expensive due to the adoption of a large variety of coding unit (CU) sizes in its RDO. Thus, we investigated the application of fully connected neural networks (NNs) to this time-sensitive application to improve its time complexity, while controlling the resulting bitrate loss. Specifically, four NNs are introduced with one NN for each depth of the coding tree unit. These NNs either split the current CU or terminate the CU search algorithm. Because the training of NNs is time-consuming and requires large training data, we further propose a novel training strategy in which offline training and online adaptation work together to overcome this limitation. Our features are extracted from original frames based on the Laplacian Transparent Composite Model (LPTCM). Experiments carried out on all-intra configuration for HEVC reveal that our method is among the best NN methods, with an average time saving of 32% and an average controlled bitrate loss of 1.6%, compared to the original HEVC. In our CU partition algorithm, a fully connected NN machine ’saw’ extracted LPTCM features to help reduce the computational intensity of compression at a controlled trade-off between compression rate and compression distortion. Turning to CV perspective where DNNs typically ’see’ the input as a JPEG image, we revisited the impact of JPEG compression on deep learning (DL) in image classification. Given an underlying DNN pre-trained with pristine ImageNet images, we demonstrated that if for any original image, one can select, among its many JPEG compressed versions including its original version, a suitable version as an input to the underlying DNN, then the classification accuracy of the underlying DNN can be improved significantly while the size in bits of the selected input is, on average, reduced dramatically in comparison with the original image. This is in contrast to the conventional understanding that JPEG compression generally degrades the classification accuracy of DL. Specifically, for each original image, consider its 10 JPEG compressed versions with their quality factor (QF) values from $\{100, 90, 80, 70, 60, 50, 40, 30, 20, 10\}$. Under the assumption that the ground truth label of the original image is known at the time of selecting an input, but unknown to the underlying DNN, we presented a selector called Highest Rank Selector (HRS). It is shown that HRS is optimal in the sense of achieving the highest top-$k$ accuracy on any set of images for any $k$ among all possible selectors. When the underlying DNN is Inception V3 or ResNet-50 V2, HRS improves, on average, the top-1 classification accuracy and top-5 classification accuracy on the whole ImageNet validation dataset by 5.6% and 1.9%, respectively, while reducing the input size in bits dramatically---the compression ratio (CR) between the size of the original images and the size of the selected input images by HRS is 8 for the whole ImageNet validation dataset. When the ground truth label of the original image is unknown at the time of selection, we further propose selectors that either maintain the top-1 accuracy, the top-5 accuracy, or the top-1 and top-5 accuracy of the underlying DNN, while achieving CRs of 8.8, 3.3, and 3.1, respectively.

Keywords

image compression, video compression, computer vision, deep learning, deep neural networks

URI

http://hdl.handle.net/10012/15865

Collections

Theses
Electrical and Computer Engineering

Full item page

Image/Video Compression: Human and Computer Vision Perspectives

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

LC Keywords

Citation

URI

Collections