Academic

News


Filter by
Jump to
Search

Challenges of Learning Under Different Levels of Supervision for Images and Videos

Mr Rahul Rahaman Department of Statistics and Data Science, NUS

Date:16 February 2023, Thursday

Location:Zoom Link- https://nus-sg.zoom.us/j/89624948342?pwd=eGN2TDNvT3dvek5IRFFERWR5TGJUQT09

Time:11 am, Singapore

Abstract

Deep Neural Networks (DNNs) have performed outstandingly in several computer vision tasks. However, the performance of DNNs has always depended on the training data size, which sometimes becomes costly as labeling high-volume datasets requires a tremendous amount of human annotation effort. This thesis focuses on three unique annotation-scarce learning scenarios: Data scarce fully-supervised learning, unsupervised learning, and weakly supervised learning. We look at the traditional mechanisms used to adopt DNNs in these unique scarce data settings and point out the caveats of using such mechanisms. Then we propose solutions to each of these scenarios that overcome the highlighted caveats.

In the first work, we look into the data-scarce but fully-labeled setting of Image Classification. We study the interaction between three widely used methods for adopting DNNs to the low-data regime: ensembling, temperature scaling, and mixup data augmentation. We first show that, despite the common belief, standard ensembling practices do not lead to better-calibrated models. We empirically demonstrate that interactions between ensembling techniques and modern data-augmentation pipelines must be considered for proper uncertainty quantification. We formulate the straightforward “Pool-Then-Calibrate” strategy for post-processing deep-ensembles, which can halve the Expected Calibration Error (ECE) on a range of benchmark classification problems when compared to standard deep-ensembles.

In our following setup, we focus on an unsupervised learning setup of Landmark Discovery in Images, where training labels are entirely absent. We empirically demonstrate that standard landmark discovery approaches are inefficient in enforcing that the intermediate representations satisfy the “Equivariance” property. To quantify the equivariance of convolutional features, we first define a metric similar to the cumulative error distribution curves. We then propose a two-step approach for landmark discovery which first finds powerful equivariant features through a contrastive learning method and then leverages these equivariant features within more standard unsupervised landmark discovery pipelines. Our method can find semantically meaningful and consistent landmarks, and it outperforms previous approaches in finding human body landmarks in the BBC Pose dataset and facial landmarks in the Cat-head dataset.

Next, we chose the weakly-supervised setup called Timestamp Supervision for the task of temporal action segmentation in videos. In this setup, training labels are available in an extremely weak form in the form of a handful of labeled video frames (less than 0.1%). Unlike recent works using distributional assumptions or ad-hoc loss functions, we propose a novel and general model EM-TSS for Timestamp Supervision by using the first principles of an Expectation-Maximization (E-M) formulation. We restructure the EM formulation by reducing the maximization step into a frame-wise cross-entropy minimization. EM-TSS surpasses previous works for the majority of metrics by a significant margin, even sometimes outperforming the fully-supervised setting with only a handful of labels. We further generalize our method to handle annotation errors and show that the generalized formulation can tolerate up to 20% missing segments, with a marginal drop in performance as compared to the other methods.