Academic

News


Filter by
Jump to
Search

Filling The Gap With Missing Data (Ph.D. Oral Presentation)

Ms. Mao YinanDepartment of Statistics and Data Science, NUS

Date:19 September 2022, Monday

Location:ZOOM: https://nus-sg.zoom.us/j/89616121854?pwd=Y0FTS0RvVkRxMUx6TEg1VDMvaVlyUT09

Time:3-4 pm, Singapore

We encounter missing data in research due to issues such as technical glitch, input mistake, drop-out, or outliers; they appear in various data sources from primary care, digital recording, survey, to name a few. Omitting non-trivial incomplete data causes issues in analysis, therefore handling missing data is a crucial step in retaining information in large dataset, and hence maximising inference power. Flexible and powerful tools of missing data imputation have been extensively studied to incorporate various data types. While deletion and single point imputation approaches are dissuaded in general , multiple-imputation based approaches are efficient and able to accommodate various data structures.

Despite the wide adoption and efficiency comparisons among imputation methods, existing literatures focus mainly on its application in inference analysis when data incompleteness is present, and there is no silver bullet —the choice of imputation methods is dependent on data type, and study objective. In this thesis, we are interested in using missing data to our advantage, in the perspective of three pillars of application.

Firstly, imputation is integrated into a conflict detection strategy for summary statistics in likelihood free inference. By intentionally deleting part of summary statistics and then completing it using multiple imputation, we derive a relative belief ratio to measure the difference of statistical evidence and to evaluate the amount of conflict in the summary statistics. Secondly, for complicated data structure such as the irregularly spaced time-dependent or longitudinal data, uncovering insights from the rich dataset benefit inference efficiency greatly. We demonstrate such application in classifying a continuous glucose monitoring (CGM) dataset and pro- pose a novel protocol in analysing the large and irregularly spaced data. Instead of treating it as missing data and apply imputation, we classified the irregular time- dependent data by a carefully designed protocol, which rendered a set of clinically meaningful and significant glucotypes differing in the degree of control, amount of time spent in range, and on the presence and timing of hyper- and hypoglycemia. Lastly, on the methodological side of classifying longitudinal dataset, we proposed a Bayesian projection-based clustering method to facilitate customised choice of random effects in the mixed linear regression. The proposed method is to model time series data by mixed linear regression using functional basis, use predictive replicates to project on random effects of interest while treating the rest as missing. The clustering results are dependent on the choice of time effects to project on, and power of estimation is reflected in the MCMC output.