Date:3 November 2023, Friday
Location:S16-04-30
Time:10 am, Singapore
Abstract
This thesis consists of two separate projects: Total variation approximation of fingerprint distribution, and statistical analysis of centred subgraph counts.
Computer scientists are interested in assessing lower bounds on the sample complexity of certain algorithms that test whether the object exhibits some pre-specified property or is “far” from having that property. We employ the multivariate central limit theorem to improve the proof of a specific property testing problem in Valiant and Valiant (2010). We hope such improvement will serve as a template for solving more sophisticated distinguishing problems, for example, graph distinguishing problems.
Dense graph limit theory was developed by Lovász and Szegedy (2006) in a breakthrough paper as a way to understand the behaviour of very large networks. Kaur and Röllin (2021) developed an in-depth analysis of the finer scale stochastic fluctuation, and proposed to use centred subgraph counts which are uncorrelated and the sources of randomness are well captured. We use this idea in actual statistical applications for stochastic block models. First, we refine the centred subgraph count statistics by examining their asymptotic behaviour under unknown model parameters, in particular, to prove multivariate central limit theorems with a certain rate of convergence. Second, these statistics are used as diagnostic tests after a model has been fitted by other procedures. Our simulation study shows that under the core-periphery structure stochastic block model, “star-shaped” centred subgraphs can detect subtle errors in label estimation obtained by spectral clustering, suggesting that our statistics can detect local structures as well as deviations from theoretically predicted ones. Third, our statistics can be used to improve the estimation of labels, and further serve as an effective tool for model selection problems. As a real example, our statistics indicate that Caenorhabditis elegans neuronal network data does not fit the stochastic block model since certain local structures are not adequately captured. Finally, we developed an R-package to calculate centred subgraph count statistics and improve the estimation of labels, together with a Python script which allows using GPU to boost computing speed.