Graph Neural Network-based Approximate Bayesian Computation for Agent-based Model Calibration of Bacterial Population Growth

Bai, Xianglong

Graph Neural Network-based Approximate Bayesian Computation for Agent-based Model Calibration of Bacterial Population Growth

Files

Bai_Xianglong.pdf (11.09 MB)

Date

2026-04-21

Authors

Bai, Xianglong

Advisor

Ingalls, Brian

Publisher

University of Waterloo

Abstract

Approximate Bayesian Computation (ABC) has emerged as a powerful likelihood-free inference framework for model selection and parameter inference in complex biological systems where explicit likelihood functions are intractable or computationally prohibitive. However, the effectiveness of ABC strongly depends on the choice of summary statistics and distance metrics used to compare simulated and observed data. When analyzing time-lapse observations of growing cell populations, the selection of suitable summary statistics often relies on manually designed features informed by domain expertise. Designing such statistics is challenging as they must capture complex spatial, structural, and temporal characteristics of the biological system. Consequently, handcrafted summary statistics may omit relevant information or fail to generalize across datasets. As a result, important information contained in the data may be lost, potentially leading to inefficient inference or biased posterior estimates. This motivates the use of deep learning approaches, such as Graph Neural Networks (GNNs), which can automatically learn informative representations directly from graph-structured data. To address these limitations, this thesis proposes and systematically investigates four novel strategies for integrating deep learning approaches into the Sequential Monte Carlo ABC (ABC-SMC) framework, with a focus on GNNs and Long Short-Term Memory (LSTM) models. These architectures are specifically designed to capture the relational structure of cell populations and the temporal dynamics inherent in time-lapse data. Using GNNs, we encoded spatial interactions among cells through contact edges in graph representations of the biological system. The temporal dynamics of the evolving cell population are captured in two ways. In one approach, LSTM layers are incorporated to model dependencies across successive graph observations in time-lapse sequences. In the alternative approach, we represent temporal relationships directly within the graph structure through lineage edges in a knowledge graph, which explicitly encode parent–daughter relationships between cells over time. We consider two learning paradigms for extracting informative representations from these graphs. In the first approach, graph regression models are trained using mean squared error (MSE) to directly predict model parameters from simulated data. In the second approach, graph embedding models are trained with a triplet loss to learn low-dimensional representations that preserve the similarity relationships among simulations generated from similar parameter configurations. The resulting representations serve as GNN-based summary statistics, replacing conventional handcrafted statistics within the ABC-SMC inference pipeline. Such deep learning approaches belong to the broader class of GNN-based methods for likelihood-free inference, which aim to automatically extract informative features from complex simulation outputs. We evaluate the proposed strategies against a baseline approach relying on classical summary statistics. Inference performance is assessed using two complementary metrics. One is the Kullback-Leibler (KL) divergence between the inferred posterior distributions and the ground-truth parameters. The other is the mean squared distance (MSD) between the inferred and true parameter values. Across all evaluated strategies, the GNN-based summary statistics consistently outperform conventional handcrafted summary statistics for simulation studies. They yield more accurate posterior approximations, as reflected by reduced KL divergence, and more precise parameter estimates, as reflected by lower MSD values. However, the results are less convincing on real data, likely due to model mismatch. Overall, this work demonstrates that replacing handcrafted summary statistics with GNN-based ones can substantially improve likelihood-free inference in complex biological systems, assuming that there is no model mismatch and no unknown noise in observations from real experiments. By integrating GNNs with the ABC-SMC framework, the proposed approach enables the automatic extraction of informative representations from graph-structured, time-evolving population data. The resulting methodology provides a principled strategy for parameter inference, bridging computational simulations and experimental observations through simulation-based model calibration. Although the biological model considered in this study serves primarily as a simple test to develop the inference pipeline, the proposed framework is designed to be readily extended to more complex cases commonly encountered in systems biology.