Browsing Department of Biostatistics and Epidemiology: Theses andDissertations by Title
Now showing items 15-18 of 18
A resampling method of time course gene expression data for gene network inferenceManipulation of cellular functions may aid in treatment and/or cure of a disease. Thus, identifying the topology of a gene regulatory network (GRN) and the molecular role of each gene is essential. Discovering GRNs from gene expression data is hampered by intrinsic attributes of the data: small sample size n, large number of variables (genes) p, and unknown error structure. Numerous theoretical approaches for GRN inference attempt to overcome these difficulties; however, most solutions utilized in these methods are to provide either point estimators such as coefficient estimators or make numerous assumptions which are often incompatible with the data. Furthermore, the different solutions cause GRN inference methods to provide inconsistent results. This dissertation proposes a resampling method for time-course gene expression data which can provide interval estimators for existing GRN inference methods without any distributional assumptions via bootstrapping and a statistical model that considers the various components of the data structure such as trend of gene expressions, errors of time-course data, and correlation between genes, etc. This method will produce more precise GRNs that are consistent with observed gene expression data. Furthermore, by applying our method to multiple existing GRN inference methods, the resulting networks obtained from different inference methods could be combined using the joint confidence region for their parameters. Thus, this method can be used for the validation of identified networks and GRN inference methods.
Statistical Methods for reaction NetworksStochastic reaction networks are important tools for modeling many biological phenomena, and understanding these networks is important in a wide variety of applied research, such as in disease treatment and in drug development. Statistical inference about the structure and parameters of reaction networks, sometimes referred to in this setting as model calibration, is often challenging due to intractable likelihoods. Here we utilize an idea similar to that of generalized estimating equations (GEE), which in this context are the so-called martingale estimating equations, for estimation of reaction rates of the network. The variance component is estimated using the approximate variance under the linear noise approximation, which is based on partial dierential equation, or Fokker-Planck equations, which provides an approximation to the exact chemical master equation. The method is applied to data from the plague outbreak at Eyam, England from 1665-1666 and the COVID-19 pandemic data. We show empirically that the proposed method gives good estimates of the parameters in a large volume setting and works well in small volume settings.
Statistical Methods to Detect Deferentially Methyleated Regions with Next-Generation Sequencing DataResearchers in genomics are increasingly interested in epigenetic factors such as DNA methylation because they play an important role in regulating gene expression without changes in the sequence of DNA. Abnormal DNA methylation is associated with many human diseases, including various types of cancer. We propose three different approaches to test for differentially methylated regions (DMRs) associated with complex traits, while accounting for correlations within and among CpG sites in the DMRs. One approach is a nonparametric method using a kernel distance statistic and the second one is a likelihood-based method using a binomial spatial scan statistic. Both of these approaches detect differential methylation regions between cases and controls along the genome. The kernel distance method uses the kernel function, while the binomial scan statistic approach uses a mixed effect model to incorporate correlations among CpG sites. Extensive simulations show that both approaches have excellent control of type I error, and both have reasonable statistical power. The binomial scan statistic approach appears to have higher power, while the kernel distance method is computationally faster. We also propose a third method under the Bayesian framework for comparing methylation rates when disease status is classified into ordinal multinomial categories (e.g., stages of cancer). The DMRs are detected using moving windows along the genome. Within each window, the Bayes factor is calculated to compare the two models corresponding to constant vs. monotonic methylation rates among the groups. As in the case of the scan statistic approach, the correlations between the sites are incorporated using a mixed effect model. Results from extensive simulation indicate that the Bayesian method is statistically valid and reasonably powerful to detect DMRs associated with disease severity. The proposed methods are demonstrated using data from a chronic lymphocytic leukemia (CLL) study.
TWO-SAMPLE TESTS FOR HIGH DIMEMSIONAL MEANS WITH PREPIVOTING and DATA TRANSFORMATIONWithin the medical field, the demand to store and analyze small sample, large variable data has become ever-abundant. Several two-sample tests for equality of means, including the revered Hotelling’s T2 test, have already been established when the combined sample size of both populations exceeds the dimension of the variables. However, tests such as Hotelling’s T2 become either unusable or output small power when the number of variables is greater than the combined sample size. We propose a test using both prepivoting and Edgeworth expansion that maintains high power in this higher dimensional scenario, known as the “large p small n ” problem. Our test’s finite sample performance is compared with other recently proposed tests designed to also handle the “large p small n ” situation. We apply our test to a microarray gene expression data set and report competitive rates for both power and Type-I error.