• False coverage rate - adjusted smoothed bootstrap simultaneous confidence intervals for selected parameters

      Sun, Jing; Department of Biostatistics and Epidemiology (Augusta University, 2020-05)
      Many modern applications refer to a large number of populations with high dimensional parameters. Since there are so many parameters, researchers often draw inferences regarding the most significant parameters, which are called selected parameters. Benjamini and Yekutieli (2005) proposed the false coverage-statement rate (FCR) method for multiplicity correction when constructing confidence intervals for only selected parameters. FCR for the confidence interval method is parallel to the concept of the false discovery rate for multiple hypothesis testing. In practice, we typically construct FCR-adjusted approximate confidence intervals for selected parameters either using the bootstrap method or the normal approximation method. However, these approximated confidence intervals show higher FCR for small and moderate sample sizes. Therefore, we suggest a novel procedure to construct simultaneous confidence intervals for the selected parameters by using a smoothed bootstrap procedure. We consider a smoothed bootstrap procedure using a kernel density estimator. A pertinent problem associated with the smoothed bootstrap approach is how to choose the unknown bandwidth in some optimal sense. We derive an optimal choice for the bandwidth and the resulting smoothed bootstrap confidence intervals asymptotically to give better control of the FCR than its competitors. We further show that the suggested smoothed bootstrap simultaneous confidence intervals are FCR-consistent if the dimension of data grows no faster than N^3/2. Finite sample performances of our method are illustrated based on empirical studies. Through these empirical studies, it is shown that the proposed method can be successfully applied in practice.
    • Family-based genome-wide association study for simulated data of Framingham Heart Study.

      Xu, Hongyan; Mathew, George; George, Varghese; Department of Biostatistics and Epidemiology (2009-12-18)
      ABSTRACT : Genome-wide association studies (GWAS) have quickly become the norm in dissecting the genetic basis of complex diseases. Family-based association approaches have the advantages of being robust to possible hidden population structure in samples. Most of these methods were developed with limited markers. Their applicability and performance for GWAS need to be examined. In this report, we evaluated the properties of the family-based association method implemented by ASSOC in the S.A.G.E package using the simulated data sets for the Framingham Heart Study, and found that ASSOC is a highly useful tool for GWAS.
    • A gene-based approach for testing association of rare alleles

      Xu, Hongyan; George, Varghese; Department of Biostatistics and Epidemiology (2011-11-29)
      Rare genetic variants have been shown to be important to the susceptibility of common human diseases. Methods for detecting association of rare genetic variants are drawing much attention. In this report, we applied a gene-based approach to the 200 simulated data sets of unrelated individuals. The test can detect the association of some genes with multiple rare variants.
    • Maternal Health Literacy Progression Among Rural Perinatal Women

      Mobley, Sandra C.; Thomas, Suzanne Dixson; Sutherland, Donald E.; Hudgins, Jodi; Ange, Brittany L.; Johnson, Maribeth H.; Department of Obstetrics and Gynecology (Springer, 2014-01-28)
      This research examined changes in maternal health literacy progression among 106 low income, high risk, rural perinatal African American and White women who received home visits by Registered Nurse Case Managers through the Enterprise Community Healthy Start Program. Maternal health literacy progression would enable women to better address intermediate factors in their lives that impacted birth outcomes, and ultimately infant mortality (Lu and Halfon in Mater Child Health J 7(1):13-30, 2003; Sharma et al. in J Natl Med Assoc 86(11):857-860, 1994). The Life Skills Progression Instrument (LSP) (Wollesen and Peifer, in Life skills progression. An outcome and intervention planning instrument for use with families at risk. Paul H. Brookes Publishing Co., Baltimore, 2006) measured changes in behaviors that represented intermediate factors in birth outcomes. Maternal Health Care Literacy (LSP/M-HCL) was a woman's use of information, critical thinking and health care services; Maternal Self Care Literacy (LSP/M-SCL) was a woman's management of personal and child health at home (Smith and Moore in Health literacy and depression in the context of home visitation. Mater Child Health J, 2011). Adequacy was set at a score of (≥4). Among 106 women in the study initial scores were inadequate (<4) on LSP/M-HCL (83 %), and on LSP/M-SCL (30 %). Significant positive changes were noted in maternal health literacy progression from the initial prenatal assessment to the first (p < .01) postpartum assessment and to the final (p < .01) postpartum assessment using McNemar's test of gain scores. Numeric comparison of first and last gain scores indicated women's scores progressed (LSP/M-HCL; p < .0001) and (LSP/M-SCL; p < .0001). Elevated depression scores were most frequent among women with <4 LSP/M-HCL and/or <4 LSP/M-SCL. Visit notes indicated lack or loss of relationship with the father of the baby and intimate partner discord contributed to higher depression scores.
    • Mathematical and Stochastic Modeling of HIV Immunology and Epidemiology

      Lee, Tae Jin; Department of Biostatistics and Epidemiology (8/3/2017)
      In HIV virus dynamics, controlling of viral load and maintaining of CD4 value at a higher level are always primary goals for the providers. In recent years, a new molecule was discovered, namely, eCD4-Ig, which mimics CD4 if introduced into the human body and has potential to change existing HIV virus dynamics. Thus, to understand dynamics of viral load, eCD4-Ig, CD4 cells, we have developed mathematical models by incorporating interactions between this new molecule and other known immunological, virological information. We further investigated model based speculations for management, and obtained the level of eCD4-Ig required for elimination of virus. Next, we built epidemiological model for HIV spread and control among discordant couple through dynamics of PrEP (Pre-exposure prophylaxis). For this, an actuarial assumptions based stochastic model is used to obtain the mean remaining time of couple to stay as discordant. We generalized single hook-up/marriage stochastic model to multiple hook-up/marriage model.
    • A modified bump hunting approach with correlation-adjusted kernel weight for detecting differentially methylated regions on the 450K array

      Daniel, Jeannie T; Department of Biostatistics and Epidemiology (8/3/2017)
      DNA methylation plays an important role in the regulation of gene expression, as hypermethylation is associated with gene silencing. The general purpose of this dissertation is the development of a statistical method, called DMR Detector, for detecting differentially methylated regions (DMRs) on the 450K array. DMR Detector makes three key modifications to an existing method called Bumphunter. The first is what statistic to collect from the initial fitting for further analysis. The second is to perform kernel smoothing under the assumption of correlated errors using a newly proposed correlation-adjusted kernel weight. The third is how to define regions of interest. In simulation, the method was shown to have high power comparable to Bumphunter, with consistently lower family-wise type I error rate, controlled well below the 0.1 FDR. DMR Detector was applied to real data and was able to detect one DMR that was not detected by Bumphunter.
    • A Modified Information Criterion in the 1d Fused Lasso for DNA Copy Number Variant Detection using Next Generation Sequencing Data

      Lee, Jaeeun; Department of Biostatistics and Epidemiology (8/3/2017)
      DNA Copy Number Variations (CNVs) are associated with many human diseases. Recently, CNV studies have been carried out using Next Generation Sequencing (NGS) technology that produces millions of short reads. With NGS reads ratio data, we use the 1d fused lasso regression for CNV detection. Given the number of copy number changes, the corresponding genomic locations are estimated by fitting the 1d fused lasso. Estimation of the number of copy number changes depends on a tuning parameter in the 1d fused lasso. In this dissertation, we propose a new modified Bayesian information criterion, called JMIC, to estimate the optimal tuning parameter in the 1d fused lasso. In theoretical studies, we prove that the number of change points estimated by JMIC converges the true number of changes. Also, our simulation studies show that JMIC outperforms the other criteria considered. Finally, we apply our proposed method to the reads ratio data from the breast tumor cell HCC1954 and its matched cell line provided by Chiang et al. (2009).
    • Multivariate Poisson Abundance Models for Analyzing Antigen Receptor Data

      Greene, Joshua C.; Department of Biostatistics and Epidemiology (2013-05)
      Antigen receptor data is an important source of information for immunologists that is highly statistically challenging to analyze due to the presence of a huge number of T-cell receptors in mammalian immune systems and the severe undersampling bias associated with the commonly used data collection procedures. Many important immunological questions can be stated in terms of richness and diversity of T-cell subsets under various experimental conditions. This dissertation presents a class of parametric models and uses a special case of them to compare the richness and diversity of antigen receptor populations in mammalian T-cells. The parametric models are based on a representation of the observed receptor counts as a multivariate Poisson abundance model (mPAM). A Bayesian model tting procedure is developed which allows tting of the mPAM parameters with the help of the complete likelihood as opposed to its conditional version which was used previously. The new procedure is shown to be often considerably more e cient (as measured by the amount of Fisher information) in the regions of the mPAM parameter space relevant to modeling T-cell data. A richness estimator based on the special case of the mPAM is shown to be superior to several existing richness estimators from the statistical ecology literature under the severe undersampling conditions encountered in antigen receptor data collection. The comparative diversity analyses based on the mPAM special case yield biologically meaningful results when applied to the T-cell receptor repertoires in mice. It is also shown that the amount of time to implement the Bayesian model tting procedure for the mPAM special case scales well as the dimension increases and that the amount of computational resources required to conduct complete statistical analyses for the mPAM special case can be drastically lower for our Bayesian model tting procedure than for code based on the conditional likelihood approach.
    • A new measure of population structure using multiple single nucleotide polymorphisms and its relationship with FST.

      Xu, Hongyan; Sarkar, Bayazid; George, Varghese; Department of Biostatistics and Epidemiology (2009-03-16)
      BACKGROUND: Large-scale genome-wide association studies are promising for unraveling the genetic basis of complex diseases. Population structure is a potential problem, the effects of which on genetic association studies are controversial. The first step to systematically quantify the effects of population structure is to choose an appropriate measure of population structure for human data. The commonly used measure is Wright's FST. For a set of subpopulations it is generally assumed to be one value of FST. However, the estimates could be different for distinct loci. Since population structure is a concept at the population level, a measure of population structure that utilized the information across loci would be desirable. FINDINGS: In this study we propose an adjusted C parameter according to the sample size from each sub-population. The new measure C is based on the c parameter proposed for SNP data, which was assumed to be subpopulation-specific and common for all loci. In this study, we performed extensive simulations of samples with varying levels of population structure to investigate the properties and relationships of both measures. It is found that the two measures generally agree well. CONCLUSION: The new measure simultaneously uses the marker information across the genome. It has the advantage of easy interpretation as one measure of population structure and yet can also assess population differentiation.
    • A New Method For Analyzing 1:N Matched Case Control Studies With Incomplete Data

      Jin, Chan; Department of Biostatisctics and Epidemiology (5/8/2017)
      1:n matched case-control studies are commonly used to evaluate the association between the exposure to a risk factor and a disease, where one case is matched to up till n controls. The odds ratio is typically used to quantify such association. Difficulties in estimating the true odds ratio arise, when the exposure status is unknown for at least one individual in a group. In the case where the exposure status is known for all individuals in a group, the true odds ratio is estimated as the ratio of the counts in the discordant cells of the observed two-by-two table. In the case where all data are independent, the odds ratio is estimated using the cross-product ratio from the observed table. Conditional logistic regression estimates are used for incomplete matching data. In this dissertation we suggest a simple method for estimating the odds ratio when the sample consists of a combination of paired and unpaired observations, with 1:n matching. This method uses a weighted average of the odds ratio calculations described above. This dissertation compares the new method to existing methods via simulation.
    • A new transmission test for affected sib-pair families.

      Xu, Hongyan; George, Varghese; Department of Biostatistics and Epidemiology (2008-05-09)
      Family-based association approaches such as the transmission-disequilibrium test (TDT) are used extensively in the study of genetic traits because they are generally robust to the presence of population structure. However, these approaches necessarily involve recruitment of families, which is more costly and time-consuming than sampling unrelated individuals in the population-based approaches. Therefore, a family-based approach, which has high power, would be appealing because of the gain in time and cost due to the reduced sample size that is required to attain adequate power. Here we introduce a new family-based transmission test using the joint transmission status from affected sib pairs. We show that by including the transmission status of both siblings, our method gives higher power than the TDT design, while maintaining the correct type I error rate. We use the simulated data from affected sib-pair families with rheumatoid arthritis provided by Genetic Analysis Workshop 15 to illustrate our approach.
    • Penalized Least Squares and the Algebraic Statistical Model for Biochemical Reaction Networks

      Linder, Daniel F. II; Department of Biostatistics and Epidemiology (2013-07)
      Systems biology seeks to understand the formation of macro structures such as cellular processes and higher level cellular phenomena by investigating the interactions of systems’ individual components. For cellular biology, this goal is to understand the dynamic behavior of biological materials within the cell, a container consisting of smaller materials such as mRNA, proteins, enzymes and other intermediates necessary for regulating intracellular functions and chemical species levels. Understanding these cellular dynamics is needed to help develop new drug therapies, which can be targeted to specific molecules or specific genes, in order to perturb the system for a desired result. In this work we develop inferential procedures to estimate reaction rate coefficients in cellular systems of ordinary differential equations (ODEs) from noisy data arising from realizations of molecular trajectories. It is assumed that these systems obey the so called chemical mass action law of kinetics, with corresponding deterministic mass action limit as the system size becomes infinite. The estimation and inference is based on the penalized least squares estimates, where the covariance structure of these estimates corresponds to the solution of a system of coupled nonautonomuous ODEs. Another topic discussed here is that of network topology estimation. The algebraic statistical model (ASM) offers a means of performing this topological inference for the special class of conic networks. We prove that the ASM recovers the true network topology as the number of samples grows without bound, a property known in the literature as sparsistency. We propose a method to extend the ASM to a wider class of networks that are decomposable into multiple cones.
    • Ranking analysis of F-statistics for microarray data.

      Tan, Yuan-De; Fornage, Myriam; Xu, Hongyan; Department of Biostatistics and Epidemiology (2008-04-15)
      BACKGROUND: Microarray technology provides an efficient means for globally exploring physiological processes governed by the coordinated expression of multiple genes. However, identification of genes differentially expressed in microarray experiments is challenging because of their potentially high type I error rate. Methods for large-scale statistical analyses have been developed but most of them are applicable to two-sample or two-condition data. RESULTS: We developed a large-scale multiple-group F-test based method, named ranking analysis of F-statistics (RAF), which is an extension of ranking analysis of microarray data (RAM) for two-sample t-test. In this method, we proposed a novel random splitting approach to generate the null distribution instead of using permutation, which may not be appropriate for microarray data. We also implemented a two-simulation strategy to estimate the false discovery rate. Simulation results suggested that it has higher efficiency in finding differentially expressed genes among multiple classes at a lower false discovery rate than some commonly used methods. By applying our method to the experimental data, we found 107 genes having significantly differential expressions among 4 treatments at <0.7% FDR, of which 31 belong to the expressed sequence tags (ESTs), 76 are unique genes who have known functions in the brain or central nervous system and belong to six major functional groups. CONCLUSION: Our method is suitable to identify differentially expressed genes among multiple groups, in particular, when sample size is small.
    • A resampling method of time course gene expression data for gene network inference

      Garren, Jeonifer Margaret; Department of Biostatistics (2015)
      Manipulation of cellular functions may aid in treatment and/or cure of a disease. Thus, identifying the topology of a gene regulatory network (GRN) and the molecular role of each gene is essential. Discovering GRNs from gene expression data is hampered by intrinsic attributes of the data: small sample size n, large number of variables (genes) p, and unknown error structure. Numerous theoretical approaches for GRN inference attempt to overcome these difficulties; however, most solutions utilized in these methods are to provide either point estimators such as coefficient estimators or make numerous assumptions which are often incompatible with the data. Furthermore, the different solutions cause GRN inference methods to provide inconsistent results. This dissertation proposes a resampling method for time-course gene expression data which can provide interval estimators for existing GRN inference methods without any distributional assumptions via bootstrapping and a statistical model that considers the various components of the data structure such as trend of gene expressions, errors of time-course data, and correlation between genes, etc. This method will produce more precise GRNs that are consistent with observed gene expression data. Furthermore, by applying our method to multiple existing GRN inference methods, the resulting networks obtained from different inference methods could be combined using the joint confidence region for their parameters. Thus, this method can be used for the validation of identified networks and GRN inference methods.
    • Simultaneous analysis of all single-nucleotide polymorphisms in genome-wide association study of rheumatoid arthritis.

      Mathew, George; Xu, Hongyan; George, Varghese; Department of Biostatistics and Epidemiology (2009-12-18)
      ABSTRACT : The availability of very large number of markers by modern technology makes genome-wide association studies very popular. The usual approach is to test single-nucleotide polymorphisms (SNPs) one at a time for association with disease status. However, it may not be possible to detect marginally significant effects by single-SNP analysis. Simultaneous analysis of SNPs enables detection of even those SNPs with small effect by evaluating the collective impact of several neighboring SNPs. Also, false-positive signals may be weakened by the presence of other neighboring SNPs included in the analysis. We analyzed the North American Rheumatoid Arthritis Consortium data of Genetic Analysis Workshop 16 using HLasso, a new method for simultaneous analysis of SNPs. The simultaneous analysis approach has excellent control of type I error, and many of the previously reported results of single-SNP analyses were confirmed by this approach.
    • SoTL Scholars Speak

      Schwind, Jessica Smith; Weeks, Thomas; Reich, Nickie; Johnson, Melissa; Armstrong, Rhonda; Hartmann, Quentin; Department of Biostatistics and Epidemiology; University Libraries; Department of Mathematics; University Libraries; Department of English and Foreign Languages; Department of English and Foreign Languages; Department of Psychological Sciences (2016-09)
      Jessica Smith Schwind, Learning is Contagious: Lessons in Online Course Design: Online learning environments are a key platform for teaching and learning in the 21st century, but they often try to simply recreate the classical in-person classroom. Our goal was to develop, implement and evaluate an immersive, online course where students are key players in a captivating epidemiologic outbreak investigation using a multidisciplinary team approach.; Thomas Weeks, Using threshold concepts in information literacy instruction: While "threshold concept" is a buzzword in information literacy instruction, can it be useful for single-session information literacy instruction? This project evaluated students who received instruction based in threshold concepts to see if they did better than their peers who received traditional skills-based instruction.; Nickie Reich, Lessons Learned From My First Son Project Traditional vs. Discovery Learning in College Algebra: Ms. Reich will step the audience through the planning, implementation, and analysis of her first So TL project. Knowledge gained from the experience and from the project data will be shared.; Melissa Johnson and Rhonda Armstrong, Using Freely Available Texts in a Literature Classroom: Rhonda Armstrong and Melissa Johnson will present their So TL project and discuss the challenges of creating an American Literature survey (pre-colonial to present) using freely-available texts. They will also discuss the students' attitude toward and level of engagement with digital texts.; Quentin Hartmann, Can peers improve performance? An investigation of the Think-Pair-Share teaching strategy: The Think-Pair-Share teaching strategy was tested with a class of psychology majors in a Senior Capstone course. All students did the same assignment alone, then one half of the students provided feedback to each other; the other half worked alone and all were given the option to revise their work. Performance between groups was compared.
    • Statistical Methods for reaction Networks

      Odubote, Oluseyi Samuel; Department of Biostatistics and Epidemiology
      Stochastic reaction networks are important tools for modeling many biological phenomena, and understanding these networks is important in a wide variety of applied research, such as in disease treatment and in drug development. Statistical inference about the structure and parameters of reaction networks, sometimes referred to in this setting as model calibration, is often challenging due to intractable likelihoods. Here we utilize an idea similar to that of generalized estimating equations (GEE), which in this context are the so-called martingale estimating equations, for estimation of reaction rates of the network. The variance component is estimated using the approximate variance under the linear noise approximation, which is based on partial dierential equation, or Fokker-Planck equations, which provides an approximation to the exact chemical master equation. The method is applied to data from the plague outbreak at Eyam, England from 1665-1666 and the COVID-19 pandemic data. We show empirically that the proposed method gives good estimates of the parameters in a large volume setting and works well in small volume settings.
    • Statistical Methods to Detect Deferentially Methyleated Regions with Next-Generation Sequencing Data

      Hu, Fengjiao (2016-07-07)
      Researchers in genomics are increasingly interested in epigenetic factors such as DNA methylation because they play an important role in regulating gene expression without changes in the sequence of DNA. Abnormal DNA methylation is associated with many human diseases, including various types of cancer. We propose three different approaches to test for differentially methylated regions (DMRs) associated with complex traits, while accounting for correlations within and among CpG sites in the DMRs. One approach is a nonparametric method using a kernel distance statistic and the second one is a likelihood-based method using a binomial spatial scan statistic. Both of these approaches detect differential methylation regions between cases and controls along the genome. The kernel distance method uses the kernel function, while the binomial scan statistic approach uses a mixed effect model to incorporate correlations among CpG sites. Extensive simulations show that both approaches have excellent control of type I error, and both have reasonable statistical power. The binomial scan statistic approach appears to have higher power, while the kernel distance method is computationally faster. We also propose a third method under the Bayesian framework for comparing methylation rates when disease status is classified into ordinal multinomial categories (e.g., stages of cancer). The DMRs are detected using moving windows along the genome. Within each window, the Bayes factor is calculated to compare the two models corresponding to constant vs. monotonic methylation rates among the groups. As in the case of the scan statistic approach, the correlations between the sites are incorporated using a mixed effect model. Results from extensive simulation indicate that the Bayesian method is statistically valid and reasonably powerful to detect DMRs associated with disease severity. The proposed methods are demonstrated using data from a chronic lymphocytic leukemia (CLL) study.
    • TWO-SAMPLE TESTS FOR HIGH DIMEMSIONAL MEANS WITH PREPIVOTING and DATA TRANSFORMATION

      Hellebuyck, Rafael Adriel; Department of Biostatistics and Epidemiology (2019-01-08)
      Within the medical field, the demand to store and analyze small sample, large variable data has become ever-abundant. Several two-sample tests for equality of means, including the revered Hotelling’s T2 test, have already been established when the combined sample size of both populations exceeds the dimension of the variables. However, tests such as Hotelling’s T2 become either unusable or output small power when the number of variables is greater than the combined sample size. We propose a test using both prepivoting and Edgeworth expansion that maintains high power in this higher dimensional scenario, known as the “large p small n ” problem. Our test’s finite sample performance is compared with other recently proposed tests designed to also handle the “large p small n ” situation. We apply our test to a microarray gene expression data set and report competitive rates for both power and Type-I error.