• An Iterative Procedure to Select and Estimate Wavelet-Based Functional Linear Mixed-Effects Regression Models

      Lundeen, Jordan Sarah; Biostatistics (Augusta University, 2019-12)
      Actigraphy is the continuous long-term measurement of activity-induced acceleration by means of a portable device that often resembles a watch and is typically worn on the wrist. Actigraphy is increasingly being used in clinical research to measure sleep and activity rhythms that might not otherwise be available using traditional techniques such as polysomnography. Actigraphy has been shown to be of value when assessing circadian rhythm disorders and sleep disorders and when evaluating treatment outcomes. It can provide more objective information on sleep habits in the patient's natural sleep environment than using the patient's recollection of their activity or a written sleep diary. We propose a wavelet-based functional linear mixed model to investigate the impact of functional predictors on a scalar response when repeated measurements are available on multiple subjects. The advantage of the proposed model is that each subject has both individual scalar covariate effects and individual functional effects over time, while also sharing common population scalar covariate effects and common population slope functions. An iterative procedure is used to estimate and select the fixed and random effects by utilizing the partial consistency property of the random effect coefficients and selecting groups of random effects simultaneously via the smoothly clipped absolute deviation (SCAD) penalty function. In the first study of its kind, we compare multiple functional regression methods through a large number of simulation parameter combinations. The proposed model is applied to actigraphy data to investigate the effect of daily activity on Hamilton Rating of Depression Scale (HRSD), Insomnia Severity Index (ISI) and Reduced Morningness- Eveningness Questionnare (RMEQ) scores.
    • Assessing the Validity of Vitamin D Supplementation in Patients with Symptomatic Knee Osteoarthritis

      Campbell, Caroline; Harris, Matthew; Littlejohn, Rodney; Paletta, Nina; Stone, David; Williams, Ashley; Medical College of Georgia (2016-05)
      In an older adult population, does supplementation of 25-hydroxyvitamin D for patients with symptomatic knee osteoarthritis with low serum vitamin D alleviate symptoms of the disease?
    • A Bayesian Framework To Detect Differentially Methylated Loci in Both Mean And Variability with Next Generation Sequencing

      Li, Shuang; Department of Biostatistics and Epidemiology (2015-07)
      DNA methylation at CpG loci is the best known epigenetic process involved in many complex diseases including cancer. In recent years, next-generation sequencing (NGS) has been widely used to generate genome-wide DNA methylation data. Although substantial evidence indicates that di erence in mean methylation proportion between normal and disease is meaningful, it has recently been proposed that it may be important to consider DNA methylation variability underlying common complex disease and cancer. We introduce a robust hierarchical Bayesian framework with a Latent Gaussian model which incorporates both mean and variance to detect di erentially methylated loci for NGS data. To identify methylation loci which are associated with disease, we consider Bayesian statistical hypotheses testing for methylation mean and methylation variance using a twodimensional highest posterior density region. To improve computational e ciency, we use Integrated Nested Laplace Approximation (INLA), which combines Laplace approximations and numerical integration in a very e cient manner for deriving marginal posterior distributions. We performed simulations to compare our proposed method to other alternative methods. The simulation results illustrate that our proposed approach is more powerful in that it detects less false positives and it has true positive rate comparable to the other methods.
    • Bayesian Functional Clustering and VMR Identification in Methylation Microarray Data

      Campbell, Jeff; Department of Biostatistics and Epidemiology (2015-07)
      The study of the relation between DNA and health and disease has had a lot of time, energy, and money invested in it over the years. As more scientific knowledge has accumulated, it has become clear that the relations between DNA and health isn’t just a function of the sequence of nucleotide bases, but also on permanent modifications of DNA that affect DNA transcriptions and thus have a macroscopic effect on an individual. The study of modifications to DNA is known as epigenetics.Epigenetic changes have been shown to play a role in certain diseases, including cancer (Novak 2004). Finding locations of differential methylation in two groups of cells is an ongoing area of research in both science and bioinformatics. The number of developed statistical methods for establishing differential DNA methylation between two groups is limited (Bock 2012). Many developed methods are developed for nextgeneration sequencing data and may not work for microarray data, and vice versa. Bisulfite sequencing, the next-generation sequencing technique for attaining methylation data, often comes with limited sample size and considerations must be made for low and variable coverage, and smoothing the methylation values. The analysis of nextgeneration sequencing data also involves small sample sizes.In addition, these methods can be sensitive to how individual CpG regions are grouped together as a region for analysis. If the DMRs are small relative to the sizes of 5 established regions, then the method may not detect a region as having differential methylation. Robust methods for clustering microarray data have also been an ongoing area of research. It is desirable to have a method that could be applied to microarray data could increase the sample size and mitigate the previous problems if the method used is robust to missing values, outliers, and microarray data noise. Functional clustering has shown to be effective when properly conducted on gene expression data. It can be used when the data have temporal measurements to identify genes that are possibly co-expressed. The clustering of methylation data can also be shown to identify epigenetic subgroups that can potentially be very useful (Wang, 2011). [introduction]
    • Classification Methods for Circular-Linear Data Using Periodic Functions

      Chen, Chen Chun (2016-07-08)
      In many fields such as medicine, agriculture and environmental studies, data are collected over time which can have some repeated pattern within a certain time period. Those data with the linear responses or measures such as blood pressure or solar energy with circular predictor, are called circular-linear data. The data having repeated measures over time are usually analyzed using longitudinal analysis methods. However, applying classical longitudinal data analysis to circular-linear data is generally inappropriate since the circular pattern of time would be treated as a simple continuous variable. Parametric approaches for circular-linear data have been developed using various modeling methods. We propose a Bayesian non-parametric MCMC circular smoothing splines approach, which is not only appropriate but also adds more flexibility for modeling and classification for circular-linear data. We first fit the circular-linear data on an estimated circle, to elicit functional pattern from the data, and then classify the patterns. In the development of the classification procedure, we use functional data analysis and some widely used dimension reduction classification methods such as the principal component analysis and support vector machine. We evaluate the performance of the proposed modelling and classification methods through extensive simulation, and demonstrate using the 2005-2006 NHANES physical activity monitor data on insomnia patients. In simulation study, the non-parametric Bayesian smoothing splines method coupled with support vector machine approach yields best performance in classification in terms of concordance rate. Our proposed nonparametric approach performed slightly better than the established parametric methods. Also, the initial data fitting procedures using a periodic regression function to reduce the noise in the data are shown to improve the performance in the classification problem. The result in the analysis of the NHANES data is consistent with simulation
    • Classifying Rheumatoid Arthritis Risk with Genetic Subgroups Using Genome-Wide Association

      Letter, Abraham J.; Department of Biostatistics and Epidemiology (2010-04)
      Structured genome-wide association methods can be used to find population substructure, determine significant SNPs, and subsequently narrow down the field of SNPs to those most significant for determining disease risk. Beginning with more than 500,000 SNPs and rheumatoid arthritis (RA) phenotype data for cases and controls, we used a three-part clustering approach that found 684 SNPs significant for determining RA after accounting for clusters, and of those, 168 SNPs with differing odds across clusters. These 168 SNPs were used to create 16 population subgroups, each revealing a unique pattern of minor allele frequencies. The subgroups showed some commonality in multi-dimensional scaling plots, however, and were combined into five RA risk categories, each with odds differing from the other categories with pvalues less than 0.0001. Thus, based on SNP information from 168 SNPs it may be possible to assign an individual into one of five distinct RA risk categories.
    • COGA phenotypes and linkages on chromosome 2.

      Wiener, Howard W; Go, Rodney C P; Tiwari, Hemant; George, Varghese; Page, Grier P; Department of Biostatistics and Epidemiology (2010-01-19)
      An initial linkage analysis of the alcoholism phenotype as defined by DSM-III-R criteria and alcoholism defined by DSM-IV criteria showed many, sometimes striking, inconsistencies. These inconsistencies are greatly reduced by making the definition of alcoholism more specific. We defined new phenotypes combining the alcoholism definitions and the latent variables, defining an individual as affected if that individual is alcoholic under one of the definitions (either DSM-III-R or DSM-IV), and indicated having a symptom defined by one of the latent variables. This was done for each of the two alcoholism definitions and five latent variables, selected from a canonical discriminant analyses indicating they formed significant groupings using the electrophysiological variables. We found that linkage analyses utilizing these latent variables were much more robust and consistent than the linkage results based on DSM-III-R or DSM-IV criteria for definition of alcoholism. We also performed linkage analyses on two first principal components derived phenotypes, one derived from the electrophysiological variables, and the other derived from the latent variables. A region on chromosome 2 at 250 cM was found to be linked to both of these derived phenotypes. Further examination of the SNPs in this region identified several haplotypes strongly associated with these derived phenotypes.
    • Comparisons of mutation rate variation at genome-wide microsatellites: evolutionary insights from two cultivated rice and their wild relatives.

      Gao, Li-Zhi; Xu, Hongyan; Department of Biostatistics and Epidemiology (2008-02-13)
      BACKGROUND: Mutation rate (mu) per generation per locus is an important parameter in the models of population genetics. Studies on mutation rate and its variation are of significance to elucidate the extent and distribution of genetic variation, further infer evolutionary relationships among closely related species, and deeply understand genetic variation of genomes. However, patterns of rate variation of microsatellite loci are still poorly understood in plant species. Furthermore, how their mutation rates vary in di-, tri-, and tetra-nucleotide repeats within the species is largely uninvestigated across related plant genomes. RESULTS: Genome-wide variation of mutation rates was first investigated by means of the composite population parameter theta (theta = 4Nmu, where N is the effective population size and mu is the mutation rate per locus per generation) in four subspecies of Asian cultivated rice O. sativa and its three related species, O. rufipogon, O. glaberrima, and O. officinalis. On the basis of three data sets of microsatellite allele frequencies throughout the genome, population mutation rate (theta) was estimated for each locus. Our results reveal that the variation of population mutation rates at microsatellites within each studied species or subspecies of cultivated rice can be approximated with a gamma distribution. The mean population mutation rates of microsatellites do not significantly differ in motifs of di-, tri-, and tetra-nucleotide repeats for the studied rice species. The shape parameter was also estimated for each subspecies of rice as well as other related rice species. Of them, different subspecies of O. sativa possesses similar shape parameters (alpha) of the gamma distribution, while other species extensively vary in their population mutation rates. CONCLUSION: Through the analysis of genome-wide microsatellite data, the population mutation rate can be approximately fitted with a gamma distribution in most of the studied species. In general, different population histories occurred along different lineages may result in the observed variation of population mutation rates at microsatellites among the studied Oryza species.
    • Correlation Coefficient Inference for Left-Censored Biomarker Data with Known Detection Limits

      McCracken, Courtney Elizabeth; Department of Biostatistics and Epidemiology (2013-05)
      Researchers are often interested in the relationship between biological concentrations obtained using two different assays, both of which may be biomarkers. Despite the continuing advances in biotechnology, the value of a particular biomarker may fall below some known limit of detection (LOD). Data values such as these are referred to as non-detects (NDs) and can be treated as left-censored observations. When attempting to measure the association between two concentrations, both of which are subject to NDs, serious complications can arise in the data analysis. Simple substitution, random imputation, and maximum likelihood estimation methods are just a few of the methods that have been proposed for handling NDs when estimating the correlation between two variables, both of which are subject to left-censoring. Unfortunately, many of the popular methods require that the data follow a bivariate normal distribution or that only a small percentage of the data for each variable are below the LOD. These assumptions are often violated with biomarker data. In this paper, we evaluate the performance of several methods, including Spearman’s rho, when the data do not follow a bivariate normal distribution and when there are moderate to large censoring proportions in one or both of the variables. We evaluate the performance of seven methods for estimating the correlation, ρ, between two left-censored variables using bias, median absolute deviation, 95% confidence interval width, and coverage probability under assumptions of various sample sizes, correlations, and censoring proportions. We show that using substitution and imputation methods yields biased estimates of ρ and less than nominal coverage probability under most of the simulation parameters we examined. We recommend the maximum likelihood method for general use even when the data significantly depart from bivariate normality.
    • Epigenetic Silencing of Nucleolar rRNA Genes in Alzheimer's Disease

      Pietrzak, Maciej; Rempala, Grzegorz A.; Nelson, Peter T.; Zheng, Jing-Juan; Hetman, Michal; Department of Biostatistics and Epidemiology (2011-07-22)
      Background: Ribosomal deficits are documented in mild cognitive impairment (MCI), which often represents an early stage Alzheimer's disease (AD), as well as in advanced AD. The nucleolar rRNA genes (rDNA), transcription of which is critical for ribosomal biogenesis, are regulated by epigenetic silencing including promoter CpG methylation.
    • False coverage rate - adjusted smoothed bootstrap simultaneous confidence intervals for selected parameters

      Sun, Jing; Department of Biostatistics and Epidemiology (Augusta University, 2020-05)
      Many modern applications refer to a large number of populations with high dimensional parameters. Since there are so many parameters, researchers often draw inferences regarding the most significant parameters, which are called selected parameters. Benjamini and Yekutieli (2005) proposed the false coverage-statement rate (FCR) method for multiplicity correction when constructing confidence intervals for only selected parameters. FCR for the confidence interval method is parallel to the concept of the false discovery rate for multiple hypothesis testing. In practice, we typically construct FCR-adjusted approximate confidence intervals for selected parameters either using the bootstrap method or the normal approximation method. However, these approximated confidence intervals show higher FCR for small and moderate sample sizes. Therefore, we suggest a novel procedure to construct simultaneous confidence intervals for the selected parameters by using a smoothed bootstrap procedure. We consider a smoothed bootstrap procedure using a kernel density estimator. A pertinent problem associated with the smoothed bootstrap approach is how to choose the unknown bandwidth in some optimal sense. We derive an optimal choice for the bandwidth and the resulting smoothed bootstrap confidence intervals asymptotically to give better control of the FCR than its competitors. We further show that the suggested smoothed bootstrap simultaneous confidence intervals are FCR-consistent if the dimension of data grows no faster than N^3/2. Finite sample performances of our method are illustrated based on empirical studies. Through these empirical studies, it is shown that the proposed method can be successfully applied in practice.
    • Family-based genome-wide association study for simulated data of Framingham Heart Study.

      Xu, Hongyan; Mathew, George; George, Varghese; Department of Biostatistics and Epidemiology (2009-12-18)
      ABSTRACT : Genome-wide association studies (GWAS) have quickly become the norm in dissecting the genetic basis of complex diseases. Family-based association approaches have the advantages of being robust to possible hidden population structure in samples. Most of these methods were developed with limited markers. Their applicability and performance for GWAS need to be examined. In this report, we evaluated the properties of the family-based association method implemented by ASSOC in the S.A.G.E package using the simulated data sets for the Framingham Heart Study, and found that ASSOC is a highly useful tool for GWAS.
    • A gene-based approach for testing association of rare alleles

      Xu, Hongyan; George, Varghese; Department of Biostatistics and Epidemiology (2011-11-29)
      Rare genetic variants have been shown to be important to the susceptibility of common human diseases. Methods for detecting association of rare genetic variants are drawing much attention. In this report, we applied a gene-based approach to the 200 simulated data sets of unrelated individuals. The test can detect the association of some genes with multiple rare variants.
    • Maternal Health Literacy Progression Among Rural Perinatal Women

      Mobley, Sandra C.; Thomas, Suzanne Dixson; Sutherland, Donald E.; Hudgins, Jodi; Ange, Brittany L.; Johnson, Maribeth H.; Department of Obstetrics and Gynecology (Springer, 2014-01-28)
      This research examined changes in maternal health literacy progression among 106 low income, high risk, rural perinatal African American and White women who received home visits by Registered Nurse Case Managers through the Enterprise Community Healthy Start Program. Maternal health literacy progression would enable women to better address intermediate factors in their lives that impacted birth outcomes, and ultimately infant mortality (Lu and Halfon in Mater Child Health J 7(1):13-30, 2003; Sharma et al. in J Natl Med Assoc 86(11):857-860, 1994). The Life Skills Progression Instrument (LSP) (Wollesen and Peifer, in Life skills progression. An outcome and intervention planning instrument for use with families at risk. Paul H. Brookes Publishing Co., Baltimore, 2006) measured changes in behaviors that represented intermediate factors in birth outcomes. Maternal Health Care Literacy (LSP/M-HCL) was a woman's use of information, critical thinking and health care services; Maternal Self Care Literacy (LSP/M-SCL) was a woman's management of personal and child health at home (Smith and Moore in Health literacy and depression in the context of home visitation. Mater Child Health J, 2011). Adequacy was set at a score of (≥4). Among 106 women in the study initial scores were inadequate (<4) on LSP/M-HCL (83 %), and on LSP/M-SCL (30 %). Significant positive changes were noted in maternal health literacy progression from the initial prenatal assessment to the first (p < .01) postpartum assessment and to the final (p < .01) postpartum assessment using McNemar's test of gain scores. Numeric comparison of first and last gain scores indicated women's scores progressed (LSP/M-HCL; p < .0001) and (LSP/M-SCL; p < .0001). Elevated depression scores were most frequent among women with <4 LSP/M-HCL and/or <4 LSP/M-SCL. Visit notes indicated lack or loss of relationship with the father of the baby and intimate partner discord contributed to higher depression scores.
    • Mathematical and Stochastic Modeling of HIV Immunology and Epidemiology

      Lee, Tae Jin; Department of Biostatistics and Epidemiology (8/3/2017)
      In HIV virus dynamics, controlling of viral load and maintaining of CD4 value at a higher level are always primary goals for the providers. In recent years, a new molecule was discovered, namely, eCD4-Ig, which mimics CD4 if introduced into the human body and has potential to change existing HIV virus dynamics. Thus, to understand dynamics of viral load, eCD4-Ig, CD4 cells, we have developed mathematical models by incorporating interactions between this new molecule and other known immunological, virological information. We further investigated model based speculations for management, and obtained the level of eCD4-Ig required for elimination of virus. Next, we built epidemiological model for HIV spread and control among discordant couple through dynamics of PrEP (Pre-exposure prophylaxis). For this, an actuarial assumptions based stochastic model is used to obtain the mean remaining time of couple to stay as discordant. We generalized single hook-up/marriage stochastic model to multiple hook-up/marriage model.
    • A modified bump hunting approach with correlation-adjusted kernel weight for detecting differentially methylated regions on the 450K array

      Daniel, Jeannie T; Department of Biostatistics and Epidemiology (8/3/2017)
      DNA methylation plays an important role in the regulation of gene expression, as hypermethylation is associated with gene silencing. The general purpose of this dissertation is the development of a statistical method, called DMR Detector, for detecting differentially methylated regions (DMRs) on the 450K array. DMR Detector makes three key modifications to an existing method called Bumphunter. The first is what statistic to collect from the initial fitting for further analysis. The second is to perform kernel smoothing under the assumption of correlated errors using a newly proposed correlation-adjusted kernel weight. The third is how to define regions of interest. In simulation, the method was shown to have high power comparable to Bumphunter, with consistently lower family-wise type I error rate, controlled well below the 0.1 FDR. DMR Detector was applied to real data and was able to detect one DMR that was not detected by Bumphunter.
    • A Modified Information Criterion in the 1d Fused Lasso for DNA Copy Number Variant Detection using Next Generation Sequencing Data

      Lee, Jaeeun; Department of Biostatistics and Epidemiology (8/3/2017)
      DNA Copy Number Variations (CNVs) are associated with many human diseases. Recently, CNV studies have been carried out using Next Generation Sequencing (NGS) technology that produces millions of short reads. With NGS reads ratio data, we use the 1d fused lasso regression for CNV detection. Given the number of copy number changes, the corresponding genomic locations are estimated by fitting the 1d fused lasso. Estimation of the number of copy number changes depends on a tuning parameter in the 1d fused lasso. In this dissertation, we propose a new modified Bayesian information criterion, called JMIC, to estimate the optimal tuning parameter in the 1d fused lasso. In theoretical studies, we prove that the number of change points estimated by JMIC converges the true number of changes. Also, our simulation studies show that JMIC outperforms the other criteria considered. Finally, we apply our proposed method to the reads ratio data from the breast tumor cell HCC1954 and its matched cell line provided by Chiang et al. (2009).
    • Multivariate Poisson Abundance Models for Analyzing Antigen Receptor Data

      Greene, Joshua C.; Department of Biostatistics and Epidemiology (2013-05)
      Antigen receptor data is an important source of information for immunologists that is highly statistically challenging to analyze due to the presence of a huge number of T-cell receptors in mammalian immune systems and the severe undersampling bias associated with the commonly used data collection procedures. Many important immunological questions can be stated in terms of richness and diversity of T-cell subsets under various experimental conditions. This dissertation presents a class of parametric models and uses a special case of them to compare the richness and diversity of antigen receptor populations in mammalian T-cells. The parametric models are based on a representation of the observed receptor counts as a multivariate Poisson abundance model (mPAM). A Bayesian model tting procedure is developed which allows tting of the mPAM parameters with the help of the complete likelihood as opposed to its conditional version which was used previously. The new procedure is shown to be often considerably more e cient (as measured by the amount of Fisher information) in the regions of the mPAM parameter space relevant to modeling T-cell data. A richness estimator based on the special case of the mPAM is shown to be superior to several existing richness estimators from the statistical ecology literature under the severe undersampling conditions encountered in antigen receptor data collection. The comparative diversity analyses based on the mPAM special case yield biologically meaningful results when applied to the T-cell receptor repertoires in mice. It is also shown that the amount of time to implement the Bayesian model tting procedure for the mPAM special case scales well as the dimension increases and that the amount of computational resources required to conduct complete statistical analyses for the mPAM special case can be drastically lower for our Bayesian model tting procedure than for code based on the conditional likelihood approach.
    • A new measure of population structure using multiple single nucleotide polymorphisms and its relationship with FST.

      Xu, Hongyan; Sarkar, Bayazid; George, Varghese; Department of Biostatistics and Epidemiology (2009-03-16)
      BACKGROUND: Large-scale genome-wide association studies are promising for unraveling the genetic basis of complex diseases. Population structure is a potential problem, the effects of which on genetic association studies are controversial. The first step to systematically quantify the effects of population structure is to choose an appropriate measure of population structure for human data. The commonly used measure is Wright's FST. For a set of subpopulations it is generally assumed to be one value of FST. However, the estimates could be different for distinct loci. Since population structure is a concept at the population level, a measure of population structure that utilized the information across loci would be desirable. FINDINGS: In this study we propose an adjusted C parameter according to the sample size from each sub-population. The new measure C is based on the c parameter proposed for SNP data, which was assumed to be subpopulation-specific and common for all loci. In this study, we performed extensive simulations of samples with varying levels of population structure to investigate the properties and relationships of both measures. It is found that the two measures generally agree well. CONCLUSION: The new measure simultaneously uses the marker information across the genome. It has the advantage of easy interpretation as one measure of population structure and yet can also assess population differentiation.
    • A New Method For Analyzing 1:N Matched Case Control Studies With Incomplete Data

      Jin, Chan; Department of Biostatisctics and Epidemiology (5/8/2017)
      1:n matched case-control studies are commonly used to evaluate the association between the exposure to a risk factor and a disease, where one case is matched to up till n controls. The odds ratio is typically used to quantify such association. Difficulties in estimating the true odds ratio arise, when the exposure status is unknown for at least one individual in a group. In the case where the exposure status is known for all individuals in a group, the true odds ratio is estimated as the ratio of the counts in the discordant cells of the observed two-by-two table. In the case where all data are independent, the odds ratio is estimated using the cross-product ratio from the observed table. Conditional logistic regression estimates are used for incomplete matching data. In this dissertation we suggest a simple method for estimating the odds ratio when the sample consists of a combination of paired and unpaired observations, with 1:n matching. This method uses a weighted average of the odds ratio calculations described above. This dissertation compares the new method to existing methods via simulation.