Transcription

INVESTIGATIONS ON GENOMICMETA-ANALYSIS: IMPUTATION FORINCOMPLETE DATA AND PROPERTIES OFADAPTIVELY WEIGHTED FISHER’S METHODbyShaowu TangMS in Biostatistics, University of Pittsburgh, 2010PhD in Mathematics, Jacobs University Bremen, Germany, 2005BS in Applied Mathematics, Wuhan University, China, 1993Submitted to the Graduate Faculty ofthe Department of BiostatsitcsGraduate School of Public Health in partial fulfillmentof the requirements for the degree ofDoctor of PhilosophyUniversity of Pittsburgh2014

UNIVERSITY OF PITTSBURGHGRADUATE SCHOOL OF PUBLIC HEALTHThis dissertation was presentedbyShaowu TangIt was defended onApril 10, 2014and approved byGeorge C. Tseng, ScDAssociate ProfessorDepartment of BiostatisticsGraduate School of Public HealthUniversity of PittsburghDaniel E. Weeks, PhDProfessorDepartment of Human GeneticsGraduate School of Public HealthUniversity of PittsburghJong-Hyeon Jeong, PhDProfessorii

Department of BiostatisticsGraduate School of Public HealthUniversity of PittsburghEleanor Feingold, PhDProfessorDepartment of Human GeneticsGraduate School of Public HealthUniversity of PittsburghAbdus S. Wahed, PhDAssociate ProfessorDepartment of BiostatisticsGraduate School of Public HealthUniversity of PittsburghDissertation Director: George C. Tseng, ScDAssociate ProfessorDepartment of BiostatisticsGraduate School of Public HealthUniversity of Pittsburghiii

INVESTIGATIONS ON GENOMIC META-ANALYSIS: IMPUTATION FORINCOMPLETE DATA AND PROPERTIES OF ADAPTIVELY WEIGHTEDFISHER’S METHODShaowu Tang, PhDUniversity of Pittsburgh, 2014Abstract:Microarray analysis to simultaneously monitor expression activities in thousands of genes hasbecome a routine experiment in biomedical research during the past decade. The microarray expression data generated by high throughput experiments may consist of thousands ofvariables and therefore pose great challenges to researchers in a wide variety of statisticaland computational issues. A commonly encountered problem by researchers is to detectgenes differentially expressed between two or more conditions and is the major concern ofthis thesis.In the first part of the thesis, we consider imputation of incomplete data in transcriptomicmeta-analysis. In the past decade, a tremendous amount of expression profiles are generatedand stored in the public domain and information integration by meta-analysis to detect differentially expressed (DE) genes has become popular to obtain increased statistical powerand validated findings. Methods that combine p-values have been widely used in such a genomic setting, among which the Fisher’s, Stouffer’s, minP and maxP methods are the mostpopular ones. In practice, raw data or p-values of DE evidence of the entire genome are oftennot available in a subset of genomic studies that are to be combined. Instead, only the detected DE gene lists under certain p-value threshold (e.g. DE genes with p-value 0.001) arereported in journal publications. The truncated p-value information voided the aforementioned meta-analysis methods and researchers are forced to apply less efficient vote countingiv

method or naı̈vely drop the studies with incomplete information. In the thesis, effectiveimputation methods were derived for such situations with partially censored p-values. Wedeveloped and compared three imputation methods – mean imputation, single random imputation and multiple imputation – for a general class of evidence aggregation methods ofwhich Fisher, Stouffer and logit methods are special examples. The null distribution of eachmethod was analytically derived and subsequent inference and genomic analysis frameworkswere established. Simulations were performed to investigate the type I error and power forunivariate case and the control of false discovery rate (FDR) for (correlated) gene expressiondata. The proposed methods were also applied to several genomic applications in prostatecancer, major depressive disorder (MDD), colorectal cancer and pain research.In the second part, we investigate statistical properties of adaptively weighted (AW)Fisher’s method. The traditional Fisher’s method assigns equal weights to each study, whichare sim- ple in nature but can not always achieve high power for a variety of alternativehypothesis settings. Intuitively more weight should be assigned to the studies with higherpower to detect the difference between different conditions. The AW-Fisher’s method, wherethe best binary 0/1 weights are determined by minimizing the p-value of the weighted teststatistics, was proposed in Li and Tseng (2011). By using the order statistics technique, thesearching space for adaptive weights reduces to linear complexity instead of exponential,which reduce the computational complexity dramatically, and a closed form is derived tocompute the p-values for combining two studies, and an importance sampling algorithm isproposed to evaluate the p-values for combining more than two studies. Theoretical propertiesof the AW-Fisher’s method such as consistency and asymptotical Bahadur optimality (ABO)are also investigated. Simulations will be performed to verify the asymptotical Bahadur optimality of the AW-Fisher and compare the performance of AW-Fisher and Fisher’s methods.Meta-analysis of multiple genomic studies increases the statistical power of biomarker detection and therefore the work in this thesis could improve public health by providing moreeffective methodologies for biomarker detection in the integration of multiple genomic studieswhen the information is incomplete or when different hypothesis settings are tested.v

TABLE OF CONTENTSPREFACE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .ix1.0 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11.1 Microarray data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . .21.2 Meta-analysis and microarray meta-analysis . . . . . . . . . . . . . . . . . .31.2.1 Combining effect sizes . . . . . . . . . . . . . . . . . . . . . . . . . . .31.2.1.1 Fixed effects model . . . . . . . . . . . . . . . . . . . . . . . .31.2.1.2 Random effects model . . . . . . . . . . . . . . . . . . . . . .41.2.2 Combining p-values . . . . . . . . . . . . . . . . . . . . . . . . . . . .41.2.2.1 Evidence aggregation methods . . . . . . . . . . . . . . . . . .41.2.2.2 Order-statistic based methods . . . . . . . . . . . . . . . . . .51.2.3 Microarray meta-analysis . . . . . . . . . . . . . . . . . . . . . . . . .51.3 Complementary hypothesis settings . . . . . . . . . . . . . . . . . . . . . . .61.4 Scope of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .72.0 IMPUTATION OF TRUNCATED P-VALUES FOR EVIDENCE AGGREGATION META-ANALYSIS METHODS AND ITS GENOMICAPPLICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .92.1 Introduction and motivation . . . . . . . . . . . . . . . . . . . . . . . . . . .92.2 Methods and inferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . .132.2.1 Evidence aggregation meta-analysis methods . . . . . . . . . . . . . .132.2.2 Mean imputation method . . . . . . . . . . . . . . . . . . . . . . . . .152.2.3 Single random imputation method . . . . . . . . . . . . . . . . . . . .182.2.4 Multiple imputation method . . . . . . . . . . . . . . . . . . . . . . .19vi

2.2.5 Some parameters in theorem 2.2.4 for the Stouffer’s and Fisher’s methods 212.2.5.1 Stouffer’s method . . . . . . . . . . . . . . . . . . . . . . . . .212.2.5.2 Fisher’s method . . . . . . . . . . . . . . . . . . . . . . . . . .212.3 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .212.3.1 Control of type I error and power analysis for univariate meta-analysis222.3.2 Simulated expression profiles . . . . . . . . . . . . . . . . . . . . . . .252.3.3 Simulation from complete real datasets . . . . . . . . . . . . . . . . .272.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .282.4.1 Application to colorectal cancer . . . . . . . . . . . . . . . . . . . . .282.4.2 Application to pain research . . . . . . . . . . . . . . . . . . . . . . .372.4.3 Application to a three-way association method (liquid association) . .382.5 Discussion and conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . .393.0 ON ADAPTIVE WEIGHTING FOR P-VALUE COMBINATION METAANALYSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .473.1 Introduction of meta-analysis . . . . . . . . . . . . . . . . . . . . . . . . . .473.1.1 Genomic meta-analysis . . . . . . . . . . . . . . . . . . . . . . . . . .473.1.2 Adaptively weighted Fisher’s method . . . . . . . . . . . . . . . . . .493.1.3 Open questions of AW-Fisher’s method in Li and Tseng (2011) . . . .503.2 Solutions to two computing problems . . . . . . . . . . . . . . . . . . . . . .513.2.1 Fast searching of the adaptive weights . . . . . . . . . . . . . . . . . .513.2.2 Computation of P(T AW log(t)) . . . . . . . . . . . . . . . . . . .533.3 Asymptotical properties of the AW-Fisher’s method . . . . . . . . . . . . . .553.3.1 Assumptions and notations . . . . . . . . . . . . . . . . . . . . . . . .553.3.2 Consistency of the estimated weights Ŵ . . . . . . . . . . . . . . . . .563.3.3 The asymptotic Bahadur optimality (ABO) of AW-Fisher’s method .593.4 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .623.4.1 ABO of AW-Fisher’s method . . . . . . . . . . . . . . . . . . . . . . .623.4.2 Comparison of AW-Fisher and Fisher’s method . . . . . . . . . . . . .623.4.3 Accuracy of importance sampling algorithm . . . . . . . . . . . . . . .633.5 Discussion and conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . .63vii

4.0 CONCLUSION AND FUTURE WORKS . . . . . . . . . . . . . . . . . .714.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .714.2 Future works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .73BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .75viii

LIST OF TABLES1Simulation results for correlated data matrix at nominal FDR 5% . . . . . .2Type I error control for independent data matrix at nominal significance level305% . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .313Detailed data sets description . . . . . . . . . . . . . . . . . . . . . . . . . . .334Seven colorectal cancer versus normal tissue expression profiling studies included in analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .355Summary of results for colorectal cancer . . . . . . . . . . . . . . . . . . . . .366Eleven pain-relevant microarray studies included in the analysis . . . . . . . .437Summary of results for patterns of pain . . . . . . . . . . . . . . . . . . . . .448Summary of pathway analysis by DAVID . . . . . . . . . . . . . . . . . . . .459Toy example of finding the adaptive weights . . . . . . . . . . . . . . . . . . .6610 Comparison of complexities 2K 1 vs. K. Total cost: sorting (at most O(K 2 ))and linear searching (O(K)) . . . . . . . . . . . . . . . . . . . . . . . . . . .6711 Powers of AW-Fisher and Fisher’s method at different significance levels α . .69ix

LIST OF FIGURES1Type I error analysis. C: complete cases; A: available-case; Me: mean-imputation;S: single-imputation; Mu: multiple imputation when α 5%, 10% and 15%. .223Power analysis. C: complete cases; A: available-case; Me: mean-imputation;S: single-imputation; Mu: multiple imputation when α 5%, 10% and 15%. .243Type I error analysis at α 0.05 for different numbers of imputation D. . . .254Number of DE genes at significance level 0.05 by multiple imputation methodwith different numbers of imputation D. The dashed lines represent the theoretical asymptotic power obtained by setting D 1000. . . . . . . . . . . . .532Number of DE genes detected by Fisher’s or Stouffer’s method. C: completedata; A: available-case; Me: mean-imputation; S: single-imputation; Mu: multiple imputation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .634 log(p) comparison of the mean imputation method using truncated datawith the complete case method using complete data. Vertical line: x 71.3.Horizontal line: y 72.58. Points right to vertical line are top 1, 000 tripletsdetected by Fisher’s complete case method, and points above to horizontal lineare top 1, 000 triplets detected by Fisher’s mean imputation method . . . . .746Heatmaps of gene expressions for DE genes identified by Fisher’s and AWFisher’s methods in the mouse energy metabolism datasets. . . . . . . . . . .658p-values of AW-Fisher’s method in log scale for K 3 and K 20 . . . . . .689Comparison of the approximated exact slopes for AW-Fisher and Fisher’smethod for K 2 and K 3. Only the first study has non-zero effectsize 0.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .x68

10 Comparison of the p-values of AW-Fisher and Fisher’s method . . . . . . . .6911 Comparison of the p-values in log-scale. Case 1: P1 , P2 Uniform(0, 1); Case2: P1 Uniform(0, 1), P2 Beta(1, 1020 ); Case 3: P1 Beta(1, 1015 ), P2 Beta(1, 1020 ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xi70

PREFACEI would like to express my deepest gratitude to my advisor, Dr. George C. Tseng for introducing me to the exiting field of genomic meta-analysis, and for his positive guidancethroughout my research. His stimulating comments and arguments have been a constantsource of inspiration.I am also very thankful for Dr. Yongseok Park’s valuable support. Without his help, thesecond problem can’t be solved so perfectly.There are some special persons whom I can never thank enough for the valuable expertisethey shared with me, and also for their warmth and friendship: Dr. Jong-Hyeon Jeonghad frequently given me the benefit of his advice and pertinent comments throughtout myresearch in survival analysis. Dr. Eleanor Feingold, Dr. Abdus S. Wahed and Dr. Daniel E.Weeks have given me good comments for my thesis writing and career advice.I would like to thank all my group members Ying Ding, Xingbin Wang, Lun-Ching Chang,Rui Chen, Serena Liao, Masaki Lin, Silvia Liu, SungHwan Kim and Tianzhou Ma for providing me kind help in various projects.Finally, I would like to thank my wife, Hong Qu. Without her support, I would never havesucceeded.xii

1.0INTRODUCTIONThe rapid development of high-throughput experimental technology in the past decade hasmade the generation of genomic data increasingly affordable. This results in the rapid accumulation of experimental data in the public domains. The Gene Expression Omnibus (GEO;http://www.ncbi.nlm.nih.gov/geo/) is one example that is the largest public database tostore gene expression data.Among the vast amounts of gene expression data stored in the public domain, it is commonthat many of them were generated to test the same or similar hypothesis for the same disease. Since a single individual study in general only contains a limited number of samples,the statistical power of the test is relatively low and the generalizability of the conclusions isoften unreliable. In order to improve the statistical power of the tests and provide validatedconclusions, it is very common in practice that researchers attempt to combine informationacross different, independent studies. This is done using a class of meta-analysis methodsthat are particularly useful in microarray data analysis.In this thesis, I will emphasize on applying meta-analysis to microarray data. The Chapter 1is outlined as follows. In section 1.1 I briefly review the microarray data analysis. In section1.2, univariate meta-analysis methods and microarray meta-analysis are introduced. Insection 1.3 several complementary hypothesis settings are introduced, and several importantquestions are posed that will be answered in this thesis and the structure of the dissertationis outlined.1

1.1MICROARRAY DATA ANALYSISIn the past decades, microarray technology has become one of the most important and powerful tools that many researchers use to monitor genome wide expression levels of genes.In general a microarray may contain thousands of genes for a limited number of samples.Commonly used statistical methods for microarray data analysis include class comparison,multiple testing, class discovery, class prediction and pathway analysis and so on. Amongthem, the most popular application is to compare the expression of a set of genes for differentconditions (for instance, cases versus controls).Unlike the traditional epidemiological problems, microarrays monitor gene expression forthousands of genes simultaneously. The standard data structure of a set of microarray dataare a series of rectangular matrices in which the rows represent the expression of genes andthe columns represent samples. Therefore, one can express the microarrays by ygsk , whereygsk denotes the gene expression for the gth gene in the sth sample of the kth study forg 1, · · · , G; s 1, · · · , S and k 1, · · · , K. Usually samples are identified by a clinicalvariable rsk indicating their classes. Thus, for a given study k, rsk {0, 1} represents a twoclass comparison problem and rsk {1, · · · , S} leads to a multi-class comparison problem.Microarray meta-analysis usually refers to combining multiple transcriptomic studies fordetecting differentially expressed (DE) genes (or biomarkers) across two or more conditions(e.g., case and control) with statistical significance and/or biological significance (e.g., foldchange). For DE gene detection, hypothesis testing (such as two-sample t-test) is performedper gene. Since multiple hypothesis tests are performed, the problem of multiple comparisonsshould be addressed. For example, N tests generate an average of αN significant genes orbiomakers at significance level α by chance. Therefore, false discovery rate (FDR) shouldbe controlled for microarray analysis. A widely used procedure to control FDR is the B-Hmethod proposed by [Benjamini and Hochberg, 1995].2

1.2META-ANALYSIS AND MICROARRAY META-ANALYSISMeta-analysis refers to systematic methods that integrate information from different, independent studies by using statistical techniques. Although the name of ”meta-analysis”was invented by Glass in 1976 [Glass 1976], some of the techniques of meta-analysis can betraced back to a long time before that. Pearson performed the first meta-analysis in 1904 tosummarize the correlation coefficients across studies of typhoid vaccination (Pearson 1904).Tippett (1931), Fisher (1948), and Wilkinson (1951) also proposed methods for to combine pvalues. Today, meta-analysis is widely used in epidemiology and the field of medical research.In meta-analysis, two major types of statistical techniques have been used: combining effectsizes and combining p-values.1.2.1Combining effect sizesIn the methods of combining effect sizes, the fixed effect model and random effect model aremost popular [Cooper et al., 2009]. These methods are usually more straightforward andpowerful to directly synthesize information of the effect size estimates and should be used inpriority when the effect sizes are well-defined and comparable across different studies.1.2.1.1Fixed effects model In fixed effect model, one assumes that there is one trueeffect size θ and all the differences in observed effects are due to sampling error. In otherwords, the fixed effect model can be written asTk θ k with k N(0, σk2 ),where Tk is the observed effect size of study k. So each effect size Tk estimates a single meaneffect θ, and differs from this mean effect by sampling error k .3

1.2.1.2Random effects model In fixed effect models, the true effect size is assumedto the same in all studies, which in many applications is implausible. More generally, inrandom effect models, each effect size is assumed to differ from the underlying populationmean θ, due to both sampling error and the underlying population variance, i.e., the randomeffect model can be writen asTk θ k ζk with k N(0, σk2 ), ζk N(0, τ 2 ).1.2.2Combining p-valuesP-value combination methods are good alternatives of effect size combination methods whenthe effect sizes are not directly comparable across different studies. The well-known p-valuecombination methods include Fisher’s method [Fisher, 1948], Stouffer’s method [Stouffer,1949], minP method [Tippett, 1931] and maxP method [Wilkinson, 1951].These methods can be divided further into two classes: evidence aggregation methods andorder-statistics based methods. The maxP and minP methods are two commonly used orderstatistics based meta-analysis methods, since they use the order statistics of the observedp-values as their test statistics. On the contrary, Fisher and Stouffer methods are amongthe most popular evidence aggregation meta-analysis methods, in which the test statisticsare defined as the sum of selected transformations of p-values for each individual study, i.e,the evidence is aggregated when new studies are included into the analysis. In this section,we assume that the null hypothesis is H0 : Kk 1 {θk 0}, where θk is the true effect size ofstudy k.1.2.2.1Evidence aggregation methods For evidence aggregation methods, given aset of p-values {p1 , · · · , pK }, the test statistic is defined asT KXTk : k 1KXFX 1 (pi ),k 1where FX (·) is the cumulative distribution function (CDF) of some random variable X.4

In theory, any random variable X can be picked up to form a combining p-values method.However, only those Xs such that the null distribution of T is simple under the null hypothesis H0 are selected. In this thesis, we focus on three popular special cases:1. Fisher’s method: When X χ22 , Tk FX 1 (pk ) 2 log(pk ).2. Stouffer’s method: When X N(0, 1), Tk FX 1 (pk ) Φ 1 (pk ).qpi35K 43. Logit method: When X Logistic(0, 1), Tk log 1 pandT t5K 4π 2 K(5K 2)iapproximately (Hedges and Olkin 1985) under null hypothesis. For K 5, it has beenq5K 4further approximated π32 K(5K 2)T N(0, 1).1.2.2.2KOrder-statistic based methods Given a set of p-values {pk }Kk 1 , let {p(k) }k 1be its ordered version. Then for order-statistic based methods, the order statistic is selectedas the test statistic (Song and Tseng 2014), i.e.,T : p(r) Beta(r, K r 1) for 1 r K.Obviously minP and maxP are special cases with r 1 and r K respectively.1.2.3Microarray meta-analysisWhen multiple microarray studies are available, meta-analysis can be used to increase thestatistical power for DE gene detection. Most meta-analytic methods for microarray studiesare based on extensions of the univariate meta-analysis methods used for traditional medicalresearch. Rhode was the first one to apply the conventional Fisher’s method for combiningmultiple microarray studies [Rhode 2002]. In this thesis I will focus on the methods of combining p-values. Since these test statistics have simple analytical forms of null distributions,they are easy to apply to the genomic setting. Recall that given study k, suppose an appropriate test statistic Tk is selected for comparison {rsk , 1 s Sk } and the resulting p-valuesfor each gene g (denoted as pgk ) can be derived from the observed expression intensities,then for each fixed g, the conventional meta-analysis methods can be applied to {pgk }Kk 1 forinformation integration. The final p-values obtained for each gene {pg }Gg 1 will be adjustedby the B-H method to control the FDR, and the DE genes can be detected at different FDRthresholds.5

1.3COMPLEMENTARY HYPOTHESIS SETTINGSIn meta-analysis one needs to combine independent p-values from a set of hypothesis tests.Given K individual hypotheses H0k : θk 0 for k 1, · · · , K, the joint null hypothesis isdefined as H0 : Kk 1 {θk 0}. Obviously H0 is true only if all the effect sizes are 0 andfalse when at least one effect size is non-zero. It has been shown that there is no uniformlymost powerful test, and some tests may be more powerful than others when some specificalternative hypotheses are true.Two commonly encountered hypothesis settings are defined at what follows:HSA : H0 :HSB : H0 :K\{θk 0} versus HA :k 1K\{θk 0} versus HA :K\k 1K[{θk 6 0},{θk 6 0}.k 1k 1In order for the null hypothesis to be false, in HSA , the alternative hypothesis is the intersection event that effect sizes of all K studies are non-zero (i.e., the effect sizes in all studies arezero), while HSB pursues non-zero effects in one or more studies (the alternative hypothesisis the union event instead of intersection in HSA ). Obviously, HSA is more stringent andmore desirable to identify consistency across all studies if the combined studies are homogeneous. HSB , however, is useful when heterogeneity in effect sizes is expected.HSA and HSB are closely related to two often-asked biological questions in genomic studies:”Which genes are significant in one or more data sets?” and ”Which genes are significantin all data sets?”. It is easy to know that the maxP method targets on HSA and all othercombining p-value methods target on HSB .Note that since in practice it is a priori unknown which individual null hypotheses are false,it is difficult for researchers to select appropriate hypothesis test with high power. In order to find a test which can achieve good power properties across such uncertainty, a newcomplementary hypothesis setting HSr is defined asHSr : H0 versus Hr :r\{θk 6 0} andk 1K[{θk 0}.k r 16

In order the null hypothesis is false, at least r effect sizes should be non-zero.1.4SCOPE OF THE THESISIn this thesis, I will focus on methods of combining p-values, which in turn implies that inorder to utilize the methods, the p-value of each study is available in advance. Althoughthis is generally true in conventional meta-analysis, it is not unusual that in many genomicstudies the raw data are unavailable and only a partial DE gene list is reported with a givenp-value threshold [Griffth et. al., 2006]. Therefore, for some gene g, only the range of pgk isavailable, i.e., whether the gene is differentially expressed at a given p-value threshold αk .The naı̈ve methods which drop either the genes or the studies with incomplete informationare not plausible, because they neglect the rich information contained in the truncated data.Therefore, there are practical needs to develop meta-analysis approaches that can efficientlycombine truncated p-value information. One solution is to impute the truncated p-valuesbefore applying conventional meta-analysis. In this thesis, three imputation methods - themean imputation, the single random imputation and the multiple imputation - are applied.In chapter 2, I investigate the imputation of truncated p-values for evidence aggregationmeta-analysis methods.When integrating multiple genomic studies, the expression patterns of genes may vary in astudy specific manner. Li and Tseng proposed an adaptively weighted Fisher’s method (AWFisher) to uncover the altered gene expression pattern across different studies [Li and Tseng,P2011], in which they started with the weighted statistic Ug (wg ) Kk 1 wgk log(pgk ), wherewgk {0, 1} is the weighted assigned to the kth study and wg (wg1 , · · · , wgK ). Denotingby pU (ug (wg )) the corresponding p-value, the adaptively weighted statistic is defined as theminimal p-value among all possible weights:VgAW min pU (ug (wg )) and ŵg arg min pU (ug (wg )).wgwg7

The resulting weight ŵg reflects a natural biological interpretation of whether or not a studycontributes to the statistical significance of a gene.Recall that the number of studies where the null hypotheses are false is unknown a priori, the proposed AW-Fisher’s method can maintain good power properties across such anuncertainty. In chapter 3, the AW-Fisher’s method is generalized to a class of evidence aggregation meta-analysis methods and some properties such as the linear searching complexity,the asymptotical consistency of the weights and the asymptotic Bahadur optimality of theproposed tests will be investigated.8

2.0IMPUTATION OF TRUNCATED P-VALUES FOR EVIDENCEAGGREGATION META-ANALYSIS METHODS AND ITS GENOMICAPPLICATION2.1INTRODUCTION AND MOTIVATIONMicroarray analysis to monitor expression activities in thousands of genes simultaneously hasbecome routine in biomedical research during the past decade. The rapid development in biological high-throughput technology results in a tremendous amount of experimental data andmany datasets are available from public domains such as Gene Expression Omnibus (GEO)and ArrayExpress. Since most microarray studies have relatively small sample sizes andlimited statistical power, integrating information from multiple transcriptomic studies usingmeta-analysis techniques is becoming popular. Microarray meta-analysis usually refers tocombining multiple transcriptomic studies for detecting differentially expressed (DE) genes(or candidate markers). DE gene analysis identifies genes differentially expressed across twoor more conditions (e.g., cases and controls) with statistical significance and/or biologicalsignificance (e.g., fold change). Microarray meta-analysis in many situations refers to performing traditional meta-analysis techniques on each gene repeatedly and then controllingthe false discovery rate (FDR) to adjust p-values for multiple comparison (Borovecki et al2005; Cardoso et al. 2007; Pirooznia et al. 2007; Segal et al. 2004). Fisher’s method (Fisher1931) was the first meta-analysis technique introduced in microarray data analysis in 2002(Rhodes et al. 2002), followed by Tippett’s minimum p-value method in 2003 (Moreau et al.2003). Subsequently many meta-analysis approaches have been used in this field, includingextensions of e

meta-analysis. In the past decade, a tremendous amount of expression pro les are generated and stored in the public domain and information integration by meta-analysis to detect dif- ferentially expressed (DE) genes has become popular to obtain increased statistical power and