Date of Completion


Embargo Period



Large Scale Multiple Testing, Dependence Structure, Signal, Noise, Signal-strength, Hidden Markov Models, Markov Chains, Random Transition Matrices, Sampling, High-dimensional Data

Major Advisor

Zhiyi Chi

Associate Advisor

Joseph Glaz

Associate Advisor

Vladimir Pozdnyakov

Field of Study



Doctor of Philosophy

Open Access

Open Access


This dissertation studies large-scale multiple testing which plays an important role in many areas of modern science and technology, such as biomedical imaging and genomic data processing. It has long been recognized that statistical dependence in data poses a significant challenge to large-scale multiple testing. Failure to take the dependence into account can result in severe drop in performance of multiple testing. In particular, the detection power of large-scale multiple tests is known to suffer under dependence when the False Discovery Proportion must be controlled. However, it often happens that the dependence structure is unknown and only a single, albeit very high-dimensional, observation of test statistic is available. This makes large scale multiple testing under dependence considerably harder. This situation can be likened to a signal processing problem with the truth/falsehood of a hypothesis playing the role of an unobservable binary signal and hypothesis-testing becomes analogous to signal detection. To complete the analogy, the signals have an unknown statistical dependence and the test-statistics are the dependent noise-corrupted observations. The typical total number of simultaneous hypotheses in this work can be between a thousand and a million. The target application context is that of large scale `preliminary sieving', using noisy observations, with the goal of reducing the scale of the problem for further examination. Likewise, the detection of extremely sparse signals lies outside the scope of this work.

Our work addresses this problem for the case of a stationary, ergodic signal vector with low signal-strength, a known noise distribution and a known signal-noise interaction-function. This case has many potential applications as signals embedded in data can often be characterized as a stationary ergodic process with an unknown distribution, while the distribution of the noise that distorts the signals can be accurately inferred beforehand under controlled experiments. Our main contribution in this setting is a new approach for improved recovery of a long sequence of dependent binary signals embedded in noisy observations. The novel aspect of our approach is the approximation and numerical computation of the posterior probabilities of binary signals at individual sites of the process, by drawing strength from observations at nearby sites without assuming the availability of their joint prior distribution. Although we only consider signal vectors registered as a time series, the approach in principle may apply to random fields as well.

A problem closely related to multiple testing under arbitrary dependence is the simulation of random transition matrices. This problem is motivated by the need for `random' Markov chains in the study on multiple testing. Random transition matrices can also be used to simulate random contingency tables, models of real-world networks, and other high-dimensional data with versatile dependence structures. For example, simulating random stochastic matrices with a specified principal eigenvector or a specified spectral gap facilitates the simulation of markov chains with specified stationary distribution or mixing-time respectively. The exact-simulation problem is known to be hard and consequently simple recipes for the exact simulation of such random matrices, even from a uniform distribution, are unavailable. We use known results to suggest simple heuristics to simulate, from an unknown distribution, stochastic matrices that have a prescribed principal left eigenvector and/or approximately a prescribed spectral gap.

The unifying theme that pervades this dissertation is that of unknown dependence structure in high-dimensional data.