Title: | Adaptable Regularized Hotelling's T^2 Test for High-Dimensional Data |
---|---|
Description: | Perform the Adaptable Regularized Hotelling's T^2 test (ARHT) proposed by Li et al., (2016) <arXiv:1609.08725>. Both one-sample and two-sample mean test are available with various probabilistic alternative prior models. It contains a function to consistently estimate higher order moments of the population covariance spectral distribution using the spectral of the sample covariance matrix (Bai et al. (2010) <doi:10.1111/j.1467-842X.2010.00590.x>). In addition, it contains a function to sample from 3-variate chi-squared random vectors approximately with a given correlation matrix when the degrees of freedom are large. |
Authors: | Haoran Li [aut, cre] |
Maintainer: | Haoran Li <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.1.0 |
Built: | 2024-10-22 06:01:12 UTC |
Source: | https://github.com/haoranli/arht |
test for high dimensional dataThis function performs the adaptable regularized Hotelling's test (ARHT) (Li et al., (2016) <arXiv:1609.08725>) for the one-sample
and two-sample test problem, where we're interested in detecting the mean vector in the one-sample problem or the difference
between mean vectors in the two-sample problem in a high dimensional regime.
ARHT(X, Y = NULL, mu_0 = NULL, prob_alt_prior = list(c(1, 0, 0), c(0, 1, 0), c(0, 0, 1)), Type1error_calib = c("cube_root", "sqrt", "chi_sq", "none"), lambda_range = NULL, nlambda = 2000, bs_size = 1e+05)
ARHT(X, Y = NULL, mu_0 = NULL, prob_alt_prior = list(c(1, 0, 0), c(0, 1, 0), c(0, 0, 1)), Type1error_calib = c("cube_root", "sqrt", "chi_sq", "none"), lambda_range = NULL, nlambda = 2000, bs_size = 1e+05)
X |
the n1-by-p observation matrix with numeric column variables. |
Y |
an optional n2-by-p observation matrix; if |
mu_0 |
the null hypothesis vector to be tested; if |
prob_alt_prior |
a non-empty list; Each field is a numeric vector with sum 1. The default value is the "canonical weights"
|
Type1error_calib |
the method to calibrate Type 1 error rate of ARHT. Choose its first element when more than one are specified. Four values are allowed:
|
lambda_range |
optional user-supplied lambda range; If |
nlambda |
optional user-supplied number of lambda's in grid search; default to be |
bs_size |
positive numeric with default value |
The method incorporates ridge-regularization in the classic Hotelling's test with the regularization parameter
chosen such that the asymptotic power under a class of probabilistic alternative prior models is maximized. ARHT combines
different prior models by taking the maximum of statistics under all models. ARHT is distributed as the maximum
of a correlated multivariate normal random vector. We estimate its covariance matrix and bootstrap its distribution. The
returned p-value is a Monte Carlo approximation to its true value using the bootstrap sample, therefore not deterministic.
Various methods are available to calibrate the slightly inflated Type 1 error rate of ARHT, including Cube-root transformation,
square-root transformation and chi-square approximation.
ARHT_pvalue
: The p-value of ARHT test.
If length(prob_alt_prior)==1
, it is identical to RHT_pvalue
.
If length(prob_alt_prior)>1
, it is the p-value after combining results from all prior models. The value is
bootstrapped, therefore not deterministic.
RHT_opt_lambda
: The optimal lambda's chosen under each of the prior models in prob_alt_prior
. It has the same length and order as
prob_alt_prior
.
RHT_pvalue
: The p-value of RHT tests with the lambda's in RHT_opt_lambda
.
RHT_std
: The standardized RHT statistics with the lambda's in RHT_opt_lambda
.
Take its maximum to get the statistic of ARHT test.
Theta1
: As defined in Li et al. (2016) <arXiv:1609.08725>, the estimated asymptotic means of RHT statistics with the lambda's in RHT_opt_lambda
.
Theta2
: As defined in Li et al. (2016) <arXiv:1609.08725>, 2*Theta2
are the estimated asymptotic variances of RHT statistics the lambda's in RHT_opt_lambda
.
Corr_RHT
: The estimated correlation matrix of the statistics in RHT_std
.
Li, H. Aue, A., Paul, D. Peng, J., & Wang, P. (2016). An adaptable generalization of Hotelling's test in high dimension.
<arXiv:1609:08725>.
Chen, L., Paul, D., Prentice, R., & Wang, P. (2011). A regularized Hotelling's test for pathway analysis in proteomic studies.
Journal of the American Statistical Association, 106(496), 1345-1360.
set.seed(10086) # One-sample test n1 = 300; p =500 dataX = matrix(rnorm(n1 * p), nrow = n1, ncol = p) res1 = ARHT(dataX) # Two-sample test n2= 400 dataY = matrix(rnorm(n2 * p), nrow = n2, ncol = p ) res2 = ARHT(dataX, dataY, mu_0 = rep(0.01,p)) # Specify probabilistic alternative priors model res3 = ARHT(dataX, dataY, mu_0 = rep(0.01,p), prob_alt_prior = list(c(1/3, 1/3, 1/3), c(0,1,0))) # Change Type 1 error calibration method res4 = ARHT(dataX, dataY, mu_0 = rep(0.01,p), Type1error_calib = "sqrt") RejectOrNot = res4$ARHT_pvalue < 0.05
set.seed(10086) # One-sample test n1 = 300; p =500 dataX = matrix(rnorm(n1 * p), nrow = n1, ncol = p) res1 = ARHT(dataX) # Two-sample test n2= 400 dataY = matrix(rnorm(n2 * p), nrow = n2, ncol = p ) res2 = ARHT(dataX, dataY, mu_0 = rep(0.01,p)) # Specify probabilistic alternative priors model res3 = ARHT(dataX, dataY, mu_0 = rep(0.01,p), prob_alt_prior = list(c(1/3, 1/3, 1/3), c(0,1,0))) # Change Type 1 error calibration method res4 = ARHT(dataX, dataY, mu_0 = rep(0.01,p), Type1error_calib = "sqrt") RejectOrNot = res4$ARHT_pvalue < 0.05
The function calculates consistent estimators of moments of the spectral distribution of the population covariance matrix given the spectral of the sample covariance matrix.
moments_PSD(eigenvalues, n, mom_degree)
moments_PSD(eigenvalues, n, mom_degree)
eigenvalues |
all eigenvalues of the sample covariance matrix including 0's. |
n |
degree of freedom of the sample covariance matrix. |
mom_degree |
the maximum order of moments. |
Estimators of moments from the first to the mom_degree
-th order.
Bai, Z., Chen, J., & Yao, J. (2010). On estimation of the population spectral distribution from a high-dimensional sample covariance matrix. Australian & New Zealand Journal of Statistics, 52(4), 423-437.
set.seed(10086) n = 400; p= 500 pop_eig = seq(10,1,length = p) # Data with covariance matrix diag(pop_eig) Z = matrix(rnorm(n*p),n,p) X = Z %*% diag(sqrt(pop_eig)) raw_eig = svd(cov(X))$d emp_eig = raw_eig[raw_eig>=0] # Moments of population spectral distribution colMeans(outer(pop_eig, 1:4, "^")) # Estimators moments_PSD(emp_eig, n-1, 4)
set.seed(10086) n = 400; p= 500 pop_eig = seq(10,1,length = p) # Data with covariance matrix diag(pop_eig) Z = matrix(rnorm(n*p),n,p) X = Z %*% diag(sqrt(pop_eig)) raw_eig = svd(cov(X))$d emp_eig = raw_eig[raw_eig>=0] # Moments of population spectral distribution colMeans(outer(pop_eig, 1:4, "^")) # Estimators moments_PSD(emp_eig, n-1, 4)
Generate samples approximately from three positively correlated chi-squared random variables
when the degrees of freedom
are large.
r3chisq(size, df, corr_mat)
r3chisq(size, df, corr_mat)
size |
sample size. |
df |
the degree of freedoms of the marginal distributions. Must be non-negative, but can be non-integer.
The function uses |
corr_mat |
the target correlation matrix; negative elements will be set to 0. |
It is generally hard to sample from with a designed correlation matrix.
In the algorithm, we approximate the random vector by
where
is a standard norm random vector and
are diagonal matrices
with diagonal elements 1's and 0's. The designed positive correlations is approximated by carefully
selecting common locations of 1's on the diagonals. The generated sample may have slightly larger marginal degrees
of freedom than the inputted
df
, also slightly different covariances.
sample
: a size
-by-3 matrix contains the generated sample.
approx_cov
: the true covariance matrix of sample
.
Li, H., Aue, A., Paul, D., Peng, J., & Wang, P. (2016).
An adaptable generalization of Hotelling's
test in high dimension. arXiv preprint <arXiv:1609.08725>.
set.seed(10086) cor_examp = matrix(c(1,1/6,2/3,1/6,1,2/3,2/3,2/3,1),3,3) a_sam = r3chisq(size = 10000, df = c(80,90,100), corr_mat = cor_examp) cov(a_sam$sample) - a_sam$approx_cov cov2cor(a_sam$approx_cov) - cor_examp
set.seed(10086) cor_examp = matrix(c(1,1/6,2/3,1/6,1,2/3,2/3,2/3,1),3,3) a_sam = r3chisq(size = 10000, df = c(80,90,100), corr_mat = cor_examp) cov(a_sam$sample) - a_sam$approx_cov cov2cor(a_sam$approx_cov) - cor_examp