Title: | Sparse Canonical Correlation Analysis for High-Dimensional Mixed Data |
---|---|
Description: | Semi-parametric approach for sparse canonical correlation analysis which can handle mixed data types: continuous, binary and truncated continuous. Bridge functions are provided to connect Kendall's tau to latent correlation under the Gaussian copula model. The methods are described in Yoon, Carroll and Gaynanova (2020) <doi:10.1093/biomet/asaa007> and Yoon, Mueller and Gaynanova (2021) <doi:10.1080/10618600.2021.1882468>. |
Authors: | Grace Yoon [aut] , Mingze Huang [ctb] , Irina Gaynanova [aut, cre] |
Maintainer: | Irina Gaynanova <[email protected]> |
License: | GPL-3 |
Version: | 1.6.2 |
Built: | 2024-11-04 04:05:05 UTC |
Source: | https://github.com/irinagain/mixedcca |
Functions to create autocorrelation matrix (p by p) with parameter rho and block correlation matrix (p by p) using group index (of length p) and (possibly) different parameter rho for each group.
autocor(p, rho) blockcor(blockind, rho)
autocor(p, rho) blockcor(blockind, rho)
p |
Specified matrix dimension. |
rho |
Correlation value(s), must be between -0.99 and 0.99. Should be a scalar for |
blockind |
Block index 1,..., K for a positive integer K specifying which variable belongs to which block, the matrix dimension is equal to |
# For p = 8, # auto correlation matrix autocor(8, 0.8) # block correlation matrix: two blocks with the same correlation within each block blockcor(c(rep(1,3), rep(2,5)), 0.8) # block correlation matrix: two blocks with different correlation within each block blockcor(c(rep(1,3), rep(2,5)), c(0.8, 0.3))
# For p = 8, # auto correlation matrix autocor(8, 0.8) # block correlation matrix: two blocks with the same correlation within each block blockcor(c(rep(1,3), rep(2,5)), 0.8) # block correlation matrix: two blocks with different correlation within each block blockcor(c(rep(1,3), rep(2,5)), c(0.8, 0.3))
Estimation of latent correlation matrix from observed data of (possibly) mixed types (continuous/binary/truncated continuous) based on the latent Gaussian copula model.
estimateR( X, type = "trunc", method = "original", use.nearPD = TRUE, nu = 0.01, tol = 0.001, verbose = FALSE ) estimateR_mixed( X1, X2, type1 = "trunc", type2 = "continuous", method = "original", use.nearPD = TRUE, nu = 0.01, tol = 0.001, verbose = FALSE )
estimateR( X, type = "trunc", method = "original", use.nearPD = TRUE, nu = 0.01, tol = 0.001, verbose = FALSE ) estimateR_mixed( X1, X2, type1 = "trunc", type2 = "continuous", method = "original", use.nearPD = TRUE, nu = 0.01, tol = 0.001, verbose = FALSE )
X |
A numeric data matrix (n by p), n is the sample size and p is the number of variables. |
type |
A type of variables in |
method |
The calculation method of latent correlation. Either "original" method or "approx". If |
use.nearPD |
A logical value indicating whether to use nearPD or not when the resulting correlation estimator is not positive definite (have at least one negative eigenvalue). |
nu |
Shrinkage parameter for correlation matrix, must be between 0 and 1, the default value is 0.01. |
tol |
Desired accuracy when calculating the solution of bridge function. |
verbose |
If |
X1 |
A numeric data matrix (n by p1). |
X2 |
A numeric data matrix (n by p2). |
type1 |
A type of variables in |
type2 |
A type of variables in |
estimateR
returns
type: Type of the data matrix X
R: Estimated p by p latent correlation matrix of X
estimateR_mixed
returns
type1: Type of the data matrix X1
type2: Type of the data matrix X2
R: Estimated latent correlation matrix of whole X
= (X1
, X2
) (p1+p2 by p1+p2)
R1: Estimated latent correlation matrix of X1
(p1 by p1)
R2: Estimated latent correlation matrix of X2
(p2 by p2)
R12: Estimated latent correlation matrix between X1
and X2
(p1 by p2)
Fan J., Liu H., Ning Y. and Zou H. (2017) "High dimensional semiparametric latent graphicalmodel for mixed data" <doi:10.1111/rssb.12168>.
Yoon G., Carroll R.J. and Gaynanova I. (2020) "Sparse semiparametric canonical correlation analysis for data of mixed types" <doi:10.1093/biomet/asaa007>.
Yoon G., Mueller C.L., Gaynanova I. (2020) "Fast computation of latent correlations" <arXiv:2006.13875>.
### Data setting n <- 100; p1 <- 15; p2 <- 10 # sample size and dimensions for two datasets. maxcancor <- 0.9 # true canonical correlation ### Correlation structure within each data set set.seed(0) perm1 <- sample(1:p1, size = p1); Sigma1 <- autocor(p1, 0.7)[perm1, perm1] blockind <- sample(1:3, size = p2, replace = TRUE); Sigma2 <- blockcor(blockind, 0.7) mu <- rbinom(p1+p2, 1, 0.5) ### true variable indices for each dataset trueidx1 <- c(rep(1, 3), rep(0, p1-3)) trueidx2 <- c(rep(1, 2), rep(0, p2-2)) ### Data generation simdata <- GenerateData(n=n, trueidx1 = trueidx1, trueidx2 = trueidx2, maxcancor = maxcancor, Sigma1 = Sigma1, Sigma2 = Sigma2, copula1 = "exp", copula2 = "cube", muZ = mu, type1 = "trunc", type2 = "continuous", c1 = rep(1, p1), c2 = rep(0, p2) ) X1 <- simdata$X1 X2 <- simdata$X2 ### Check the range of truncation levels of variables range(colMeans(X1 == 0)) range(colMeans(X2 == 0)) ### Estimate latent correlation matrix # with original method R1_org <- estimateR(X1, type = "trunc", method = "original")$R # with faster approximation method R1_approx <- estimateR(X1, type = "trunc", method = "approx")$R R12_approx <- estimateR_mixed(X1, X2, type1 = "trunc", type2 = "continuous", method = "approx")$R12
### Data setting n <- 100; p1 <- 15; p2 <- 10 # sample size and dimensions for two datasets. maxcancor <- 0.9 # true canonical correlation ### Correlation structure within each data set set.seed(0) perm1 <- sample(1:p1, size = p1); Sigma1 <- autocor(p1, 0.7)[perm1, perm1] blockind <- sample(1:3, size = p2, replace = TRUE); Sigma2 <- blockcor(blockind, 0.7) mu <- rbinom(p1+p2, 1, 0.5) ### true variable indices for each dataset trueidx1 <- c(rep(1, 3), rep(0, p1-3)) trueidx2 <- c(rep(1, 2), rep(0, p2-2)) ### Data generation simdata <- GenerateData(n=n, trueidx1 = trueidx1, trueidx2 = trueidx2, maxcancor = maxcancor, Sigma1 = Sigma1, Sigma2 = Sigma2, copula1 = "exp", copula2 = "cube", muZ = mu, type1 = "trunc", type2 = "continuous", c1 = rep(1, p1), c2 = rep(0, p2) ) X1 <- simdata$X1 X2 <- simdata$X2 ### Check the range of truncation levels of variables range(colMeans(X1 == 0)) range(colMeans(X2 == 0)) ### Estimate latent correlation matrix # with original method R1_org <- estimateR(X1, type = "trunc", method = "original")$R # with faster approximation method R1_approx <- estimateR(X1, type = "trunc", method = "approx")$R R12_approx <- estimateR_mixed(X1, X2, type1 = "trunc", type2 = "continuous", method = "approx")$R12
Internal mixedCCA function finding w1 and w2 given R1, R2 and R12
find_w12bic( n, R1, R2, R12, lamseq1, lamseq2, w1init, w2init, BICtype, maxiter = 100, tol = 0.01, trace = FALSE, lassoverbose = FALSE )
find_w12bic( n, R1, R2, R12, lamseq1, lamseq2, w1init, w2init, BICtype, maxiter = 100, tol = 0.01, trace = FALSE, lassoverbose = FALSE )
n |
Sample size |
R1 |
Correlation matrix of dataset |
R2 |
Correlation matrix of dataset |
R12 |
Correlation matrix between the dataset |
lamseq1 |
A sequence of lambda values for the datasets |
lamseq2 |
A sequence of lambda values for the datasets |
w1init |
An initial vector of length p1 for canonical direction |
w2init |
An initial vector of length p1 for canonical direction |
BICtype |
Either 1 or 2: For more details for two options, see the reference. |
maxiter |
The maximum number of iterations allowed. |
tol |
The desired accuracy (convergence tolerance). |
trace |
If |
lassoverbose |
If |
find_w12bic
returns a data.frame containing
w1: estimated canonical direction .
w2: estimated canonical direction .
w1trace: a matrix, of which column is the estimated canonical direction at each iteration. The number of columns is the number of iteration until the convergence.
w2trace: a matrix, of which column is the estimated canonical direction at each iteration. The number of columns is the number of iteration until the convergence.
lam1.iter: For each iteration, what lambda value is selected for is stored.
lam2.iter: For each iteration, what lambda value is selected for is stored.
obj: objective function value without penalty: . If lamseq1 and lamseq2 are scalar, then original objective function including penalty part will be used.
Yoon G., Carroll R.J. and Gaynanova I. (2020) "Sparse semiparametric canonical correlation analysis for data of mixed types" <doi:10.1093/biomet/asaa007>.
GenerateData
is used to generate two sets of data of mixed types for sparse CCA under the Gaussian copula model.
GenerateData( n, trueidx1, trueidx2, Sigma1, Sigma2, maxcancor, copula1 = "no", copula2 = "no", type1 = "continuous", type2 = "continuous", muZ = NULL, c1 = NULL, c2 = NULL )
GenerateData( n, trueidx1, trueidx2, Sigma1, Sigma2, maxcancor, copula1 = "no", copula2 = "no", type1 = "continuous", type2 = "continuous", muZ = NULL, c1 = NULL, c2 = NULL )
n |
Sample size |
trueidx1 |
True canonical direction of length p1 for |
trueidx2 |
True canonical direction of length p2 for |
Sigma1 |
True correlation matrix of latent variable |
Sigma2 |
True correlation matrix of latent variable |
maxcancor |
True canonical correlation between |
copula1 |
Copula type for the first dataset. U1 = f(Z1), which could be either "exp", "cube". |
copula2 |
Copula type for the second dataset. U2 = f(Z2), which could be either "exp", "cube". |
type1 |
Type of the first dataset |
type2 |
Type of the second dataset |
muZ |
Mean of latent multivariate normal. |
c1 |
Constant threshold for |
c2 |
Constant threshold for |
GenerateData
returns a list containing
Z1: latent numeric data matrix (n by p1).
Z2: latent numeric data matrix (n by p2).
X1: observed numeric data matrix (n by p1).
X2: observed numeric data matrix (n by p2).
true_w1: normalized true canonical direction of length p1 for X1
.
true_w2: normalized true canonical direction of length p2 for X2
.
type: a vector containing types of two datasets.
maxcancor: true canonical correlation between Z1
and Z2
.
c1: constant threshold for X1
for "trunc" and "binary" data type.
c2: constant threshold for X2
for "trunc" and "binary" data type.
Sigma: true latent correlation matrix of Z1
and Z2
((p1+p2) by (p1+p2)).
### Simple example # Data setting n <- 100; p1 <- 15; p2 <- 10 # sample size and dimensions for two datasets. maxcancor <- 0.9 # true canonical correlation # Correlation structure within each data set set.seed(0) perm1 <- sample(1:p1, size = p1); Sigma1 <- autocor(p1, 0.7)[perm1, perm1] blockind <- sample(1:3, size = p2, replace = TRUE); Sigma2 <- blockcor(blockind, 0.7) mu <- rbinom(p1+p2, 1, 0.5) # true variable indices for each dataset trueidx1 <- c(rep(1, 3), rep(0, p1-3)) trueidx2 <- c(rep(1, 2), rep(0, p2-2)) # Data generation simdata <- GenerateData(n=n, trueidx1 = trueidx1, trueidx2 = trueidx2, maxcancor = maxcancor, Sigma1 = Sigma1, Sigma2 = Sigma2, copula1 = "exp", copula2 = "cube", muZ = mu, type1 = "trunc", type2 = "trunc", c1 = rep(1, p1), c2 = rep(0, p2) ) X1 <- simdata$X1 X2 <- simdata$X2 # Check the range of truncation levels of variables range(colMeans(X1 == 0)) range(colMeans(X2 == 0))
### Simple example # Data setting n <- 100; p1 <- 15; p2 <- 10 # sample size and dimensions for two datasets. maxcancor <- 0.9 # true canonical correlation # Correlation structure within each data set set.seed(0) perm1 <- sample(1:p1, size = p1); Sigma1 <- autocor(p1, 0.7)[perm1, perm1] blockind <- sample(1:3, size = p2, replace = TRUE); Sigma2 <- blockcor(blockind, 0.7) mu <- rbinom(p1+p2, 1, 0.5) # true variable indices for each dataset trueidx1 <- c(rep(1, 3), rep(0, p1-3)) trueidx2 <- c(rep(1, 2), rep(0, p2-2)) # Data generation simdata <- GenerateData(n=n, trueidx1 = trueidx1, trueidx2 = trueidx2, maxcancor = maxcancor, Sigma1 = Sigma1, Sigma2 = Sigma2, copula1 = "exp", copula2 = "cube", muZ = mu, type1 = "trunc", type2 = "trunc", c1 = rep(1, p1), c2 = rep(0, p2) ) X1 <- simdata$X1 X2 <- simdata$X2 # Check the range of truncation levels of variables range(colMeans(X1 == 0)) range(colMeans(X2 == 0))
Calculate Kendall's tau correlation.
The function KendallTau
calculates Kendall's tau correlation between two variables, returning a single correlation value. The function Kendall_matrix
returns a correlation matrix.
KendallTau(x, y) Kendall_matrix(X, Y = NULL)
KendallTau(x, y) Kendall_matrix(X, Y = NULL)
x |
A numeric vector. |
y |
A numeric vector. |
X |
A numeric matrix (n by p1). |
Y |
A numeric matrix (n by p2). |
KendallTau(x, y)
returns one Kendall's tau correlation value between two vectors, x
and y
.
Kendall_matrix(X)
returns a p1 by p1 matrix of Kendall's tau correlation coefficients. Kendall_matrix(X, Y)
returns a p1 by p2 matrix of Kendall's tau correlation coefficients.
n <- 100 # sample size r <- 0.8 # true correlation ### vector input # Data generation (X1: truncated continuous, X2: continuous) Z <- mvrnorm(n, mu = c(0, 0), Sigma = matrix(c(1, r, r, 1), nrow = 2)) X1 <- Z[,1] X1[Z[,1] < 1] <- 0 X2 <- Z[,2] KendallTau(X1, X2) Kendall_matrix(X1, X2) ### matrix data input p1 <- 3; p2 <- 4 # dimension of X1 and X2 JSigma <- matrix(r, nrow = p1+p2, ncol = p1+p2); diag(JSigma) <- 1 Z <- mvrnorm(n, mu = rep(0, p1+p2), Sigma = JSigma) X1 <- Z[,1:p1] X1[Z[,1:p1] < 0] <- 0 X2 <- Z[,(p1+1):(p1+p2)] Kendall_matrix(X1, X2)
n <- 100 # sample size r <- 0.8 # true correlation ### vector input # Data generation (X1: truncated continuous, X2: continuous) Z <- mvrnorm(n, mu = c(0, 0), Sigma = matrix(c(1, r, r, 1), nrow = 2)) X1 <- Z[,1] X1[Z[,1] < 1] <- 0 X2 <- Z[,2] KendallTau(X1, X2) Kendall_matrix(X1, X2) ### matrix data input p1 <- 3; p2 <- 4 # dimension of X1 and X2 JSigma <- matrix(r, nrow = p1+p2, ncol = p1+p2); diag(JSigma) <- 1 Z <- mvrnorm(n, mu = rep(0, p1+p2), Sigma = JSigma) X1 <- Z[,1:p1] X1[Z[,1:p1] < 0] <- 0 X2 <- Z[,(p1+1):(p1+p2)] Kendall_matrix(X1, X2)
This internal function generates lambda sequence of length nlamseq
equally spaced on a logarithmic scale. Since this is for sparse CCA, it returns a list of two vectors. Each vector will be used for each data set and
. And
and
denote canonical vector for each data set.
lambdaseq_generate( nlamseq = 20, lam.eps = 0.01, Sigma1, Sigma2, Sigma12, w1init = NULL, w2init = NULL )
lambdaseq_generate( nlamseq = 20, lam.eps = 0.01, Sigma1, Sigma2, Sigma12, w1init = NULL, w2init = NULL )
nlamseq |
The length of lambda sequence |
lam.eps |
The smallest value for lambda as a fraction of maximum lambda value |
Sigma1 |
Covariance/correlation matrix of |
Sigma2 |
Covariance/correlation matrix of |
Sigma12 |
Covariance/correlation matrix between |
w1init |
Initial value for canonical vector |
w2init |
Initial value for canonical vector |
lambdaseq_generate
returns a list of length 2. Each vector is of the same length nlamseq
and will be used for each data set separately.
Applies sparse canonical correlation analysis (CCA) for high-dimensional data of mixed types (continuous/binary/truncated continuous). Derived rank-based estimator instead of sample correlation matrix is implemented. There are two types of BIC criteria for variable selection. We found that BIC1 works best for variable selection, whereas BIC2 works best for prediction.
mixedCCA( X1, X2, type1, type2, lamseq1 = NULL, lamseq2 = NULL, nlamseq = 20, lam.eps = 0.01, w1init = NULL, w2init = NULL, BICtype, KendallR = NULL, maxiter = 100, tol = 0.01, trace = FALSE, lassoverbose = FALSE )
mixedCCA( X1, X2, type1, type2, lamseq1 = NULL, lamseq2 = NULL, nlamseq = 20, lam.eps = 0.01, w1init = NULL, w2init = NULL, BICtype, KendallR = NULL, maxiter = 100, tol = 0.01, trace = FALSE, lassoverbose = FALSE )
X1 |
A numeric data matrix (n by p1). |
X2 |
A numeric data matrix (n by p2). |
type1 |
A type of data |
type2 |
A type of data |
lamseq1 |
A tuning parameter sequence for |
lamseq2 |
A tuning parameter sequence for |
nlamseq |
The number of tuning parameter sequence lambda - default is 20. |
lam.eps |
A ratio of the smallest value for lambda to the maximum value of lambda. |
w1init |
An initial vector of length p1 for canonical direction |
w2init |
An initial vector of length p2 for canonical direction |
BICtype |
Either 1 or 2: For more details for two options, see the reference. |
KendallR |
An estimated Kendall |
maxiter |
The maximum number of iterations allowed. |
tol |
The desired accuracy (convergence tolerance). |
trace |
If |
lassoverbose |
If |
mixedCCA
returns a data.frame containing
KendallR: estimated Kendall's matrix estimator.
lambda_seq: the values of lamseq
used for sparse CCA.
w1: estimated canonical direction .
w2: estimated canonical direction .
cancor: estimated canonical correlation.
fitresult: more details regarding the progress at each iteration.
Yoon G., Carroll R.J. and Gaynanova I. (2020) "Sparse semiparametric canonical correlation analysis for data of mixed types" <doi:10.1093/biomet/asaa007>.
### Simple example # Data setting n <- 100; p1 <- 15; p2 <- 10 # sample size and dimensions for two datasets. maxcancor <- 0.9 # true canonical correlation # Correlation structure within each data set set.seed(0) perm1 <- sample(1:p1, size = p1); Sigma1 <- autocor(p1, 0.7)[perm1, perm1] blockind <- sample(1:3, size = p2, replace = TRUE); Sigma2 <- blockcor(blockind, 0.7) mu <- rbinom(p1+p2, 1, 0.5) # true variable indices for each dataset trueidx1 <- c(rep(1, 3), rep(0, p1-3)) trueidx2 <- c(rep(1, 2), rep(0, p2-2)) # Data generation simdata <- GenerateData(n=n, trueidx1 = trueidx1, trueidx2 = trueidx2, maxcancor = maxcancor, Sigma1 = Sigma1, Sigma2 = Sigma2, copula1 = "exp", copula2 = "cube", muZ = mu, type1 = "trunc", type2 = "trunc", c1 = rep(1, p1), c2 = rep(0, p2) ) X1 <- simdata$X1 X2 <- simdata$X2 # Check the range of truncation levels of variables range(colMeans(X1 == 0)) range(colMeans(X2 == 0)) # Kendall CCA with BIC1 kendallcca1 <- mixedCCA(X1, X2, type1 = "trunc", type2 = "trunc", BICtype = 1, nlamseq = 10) # Kendall CCA with BIC2. Estimated correlation matrix is plugged in from the above result. R <- kendallcca1$KendallR kendallcca2 <- mixedCCA(X1, X2, type1 = "trunc", type2 = "trunc", KendallR = R, BICtype = 2, nlamseq = 10)
### Simple example # Data setting n <- 100; p1 <- 15; p2 <- 10 # sample size and dimensions for two datasets. maxcancor <- 0.9 # true canonical correlation # Correlation structure within each data set set.seed(0) perm1 <- sample(1:p1, size = p1); Sigma1 <- autocor(p1, 0.7)[perm1, perm1] blockind <- sample(1:3, size = p2, replace = TRUE); Sigma2 <- blockcor(blockind, 0.7) mu <- rbinom(p1+p2, 1, 0.5) # true variable indices for each dataset trueidx1 <- c(rep(1, 3), rep(0, p1-3)) trueidx2 <- c(rep(1, 2), rep(0, p2-2)) # Data generation simdata <- GenerateData(n=n, trueidx1 = trueidx1, trueidx2 = trueidx2, maxcancor = maxcancor, Sigma1 = Sigma1, Sigma2 = Sigma2, copula1 = "exp", copula2 = "cube", muZ = mu, type1 = "trunc", type2 = "trunc", c1 = rep(1, p1), c2 = rep(0, p2) ) X1 <- simdata$X1 X2 <- simdata$X2 # Check the range of truncation levels of variables range(colMeans(X1 == 0)) range(colMeans(X2 == 0)) # Kendall CCA with BIC1 kendallcca1 <- mixedCCA(X1, X2, type1 = "trunc", type2 = "trunc", BICtype = 1, nlamseq = 10) # Kendall CCA with BIC2. Estimated correlation matrix is plugged in from the above result. R <- kendallcca1$KendallR kendallcca2 <- mixedCCA(X1, X2, type1 = "trunc", type2 = "trunc", KendallR = R, BICtype = 2, nlamseq = 10)
This function is modified from CCA package rcc function. This function is used for simulation. Inputs should be correlation or covariance matrices of each data set and between datasets.
myrcc(R1, R2, R12, lambda1, lambda2, tol = 1e-04)
myrcc(R1, R2, R12, lambda1, lambda2, tol = 1e-04)
R1 |
correlation/covariance/rank-based correlation matrix of dataset |
R2 |
correlation/covariance/rank-based correlation matrix of dataset |
R12 |
correlation/covariance/rank-based correlation matrix between dataset |
lambda1 |
tuning parameter (a scalar value) for dataset |
lambda2 |
tuning parameter (a scalar value) for dataset |
tol |
tolerance for eigenvalues. Refer to |
myrcc
returns a data.frame containing
cancor: estimated canonical correlation.
w1: estimated canonical direction .
w2: estimated canonical direction .
This function is modified from original CCA function for two reasons: to deal with only positive eigenvalues larger than the tolerance when calculating the inverse of the matrices and to compute Singular Value Decomposition using irlba
algorithm. Inputs should be correlation or covariance matrices of each data set and between datasets. This function returns only the first pair of canonical covariates.
standardCCA(S1, S2, S12, tol = 1e-04)
standardCCA(S1, S2, S12, tol = 1e-04)
S1 |
correlation/covariance matrix of dataset |
S2 |
correlation/covariance matrix of dataset |
S12 |
correlation/covariance matrix between dataset |
tol |
tolerance for eigenvalues. |
standardCCA
returns a data.frame containing
cancor: estimated canonical correlation.
w1: estimated canonical direction .
w2: estimated canonical direction .