Phân tích canonical correspondence analysis với r là gì

Canonical correlation analysis (CCA) is a multivariate statistical technique used to explore the relationship between two sets of variables. Canonical correlation is similar to correlation and regression, but it goes one step further by measuring the correlation of a linear combination of variables from each set that is maximally correlated with the linear combination of variables from the other set. Canonical correlation allows us to look at variables that are measured on different scales and identify the underlying relationship between them that can be used to explain their co-variation.

Correlation Matrix

All of that is a mouthful, but we can illustrate the technique in R using the

Create and print the correlation matrix

(corr_matrix <- cor(mtcars)) -0.8676594

0 dataset. But before digging into canonical correlation, it may be useful to first review the idea of correlation. Correlation is a statistical measure that describes the relationship between two variables. It is a measure of how two variables are related to each other and the strength of that relationship.

Correlation is often expressed as a correlation coefficient, which ranges from -1 to 1. A correlation coefficient of -1 indicates a perfect negative correlation, where the two variables move in opposite directions. A correlation coefficient of 0 indicates no correlation, where the two variables are not related to each other. A correlation coefficient of 1 indicates a perfect positive correlation, where the two variables move in the same direction.

To create a correlation matrix, we use the

Create and print the correlation matrix

(corr_matrix <- cor(mtcars)) -0.8676594

1 function. The output of the correlation matrix shows the pairwise correlations between the variables in our dataset.

library(tidyverse) data(mtcars)

Create and print the correlation matrix

(corr_matrix <- cor(mtcars)) -0.8676594

Mathematically, we can say that correlation is also the covariance divided by the product of the standard deviations, as we can see below, using the

Create and print the correlation matrix

(corr_matrix <- cor(mtcars)) -0.8676594

2 and

Create and print the correlation matrix

(corr_matrix <- cor(mtcars)) -0.8676594

3 variables as examples.

cov(mtcars$mpg, mtcars$wt) / (sd(mtcars$mpg) * sd(mtcars$wt)) [1] -0.8676594

We can see in our correlation matrix that there is a strong negative correlation between the

Create and print the correlation matrix

(corr_matrix <- cor(mtcars)) -0.8676594

2 variable and other variables such as

Create and print the correlation matrix

(corr_matrix <- cor(mtcars)) -0.8676594

5,

Create and print the correlation matrix

(corr_matrix <- cor(mtcars)) -0.8676594

6, and

Create and print the correlation matrix

(corr_matrix <- cor(mtcars)) -0.8676594

3. (The font above is kind of small, but the values are:

Create and print the correlation matrix

(corr_matrix <- cor(mtcars)) -0.8676594

8,

Create and print the correlation matrix

(corr_matrix <- cor(mtcars)) -0.8676594

9, and

cov(mtcars$mpg, mtcars$wt) / (sd(mtcars$mpg) * sd(mtcars$wt)) [1] -0.8676594

0, respectively). This suggests that as these variables increase,

Create and print the correlation matrix

(corr_matrix <- cor(mtcars)) -0.8676594

2 tends to decrease (hence the minus sign).

In the case of simple linear regression, with one independent and one dependent variable, the slope and intercept of the regression line can be calculated in this way. By multiplying the correlation by the quotient of the standard deviation of

cov(mtcars$mpg, mtcars$wt) / (sd(mtcars$mpg) * sd(mtcars$wt)) [1] -0.8676594

2 over the standard deviation of

cov(mtcars$mpg, mtcars$wt) / (sd(mtcars$mpg) * sd(mtcars$wt)) [1] -0.8676594

3, we are effectively putting the correlation coefficient back on the scale of the data. Equivalently, we can also think about the slope of the regression line as the covariance of

cov(mtcars$mpg, mtcars$wt) / (sd(mtcars$mpg) * sd(mtcars$wt)) [1] -0.8676594

3 times

cov(mtcars$mpg, mtcars$wt) / (sd(mtcars$mpg) * sd(mtcars$wt)) [1] -0.8676594

2 over the variance of

cov(mtcars$mpg, mtcars$wt) / (sd(mtcars$mpg) * sd(mtcars$wt)) [1] -0.8676594

3.

r_slope <- cor(mtcars$mpg, mtcars$wt) * (sd(mtcars$mpg) / sd(mtcars$wt))

r_slope <- cov(mtcars$mpg, mtcars$wt) / var(mtcars$wt)

r_intercept <- mean(mtcars$mpg) - r_slope * mean(mtcars$wt)

ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point() + ggtitle("Mtcars") + labs(subtitle = "mpg ~ wt") + geom_abline(intercept = r_intercept, slope = r_slope, color = "

1e466e")

Canonical Correlation

But what if we want to think about, for example, both

Create and print the correlation matrix

(corr_matrix <- cor(mtcars)) -0.8676594

2 and

cov(mtcars$mpg, mtcars$wt) / (sd(mtcars$mpg) * sd(mtcars$wt)) [1] -0.8676594

8, both as separate but related measures of car performance? (

Create and print the correlation matrix

(corr_matrix <- cor(mtcars)) -0.8676594

2 is fuel efficiency and

cov(mtcars$mpg, mtcars$wt) / (sd(mtcars$mpg) * sd(mtcars$wt)) [1] -0.8676594

8 is quartile mile time in seconds.)

Enter canonical correlation: we start by splitting our

Create and print the correlation matrix

(corr_matrix <- cor(mtcars)) -0.8676594

0 dataset into two sections: We create one matrix that contains variables pertaining to car characteristics, and we create a second matrix that contains a set of variables that pertain to car performance measures. The car performance measures refer to the miles per gallon (

Create and print the correlation matrix

(corr_matrix <- cor(mtcars)) -0.8676594

  1. and quarter mile time in seconds (
    cov(mtcars$mpg, mtcars$wt) / (sd(mtcars$mpg) * sd(mtcars$wt)) [1] -0.8676594

8), whereas the car characteristics matrix refers to everything else. We can then use the

r_slope <- cor(mtcars$mpg, mtcars$wt) * (sd(mtcars$mpg) / sd(mtcars$wt))

r_slope <- cov(mtcars$mpg, mtcars$wt) / var(mtcars$wt)

r_intercept <- mean(mtcars$mpg) - r_slope * mean(mtcars$wt)

4 function from the

r_slope <- cor(mtcars$mpg, mtcars$wt) * (sd(mtcars$mpg) / sd(mtcars$wt))

r_slope <- cov(mtcars$mpg, mtcars$wt) / var(mtcars$wt)

r_intercept <- mean(mtcars$mpg) - r_slope * mean(mtcars$wt)

5 library to perform canonical correlation.

car_char <- mtcars %>% dplyr::select(-mpg, -qsec)

car_perf <- mtcars %>% dplyr::select(mpg, qsec)

The

r_slope <- cor(mtcars$mpg, mtcars$wt) * (sd(mtcars$mpg) / sd(mtcars$wt))

r_slope <- cov(mtcars$mpg, mtcars$wt) / var(mtcars$wt)

r_intercept <- mean(mtcars$mpg) - r_slope * mean(mtcars$wt)

6 function takes two arguments, just like the

r_slope <- cor(mtcars$mpg, mtcars$wt) * (sd(mtcars$mpg) / sd(mtcars$wt))

r_slope <- cov(mtcars$mpg, mtcars$wt) / var(mtcars$wt)

r_intercept <- mean(mtcars$mpg) - r_slope * mean(mtcars$wt)

7 functions takes two arguments. Unlike with correlation, however, instead of passing a vector for each argument, we are passing a matrix for each argument. The

r_slope <- cor(mtcars$mpg, mtcars$wt) * (sd(mtcars$mpg) / sd(mtcars$wt))

r_slope <- cov(mtcars$mpg, mtcars$wt) / var(mtcars$wt)

r_intercept <- mean(mtcars$mpg) - r_slope * mean(mtcars$wt)

6 function performs a canonical correlation analysis between two sets of variables.

(cc <- cancor(car_char, car_perf))

Here we look at the output. There’s a lot going on. We will take it one step at a time. To get this to fit more nicely in the console, I have cut off vs, am, gear, and carb.

$cor [1] 0.9270377 0.8307044

$xcoef

          [,1]          [,2]          [,3]         [,4]          [,5]  
cyl -0.0499281753 0.0670144621 0.2294745588 0.022716673 0.1393809775 disp 0.0002925509 0.0005707853 -0.0004640581 0.002045239 -0.0041430468 hp -0.0010711799 0.0013955682 -0.0041747403 -0.002529170 0.0006800028 drat 0.0028620032 0.1420736723 0.0458741499 0.488711843 0.1981653504 wt -0.0697042950 -0.2556853374 -0.0725790308 0.074411093 0.3051552113

$ycoef

       [,1]       [,2]  
mpg 0.02607841 0.0199184 qsec 0.02418249 -0.1080033

$xcenter

   cyl       disp         hp       drat         wt   
6.187500 230.721875 146.687500 3.596563 3.217250

$ycenter

 mpg     qsec   
20.09062 17.84875

Let’s look at the parts one at a time.

cc$cor

[1] 0.9270377 0.8307044

The top part of the output,

r_slope <- cor(mtcars$mpg, mtcars$wt) * (sd(mtcars$mpg) / sd(mtcars$wt))

r_slope <- cov(mtcars$mpg, mtcars$wt) / var(mtcars$wt)

r_intercept <- mean(mtcars$mpg) - r_slope * mean(mtcars$wt)

9, is a list containing the canonical correlation coefficients between the two sets of variables. In this case, there are two coefficients:

ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point() + ggtitle("Mtcars") + labs(subtitle = "mpg ~ wt") + geom_abline(intercept = r_intercept, slope = r_slope, color = "

1e466e")

0 and

ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point() + ggtitle("Mtcars") + labs(subtitle = "mpg ~ wt") + geom_abline(intercept = r_intercept, slope = r_slope, color = "

1e466e")

1. These coefficients measure the strength of the relationship between the two sets of variables, with higher values indicating a stronger relationship.

In a canonical correlation analysis, there are usually multiple canonical correlation coefficients, not just one. The number of coefficients is equal to the minimum of the number of variables in each set. So in our example, there are 9 variables in the

ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point() + ggtitle("Mtcars") + labs(subtitle = "mpg ~ wt") + geom_abline(intercept = r_intercept, slope = r_slope, color = "

1e466e")

2 set and 2 variables in the

ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point() + ggtitle("Mtcars") + labs(subtitle = "mpg ~ wt") + geom_abline(intercept = r_intercept, slope = r_slope, color = "

1e466e")

3 set. Therefore, the

r_slope <- cor(mtcars$mpg, mtcars$wt) * (sd(mtcars$mpg) / sd(mtcars$wt))

r_slope <- cov(mtcars$mpg, mtcars$wt) / var(mtcars$wt)

r_intercept <- mean(mtcars$mpg) - r_slope * mean(mtcars$wt)

6 function returns two canonical correlation coefficients.

Each of the canonical correlation coefficients represents the strength of the relationship between the two sets of variables, but for a different pair of linear combinations of the variables. Specifically, the first coefficient measures the strength of the relationship between the first linear combination of variables in

ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point() + ggtitle("Mtcars") + labs(subtitle = "mpg ~ wt") + geom_abline(intercept = r_intercept, slope = r_slope, color = "

1e466e")

2 and the first linear combination of variables in

ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point() + ggtitle("Mtcars") + labs(subtitle = "mpg ~ wt") + geom_abline(intercept = r_intercept, slope = r_slope, color = "

1e466e")

3, while the second coefficient measures the strength of the relationship between the second linear combination of variables in

ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point() + ggtitle("Mtcars") + labs(subtitle = "mpg ~ wt") + geom_abline(intercept = r_intercept, slope = r_slope, color = "

1e466e")

2 and the second linear combination of variables in

ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point() + ggtitle("Mtcars") + labs(subtitle = "mpg ~ wt") + geom_abline(intercept = r_intercept, slope = r_slope, color = "

1e466e")

3.

In our case, both the first canonical correlation coefficient (CCC) is a big number and the second CCC is also a pretty big number. This means that the first linear combination from each set of variables is strongly correlated, and the second linear combination from each set of variables is also pretty strongly correlated. In other words, the car characteristics variables are highly related to the car performance variables, and we can use the car characteristics to predict car performance.

In more technical terms, the

r_slope <- cor(mtcars$mpg, mtcars$wt) * (sd(mtcars$mpg) / sd(mtcars$wt))

r_slope <- cov(mtcars$mpg, mtcars$wt) / var(mtcars$wt)

r_intercept <- mean(mtcars$mpg) - r_slope * mean(mtcars$wt)

4 function is returning the correlation between two canonical variates; the canonical variates themselves are the linear combination of the two sets of variables that are maximally correlated with each other.

Because in canonical correlation analysis the goal is to find the linear combinations of two sets of variables that are maximally correlated, the canonical correlation coefficient will generally be higher than the correlation coefficient between any two individual variables in either set. The reason for this is that canonical correlation analysis is designed to identify the underlying relationships between two sets of variables that are not apparent from the individual variables alone. By identifying the combinations of variables that are most strongly related, canonical correlation analysis can provide insights into the underlying factors (shared variance) or dimensions that are driving the observed patterns of correlation.

Canonical correlation is different from regular correlation and regression in several ways. In regular correlation, we examine the relationship between two variables on the same scale, whereas in canonical correlation we examine the relationship between two sets of variables that are measured on different scales. In regression, we aim to predict one variable from another, whereas in canonical correlation we aim to identify the underlying relationship between two sets of variables.

An interpretation of the results of the canonical correlation analysis on the

Create and print the correlation matrix

(corr_matrix <- cor(mtcars)) -0.8676594

0 dataset might be that there is a strong relationship between the first set of variables (horsepower, weight, and number of cylinders) and the second set of variables (mpg, displacement, and rear axle ratio). This suggests that cars with higher horsepower, greater weight, and more cylinders tend to have lower miles per gallon, larger displacement, and higher rear axle ratios. This interpretation is supported by the canonical loadings, which show that the variables in the first set are negatively correlated with the variables in the second set, and the canonical coefficients, which show the weights assigned to each variable in the linear combination that forms the canonical