Some Simple Measures of Correlation

To implement the following measures of correlation, we use the ‘wine’ dataset from the HDclassif package. To wet your palate, we test for correlation between ‘alcohol’ and ‘proline’ (an amino acid) content. Note that we employ the correlation measure tests from the psych, rather than base package.


library(psych)
library(HDclassif)
data(wine)

 

Pearson’s Rho Correlation

The Pearson correlation coefficient is a measure of the linear dependence between two variables and has support between −1 and 1. Here, 1 implies total positive correlation, 0 is no correlation, and −1 implies total negative correlation. For a population, the Pearson correlation between variables X and Y is defined by, where  denotes the expectation operator, and  indicates variance. The sample correlation, which is the estimate of  based on a sample of observations in T timesteps, is computed by

{\rho_{XY}}= \frac{\sum_{t=1}^T(X_t-\bar{X})(Y_t-\bar{Y})}{\sqrt{\sum_{t=1}^T(X_t-\bar{X})^2(Y_t-\bar{Y})^2}}

where  is the pair of the observed X and Y values in quarter t, , and  and  are the averages of the observed X and Y values, respectively. The major caveat of the Pearson correlation coefficient is its presumption of the relationship between X and Y. Firstly, it supposes that X and Y are continuous variables (i.e. a practical limitation). Secondly, that X and Y have a pairwise association; in other words, for each X_i there is an Y_i. Thirdly, the relationship between the variables is perfect; there are no outliers that skew the relationship in either direction. Fourthly, and arguably the most critical assumption, is the linearity of the relationship. When plotting X against Y, a linear increase in X corresponds to a linear increase or decrease in Y. Finally, homoscedasticity refers to the variance of the relationship being constant. Essentially, when viewing the scatterplot of X against Y, the dispersion should be consistent across all values.

To perform in R:


corr.test(wine[, c("V1","V13")], method='pearson')

Which outputs:


Call:corr.test(x = wine[, c("V1", "V13")], method = "pearson")
Correlation matrix
 V1 V13
V1 1.00 0.64
V13 0.64 1.00
Sample Size
[1] 178
Probability values (Entries above the diagonal are adjusted for multiple tests.)
 V1 V13
V1 0 0
V13 0 0

 

Spearman’s Rank Correlation

Spearman rank correlation is a nonparametric measure of association between variables measured on an ordinal, interval, or ratio scale that exhibit a monotonic relationship. For interval or ratio data (i.e. continuous), it differs from Pearson in that the relationship between variables need not be linear, nor homoscedastic. Otherwise, the general assumptions regarding the relationship between X and Y are presumed. The benefit to the Spearman correlation measure is the lack of imposition on the relationship; notably, it assumes a monotonic relationship between X and Y exists. In other words, the two variables increase/decrease in value together or when one increases and the other decreases. The magnitude by which this occurs is not a presumption in this measure; this correlation measure is based on the ranks of the data values in each variable. The formula for calculating the Spearman’s rank correlation is:

\rho_XY=1-\frac{6 \sum d_i^2}{n(n^2-1)}

Where d_i=x_i-y_i is the difference between ranks.   Spearman's rank correlation is satisfactory for determining the relationship between two variables but it may be difficult to interpret Spearman correlation intuitively because of its quadratic form. Kendall's rank correlation improves upon this by reflecting the strength of the dependence between the variables being compared, without penalizing the squares of discordance.

To perform in R:


corr.test(wine[, c("V1","V13")], method='spearman')

Which outputs:


Call:corr.test(x = wine[, c("V1", "V13")], method = "spearman")
Correlation matrix
 V1 V13
V1 1.00 0.63
V13 0.63 1.00
Sample Size
[1] 178
Probability values (Entries above the diagonal are adjusted for multiple tests.)
 V1 V13
V1 0 0
V13 0 0

 

Kendall’s Tau Correlation

Kendall's Tau correlation is another non-parametric test of correlation and a measure of the strength of dependence between two variables. Consider two samples of size n: X and Y. The total number of possible pairings of X with Y observations is n(n-1)/2. Now sort X and Y independently and consider their ranked pairings. If x_i > y_i when ordered on both X and Y then this pair is discordant, otherwise the pair is concordant. The numerator is the difference between the number of concordant (ordered in the same way, n_c) and discordant (ordered differently, n_d) pairs.

Tau is given by:

\tau=\frac{n_x-n_d}{n(n-1)/2}

If there are tied (same value) observations then the following formula is used:

\tau_{tie}=\frac{n_x-n_d}{\sqrt{[n(n-1)/2-\sum_{i=1}^t t_i(t_i-1)/2][n(n-1)/2-\sum_{i=1}^u u_i(u_i-1)/2]}}

where t_i is the number of observations tied at a particular rank of X and u_i is the number tied at a rank of Y. In summary, Kendall’s Tau penalizes disordered pairs by the distance of their disorder. The main difference being that Spearman’s Rho penalizes disordered pairs by the squared distance of their disorder.

To perform in R:


corr.test(wine[, c("V1","V13")], method='kendall')

Which outputs:


Call:corr.test(x = wine[, c("V1", "V13")], method = "kendall")
Correlation matrix
 V1 V13
V1 1.00 0.45
V13 0.45 1.00
Sample Size
[1] 178
Probability values (Entries above the diagonal are adjusted for multiple tests.)
 V1 V13
V1 0 0
V13 0 0

 

As a rank-based measure, Kendall’s Tau is arguably superior to Spearman’s Rho from an interpretation perspective (Newson, 2002). In spite of this, Spearman and Kendall are intimately related. Note that based on a convergence criterion of  \tau \rightarrow 1 from (5) in (Xu et al, 2010) we obtain for large n:

 3 \tau-2 \rho \leq 1\Rightarrow\rho \geq 1 \Rightarrow\rho=1

since  \rho \leq 1. Hence, asymptotically when Kendall’s Tau is equal to 1, we ensure that Spearman’s Rho is equal to 1.

While the threshold for determining significance of correlation differs depending on the application and correlation measure, it is common to use \pm 0.7 as a cut-off for strong correlation.

 

[1]Xu, Weichao; Hou, Yunhe; Hung, Y. S.; Zou, Yuexian. Comparison of Spearman's rho and Kendall's tau in Normal and Contaminated Normal Models. eprint arXiv:1011.2009.

[2] Newson R. Parameters behind "nonparametric" statistics: Kendall's tau,Somers' D and median differences. Stata Journal 2002; 2(1):45-64.