David Lovell, Vera Pawlowsky-Glahn, Juan José Egozcue, Samuel Marguerat, Jürg Bähler
When trying to understand when a quantity covaries with another, a standard set of tools which come to mind are correlation coefficients. But consider three statistically independent variables: X, Y and Z which have no correlation. Plotting a large number of samples from X vs Y, Y vs Z or X vs Z will indeed give small correlation coefficients. However, the quantities X/Z and Y/Z must be correlated due to their common divisor, which can mislead us in believing that X is correlated with Y, which we know is untrue (this is clearly shown in Fig. 1A). Thus, if we are interested in relationships between X and Y, searching for correlation between X/Z and Y/Z can be misleading. It should be noted that this is only a concern when Z is a random variable, with a large enough variance. If Z is a constant number across experiments, then our intuition for correlation coefficients is restored. The lesson is 'correlation between relative abundances is meaningless, if we are using different normalisations for each condition'.
This statistical trap is easy to overlook, as it is commonplace to search for correlations in quantities which are normalised (say mRNA of gene 1/total mRNA, i.e. X/Z). The authors highlight that if X/Z and Y/Z are proportional across each sample, then X must be proportional to Y. They therefore suggest a 'goodness-of-fit to proportionality' as a more appropriate statistic when searching for covariation in relative abundances. This is defined as ϕ = var(log(A/B))/var(log A), where A = X/Z and B = Y/Z. ϕ is zero when A and B are perfectly proportional.
Update: For enthusiasts!
Let's use some Monte Carlo to test this out! Using the notation Unif(p,q) as a uniform distribution with p as the minimum and q as the maximum. I have generated draws from three uniform random variables: X ~ Unif(1,2), Y ~ Unif(4,8), Z ~ Unif(5, 300). We see that none of the variables correlate with each other. However, when we create new variables A = X/Z and B = Y/Z, we see a striking correlation (0.95). So one cannot claim that X is correlated with Y, just because A is correlated with B.
This is a particularly pathogenic example, since Z has a huge variance. This simulation yielded ϕ=0.11. Check out the comments on this post to see some back-and-forth between David and I on this.