David Lovell, Vera Pawlowsky-Glahn, Juan José Egozcue, Samuel Marguerat, Jürg Bähler

When trying to understand when a quantity covaries with another, a standard set of tools which come to mind are correlation coefficients. But consider three statistically independent variables: X, Y and Z which have no correlation. Plotting a large number of samples from X vs Y, Y vs Z or X vs Z will indeed give small correlation coefficients. However, the quantities X/Z and Y/Z must be correlated due to their common divisor, which can mislead us in believing that X is correlated with Y, which we know is untrue (this is clearly shown in Fig. 1A). Thus, if we are interested in relationships between X and Y, searching for correlation between X/Z and Y/Z can be misleading. It should be noted that this is only a concern when Z is a random variable, with a large enough variance. If Z is a constant number across experiments, then our intuition for correlation coefficients is restored. The lesson is 'correlation between relative abundances is meaningless,

*if we are using different normalisations for each condition*'.

This statistical trap is easy to overlook, as it is commonplace to search for correlations in quantities which are normalised (say mRNA of gene 1/total mRNA, i.e. X/Z). The authors highlight that if X/Z and Y/Z are proportional across each sample, then X must be proportional to Y. They therefore suggest a 'goodness-of-fit to proportionality' as a more appropriate statistic when searching for covariation in relative abundances. This is defined as ϕ = var(log(

*A*/

*B*))/var(log

*A*), where A = X/Z and B = Y/Z. ϕ is zero when A and B are perfectly proportional.

------------------------------------

Update: For enthusiasts!

Let's use some Monte Carlo to test this out! Using the notation Unif(p,q) as a uniform distribution with p as the minimum and q as the maximum. I have generated draws from three uniform random variables: X ~ Unif(1,2), Y ~ Unif(4,8), Z ~ Unif(5, 300). We see that none of the variables correlate with each other. However, when we create new variables A = X/Z and B = Y/Z, we see a striking correlation (0.95). So one cannot claim that X is correlated with Y, just because A is correlated with B.

This is a particularly pathogenic example, since Z has a huge variance. This simulation yielded ϕ=0.11. Check out the comments on this post to see some back-and-forth between David and I on this.

Nice simulation Juvid.

ReplyDeleteActually, ϕ is a bit different to var(log(A/B)) for precisely the reason you point out (that it does not have a meaningful scale)

If you go to the end of the section "Measuring Proportionality" you will see that

ϕ(log x, log y) = var(log(x/y))/var(log x).

(...strictly speaking, you should use clr() instead of log(), but I don't think that'll make a big difference here).

I'd love to know what you get for ϕ now. Cheers, David

Hi David, thanks for your comment! I've removed that part of the post for now, and ammended the definition of ϕ (sorry about that!).

DeleteUsing my notation of A = X/Z, B = Y/Z (where we only have access to A and B and not the underlying distributions X, Y and Z), would it be fair to say that ϕ = var(log(A/B))/var(log(A))? If so, my simulation yields ϕ = 0.11 (I could get a confidence interval on that by bootstrapping, if you think that would help).

You have used the cut-off ϕ<0.05 in your paper, so that appears to be sufficient to declare that A is not proportional to B. Is this correct?

Dear Juvid, sorry not to respond sooner: it is not lack of interest but abundance of commitments.

DeleteYour simulations are really making me think! Let me share my thoughts, starting from your last point about the cut-off of ϕ<0.05.

I don't believe that we should declare a hard and fast cut-off for statistics that measure association. I think the utility of statistics like ϕ, correlation, etc lies in helping us to _explore_ potential relationships, rather than adjudicate as to whether the relationships exist or not. So, on that point, I would encourage analysts to look at the pairs of data that have low ϕ values as an indicator of potential relationships. I think it is useful to look at Anscombe's quartet (http://en.wikipedia.org/wiki/Anscombe%27s_quartet) every now and again, to remind ourselves of the peril of relying solely upon statistics to summarise the relationships in data.

Turning back to your simulation, my suspicion is that the ϕ statistic is no panacea for spurious correlation. As you say, the variable Z has much higher variance than X or Y. My guess is that if you increase the variance of Z further still,the ϕ statistic will get even lower. Remember too that the ϕ statistic is related to the slope and goodness of fit (correlation!) of logarithmically transformed data. And with respect to the goodness of fit, will be prone to the same kinds of issues as illustrated in Anscombe's quartet.

Thanks again for your simulations. And for making me think very hard! If I have any new insights I will look to share them with you.

Cheers, David

Brilliant! Thank you for getting back to us. I've updated the post, and hope you find it accurate. I think the conversation above really underscores the subtleties in thinking about this area. Please do get in touch with any further thoughts you may have! Juvid

ReplyDeleteWell, I took a leaf out of your book Juvid and did some simulations to explore the impact of variation in Z on ϕ(log x, log y) and on ϕ(clr x,clr y)

ReplyDelete...and I realised that the clr() tranformation is very important because it makes ϕ independent of variation in Z. So, I think I have to revise some of my suspicions in previous posts!

I have a 2-page PDF describing this additional exploration... is there a way I could share that with readers of your blog??

If you can post a URL then I think that's the best way. You might be able to do this by following the steps in this article http://www.mybloggerlab.com/2013/03/how-to-embed-pdf-and-other-documents-in-blogger-posts.html

Delete