The Cherriest Pick

PUBLISHED ON JUN 1, 2017 — R

We discussed a paper in journal club today (Tan et al. 2017), where three outliers were removed from a plot with just seven data points. The authors were happy to report a correlation coefficient (r) of 0.99. Here is their Fig. 2b (doesn’t matter what’s on the axes), where ‘outliers’ were labelled with open symbols.

What value of r do we expect between two normally distributed variables (where n=7), when we remove three points so that r is maximal?

# Take 7 normally distributed points, try all combinations where we choose 4
# points, and select the one with the highest correlation.
cherrypick <- function(plotit=FALSE){
  x <- rnorm(7, mean = 28, sd = 1)
  y <- rnorm(7, mean = 26, sd = 1)
  
  m <- combn(7, 4)
  cr <- apply(m, 2, function(i)cor(x[i],y[i]))
  
  if(plotit){ 
    ii <- m[,which.max(abs(cr))]
    
    plot(x,y)
    points(x[ii], y[ii], pch=19)
    abline(lm(y[ii] ~ x[ii]), lty=5)
  } else {
    return(max(abs(cr)))
  }
}

A few examples with the above function, where the regression line uses the four solid points (and the remaining 3 are ‘outliers’).

set.seed(1234)
par(mfrow=c(3,3), mar=c(2.2,2.2,0.2,0.2), 
    cex.axis=0.8,
    cex.lab=0.8, mgp=c(1.2,0.4,0), tcl=0.2)
for(i in 1:9)cherrypick(plotit=TRUE)

Now we simulate ten thousand times and get the distribution of the cherry-picked r (absolute values).

picks <- replicate(10^4, cherrypick())

hist(picks, breaks=100, col="darkgrey", border="darkgrey",
     xlab="Correlation coefficient n=4", main="")

Clearly high values of r are very likely. In this simulation, 87% of simulated r values were above 0.9, and 66.5% above 0.95.