The Cherriest Pick

PUBLISHED ON JUN 1, 2017 — R

We discussed a paper in journal club today (Tan et al. 2017), where three outliers were removed from a plot with just seven data points. The authors were happy to report a correlation coefficient (r) of 0.99. Here is their Fig. 2b (doesn’t matter what’s on the axes), where ‘outliers’ were labelled with open symbols. What value of r do we expect between two normally distributed variables (where n=7), when we remove three points so that r is maximal?

# Take 7 normally distributed points, try all combinations where we choose 4
# points, and select the one with the highest correlation.
cherrypick <- function(plotit=FALSE){
x <- rnorm(7, mean = 28, sd = 1)
y <- rnorm(7, mean = 26, sd = 1)

m <- combn(7, 4)
cr <- apply(m, 2, function(i)cor(x[i],y[i]))

if(plotit){
ii <- m[,which.max(abs(cr))]

plot(x,y)
points(x[ii], y[ii], pch=19)
abline(lm(y[ii] ~ x[ii]), lty=5)
} else {
return(max(abs(cr)))
}
}

A few examples with the above function, where the regression line uses the four solid points (and the remaining 3 are ‘outliers’).

set.seed(1234)
par(mfrow=c(3,3), mar=c(2.2,2.2,0.2,0.2),
cex.axis=0.8,
cex.lab=0.8, mgp=c(1.2,0.4,0), tcl=0.2)
for(i in 1:9)cherrypick(plotit=TRUE) Now we simulate ten thousand times and get the distribution of the cherry-picked r (absolute values).

picks <- replicate(10^4, cherrypick())

hist(picks, breaks=100, col="darkgrey", border="darkgrey",
xlab="Correlation coefficient n=4", main="") Clearly high values of r are very likely. In this simulation, 87% of simulated r values were above 0.9, and 66.5% above 0.95.