Pearson vs Spearman correlation coefficients

mklureza · January 29, 2025, 10:58am

Careless has the capacity to calculate the CChalf and CCpred using either the Pearson or Spearman correlation coefficient. I know that the Pearson CC is the one used for the standard crystallographic rule of setting the high-resolution cutoff at the point where the CC drops below 0.3, but I found this in the original careless paper: “The choice of summary statistic is up to the user. However, we recommend Spearman’s rank correlation coefficient as a robust alternative to Pearson’s.” I don’t have much intuition on this - is there a way to tell which statistic is most useful for understanding CChalf and/or CCpred for any given dataset?

kmdalton · January 29, 2025, 2:59pm

The community has historically used Pearson’s correlation coefficients for measures like CC1/2, CCanom, and CCwork/free. In my opinion, the advantage of Pearson’s correlation coefficient is that it can be easily modified to factor in the uncertainty of the data. The advantage of Spearman’s CC is that it is less sensitive to outliers. In careless, I do use the uncertainty of the reflection observation (aka SIGIOBS) to weight each reflection’s contribution to the Pearson CCpred. I use this implementation of weighted Pearson CCs in careless.ccpred here.

I think which measure you choose depends on how much you trust your data. If you think there are likely to be relatively few outliers and that the uncertainty estimates from integration are credible, I would feel comfortable using either metric. I’ll note that unlike most scaling packages, careless doesn’t do any outlier rejection. In light of this, I think Spearman is the safer choice, but the historical precedent is for Pearson which is why it is the default in careless.ccpred.

mklureza · January 30, 2025, 6:58am

Okay - would it be reasonable to use the Pearson correlation coefficient for CC1/2 (following historical precedent) and the Spearman one for CCpred (if I’m still trying to figure out how many outliers there are/aren’t), then? Or does the lack of outlier rejection mean that the Spearman is likely the safer choice for CC1/2 as well?

Thanks!

KayDiederichs · January 30, 2025, 10:31am

There is another important criterium, besides robustness in the face of outliers, for the choice between Pearson’s and Spearman’s correlation coefficient:
Pearson should be used for linear relationships, whereas Spearman just requires monotonous relation.
In crystallographic data processing, outliers arising from shadows (too low) or overloads or zingers (too high) can and should be discarded in any case. Outliers identified by comparison with symmetry equivalents do not usually have extreme values. Furthermore, the relation between intensity estimates is linear. Thus, Pearson is adequate for both CChalf and CCpred.
One caveat: comparing CC* (calculated from sqrt(2 CChalf/(1+CChalf))) with CCpred is an under-appreciated possibility for identifying overfitting, or inadequate models. That comparison, however, is only sensible if both CC calculations use the same type of correlation coefficient, and if both use the same type of weighting, or no weighting.
If CChalf uses no weighting (as do all the programs I know), then CCpred should also not use weighting. Otherwise, CC* should not be compared with CCpred.

kmdalton · January 30, 2025, 2:07pm

@mklureza, and following up on @KayDiederichs, all the “stats” subprograms in careless (careless.{cchalf,ccpred,ccanom, etc}) use the weighted variant of the Pearson correlation. They are all internally consistent, but perhaps not consistent with other programs as Kay notes. From personal communications with Wolfgang Kabsch, XDS uses weights for CCanom in its reports. I certainly trust Kay to know whether any weights are used for CC1/2 and CC* .

If it would be helpful, I could make all three variants,

Pearson CC
Weighted Pearson CC
Spearman CC

accessible on the command line? @mklureza , let me know, and I’ll put it on my to-do list!

I like Kay’s point about Spearman quantifying non-linear, monotonic relationships. In principle this is not what we really want. What we really want is a robust, linear correlation estimator. I could look into using, the minimum covariance determinant estimator (2.6. Covariance estimation — scikit-learn 0.20.4 documentation). Unfortunately, the scikit-learn MCD implementation grinds to a halt after about 100k examples. I could also use a heuristic to reject outliers in the CCpred calculation. I could do this based on the predictive or observed variance. I could also try including the predictive variance in the weights. This would downweight outliers that careless “knows” aren’t well-modeled. I’d love @KayDiederichs 's take on all of this.

KayDiederichs · January 30, 2025, 3:47pm

I confirm that, as you say, XDS uses a weighted Pearson correlation coefficient for the anomalous signal. This is because the anom signal is so noisy that it makes sense to utilize the additional information that the sigmas deliver. Somewhat unfortunately, this fact is to my knowledge not documented anywhere (except here, now).
On the other hand, CChalf in XDS is calculated as the plain (unweighted) Pearson correlation coefficient. This is because it was defined that way in the publications (starting with Karplus&Diederichs 2012), and XDS typically sticks to published definitions.
I’d appreciate a nomenclature, in careless, like “weighted CChalf” when using the weighted Pearson, and “CChalf” when using the unweighted Pearson, for consistency with other programs and the literature. And implementation (and default) of unweighted Pearson. Perhaps not even a need for a choice on the commandline (as you suggest) - see below.
Robust estimator: After witnessing that it takes a decade to introduce a new quantity (CChalf) into MX, I am no longer a fan of introducing “new” quantities (like MCD) unless there is a very clear reason of doing so - there are just too many possibilities, and things become arbitrary. On the other hand, there’s nothing wrong with showing values of uncommon indicators side-by-side with those of common indicators - just add another column in the careless output table that shows the common indicators. In some cases, in this way one might see and understand phenomena that only the uncommon indicators capture. Or it may turn out in the long run that nothing interesting can be learnt from the uncommon indicators.

kmdalton · January 30, 2025, 4:03pm

Thanks for the suggestions, Kay! I’ll update the nomenclature in careless.cchalf/ccanom to match XDS. I’ll also modify the stats subprograms to report all three variants with clear labels.

To circle back to @mklureza’s question: In the context of CCpred, the data can be noisy (as with CCanom). So, I think it makes sense to consider the weighted estimator. An open question is whether using the predictive variance in the weights could be beneficial. This is possible because careless predicts a distribution for each reflection observation. Last I checked, the CLI records the mean and standard deviation for this “posterior distribution”.