Modeling uncertainty of a reference structure in scaling

dorismai · June 25, 2025, 11:43pm

It is common to scale datasets with respect to a reference structure. What would be some appropriate modeling choices of the uncertainty of this reference structure? That is, given F_c, are there reasonable approaches to model \text{Sig} F_c without access to F_o or \text{Sig} F_o?

I am asking this as @kmdalton and I attempt to think of the reference structure as a prior distribution in Bayesian scaling/merging framework. I am curious about your thoughts @KayDiederichs @afadini, and any input (including warning against some bad approaches) from anyone is very welcomed!

KayDiederichs · June 26, 2025, 12:18am

Hi Doris! Probably an idea that is worth testing! The 0-order approach to assign sigma would be a fraction of the amplitude. I’d choose 20% which mirrors a typical R-value. Of course that approach would underestimate the sigma for weak reflections, and overestimate for strong ones, but I currently fail to come up with a still simple but more sensible approach. Would it matter much?
What reviewers would say is that this type of scaling aid might introduce model bias. So this argument must be kept in mind. But it would certainly be helpful for low-symmetry partial datasets!

Doeke_Hekstra · June 26, 2025, 3:41pm

Some initial thoughts

if one has a reference model, presumably has a reference data set which would be a better reference than the reference model, and comes with sigF.
if you do want to use the reference model for scaling then, from the point of view of a joint prior, what you care about is the correlation of the model structure factors with the observed ones – a worse, or less appropriate, model will have a lower correlation. We did some work in this direction in these notebooks: GitHub - Hekstra-Lab/dw: jupyter notebooks exploring the double-Wilson model (see also Preparing to download ...).
To Kay’s point: yes, if you want to pursue this approach it makes sense to assign errors based on the mismatch between reference model and reference data. In the likelihood formalism for structure refinement, one infers, in essence, a best P(Fo | Fc). You may want to extract that from, e.g., PHENIX directly and use as your prior.

kmdalton · June 26, 2025, 4:25pm

For sure, but that begs the question: What do you do with the reflections that are missing from the reference data but observed in the sample? Do you just discard them?

tjlane · July 3, 2025, 12:29pm

@dorismai - two more quick thoughts. I’m not entirely clear on what you need, so you might have to use some judgement.

If you are happy to just assume a probabilistic model where the Fc’s are “perfect”, then you could assert a model where the error between each Fo and Fc is independent and Gaussian, with a sigma given by SIGFo. For scaling this is probably good enough?

You can check out our re-implementation of SCALEIT (aka, least-squares) in meteor, which really should be merged upstream into rs (that’s on me): meteor/meteor/scale.py at 20f787464ab908026f6974e3ada4e550564fb39a · rs-station/meteor · GitHub

If you guys need to take into account the fact that the MODEL has errors, and the observed dataset corresponds to the model (as in, it’s not a different protein/condition/etc), then you could use (what I call) Srinivasan’s model, which is essentially what SIGMAA does. There, you assert that the model may have errors in terms of composition (missing atoms) or the positions of the atoms, and obtain a complex Rice distribution modeling P(Fo | Fc) parameterized by sigma-A.

That won’t include experimental errors, but you could “stack” it with the Gaussian error model above in a Bayesian sense.

That’s a quick reply that hopefully makes some kind of sense, where I didn’t mangle anything too badly – I left out a lot of details. Happy to be corrected if I messed up or alternatively let me know if more elaboration is needed for it to click!