Tuning careless hyperparameters

mklureza · January 22, 2025, 3:27pm

I am trying to tune the hyperparameters associated with using careless to scale (Laue) data, and would like to gain some clarity on what these imply/how best they can be adjusted. I have seen that processing the same data with different sets of hyperparameters (specifically double-Wilson R and mlp-width) can lead to radically divergent difference maps, but it is not yet apparent to me which ones are more “correct”, nor whether I’m even sampling the full/correct region of hyperparameter space. As a result, I’m now getting deep into the weeds to try and understand my options for knobs to turn. To that end:

While I understand the literal difference between --mlp-layers and --mlp-width, what is the difference between them in terms of the effect on the resulting scaling model? In particular, is it better to tune one hyperparameter while holding the other one constant (if so, which one), or should they both be adjusted independently?
What is the default value for --image-layers? And if the answer is that it is zero, is there any different between including that flag but setting it to 0, versus just omitting it?
Does the integer value given for --positional-encoding-frequencies need to match the number of items in the --positional-encoding-keys list? What happens (from a code standpoint) if the number of frequencies given is larger or smaller than the number of keys passed?
The careless poly --help message says, “–kl-weight KL_WEIGHT: Set the weight of the kl divergence term relative to the likelihood. By default, this is based purely on the number of reflections.” I have been experimenting with tuning the double Wilson R value - is it also worth adjusting the KL weight, or should that really be kept at default (since the default is already based on the reflections in that specific data set)?

Doeke_Hekstra · January 22, 2025, 4:12pm

hi @mklureza,

While Careless has sensible default settings, the choice of hyperparameters can have a quite large effect, especially for applications in time-resolved crystallography where differences between merged structure factor amplitudes are important. Section III of https://pubs.aip.org/aca/sdy/article/11/6/064301/3319201 describes what some of the hyperparameters do and illustrates how one can explore settings. It is generally not possible or advisable to explore all possible combinations.

As for your specific questions, I’ll provide answers to the best of my understanding but hope that @kmdalton can add his perspective below.

I typically do not vary --mlp-layers but, anecdotally, we have seen improvements in difference map quality with larger --mlp-width (>10) in the past. When you’re low on GPU memory for a large dataset, lowering --mlp-width may be necessary, and widths of 5-10 likely give good results.
Yes, I think 0 image layers is the default. The biggest impact is using image layers or not using them, so comparing 0 and 1, or 0 and 2, is useful.
No. --positional-encoding-frequencies specifies the number of frequencies per key. The biggest impact comes from trying a number in the range of 2-5, rather than 0.
–kl-weight: I will leave this to @kmdalton to advise. I’ve never varied this.

@kmdalton, please correct me on any of this!

kmdalton · January 22, 2025, 4:31pm

Hi @mklureza, I’ll do my best to add my own perspective on your questions with the caveat that some of these touch on deep issues in contemporary machine learning.

--mlp-layers and --mlp-width work together to determine the total parameter count of the model. The number of parameters goes roughly as |\theta| \propto (l-1)*(w^2 + w) where l is the number of layers and w is the width. The overall parameter count controls the degree to which the model can fit/overfit the data. I would anticipate similar behavior from models with similar parameter counts. There is some literature about the behavior of simple neural nets like the one used in careless as a function of depth / width. My understanding is that these theoretical results mostly deal with asymptotic behavior like when w\rightarrow\infty.
I can confirm zero is the default. I have yet to find an example where either 0 or 2 wasn’t the optimal value. Values higher than 2 usually lead to overfitting in my experience.
Agree with @Doeke_Hekstra
--kl-weight sets the relative impact of the fit to Wilson’s prior vs the fit to the observed data. Higher values will increase the importance of the prior to the optimization objective. It is most useful if you need to compare hyperparameter settings across datasets with different numbers of reflection observations. For hyperparameter tuning within the same dataset, I don’t typically use it. Numerically, it would correspond roughly to one over the average multiplicity of the reflections. I would expect that for most time-resolved datasets a value in the 0.01 to 0.0001 range would be reasonable.