LLM Prompting for Crystallographic Software Generation

I am interested in people’s tips and tricks for getting the most of LLMs when working on technical projects. My own work is at the intersection of crystallography and machine learning, and I frequently combine variational inference with deep learning. I find LLMs can struggle to generate numerically stable code in this niche. I’m staring to keep some track of the failure modes I encounter and prompts that I hope will steer the model away from repeating these mistakes. I’m curious what pitfalls others are falling into, and how to prevent them. I’m creating this thread for open discussion and another where users can submit prompt snippets

Over the weekend I’ve was using Claude Code to translate one of my projects, careless from TensorFlow to PyTorch. On the whole, I’ve had a tremendously positive experience. I estimate this task would probably take me a month on my own, but I was able to turn it around in a few days with Claude Code. I’ll note this is a very well-defined task to translate a command line application which already has an extensive and mature test suite. I had high expectations and for the most part they were met. I did hit a few snags along the way. I hope they are instructive. I don’t have any grand thesis I’m afraid. Just a bunch of hopefully helpful anecdotes.

LLMs make some of the same mistakes as students

With code generation tools, I think it is useful to properly set your expectation for the level of “intelligence” or better yet “experience” that your agent has. My mental model for today’s LLMs is that they are about as proficient as an entry-level PhD student. For better or worse. They know a good amount, they can do research independently, but they make some basic mistakes. I have two anecdotes along these lines.

Thing one: In translating careless, Claude made a mistake that I see out of many students which is instantiating PyTorch tensors on the wrong device. Unlike TensorFlow or JAX, PyTorch doesn’t presume to know what device you’d like your tensors on. So, it’s important when you create a new one that you are careful to infer the right device, usually based on the device of some input data or model parameters. Failing to do so will usually lead to code that runs on CPU but errors out on GPU.

Thing two: This one is more technical, but I think it falls under a larger class of issues. I found that when Claude went to implement a particular distribution I use heavily in careless, the truncated normal, it generated an implementation based on a very numerically unstable method (for the variational inference aficionados, it decided to use the reparameterization trick starting from a uniform distribution and using the inverse CDF). This is not a bug per se. It could work in some less sensitive applications, but it is inferior to the solution implemented in TensorFlow (again for the nerds out there: rejection sampling plus implicit reparameterization gradients). In any case, it was way too unstable for my application. Once upon a time I had a student who came to me with some code that used the same approach as Claude implemented also with catastrophic results. As Claude put it, I guess this one’s “a rite of passage for anyone implementing truncated distributions in PyTorch.”

LLMs make the same mistakes I do

Crystallography is full of finicky gotchas. From a statistical perspective one of the most annoying things is that reflections actually follow different classes of distributions depending on their MIller index. Skipping over the details, these distributions have different support. It seems like a painfully nitpicky thing, but the prior distribution for some reflections (acentric for the nerds) doesn’t have support at zero whereas the other (centric) reflections do. Shockingly this actually turns out to matter for careless. If you don’t ensure that the support of the structure factors exactly (and I mean exactly!) matches the prior, it won’t work. You’ll get a bunch of NaNs and the model will blow up. For a long time careless had this bug which would manifest unpredictably and almost universally when I was working on a deadline. But I fixed it. I fixed it way back in 2021. The curious thing was that when the LLM went to translate my code it reintroduced a nearly identical bug. So, I guess I’m left to conclude that Claude is about as detail oriented as I am.

LLMs can ignore critical details

During translation Claude removed several essential features in careless.

  • It changed the default bijector for positive variables from exp to softplus. Vexingly, it left the command line flag in place to set the bijector, but it wasn’t hooked up to anything. This took a while to realize. This small change has a big impact on performance metrics like time-resolved and anomalous difference peaks.
  • It totally ignored the special initialization scheme I use for linear layers in my neural network
  • It neglected to rescale the neural network output to match the scale of the data as I did in the TensorFlow implementation leading to numerical issues.
2 Likes

hmm… sounds familiar :joy: