Output in unmerged batch mtz

Can 1/(spaceship) output unmerged mtz files in batches? I can’t seem to figure out how to do that if so. Gemmi can, but it’s messy:

cell = ds_corr.cell
sg = ds_corr.spacegroup

ds_corr = rs.DataSet(whatever)

mtz_out = gemmi.Mtz(with_base=True)
mtz_out.spacegroup = sg
mtz_out.cell = cell

ds = mtz_out.add_dataset("combined")
ds.project_name = "whatamidoing"
ds.crystal_name = "pleasehelp"
dataset_id = ds.id

# Add columns: M/ISYM BATCH I SIGI LAMBDA
mtz_out.add_column("M/ISYM", "Y", dataset_id=dataset_id)
mtz_out.add_column("BATCH", "B", dataset_id=dataset_id)
mtz_out.add_column("I", "J", dataset_id=dataset_id)
mtz_out.add_column("SIGI", "Q", dataset_id=dataset_id)
mtz_out.add_column("LAMBDA", "R", dataset_id=dataset_id)

all_data = []
for batch, group in ds_corr.groupby(by='BATCH'):
    nrefls = len(group)
    # Build array: H K L M/ISYM BATCH I SIGI LAMBDA
    batch_data = np.column_stack([
        group.index.get_level_values('H').to_numpy(),
        group.index.get_level_values('K').to_numpy(),
        group.index.get_level_values('L').to_numpy(),
        group['M/ISYM'].to_numpy(),  
        np.full(nrefls, batch, dtype='f4'),
        group['I'].to_numpy(),
        group['SIGI'].to_numpy(),
        group['LAMBDA'].to_numpy()
    ])
    all_data.append(batch_data)

combined = np.vstack(all_data)
mtz_out.set_data(combined)

nbatch = ds_corr.groupby(by='BATCH').ngroups

for b in range(1, nbatch + 1):
    batch = gemmi.Mtz.Batch()
    batch.number = b
    batch.cell = cell
    batch.dataset_id = dataset_id
    mtz_out.batches.append(batch)

mtz_out.sort()
mtz_out.write_to_file("combined_sorted_lorentz.mtz")

Hi @aaronfinke, can you say a little more about your use case? I’m a little unsure what wrangling your snippet does apart from telling gemmi to write batch headers. If you want to edit the batch headers, that probably needs to happen at the gemmi level. While rs supports “BATCH” columns to associate reflections with specific batches, it doesn’t have support fort batch headers which map BATCH numbers to dataset IDs.

I think you can probably simplify your code if you do whatever manipulations you want in rs, convert your rs.DataSet to gemmi, and edit the batch headers there.

ds = rs.read_mtz(whatever)

 ... #do whatever fancy stuff you want here to sort / modify ds

dataset_id = 0
cell = ds.cell
mtz = ds.to_gemmi()
for b in ds.BATCH.unique():
    batch = gemmi.Mtz.Batch()
    batch.number = b
    batch.cell = cell
    batch.dataset_id = dataset_id
    mtz.batches.append(batch)

mtz_out.write_to_file("combined_sorted_lorentz.mtz")

It’s definitely not ideal. Couple of questions

  1. should we modify rs to use the above strategy to populate batch headers automatically?
  2. should we add methods to edit batch headers in rs? or should we just punt that to gemmi?

I wanted to output an mtz file for input into aimless, which requires batch headers.

  1. should we modify rs to use the above strategy to populate batch headers automatically?

maybe- up to you if you want to include this functionality. Since it’s exclusive to mtz, maybe just set it as an option when writing to ds.write_mtz, like:

ds.write_mtz("foo.mtz", batch=True)

And if batch=True it can just run the gemmi batch file write like I’ve done above, but only if there’s a proper BATCH column in the dataset (e.g. raise ValueError if no BATCH column).

  1. should we add methods to edit batch headers in rs? or should we just punt that to gemmi?

I can’t think of a reason to do it other than for output to mtz. rs can do any other manipulation with a simple ds.groupby('BATCH') which is way simpler than anything else. I’d just stick with gemmi.