Loading and Saving

Loading

mzML

To accommodate disparate instrument types and manufacturers (e.g. Bruker, Waters, Thermo, Agilent), DEIMoS operates under the assumption that input data are in an open, standard format. As of this publication, the accepted file format for DEIMoS is mzML (or mzML.gz), which contains metadata, separation, and spectrometry data that reproduce the contents of vendor formats. Conversion to mzML from several other formats can be performed using the free and open-source ProteoWizard msconvert utility.

By default, DEIMoS will load frame, scan, m/z, and intensity from the mzML, as well as precursor m/z for MS2, as available. Additional “accession” fields may be specified for data of higher dimension. To view these fields, a convenience function is provided.

[1]:
import deimos

accessions = deimos.get_accessions('example_data.mzML.gz')
accessions
[1]:
{'positive scan': 'MS:1000130',
 'ms level': 'MS:1000511',
 'MSn spectrum': 'MS:1000580',
 'profile spectrum': 'MS:1000128',
 'lowest observed m/z': 'MS:1000528',
 'highest observed m/z': 'MS:1000527',
 'no combination': 'MS:1000795',
 'scan start time': 'MS:1000016',
 'ion mobility drift time': 'MS:1002476',
 'scan window lower limit': 'MS:1000501',
 'scan window upper limit': 'MS:1000500',
 'isolation window target m/z': 'MS:1000827',
 'isolation window lower offset': 'MS:1000828',
 'isolation window upper offset': 'MS:1000829',
 'selected ion m/z': 'MS:1000744',
 'collision-induced dissociation': 'MS:1000133',
 'collision energy': 'MS:1000045',
 '32-bit float': 'MS:1000521',
 'zlib compression': 'MS:1000574',
 'm/z array': 'MS:1000514',
 'intensity array': 'MS:1000515'}

The example data referenced is from an Agilent 6560 Ion Mobility LC/Q-TOF system. Thus, we will additionally need to parse retention time and ion mobility drift times. Consulting the list above, we are able to supply appropriate accession fields to the load function, renaming as convenient (here, “scan start time” becomes “retention_time” and “ion mobility drift time” becomes “drift_time”). The load function will infer file type based on extension (here, .mzML or .mzML.gz)

[2]:
%%time
data = deimos.load('example_data.mzML.gz',
                   accession={'retention_time': 'MS:1000016',
                              'drift_time': 'MS:1002476'})
CPU times: user 6min 20s, sys: 5.78 s, total: 6min 26s
Wall time: 6min 31s

The resulting data will be returned as a dictionary containing data frames, with keys per MS level. The example data contains MS1 and MS2 (collected at 20 eV).

[3]:
data['ms1']
[3]:
scanId retention_time drift_time mz intensity
0 416.0 0.07125 0.00000 71.677490 0.0
1 416.0 0.07125 0.00000 71.680420 11.0
2 416.0 0.07125 0.00000 71.683350 4.0
3 416.0 0.07125 0.00000 71.686279 0.0
4 416.0 0.07125 0.00000 71.703850 0.0
... ... ... ... ... ...
120486132 472575.0 21.98105 49.90292 1606.741089 4.0
120486133 472575.0 21.98105 49.90292 1606.755005 5.0
120486134 472575.0 21.98105 49.90292 1606.768921 0.0
120486135 472575.0 21.98105 49.90292 1608.308472 0.0
120486136 472575.0 21.98105 49.90292 1608.322266 1.0

120486137 rows × 5 columns

[4]:
data['ms2']
[4]:
scanId retention_time drift_time mz intensity precursor_mz
0 0.0 0.051783 0.00000 61.058502 0.0 813.195496
1 0.0 0.051783 0.00000 61.061207 7.0 813.195496
2 0.0 0.051783 0.00000 61.063911 29.0 813.195496
3 0.0 0.051783 0.00000 61.066612 5.0 813.195496
4 0.0 0.051783 0.00000 61.069317 6.0 813.195496
... ... ... ... ... ... ...
108650686 472159.0 21.961750 49.90292 1637.126221 6.0 843.569336
108650687 472159.0 21.961750 49.90292 1637.140259 2.0 843.569336
108650688 472159.0 21.961750 49.90292 1637.154297 5.0 843.569336
108650689 472159.0 21.961750 49.90292 1637.168213 6.0 843.569336
108650690 472159.0 21.961750 49.90292 1637.182251 11.0 843.569336

108650691 rows × 6 columns

HDF5

If the data is already parsed and saved in the Hierarchical Data Format, loading will be much faster. The function does not change, as the loader will again infer format by file extension. However, arguments will be different: specifing accessions is no longer required, but the relevant MS level must be selected using the key flag.

[5]:
%%time
ms1 = deimos.load('example_data.h5', key='ms1')
ms1
CPU times: user 7.92 s, sys: 4.85 s, total: 12.8 s
Wall time: 14 s
[5]:
scanId retention_time drift_time mz intensity
0 416.0 0.07125 0.00000 71.677490 0
1 416.0 0.07125 0.00000 71.680420 11
2 416.0 0.07125 0.00000 71.683350 4
3 416.0 0.07125 0.00000 71.686279 0
4 416.0 0.07125 0.00000 71.703850 0
... ... ... ... ... ...
120486132 472575.0 21.98105 49.90292 1606.741089 4
120486133 472575.0 21.98105 49.90292 1606.755005 5
120486134 472575.0 21.98105 49.90292 1606.768921 0
120486135 472575.0 21.98105 49.90292 1608.308472 0
120486136 472575.0 21.98105 49.90292 1608.322266 1

120486137 rows × 5 columns

Multi-file Loading

For certain alignment applications, a high number of input files bars reading each into memory simultaneously. In these situations, Dask is used to virtually load multiple data frames, thus more amenable for downstream computation. The load function will detect whether a list of inputs is passed and read using the appropriate backend. Dask chunksize (see docs) may be specified by the chunksize flag, and additional meta data per input file can be passed as a dictionary with keys for each path (e.g. date, sample type, etc.). Only HDF5 format is support for multi-file loading.

[6]:
ms1 = deimos.load(['example_data.h5', 'example_data.h5'], key='ms1', chunksize=1E7, meta=None)
ms1
[6]:
Dask DataFrame Structure:
scanId retention_time drift_time mz intensity sample_idx sample_id
npartitions=26
float64 float64 float64 float64 int64 int64 object
... ... ... ... ... ... ...
... ... ... ... ... ... ... ...
... ... ... ... ... ... ...
... ... ... ... ... ... ...
Dask Name: concat, 91 tasks

Note that additional columns are appended to indicate each source file name and index. As the data frames are loaded virtually, the output is a placeholder for would-be data. For more on loading multiple files, see the section on alignment.

Saving

HDF5

By default, DEIMoS exports a lightweight, data frame-based representation in Hierarchical Data Format version 5 (HDF5) file format. One must specify a path, the data frame to be saved, and a key for the container. Multiple keys may be saved to the same container (i.e. MS1 and MS2). The mode flag is used to indicate file overwrite (mode='w') or append (mode='a'), the latter to be used when saving multiple data frames to the file.

[7]:
# Save ms1 to new file
deimos.save('example_data.h5', data['ms1'], key='ms1', mode='w')

# Save ms2 to same file
deimos.save('example_data.h5', data['ms2'], key='ms2', mode='a')

mzML

We are currently refactoring the code to export to mzML. Check back soon!

MGF

We are currently refactoring the code to export to mzML. Check back soon!