Loading and Saving
Loading
mzML
To accommodate disparate instrument types and manufacturers (e.g. Bruker, Waters, Thermo, Agilent), DEIMoS operates under the assumption that input data are in an open, standard format. As of this publication, the accepted file format for DEIMoS is mzML (or mzML.gz), which contains metadata, separation, and spectrometry data that reproduce the contents of vendor formats. Conversion to mzML from several other formats can be performed using the free and open-source ProteoWizard msconvert utility.
By default, DEIMoS will load frame, scan, m/z, and intensity from the mzML, as well as precursor m/z for MS2, as available. Additional “accession” fields may be specified for data of higher dimension. To view these fields, a convenience function is provided.
[1]:
import deimos
accessions = deimos.get_accessions('example_data.mzML.gz')
accessions
[1]:
{'positive scan': 'MS:1000130',
'ms level': 'MS:1000511',
'MSn spectrum': 'MS:1000580',
'profile spectrum': 'MS:1000128',
'lowest observed m/z': 'MS:1000528',
'highest observed m/z': 'MS:1000527',
'no combination': 'MS:1000795',
'scan start time': 'MS:1000016',
'ion mobility drift time': 'MS:1002476',
'scan window lower limit': 'MS:1000501',
'scan window upper limit': 'MS:1000500',
'isolation window target m/z': 'MS:1000827',
'isolation window lower offset': 'MS:1000828',
'isolation window upper offset': 'MS:1000829',
'selected ion m/z': 'MS:1000744',
'collision-induced dissociation': 'MS:1000133',
'collision energy': 'MS:1000045',
'32-bit float': 'MS:1000521',
'zlib compression': 'MS:1000574',
'm/z array': 'MS:1000514',
'intensity array': 'MS:1000515'}
The example data referenced is from an Agilent 6560 Ion Mobility LC/Q-TOF system. Thus, we will additionally need to parse retention time and ion mobility drift times. Consulting the list above, we are able to supply appropriate accession fields to the load function, renaming as convenient (here, “scan start time” becomes “retention_time” and “ion mobility drift time” becomes “drift_time”). The load function will infer file type based on extension (here, .mzML or .mzML.gz)
[2]:
%%time
data = deimos.load('example_data.mzML.gz',
accession={'retention_time': 'MS:1000016',
'drift_time': 'MS:1002476'})
CPU times: user 6min 20s, sys: 5.78 s, total: 6min 26s
Wall time: 6min 31s
The resulting data will be returned as a dictionary containing data frames, with keys per MS level. The example data contains MS1 and MS2 (collected at 20 eV).
[3]:
data['ms1']
[3]:
| scanId | retention_time | drift_time | mz | intensity | |
|---|---|---|---|---|---|
| 0 | 416.0 | 0.07125 | 0.00000 | 71.677490 | 0.0 |
| 1 | 416.0 | 0.07125 | 0.00000 | 71.680420 | 11.0 |
| 2 | 416.0 | 0.07125 | 0.00000 | 71.683350 | 4.0 |
| 3 | 416.0 | 0.07125 | 0.00000 | 71.686279 | 0.0 |
| 4 | 416.0 | 0.07125 | 0.00000 | 71.703850 | 0.0 |
| ... | ... | ... | ... | ... | ... |
| 120486132 | 472575.0 | 21.98105 | 49.90292 | 1606.741089 | 4.0 |
| 120486133 | 472575.0 | 21.98105 | 49.90292 | 1606.755005 | 5.0 |
| 120486134 | 472575.0 | 21.98105 | 49.90292 | 1606.768921 | 0.0 |
| 120486135 | 472575.0 | 21.98105 | 49.90292 | 1608.308472 | 0.0 |
| 120486136 | 472575.0 | 21.98105 | 49.90292 | 1608.322266 | 1.0 |
120486137 rows × 5 columns
[4]:
data['ms2']
[4]:
| scanId | retention_time | drift_time | mz | intensity | precursor_mz | |
|---|---|---|---|---|---|---|
| 0 | 0.0 | 0.051783 | 0.00000 | 61.058502 | 0.0 | 813.195496 |
| 1 | 0.0 | 0.051783 | 0.00000 | 61.061207 | 7.0 | 813.195496 |
| 2 | 0.0 | 0.051783 | 0.00000 | 61.063911 | 29.0 | 813.195496 |
| 3 | 0.0 | 0.051783 | 0.00000 | 61.066612 | 5.0 | 813.195496 |
| 4 | 0.0 | 0.051783 | 0.00000 | 61.069317 | 6.0 | 813.195496 |
| ... | ... | ... | ... | ... | ... | ... |
| 108650686 | 472159.0 | 21.961750 | 49.90292 | 1637.126221 | 6.0 | 843.569336 |
| 108650687 | 472159.0 | 21.961750 | 49.90292 | 1637.140259 | 2.0 | 843.569336 |
| 108650688 | 472159.0 | 21.961750 | 49.90292 | 1637.154297 | 5.0 | 843.569336 |
| 108650689 | 472159.0 | 21.961750 | 49.90292 | 1637.168213 | 6.0 | 843.569336 |
| 108650690 | 472159.0 | 21.961750 | 49.90292 | 1637.182251 | 11.0 | 843.569336 |
108650691 rows × 6 columns
HDF5
If the data is already parsed and saved in the Hierarchical Data Format, loading will be much faster. The function does not change, as the loader will again infer format by file extension. However, arguments will be different: specifing accessions is no longer required, but the relevant MS level must be selected using the key flag.
[5]:
%%time
ms1 = deimos.load('example_data.h5', key='ms1')
ms1
CPU times: user 7.92 s, sys: 4.85 s, total: 12.8 s
Wall time: 14 s
[5]:
| scanId | retention_time | drift_time | mz | intensity | |
|---|---|---|---|---|---|
| 0 | 416.0 | 0.07125 | 0.00000 | 71.677490 | 0 |
| 1 | 416.0 | 0.07125 | 0.00000 | 71.680420 | 11 |
| 2 | 416.0 | 0.07125 | 0.00000 | 71.683350 | 4 |
| 3 | 416.0 | 0.07125 | 0.00000 | 71.686279 | 0 |
| 4 | 416.0 | 0.07125 | 0.00000 | 71.703850 | 0 |
| ... | ... | ... | ... | ... | ... |
| 120486132 | 472575.0 | 21.98105 | 49.90292 | 1606.741089 | 4 |
| 120486133 | 472575.0 | 21.98105 | 49.90292 | 1606.755005 | 5 |
| 120486134 | 472575.0 | 21.98105 | 49.90292 | 1606.768921 | 0 |
| 120486135 | 472575.0 | 21.98105 | 49.90292 | 1608.308472 | 0 |
| 120486136 | 472575.0 | 21.98105 | 49.90292 | 1608.322266 | 1 |
120486137 rows × 5 columns
Multi-file Loading
For certain alignment applications, a high number of input files bars reading each into memory simultaneously. In these situations, Dask is used to virtually load multiple data frames, thus more amenable for downstream computation. The load function will detect whether a list of inputs is passed and read using the appropriate backend. Dask chunksize (see docs) may be specified by the chunksize flag, and
additional meta data per input file can be passed as a dictionary with keys for each path (e.g. date, sample type, etc.). Only HDF5 format is support for multi-file loading.
[6]:
ms1 = deimos.load(['example_data.h5', 'example_data.h5'], key='ms1', chunksize=1E7, meta=None)
ms1
[6]:
| scanId | retention_time | drift_time | mz | intensity | sample_idx | sample_id | |
|---|---|---|---|---|---|---|---|
| npartitions=26 | |||||||
| float64 | float64 | float64 | float64 | int64 | int64 | object | |
| ... | ... | ... | ... | ... | ... | ... | |
| ... | ... | ... | ... | ... | ... | ... | ... |
| ... | ... | ... | ... | ... | ... | ... | |
| ... | ... | ... | ... | ... | ... | ... |
Note that additional columns are appended to indicate each source file name and index. As the data frames are loaded virtually, the output is a placeholder for would-be data. For more on loading multiple files, see the section on alignment.
Saving
HDF5
By default, DEIMoS exports a lightweight, data frame-based representation in Hierarchical Data Format version 5 (HDF5) file format. One must specify a path, the data frame to be saved, and a key for the container. Multiple keys may be saved to the same container (i.e. MS1 and MS2). The mode flag is used to indicate file overwrite (mode='w') or append (mode='a'), the latter to be used when saving multiple data frames to the file.
[7]:
# Save ms1 to new file
deimos.save('example_data.h5', data['ms1'], key='ms1', mode='w')
# Save ms2 to same file
deimos.save('example_data.h5', data['ms2'], key='ms2', mode='a')
mzML
We are currently refactoring the code to export to mzML. Check back soon!
MGF
We are currently refactoring the code to export to mzML. Check back soon!