Loading and Saving

Loading

mzML

To accommodate disparate instrument types and manufacturers (e.g. Bruker, Waters, Thermo, Agilent), DEIMoS operates under the assumption that input data are in an open, standard format. As of this publication, the accepted file format for DEIMoS is mzML (or mzML.gz), which contains metadata, separation, and spectrometry data that reproduce the contents of vendor formats. Conversion to mzML from several other formats can be performed using the free and open-source ProteoWizard msconvert utility.

By default, DEIMoS will load frame, scan, m/z, and intensity from the mzML, as well as precursor m/z for MS2, as available. Additional “accession” fields may be specified for data of higher dimension. To view these fields, a convenience function is provided.

[1]:

import deimos

accessions = deimos.get_accessions('example_data.mzML.gz')
accessions

[1]:

{'positive scan': 'MS:1000130',
 'ms level': 'MS:1000511',
 'MSn spectrum': 'MS:1000580',
 'profile spectrum': 'MS:1000128',
 'lowest observed m/z': 'MS:1000528',
 'highest observed m/z': 'MS:1000527',
 'no combination': 'MS:1000795',
 'scan start time': 'MS:1000016',
 'ion mobility drift time': 'MS:1002476',
 'scan window lower limit': 'MS:1000501',
 'scan window upper limit': 'MS:1000500',
 'isolation window target m/z': 'MS:1000827',
 'isolation window lower offset': 'MS:1000828',
 'isolation window upper offset': 'MS:1000829',
 'selected ion m/z': 'MS:1000744',
 'collision-induced dissociation': 'MS:1000133',
 'collision energy': 'MS:1000045',
 '32-bit float': 'MS:1000521',
 'zlib compression': 'MS:1000574',
 'm/z array': 'MS:1000514',
 'intensity array': 'MS:1000515'}

The example data referenced is from an Agilent 6560 Ion Mobility LC/Q-TOF system. Thus, we will additionally need to parse retention time and ion mobility drift times. Consulting the list above, we are able to supply appropriate accession fields to the load function, renaming as convenient (here, “scan start time” becomes “retention_time” and “ion mobility drift time” becomes “drift_time”). The load function will infer file type based on extension (here, .mzML or .mzML.gz)

[2]:

%%time
data = deimos.load('example_data.mzML.gz',
                   accession={'retention_time': 'MS:1000016',
                              'drift_time': 'MS:1002476'})

CPU times: user 6min 20s, sys: 5.78 s, total: 6min 26s
Wall time: 6min 31s

The resulting data will be returned as a dictionary containing data frames, with keys per MS level. The example data contains MS1 and MS2 (collected at 20 eV).

[3]:

data['ms1']

[3]:

	scanId	retention_time	drift_time	mz	intensity
0	416.0	0.07125	0.00000	71.677490	0.0
1	416.0	0.07125	0.00000	71.680420	11.0
2	416.0	0.07125	0.00000	71.683350	4.0
3	416.0	0.07125	0.00000	71.686279	0.0
4	416.0	0.07125	0.00000	71.703850	0.0
...	...	...	...	...	...
120486132	472575.0	21.98105	49.90292	1606.741089	4.0
120486133	472575.0	21.98105	49.90292	1606.755005	5.0
120486134	472575.0	21.98105	49.90292	1606.768921	0.0
120486135	472575.0	21.98105	49.90292	1608.308472	0.0
120486136	472575.0	21.98105	49.90292	1608.322266	1.0

120486137 rows × 5 columns

[4]:

data['ms2']

[4]:

	scanId	retention_time	drift_time	mz	intensity	precursor_mz
0	0.0	0.051783	0.00000	61.058502	0.0	813.195496
1	0.0	0.051783	0.00000	61.061207	7.0	813.195496
2	0.0	0.051783	0.00000	61.063911	29.0	813.195496
3	0.0	0.051783	0.00000	61.066612	5.0	813.195496
4	0.0	0.051783	0.00000	61.069317	6.0	813.195496
...	...	...	...	...	...	...
108650686	472159.0	21.961750	49.90292	1637.126221	6.0	843.569336
108650687	472159.0	21.961750	49.90292	1637.140259	2.0	843.569336
108650688	472159.0	21.961750	49.90292	1637.154297	5.0	843.569336
108650689	472159.0	21.961750	49.90292	1637.168213	6.0	843.569336
108650690	472159.0	21.961750	49.90292	1637.182251	11.0	843.569336

108650691 rows × 6 columns

HDF5

If the data is already parsed and saved in the Hierarchical Data Format, loading will be much faster. The function does not change, as the loader will again infer format by file extension. However, arguments will be different: specifing accessions is no longer required, but the relevant MS level must be selected using the key flag.

[5]:

%%time
ms1 = deimos.load('example_data.h5', key='ms1')
ms1

CPU times: user 7.92 s, sys: 4.85 s, total: 12.8 s
Wall time: 14 s

[5]:

	scanId	retention_time	drift_time	mz	intensity
0	416.0	0.07125	0.00000	71.677490	0
1	416.0	0.07125	0.00000	71.680420	11
2	416.0	0.07125	0.00000	71.683350	4
3	416.0	0.07125	0.00000	71.686279	0
4	416.0	0.07125	0.00000	71.703850	0
...	...	...	...	...	...
120486132	472575.0	21.98105	49.90292	1606.741089	4
120486133	472575.0	21.98105	49.90292	1606.755005	5
120486134	472575.0	21.98105	49.90292	1606.768921	0
120486135	472575.0	21.98105	49.90292	1608.308472	0
120486136	472575.0	21.98105	49.90292	1608.322266	1

120486137 rows × 5 columns

Multi-file Loading

For certain alignment applications, a high number of input files bars reading each into memory simultaneously. In these situations, Dask is used to virtually load multiple data frames, thus more amenable for downstream computation. The load function will detect whether a list of inputs is passed and read using the appropriate backend. Dask chunksize (see docs) may be specified by the chunksize flag, and additional meta data per input file can be passed as a dictionary with keys for each path (e.g. date, sample type, etc.). Only HDF5 format is support for multi-file loading.

[6]:

ms1 = deimos.load(['example_data.h5', 'example_data.h5'], key='ms1', chunksize=1E7, meta=None)
ms1

[6]:

Dask DataFrame Structure:

	scanId	retention_time	drift_time	mz	intensity	sample_idx	sample_id
npartitions=26
	float64	float64	float64	float64	int64	int64	object
	...	...	...	...	...	...	...
...	...	...	...	...	...	...	...
	...	...	...	...	...	...	...
	...	...	...	...	...	...	...

Dask Name: concat, 91 tasks

Note that additional columns are appended to indicate each source file name and index. As the data frames are loaded virtually, the output is a placeholder for would-be data. For more on loading multiple files, see the section on alignment.

Saving

HDF5

By default, DEIMoS exports a lightweight, data frame-based representation in Hierarchical Data Format version 5 (HDF5) file format. One must specify a path, the data frame to be saved, and a key for the container. Multiple keys may be saved to the same container (i.e. MS1 and MS2). The mode flag is used to indicate file overwrite (mode='w') or append (mode='a'), the latter to be used when saving multiple data frames to the file.

[7]:

# Save ms1 to new file
deimos.save('example_data.h5', data['ms1'], key='ms1', mode='w')

# Save ms2 to same file
deimos.save('example_data.h5', data['ms2'], key='ms2', mode='a')

mzML

We are currently refactoring the code to export to mzML. Check back soon!

MGF

We are currently refactoring the code to export to mzML. Check back soon!