Peak Detection

Feature detection, also referred to as peak detection, is the process by which local maxima that fulfill certain criteria (such as sufficient signal-to-noise ratio) are located in the signal acquired by a given analytical instrument. This process results in “features” associated with the analysis of molecular analytes from the sample under study or from chemical, instrument, or random noise. Typically, feature detection involves a mass dimension (m/z) as well as one or more separation dimensions (e.g. drift and/or retention time), the latter offering distinction among isobaric/isotopic features.

Persistent Homology

We have recently implemented a peak detection method that leverages persistent homology. Advantages are that it runs faster, uses much less memory, has a measure of signal-to-noise (persistence) built into the method, does not depend on any user-defined parameters, and does not require partitioning of the data to keep the processing tractable. We recommend use of this method, though we will continue to make the convolution-based approach available.

[1]:
import deimos
import numpy as np
import matplotlib.pyplot as plt
[2]:
%%time
# Load data, excluding scanid column
ms1 = deimos.load('example_data.h5', key='ms1', columns=['mz', 'drift_time', 'retention_time', 'intensity'])
CPU times: user 3.35 s, sys: 3.86 s, total: 7.21 s
Wall time: 7.65 s

We will next build arrays containing unique, sorted values (factors) for each dimension of the input data. This consideration ensures data connectivity in downstream activities (smoothing, peak detection) is not affected by, for example, thresholding. If an index is built from thresholded data directly, there can be scenarios in which otherwise disconnected data points become connected due to data loss. Factors can be used to build the index at any point.

[3]:
%%time
# Build factors from raw data
factors = deimos.build_factors(ms1, dims='detect')
CPU times: user 8.76 s, sys: 2.73 s, total: 11.5 s
Wall time: 11.6 s

For computational efficiency reasons, we next threshold the data before applying downstream operations.

[4]:
%%time
# Nominal threshold
ms1 = deimos.threshold(ms1, threshold=500)
CPU times: user 568 ms, sys: 641 ms, total: 1.21 s
Wall time: 1.23 s

An index may now be built for the thresholded data for use in the following smoothing and peak detection operations.

[5]:
%%time
# Build index
index = deimos.build_index(ms1, factors)
CPU times: user 56.3 ms, sys: 4.14 ms, total: 60.5 ms
Wall time: 60 ms

Before peak detection, it is good practice to smooth the data. The following smoothing parameters were determined by analysis of TuneMix and internal standards to minimize m/z error. Generally, keeping the smooth radius small and increasing number of iterations is preferable to a larger smoothing radius, albeit at greater computational expense. Notice we pass the index constructed for the thresholded data here.

[6]:
%%time
# Smooth data
ms1 = deimos.filters.smooth(ms1, index=index, dims=['mz', 'drift_time', 'retention_time'],
                            radius=[0, 1, 0], iterations=7)
CPU times: user 5.09 s, sys: 292 ms, total: 5.38 s
Wall time: 5.43 s

Finally, the peak detection operation. Note that by supplying a radius per dimension, an intensity-weighted per-dimension coordinate will be returned for each feature. These steps were combined in order to apply intensity weighting only to detected peaks, as opposed to all data points, improving computational efficiency. Parameters here are dependent on the resolution of your data. Zero indicates no weighting will be applied in that dimension. Notice we again pass the index constructed for the thresholded data here.

[7]:
%%time
# Perform peak detection
ms1_peaks = deimos.peakpick.persistent_homology(ms1, index=index,
                                                dims=['mz', 'drift_time', 'retention_time'],
                                                radius=[2, 10, 0])

# Sort by persistence
ms1_peaks = ms1_peaks.sort_values(by='persistence', ascending=False).reset_index(drop=True)
ms1_peaks
CPU times: user 5.18 s, sys: 547 ms, total: 5.73 s
Wall time: 5.72 s
[7]:
mz drift_time retention_time intensity persistence mz_weighted drift_time_weighted retention_time_weighted
0 496.335480 28.378529 1.926033 47870.839844 47370.008285 496.337463 28.336977 1.926034
1 192.091949 16.594225 9.035784 37407.832031 36907.687972 192.094498 16.632273 9.035781
2 204.121979 16.954967 8.494817 36476.816406 35976.445785 204.123215 17.011349 8.494816
3 118.085197 13.467776 8.224500 36248.109375 35747.613753 118.086433 13.758590 8.224499
4 144.099182 14.670256 7.181000 34877.378906 34377.035628 144.101120 14.669243 7.181000
... ... ... ... ... ... ... ... ...
12864 582.417297 30.783487 2.312300 502.000000 1.027358 582.417297 30.996725 2.312300
12865 611.250916 28.739271 1.037067 1451.595947 1.018677 611.247253 28.496101 1.037066
12866 758.666321 35.593407 2.505433 731.699463 1.014221 758.661865 35.406212 2.505433
12867 118.085197 13.588024 10.233833 1418.360596 1.007446 118.085907 13.693040 10.233833
12868 634.256470 30.302496 5.712467 912.363647 1.002197 634.258362 30.128944 5.712467

12869 rows × 8 columns

The persistence column indicates the “height” of a given feature relative to the filtration. Higher values indicate greater prominence with respect to the surrounding signal.

Maximum Filtration

The version of DEIMoS published in Analytical Chemistry implements an N-dimensional maximum filter from scipy.ndimage that convolves the instrument signal with a structuring element, also known as a kernel, and compares the result against the input array to identify local maxima as candidate features or peaks. To demonstrate, we will operate on a subset of 2D data to minimize memory usage and computation time.

[8]:
# Load data, excluding scanid column
ms1 = deimos.load('example_data.h5', key='ms1', columns=['mz', 'drift_time', 'retention_time', 'intensity'])

# Sum over retention time
ms1_2d = deimos.collapse(ms1, keep=['mz', 'drift_time'])
[9]:
%%time
# Perform peak detection
ms1_peaks = deimos.peakpick.local_maxima(ms1_2d, dims=['mz', 'drift_time'], bins=[37, 9])
CPU times: user 17.6 s, sys: 3.67 s, total: 21.3 s
Wall time: 22.2 s

Selecting Kernel Size

Key to this process is the selection of kernel size, which can vary by instrument, dataset, and even compound. For example, in LC-IMS-MS/MS data, peak width increases with increasing m/z and drift time, and also varies in retention time. Ideally, the kernel would be the same size as the N-dimensional peak (i.e. wavelets), though computational efficiency considerations for high-dimensional data currently limit the ability to dynamically adjust kernel size. Thus, the selected kernel size should be representative of likely features of interest.

This process is exploratory, and selections can be further refined pending an initial processing of the data. To start, we will get a sense of our data by visualizing a high-intensity feature.

[10]:
# Subset to lower mass range
ms1_ss = deimos.slice(ms1, by='mz', low=150, high=250)

# Get maximal data point
mz_i, dt_i, rt_i, intensity_i = ms1_ss.loc[ms1_ss['intensity'] == ms1_ss['intensity'].max(), :].round(1).values[0]

# Subset the raw data
feature = deimos.slice(ms1,
                       by=['mz', 'drift_time', 'retention_time'],
                       low=[mz_i - 0.01, dt_i - 0.6, rt_i - 0.3],
                       high=[mz_i + 0.15, dt_i + 0.6, rt_i + 0.3])

# Visualize
deimos.plot.multipanel(feature, dpi=150)
plt.tight_layout()
plt.show()
../_images/user_guide_peak_detection_24_0.png
[11]:
print('{}:\t\t{}'.format('mz', len(feature['mz'].unique())))
print('{}:\t{}'.format('drift_time', len(feature['drift_time'].unique())))
print('{}:\t{}'.format('retention_time', len(feature['retention_time'].unique())))
mz:             32
drift_time:     10
retention_time: 14

The number of sampled data points in each dimension informs selection of suitable peak detection parameters, in this case 32 values in m/z, 10 values in drift time, and 14 values in retention time. For the kernel to be centered on each “voxel”, however, selections must be odd. Due to the multidimensional nature of the data, kernel size need not be exact: two features need only be separated in one dimension, not all dimensions simultaneously.

Partitioning

This dataset is comprised of almost 200,000 unique m/z values, 416 unique drift times, and 568 unique retention times. In order to process the data by N-dimensional filter convolution, the data frame-based coordinate format must be converted into a dense array. In this case, a dense array would comprise 4.7E9 cells and, for 32-bit intensities, require approximately 174 GB of memory.

[12]:
print('{}:\t\t{}'.format('mz', len(ms1['mz'].unique())))
print('{}:\t{}'.format('drift_time', len(ms1['drift_time'].unique())))
print('{}:\t{}'.format('retention_time', len(ms1['retention_time'].unique())))
mz:             197408
drift_time:     416
retention_time: 568

This is of course not tenable for many workstations, necessitating a partitioning utility by which the input may be split along a given dimension, each partition processed separately. Here, we create a Partitions object to divide the m/z dimension into chunks of 1000 unique values, with a partition overlap of 0.2 Da to ameliorate artifacts arising from artificial partition “edges”. Next, its map method is invoked to apply peak detection to each partition.

The processes flag may also be specified to spread the computational load over multiple cores. Memory footprint scales linearly with number of processes. Note that we will not observe any performance improvement for this 2D example case.

[13]:
%%time
# Partition the data
partitions = deimos.partition(ms1_2d, split_on='mz', size=10000, overlap=0.5)

# Map peak detection over partitions
ms1_peaks_partitioned = partitions.map(deimos.peakpick.local_maxima,
                                       dims=['mz', 'drift_time'],
                                       bins=[37, 9],
                                       processes=8)
CPU times: user 19.2 s, sys: 11.5 s, total: 30.7 s
Wall time: 36.8 s

With overlap selected appropriately, the partitioned result should be identical to the previous result.

[14]:
all(ms1_peaks_partitioned == ms1_peaks)
[14]:
True

Kernel Scaling

Peak width in m/z and drift time increase with m/z. In the example data used here, the sample inverval in m/z also increases with increasing m/z. This means that our kernel effectively “grows” as m/z increases, as kernel is selected by number of such intervals rather than an m/z range.

[15]:
# Unique m/z values
mz_unq = np.unique(ms1_2d['mz'])

# m/z sample intervals
mz_diff = np.diff(mz_unq)

# Visualize
plt.figure(dpi=150)
plt.plot(mz_unq[1:], mz_diff)
plt.xlabel('m/z', fontweight='bold')
plt.ylabel('Interval', fontweight='bold')
plt.show()
../_images/user_guide_peak_detection_37_0.png

However, the drift time sample interval is constant throughout the acquisition. To accommodate increasing peak width in drift time, we can scale the kernel in that dimension by the m/z per partition, scaled by a reference resolution (i.e. the minimum interval in the above). Thus, the drift time kernel size of the first partition will be scaled by a factor of 1 (no change), the last by a factor of ~1.4. This represents an advanced usage scenario and should only be considered with sufficient justification. That is, knowledge of sample intervals in each dimension, peak widths as a function of these sample intervals, and whether the relationship(s) scale linearly.

[16]:
%%time
# Partition the data
partitions = deimos.partition(ms1_2d, split_on='mz', size=10000, overlap=0.5)

# Map peak detection over partitions
ms1_peaks_partitioned = partitions.map(deimos.peakpick.local_maxima,
                                       dims=['mz', 'drift_time'],
                                       bins=[37, 9],
                                       scale_by='mz',
                                       ref_res=mz_diff.min(),
                                       scale=['drift_time'],
                                       processes=8)
CPU times: user 17.3 s, sys: 8.31 s, total: 25.6 s
Wall time: 30.1 s

Note that, though we have ignored retention time, its sample interval in these data is also constant. However, there is no discernable relationship with m/z, thus barring use of this scaling functionality. In such cases, simply determining an average, representative kernel size is typically sufficient.