How we analyse untargeted metabolomics data

As a lab primarily interested in metabolomics applications, our publications usually focus on the biological results and their relevance. Much more rarely do we describe the infrastructure that turns raw LC-MS/MS data into detected features, annotations, and ultimately the files that accompany a publication, whether tables, mzTab exports, or other deliverables. We sometimes allude to this infrastructure in talks, but there is rarely enough time to do justice to what is in fact a central pillar of our work. This post is therefore meant as a guided tour of our data-analysis ecosystem.

Our processing pipelines have been evolving for roughly 20 years to keep pace with advances in instrumentation, the growing size of studies, and the changing scientific questions we want to answer. We started about two decades ago with medium-scale analyses of primary metabolism using GC-TOF and LC-QQQ, including 13C tracing. At that stage, the core challenge was the rapid and reproducible quantification of thousands of EICs within each study.

Around 2012, we developed fiaMiner, our internal software for FIA-TOF. It was later used on more than 1.5 million samples and in studies of up to 100,000 injections. The software included raw-signal processing designed to exploit the full dynamic range of the detector, as well as a large GUI for assembling and running complex analysis workflows with more than 100 selectable plugins.

As we became more involved in projects that required charting metabolites and lipids beyond primary metabolism, we expanded our toolbox for processing and analysing untargeted LC-MS data and for annotating features across a much broader chemical space. That meant collecting large amounts of MS2 data for structural elucidation and moving to software that could integrate multiple sources of evidence, including isotopic patterns, MS2, and retention time, to generate structural information for thousands of features across studies that themselves can contain thousands of samples.

The range of possible software solutions was extensive, including well-established options such as mzmine, openMS, MSDIAL, Workflow4Metabolomics, RforMassSpectrometry, emzed, commercial alternatives, and many more. With so many strong options available, it should be straightforward to pick one, right?

In the end, we made the most radical choice possible: we started building a completely new analysis ecosystem, MASSter, from an architecture designed around our current and future needs. It was a major investment, with 600+ hours of development stolen from other things, but in hindsight it was absolutely worth it.

This post explains the rationale, the resulting design principles, and the impact they had on our ability to analyse untargeted metabolomics experiments. This is Part I, devoted to the core analysis workflow. Part II will follow and will discuss benchmarks and more advanced functionality.

1. Tasks and needs

Analysing untargeted metabolomics data can be reduced to a few core tasks:

  1. building a catalogue of the “unique” features detected across a set of samples,
  2. providing relative quantification, and
  3. attempting to characterise those features structurally based on MS1, MS2, RT, MS3, CCS, etc.

Dozens of software packages and tools can, in principle, address these tasks, with additional specialisation when ion mobility or MS3+ data are available. The tasks themselves, however, do not capture the practical constraints that users must deal with. If those constraints are not handled well, the final result falls short of what untargeted analysis should deliver.

In our lab, the following requirements matter especially:

  • First, untargeted metabolomics should not be worse than targeted data analysis or careful manual curation. By “targeted”, I do not mean a different acquisition mode such as MRM on a QQQ, but the analysis of high-resolution data within an expected m/z and retention-time window for a list of compounds or formulas, as in workflows such as HERMES. Given the same raw data, untargeted analysis should ideally deliver a similar detection limit and similar quantification accuracy while providing broader coverage than a targeted workflow.
  • Scalability: our studies range from a handful of LC-MS files to well over 1000, and they include both DDA and very bulky DIA data. Can the software handle that amount of data without a catastrophic loss of performance? A recurring problem is the algorithmic complexity of key steps such as alignment, clustering, and adduct detection. Some implementations scale quadratically and are therefore guaranteed to fail as sensitivity or study size increases.
  • Sensitivity coupled with robustness: sensitivity is the ability to detect low-abundance features, including rare ones that appear only in a small subset of samples. It is easy to increase sensitivity by relaxing quality criteria such as signal-to-noise, points per peak, or outlier tolerance, but this immediately increases the risk of picking up noise. In large studies, this can lead to an explosion of erratic features.
  • Reduce subjectivity: every step of the processing pipeline depends on numerical parameters, many of which vary across LC systems and instruments and strongly affect the outcome. In practice, parameterisation is often based on guesswork or on the user’s personal experience, which makes untargeted analysis far too subjective. Data processing should not be a form of art; it should rest on deterministic and rational criteria.
  • Reproducibility: everyone is familiar with the concept of FAIR data, but accessibility and reproducibility should also apply to data processing itself. Given the same raw data and the same parameters, it should be possible for any scientist, anywhere, to reproduce all processing steps and the final results independently.
  • Vendor-, technology-, and method-agnostic: high-resolution detectors differ fundamentally in dynamic range, resolution, mass accuracy, baseline shape, scan speed, and many other properties. Individual methods can further modify those characteristics. Once the data have been acquired, users should not need to worry about all those specifics; they should still obtain the best possible results without specialised training in either mass spectrometry or data processing.
  • Ease of use: what about users with no prior experience in LC-MS/MS data analysis and no coding skills? How steep is the learning curve?
  • Automation-ready: Is it possible to integrate the pipeline in preexisting IT infrastructure, both locally or on clouds?
  • Fast prototyping: we occasionally work with new technology, and new technology often requires new processing workflows. Recent examples include novel forms of DIA, GPU-centric annotation, and untargeted stable-isotope tracing. The ability to integrate new functionality quickly is a decisive factor for R&D.
  • Low demand for maintenance, support, and deployment. No, we do not have dedicated systems and software engineers to maintain scientific software. Hardly any academic lab does.

The additional complication is that some of these needs pull in opposite directions. We want something sensitive, but also scalable. We want something powerful, but also simple enough that not every user needs to become a domain expert in signal processing. We want something that can evolve quickly for R&D, but not in a way that makes routine analysis unmaintainable. We want something automated, but not a black box.

Hence, one of the most challenging requirements overall is to combine all of these requirements in a single product: A workflow that performs beautifully on 20 samples but becomes unusable on 1000 is not enough. A workflow that discovers many features but cannot quantify them coherently is not enough. A workflow that generates a nice table but leaves no reproducible trail from raw data to result is not enough either.

2. The limitations of existing software

Given all these requirements, it is hardly surprising that a true jack-of-all-trades for untargeted metabolomics processing did not already exist. If we look at the software most commonly used in the field, we can find important pieces of the ideal solution here and there.

To date, mzmine comes closest because of its versatility, consistently solid performance, speed, scalability, and the possibility to share fully parameterised workflows. Unfortunately, many key features are only available under a yearly licence that can be expensive for academic labs. MSDIAL stands out for its user-friendliness and for unique functionality such as DIA analysis and extensive lipid-annotation support, including special dissociation modes such as OAD and EAD. In my hands, however, it scales poorly and struggles with annotation on messy real-life MS2 spectra. R-based ecosystems such as XCMS, Galaxy, RforMassSpectrometry, or patRoon, and Python-based wrappers such as pyOpenMS or emzed, offer more flexibility but require substantial prior knowledge and coding skills even to get started. In addition, dozens of tools developed by single academic groups—and often by single students—solve one specific part of the workflow extremely well, from MassCube to asari and khipu, ADAP-BIG, PeakDecoder, AnnoMe, etc. But stitching together a Frankenstein pipeline from many isolated code bases is not a realistic option for us.

On the vendor side, the offering is much thinner. The only package that clearly stands out to me is Compound Discoverer, which combines a polished interface, strong annotation tools, customisable workflows, and the possibility to code additional nodes. It is limited to Orbitrap and Astral data, of course, but Thermo still deserves credit for the quality of the design and the steady improvements introduced since version 3.x.

For most other vendors, the options range from essentially nothing to outsourced third-party solutions or intuitive but highly limited tools. The overall impression is that some MS vendors devote only modest resources to metabolomics data analysis. The main motivation is to fill a gap in their portfolio rather than to seize the opportunity to truly boost their hardware or solve concrete users’ problems. We are told this is because the software team dedicated to metabolomics is very small, yet there are virtuous examples of better products driven by tiny teams, so maybe the problem is somewhere else ;-).

3. Masstering untargeted metabolomics

For us, the only logical consequence was to start from scratch, beginning with the design principles that we considered essential for reaching our goals. We integrated and updated ideas from our earlier work on GC-TOF and QQQ data, and we ported some concepts from fiaMiner. A first attempt took the form of SLAW, in work led by Alexis Delabrière and published in 2023. SLAW was a real milestone. Alexis showed that multidimensional parameter optimisation can identify parameter sets that increase sensitivity without sacrificing robustness at scale, as described in our previous post and in the paper. This was demonstrated across different instruments and feature-detection algorithms, including other labs, and it opened the door to reduced subjectivity, greater scalability, automation, and improved ease of use.

This became especially clear when we looked at the centroid data produced by different instruments. The figure below compares the same sample acquired on an Orbitrap and on two TOF instruments from different vendors. For a small region of the RT × m/z space, we plotted all centroids coloured by intensity. The differences are striking. The Orbitrap data on the left look clean: peaks at constant m/z are clearly visible and the background is almost absent. The first TOF in the centre shows intensity-dependent shifts in m/z and possibly additional features, such as the red stripe at m/z 878.785. The second TOF retains a stable m/z at high intensity but shows much more low-intensity jitter and a more uniform scatter of sporadic centroids. Distinguishing noise from real peaks therefore requires parameters that are adapted to each instrument and LC system, not least because the LC setup affects the number of scans across a chromatographic peak.

Comparative analysis of LC-MS centroid data from various instruments showcasing differences in peak clarity and background noise. The left panel is from an Orbitrap instrument, the other two panels were run on TOF instruments from different vendors.

The downside was that the search for optimal parameters could take several hours, even for small studies, and it retained a stochastic component that led to slightly different outcomes from run to run. It could also get trapped in local optima, and troubleshooting was cumbersome. After two years of testing, we concluded that this complex multidimensional optimisation could be avoided almost entirely by (i) relying on algorithms that are intrinsically robust and require only a small set of meaningful parameters and (ii) introducing ways to infer those parameters deterministically from the data.

One of the key lessons that followed was that relying on vendor-provided centroid LC-MS data can produce worse results—at least for TOF detectors. If we want to maximise sensitivity, we need to understand the nature of the noise. It turns out that many things go wrong in the low-intensity regime close to the baseline. A simple visual comparison between profile and centroided data often reveals that artefacts and jitter are introduced by centroiding itself, whereas the profile data are much more homogeneous and, at least visually, real peaks remain clearly discernible.

Most feature-detection algorithms, however, do not actually use a two-dimensional view of the data. Some do, borrowing ideas from the broader computer-vision community such as OpenCV or CNN-based approaches, but those methods have not struck us as especially efficient in practice. Most feature-detection algorithms instead work sequentially, one scan at a time, by searching for stretches of consecutive centroids or by walking the data in order of intensity. Once you understand how peak detection works, it becomes easier to see which centroiding artefacts are tolerable and which are toxic.

Centroiding and peak picking should therefore not be treated as independent steps. The final result depends on their interplay. This is one of the critical tricks for detecting true peaks close to the baseline, and it is the main reason why we stopped relying on msconvert and vendor centroiding. Orbitrap data are the main exception, because they are intrinsically cleaner and easier to analyse.

4. Dealing with Raw data

Our processing starts from proprietary raw data, but the workflow does not merely bypass centroiding in msconvert. Whenever possible, we skip conversion to intermediate profile mzML (or mz5) altogether and work directly with the DLLs distributed through ProteoWizard. At the moment, this is only possible on Windows, but it brings several transformative advantages:

  • Speed: every LC-MS/MS file contains MS2 scans that are redundant in DDA or irrelevant, even empty, in DIA. Querying specific MS2 spectra only when they are needed greatly accelerates analysis. For example, we can extract only MS1 scans for feature detection and defer MS2 extraction until it is clear which scans are interesting because they contain the best available spectrum for a relevant feature. We gain time by avoiding file conversion, by indexing the file much faster than mzML allows, and by processing only the scans that actually matter. As we will see later, the gain is especially large for DIA.
  • Small memory footprint: if profile retrieval from storage and on-the-fly centroiding are fast enough, there is little reason to cache MS2 data in memory. Disabling that cache reduces memory demand by roughly one to two orders of magnitude, which in turn allows parallel processing of many files – typically six to twelve, depending on available I/O bandwidth – with nearly proportional speed-up on large studies.
  • Possibility to differentiate centroiding between MS1 and MS2 scans: MS1 and MS2 spectra have different properties, so we can tailor centroiding and baselining to the needs of feature detection on the one hand and MS2-based identification on the other.
  • Information content: direct access to the raw files preserves a wealth of information about the instrument, acquisition method, and scan-level metadata that is stored in the proprietary files but not exported to mzML. NOTE: this capability is only for internal use, and not available in the public distribution.

Vendors differ greatly in how they support direct data access. Some are pleasant to work with and provide detailed API documentation. Others are more neutral, but their libraries are still easy to inspect. In some cases, the file format is generic enough to be almost self-explanatory like in the case of SQLite-based containers. A few vendors actively discourage this kind of access. In practice, we found the libraries and EULA terms distributed with ProteoWizard sufficient to access all information we needed.

The second component we built was fast centroiding. At present, pulling a random profile spectrum from a locally stored file and centroiding it in Python takes less than one millisecond. Caching of MS2 data in memory is disabled by default.

5. Feature detection at the single-sample level

Single LC-MS runs are analysed independently to generate an inclusive list of features with extensive metadata, together with a catalogue of the MS2 scans available in each file.

We tested several feature-detection algorithms in combination with multiple centroiding strategies and found FeatureFinderMetabo from the openMS library to be the best overall candidate. It exposes dozens of parameters, but the vast majority can be fixed at constant values while still delivering consistently sensitive and precise detection of features and isotopes across all tested sample types. In practice, only a handful of parameters remain critical, notably the noise level in counts and the expected chromatographic FWHM in seconds. Because those parameters have a clear physical meaning and can be inferred directly from the data, we can avoid the tedious optimisation strategy introduced in SLAW. The resulting performance is excellent across the full dynamic range.

Feature detection of a lipid extract analysed on a SCIEX 8600 ZenoTOF system. The colormaps show all centroids in MS1 scans, with colors ranging over 6 orders of magnitude (!!). Green points indicate the monoistopic peak of identified features. Grey dots and lines outline the isotopomers.

Feature detection returns deisotoped feature lists enriched with a wide range of metadata, including convex hulls, quality indicators, and the precursor-ion EIC. We then search among co-eluting features for adducts and in-source fragments, including correlation analysis of precursor EICs, and finally build a catalogue of all MS2 scans linked to MS1 features in the file. For each feature, we retrieve and store only the best MS2 spectrum. This is not a limitation of the architecture but a deliberate design choice that simplifies downstream processing while preserving performance. All redundant or unlinked MS2 spectra are ignored, and no time is wasted centroiding irrelevant MS2 data. The catalogue is sufficient to know what MS2 evidence exists without paying the full computational cost of digesting all fragment data prematurely. MS2 extraction can always be resumed later, once fragmentation data are actually needed and the relevant scans are known.

All results are stored in a custom HDF5 file. We chose HDF5 to guarantee selective retrieval of data and interoperability across Python versions and, potentially, with other software. Each file stores a pointer to the raw data, serialised spectra and chromatograms, and the full processing history together with all parameters: an essential precondition for the FAIRification of data processing.

6. Merging single samples into aligned studies

After all single LC-MS files have been analysed separately, we merge them into a study. The task is to determine which features across different files correspond to the same underlying signal and group them into study-level super-features, which we call consensus features. In principle, that linking can rely on similar m/z values, retention time or convex hull overlap, intensity patterns, isotopic patterns, MS2 spectra, EICs, and more. Conceptually, each layer contributes a similarity matrix over pairs of features from different samples. Together, these matrices form a weighted hypergraph that must be pruned and clustered to identify consensus features.

This is by far the most complex task of the entire workflow, because:

  • The size of the problem scales badly with the number of features, samples, and evidence layers. As soon as one works with more than a handful of samples, avoiding a combinatorial explosion becomes critical. At 1000+ files, we have to deal with one to ten million features or more, and naïve O(n²)-style merging becomes a dead end. Low algorithmic complexity and careful memory use are therefore essential.
  • Starting at roughly 300–1000 files, simply loading all data into memory for computation becomes challenging in its own right.
  • This is also where many untargeted workflows quietly decide what biology you are still allowed to see. Common features are easy. Abundant features found in nearly every file are rarely the main challenge. The difficult part is preserving rare or subgroup-specific features while still merging the study into something coherent. If merging is too aggressive, the rare material disappears. If it is too permissive, the consensus becomes noisy and unstable. When rare features are scientifically important, pruning must be adapted accordingly.

This is a difficult problem, and each software package addresses it differently. What works beautifully on 10 samples may collapse on 100. We ported several algorithms and chunking strategies, optimised them for speed and scalability, and tested them on studies ranging from 10 to 3000 samples. MASSter automatically chooses the merging method by default—although the user can override it—and it infers most parameters automatically. In practice, only one parameter remains genuinely critical for the user: the minimum frequency required for a consensus feature across the study, which depends on the study design and the scientific question.

In MASSter, merging can be preceded by a generic RT pre-alignment step, which is only needed when major chromatographic drift occurred during the sequence. Smaller shifts are handled directly by the merging algorithms. After merging, MASSter consolidates adduct detection by combining single-file evidence with correlations observed across the study. The catalogue of MS2 scans is updated across all files. The best MS1 and MS2 data are propagated from the individual files to the consensus features, and numerous additional quality metrics are computed at study level.

SAs for single files, we built an HDF5-based format to store study results, including the full history of processing steps and all associated parameters.

7. Reproducible quantification: merging untargeted discovery with targeted logic

This is the step where untargeted metabolomics has, in my opinion, too often accepted a lower standard than targeted workflows.

In targeted analyses, quantification is performed on one-dimensional chromatograms extracted from MS1 or MS2 data, for example MRM traces. Peak boundaries are determined either by explicit rules or by fitting a shape model. Users can customise the model and fine-tune parameters for each compound, and in the end the exact integration limits can still be edited manually.

Some software even offers AI-assisted peak recognition, with good results provided that the training data are large and representative. But even if the training burden is ignored, AI-assisted integration has a much higher computational cost than rule-based approaches. That is irrelevant for a few dozen compounds across a few dozen samples; it becomes a real bottleneck once one has to process more than a million traces, which is common in untargeted metabolomics.

To keep computational complexity within reasonable bounds, untargeted workflows often simplify integration by using peak height as a proxy or by summing all intensities within a broad expected region of the MS1 data. In SLAW, for example, we optimised the tolerances that define the extraction box so as to obtain similar results across QC samples. The problem appears when peak shapes vary in study samples because of tailing, crowded regions, or partial overlap. In those cases the apex intensity no longer reflects the amount faithfully, and blind summation can either trim part of the peak, causing underestimation, or include neighbouring compounds, causing overestimation.

The alternative is to rely on the integration already performed during single-file feature detection, before features are aligned and linked at study level. That comes at no extra computational cost, but it suffers from the fact that peak-detection algorithms are optimised for a different task. The problem is especially visible for partially overlapping isomers, which may be resolved as distinct features in one sample but merged into a single unsplit peak in the next because of small differences in relative intensity or valley depth. One file may then report a three-second-wide peak, while the next reports a seven-second peak because neighbouring isomers were not split. Both will still be linked to the same consensus feature, but their areas will not be comparable. This is fundamentally a study-level problem, and it can only be solved by ensuring that quantification across all samples follows a comparable logic.

To the best of my knowledge [please correct me!], most widely used untargeted software still quantifies all detectable features using heights, crude sums, or sample-level integration inherited from the feature-detection step. The resulting data are therefore approximations and do not fully meet the standard of targeted workflows. That may be acceptable when untargeted analysis is used primarily for discovery. But any downstream statistical analysis benefits from greater measurement precision: lower coefficients of variation make it easier to detect smaller changes. So, can we do better?

Our answer was to merge the two logics. Once consensus features have been discovered, we switch to a one-dimensional, chromatography-centric view and quantify every feature across every sample in that space. Information from the consensus feature is passed to a deterministic 1D integrator, which uses it as a seed and then refines the integration while accounting for RT drift, peak tailing, neighbouring peaks, and baseline shifts. The 1D integrator does not rely on AI.

The hybrid workflow also improves the quantification of features that were initially missed at single-file level. Gap filling is performed after merging: we recurse into the raw-data information stored in HDF5, extract EICs for all missing values, and then integrate them with the same reproducible 1D logic.

By feeding consensus information into the 1D integrator, we obtain areas that are genuinely comparable across all samples. In other words, the result approaches the standard of a targeted workflow while remaining extremely fast to compute (more on this in Part II).

Example of reproducible integration for 3 isobaric lipids. After merging and consensus feature detection, the 1D-integrator is called with the consensus information. In the figure, this is shown as black vertical bars pointing to the left and right tails of the peak that needs to be integrated. The integrator then refines peak identification by walking each individual EICs. The final limits (dashed lines) vary across samples to account for neighbouring peaks.

We will return later to QC and QA of the integration step. For us, it was essential to provide an efficient way for users to inspect and, when needed, modify the integration. Quantification software such as Skyline offers dedicated views and navigation tools for reviewing results. Untargeted software is generally less generous: some visualisation is possible, but side-by-side comparison and correction are rarely supported.

To address this need, we integrated a graphical dashboard that runs in the browser and allows users, for any consensus feature, to inspect all linked chromatograms rapidly and edit the integration settings either for the whole batch—which is the preferred mode—or, if absolutely necessary, for individual samples. The updated settings are applied on the fly, and all peaks, areas, and QC plots are refreshed immediately.

8. Identification

Let’s start by debunking a myth: a MS2 or RT match between an unknown feature and a pure compound is NOT a proof of identity and says NOTHING about confidence. The confusion arose largely because the metabolomics reporting initiatives of Sumner et al. (2007) and Schymanski et al. (2014) used the term “confidence” to describe the type of evidence available rather than the probability that a structural statement is true.

Take a simple example. If we state that a peak XY is glucose because it matches the MS2 spectrum and RT of the pure compound, we are implicitly claiming that no other plausible hexose (fructose, galactose, …) could explain the data equally well. That conclusion is justified only if the RT and MS2 spectra of all plausible alternative candidates are sufficiently different and have actually been considered. In most real cases, this is not true, either because the measurement is not selective enough or because we do not have the pure standards required even to test the alternatives.

In the context of peak identification, confidence…

  • … measures the reliability, certainty, or probability of a statement about the structural properties of a feature.
  • … can vary with the level of detail of the statement. I can be 100% sure about a formula, 100% sure that a molecule contains a phosphate group, 80% sure that it is a hexose monophosphate, 80% sure that the phosphate sits at position 1, and only 6% sure that the compound is glucose-1-phosphate.
  • … depends on whether plausible alternatives, especially isomers, can be excluded, as suggested by Metz at al, 2025.
  • … depends on the chemical space that is actually evaluated. If we only search the molecules available in the lab or those for which we have MS2 data, our apparent confidence may be artificially inflated.

To provide a fair assessment of annotation confidence, we follow a few simple principles:

  • We explore a chemical space of more than one million structures for both metabolites and lipids, including many types of isomers.
  • We share the information about the chemical space that was considered through public channels.
  • We enumerate all compounds that remain compatible with the available data instead of stopping at the first plausible hit.
  • If many candidates remain – sometimes tens or hundreds – we try to report a name that represents the whole class rather than one overly specific structure. In other words, we often increase confidence by reducing the level of structural detail.

The last point is simply illustrated with lipids, which we annotate with LipidOracle. We sequentially attempt to identify the class from headgroup-specific fragments, then the acyl chains, then the regiochemistry, and finally double-bond positions. At each level, confidence can be evaluated by counting how many isomers can actually be excluded because they fail to match the MS2 data. Whenever possible, we aggregate the result to report less structural detail but at higher confidence. In the end, we retain only identifications that pass the chosen confidence threshold.

Flowchart illustrating the step-by-step confidence assessment (in violet) for lipid identification using EAD and CID diagnostic rules.


For non-lipids, we rely on tima – taxonomically informed metabolite analysis (by Adriano Rutz) to integrate MS1and MS2 library searches (against virtually all public experimental MS2 libraries and in silico predictions for >1 Mio compounds), CSI::FingerID, Classyfire, internal RT libraries, etc., into a coherent assessment of putative identities.

We do not attempt to explain how things work, but note the t (taxonomy) term, which allows us to adjust scores depending on the nature of the sample. We don’t restrict the chemical search space: it’s always > 1.3 Mio compounds, but knowing that the sample was a yeast extract, or a urine sample, allows us to rerank isomers demoting compounds that are unlikely to be present because from an unrelated taxon (like a plant secondary metabolite).

Overview of metabolite classification metrics using taxonomic distance, physico-chemical consistency, and spectral similarity for enhanced metabolomics analysis (Figure by Adriano Rutz)

9. Quality control

One important question is quality assurance and control, even more in untargeted metabolomics when sensitivity is pushed to the extreme to obtain a possibly deep coverage.

Sample-centric exploratory analyses, such as the one shown below, are commonly integrated into metabolomics software. They are suited for identifying problematic samples or obtaining a sense of drifts that might require normalisation. They are, however, ineffective at diagnosing data quality or processing issues.

Sample-centric unsupervised analysis of sample quality, including PCA, HCA, TIC along the injection order, RT correction, and parallel plots of samples metadata.

To assess the underlying quality of the data, we need specific metrics. In commonly used metabolomics software, QA/QC options remain rather limited. We wanted to go further, even if initially only for research purposes. In SLAW we introduced quantitative metrics for the reproducibility of alignment and peak detection, but many more dimensions can be evaluated: from individual spectra and chromatograms to single features, samples, studies, and even comparisons across studies.

At present we calculate roughly 20 quantitative metrics to track different aspects of processing, for example:

  • At the centroiding level, we track signal-to-noise, prominence, and related descriptors.
  • At the peak-detection level, we evaluate, among other things, how often 13C traces are recovered, signal-to-noise and prominence in the time domain, EIC coherence, noise, and baseline behaviour.
  • At the RT-alignment level, we monitor the MAE and shape of the RT correction as well as the tightness of linked features.
  • At the consensus level, we examine cluster robustness through metrics such as cluster size, hull size, and the reproducibility of boundaries across linked features.
  • At the quantification level, we assess boundary deviations, baseline behaviour, EIC shape, symmetry, and related descriptors for all individual chromatograms.

This provides a wealth of metadata, as show below for the consensus features of a small lipidomics experiments:

The main lesson learned so far is that none of these QC/QA should be used to select good features with strict cutoffs. Strict cutoffs are attractive because they are simple, transparent, and easy to automate. The problem is that they mostly reward ideal, intense, beautifully shaped peaks. Real biology does not always look like that. The vast majority of features detected in untargeted data is far from ideal, simply because of noise, deviations in peak shape, undersampling, …

For this reason, our default strategy is to filter at multiple levels but use only very permissive cutoffs, as low as 20-30% of the maximum value, to remove all features that are clearly inadmissible or non-quantifiable. The user can still apply stringent cutoffs in the downstream analysis.

End of Part I

You can find Part II here.

This Post Has 5 Comments

  1. Suresh Poudel

    Excellent and a thorough read

  2. Carli

    very good!!

  3. bjelic

    Opening a roadmap for improvements…
    Looking for part II 😉

  4. Simran A

    Eagerly awaiting Part II !
    Thank you so much for writing this. After having mzmine, MSDial and various other tools crash on my ~1000 samples multiple times, this provided so much clarity and hope !

  5. Shuai Nie

    Great article! I learned a lot!

Leave your comment:

This site uses Akismet to reduce spam. Learn how your comment data is processed.