The fun part of analysing untargeted metabolomics data

This is the second part of our mini-series on how we analyse untargeted metabolomics data. After outlining the needs and the core processing logic in Part I, we now turn to how all the parts come together with orchestration, user experience, zero-knowledge processing, FAIR reporting, benchmarking, and automation. Spoiler: it’s actually a lot of fun!

10. One architecture to serve them all?

One of the challenges of untargeted data processing is the diversity of users it has to serve. In practice, we distinguish three broad user categories:

  1. Developers, who need to inspect data in detail, assemble new workflows, and iterate quickly during prototyping and testing.
  2. Analysts: scientists such as chemists, biologists, and students who operate the instruments and need to analyse their data with state-of-the-art methods and resources. We assume they have limited programming skills, or at least better things to do than learning about configuration files syntax, parameter names, and the like.
  3. Collaborators: scientists who care mainly about the results and how to interpret them, much less about any technicality of metabolomics or its processing.

These three groups differ in skills, motivation, and willingness to invest time in metabolomics software. When considering the current landscape of metabolomics software, it becomes clear how certain products are more suitable for specific user groups. That raises an obvious question: can one framework really serve all of them? Or does one inevitably end up with “forking” into separate tools for developers, analysts, and collaborators?

This question brings us back to the broader requirements listed in Part I, which were already demanding and often contradictory. We wanted something sensitive but robust, scalable but easy to use, flexible enough for R&D yet deterministic enough for routine work, and suitable both for interactive exploration and for headless execution. Above all, we wanted the whole chain to behave like a single analytical system, from raw data to quantitative tables, rather than as a collection of tools connected mostly by user patience. These constraints shaped the architectural choices summarized below.

We chose to build on Python. Yes, the usual objections are well known: Python is slow, dynamically typed, and too high-level for serious data processing. None of these arguments seemed fatal. We built around efficient libraries such as NumPy, Numba, and Polars, took advantage of vectorization and SIMD-friendly code paths where it mattered, and – as we will see later – speed is not the problem. In exchange, Python gave us readability, composability, rapid prototyping, a rich ecosystem, and a single language that can serve developers, analysts, collaborators, and automated pipelines. That was especially valuable after years of living with more heterogeneous and less maintainable stacks.

Headless operation was essential. We therefore built the codebase around the key concepts of sample and study, each with methods for loading, saving, processing, exporting, and visualising data and results. Around these we added accessory classes for spectra, chromatograms, libraries, and related objects. The goal was not object-oriented beauty for its own sake, but a structure that matches the actual logic of the analysis.

Information vs. data flow

To deal efficiently with the sheer volume of data, each processing step extracts only new information and passes forward only what is needed. Duplication is avoided at both the data-object and processing levels: a scan is normally never retrieved – and never centroided – twice. At the same time, all objects remain linked, from single spectra to consensus features. For developers especially, this matters because it remains possible to jump back to the raw proprietary data or to the intermediary HDF5 objects and selectively reconstruct the exact layer of interest. Since the link to the raw data is never broken, the workflow can always recurse back to the source when needed and reconstruct non-cached data at any level.

Data flow and information flow in MASSter. Each sample is processed independently and in parallel, while only the essential metadata are passed on to the Study object. Bulky data such as MS2 spectra and EICs are retrieved on demand, only for the features and samples where they are actually needed.

This architecture has one immediate consequence: a small memory footprint. That brings two major benefits. First, it becomes possible to process multiple samples in parallel, with disk access as the main practical limitation. Second, parsimonious scan handling makes DIA much cheaper than one might expect. As we’ll see later in the benchmarks, we can neglect >99% of the MS2 scans and close the gap between DIA and DDA processing. Together, a small memory footprint and smart data retrieval pave the way to large-scale DIA studies.

API first, GUI second

A fundamental architectural decision was to make scripts and the API the foundation of everything. If we are serious about reducing subjectivity and enabling automation, then the analysis must not depend on hidden clicks, transient UI states, or decisions that become impossible to reconstruct a week later. A GUI is seductive because it feels intuitive: one sees buttons, moves sliders, and gets the impression that the analysis is more accessible. But that convenience comes at a price. GUIs are exceptionally good at hiding state, encouraging trial-and-error parameter fiddling, and making it difficult to describe exactly how a result was obtained. If the goal is reproducibility, deterministic behavior, and low bias, then every meaningful parameter must exist explicitly, not as a hidden consequence of a sequence of clicks.

That is why headless execution is treated as the default case rather than as a special mode. A complete workflow must be runnable without any user interaction, with all parameters available from code or configuration files. A script must be sufficient to rerun the same logic tomorrow, on another computer, or within an ETL pipeline. This is not only a technical preference. It is a scientific one, and we already moved in that direction with SLAW, LipidOracle, and tima through Docker containers and explicit YAML configurations.

The dashboard was added only recently, after the core architecture already existed. It is built on top of the same API and the same stored objects. Its purpose is to speed up review, lower the barrier for inspection, and help users navigate difficult results. But it should never become the place where essential processing decisions live in an unrecoverable way. The core method remains scriptable, explicit, and headless; the dashboard is a window into it. This also keeps the dashboard honest. Because it is not responsible for “doing the science” behind the scenes, it can focus on what visual interfaces are actually good at: inspecting outcomes, comparing samples, visualising chromatograms, browsing MS2 evidence, or building local molecular networks.

We also use the dashboard for R&D projects. It’s enjoyable to design and experiment with interactive modules that can access and connect nearly all data simultaneously. For example, what would an “ISF explorer” look like? We can perform real-time scoring of correlation, coelution, MS2 similarity, and more for thousands of features, but how do we distil this into meaningful information, and how should we present it? The dashboard provides us with a great, interactive playground for development.

The user’s perspective

Very few users care about what happens in the background. The real question is whether the system helps them achieve their goals quickly and with as little friction as possible. Building on the architecture above, this is how we envisage serving the three user categories with a single framework:

  • Developers: interactive scripting in Marimo or Jupyter, or custom workflows written from scratch, from templates, or by unleashing some coding agents on the API.
  • Analysts: the analysis should run automatically and produce a full set of results, reports, plots, and QC views without requiring detailed knowledge of the processing steps. This is what we call zero-knowledge processing, discussed below. Minimal Python skills are enough to adapt the workflow, create ad hoc plots, or launch the dashboard for review.
  • Collaborators: they are mostly interested in the results, which we provide in a variety of formats: Excel (yes, many still like it), parquet, MGF, mzTab-M, figures, and other interoperable artefacts. Most importantly, we share the Python scripts so that others can reproduce the full analysis independently. More on this in the FAIR processing section.

Few users will care about what happens in the background. The real question is whether the system helps them achieve their goals quickly and with as little friction as possible. Building on the architecture above, this is how we envisage serving the three user categories with a single framework:

  • Developers: interactive scripting in Marimo or Jupyter, or custom workflows written from scratch, from templates, or — if one is feeling adventurous — by unleashing a few agents on the API.
  • Analysts: the analysis should run automatically and produce a full set of results, reports, plots, and QC views without requiring detailed knowledge of the processing steps. This is what we call zero-knowledge processing, discussed below. Minimal Python skills are enough to adapt the workflow, create ad hoc plots, or launch the dashboard for review.
  • Collaborators: they are mostly interested in the results, which we provide in a variety of formats: Excel (yes, many still like it), parquet, MGF, mzTab-M, figures, and other interoperable artefacts. Most importantly, we share the Python scripts so that others can reproduce the full analysis independently. More on this in the FAIR processing section.

11. Zero-knowledge processing with the Wizard

One of our main goals was to reduce subjectivity. Much of untargeted metabolomics processing still relies on tacit knowledge, repeated guesswork, and ad hoc parameter tuning, all of which undermine reproducibility, transparency, and interoperability. A workflow should not become more trustworthy because a particularly experienced person happened to run it. It should become trustworthy because its logic is explicit, deterministic, and reproducible. That is the real goal of what I call zero-knowledge processing.

The expression may sound provocative. Obviously, the software does not know nothing. On the contrary, the whole point is to transfer as much accumulated knowledge as possible into the workflow itself. The zero-knowledge part refers to the user: one should be able to process a study without understanding every hidden assumption, every low-level algorithmic detail, or every parameter that once required months of habit to set properly. In other words, the software should carry more of the burden, and the user less.

To support zero-knowledge processing, we integrated a Wizard into MASSter. Our ambition was to build an assistant capable of analysing an LC-MS study from raw data, regardless of study size, instrumentation, LC settings, and so on. But how? Despite what the name might suggest, no wizardry was required.

In earlier work, we dedicated significant effort to the automatic optimisation of processing parameters (see SLAW). That work was highly valuable. It demonstrated that sensitivity and robustness could be improved methodically rather than by pure trial and error. However, it also revealed the limitations. It is computationally expensive because of the numerous parameters that must be optimised simultaneously. We also understood that optimisation is not equivalent to understanding. Relying on native parameters with partly cryptic meanings made it challenging to learn and transfer knowledge across different studies. If every study still begins with an extensive search through a large parameter space, the method remains cumbersome, slow, partly stochastic, and hard to troubleshoot.

The real breakthrough came with the recoding of the core processing steps, which gave us more naturally robust algorithms and, just as importantly, a much clearer view of the parameters that are truly critical, as described in Part I. What remained were a few parameters with real physical or analytical meaning, such as noise level, approximate chromatographic width, or similarly interpretable properties of the data. These can often be estimated from the data directly rather than guessed by the user. Once this became clear, the Wizard only needed the ability to inspect an exemplary raw file and infer those parameters directly from data. That turned out to be much simpler than the word “Wizard” suggests. There is no magic and no AI involved: just mere measurement combined with heuristic and analysis templates. That is what hides in our Wizard.

From the user’s perspective, the Wizard only needs to know where to find the raw data and where to store the results. That is all. It then inspects an exemplary file, infers the key parameters, and writes a complete analysis workflow to a python processing script. It takes 1-2 minutes to complete the task, and then it flies away.

The processing script includes about 400 lines of code, the full sequence of processing steps together with the parameter values, including those adapted to the actual data. By executing the script, any user can perform the full analysis: single-sample analyses on multiple processes, alignment and merging into a study, plotting and exporting. The script can be modified and rerun at will, creating a new set of results coupled to the exact parameters specified in the processing script and the MASSter version used.

The initial part of the processing script was created by the Wizard, with the initialisation of the parameters that determine the behaviour of the script and critical settings of actual processing.

In addition, the Wizard creates a Marimo notebook so that the user can reopen the results in an interactive environment, import annotations from external tools, open the dashboard, continue with downstream analysis, and so on.

The Wizard has become the default method for processing LC-MS/MS data in our lab, for both newbies and experts. Of course, it is not perfect. We still stumble on unexpected samples and instrument behavior, that require corrections. But this does not weaken the vision. It sharpens it. If a dataset needs expert intervention and manual adjustment, we use it to evolve the Wizard – rather than celebrating artisanal data processing.

We ask both beginners and experts to be cautious with the results, double-check their quality, and report any undesirable results. If a particular issue repeatedly requires expert intervention, we try to absorb that knowledge into the Wizard logic so the next user does not have to rediscover the same problem. This feedback loop helps us deliver rapid updates – sometimes several builds in a single day – and pushes the codebase toward greater robustness instead of relying on manual patches and heroic curation.

12. FAIR processing and fair reporting

FAIR is often invoked as if it simply meant “deposit the data somewhere”. It means more than that. Data should be Findable, Accessible, Interoperable, and Reusable. In other words, others should be able to find a study, access it, understand what is in it, and do something useful with it without having to reverse-engineer the whole experiment. In metabolomics, this logic is largely embodied by repositories such as MetaboLights and the Metabolomics Workbench, which have pushed the community toward richer metadata and more structured reporting.

This is clearly better than the alternative, which is to leave data on someone’s hard drive and hope for the best. Still, I find the current solutions unsatisfying in practice. They improve discoverability and metadata quality, but they also come with a substantial submission overhead, often ask for information that is already present in the files, and expose mainly metadata while the actual data remains buried in downloadable archives. Most importantly, they leave out a critical part of the scientific record: the path from raw data to results. So the study may be FAIR on paper, while the analysis itself remains opaque and difficult to reproduce.

This is why we often prefer to deposit data in generic repositories such as Zenodo, BioStudies, or our institutional infrastructure, where we can combine different kinds of data, results, and scripts that document the processing. Alternatively, we go to MassIVE, where the barrier to submission is much lower, but the data becomes actionable thanks to tools such as ReDU and MASST.

There are two fundamental issues. The first is to move beyond FAIR data and include its processing itself in FAIR practice. This is where our scripting-based processing becomes a critical asset. Whether the script is written by hand, assembled interactively by a power user, or generated automatically by the Wizard, the end product is the same kind of object: a concrete workflow that can be inspected, rerun, shared, and archived with the data.

At a deeper level, MASSter keeps a history of all steps and parameters used during processing. In our internal environment – though not in the public distribution – the history also includes the details of the LC and MS method that can be extracted directly from the proprietary raw files. These can be exported as a Python dictionary or JSON, and are appended to Excel exports by default. We don’t (yet) have a mechanism to reverse engineer a processing script from JSON, but it’s on our nice-to-have list.

The second issue is how to share metabolomics results both FAIRly and fairly. FAIRly in the sense of using formats that other labs, tools, and infrastructures can actually read and reuse; fairly in the sense of representing what we truly know – and do not know – about the results, especially when it comes to structural annotation.

On the technical side, FAIR formats do exist. The best example in metabolomics is mzTab-M, a standards-oriented tabular format designed to report metadata, features, results, and supporting evidence. We generate compliant mzTab-M files for both flow injection and LC-MS/MS data, with full integration of identification results from tima and LipidOracle. However, an mzTab-M file should never be considered sufficient to describe the whole result. The format was designed primarily to report quantitative, well-annotated data. Given the intrinsic uncertainty of MS-based identification (see Part I), we always package representative MS1 and MS2 data for all features as MGF alongside the mzTab-M export.

Increasing reporting fairness is more challenging. In metabolomics, humans are very good at overselling identification results in prose, figures, and tables. We all know the temptation: write a precise name (preferably with full structural decoration), smooth over the ambiguity, and let the reader assume that a match implies confidence. As we explained earlier, this is not how identification works. A match is not proof of identity. Confidence does not come from how pretty a library hit looks, but from whether plausible alternatives can actually be excluded. The real problem emerges when this information reaches non-experts or machines and a poor guess quietly hardens into an assertion.

Our strategy for fair reporting is to export results and underlying data/evidence as a linked, semantically-defined, machine-readable knowledge graph. This is one of the goals of our MetaboLinkAI project, and we hope to publish a first version by the end of 2026.

12. Benchmarks

Enough talking: does it actually deliver? Let’s test with a lipidomics study comprising about 2800 LC-MS injections, acquired on a SCIEX 7600 ZenoTOF system using traditional DDA logic (MTBLS14098 on MetaboLights). The raw data amounted to roughly 22 GB and were stored on a local SSD. We asked the Wizard to create an analysis script, which we edited to manually set two parameters: allowing 8 parallel processes and keeping of consensus features present in at least 30 samples (~1%) to enhance the analysis’s comprehensiveness. We launched the script and went for lunch.

After 2.5 hours, the Wizard was done. Parallel processing of 2817 single files, from *.wiff to *.sample5, with 8 subprocesses took 39 minutes. The assembly into a study took ca. 110 minutes. It’s worth looking into the major steps, because they give a better feel for the completre workflow and where the time is spent:

Timing breakdown of the ~2800-file lipidomics workflow, obtained by running the processing script generated by the Wizard.

Let us decode the log:

  • Loading the results of 2817 files took 5 minutes. A total of 1,475,452 features was loaded, including their EIC and MS2 spectra.
  • Pre-alignment of retention time for all files took 5 seconds. This is to correct for common, non-linear shifts in chromatography.
  • Merging took 4 minutes and resulted in 3031 consensus features across all samples. Since we retained consensus features present in at least 1% of samples, the final study still contains many missing values: 88%.
  • The workflow then attempted to link all available MS2 spectra for the 3031 consensus features. A total of 1,052,125 MS2 scans were acquired across the 2817 injections. 895,422 MS2 scans could be associated with one of 2859 consensus features. For 172 consensus features, no MS2 was ever triggered by the DDA engine during the run. The MS2 association step took 30 seconds.
  • The workflow then extracted MS1 data for all consensus features. This information is needed to support adduct and isotope identification. This step chooses the files where the feature was the highest, and recurs into the centroided data to retrieve the MS1 scan at the apex of the chromatographic peak. A total of 776 files had to be queried, and MS1 peaks associated with the canonical isotopes and adducts were found for 2870 consensus features. This step was completed in 43 seconds.
  • Next came gap filling, which retrieved the EICs for the 88% missing values, which corresponds to 2817 x 3031 x 88% = 7.5 Mio EICs, which had to be retrieved from stored data on the hard drive. Keep in mind that data is stored by scan, and retrieving orthogonal EICs is particularly demanding. This is the longest step, with a cost that scales quadratically with the number of samples and features. In our challenging example, gap filling 7.5 Mio traces still took only about 1 hour. We are genuinely proud of that number, because it reflects the efficiency of both the architecture and the implementation.
  • Once all 8.5 Mio EICs were available, they were integrated to obtain quantitative peak areas. As described in Part I, the integration takes into account the consensus features, and adapt the boundaries to each individual EICs. The analysis speed was of ~8000 EICs per second, completing all 8.5 Mio integrations in ~ 17 min.
  • The rest is dedicated to saving everything, exporting data (csv, xlsx, parquet, MGF), creating plots, reports, etc… for a total of 10 more minutes , not shown in the log.

Overall, the Wizard took 2.5 hours to perform a comprehensive analysis of 2817 raw files, including rare features, data recursion (in orange), reproducible quantification of 8.5 million chromatograms, and extraction of optimal MS1 and MS2 spectra. That is really not bad, especially since it was all done automatically during a lunch break. Isn’t it fun?

From this point on, the workflow depends on the study. In this particular case, we typically run LipidOracle on the Wizard output, then reopen the results in the Marimo notebook prepared by the Wizard to generate plots, launch the dashboard, and continue with downstream analysis.

ZT Scan DIA

Our architecture has a distinct advantage with DIA data such as SCIEX ZT Scan DIA, available on the 7600 and 8600 instruments. This scanning DIA mode can acquire roughly 850 MS2 spectra per second, effectively narrowing the gap between DDA and DIA by combining ultra-narrow precursor selection with very high speed across a broad precursor m/z range. A single raw file can easily reach 100,000 to 1 million MS2 scans and become unpleasantly heavy. I covered the details in a dedicated tutorial, but a direct comparison is worth mentioning here.

We conducted a test with a small “LC-MS” experiment with 9 ZT Scan DIA files (23 GB), using MSDIAL (5.5.251021), the only software known to support this type of data. It ran for ca. 2.5 hours, processing all files sequentially, and extracted 900 aligned features. Unfortunately, we could not visualise or extract the MS2 data because of an immediate error. Our Wizard processed all files in parallel and completed the task in ~ 15 min, detecting 3190 features and extracting MS2 spectra for all, including detailed DIA statistics for 100’000s of MS2 fragments.

Because of the architecture and the delayed processing of MS2 spectra, the cost of analysing a ZT Scan DIA file is only marginally higher than for a DDA file, and entirely justified by the extra fragment-level properties that are computed. Resource demand remains low enough that large ZT Scan DIA studies can be processed in parallel just as easily as the 2800-file DDA study above. The Wizard recognises ZT Scan DIA files and handles them accordingly.

13. Automation

By now it should come as no surprise that our LC-MS processing can run in headless mode, either by trusting the Wizard or by using custom Python scripts. The only prerequisite is installation of the module from thee repository. This makes it straightforward to integrate the workflow into tools such as Snakemake, Luigi, Prefect, Nextflow, or Airflow. Note, however, that direct raw-data access depends on .NET libraries and therefore only works on Windows. If the orchestrator runs on Unix, files must first be converted to mzML or mz5 using the Pwiz container.

In our lab, we recommend running the Wizard on Windows machines, either locally or on VMs that we provide centrally. Since the requirements are minimal, they can just create a dedicated virtual environment for each study and add the masster distribution. With uv, that takes about 30 seconds and the clean definition of the environment for FAIR data processing comes for free.

For annotation with tima and LipidOracle, however, we use Apache Airflow, which is the orchestrator we use for a variety of cron jobs and on-demand processing jobs. Both tima and LipidOracle can be run locally with Docker, but this requires a certain level of expertise and, above all, familiarity with their configuration.

To support analysts and students with limited interest in the details, we created DAGs that request only a small number of user inputs, with the vast majority of parameters preconfigured. Parameters and thresholds are set conservatively, meaning that structures or classes are only reported when the evidence is solid. We also cached the relevant MS2 libraries, so that the DAGs can create local copies and proceed immediately with the analysis.

The users are only asked to provide basic information, in the form of a JSON. For example, this is the information required to launch a complete tima annotation job:

Minimal JSON required to launch an automated tima job through our Airflow instance.

The task points to the DAG, features and spectra point to the files generated by MASSter, taxon is used to adjust the scoring based on the samples type, mixed_mode indicates whether our internal RT library can be applied. This is minimal information the user must provide, and the DAG does the rest.

In practice, users are not exposed to Airflow or its DAGs at all. Jobs are triggered by dropping JSON files like this onto a network drive monitored by a watchdog, itself running on Airflow. If the JSON is valid and the task is recognized, a DAG is launched, the latest container version is built or pulled from Docker Hub, and the analysis proceeds. The computational requirements remain modest: the scheduler and workers run on a single on-premises VM with 20 vCPUs and 32 GB of RAM inside our protected VPZ, with user-level control of data access.

A single tima run for thousands of MS2 spectra typically takes about 10–30 minutes. It produces a detailed analysis and a full log. Most users will not care much about either. They simply import the results from the NAS into a MASSter script, possibly in an interactive environment, and continue with the analysis using the added structural information.

The view on Airflow’s operations. This is used mostly by admins and developers. Users can simply drop a JSON in a folder and, after 30 minutes, collect the results, and import them into MASSter.

14. Are we there yet?

It’s time to recapitulate how long post describing how we analyse untargeted metabolomics and lipidomics data. We started in Part I with very ambitious goals and needs that were unmet by existing solutions, prompting us to create a new ecosystem. Did we meet the expectations?

The first requirement was perhaps the most ambitious one: untargeted metabolomics should not be worse than targeted data analysis or careful manual curation. On this point, I believe we made real progress by merging the two workflows. The switch to consensus-guided, chromatography-centric quantification greatly facilitates gap filling and obtaining a peak integration that is consistent across all samples. It is a big step-up over the common practice of using rule-based integrator on individual EICs, or taking the intensity as a proxy. The use of EICs improves consistency, traceability, and, in practice, precision. The approach is scalable (8.5 Mio traces retrieved and integrated in about 90 minutes!), and very robust – more than untargeted processing.

Scalability changed from a concern into a design principle. Direct raw-data access, reduced caching, study-level objects, chunked merging, explicit attention to algorithmic complexity, and efficient storage formats allow the workflow to survive the study sizes that modern metabolomics now produces routinely. The lipidomics benchmark with about 2800 files, together with our experience on bulky DIA and ZT Scan data, shows that this was not an afterthought.

Sensitivity combined with robustness remains one of the hardest balances to strike. The main lesson was that aligning centroiding with feature detection markedly improves the reproducible detection of low-abundance features in TOF data. Across several comparisons with other software, we often detect two to three times more features while maintaining low CVs for technical replicates. That last point matters more than peak counts alone. We deliberately avoid smoothing the data, which is risky with undersampled EICs, and instead rely on distributed soft filtering to keep noise and spurious features under control, even when merging 3000 files.

Reducing subjectivity is where the change feels most profound. More robust functions, fewer truly critical parameters, deterministic rules where possible, and the Wizard’s ability to infer important settings from an example file remove a large amount of guesswork from the workflow. Subjectivity has not vanished entirely, of course. It survives where it belongs: in study design, interpretation, or deciding what consensus features to retain based on study criteria and goals. This is an area that offers significant room for further improvement, but we have already come much further than existing solutions.

Reproducibility and FAIRness also improved in concrete ways. The equation is simple: script plus code version plus data should yield the results, and the results should include the relevant settings and processing history. A processed study is therefore not only a numerical output; it is a documented analytical state. There is still work to do, especially around interoperability and semantic export, but the important shift has already happened: the workflow itself has become a shareable object.

Ease of use is the most treacherous aspect to discuss. It is very tempting to equate ease of use with maximum automation, as if the ideal software were simply the one that asks no questions and hides all complexity behind a reassuring interface. I do not think that is true. The goal is not to simplify and automate everything, but to provide each user with the right level of control and the right degree of abstraction. In fairness, we did not build this ecosystem out of a burning desire to rescue other labs from their software problems. We were mostly trying to rescue ourselves while juggling method development, large-scale analyses, and the will to deliver the most informative data possible to collaborators. From that gloriously self-centred perspective, I think we have come a long way.

Vendor-, technology-, and method-agnostic: well, yes! At least for the high-res data from the three vendors in our lab.

Automation-readiness; Fast prototyping for R&D; Low demand for maintenance, support, and deployment: yes, yes, and yes. 3x 100%.

Concluding remarks

This concludes a long description of the hows and whys of our analysis of untargeted metabolomics data. I hope I could convey that this is not an art, but a complex engineering problem that can be addressed quite efficiently once the right dots are connected in the right architecture.

It is also, thankfully, fun. Fun to think about how to solve these problems. Fun to be able to swiftly analyse studies of any size. Fun to discover new leads in untargeted metabolomics data.

If all of this sounds suspiciously optimistic, do not worry: there is still plenty left to do. More importantly, there is no shortage of new opportunities, new challenges, and occasionally new bad ideas ahead.

Leave your comment:

This site uses Akismet to reduce spam. Learn how your comment data is processed.