The journey towards de novo structure elucidation

This post will describe the journey towards MSNovelist, a project I have been working on over the last two years and which is now finally available as a preprint. MSNovelist is a de novo structure elucidation method for MS2 spectra.

Background

De novo indicates that molecular structures are generated directly from the MS2 spectrum without referring to a database. In contrast to existing in silico methods, this allows discovering structures that are completely new and unexpected – such as novel natural products, pharmaceutical metabolites or environmental transformation products.

Aside from the practical applications, I find this problem interesting from a fundamental viewpoint. De novo identification methods exist for peptides, for example. For peptides, we understand the fragmentation rules, and the search space is well defined (since they are linear). In contrast, small molecules contain a broad variety of structural features – rings, functional groups, branches etc. For the formula C12H12 alone, there are 23 million isomers possible. We still don’t have a good understanding of fragmentation rules, but even then, how would we go about building a molecule from a mass spectrum?

There have been some attempts to tackle this by combinatorially generating molecules with some constraints, e.g., specific functional groups. However, these approaches suffer from a combinatorial explosion – even for small-ish molecules, they generate so many possible candidates that they become of little use in practice.

This question has fascinated me for a long time. So, I started thinking when – at the Dagstuhl seminar for Computational Metabolomics in 2017 – Kai Dührkop presented CSI:FingerID, an in silico database search method. CSI:FingerID compares a spectrum to a database by predicting a molecular fingerprint for the spectrum and find the best-matching fingerprint of all candidates in a database. A molecular fingerprint is basically a long vector of structural features (which may be concrete substructures like a carbonyl, or more abstract features calculated from the atom connections).

Database search with CSI:FingerID in a nutshell: A molecular fingerprint is predicted from an MS2 spectrum and compared to all database candidate fingerprint to find the best match.

If for a minute we forget that these are probabilistic, the CSI:FingerID output could be seen as a long list of constraints for a molecule. What if we could use all of these constraints as input for a molecule generator? Would this be enough to narrow down the chemical space to generate a target molecule?

The vision for de novo structure generation: A molecular fingerprint is predicted from an MS2 spectrum, and used as input to generate a candidate molecule directly.

At the same meeting, Kai also presented initial results with CANOPUS – where predicted fingerprints are used to determine the chemical class of a molecule. This contributed a second important insight: Predicted fingerprints as an input allow us to bypass the key limitation in machine learning for mass spectrometry – there are only some 30’000 training spectra around. In contrast, there are virtually unlimited training fingerprints to train any machine learning method on, because they can be computed for any known structure from a database. These were two of the three ingredients needed for de novo structure elucidation.

The third missing ingredient was a way to generate molecules. I first thought of something like a genetic algorithm, where a procedure would start to build molecules and iteratively come closer and closer to the target fingerprint. At one point, I tried to modify the Open Molecule Generator for that purpose; however, I never got very far with this.

Design

Luckily, developments in another field came to my advantage: In 2016, researchers started to explore the application of deep learning methods to molecule generation. I consider the Gómez-Bombarelli SMILES autoencoder the seminal publication in the field. This paper was published 2018, however the preprint has been around since 2016 and was followed by countless related models. How to apply deep learning to molecules? One way is to use methods that are used for natural language processing; at the time, these were mostly recurrent neural networks. An RNN is a network with a single set of parameters that describes the transition from an input state to an output state and an output. From a starting state, a sequence is modelled by recursively applying (“unrolling”) the RNN. This approach can be applied to sequence input, but also to sequence output and is used, for example, for text generation (but also, time series forecasting or any other sequence task).

Scheme of an unrolled RNN for a word-level language model

(By now, RNN are replaced with transformer models. Those have some advantages, but ultimately, this doesn’t matter for the application at hand.) Either way, using the long-established SMILES notation, molecules can be represented as sequences, and the same methods can be applied. Most work in the field uses the autoencoder paradigm, where a network is trained to encode the input into a low-dimensional vector, the latent representation, and reconstruct the input from the latent representation.

The idea is that the network is forced to keep all information about the structure in a compact vector (the latent space acts as a dimensional bottleneck), such that the latent representation is a continuous representation of a complex input. For application, the encoder is then thrown away, and new molecules are generated by sampling points in the latent space – with appropriate optimization techniques, one can obtain molecules with desired target properties. This could, for example, also be applied here to optimize a molecule until it matches a target fingerprint.

My work takes a slightly different approach. In our case, it is not necessary to “blindly” optimize in latent space, because a molecular fingerprint directly contains information about molecular structure. All we need to do is to reconstruct the structure from the fingerprint. The fingerprint is not a target, but an input, and it should be transformed into a molecule. This is almost verbatim the same task that is solved in image captioning: First, a convolutional neural network recognizes the features in an image, resulting in a vector (analogously to our fingerprint). Second, the vector is an input for an RNN decoding it into a sentence describing the image. I based the MSNovelist model on this approach and implemented an LSTM-based RNN (basically, an RNN that doesn’t suffer from rapid memory loss over the iterations) for the generation of SMILES code from fingerprint input.

(Note that most image captioning models use the second-last layer rather than the final layer as their input representation, before the network’s information is artificially reduced into discrete classes. We don’t have that option, but this is only a moderate problem, given how descriptive molecular fingerprints are.)

Scheme of an image captioning network (top) and the analogous MSNovelist structure elucidation method (bottom). “Skateboard Kitten” by jm2c is licensed under CC BY-NC-SA 2.0

Side note: Generating sequences with RNNs

The choice of model also influences the choice of sampling. How do we choose the “best” sequence from such a model? For every step in the sequence, the model predicts a probability for each possible following token (so, if we have tokens C,N,O,c,n,o,=,(,),1,$, this gives 11 probabilities that sum up to 1). For example, in the case shown above, the model has predicted "O=C(" already. The predicted probabilities for the next step could be, for example, 85% for N, 10% for C and 5% for c.

In the simplest case, we always follow the path of least resistance and choose the token with the highest probability. In this example, we would choose N 100% of the time. This is deterministic, and only one sequence will be generated for one specific input state. Note also that this is also not guaranteed to generate the most probable sequence overall since we have no way to see ahead whether the “easy” choice at this point will lead to “hard” choices down the road.

To generate more variability, stochastic sampling can be used, where the next token is chosen randomly based on the predicted probabilities (roughly speaking, if "C" has a probability of 0.9, it will be chosen in 90% of cases). By running the model multiple times, one generates a list of multiple possible sequences arising from the same starting state. However, because of the combinatorial complexity of this process, it becomes actually quite unlikely that something as good as the “least resistance” sequence is sampled at all, a really undesirable outcome for our case. (This method is desirable for “creatively” sampling from RNN.)

In contrast, we want to generate the best top-n sequences for our input vector. To do so exactly would require us to expand every possible sequence at every timestep, excluding only those guaranteed to never make it back to the top-n, which is not practical. In practice, an acceptable heuristic for this is a beam search. Roughly, we are extending n sequences in parallel. We expand all of them at every step, giving n times the number of tokens (in our example, 11*n) possible extensions.

For the next step, we will keep n of them: the ones that have the best overall probability up to this step, combining the probability of the sequence up to this step and the probability of the following token. This fulfils two goals: First, we generate multiple possible sequences that are somewhat close to the best n possible ones; second, we can get a higher overall probability than the “least resistance” sequence because we can follow a path that is not the easiest right now but will have easier choices down the road.

Implementation

The basic implementation of a RNN network like this is, by now, the easy part of the story. Sequence-generating models have become so common that any Google search will turn up good, mediocre and bad implementations and step-by-step tutorials, even chemistry-specific ones. Initial results were very encouraging and showed that the reconstruction of a molecule from a fingerprint is possible in many cases, even with proper evaluation. As fascinating as this is, it’s also somewhat unsurprising – the reconstruction of a molecule from its fingerprint (by optimization, not by “image captioning”) is part of the Guacamol benchmarks for molecule generation models. But the real challenges only start here:

CSI:FingerID is only able to predict part of a fingerprint, not a complete fingerprint. Is this subset still sufficient to reconstruct a molecule?

Initial tests were done with different fingerprints available in RDKit. Unsurprisingly, removing random parts of fingerprints has a minor effect, as in large fingerprints the information is to some degree redundant. In reality, the “missing” part of the fingerprint will be non-redundant, because the structural features that can be predicted from an MS2 spectrum are correlated with each other. Using the exact fingerprint predicted by CSI:FingerID as input, the performance dropped. However, the model could be rescued by explicitly adding the molecular formula as input. Presumably, knowing some substructure specifies a part of the molecule; knowing the formula narrows the blank space that needs to be filled in.

More importantly, CSI:FingerID predictions are probabilistic and definitely not perfect. Can we train the model such that it is tolerant to errors in the fingerprint?

Up to now, I had shown that I could use a fingerprint to reconstruct molecules. But in reality, nobody cares about this – the question is whether I can reconstruct the molecules from spectra. The project got out of the playground stage and into handling real data after Dagstuhl meeting in January 2020, when we joined forces between the Zamboni and the Böcker labs.

When I started to use the model on the fingerprints predicted from actual spectra by CSI:FingerID, instead of “synthetic” fingerprints, the performance took a complete nosedive. This was expected, since the model was trained on perfect data. To tackle this challenge, I could draw on the Böcker group’s experience with CANOPUS, where they faced the same problem. The solution was to train the model with simulated predicted fingerprints, which are generated by correlated or uncorrelated sampling from a set of known CSI:FingerID predictions. This simulation method is far from perfect, but efficient; trained in this way, the model was able to succesfully generate structures from predicted fingerprints.

Can we generate molecules with one specific molecular formula?

Without any tuning, only approximately 20% of generated structures in a top-128 beam search have the correct molecular formula we are looking for – other molecules have an extra CH2 group or a double bond too much. These structures can be instantly excluded as candidates for the spectrum. So while we think we are running a top-128 search, in reality we are only generating ~25 viable candidates. Can we improve on this?

This turns out to be remarkably simple. For example, starting from C9H11NO, the sequence O=C(Nc1c uses one oxygen, three carbons, and one nitrogen. This means there is only C6H??N0O0 left. We can feed this domain knowledge to the model by concatenating this info to the input at every step. This mechanism almost doubles the number of viable candidates to 40%. Of course, this doesn’t take care of hydrogens, which are implicit in SMILES, and not easy to predict in a rule-based manner. For example, in O=C(Nc1cc the first c has zero hydrogen atoms attached, whereas the second c has one. Here I tried an approach that worked better than expected: I predict the hydrogen counts from the left-hand context with an auxiliary, independent RNN, using only the total hydrogen count in the molecule for training – adding this input to the model raises viable candidates to 57%.

Does all of this magic help in the end? A tiny bit, as it slightly increases how many correct structures we retrieve. On the other hand, it inadvertently ended up giving us a powerful weapon: by using these data augmentation procedures, we have a model that gives us the structures with a specified molecular formula. If we now train and run the same model without the fingerprint input, we can directly measure how much the structural information from MS2 contributes to our structure generation, and how much we find purely from generating structures for one chemical formula, which I call “isomer sampling” or “naïve generation”.

How do we use the top-n output in a better way?

Generating more than one structure means that we are getting a list of candidates with associated probability scores under the model. Of course, generating more structures means that we will find the correct structure in the list more frequently; but in the end, no one is going to look through 128 results and by some magic determine which one is right – we will focus on the structures with the highest scores. So if we believe the model score, we might as well just generate one or a few candidates. Another possibility is to find a better way to rank the results – for example, by finding the structure in the output that is the best match to the input fingerprint. Naïvely, one might just apply (for example) Tanimoto similarity for this – but better ways to score the results were already developed of CSI:FingerID. We use the “modified Platt score” (ModPlatt), which takes into account the prediction accuracy of CSI:FingerID as well as the prediction confidence for a specific fingerprint bit.

Side note: Computational performance

As I said, initial implementation is not the challenging part here. Actually, even things like fingerprint simulation, formula hinting and beam search can be implemented in a quite straightforward way with imperative programming. However, this is usually not a very efficient way and becomes a roadblock when training, re-training, re-evaluating the model, changing small things here and there, etc: Everything that is not done on the GPU, or at least in the TensorFlow computational graph, becomes a bottleneck and slows down things by orders of magnitude. With time, I moved from a somewhat pedestrian implementation to a quite streamlined one, implementing everything possible directly with TensorFlow. In principle, this contributes zero to the scientific content of this work – in practice, it made it possible to train and evaluate with a reasonable timeframe and was worth the effort.

Evaluation

I could go on and write an entire book chapter about the evaluation strategy, but this post is already quite long – therefore I will try to focus on the conclusion. We evaluated MSNovelist de novo structure prediction on the GNPS dataset which was also used for the evaluation of CSI:FingerID. Importantly, the evaluation was done with full cross-validation – the input fingerprints for MSNovelist were predicted with structure-disjoint cross-validation instances of CSI:FingerID, and the de novo structures were generated with structure-disjoint cross-validation MSNovelist models.

MSNovelist evaluation: top-n retrieval: what fraction of structures (y axis) were retrieved by the model at rank <= n (x axis)? Left: entire dataset; right: database top1 matches

The most intuitive evaluation for MSNovelist is how many correct structures were retrieved overall, and where they are ranked. In total, MSNovelist retrieved almost half of the correct structures; 25% were even ranked first. This is way better than anything I would have expected before starting this (in all seriousness: I expected this to work on a few test cases, but did not think it would hold up to rigorous evaluation.) But wait: what do these numbers even mean? This apparently simple way to evaluate the model mixes many different aspects: for example, is the input fingerprint a good match to the true structure? How many similar structures exist for a fingerprint? How good is MSNovelist at making structures that match a fingerprint? And how comprehensive is the chemical space model of MSNovelist?

For a start, only 39% of structures can be correctly identified (ranked first) by database search! This means that even if MSNovelist was generating the exact same top-hypotheses as are in the database, it still could not possibly be better than 39% top 1. (It is theoretically possible to score higher than that – but this would indicate strange problems with the model rather than a success.) It is more realistic to look only at those 39% where MSNovelist actually has a chance to win. Here, MSNovelist finds 2/3 of those structures, and ranks 61% first, a quite impressive result. This also matches well with a chemical space coverage test: Over the whole dataset (not the 39%), MSNovelist finds about 60% of structures that are ranked highly in database search (including both correct and incorrect structures).

Note: there are still things we aren’t considering here. For example, in an extreme case, a structure can be scored on rank 1 in database search even if the fingerprint prediction is completely wrong, if there is simply no other candidate for this formula. The de novo model doesn’t have this luxury – it will try to generate a matching structure either way.

But to me, the key question is if the MS2 data, represented by the fingerprint, actually helps at all in generating the correct structures. For this, we can use the fingerprint-free “naïve generation” model as a comparison. Interestingly, naïve generation already retrieves 31% of correct structures (37% of the top-1 subset) – and by applying ModPlatt sorting, gets 17% to top-1!

Why is this important?
  • First, we can show that even if a substantial set of structures is found just by isomer sampling, the fingerprint data makes a clear difference and leads to a clear increase in identified structures. So all our work was not for nothing.
  • Second, we can show we aren’t fooling ourselves. If someone showed me a model with 31% retrieval and 17% top-1 for de novo ID, I would have considered this a great success since there was nothing in this field before. However, all of that could be achieved completely without any MS2 information for structure generation, simply by using an existing sorting method on a bag of randomly generated molecules.

The data shows further interesting outcomes. For example, it is clear that using ModPlatt sorting gives better performance than believing the model probability. However, using the real model without ModPlatt sorting is still better than naïve generation combined with ModPlatt sorting, making another argument that model probability truly reflects fingerprint match. In the manuscript, we further looked at the output in different ways to understand different aspects of model performance.

Practical considerations

We now have a nice evaluation that tells us the model works, which is great. But can we actually do something useful with it? I will not go into detail about our proof-of-concept application, but highlight a few points we noted along the way:

  • First of all, there are no miracles. If there is no information in the MS2, no model will be able to use the data in a useful way or give sensible results. In practice, this means that it is helpful to have clean, information-rich spectra as input.
  • Second, MSNovelist is focused purely on the task of structure generation, and relies completely on upstream tools (in practice, SIRIUS) for molecular formula determination. However, even this is not so easy in many cases. The de novo results were much easier to interpret and understand for cases where the molecular formula was clear enough that this problem was “off the table”. In contrast, we observed many cases where both database search results and de novo structure suggestions seemed quite unlikely, which was often a pointer to that the molecular formula was probably incorrect to start with.
  • Finally, de novo annotations should, for now, only be regarded as a starting point – while the model may provide a plausible prima facie explanation, there is often not yet enough evidence to rule out alternatives, and careful evaluation (by manual interpretation but also by orthogonal methods) should follow to get to the true answer.

Outlook

This concludes my short, or really not so short, look back on the development of MSNovelist. Right now, the model is on Github, and the manuscript is available as a preprint – so for now, all of this is still preliminary, but we hope to get it published in due time.

Where do we go from here? This is certainly by far not the end of the story regarding de novo annotation or deep learning for MS2 in general. Quite in contrast, we would rather regard it as a first baseline – we are demonstrating that we can achieve de novo structure elucidation even using a relatively simple model (which is regarded as somewhat outdated for molecule generation, even though it works just as well as many “newer” models), without extensive optimization. The field of molecule generation is moving at an amazing pace. Important themes are the use of reinforcement learning for molecule optimization, robust alternatives to SMILES and the generation of graphs rather than SMILES strings. On the spectrum side, new representations are being explored for different purposes. There is ample opportunity to explore the application of these methods to de novo structure elucidation, and we are looking forward to seeing what others come up with.

In the meantime, we would be happy to hear your thoughts about this. Comment below – I will try to give some useful answer if I have one 🙂

This Post Has 4 Comments

  1. Alaa Othman

    Thanks a lot, Michael for taking us through your exciting journey to reach MSNovelist and developing a great tool!! A couple of practical questions that I did not ask you yet 🙂 and perhaps more will come as I read even more carefully the pre-print!!
    1- Would MSNovelist work equally well on molecules with known fragmentation rules (such as lipids for example)? The last time that I tried Sirius, a few months ago, I could not get it to work properly on lipids with informative MS2 spectra and got the feeling it performed much better on small molecules (average lipid has an m/z of 700). Or Do you think that something like what Cyrus Papan from the Shevchenko lab tested a few years ago or a similar approach would work in a more straightforward way (described here https://pubmed.ncbi.nlm.nih.gov/24471557/)
    2- You briefly mentioned that MSNovelist will perform poorly if upstream tools (i.e. Sirius) could not generate clear formula. Is this always the case? and if yes, can one define thresholds MSNovelist to exclude those cases so it does not generate false positives structures anyway. Thanks again for the great work!!

    1. Sebastian Böcker

      You can use SIRIUC or ZODIAC for lipids but not CSI:FingerID or any other in silico method such as MetFrag, MAGMa, CFM-ID etc. MSNovelist is like CSI:FingerID, so not lipids. It works but it makes no sense. See here: https://www.youtube.com/watch?v=oEQoB2aaV2E&t=308s

  2. Michael Stravs

    Hi,
    sorry for the delay!
    1a)
    Cases where the fragmentation rules are already well understood are not necessarily the best use case for MSNovelist or even CSI:FingerID. Fingerprints of different lipid isomers might only differ in few bits, so it’s hard to distinguish them with a general-purpose method. For PC 34:4 there are >14000 entries in Pubchem with Tanimoto similarity >97%. For hesperetin, a natural product, there are 400. Now we haven’t filtered by formula yet, but you get the point: even with perfect fingerprint prediction and perfect structure generation from fingerprints, it would be a tough battle to win. It is in the nature of CSI:FingerID that the spectrum is “thrown away” after the fingerprint prediction, so even though you may know good rules to distinguish lipids based on peak patterns, MSNovelist cannot make good use of them.
    1b)
    But where could there be potential for MSNovelist in lipid discovery? It is imaginable that MSNovelist could find new lipid classes, e.g. a modified head group, or chemical modifications on the side chains. Those could then be the basis for further elucidation, such as regiochemistry resolution, by an expert.

    2) The issue is combining the probability of a formula being correct with the score of generated compounds. Here, database search may profit from the limited search space again. For example, imagine SIRIUS gives 65% probability to formula 1 and 35% to formula 2, and correspondingly predicts two fingerprints. For fingerprint 1, there are only bad candidates (or nothing at all), but for fingerprint 2, some very good matches – this makes it much easier to believe in formula 2. In contrast, MSNovelist will still try to build a good candidate for either formula, and will more frequently have narrow differences between the two. My simple solution was to just filter out everything where a formula wasn’t strongly favoured, but this is not currently done automatically by MSNovelist; instead, results for all formulas are listed.

  3. Alaa Othman

    Thanks a lot Michael and Sebastian! I was referring mainly to finding new lipid classes with never-seen-before modifications either on the headgroup or acyl chains or even sterols, isoprenoids, sterols etc.. so Michael point 1b addressed it. Thnx.

    However, in response to 1a and part of Sebastian’s comment, this also poses the question in datasets where we can have a mix of features with MS2 spectra for 1) lipids with clear fragmentation rules, 2) lipids without fragmentation rules and 3) other small molecules such phytochemicals ( by the way hesperetin is technically speaking also a lipid belonging to flavonoids (Lipid Maps ID: LMPK12140003) or even drugs etc. From the experimental point of view, that is not unusual and rather the common case in my experience as there is a big overlap of physicochemical properties between lipids with clear fragmentation rules and many other chemical classes. In those datasets, using a dedicated rule-based lipid annotation algorithm will surely miss everything else and using a dedicated “small molecule” annotation algorithm will not always perform as well on lipids. Does this mean, in your opinion, that one needs to always annotate these datasets twice with different algorithms to get the best of the two worlds as general purpose annotation methods will not work on both well?

    Thanks again Michael for the exciting story and Sebastian for enriching the discussion here as well.

Leave your comment:

This site uses Akismet to reduce spam. Learn how your comment data is processed.