All-inclusive, turnkey processing for untargeted LC-MS

Post author:Nicola Zamboni
Post published:2 July 2021
Post comments:27 Comments

Imagine running a study with 100 samples by untargeted LC-MS (polar or lipids, you pick!). Data acquisition by DIA/DDA in complete in 1-2 days, and comes the step of data analysis.

The minimum expectation for the software is to find all peaks across samples (rare and frequent, high and low, symmetric and tailing peaks), take care of shifts in m/z and RT, group isotopologues and adducts, and eventually output a table with peak areas/volumes for all features. The luxury version also delivers consolidated MS2 spectra, optionally some recursive gap filling, and maybe a mzTab.

Common challenges in processing untargeted LC-MS/MS data

What software can accomplish all of this? The options are to use commercial software (if one can afford it), or one of the usual suspects: XCMS, openMS/KNIME, mzMine, MS-DIAL, … We tried everything we had access to – repeatedly over years – and the experience has been frustrating. Some recurring problems we encountered:

The software crashed (reproducibly) with mid-sized LC-MS studies. In some instances, it computed for 7 days and eventually triggered blue screens or odd exceptions.
Some software was speedy, but it was never a good sign. In our experience, if LC-MS software is too fast, it is because it’s cutting corners, leaving peaks behind, misaligning retention samples, reporting plenty of missing values, and so on.
Parameter-less algorithms did poorly on real LC-MS samples. Parameter-dense algorithms were tunable, but in most cases, it was a manual, trial-and-mostly-error procedure. Some parameters are just not trivial to understand and to adjust.
Plenty of coding is necessary to obtain the full result. Importantly, we had to traverse between R, python, maybe java to get everything done.
The coherence of the results produced by different tools on the same data was surprisingly low. What should we trust?

Many of these issues, and new ones, emerged with an increasing number of samples. Admittedly, it could be that we were simply not skilled enough to get it to work. It could also be that there is commercial software that solves all of these problems and scales, but we could not afford it. Nevertheless, we are on a good average, and most of the metabolomics users interested in mid-to-large-sized studies (100’s to 1000’s of samples) had to – or would – face similar issues. Please feel free to articulate your experience in the comments.

Introducing SLAW: a scalable and self-optimizing LC-MS/MS analysis package

We are proud to present a tool that tackles most of the problems we experienced when processing untargeted LC-MS data. The credit goes to Alexis Delabrière, an extremely skilled and knowledgeable Postdoc with a PhD in Computer Science who spent two years developing a unifying system that got us past all frustrations. The result is SLAW. The name stands for scalable LC-MS analysis workflow, and it hides plenty of functionalities:

SLAW performs peak picking by any of the most used and top-performing algorithms (featureFinderMetabo from openMS, ADAP from mzMine, CentWave from XCMS)
SLAW includes a novel sample alignment procedure that scales to thousands of individual LC.MS files. The algorithm works on a desktop computer but can automatically make use of high-performance computing infrastructure.
SLAW includes a novel parameter optimization procedure that efficiently optimises peak picking and alignment parameters for each dataset (i.e. LC method and MS instrument). In our tests, 5-7 parameters with complex, interdependent effects on the results were optimized simultaneously. This means that all the key parameters can be automatically tuned by SLAW without the user’s intervention for any of the embedded peak pickers. The optimization procedure employs a metric that combines two terms. The first is sensitivity (# of features detected), the second is robustness (frequency of detection across replicate samples). Thereby, parameter optimization ensures that – in large studies – peak picking and alignment don’t degenerate in a myriad of noisy and sporadic features but in robust and reproducible entities. Of course, the objective function could be modified at will, but the default form is effective for both small and large studies.
SLAW groups isotopologues and adducts into features by exploiting both intra- and inter-sample correlations.
SLAW fills missing values by raw data recursion, optimizing the extraction windows to minimize biases between recursed and non-recursed data. [This feature is only relevant for TOF data. For Orbitrap data, missing values have to be guessed by imputation.]
SLAW scales for thousands of LC-MS files. We tested several datasets with > 1000 LC-MS files, and all went fine.
SLAW digs into raw data to extract isotopic patterns from the best possible samples independent of peak picking.
Regardless of the number of LC-MS files, SLAW consolidates all MS2 spectra collected across a study (in top-n or iterative procedures) in consensus spectra for each collision energy.
SLAW outputs data in tables (csv), consensus MS2 spectra (mgf), as well as the individual MS2 spectra (mgf) and peaks table (csv).
If optimization is included, SLAW returns optimal parameter values that can be used in future runs with similar measurements without repeating parameter searches.
SLAW requires zero coding skills. SLAW comes as a containerized Docker and takes care of the communication between different languages (python, R, Cpp, …).
SLAW can be installed in a minute via Docker and immediately run.

SLAW in practice

SLAW requires (i) centroided files in mzML or mzXML format, and (ii) a text file that describes the sample type for each file: BLANK, SAMPLE, QC, MS2. That’s it. Optionally, some fundamental settings (i.e. the peak picking algorithm) or preferred parameter ranges can be passed with a text file provided by SLAW. That’s it. The whole process is started with a simple command-line statement.

Comment: QC files are quality controls, typically pooled study samples that have been measured at regular intervals during the sequence. QC files are crucial for parameter optimization. For short sequences, 3-4 QC samples are sufficient. For long sequences, more is recommended. MS2 indicates files dedicated to fragmentation spectra, for example, as obtained by iterative MS2 methods.

Comment: any optimization procedure requires a metric. In SLAW, the metric combines two terms: sensitivity and robustness. Both can be measured based on QC samples, assuming they are representative of study samples. Notably, we don’t make use of spike-ins or any other ground truth. This was a decision we took early in development because (i) it adheres better to the idea of untargeted methods to capture possibly many features of different classes, and (ii) it preserves general and retroactive applicability (inclusion of pooled study samples is common practice, use of spike-ins is rarer).

Comment: the robustness term gains importance with the increasing number of samples. Pushing (only) sensitivity rewards the detection of noise. In large studies, this leads to detecting huge amounts of features in only a reduced number of samples. This degeneration can hardly be compensated in later filtering steps. The inclusion of a competing robustness term balances the optimization.

We tested SLAW on dozens of studies spanning different MS brands, number of samples, gradient lengths, etc., with satisfying results. In our lab, it has become a workhorse in routine analysis, both for polar and lipid extracts. As it helped us a lot, we think it might be of general interest for scientists that face similar problems.

We just submitted a paper that describes the different components (e.g. parameter optimization, alignment) and the re-coding done to make it scale. The paper compares the pros and cons of SLAW against the only two workflows that were found to scale: XCMS/IPO and openMS. The paper focuses on scalability (up to 2500 LC-MS files). We hope to be soon able to share the article. In the meantime, we are glad to share some exemplary results from tests that are not part of the manuscript.

Comparison to vendor software on a small dataset

We used SLAW to analyzed a complex lipidomics sample by DDA. We analyzed the same sample repeatedly on a Thermo QExactive HF-X and on an Agilent 6546 QTOF. Both datasets were analyzed with the vendor’s software. In SLAW, we used the out-of-the-box configuration and let the automated optimization find the best parameters individually for both datasets. Our evaluation was based on (i) number of peaks identified, (ii) CV of the identified peaks, (iii) number of lipid features left after declustering, deisotoping, and a crude annotation (MS1 and RT).

On QExactive data and compared to CompoundDiscoverer 3.1, SLAW (i) detects 2x more feature groups (8436 vs 4231), (ii) achieves better reproducibility (CV) over the whole range of intensities (13% vs 17%), (iii) finds more frequently features across samples (83% vs 26%), (iv) finds more putative species (1733 vs 1125).

On QTOF data and compared to ProFinder 10, SLAW (i) detects 2x more feature groups (18222 vs 8444), (ii) achieves similar reproducibility (CV ~ 13%), (iii) finds more frequently features across samples (98% vs 83% after gap filling. The pre- number is not available), (iv) finds more putative species (2462 vs 1460).

Large scale data

Exemplarily, we show the analysis of ca. 1200 serum lipidome samples by LC-MSn, both on TOF and Orbitrap. In both cases, SLAW completed the analysis in less than 12 hours. Unfortunately, we can’t compare it to vendor software because it crashes… Despite the sheer number of samples, the number of feature groups is consistent with expectations. Most features have been detected in the majority of samples. The CV (SD) is still large but is before any type of normalization.

Additional information

The paper describing SLAW can be found at https://pubs.acs.org/doi/10.1021/acs.analchem.1c02687

Are you interested in using or testing SLAW? The simplest way is to install Docker and pull the latest SLAW version:

docker pull adelabriere/slaw:latest

You’ll find more information on GitHub https://github.com/zamboni-lab/SLAW (check the wiki). More will be disclosed with the publication of the companion paper.

How can you help us?

Please try it, challenge it, and report your experience. If you like it, just commenting below with a like or a thank you (to Alexis) will be enough to recognize that it is well-received.
If not, tell us what’s bad. Our main interest is in improving the algorithmic part, and we are eager to hear about scalability and parameter optimization. For instance, some TOF data are particularly challenging because of a dependency of peak shape on intensity, MCP ringing, and other issues. If you have good data to share (it doesn’t have to be large scale), it might help.

Comment: We don’t have the resources for troubleshooting data sets, in particular, if the problem is poor chromatography (massive tailing, noise, drifting RTs, overloaded columns, carryover, etc.), wrong instrument tuning, or that you don’t see a metabolite even though you know the m/z, but there is no peak, etc. ;-). For this, we apologize in advance.

Comment: SLAW does not include any GUI or fancy visualization. Such functionalities are not planned. SLAW will remain a stand-alone tool that focuses on the processing of untargeted LC-MS data. In our environment, SLAW is embedded in automated workflows. We have modules that precede SLAW and downstream modules that take SLAW’s output to visualization, analysis, annotation, etc.

Comment: SLAW does not include functionalities for annotation (identification) of features (i.e. by RT, AM, MS2, …). This is a work in progress and will be part of a different module. Therefore, for now, you’ll have to annotate by yourself.

This Post Has 27 Comments

Yasin 8 July 2021 Reply

Quickly checked it (centwave and optimization = true) for nine replicate injections of a lipid extract measured on RP-QEHF using my mzRAPP package. From a benchmark with 3594 identified peaks (RT shifts < 10s) 98% were recovered and <1% of isotopologue ratio biases were increased by more than 20 percentage-points when using Slaw-peak areas. Both of those metrics are for the result after peak alignment, so I think it is very good! Thanks for this tool, looks great and I am looking forward to scrutinizing it in more detail! 🙂

Loading...
1. Nicola Zamboni 8 July 2021 Reply
  
  Hi Yasin, Thank you for testing, your feedback, and the positive news! Keep us posted.
  
  Loading...
2. Alexis Delabriere 8 July 2021 Reply
  
  Hi Yasin, thanks for the feedback. We designed it as robust as possible especially so I am happy to see that that the case on your data. I am also super happy to discover mzRAPP and I ll probably try to integrate it to evaluate the future development of SLAW. It is the kind of stuff that I do with a bunch of script laying around and this seems a better solution.
  
  Loading...
  1. Yasin 8 July 2021 Reply
    
    Hi happy to hear that! Let me know if something with mzRAPP does not work properly.
    
    Loading...
Ralf Tautenhahn 16 July 2021 Reply

Can you please share the raw data files ? I’d like to run this dataset in the current version of CD (3.2) and the soon-to-be-released version 3.3 which uses a entirely new peak detection algorithm.

Loading...
1. Nicola Zamboni 16 July 2021 Reply
  
  I can share the small one (you should receive an email soon). The large one (1200+) I show here can’t be released at the point, but you can try with MSV000086486 (on Massive) which is the one we used in the paper (2500 LC-MS files).
  
  Loading...
Ralf Tautenhahn 16 July 2021 Reply

Sounds good. I’d also like to try the 2500 sample dataset, however MassIVE is telling me “Dataset MSV000086486 is currently private. It has been deposited to MassIVE, but has not yet been publicly released.”

Loading...
1. Alexis Delabriere 16 July 2021 Reply
  
  Hi Ralf, my mistake. It should be released now. The pooled QC are the files labelled ‘PSS’.
  
  Loading...
  1. Ralf Tautenhahn 20 July 2021 Reply
    
    Hi Alexis, thank you. I see a lot of mzML files. Would it be possible to share the original RAW files ? mzML files (in general) do not contain some very important data (e.g. details about accuracy and resolution) which CD uses for processing. I can provide a link to OneDrive or Goggle Drive if you prefer not to share the .raw files on MassIVE.
    
    Loading...
    1. Alexis Delabriere 16 August 2021 Reply
      
      Hi Ralf, sorry for the delay, I have a zip with all the raw files online, I however need an email to give you access to it using the existing ETH file sharing system. Do you have one to which I can send an invite so that you get access to the files ?
      
      Loading...
      1. Alexis Delabriere 16 August 2021
        
        You can send it to delabriere@imsb.biol.ethz.ch
        
        Loading...
  2. Alexis Delabriere 21 July 2021 Reply
    
    I ll try to give them to you, it ll probably take me few days as they are quite big. SLAW does not perform peak detection during gap-filling, betcause this is not something that I would not be able to do reproductibly on different Mass Spectrometers while using 3 different peak picking algorithms. I use simple area filling, but the margin are optimized such that the distribution of the missing values is as similar to the distribution of the detected intensities.as possible (Optimization once again). IIt is cutting corner compared to peak detection but I really went for robustness over specifcity.
    
    Loading...
2. Alaa Othman 16 July 2021 Reply
  
  Thanks Ralf for the interest in SLAW. When we tried the large dataset used in the manuscript on CD3.1, it didn’t make it past the gap-filling node (which scaled non-linearly after almost a week), without increasing the intensity cutoffs to a level where we knew we are missing a lot of features. We had to chop the dataset into three to make it work and the number of gaps in all cases is just ridiculously exploding. We also had many other issues with grouping and identifying the base ions that you might see after annotation even in 3.2. However, with SLAW, it is not only that it was finished way much faster but one doesn’t have to set very high-intensity cutoffs and intensity CV’s are great for QC samples, coverage after annotation is excellent and above all it’s open-source :).. Thanks again for the interaction and feel free to try SLAW as well and perhaps tell us what you think.
  
  Loading...
  1. Ralf Tautenhahn 20 July 2021 Reply
    
    Hi Alaa. You are right. Gap Filling can take quite a while for large datasets. Gap Filling performs actual peak (re-)detection for all (missed) ions for each compound across all the samples, so it’s not cutting any corners there. To get a result that might be more comparable to your workflow described above (it says you do not use gap filling but rather missing value imputation for Orbitrap data) you can simply remove the Gap Filling node and use the Missing Value Imputation node which is of course much faster. Also, I’d be happy to share a beta build of CD3.3 if you want to try it out. The new peak detection in CD3.3 is not only faster but also more sensitive. Peak detection now runs on a very low intensity threshold per default and uses a peak-quality based filtering mechanism. CD3.3 also directly shows the “Reference Ion” in the Compounds table, which might be what you were looking for. We can discuss details per email if you are interested.
    
    Loading...
    1. Alaa Othman 21 July 2021 Reply
      
      Thanks a lot, Ralf!! Yes, we tried without gap filling in CD 3.1 of course. The issue is that we get more compounds with gaps that those without gaps in replicate samples (QCs) where l expect that the majority of compounds should be without gaps (almost only 25% of compounds are without gaps in replicates). That’s the case even in small datasets such as the one that Nicola is showing in the post. With SLAW, we don’t get that many gaps to start with in the first place for orbitrap data ( 80% of the compounds are detected in all replicates before gap filling). That is due to Alexis’ brilliant parameter optimization algorithm implemented in SLAW which finds one of the best parameter settings to pick, align etc..
      
      Perhaps, Nicola or Alexis can comment better on the gap filling in SLAW but the major advantage of SLAW in my hands is to avoid imputing gap areas completely or minimize it to samples where there’s really no peaks so I always prefered to use gap filling node in CD to comply with that reproducibility of detection constraint. (btw Gap filling does an excellent job in CD to find missed peaks).
      
      Finally, l have to admit that I have used/ tested almost every Thermo software since Sieve and it has really developed a lot over the years and the development team is really very responsive to the academic needs so thanks for all the effort you put in there.., however, I also used/ tested most of the open source and commercial LC-MS untargeted software so in my opinion the issue with tuning parameters is largely overlooked and even for expert users, there’s simply not much time to try out different parameter sets to reach optimal settings for every study, and sometimes the best parameter settings are even counterintuitive.. and that is one of the strengths that SLAW offers…and the evaluation criteria that Alexis and Nicola are proposing here and in the manuscript help reach that.
      
      Thanks a lot for the offer with the beta testing of CD3.3, l will come back to you after we are back from holiday.
      
      Loading...
justin m 17 July 2021 Reply

Very interesting. Is there a reason this takes centroid data rather than profile data?

How does this compare to Progenesis, the gold-standard? Also, what are the computer specifications: “we show the analysis of ca. 1200 serum lipidome samples … SLAW completed the analysis in less than 12 hours.”?

Lastly, the instructions on github to demo SLAW are too difficult to understand. I forsee low use if the general scientist cannot understand how to use this. Does the publication further elaborate on this: “In our environment, SLAW is embedded in automated workflows.” ?

Loading...
1. Alaa Othman 17 July 2021 Reply
  
  Thanks, Justin M. Did you ever try Progenesis with > 1000 samples? In my hands, Progenesis took always longer than Compound discoverer with small datasets ( l did a one-to-one comparison a few years ago) and Progenesis output had more false-positive peaks that I had to filter later. Perhaps this changed in later versions? Did you check the reproducibility of detection and peak intensities in Progenesis? I am curious to see data regarding these three evaluation criteria (scalability, CV% of intensities of the same peaks in replicates, and the number of reproducibly detected features in replicates(. Finally, I am not sure what do you mean by gold- standard? There’s no gold standard software for LCMS data analysis, IMHO, however, there are popular ones, open-source, free software, and some commercial ones.
  
  Loading...
  1. justin m 17 July 2021 Reply
    
    Hi Alaa, I routinely use Progenesis (always latest generation) with 100s to 1000+ samples. On a Ryzen 3950x cpu, the time to import profile-mode data, perform alignment, and pick peaks ranges from a few minutes to a few hours, respectively. I also have 2 windows open, one for pos and one for neg ESI. Both run simultaneously.
    
    I agree with the false-positive noise. I solve that by decreasing peak picking (eg, auto 3 to 2) or filter out by ‘max abundance < X’. I always do manual peak review for the top-n biomarkers of interest, so over-picking is not a concern.
    
    By gold-standard here, i mean that, thus far, Progenesis appears to check all the boxes: 1) alignment quality, 2) peak picking quality, 3) speed, and 4) ease of use. For instance, a program that is proven to miss 50% of the peaks, would not be in the running for Gold Standard (see below). I’m not sure what you mean by reproducibility and peak intensity. Progenesis has their own peak area scale, which is often much lower than vendor peak area–but this is all relative.
    
    I have extensively used 2 other software packages, one vendor-based and one-freeware. Both repeatedly crashed and missed many peaks. For the freeware, other users on the developer’s forum have also noted missed peaks. I have not extensively trialed web-based solutions since it would be against policy to dump data online. I have only tried with limited demo data that is not mine. The web based solutions would already fail the speed factor, and I think many of them have data file limits.
    
    I would like to demo genedata expressionist. That one looks good on paper, but only licenses to large enterprises. For you, what are some options that could be in the running for Gold Standard?
    
    Loading...
    1. Alaa Othman 19 July 2021 Reply
      
      Thanks Justin for the feedback. In my experience, these timescales were not feasible with Progenesis QI. ( approx. 24 samples took a few hours on a Xeon workstation with more cores, RAM and GPU than the Ryzen. How many compounds/peaks do you get with that speed? Could it be that high intensity cutoff makes it faster because it does not detect many low intense peaks? Even in the official application note from the commercial provider here http://storage.nonlinear.com/webfiles/support/progenesis/docs/comet/Application-Note-Rapid-Validation-Of-LC-MS-Approach-for-Nontargeted-Metabolomics.pdf. It took them approx. 3 hours for 113 samples (same number of samples takes less than half an hour with SLAW in our setup). Is any of the results of the data with 1000+ samples with Progenesis published somewhere? I am curious to know in which context this can work with that speed, LC setting etc..
      
      My criteria ( overlaps largely with Nicola’s and Alexis’) would be
      1) number of detected peaks
      2) reproducibility of peak detection ( i.e. if you were to inject 5 replicates for example and run them with Progenesis QI, how many peaks will be detected in all samples without gap filling ( I know that Progenesis does not allow for gaps and that is where most of the noisy peaks come)
      3) reproducibility of integrated areas of detected peaks in replicates ( again if you were to inject 5 replicates of the same chemical sample, would you get the same area/intensity for those detected peaks in all 5 replicates). That is usually expressed in CV% ( standard deviation/mean in %)
      4) chromatographic peak width distribution
      5) quality of grouping adducts and identifying the base ion
      6) speed
      
      and finally of course, I agree with you that ease of use is a criteria to reach wider user base but the data quality based on criteria 1-5 is of higher importance to me..
      
      I do not think there is any specific gold standard software for untargeted LC-MS data analysis in the strict definition of gold standard ( reflecting the truth of peaks, their areas and annotations). There are so many different software and I always evaluate whatever comes new and compare it to others. That also depends on your purpose form the software use, if you are going after the wide spectrum discovery metabolomics/lipidomics or biomarker research only.. Thanks again!
      
      Loading...
      1. justin 20 July 2021
        
        Hello Alaa and team. I ran all QC samples (n=81, i think) from the Massive 2400 set mentioned above. The samples were centroid format in positive ESI. Progenesis makes the user input mass resolution when using centroid mode. I used 50,000 because the data came from a HFx. However, I’m not sure what setting was actually used.
        
        Using the default Progenesis parameters (just click start), the total data crunching time for 81 samples was just under 3 minutes. This included: converting mzML to Progenesis format, alignment, peak picking, and adduct deconvolution (i have 6 adducts–though this step takes almost no time). The next step, I can choose compound database(s). Automated compound annotation took a further 1 second.
        
        Since there is a question if increasing peak picking could increase data analysis time, I submitted all samples for data analysis using ‘max peak picking’. The completion window returns a ‘total analysis time’ so, tomorrow, we will know how long it took. There are no options for RT alignment. RT alignment and peak picking take 99% of the overall time.
        
        I think my main point got lost in this discussion. My point is, it would be relevant to compare SLAW to a program or programs that can reliably handle 1000s of samples. In my opinion, this should include Progenesis because it is one of the few programs that can reliably handle this sample volume. How Progenesis fares in your criteria (1-5) should indeed be evaluated. I feel that comparing SLAW to programs mentioned in this blog post above is inadequate. These programs are known to frequently crash and several are difficult to use. Also, that application note from Progenesis is from 2013–not a very fair comparison to use a 2013 cpu against your modern computer cluster.
        
        As an aside, I don’t understand your criteria #2-3. Reproducibility for #2-3 have to do with chromatography and MS cycle time. I think if you wanted to test data analysis reproducibility, you would copy/paste the exact same data file 5 times, so that the program analyzes the same data.
        
        Loading...
2. Nicola Zamboni 17 July 2021 Reply
  
  Dear Justin, if you think that Progenesis QI is the gold standard and SLAW too difficult, you should definitely stick to the first.
  
  Regarding your last comment: no, the planned publication does not elaborate on our internal workflows and infrastructure. The focus is slaw and scalability.
  
  Loading...
3. Alexis Delabriere 18 July 2021 Reply
  
  Hi Justin thanks for the comment. I will try to make the instructions clearer. I agree that the pure docker is probably confusing, but there was no way to have the optimisation running with different peakpicking in another setup. We did not compare to Progenesis. The problem would be that even if I wanted to, I would have to spend days/weeks learning the correct parametrization by trial and error, which make a huge difference. And the results would still be worst than SLAW optimized parameters, except if I really become expert with it (At least that s my experience, the optimized parameters pretty much always produce better results than I do by hand). I heard regularly that progenesis QI has very good peak picking, so I would guess that if tune it well it could probably outperforms SLAW. The 2400 samples were ran on our cluster and I was able to get a successful processing with 25Gb or RAM at minima and 15 cores. SLAW adapts its RAM usage depending of the setup.
  
  Loading...
Demetrius 10 August 2021 Reply

Hi, I ran SLAW with optimization set to “true” on 42 mzML samples using gcp (128 GB, 16 cpus). The total processing time was ~29 hours. Is this an expected result when using optimization or is there something I can do on my end to improve the time performance?

Loading...
1. Nicola Zamboni 10 August 2021 Reply
  
  It seems much, much longer than I am used to. Definitely would need more details to understand (What instrument? Size of files? What peak picker? Duration of optimization vs. duration of other steps? Log?)
  
  Loading...
  1. Demetrius 13 August 2021 Reply
    
    Hi Alexis and Nicola,
    
    Thanks for the quick replies. I don’t have permission to share the data so the log is the best I can do. I lost the original log so I resubmitted the job. It’s still in progress. The files are between 160Mb and 215Mb in size. Optimization duration was ~ 13 hours. peakpicker: openMS. Here is the head and tail of the log so far:
    
    {“log”:”2021-08-12|19:30:16|INFO: Total memory available: 65566 and 16 cores. The workflow will use 1257 Mb by core on 15 cores.\n”,”stream”:”stderr”,”time”:”2021-08-12T19:30:16.774443625Z”}
    {“log”:”2021-08-12|19:30:16|INFO: Guessing polarity from file:SG_S004721_BN020010_PDB+100mM-NaCl_7_EtOAc_R1.mzML\n”,”stream”:”stderr”,”time”:”2021-08-12T19:30:16.774753587Z”}
    {“log”:”2021-08-12|19:30:18|INFO: Polarity detected: positive\n”,”stream”:”stderr”,”time”:”2021-08-12T19:30:18.693467003Z”}
    {“log”:”2021-08-12|19:30:19|INFO: STEP: initialisation TOTAL_TIME:2.97s LAST_STEP:2.97s\n”,”stream”:”stderr”,”time”:”2021-08-12T19:30:19.743851406Z”}
    {“log”:”2021-08-12|19:39:08|INFO: Finished initial parameters estimation\n”,”stream”:”stderr”,”time”:”2021-08-12T19:39:08.908278376Z”}
    {“log”:”2021-08-12|19:39:08|INFO: Optimizing remaining parameters\n”,”stream”:”stderr”,”time”:”2021-08-12T19:39:08.908313598Z”}
    {“log”:”2021-08-12|19:39:10|INFO: 3 peakpicking added\n”,”stream”:”stderr”,”time”:”2021-08-12T19:39:10.472586305Z”}
    {“log”:”2021-08-12|19:39:10|INFO: 3 peakpicking added\n”,”stream”:”stderr”,”time”:”2021-08-12T19:39:10.882260992Z”}
    ….
    ….
    ….
    {“log”:”2021-08-13|08:30:54|INFO: Alignment finished\n”,”stream”:”stderr”,”time”:”2021-08-13T08:30:54.682478716Z”}
    {“log”:”2021-08-13|08:31:36|INFO: Optimization stopped because no improvement was achieved in the last round. \n”,”stream”:”stderr”,”time”:”2021-08-13T08:31:36.85195006Z”}
    {“log”:”2021-08-13|08:31:36|INFO: Finished optimization\n”,”stream”:”stderr”,”time”:”2021-08-13T08:31:36.86368749Z”}
    {“log”:”2021-08-13|08:31:36|INFO: Exporting in /output/parameters.txt\n”,”stream”:”stderr”,”time”:”2021-08-13T08:31:36.863708263Z”}
    {“log”:”2021-08-13|08:31:36|INFO: STEP: optimization TOTAL_TIME:46880.14s LAST_STEP:46877.17s\n”,”stream”:”stderr”,”time”:”2021-08-13T08:31:36.910261554Z”}
    {“log”:”2021-08-13|08:31:37|INFO: 42 peakpicking added\n”,”stream”:”stderr”,”time”:”2021-08-13T08:31:37.560751833Z”}
    {“log”:”2021-08-13|12:05:03|INFO: MS2 extraction finished\n”,”stream”:”stderr”,”time”:”2021-08-13T12:05:03.350349881Z”}
    {“log”:”2021-08-13|12:05:03|INFO: STEP: peakpicking TOTAL_TIME:59686.58s LAST_STEP:12806.45s\n”,”stream”:”stderr”,”time”:”2021-08-13T12:05:03.35755059Z”}
    {“log”:”2021-08-13|12:05:03|INFO: Aligning\n”,”stream”:”stderr”,”time”:”2021-08-13T12:05:03.362537501Z”}
    
    Loading...
    1. Alexis Delabriere 13 August 2021 Reply
      
      Ok, that is actually way bigger than what we usually process, and I guess that a large part of the file is composed of ‘useless’ point (solvent, noise). Please try to convert reconvert it using proteoWizrd, and add an absolute filter on intensity, You can bypass the problem during optimization by setting the optimization/noise_threshold parameters higher, but a lot of time will be lost anyway during peakpicking. Reconverting is probably the best option.
      
      You also found a bug in the memory selection of SLAW, the 1257 Mb should be higher,I will correct it. You can bypass it by passing the memory directly to SLAW as an environment variable, ie adding: ‘-e MEMORY=3000’ in your command line. This will help. I guess it is a big part of the issue on your computer.
      
      Tell me if that helps, I ll comment here when I fixed the memory bug.
      
      Best,
      
      Alexis
      
      Loading...
2. Alexis Delabriere 10 August 2021 Reply
  
  Hi, Demetrius, this is way too long. The only explanation that I can think about is the lack of detection of the multicore settings, or a very big file, which could be cleaned with MSconvert setting. If you want me to investigate you can give me access to a sample at delabriere@imsb.biol.ethz.ch
  
  Loading...