Research project BL/80/IN21 (Research action BL)
The management of an outbreak depends on the rapid accurate identification of the responsible source of contamination. This requires identifying the pathogens in potentially contaminated samples and patients and tracing back the source of the contamination based on the obtained genomic information. In the context of food-borne outbreaks, conventional methods based on isolation of the pathogen from the contaminated matrix and target-specific real-time polymerase chain reactions (qPCRs) are used in routine. In recent years, the use of whole genome sequencing (WGS) of bacterial isolates has proven its value to collect relevant information for strain characterization as well as tracing the origin of the contamination by linking the isolate of a putatively contaminated environmental sample and the patient’s isolate with high resolution. Metagenomic sequencing offers a novel approach towards the investigation of outbreaks, allowing to reduce the turnaround time of analysis in e.g. an outbreak investigation, and making possible the detection of multi-species/strain outbreaks.
- So far, Sciensano could set up the workflows to generate the metagenomics data from matrices (in casu food). However, the existing bioinformatics pipeline to identify species from these data has limitations. It depends on the availability of databases with many reference genomes of the species of interest (which are not available for all pathogens) and are often unable to identify subtle differences and/or recombination between strains (like the insertions of mobile elements carrying antibiotic resistance genes).
- Alternatively, a de novo assembly based approach can be used in which longer contigs are reconstructed from smaller DNA segments, called reads. Obtained contigs obtained by metagenomic assembly are further clustered based on their properties (k-mer composition, frequency), a process referred to as binning. The obtained bins are then proxies of assembled genomes. However, most of the bins, irrespective of the de novo assembly algorithm that was used do not represent single genomes. So reconstruction of species at genome level or assignments of antimicrobial resistance (AMR) genes to species remains infeasible, complicating the unbiased identification of species and monitoring of antibiotic resistance during outbreak detection.
Objectives: develop a workflow to support outbreak investigations based on metagenomics data. This requires:
O1. Improving the assembly process. We will use a novel paradigm based on Conditional Random Fields (CRF) developed in the lab of J. Fostier that takes in account both the overlap of reads in the de Bruijn graph as well as the frequencies of reads in neighboring nodes of the graph. This allows to estimate in a probabilistic way both the genomes of the species but also their relative abundance in the mixture. The goal is to turn this theoretical framework in a practical assembler.
O2. Compare the performance of this assembler with the state-of-the-art using metagenomics data available in Sciensano in the context of outbreak investigation.
O3. To complete the workflow by adding the pipelines for the inference of the spatial and temporal spread of the outbreak inferred from the genomic information available in NITTTR Kolkata.
The end product resulting from networking activities among complementary expert groups will be a workflow that can support decision making during outbreak investigation and for the surveillance of antibiotic resistance strains based on metagenomics data. J. Fostier will contribute the theoretical framework, K. Marchal contributes expertise in metagenomics assembly using state-of-the-art tools, Indrajit Saha contributes expertise in applied computer science and with the analysis of mutational landscapes and Sciensano contributes expertise with the generation of the data and the applicability of the methods in a governmental setting.