PlasmidTron: assembling the cause of phenotypes from NGS data

Andrew J Page ORCID logo; Alexander Wailan ORCID logo; Yan Shao ORCID logo; Kim Judge ORCID logo; Gordon Dougan ORCID logo; Elizabeth J Klemm ORCID logo; Nicholas R Thomson ORCID logo; Jacqueline A Keane ORCID logo; (2017) PlasmidTron: assembling the cause of phenotypes from NGS data. BiorXiv. DOI: 10.1101/188920
Copy

<jats:title>Abstract</jats:title><jats:p>When defining bacterial populations through whole genome sequencing (WGS) the samples often have detailed associated metadata that relate to disease severity, antimicrobial resistance, or even rare biochemical traits. When comparing these bacterial populations, it is apparent that some of these phenotypes do not follow the phylogeny of the host i.e. they are genetically unlinked to the evolutionary history of the host bacterium. One possible explanation for this phenomenon is that the genes are moving independently between hosts and are likely associated with mobile genetic elements (MGE). However, identifying the element that is associated with these traits can be complex if the starting point is short read WGS data. With the increased use of next generation WGS in routine diagnostics, surveillance and epidemiology a vast amount of short read data is available and these types of associations are relatively unexplored. One way to address this would be to perform assembly <jats:italic>de novo</jats:italic> of the whole genome read data, including its MGEs. However, MGEs are often full of repeats and can lead to fragmented consensus sequences. Deciding which sequence is part of the chromosome, and which is part of a MGE can be ambiguous. We present <jats:italic>PlasmidTron</jats:italic>, which utilises the phenotypic data normally available in bacterial population studies, such as antibiograms, virulence factors, or geographic information, to identify sequences that are likely to represent MGEs linked to the phenotype. Given a set of reads, categorised into cases (showing the phenotype) and controls (phylogenetically related but phenotypically negative), <jats:italic>PlasmidTron</jats:italic> can be used to assemble <jats:italic>de novo</jats:italic> reads from each sample linked by a phenotype. A <jats:italic>k</jats:italic>-mer based analysis is performed to identify reads associated with a phylogenetically unlinked phenotype. These reads are then assembled <jats:italic>de novo</jats:italic> to produce contigs. By utilising <jats:italic>k</jats:italic>-mers and only assembling a fraction of the raw reads, the method is fast and scalable to large datasets. This approach has been tested on plasmids, because of their contribution to important pathogen associated traits, such as AMR, hence the name, but there is no reason why this approach cannot be utilized for any MGE that can move independently through a bacterial population. <jats:italic>PlasmidTron</jats:italic> is written in Python 3 and available under the open source licence GNU GPL3 from <jats:ext-link xmlns:xlink="http://www.w3.org/1999/xlink" ext-link-type="uri" xlink:href="https://github.com/sanger-pathogens/plasmidtron">https://github.com/sanger-pathogens/plasmidtron</jats:ext-link>.</jats:p><jats:sec><jats:title>DATA SUMMARY</jats:title><jats:p><jats:list list-type="order"><jats:list-item><jats:p>Source code for <jats:italic>PlasmidTron</jats:italic> is available from Github under the open source licence GNU GPL 3; (url - <jats:ext-link xmlns:xlink="http://www.w3.org/1999/xlink" ext-link-type="uri" xlink:href="https://goo.gl/ot6rT5">https://goo.gl/ot6rT5</jats:ext-link>)</jats:p></jats:list-item><jats:list-item><jats:p>Simulated raw reads files have been deposited in Figshare; (url - <jats:ext-link xmlns:xlink="http://www.w3.org/1999/xlink" ext-link-type="uri" xlink:href="https://doi.org/10.6084/m9.figshare.5406355.vl">https://doi.org/10.6084/m9.figshare.5406355.vl</jats:ext-link>)</jats:p></jats:list-item><jats:list-item><jats:p><jats:italic>Salmonella enterica</jats:italic> serovar Weltevreden strain VNS10259 is available from GenBank; accession number GCA_001409135.</jats:p></jats:list-item><jats:list-item><jats:p><jats:italic>Salmonella enterica</jats:italic> serovar Typhi strain BL60006 is available from GenBank; accession number GCA_900185485.</jats:p></jats:list-item><jats:list-item><jats:p>Accession numbers for all of the Illumina datasets used in this paper are listed in the supplementary tables.</jats:p></jats:list-item></jats:list></jats:p><jats:p><jats:bold>I/We confirm all supporting data, code and protocols have been provided within the article or through supplementary data files.</jats:bold> ⊠</jats:p></jats:sec><jats:sec><jats:title>IMPACT STATEMENT</jats:title><jats:p>PlasmidTron utilises the phenotypic data normally available in bacterial population studies, such as antibiograms, virulence factors, or geographic information, to identify sequences that are likely to represent MGEs linked to the phenotype.</jats:p></jats:sec>


picture_as_pdf
188920.full.pdf
subject
Submitted Version
Available under Creative Commons: NC 3.0

View Download

Atom BibTeX OpenURL ContextObject in Span Multiline CSV OpenURL ContextObject Dublin Core Dublin Core MPEG-21 DIDL EndNote HTML Citation JSON MARC (ASCII) MARC (ISO 2709) METS MODS RDF+N3 RDF+N-Triples RDF+XML RIOXX2 XML Reference Manager Refer Simple Metadata ASCII Citation EP3 XML
Export

Downloads