ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Software Tool Article

Disambiguate: An open-source application for disambiguating two species in next generation sequencing data from grafted samples

[version 1; peer review: 2 approved, 1 approved with reservations]
PUBLISHED 22 Nov 2016
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Python collection.

Abstract

Grafting of cell lines and primary tumours is a crucial step in the drug development process between cell line studies and clinical trials. Disambiguate is a program for computationally separating the sequencing reads of two species derived from grafted samples. Disambiguate operates on alignments to the two species and separates the components at very high sensitivity and specificity as illustrated in artificially mixed human-mouse samples. This allows for maximum recovery of data from target tumours for more accurate variant calling and gene expression quantification. Given that no general use open source algorithm accessible to the bioinformatics community exists for the purposes of separating the two species data, the proposed Disambiguate tool presents a novel approach and improvement to performing sequence analysis of grafted samples. Both Python and C++ implementations are available and they are integrated into several open and closed source pipelines. Disambiguate is open source and is freely available at https://github.com/AstraZeneca-NGS/disambiguate.

Keywords

NGS, patient derived xenograft, explant, disambiguation, sequencing

Introduction

Xenografts, both cell line and primary tumour, are routinely profiled in preclinical and translational research. Xenografts are used to study everything from new target identification to responses to targeted therapeutics and mechanisms of resistance1 in an environment that is more realistic than just 2D cell lines. However, due to mouse stromal contamination of the human tumour, not all the data resulting from studying the extracted samples are guaranteed to be of human origin.

Direct high throughput sequencing of grafted samples with a mixture of two species is routine practice. However with the high volume of data and computational challenges of alignment and kmer identification, new computational strategies are required to computationally separate the two species’ components for more accurate downstream analysis1, especially for the reduction of variant calling artefacts. However, the two-species alignment approach proposed in Bradford et al.1 excludes reads that align to both organisms, clearly dismissing a large portion of the data as evidenced in Table 1 and Table 2 when observing cross species alignment rates.

Table 1. Read pairs assigned human (hg19) and mouse (mm10) post disambiguation in BWA aligned DNA-seq data.

The ’Ambiguous’ column includes reads that aligned to neither or had equal quality scores for the alignments and could not be disambiguated.

Samplemm10hg19AmbiguousRead pairs
total
SRR117681447197650261578854247312349
SRR152826911502††7710289515376777268164

Down from 25638785 read pairs with alignment to hg19

††Down from 39686392 read pairs with alignment to mm10

Table 2. Read pairs assigned human (hg19) and mouse (mm10) post disambiguation in STAR aligned RNA-seq data.

The ’Ambiguous’ column includes reads that aligned to neither or had equal quality scores for the alignments and could not be disambiguated.

Samplemm10hg19AmbiguousRead
pairs total
SRR1930152231260868069484936424056144
SRR38740094289††49677937988084459653070

Down from 3005372 read pairs with alignment to hg19

††Down from 6001230 read pairs with alignment to mm10

Algorithms designed for disambiguating the host and tumour sequences include e.g. the Xenome tool2, which is based on machine learning applied to k-mers from both species. However, the implementation is not readily available and is not free for non-academic users. In 3 the authors also aligned the reads to both species, but no attempt was taken to disambiguate the data and no implementation is readily available.

Here, an alternative approach using read alignment quality is proposed to further disambiguate reads that can be mapped to both species. Alignment is first performed to both species independently and the reads are disambiguated as a post-processing step. There is no requirement to maintain pseudo reference indices based on combinations of reference sequences. This approach shows a very high sensitivity and specificity on artificially generated samples obtained by mixing reads from the individual species. The Disambiguate tool is community supported and widely used in several open and closed source pipelines.

Methods

Implementation

The Disambiguate algorithm works by operating on natural name sorted BAM files from alignments to two species. Name sorting is a critical part in not having to read all the data from both species’ alignments into memory simultaneously; the same read aligned to both species is disambiguated on the fly by going through both alignment files synchronously. For reads that have alignments to both species and therefore require disambiguation, the specific details of the disambiguation process are slightly different for the different aligners. Thus far the algorithm has been tested for BWA-MEM4 and Bowtie25 for DNA-seq, and TopHat26, STAR7 and Hisat28 for RNA-seq. Illumina’s paired end sequencing is preferred as the mate can often break a tie. Figure 1 illustrates the disambiguation process.

2b87a384-55a7-417c-84b0-695438231a37_figure1.gif

Figure 1. The disambiguation process illustrated.

Alignment is first performed against both species. The disambiguation application then operates on the raw, natural name sorted BAM files to assign the read pairs into one of the two species or as ambiguous for unresolved cases.

Disambiguate assigns the reads on a per-pair basis, based on the highest quality alignment of the read pair. For BWA and STAR the alignment score (AS, higher better) is used as the primary disambiguation metric followed by edit distance (NM, lower better) to the reference. For Tophat2 and Hisat2 based alignments the sum (lower better) of edit distance, number of reported alignments (NH) and the number of gap opens (XO) is used.

Operation

The algorithm is implemented in Python (with dependency on the Pysam package) and C++ (with dependency on BamTools), with the C++ version being approximately four times faster than the Python code. 64 bit unix/linux systems are supported.

Given name sorted alignment (BAM) files aligned to the two species of interest (e.g. human and mouse), the algorithm infers for each read the most likely origin. The output contains BAM files for both species, BAM files for ambiguous reads and a text file describing how many read pairs were assigned to each BAM file. The simplest way to perform all of the alignment and disambiguation is by running bcbio, in which Disambiguate is integrated, on the raw sequencing data.

Results

To illustrate the utility of Disambiguate, raw publicly available human and mouse exome sequencing reads (100bp paired end Illumina data) were downloaded from the European Nucleotide Archive (ENA) with Run Accessions SRR1176814 and SRR1528269.

The reads were concatenated, aligned against hg19 and mm10 using BWA MEM, and processed using Disambiguate. Pre-disambiguation, for the human sample (SRR1528269), there were 39686392 read pairs (out of total 77268164), for which at least one read aligned to mouse. Similarly, for the mouse sample (SRR1176814), there were 25638785 read pairs (out of total 47312349) for which at least one read aligned to human. Table 1 summarises the post disambiguation results. As can be seen, the disambiguation algorithm correctly pulls apart virtually all of the read pairs. In other internal studies, Disambiguate has time and again highlighted samples with low human assigned component, correlating with poor extraction or lack of growth of the tumour cells in the host.

STAR aligned human (SRR387400) and mouse (SRR1930152) RNA-seq data was also analysed with very similar results, see Table 2.

Conclusions

In summary, Disambiguate provides an important tool for computationally separating sequence reads originating from two species. In human-mouse studies it also allows the study of the mouse stromal component for gene expression and DNA variation.

In addition to RNA-seq and whole genome sequencing, it is worth highlighting that for targeted hybridisation capture sequencing of xenograft samples, where baits from a single species are used, disambiguation is still highly recommended. This is best seen in Table 1 where a large number of human exome reads aligned to mouse and would potentially affect downstream interpretation without disambiguation.

Disambiguate has been well adopted in the open source community; it is integrated in the open source bcbio pipeline, and has been successfully used in both RNA and DNA sequencing of xenografts both at AstraZeneca and other research institutes. This is evidenced by the number of support tickets from a variety of organisations on the bcbio-nextgen Github page.

Data availability

The data used here is available from the European Nucleotide Archive with Run Accession numbers SRR1176814 and SRR1528269.

Software availability

Software integrating Disambiguate available from: https://github.com/chapmanb/bcbio-nextgen

Latest source code: https://github.com/AstraZeneca-NGS/disambiguate

Archived source code as at time of publication: DOI: 10.5281/zenodo.1660179

License: MIT.

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 22 Nov 2016
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Ahdesmäki MJ, Gray SR, Johnson JH and Lai Z. Disambiguate: An open-source application for disambiguating two species in next generation sequencing data from grafted samples [version 1; peer review: 2 approved, 1 approved with reservations] F1000Research 2016, 5:2741 (https://doi.org/10.12688/f1000research.10082.1)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 22 Nov 2016
Views
66
Cite
Reviewer Report 05 Dec 2016
Gavin R. Oliver, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA 
Asha A. Nair, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA 
Approved with Reservations
VIEWS 66
We believe that overall the software tool article by Ahdesmäki et al. seems sound and provides a solution to a problem that appears to be inadequately addressed in the field currently. 

Nonetheless, we believe the manuscript would benefit from some ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Oliver GR and Nair AA. Reviewer Report For: Disambiguate: An open-source application for disambiguating two species in next generation sequencing data from grafted samples [version 1; peer review: 2 approved, 1 approved with reservations]. F1000Research 2016, 5:2741 (https://doi.org/10.5256/f1000research.10863.r17881)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 24 Jan 2017
    Miika Ahdesmäki
    24 Jan 2017
    Author Response
    Dear Gavin and Asha, 
    Many thanks for the very detailed review and comments. We have addressed your points in v2 of the manuscript.  

    Into/background: 
    We have added the text ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 24 Jan 2017
    Miika Ahdesmäki
    24 Jan 2017
    Author Response
    Dear Gavin and Asha, 
    Many thanks for the very detailed review and comments. We have addressed your points in v2 of the manuscript.  

    Into/background: 
    We have added the text ... Continue reading
Views
55
Cite
Reviewer Report 25 Nov 2016
Matthew D. Eldridge, Cancer Research UK Cambridge Institute, University of Cambridge, Cambridge, UK 
Approved
VIEWS 55
This paper describes a computational tool for separating sequencing reads from a sample that contains DNA or RNA from two species. This is a necessary pre-processing step for genomic or transcriptomic analysis of patient-derived xenograft cancer models.

... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Eldridge MD. Reviewer Report For: Disambiguate: An open-source application for disambiguating two species in next generation sequencing data from grafted samples [version 1; peer review: 2 approved, 1 approved with reservations]. F1000Research 2016, 5:2741 (https://doi.org/10.5256/f1000research.10863.r17879)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 24 Jan 2017
    Miika Ahdesmäki
    24 Jan 2017
    Author Response
    Dear Mathew, 
    Many thanks for reviewing our manuscript and the comments. We have modified v2 of the manuscript to address the points you raise, namely: 
    1. The aligner tags
    ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 24 Jan 2017
    Miika Ahdesmäki
    24 Jan 2017
    Author Response
    Dear Mathew, 
    Many thanks for reviewing our manuscript and the comments. We have modified v2 of the manuscript to address the points you raise, namely: 
    1. The aligner tags
    ... Continue reading
Views
52
Cite
Reviewer Report 23 Nov 2016
Daniel Nicorici, Orion Corporation Orion Pharma, Espoo, Finland 
Approved
VIEWS 52
This papers introduces a tool, named Disambiguate, for computationally separating the DNA/RNA sequencing reads of two species, like for example in case of xenograft samples. The tool takes as input BAM files from wide range of NGS aligners.

... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Nicorici D. Reviewer Report For: Disambiguate: An open-source application for disambiguating two species in next generation sequencing data from grafted samples [version 1; peer review: 2 approved, 1 approved with reservations]. F1000Research 2016, 5:2741 (https://doi.org/10.5256/f1000research.10863.r17877)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 24 Jan 2017
    Miika Ahdesmäki
    24 Jan 2017
    Author Response
    Dear Daniel, 
    Thank you for the review, your comments are much appreciated. We have addressed your points in v2 of the manuscript. 
    1. We have explicitly mentioned in the
    ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 24 Jan 2017
    Miika Ahdesmäki
    24 Jan 2017
    Author Response
    Dear Daniel, 
    Thank you for the review, your comments are much appreciated. We have addressed your points in v2 of the manuscript. 
    1. We have explicitly mentioned in the
    ... Continue reading

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 22 Nov 2016
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.