[Bioc-devel] MRD measurements in Leukemic patients using NGS data in r

Sat Mar 7 22:28:36 CET 2020

something like this perhaps?

library(Rsamtools)
BamFiles <- BamFileList(lapply(list.files(pattern=".bam$"), BamFile))
names(BamFiles) <- sapply(strsplit(list.files(pattern=".bam$"), "\\."),
`[`, 1)
show(BamFiles)
#BamFileList of length 14
#names(14): Normal_0813_S19_R1_001 Normal_2013_S18_R1_001 ... REMC_CD19
REMC_CD3

library(Homo.sapiens)
gxs <- genes(Homo.sapiens)
names(gxs)  <- mapIds(Homo.sapiens, names(gxs), "SYMBOL", "ENTREZID")
SomeGenes <- gxs[ c("HRAS", "WT1", "IGF2") ]

sbparam <- ScanBamParam(which=SomeGenes)
puparam <- PileupParam(max_depth=1e3, min_nucleotide_depth=5,
                       include_insertions=TRUE, min_minor_allele_depth=1)
piles <- lapply(BamFiles, pileup, ScanBamParam=sbparam, PileupParam=puparam)

filterInvariant <- function(x) {
  positions <- subset(x, duplicated(pos))$pos
  subset(x, pos %in% positions)
}
show(do.call(rbind, lapply(lapply(piles, filterInvariant), head, 2)))
#                           seqnames     pos strand nucleotide count
# Normal_0813_S19_R1_001.43    chr11 1979973      -          C     1
# Normal_0813_S19_R1_001.44    chr11 1979973      +          T     1
# Normal_2013_S18_R1_001.16    chr11 1979889      +          T     3
# Normal_2013_S18_R1_001.17    chr11 1979889      -          T     1
# Normal_2714_S20_R1_001.4     chr11 1979933      +          A     1
# Normal_2714_S20_R1_001.5     chr11 1979933      -          A     1
# P01-00020-T06-P-cfDNA.21     chr11 1980045      +          G     1
# P01-00020-T06-P-cfDNA.22     chr11 1980045      -          G     1
# P01-00020-T06-P-cfDNA2.11    chr11 1979884      +          T     1
# P01-00020-T06-P-cfDNA2.12    chr11 1979884      -          T     1
# P01-00021-T06-P-cfDNA.5      chr11 1979879      +          C     1
# P01-00021-T06-P-cfDNA.6      chr11 1979879      -          C     1
# P01-00024-T06-P-cfDNA.9      chr11 1979990      +          C     1
# P01-00024-T06-P-cfDNA.10     chr11 1979990      +          T     2
# P01-00028-T06-N-P01.3        chr11 1979851      +          A     1
# P01-00028-T06-N-P01.4        chr11 1979851      +          G     1
# P01-00028-T06-P-cfDNA.10     chr11 1979948      +          A     1
# P01-00028-T06-P-cfDNA.11     chr11 1979948      +          G     1
# P01-00028-T06-T-P01.2        chr11 1979850      +          C     3
# P01-00028-T06-T-P01.3        chr11 1979850      -          C     2
# P01-00028-T06-T-P02.3        chr11 1979851      +          A     1
# P01-00028-T06-T-P02.4        chr11 1979851      +          C     1

Or whatever you like (GAlignments would get you the reads, GAlignmentPairs
the read pairs, and so on)

--t

On Sat, Mar 7, 2020 at 3:51 PM Mulder, R <r.mulder01 using umcg.nl> wrote:

> Dear Tim,
>
>
> I installed Rsamtools and ran it on a bamfile using the following command
> to get nucleotide count for 2 hotspot regions.
>
>
> library(Rsamtools)
> bamfile <- "test.bam"
> bf <- BamFile(bamfile)
> param <- ScanBamParam(which=GRanges(c("4", "9"), IRanges(start=c(55599321,
> 5073770), end=c(55599321, 5073770))))
>
> Now, I want to perform this on more than 1 bam file at once and on more
> region.
>
> Do you perhaps have ideas on how I could do this?
>
> Best,
>
> Rene
>
> ------------------------------
> *Van:* Tim Triche, Jr. <tim.triche using gmail.com>
> *Verzonden:* vrijdag 6 maart 2020 13:57:10
> *Aan:* Mulder, R
> *CC:* bioc-devel using r-project.org
> *Onderwerp:* Re: [Bioc-devel] MRD measurements in Leukemic patients using
> NGS data in r
>
> Oh hey, one last thing — if all you want is to get nucleotide counts per
> region of interest, just use pileup() in Rsamtools, with bamWhich(GRanges)
> holding a GRanges (Genomic Ranges) of your regions added to scanBamParams
> for each BAM. It sounds awkward but in practice it is super fast and will
> give you all the nucleotide and read level information you could want. One
> of my interns implemented this for mitochondrial variant calling in
> MTseeker when we got sick of using gmapR and being flagged for errors on
> not-Linux. (We gutted the entire package recently and have new, insanely
> deep examples from Oxford Nanopore direct RNA sequencing and from large
> single cell datasets; I need to add those and get the package back out of
> purgatory).
>
> That said, in the end you will want a LOT of validation material so this
> is very much just a starting point. But still, it’s your starting point, in
> R at least. And truthfully I much prefer R/Bioconductor idioms to (say)
> pysam or the like. htsnim is nice but then you’ll be implementing the ML
> bits from scratch, so I think your instincts to try R first are sensible.
>
> Good luck! Even if you use this for something else besides MRD, I think it
> will become a useful exercise.
>
> --t
>
> On Mar 5, 2020, at 4:36 PM, Tim Triche, Jr. <tim.triche using gmail.com> wrote:
>
> 
> a few thoughts:
>
> 1) look into Shearwater (
> https://bioconductor.org/packages/release/bioc/html/deepSNV.html
> <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbioconductor.org%2Fpackages%2Frelease%2Fbioc%2Fhtml%2FdeepSNV.html&data=02%7C01%7Cr.mulder01%40umcg.nl%7C1d7ff8bff12b40c27b7a08d7c1cde470%7C335122f9d4f44d67a2fccd6dc20dde70%7C0%7C0%7C637190962347080053&sdata=LpUeflml9u2XLENpmDAEuDBRANXZpjkFbIEiJAeP2pw%3D&reserved=0>),
> then
>
> 2) talk to Todd Druley @ WashU, Elli Pappaemanuil @ MSKCC, Ruud & Bob @
> Erasmus, the usual suspects
>
> 3) plan to validate w/ddPCR (at the absolute very least) and be aware that
> most MRD in leukemia is done by a combination of BCR/TCR + breakpoint PCR
> (lymphoid/fusion-driven) or DFN flow (myeloid + normal cyto)
>
> not saying that ML-based methods might not help, but if you've got a
> 30x-100x genome (or even 1000x FM1) and are trying to compete with existing
> standard approaches that can detect molecules at 1e-6, it'll be rough.  An
> alternative approach (that has been used repeatedly) is to throw caution to
> the wind, generate primers for numerous subject-specific somatic variants,
> and use the ensemble to try and model MRD (speaking of ML). On the one
> hand, that could give the clinic a "customer for life"; on the other hand,
> it's not conducive to large-scale automation & deployment. As far as I
> know, it never got much traction, in leukemia or anywhere else.  (Consider
> that flow cytometry is capable of detecting 1-in-10K to 1-in-a-million
> cells in most clinical flow labs.)
>
> Best of luck! (and if you're not already working with UMI-tagged reads,
> please talk to the people in #2 above; the reason most people won't go
> below 5% VAF is that you get thwacked by error rates at that level, and the
> reason most NGS-based MRD is based on UMIs is that existing PCR-based
> methods have 6 logs sensitivity.)
>
> --t
>
>
> On Thu, Mar 5, 2020 at 4:08 PM Mulder, R <r.mulder01 using umcg.nl> wrote:
>
>> Hi,
>>
>>
>> I was wondering if anyone could help me with a script and support for the
>> above mentioned goal.
>>
>> For this I have several BAM files for which I want to determine de
>> nucleotide count per region of interest. The latter could be several
>> hotspot mutation sites. I would like to get an overall overview of all the
>> BAM files. Next I want to use these counts to determine for any new BAM
>> file if the count for a particular genomic position is higher than the
>> allowable range, hence could indicate if a mutation is present. For this I
>> would like to use the modified Thompson Tau test. I think machine learning
>> could be used for this. So, why do I want to do all this? Well, normal NGS
>> pipelines only deal with variants at a frequency of 5%. Mutatios below this
>> frequency are often missed. To know if a mutation is present below this
>> level, you showed dive into the alignment and most often manually
>> investigate the base calls. I know that this races some questions regarding
>> call qualities, but then again our conventional assays have actually
>> confirmed some of these low mutations. In addition, NGS can
>>  be used to determine the presence of low frequent mutation which is of
>> great importance for determining the measurable residual disease after
>> treatment.
>>
>>
>> I am new to r and bioconductor so I would be very thankful if someone
>> could help me in my mission to setting up a script for this purpose.
>>
>>
>> Thanks,
>>
>>
>> Rene Mulder
>>
>> Laboratory Medicine
>>
>> University Medical Center Groningen
>>
>> The Netherlands
>>
>> ________________________________
>> De inhoud van dit bericht is vertrouwelijk en alleen bes...{{dropped:15}}
>>
>> _______________________________________________
>> Bioc-devel using r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>> <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fbioc-devel&data=02%7C01%7Cr.mulder01%40umcg.nl%7C1d7ff8bff12b40c27b7a08d7c1cde470%7C335122f9d4f44d67a2fccd6dc20dde70%7C0%7C0%7C637190962347090042&sdata=HXNdz6iiEUQ3l%2BVwwDIGX0RAdO7Tdd7r7f4FcgW1k9M%3D&reserved=0>
>>
> ------------------------------
> De inhoud van dit bericht is vertrouwelijk en alleen bestemd voor de
> geadresseerde(n). Anderen dan de geadresseerde(n) mogen geen gebruik maken
> van dit bericht, het niet openbaar maken of op enige wijze verspreiden of
> vermenigvuldigen. Het UMCG kan niet aansprakelijk gesteld worden voor een
> incomplete aankomst of vertraging van dit verzonden bericht.
>
> The contents of this message are confidential and only intended for the
> eyes of the addressee(s). Others than the addressee(s) are not allowed to
> use this message, to make it public or to distribute or multiply this
> message in any way. The UMCG cannot be held responsible for incomplete
> reception or delay of this transferred message.
>

	[[alternative HTML version deleted]]