[Bioc-devel] MRD measurements in Leukemic patients using NGS data in r
Vincent Carey
@tvjc @end|ng |rom ch@nn|ng@h@rv@rd@edu
Sun Mar 8 12:00:41 CET 2020
This has been a very interesting and illuminating dialogue but it really
should be moved to
support.bioconductor.org so that a record is available to the general user
community.
The solution here
> library(Homo.sapiens)
Error in library(Homo.sapiens) :
there is no package called ‘Homo.sapiens’
is BiocManager::install("Homo.sapiens")
On Sun, Mar 8, 2020 at 6:22 AM Mulder, R <r.mulder01 using umcg.nl> wrote:
> Dear Tim,
>
>
> Thanks again for your (quick) reply.
>
>
> I just ran it on my computer and it gave me some results but also several
> errors.
>
>
> The first part went perfect
>
>
> > BamFiles <- BamFileList(lapply(list.files(pattern=".bam$"), BamFile))
> > names(BamFiles) <- sapply(strsplit(list.files(pattern=".bam$"), "\\."),
> `[`, 1)
> > show(BamFiles)
> BamFileList of length 2
> names(2): here were my 2 files named without bam extension
>
> However the second part did give results but also several errors
>
> > library(Homo.sapiens)
> Error in library(Homo.sapiens) :
> there is no package called ‘Homo.sapiens’
> > gxs <- genes(Homo.sapiens)
> Error in genes(Homo.sapiens) : could not find function "genes"
> > names(gxs) <- mapIds(Homo.sapiens, names(gxs), "SYMBOL", "ENTREZID")
> Error in mapIds(Homo.sapiens, names(gxs), "SYMBOL", "ENTREZID") :
> could not find function "mapIds"
> > SomeGenes <- gxs[ c("HRAS", "WT1", "IGF2") ]
> Error: object 'gxs' not found
> >
> > sbparam <- ScanBamParam(which=SomeGenes)
> Error in ScanBamParam(which = SomeGenes) : object 'SomeGenes' not found
> > puparam <- PileupParam(max_depth=1e3, min_nucleotide_depth=5,
> + include_insertions=TRUE, min_minor_allele_depth=1)
> > piles <- lapply(BamFiles, pileup, ScanBamParam=sbparam,
> PileupParam=puparam)
> >
> > filterInvariant <- function(x) {
> + positions <- subset(x, duplicated(pos))$pos
> + subset(x, pos %in% positions)
> + }
> > show(do.call(rbind, lapply(lapply(piles, filterInvariant), head, 2)))
> seqnames pos strand nucleotide count
> myfile1 1 36931671 + A 251
> myfile1 1 36931671 + G 2
> myfile2 1 36931673 + - 1
> myfile2 1 36931673 + A 253
>
>
> Error in library(Homo.sapiens) :
> there is no package called ‘Homo.sapiens’
>
> For this error I installed the package BSgenome.Hsapiens.UCSC.hg19 and ran
> it again, but the errors at the next operations remained.
>
> > gxs <- genes(Homo.sapiens)
> Error in genes(Homo.sapiens) : could not find function "genes"
>
> Renaming Homo.sapiens as BSgenome.Hsapiens.UCSC.hg19 did not work. Should
> I look up how this genome is called?
>
>
> > names(gxs) <- mapIds(Homo.sapiens, names(gxs), "SYMBOL", "ENTREZID")
> Error in mapIds(Homo.sapiens, names(gxs), "SYMBOL", "ENTREZID") :
> could not find function "mapIds"
>
> Where can I find this function?
>
> > SomeGenes <- gxs[ c("HRAS", "WT1", "IGF2") ]
> Error: object 'gxs' not found
>
> Does this have to do with the fact that IGF2 is not included in my panel?
>
> > sbparam <- ScanBamParam(which=SomeGenes)
> Error in ScanBamParam(which = SomeGenes) : object 'SomeGenes' not found
>
> As a result of the previous error, right?
>
> Interestingly enough, I did get results. Why?
>
>
>
> ________________________________
> Van: Tim Triche, Jr. <tim.triche using gmail.com>
> Verzonden: zaterdag 7 maart 2020 22:28
> Aan: Mulder, R
> CC: bioc-devel using r-project.org
> Onderwerp: Re: [Bioc-devel] MRD measurements in Leukemic patients using
> NGS data in r
>
> something like this perhaps?
>
> library(Rsamtools)
> BamFiles <- BamFileList(lapply(list.files(pattern=".bam$"), BamFile))
> names(BamFiles) <- sapply(strsplit(list.files(pattern=".bam$"), "\\."),
> `[`, 1)
> show(BamFiles)
> #BamFileList of length 14
> #names(14): Normal_0813_S19_R1_001 Normal_2013_S18_R1_001 ... REMC_CD19
> REMC_CD3
>
> library(Homo.sapiens)
> gxs <- genes(Homo.sapiens)
> names(gxs) <- mapIds(Homo.sapiens, names(gxs), "SYMBOL", "ENTREZID")
> SomeGenes <- gxs[ c("HRAS", "WT1", "IGF2") ]
>
> sbparam <- ScanBamParam(which=SomeGenes)
> puparam <- PileupParam(max_depth=1e3, min_nucleotide_depth=5,
> include_insertions=TRUE, min_minor_allele_depth=1)
> piles <- lapply(BamFiles, pileup, ScanBamParam=sbparam,
> PileupParam=puparam)
>
> filterInvariant <- function(x) {
> positions <- subset(x, duplicated(pos))$pos
> subset(x, pos %in% positions)
> }
> show(do.call(rbind, lapply(lapply(piles, filterInvariant), head, 2)))
> # seqnames pos strand nucleotide count
> # Normal_0813_S19_R1_001.43 chr11 1979973 - C 1
> # Normal_0813_S19_R1_001.44 chr11 1979973 + T 1
> # Normal_2013_S18_R1_001.16 chr11 1979889 + T 3
> # Normal_2013_S18_R1_001.17 chr11 1979889 - T 1
> # Normal_2714_S20_R1_001.4 chr11 1979933 + A 1
> # Normal_2714_S20_R1_001.5 chr11 1979933 - A 1
> # P01-00020-T06-P-cfDNA.21 chr11 1980045 + G 1
> # P01-00020-T06-P-cfDNA.22 chr11 1980045 - G 1
> # P01-00020-T06-P-cfDNA2.11 chr11 1979884 + T 1
> # P01-00020-T06-P-cfDNA2.12 chr11 1979884 - T 1
> # P01-00021-T06-P-cfDNA.5 chr11 1979879 + C 1
> # P01-00021-T06-P-cfDNA.6 chr11 1979879 - C 1
> # P01-00024-T06-P-cfDNA.9 chr11 1979990 + C 1
> # P01-00024-T06-P-cfDNA.10 chr11 1979990 + T 2
> # P01-00028-T06-N-P01.3 chr11 1979851 + A 1
> # P01-00028-T06-N-P01.4 chr11 1979851 + G 1
> # P01-00028-T06-P-cfDNA.10 chr11 1979948 + A 1
> # P01-00028-T06-P-cfDNA.11 chr11 1979948 + G 1
> # P01-00028-T06-T-P01.2 chr11 1979850 + C 3
> # P01-00028-T06-T-P01.3 chr11 1979850 - C 2
> # P01-00028-T06-T-P02.3 chr11 1979851 + A 1
> # P01-00028-T06-T-P02.4 chr11 1979851 + C 1
>
>
> Or whatever you like (GAlignments would get you the reads, GAlignmentPairs
> the read pairs, and so on)
>
>
>
> --t
>
>
> On Sat, Mar 7, 2020 at 3:51 PM Mulder, R <r.mulder01 using umcg.nl<mailto:
> r.mulder01 using umcg.nl>> wrote:
>
> Dear Tim,
>
>
> I installed Rsamtools and ran it on a bamfile using the following command
> to get nucleotide count for 2 hotspot regions.
>
>
> library(Rsamtools)
> bamfile <- "test.bam"
> bf <- BamFile(bamfile)
> param <- ScanBamParam(which=GRanges(c("4", "9"), IRanges(start=c(55599321,
> 5073770), end=c(55599321, 5073770))))
>
> Now, I want to perform this on more than 1 bam file at once and on more
> region.
>
> Do you perhaps have ideas on how I could do this?
>
> Best,
>
> Rene
>
> ________________________________
> Van: Tim Triche, Jr. <tim.triche using gmail.com<mailto:tim.triche using gmail.com>>
> Verzonden: vrijdag 6 maart 2020 13:57:10
> Aan: Mulder, R
> CC: bioc-devel using r-project.org<mailto:bioc-devel using r-project.org>
> Onderwerp: Re: [Bioc-devel] MRD measurements in Leukemic patients using
> NGS data in r
>
> Oh hey, one last thing — if all you want is to get nucleotide counts per
> region of interest, just use pileup() in Rsamtools, with bamWhich(GRanges)
> holding a GRanges (Genomic Ranges) of your regions added to scanBamParams
> for each BAM. It sounds awkward but in practice it is super fast and will
> give you all the nucleotide and read level information you could want. One
> of my interns implemented this for mitochondrial variant calling in
> MTseeker when we got sick of using gmapR and being flagged for errors on
> not-Linux. (We gutted the entire package recently and have new, insanely
> deep examples from Oxford Nanopore direct RNA sequencing and from large
> single cell datasets; I need to add those and get the package back out of
> purgatory).
>
> That said, in the end you will want a LOT of validation material so this
> is very much just a starting point. But still, it’s your starting point, in
> R at least. And truthfully I much prefer R/Bioconductor idioms to (say)
> pysam or the like. htsnim is nice but then you’ll be implementing the ML
> bits from scratch, so I think your instincts to try R first are sensible.
>
> Good luck! Even if you use this for something else besides MRD, I think it
> will become a useful exercise.
>
> --t
>
> On Mar 5, 2020, at 4:36 PM, Tim Triche, Jr. <tim.triche using gmail.com<mailto:
> tim.triche using gmail.com>> wrote:
>
>
> a few thoughts:
>
> 1) look into Shearwater (
> https://bioconductor.org/packages/release/bioc/html/deepSNV.html<
> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbioconductor.org%2Fpackages%2Frelease%2Fbioc%2Fhtml%2FdeepSNV.html&data=02%7C01%7Cr.mulder01%40umcg.nl%7C5dfaa267d461487c892408d7c2de972a%7C335122f9d4f44d67a2fccd6dc20dde70%7C0%7C0%7C637192133587503301&sdata=3yQeLWUNk%2FLjDAOquRQf6wgbGvn06KZGUUzhIw%2BBUMk%3D&reserved=0>),
> then
>
> 2) talk to Todd Druley @ WashU, Elli Pappaemanuil @ MSKCC, Ruud & Bob @
> Erasmus, the usual suspects
>
> 3) plan to validate w/ddPCR (at the absolute very least) and be aware that
> most MRD in leukemia is done by a combination of BCR/TCR + breakpoint PCR
> (lymphoid/fusion-driven) or DFN flow (myeloid + normal cyto)
>
> not saying that ML-based methods might not help, but if you've got a
> 30x-100x genome (or even 1000x FM1) and are trying to compete with existing
> standard approaches that can detect molecules at 1e-6, it'll be rough. An
> alternative approach (that has been used repeatedly) is to throw caution to
> the wind, generate primers for numerous subject-specific somatic variants,
> and use the ensemble to try and model MRD (speaking of ML). On the one
> hand, that could give the clinic a "customer for life"; on the other hand,
> it's not conducive to large-scale automation & deployment. As far as I
> know, it never got much traction, in leukemia or anywhere else. (Consider
> that flow cytometry is capable of detecting 1-in-10K to 1-in-a-million
> cells in most clinical flow labs.)
>
> Best of luck! (and if you're not already working with UMI-tagged reads,
> please talk to the people in #2 above; the reason most people won't go
> below 5% VAF is that you get thwacked by error rates at that level, and the
> reason most NGS-based MRD is based on UMIs is that existing PCR-based
> methods have 6 logs sensitivity.)
>
> --t
>
>
> On Thu, Mar 5, 2020 at 4:08 PM Mulder, R <r.mulder01 using umcg.nl<mailto:
> r.mulder01 using umcg.nl>> wrote:
> Hi,
>
>
> I was wondering if anyone could help me with a script and support for the
> above mentioned goal.
>
> For this I have several BAM files for which I want to determine de
> nucleotide count per region of interest. The latter could be several
> hotspot mutation sites. I would like to get an overall overview of all the
> BAM files. Next I want to use these counts to determine for any new BAM
> file if the count for a particular genomic position is higher than the
> allowable range, hence could indicate if a mutation is present. For this I
> would like to use the modified Thompson Tau test. I think machine learning
> could be used for this. So, why do I want to do all this? Well, normal NGS
> pipelines only deal with variants at a frequency of 5%. Mutatios below this
> frequency are often missed. To know if a mutation is present below this
> level, you showed dive into the alignment and most often manually
> investigate the base calls. I know that this races some questions regarding
> call qualities, but then again our conventional assays have actually
> confirmed some of these low mutations. In addition, NGS can
> be used to determine the presence of low frequent mutation which is of
> great importance for determining the measurable residual disease after
> treatment.
>
>
> I am new to r and bioconductor so I would be very thankful if someone
> could help me in my mission to setting up a script for this purpose.
>
>
> Thanks,
>
>
> Rene Mulder
>
> Laboratory Medicine
>
> University Medical Center Groningen
>
> The Netherlands
>
> ________________________________
> De inhoud van dit bericht is vertrouwelijk en alleen bes...{{dropped:15}}
>
> _______________________________________________
> Bioc-devel using r-project.org<mailto:Bioc-devel using r-project.org> mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel<
> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fbioc-devel&data=02%7C01%7Cr.mulder01%40umcg.nl%7C5dfaa267d461487c892408d7c2de972a%7C335122f9d4f44d67a2fccd6dc20dde70%7C0%7C0%7C637192133587503301&sdata=jDiihphF1FySPCdvLtXHMsmlkQ7ta6Fo0AjBvMa5VIA%3D&reserved=0
> >
> ________________________________
> De inhoud van dit bericht is vertrouwelijk en alleen bestemd voor de
> geadresseerde(n). Anderen dan de geadresseerde(n) mogen geen gebruik maken
> van dit bericht, het niet openbaar maken of op enige wijze verspreiden of
> vermenigvuldigen. Het UMCG kan niet aansprakelijk gesteld worden voor een
> incomplete aankomst of vertraging van dit verzonden bericht.
>
> The contents of this message are confidential and only intended for the
> eyes of the addressee(s). Others than the addressee(s) are not allowed to
> use this message, to make it public or to distribute or multiply this
> message in any way. The UMCG cannot be held responsible for incomplete
> reception or delay of this transferred message.
> ________________________________
> De inhoud van dit bericht is vertrouwelijk en alleen bestemd voor de
> geadresseerde(n). Anderen dan de geadresseerde(n) mogen geen gebruik maken
> van dit bericht, het niet openbaar maken of op enige wijze verspreiden of
> vermenigvuldigen. Het UMCG kan niet aansprakelijk gesteld worden voor een
> incomplete aankomst of vertraging van dit verzonden bericht.
>
> The contents of this message are confidential and only intended for the
> eyes of the addressee(s). Others than the addressee(s) are not allowed to
> use this message, to make it public or to distribute or multiply this
> message in any way. The UMCG cannot be held responsible for incomplete
> reception or delay of this transferred message.
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
--
The information in this e-mail is intended only for the ...{{dropped:18}}
More information about the Bioc-devel
mailing list