[Bioc-devel] Rsamtools applyPileups function not merging positions from multiple files if not identical
Martin Morgan
mtmorgan at fhcrc.org
Wed Apr 23 14:10:08 CEST 2014
Thanks for the file snippets; I'm able to reproduce this bug and will let you
know of its resolution. Martin
On 04/22/2014 07:44 AM, Jonathon Hill wrote:
> Hi Martin,
>
> Thank you for your response. I checked the header and it says that it is
> coordinate sorted, so that shouldn’t be the problem. Here are the results of the
> code you provided:
>
> > r3 = applyPileups(PileupFiles(c(fl1, fl2)), function(x) x, param=testparam)
> > any(duplicated(r3[[1]][["pos"]]))
> [1] TRUE
> > pos = r3[[1]][["pos"]]
> > table(table(pos))
>
> 1 2
> 4834 3115
> > udpos = unique(pos[duplicated(pos)])
> > head(pos[match(pos, udpos)], 20)
> [1] 135003 135006 135007 135008 135009 135010 135011 135012 135013 135014
> 135015 135016 135017 135018 135019 135020 135021 135022 135023
> [20] 135024
> > head(match(pos, udpos), 20)
> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
> > table(pos)[1:50]
> pos
> 134762 134763 134764 134765 134766 134767 134768 134769 134770 134771 134772
> 134773 134774 134775 134776 134777 134778 134779 134780
> 1 1 1 1 1 1 1 1 1 1 1
> 1 1 1 1 1 1 1 1
> 134781 134782 134783 134784 134785 134786 134787 134788 134991 134992 134993
> 134994 134995 134999 135001 135002 135003 135004 135005
> 1 1 1 1 1 1 1 1 1 1 1
> 1 1 1 1 1 2 1 1
> 135006 135007 135008 135009 135010 135011 135012 135013 135014 135015 135016 135017
> 2 2 2 2 2 2 2 2 2 2 2
> 2
>
> I think the last one shows it well. There are duplicates throughout the file,
> anywhere that there are reads in both files. I have attached the bam files you
> requested showing the region used here.
>
> Thanks,
>
> Jonathon
>
> On Apr 21, 2014, at 7:17 PM, Martin Morgan <mtmorgan at fhcrc.org
> <mailto:mtmorgan at fhcrc.org>> wrote:
>
>> On 04/21/2014 02:33 PM, Jonathon Hill wrote:
>>> Hi,
>>>
>>> I have been trying to use Rsamtools’ applyPileups function to compare two
>>> files position-by-position. In order to test it out, I simply ran:
>>> minBaseQuality <- 20
>>> minMapQuality <- 30
>>> minDepth <- 10
>>> maxDepth <- 1000
>>> testparam <- PileupParam(what="seq",
>>> which=GRanges(“chr20", IRanges(1, 1000000)),
>>> minBaseQuality=minBaseQuality,
>>> minMapQuality=minMapQuality,
>>> minDepth=minDepth,
>>> maxDepth=maxDepth,
>>> )
>>> fl1 <- "10696X1_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam"
>>> fl2 <- "10696X4_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam"
>>> r3 = applyPileups(PileupFiles(c(fl1, fl2)), function(x) x, param=testparam)
>>>
>>> My understanding is that this should result in a three-dimensional array with
>>> “ACTGN Counts” in the first dimension, files in the second and position in
>>> the third. Positions with overlapping reads in both files should thus be
>>> collapsed into a single line in the third dimension. However, selecting one
>>> of these positions shows that they are duplicated:
>>>
>>> r3[[1]][["seq"]][ , , r3[[1]][["pos"]] == 135003]
>>
>> I think your understanding is basically correct.
>>
>> The function is assuming that the BAM files are sorted by position (with,
>> e.g., sortBam, but the files don't have to be sorted by Rsamtools).
>>
>> Executing a similar command gives me
>>
>> > str(r3[[1]])
>> List of 3
>> $ seqnames: Named int 211195
>> ..- attr(*, "names")= chr "chr20"
>> $ pos : int [1:211195] 60026 60027 60028 60029 60030 60031 60032 60033
>> 60034 60035 ...
>> $ seq : int [1:5, 1:2, 1:211195] 0 0 0 0 0 0 0 0 0 0 ...
>> ..- attr(*, "dimnames")=List of 3
>> .. ..$ : chr [1:5] "A" "C" "G" "T" ...
>> .. ..$ : chr [1:2] "normal_srx113635_sorted.bam" "tumor_srx036691_sorted.bam"
>> .. ..$ : NULL
>>
>> Do you get something similar, especially the identical seqnames, pos
>> dimension, and third dimension of seq? 'pos' should apparently be unique; so
>>
>> > any(duplicated(r3[[1]][["pos"]]))
>> [1] FALSE
>>
>> If there are duplicates, I wonder how many there are and where they occur
>>
>> pos = r3[[1]][["pos"]]
>> table(table(pos))
>> udpos = unique(pos[duplicated(pos)])
>> head(pos[match(pos, udpos)], 20)
>> head(match(pos, udpos), 20)
>>
>> If nothing is suggested by the above, can you make a subset of the BAM files
>> available to me, e.g., the result of
>>
>> param = ScanBamParam(which=GRanges("chr20", IRanges(1, 1000000)))
>> filterBam(fls[1], tempfile(), param=param)
>>
>> Thanks,
>>
>> Martin
>>
>>> yields:
>>>
>>> , , 1
>>>
>>> 10696X1_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam
>>> 10696X4_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam
>>> A 0
>>> 0
>>> C 0
>>> 0
>>> G 10
>>> 0
>>> T 0
>>> 0
>>> N 0
>>> 0
>>>
>>> , , 2
>>>
>>> 10696X1_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam
>>> 10696X4_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam
>>> A 0
>>> 0
>>> C 0
>>> 0
>>> G 0
>>> 13
>>> T 0
>>> 0
>>> N 0
>>> 0
>>>
>>> Even though the position is the same, it is showing up twice. Each time, one
>>> of the files shows zeroes. This is not consistent with what happens if the
>>> files are identical (as in the example from the help docs).
>>>
>>> For example,
>>>
>>> r3 = applyPileups(PileupFiles(c(fl1, fl1)), function(x) x, param=testparam)
>>> #file 1 entered twice
>>> r3[[1]][["seq"]][ , , r3[[1]][["pos"]] == 135003]
>>>
>>> yields:
>>>
>>> 10696X1_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam
>>> 10696X1_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam
>>> A 0
>>> 0
>>> C 0
>>> 0
>>> G 10
>>> 10
>>> T 0
>>> 0
>>> N 0
>>> 0
>>>
>>> Is this the expected behavior? It seems like each position should only show
>>> up once in the output. Is there something I am missing?
>>>
>>> Thanks,
>>>
>>> Jonathon Hill
>>> Postdoc
>>> Yost Lab, University of Utah
>>>
>>>
>>>
>>> [[alternative HTML version deleted]]
>>>
>>>
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>
>>
>> --
>> Computational Biology / Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N.
>> PO Box 19024 Seattle, WA 98109
>>
>> Location: Arnold Building M1 B861
>> Phone: (206) 667-2793
>
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioc-devel
mailing list