[Bioc-devel] Rsamtools applyPileups function not merging positions from multiple files if not identical

Martin Morgan mtmorgan at fhcrc.org
Wed Apr 23 14:10:08 CEST 2014


Thanks for the file snippets; I'm able to reproduce this bug and will let you 
know of its resolution. Martin

On 04/22/2014 07:44 AM, Jonathon Hill wrote:
> Hi Martin,
>
> Thank you for your response. I checked the header and it says that it is
> coordinate sorted, so that shouldn’t be the problem. Here are the results of the
> code you provided:
>
>  > r3 = applyPileups(PileupFiles(c(fl1, fl2)), function(x) x, param=testparam)
>  > any(duplicated(r3[[1]][["pos"]]))
> [1] TRUE
>  > pos = r3[[1]][["pos"]]
>  > table(table(pos))
>
>     1    2
> 4834 3115
>  > udpos = unique(pos[duplicated(pos)])
>  > head(pos[match(pos, udpos)], 20)
>   [1] 135003 135006 135007 135008 135009 135010 135011 135012 135013 135014
> 135015 135016 135017 135018 135019 135020 135021 135022 135023
> [20] 135024
>  > head(match(pos, udpos), 20)
>   [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
>  > table(pos)[1:50]
> pos
> 134762 134763 134764 134765 134766 134767 134768 134769 134770 134771 134772
> 134773 134774 134775 134776 134777 134778 134779 134780
>       1      1      1      1      1      1      1      1      1      1      1
>     1      1      1      1      1      1      1      1
> 134781 134782 134783 134784 134785 134786 134787 134788 134991 134992 134993
> 134994 134995 134999 135001 135002 135003 135004 135005
>       1      1      1      1      1      1      1      1      1      1      1
>     1      1      1      1      1      2      1      1
> 135006 135007 135008 135009 135010 135011 135012 135013 135014 135015 135016 135017
>       2      2      2      2      2      2      2      2      2      2      2
>     2
>
> I think the last one shows it well. There are duplicates throughout the file,
> anywhere that there are reads in both files. I have attached the bam files you
> requested showing the region used here.
>
> Thanks,
>
> Jonathon
>
> On Apr 21, 2014, at 7:17 PM, Martin Morgan <mtmorgan at fhcrc.org
> <mailto:mtmorgan at fhcrc.org>> wrote:
>
>> On 04/21/2014 02:33 PM, Jonathon Hill wrote:
>>> Hi,
>>>
>>> I have been trying to use Rsamtools’ applyPileups function to compare two
>>> files position-by-position. In order to test it out, I simply ran:
>>> minBaseQuality <- 20
>>> minMapQuality <- 30
>>> minDepth <- 10
>>> maxDepth <- 1000
>>> testparam <- PileupParam(what="seq",
>>>                          which=GRanges(“chr20", IRanges(1, 1000000)),
>>>                          minBaseQuality=minBaseQuality,
>>>                          minMapQuality=minMapQuality,
>>>                          minDepth=minDepth,
>>>                          maxDepth=maxDepth,
>>> )
>>> fl1 <- "10696X1_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam"
>>> fl2 <- "10696X4_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam"
>>> r3 = applyPileups(PileupFiles(c(fl1, fl2)), function(x) x, param=testparam)
>>>
>>> My understanding is that this should result in a three-dimensional array with
>>> “ACTGN Counts” in the first dimension, files in the second and position in
>>> the third. Positions with overlapping reads in both files should thus be
>>> collapsed into a single line in the third dimension. However, selecting one
>>> of these positions shows that they are duplicated:
>>>
>>> r3[[1]][["seq"]][ , , r3[[1]][["pos"]] == 135003]
>>
>> I think your understanding is basically correct.
>>
>> The function is assuming that the BAM files are sorted by position (with,
>> e.g., sortBam, but the files don't have to be sorted by Rsamtools).
>>
>> Executing a similar command gives me
>>
>> > str(r3[[1]])
>> List of 3
>> $ seqnames: Named int 211195
>>  ..- attr(*, "names")= chr "chr20"
>> $ pos     : int [1:211195] 60026 60027 60028 60029 60030 60031 60032 60033
>> 60034 60035 ...
>> $ seq     : int [1:5, 1:2, 1:211195] 0 0 0 0 0 0 0 0 0 0 ...
>>  ..- attr(*, "dimnames")=List of 3
>>  .. ..$ : chr [1:5] "A" "C" "G" "T" ...
>>  .. ..$ : chr [1:2] "normal_srx113635_sorted.bam" "tumor_srx036691_sorted.bam"
>>  .. ..$ : NULL
>>
>> Do you get something similar, especially the identical seqnames, pos
>> dimension, and third dimension of seq? 'pos' should apparently be unique; so
>>
>> > any(duplicated(r3[[1]][["pos"]]))
>> [1] FALSE
>>
>> If there are duplicates, I wonder how many there are and where they occur
>>
>>  pos = r3[[1]][["pos"]]
>>  table(table(pos))
>>  udpos = unique(pos[duplicated(pos)])
>>  head(pos[match(pos, udpos)], 20)
>>  head(match(pos, udpos), 20)
>>
>> If nothing is suggested by the above, can you make a subset of the BAM files
>> available to me, e.g., the result of
>>
>>  param = ScanBamParam(which=GRanges("chr20", IRanges(1, 1000000)))
>>  filterBam(fls[1], tempfile(), param=param)
>>
>> Thanks,
>>
>> Martin
>>
>>> yields:
>>>
>>> , , 1
>>>
>>>   10696X1_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam
>>> 10696X4_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam
>>> A                                                   0
>>>                                                   0
>>> C                                                   0
>>>                                                   0
>>> G                                                  10
>>>                                                   0
>>> T                                                   0
>>>                                                   0
>>> N                                                   0
>>>                                                   0
>>>
>>> , , 2
>>>
>>>   10696X1_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam
>>> 10696X4_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam
>>> A                                                   0
>>>                                                   0
>>> C                                                   0
>>>                                                   0
>>> G                                                   0
>>>                                                  13
>>> T                                                   0
>>>                                                   0
>>> N                                                   0
>>>                                                   0
>>>
>>> Even though the position is the same, it is showing up twice. Each time, one
>>> of the files shows zeroes. This is not consistent with what happens if the
>>> files are identical (as in the example from the help docs).
>>>
>>> For example,
>>>
>>> r3 = applyPileups(PileupFiles(c(fl1, fl1)), function(x) x, param=testparam)
>>> #file 1 entered twice
>>>  r3[[1]][["seq"]][ , , r3[[1]][["pos"]] == 135003]
>>>
>>> yields:
>>>
>>>   10696X1_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam
>>> 10696X1_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam
>>> A                                                   0
>>>                                                   0
>>> C                                                   0
>>>                                                   0
>>> G                                                  10
>>>                                                  10
>>> T                                                   0
>>>                                                   0
>>> N                                                   0
>>>                                                   0
>>>
>>> Is this the expected behavior? It seems like each position should only show
>>> up once in the output. Is there something I am missing?
>>>
>>> Thanks,
>>>
>>> Jonathon Hill
>>> Postdoc
>>> Yost Lab, University of Utah
>>>
>>>
>>>
>>> [[alternative HTML version deleted]]
>>>
>>>
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>
>>
>> --
>> Computational Biology / Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N.
>> PO Box 19024 Seattle, WA 98109
>>
>> Location: Arnold Building M1 B861
>> Phone: (206) 667-2793
>


-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioc-devel mailing list