[Bioc-devel] Rsamtools applyPileups function not merging positions from multiple files if not identical

Tue Apr 29 18:54:57 CEST 2014

On 04/29/2014 09:19 AM, Jonathon Hill wrote:
> Hi Martin,
>
> Thanks for looking into this. This is a problem that we have run into before.
> The zebrafish genome has a number of small contigs (> 1000 of them).
> Unfortunately, the novoalign aligner only includes @SQ lines if reads actually
> map to the contig. Since they are so small, each file ends up having its own set
> of small contigs included due to sporadic low-coverage alignments to the
> contigs, even when aligned using the same parameters and the same exact
> reference genome. That is what happened here. I am working on a small function
> to gather all of the contigs from each file and merge them to create a larger
> list that can be used to create a common header. However, it is taking a while
> because it looks like I have to covert the files to sam, replace the header and
> then convert back to bam, sort and index to make sure the factor levels remain
> consistent. I would love any thoughts you have on how to do this. I am not a C
> programmer, so I am resorting to system calls to samtools in my R code. It seems
> like something that would have a fairly broad appeal, as the genomes of many
> non-model organisms have numerous small contigs.

The right place to fix this is with novoalign, I guess.

For creating a common header, if you have the fasta files used in the alignment 
then perhaps

   seqinfo = seqinfo(FaFile("foo.fa"))

(though perhaps the file needs to be razip()'d and / or indexed indexFa() 
first). Alternatively (less preferred, I think) something like

    fls = dir(pattern="bam$")   ## vector of all relevant files
    bfl = BamFileList(fls)
    seqinfo = Reduce(merge, lapply(bfl, seqinfo))

with seqinfo now containing the the relevant ranges, etc.

You could view the current header of one of your files with

   samtools view -H ex1.bam > ex1.header

which looks like

@SQ	SN:seq1	LN:1575
@SQ	SN:seq2	LN:1584

Arrange to synthesize the correct @SQ tags with

     sprintf("@SQ\tSN:%s\tLN:%d", seqnmanes(seqinof), seqlengths(seqinfo))

Edit these into ex1.header, and then use the samtools reheader command, along 
the lines of

     samtools reheader ex1.header ex1.bam > ex1.rehead.bam

Martin

>
> Thanks again,
>
> Jonathon
>
> On Apr 26, 2014, at 5:28 AM, Martin Morgan <mtmorgan at fhcrc.org
> <mailto:mtmorgan at fhcrc.org>> wrote:
>
>> On 04/23/2014 07:42 AM, Jonathon Hill wrote:
>>> Thanks. I look forward to hearing from you.
>>
>> Hi Jonathon --
>>
>> It turns out that your BAM files have different seqlevels
>>
>> > fls <- PileupFiles(dir(pattern = "_sorted.bam$", full=TRUE))
>> > lvls = lapply(fls, seqlevels)
>> > identical(lvls[[1]], lvls[[2]])
>> [1] FALSE
>>
>> i.e., the BAM files have different reference sequences. Because of this,
>> samtools thinks of 'chr20' in one file as different from 'chr20' in another,
>> much as R might confuse values of factors with different level sets.
>>
>> I updated Rsamtools to check that the seqlevels are identical
>>
>> > applyPileups(fls, function(...) {})
>> Error in applyPileups(files, FUN, ..., param = plpParam(files)) :
>>  applyPileups 'seqlevels' must be identical(); failed when comparing
>>    '10696X1chr20testregion_sorted.bam' with
>>    '10696X4chr20testregion_sorted.bam'
>>
>> so at least the problem is more apparent. I don't think you can correct your
>> files using Rsamtools, maybe picard or worst-case (maybe it's appropriate
>> anyway) re-aligning all samples to a common set of sequences?
>>
>> Hope that sheds some light, and thanks for the report,
>>
>> Martin
>>
>>>
>>> On Apr 23, 2014, at 6:10 AM, Martin Morgan <mtmorgan at fhcrc.org
>>> <mailto:mtmorgan at fhcrc.org>
>>> <mailto:mtmorgan at fhcrc.org>> wrote:
>>>
>>>> Thanks for the file snippets; I'm able to reproduce this bug and will let you
>>>> know of its resolution. Martin
>>>>
>>>> On 04/22/2014 07:44 AM, Jonathon Hill wrote:
>>>>> Hi Martin,
>>>>>
>>>>> Thank you for your response. I checked the header and it says that it is
>>>>> coordinate sorted, so that shouldn’t be the problem. Here are the results
>>>>> of the
>>>>> code you provided:
>>>>>
>>>>> > r3 = applyPileups(PileupFiles(c(fl1, fl2)), function(x) x, param=testparam)
>>>>> > any(duplicated(r3[[1]][["pos"]]))
>>>>> [1] TRUE
>>>>> > pos = r3[[1]][["pos"]]
>>>>> > table(table(pos))
>>>>>
>>>>>   1    2
>>>>> 4834 3115
>>>>> > udpos = unique(pos[duplicated(pos)])
>>>>> > head(pos[match(pos, udpos)], 20)
>>>>> [1] 135003 135006 135007 135008 135009 135010 135011 135012 135013 135014
>>>>> 135015 135016 135017 135018 135019 135020 135021 135022 135023
>>>>> [20] 135024
>>>>> > head(match(pos, udpos), 20)
>>>>> [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
>>>>> > table(pos)[1:50]
>>>>> pos
>>>>> 134762 134763 134764 134765 134766 134767 134768 134769 134770 134771 134772
>>>>> 134773 134774 134775 134776 134777 134778 134779 134780
>>>>>     1      1      1      1      1      1      1      1      1      1      1
>>>>>   1      1      1      1      1      1      1      1
>>>>> 134781 134782 134783 134784 134785 134786 134787 134788 134991 134992 134993
>>>>> 134994 134995 134999 135001 135002 135003 135004 135005
>>>>>     1      1      1      1      1      1      1      1      1      1      1
>>>>>   1      1      1      1      1      2      1      1
>>>>> 135006 135007 135008 135009 135010 135011 135012 135013 135014 135015 135016
>>>>> 135017
>>>>>     2      2      2      2      2      2      2      2      2      2      2
>>>>>   2
>>>>>
>>>>> I think the last one shows it well. There are duplicates throughout the file,
>>>>> anywhere that there are reads in both files. I have attached the bam files you
>>>>> requested showing the region used here.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Jonathon
>>>>>
>>>>> On Apr 21, 2014, at 7:17 PM, Martin Morgan <mtmorgan at fhcrc.org
>>>>> <mailto:mtmorgan at fhcrc.org>
>>>>> <mailto:mtmorgan at fhcrc.org>
>>>>> <mailto:mtmorgan at fhcrc.org>> wrote:
>>>>>
>>>>>> On 04/21/2014 02:33 PM, Jonathon Hill wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I have been trying to use Rsamtools’ applyPileups function to compare two
>>>>>>> files position-by-position. In order to test it out, I simply ran:
>>>>>>> minBaseQuality <- 20
>>>>>>> minMapQuality <- 30
>>>>>>> minDepth <- 10
>>>>>>> maxDepth <- 1000
>>>>>>> testparam <- PileupParam(what="seq",
>>>>>>>                        which=GRanges(“chr20", IRanges(1, 1000000)),
>>>>>>>                        minBaseQuality=minBaseQuality,
>>>>>>>                        minMapQuality=minMapQuality,
>>>>>>>                        minDepth=minDepth,
>>>>>>>                        maxDepth=maxDepth,
>>>>>>> )
>>>>>>> fl1 <- "10696X1_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam"
>>>>>>> fl2 <- "10696X4_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam"
>>>>>>> r3 = applyPileups(PileupFiles(c(fl1, fl2)), function(x) x, param=testparam)
>>>>>>>
>>>>>>> My understanding is that this should result in a three-dimensional array with
>>>>>>> “ACTGN Counts” in the first dimension, files in the second and position in
>>>>>>> the third. Positions with overlapping reads in both files should thus be
>>>>>>> collapsed into a single line in the third dimension. However, selecting one
>>>>>>> of these positions shows that they are duplicated:
>>>>>>>
>>>>>>> r3[[1]][["seq"]][ , , r3[[1]][["pos"]] == 135003]
>>>>>>
>>>>>> I think your understanding is basically correct.
>>>>>>
>>>>>> The function is assuming that the BAM files are sorted by position (with,
>>>>>> e.g., sortBam, but the files don't have to be sorted by Rsamtools).
>>>>>>
>>>>>> Executing a similar command gives me
>>>>>>
>>>>>> > str(r3[[1]])
>>>>>> List of 3
>>>>>> $ seqnames: Named int 211195
>>>>>> ..- attr(*, "names")= chr "chr20"
>>>>>> $ pos     : int [1:211195] 60026 60027 60028 60029 60030 60031 60032 60033
>>>>>> 60034 60035 ...
>>>>>> $ seq     : int [1:5, 1:2, 1:211195] 0 0 0 0 0 0 0 0 0 0 ...
>>>>>> ..- attr(*, "dimnames")=List of 3
>>>>>> .. ..$ : chr [1:5] "A" "C" "G" "T" ...
>>>>>> .. ..$ : chr [1:2] "normal_srx113635_sorted.bam" "tumor_srx036691_sorted.bam"
>>>>>> .. ..$ : NULL
>>>>>>
>>>>>> Do you get something similar, especially the identical seqnames, pos
>>>>>> dimension, and third dimension of seq? 'pos' should apparently be unique; so
>>>>>>
>>>>>> > any(duplicated(r3[[1]][["pos"]]))
>>>>>> [1] FALSE
>>>>>>
>>>>>> If there are duplicates, I wonder how many there are and where they occur
>>>>>>
>>>>>> pos = r3[[1]][["pos"]]
>>>>>> table(table(pos))
>>>>>> udpos = unique(pos[duplicated(pos)])
>>>>>> head(pos[match(pos, udpos)], 20)
>>>>>> head(match(pos, udpos), 20)
>>>>>>
>>>>>> If nothing is suggested by the above, can you make a subset of the BAM files
>>>>>> available to me, e.g., the result of
>>>>>>
>>>>>> param = ScanBamParam(which=GRanges("chr20", IRanges(1, 1000000)))
>>>>>> filterBam(fls[1], tempfile(), param=param)
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Martin
>>>>>>
>>>>>>> yields:
>>>>>>>
>>>>>>> , , 1
>>>>>>>
>>>>>>> 10696X1_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam
>>>>>>> 10696X4_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam
>>>>>>> A                                                   0
>>>>>>>                                                 0
>>>>>>> C                                                   0
>>>>>>>                                                 0
>>>>>>> G                                                  10
>>>>>>>                                                 0
>>>>>>> T                                                   0
>>>>>>>                                                 0
>>>>>>> N                                                   0
>>>>>>>                                                 0
>>>>>>>
>>>>>>> , , 2
>>>>>>>
>>>>>>> 10696X1_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam
>>>>>>> 10696X4_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam
>>>>>>> A                                                   0
>>>>>>>                                                 0
>>>>>>> C                                                   0
>>>>>>>                                                 0
>>>>>>> G                                                   0
>>>>>>>                                                13
>>>>>>> T                                                   0
>>>>>>>                                                 0
>>>>>>> N                                                   0
>>>>>>>                                                 0
>>>>>>>
>>>>>>> Even though the position is the same, it is showing up twice. Each time, one
>>>>>>> of the files shows zeroes. This is not consistent with what happens if the
>>>>>>> files are identical (as in the example from the help docs).
>>>>>>>
>>>>>>> For example,
>>>>>>>
>>>>>>> r3 = applyPileups(PileupFiles(c(fl1, fl1)), function(x) x, param=testparam)
>>>>>>> #file 1 entered twice
>>>>>>> r3[[1]][["seq"]][ , , r3[[1]][["pos"]] == 135003]
>>>>>>>
>>>>>>> yields:
>>>>>>>
>>>>>>> 10696X1_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam
>>>>>>> 10696X1_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam
>>>>>>> A                                                   0
>>>>>>>                                                 0
>>>>>>> C                                                   0
>>>>>>>                                                 0
>>>>>>> G                                                  10
>>>>>>>                                                10
>>>>>>> T                                                   0
>>>>>>>                                                 0
>>>>>>> N                                                   0
>>>>>>>                                                 0
>>>>>>>
>>>>>>> Is this the expected behavior? It seems like each position should only show
>>>>>>> up once in the output. Is there something I am missing?
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Jonathon Hill
>>>>>>> Postdoc
>>>>>>> Yost Lab, University of Utah
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> [[alternative HTML version deleted]]
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>>>>>> <mailto:Bioc-devel at r-project.org><mailto:Bioc-devel at r-project.org>mailing
>>>>>>> list
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>>>>> 1100 Fairview Ave. N.
>>>>>> PO Box 19024 Seattle, WA 98109
>>>>>>
>>>>>> Location: Arnold Building M1 B861
>>>>>> Phone: (206) 667-2793
>>>>>
>>>>
>>>>
>>>> --
>>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>>> 1100 Fairview Ave. N.
>>>> PO Box 19024 Seattle, WA 98109
>>>>
>>>> Location: Arnold Building M1 B861
>>>> Phone: (206) 667-2793
>>>
>>
>>
>> --
>> Computational Biology / Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N.
>> PO Box 19024 Seattle, WA 98109
>>
>> Location: Arnold Building M1 B861
>> Phone: (206) 667-2793
>

-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793