[BioC] GenomicAlignments and QNAME collision

Wed Jun 4 09:11:30 CEST 2014

Dear Valerie

it might be the string I used: qnameSuffixStart="\\" (while it should 
have been "/")

Being the escape character I guess it has been messing up the code.

Stefano

On 06/03/2014 05:47 PM, Valerie Obenchain wrote:
> I can't reproduce this problem.
>
> When a prefix/suffix is not found in the qname no error is thrown - it 
> just doesn't do any trimming. I've clarified this in the documentation 
> in 1.17.22.
>
> With Rsamtools 1.17.21:
>
> fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
>
> ## original qnames:
> bfm <- BamFile(fl, asMates=TRUE)
> scanBam(bfm, param=ScanBamParam(what="qname"))[[1]]$qname[1:2]
>> [1] "EAS54_61:4:143:69:578" "EAS54_61:4:143:69:578"
>
> ## bogus prefix:
> bfm <- BamFile(fl, asMates=TRUE, qnamePrefixEnd="*")
> scanBam(bfm, param=ScanBamParam(what="qname"))[[1]]$qname[1:2]
>> [1] "EAS54_61:4:143:69:578" "EAS54_61:4:143:69:578"
>
> ## bogus suffix:
> bfm <- BamFile(fl, asMates=TRUE, qnameSuffixStart="*")
> scanBam(bfm, param=ScanBamParam(what="qname"))[[1]]$qname[1:2]
>> [1] "EAS54_61:4:143:69:578" "EAS54_61:4:143:69:578"
>
> Can you provide a reproducible example? Maybe you were calling 
> scanBam() differently?
>
> Valerie
>
> On 06/03/2014 03:43 AM, Stefano Calza wrote:
>> Valerie,
>>
>> just a little note. In case the character in qnameSuffixEnd does not
>> exist in QNAME (i.e. you specify the wrong character...) scanBam (I
>> tested that) hangs forever.
>>
>> I guess there should be a test on the very first read: if the character
>> is not in the string then report error...
>>
>>
>> Stefano
>>
>> On 06/02/2014 05:34 PM, Valerie Obenchain wrote:
>>> You're welcome. Let us know if you run into other problems.
>>>
>>> Valerie
>>>
>>>
>>> On 06/01/2014 10:01 AM, Stefano Calza wrote:
>>> > Great!
>>> >
>>> > I would like to thank Valerie and all Bioconductor developers.
>>> >
>>> > Stefano
>>> >
>>> > Il 31/05/2014 21:39, Valerie Obenchain ha scritto:
>>>> Trimming the qname is implemented in Rsamtools 1.17.18 and should be
>>>> available via biocLite() ~ noon tomorrow.
>>>>
>>>> 'qnamePrefixStart' and 'qnameSuffixEnd' fields have been added to the
>>>> BamFile class. Currently the trimming is implemented for mate pairing
>>>> only, not just reading records from a bam.
>>>>
>>>> Unique qnames aren't a problem in this sample file - just using it to
>>>> demonstrate the output.
>>>>
>>>> fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
>>>> param <- ScanBamParam(what="qname")
>>>>
>>>> ## no trimming
>>>> bf <- BamFile(fl, asMates=TRUE)
>>>>>> scanBam(bf, param=param)[[1]]$qname[1:3]
>>>>> [1] "EAS54_61:4:143:69:578" "EAS54_61:4:143:69:578"
>>>>> [3] "EAS219_FC30151:7:51:1429:1043"
>>>>
>>>> ## trim prefix
>>>> qnamePrefixEnd(bf) <- "_"
>>>>>> scanBam(bf, param=param)[[1]]$qname[1:3]
>>>>> [1] "61:4:143:69:578"        "61:4:143:69:578"
>>>>> "FC30151:7:51:1429:1043"
>>>>
>>>> ## trim prefix and suffix
>>>> qnameSuffixStart(bf) <- ":"
>>>>>> scanBam(bf, param=param)[[1]]$qname[1:3]
>>>>> [1] "61:4:143:69"       "61:4:143:69" "FC30151:7:51:1429"
>>>>
>>>>
>>>> Valerie
>>>>
>>>>
>>>> On 05/08/14 12:17, Stefano Calza wrote:
>>>>> Yes, I say that would be easier to use than regexp
>>>>>
>>>>> Stefano
>>>>>
>>>>> On 05/08/2014 08:59 PM, Valerie Obenchain wrote:
>>>>>> Thanks for the SRA tips.
>>>>>>
>>>>>> It's starting to look like the modifications are an additional 
>>>>>> prefix
>>>>>> or suffix separated by dot or slash (possibly underscore?). Maybe
>>>>>> simply adding an option to trim the QNAME by the pre/post term
>>>>>> separated by a given character would be sufficient. This allows
>>>>>> flexibilty but prevents unwarranted QNAME mangling.
>>>>>>
>>>>>> Valerie
>>>>>>
>>>>>>
>>>>>> On 05/08/14 10:56, Stefano Calza wrote:
>>>>>>> Right this is how I got some other example. I think it would add 
>>>>>>> the
>>>>>>> files names, as from 2 SRA files (SRR1234 & SRR1235) my reads are
>>>>>>> named
>>>>>>> STARTING with SRR1234 or SRR1235 for the two mates, followed by
>>>>>>> actual
>>>>>>> read QNAME.
>>>>>>>
>>>>>>> Stefano
>>>>>>>
>>>>>>> On 05/08/2014 07:05 PM, James W. MacDonald wrote:
>>>>>>>> Hi Valerie,
>>>>>>>>
>>>>>>>> You get something similar from the .sra files that you can 
>>>>>>>> download
>>>>>>>> from the SRA, if they are paired data. If you use the SRA 
>>>>>>>> toolkit to
>>>>>>>> convert to fastq (fastq-dump), it will spit out two fastq 
>>>>>>>> files, and
>>>>>>>> the QNAME in each of the fastq files will be appended with a .1 
>>>>>>>> for
>>>>>>>> the first pairs and a .2 for the second pairs.
>>>>>>>>
>>>>>>>> As an example:
>>>>>>>>
>>>>>>>> zcat SRR833731_1.fastq.gz | head -n 1
>>>>>>>> @SRR833731.1.1 HWI-ST423:250:D0JRLACXX:8:1101:1473:1978 length=101
>>>>>>>> zcat SRR833731_2.fastq.gz | head -n 1
>>>>>>>> @SRR833731.1.2 HWI-ST423:250:D0JRLACXX:8:1101:1473:1978 length=101
>>>>>>>>
>>>>>>>>
>>>>>>>> Best,
>>>>>>>>
>>>>>>>> Jim
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thursday, May 08, 2014 12:03:29 PM, Stefano Calza wrote:
>>>>>>>>> Thanks Valerie
>>>>>>>>>
>>>>>>>>> I have got this BAM files from different sources but they 
>>>>>>>>> cannot be
>>>>>>>>> distributed.
>>>>>>>>>
>>>>>>>>> Up to now I experienced twp different 'patterns' in QNAME. One
>>>>>>>>> is the
>>>>>>>>> trailing value as we said (/1, /2). Another one is a leading
>>>>>>>>> string.
>>>>>>>>> Eg. (made up QNAME)
>>>>>>>>>
>>>>>>>>> SRR1122.12345HTR
>>>>>>>>> SRR1123.12345HTR
>>>>>>>>>
>>>>>>>>> So there must be removed SRR1122 and SRR1123)
>>>>>>>>>
>>>>>>>>> My little program actually uses a regex substitution, so the
>>>>>>>>> user can
>>>>>>>>> decide what pattern to edit. This second one though it seems quit
>>>>>>>>> unusual.
>>>>>>>>>
>>>>>>>>> Those with  trailing values were downloaded by TCGA (if I recall
>>>>>>>>> correctly the use a pipeline called MapSplice)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>>
>>>>>>>>> Stefano
>>>>>>>>>
>>>>>>>>> On 05/08/2014 05:54 PM, Valerie Obenchain wrote:
>>>>>>>>>> Hi Stefano,
>>>>>>>>>>
>>>>>>>>>> No, the current mate-pairing doesn't handle the trailing 
>>>>>>>>>> values. I
>>>>>>>>>> will implement this and post back when it's done.
>>>>>>>>>>
>>>>>>>>>> For reference, where did you download your bam files or what
>>>>>>>>>> application outputs QNAMEs in this format? I'd like to have
>>>>>>>>>> some for
>>>>>>>>>> test data.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks.
>>>>>>>>>> Valerie
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 05/08/14 08:14, Stefano Calza wrote:
>>>>>>>>>>> Hi everybody
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I am using GenomicAlignments package to read RNAseq pair-end
>>>>>>>>>>> data. The
>>>>>>>>>>> problem is that readGAlignmentPairsFromBam, after setting
>>>>>>>>>>> asMates=TRUE
>>>>>>>>>>> in BamFile, returns 0 mates.
>>>>>>>>>>>
>>>>>>>>>>> The reason is that mates have different QNAMEs. Eg:
>>>>>>>>>>>
>>>>>>>>>>> UNC15-SN850:240:D148CACXX:3:1308:19719:99367/1
>>>>>>>>>>> UNC15-SN850:240:D148CACXX:3:1308:19719:99367/2
>>>>>>>>>>>
>>>>>>>>>>> that is the two mates have /1 or /2 at the end.
>>>>>>>>>>>
>>>>>>>>>>> I wrote a Python (and a cpp) program to fix it...but this takes
>>>>>>>>>>> still
>>>>>>>>>>> quite a substantial amount of time on big files.
>>>>>>>>>>>
>>>>>>>>>>> Does the mating algorithm allow for this? If so how?
>>>>>>>>>>>
>>>>>>>>>>> Regards
>>>>>>>>>>>
>>>>>>>>>>> Stefano
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Bioconductor mailing list
>>>>>>>>>>> Bioconductor at r-project.org
>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>>>>>> Search the archives:
>>>>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor 
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Bioconductor mailing list
>>>>>>>>> Bioconductor at r-project.org
>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>>>> Search the archives:
>>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>>>
>>>>>>>> -- 
>>>>>>>> James W. MacDonald, M.S.
>>>>>>>> Biostatistician
>>>>>>>> University of Washington
>>>>>>>> Environmental and Occupational Health Sciences
>>>>>>>> 4225 Roosevelt Way NE, # 100
>>>>>>>> Seattle WA 98105-6099
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Bioconductor mailing list
>>>>>>>> Bioconductor at r-project.org
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>>> Search the archives:
>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioconductor mailing list
>>>>>>> Bioconductor at r-project.org
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>> Search the archives:
>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>
>>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>>
>>
>
>