[BioC] GenomicAlignments and QNAME collision

James W. MacDonald jmacdon at uw.edu
Thu May 8 19:05:11 CEST 2014


Hi Valerie,

You get something similar from the .sra files that you can download 
from the SRA, if they are paired data. If you use the SRA toolkit to 
convert to fastq (fastq-dump), it will spit out two fastq files, and 
the QNAME in each of the fastq files will be appended with a .1 for the 
first pairs and a .2 for the second pairs.

As an example:

zcat SRR833731_1.fastq.gz | head -n 1
@SRR833731.1.1 HWI-ST423:250:D0JRLACXX:8:1101:1473:1978 length=101
zcat SRR833731_2.fastq.gz | head -n 1
@SRR833731.1.2 HWI-ST423:250:D0JRLACXX:8:1101:1473:1978 length=101


Best,

Jim



On Thursday, May 08, 2014 12:03:29 PM, Stefano Calza wrote:
> Thanks Valerie
>
> I have got this BAM files from different sources but they cannot be
> distributed.
>
> Up to now I experienced twp different 'patterns' in QNAME. One is the
> trailing value as we said (/1, /2). Another one is a leading string.
> Eg. (made up QNAME)
>
> SRR1122.12345HTR
> SRR1123.12345HTR
>
> So there must be removed SRR1122 and SRR1123)
>
> My little program actually uses a regex substitution, so the user can
> decide what pattern to edit. This second one though it seems quit
> unusual.
>
> Those with  trailing values were downloaded by TCGA (if I recall
> correctly the use a pipeline called MapSplice)
>
>
> Regards
>
> Stefano
>
> On 05/08/2014 05:54 PM, Valerie Obenchain wrote:
>> Hi Stefano,
>>
>> No, the current mate-pairing doesn't handle the trailing values. I
>> will implement this and post back when it's done.
>>
>> For reference, where did you download your bam files or what
>> application outputs QNAMEs in this format? I'd like to have some for
>> test data.
>>
>>
>> Thanks.
>> Valerie
>>
>>
>> On 05/08/14 08:14, Stefano Calza wrote:
>>> Hi everybody
>>>
>>>
>>> I am using GenomicAlignments package to read RNAseq pair-end data. The
>>> problem is that readGAlignmentPairsFromBam, after setting asMates=TRUE
>>> in BamFile, returns 0 mates.
>>>
>>> The reason is that mates have different QNAMEs. Eg:
>>>
>>> UNC15-SN850:240:D148CACXX:3:1308:19719:99367/1
>>> UNC15-SN850:240:D148CACXX:3:1308:19719:99367/2
>>>
>>> that is the two mates have /1 or /2 at the end.
>>>
>>> I wrote a Python (and a cpp) program to fix it...but this takes still
>>> quite a substantial amount of time on big files.
>>>
>>> Does the mating algorithm allow for this? If so how?
>>>
>>> Regards
>>>
>>> Stefano
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor

--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099



More information about the Bioconductor mailing list