[BioC] GenomicAlignments and QNAME collision
James W. MacDonald
jmacdon at uw.edu
Thu May 8 19:05:11 CEST 2014
Hi Valerie,
You get something similar from the .sra files that you can download
from the SRA, if they are paired data. If you use the SRA toolkit to
convert to fastq (fastq-dump), it will spit out two fastq files, and
the QNAME in each of the fastq files will be appended with a .1 for the
first pairs and a .2 for the second pairs.
As an example:
zcat SRR833731_1.fastq.gz | head -n 1
@SRR833731.1.1 HWI-ST423:250:D0JRLACXX:8:1101:1473:1978 length=101
zcat SRR833731_2.fastq.gz | head -n 1
@SRR833731.1.2 HWI-ST423:250:D0JRLACXX:8:1101:1473:1978 length=101
Best,
Jim
On Thursday, May 08, 2014 12:03:29 PM, Stefano Calza wrote:
> Thanks Valerie
>
> I have got this BAM files from different sources but they cannot be
> distributed.
>
> Up to now I experienced twp different 'patterns' in QNAME. One is the
> trailing value as we said (/1, /2). Another one is a leading string.
> Eg. (made up QNAME)
>
> SRR1122.12345HTR
> SRR1123.12345HTR
>
> So there must be removed SRR1122 and SRR1123)
>
> My little program actually uses a regex substitution, so the user can
> decide what pattern to edit. This second one though it seems quit
> unusual.
>
> Those with trailing values were downloaded by TCGA (if I recall
> correctly the use a pipeline called MapSplice)
>
>
> Regards
>
> Stefano
>
> On 05/08/2014 05:54 PM, Valerie Obenchain wrote:
>> Hi Stefano,
>>
>> No, the current mate-pairing doesn't handle the trailing values. I
>> will implement this and post back when it's done.
>>
>> For reference, where did you download your bam files or what
>> application outputs QNAMEs in this format? I'd like to have some for
>> test data.
>>
>>
>> Thanks.
>> Valerie
>>
>>
>> On 05/08/14 08:14, Stefano Calza wrote:
>>> Hi everybody
>>>
>>>
>>> I am using GenomicAlignments package to read RNAseq pair-end data. The
>>> problem is that readGAlignmentPairsFromBam, after setting asMates=TRUE
>>> in BamFile, returns 0 mates.
>>>
>>> The reason is that mates have different QNAMEs. Eg:
>>>
>>> UNC15-SN850:240:D148CACXX:3:1308:19719:99367/1
>>> UNC15-SN850:240:D148CACXX:3:1308:19719:99367/2
>>>
>>> that is the two mates have /1 or /2 at the end.
>>>
>>> I wrote a Python (and a cpp) program to fix it...but this takes still
>>> quite a substantial amount of time on big files.
>>>
>>> Does the mating algorithm allow for this? If so how?
>>>
>>> Regards
>>>
>>> Stefano
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099
More information about the Bioconductor
mailing list