[BioC] Sorting a GAlignments object by QNAME
Hervé Pagès
hpages at fhcrc.org
Mon Sep 30 21:17:09 CEST 2013
Hi Rubi,
On 09/30/2013 07:33 AM, nimrod rubinstein wrote:
> Hi Hervé and thanks for the response.
>
> I assume you meant 'gal[order(names(gal))]' rather than
> 'gal[order(seqnames(gal))]'
> in order to sort by QNAME (i.e. read ID)?
My bad. You're right: seqnames is mapped to the RNAME field, not to the
QNAME field.
>
> Using this, however, doesn't seem to sort gal by the same sorting rules
> that sortBam or samtools sort -n does. Am I missing something?
2 reasons for that:
(1) The result of 'order(names(gal))' depends on the
collating sequence of the locale in use (see ?order
and ?Comparison). So different users can get different
results when doing 'gal[order(names(gal))]'. I don't
know what collating sequence 'samtools sort -n' is using
but even users using the same collating sequence might
get different results, because of (2).
(2) We don't know whether order() uses a "stable" sorting algorithm
or not (AFAIK ?order doesn't tell). When a tie occurs (and when
sorting by QNAME you can bet many ties will occur), a non-stable
algo is free to break the tie in any way, whereas a stable algo
is guaranteed to order the elements involved in the tie the
same way they show up in the input.
To summarize, 'gal[order(names(gal))]' and 'samtools sort -n' would
be guaranteed to give the same result if both were using the same
collating sequence and a stable sorting algo. I guess it's not the case.
HTH,
H.
>
> Thanks,
> rubi
>
>
>
> On Mon, Sep 30, 2013 at 12:51 AM, Hervé Pagès <hpages at fhcrc.org> wrote:
>
>> Hi Rubi,
>>
>> 'gal[order(seqnames(gal))]' will sort GAlignments object 'gal' by QNAME.
>>
>> H.
>>
>>
>>
>> On 09/29/2013 12:32 PM, rubi [guest] wrote:
>>
>>>
>>> Is there a way to sort the records in a GAlignments object by the QNAME,
>>> as this object is created with the readGAlignmentsFromBam function where
>>> the bam file and its corresponding index file must be sorted by RNAME and
>>> POS.
>>>
>>> Unless I'm missing something the only way I see how can this be done is
>>> read the bam into a data.table and sort that.
>>>
>>> -- output of sessionInfo():
>>>
>>> R version 3.0.2 (2013-09-25)
>>> Platform: x86_64-w64-mingw32/x64 (64-bit)
>>>
>>> locale:
>>> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United
>>> States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
>>> [5] LC_TIME=English_United States.1252
>>>
>>> attached base packages:
>>> [1] parallel stats graphics grDevices utils datasets methods
>>> base
>>>
>>> other attached packages:
>>> [1] doParallel_1.0.3 iterators_1.0.6 foreach_1.4.1
>>> data.table_1.8.10 Rsamtools_1.13.44 Biostrings_2.29.19
>>> GenomicRanges_1.13.45 XVector_0.1.4
>>> [9] IRanges_1.19.38 BiocGenerics_0.7.5
>>>
>>> loaded via a namespace (and not attached):
>>> [1] bitops_1.0-6 codetools_0.2-8 stats4_3.0.2 tools_3.0.2
>>> zlibbioc_1.7.0
>>>
>>> --
>>> Sent via the guest posting facility at bioconductor.org.
>>>
>>> ______________________________**_________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>> Search the archives: http://news.gmane.org/gmane.**
>>> science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>>
>>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>>
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org
>> Phone: (206) 667-5791
>> Fax: (206) 667-1319
>>
>
> [[alternative HTML version deleted]]
>
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list