[Bioc-sig-seq] Minimal short sequences position/orientation container

Thu Sep 24 22:16:14 CEST 2009

Thanks, Sean. That answers the question.

Ivan

Ivan Gregoretti, PhD
National Institute of Diabetes and Digestive and Kidney Diseases
National Institutes of Health
5 Memorial Dr, Building 5, Room 205.
Bethesda, MD 20892. USA.
Phone: 1-301-496-1592
Fax: 1-301-496-9878

On Thu, Sep 24, 2009 at 4:04 PM, Sean Davis <seandavi at gmail.com> wrote:
> On Thu, Sep 24, 2009 at 2:40 PM, Ivan Gregoretti <ivangreg at gmail.com> wrote:
>> Hi Patrick,
>>
>> Great. It works.
>>
>> Can you clarify if the following observation is a feature or a bug?
>>
>> When I download
>>
>> http://dl.getdropbox.com/u/2051155/myTags.bed
>>
>> and from the unix prompt I take a peek at it, I get:
>>
>> head myTags.bed
>>
>> chr1    3002444 3002479                 +
>> chr1    3002989 3003024                 -
>> chr1    3017603 3017638                 +
>> chr1    3017879 3017914                 -
>> chr1    3018173 3018208                 +
>> chr1    3018183 3018218                 -
>> chr1    3018183 3018218                 -
>> chr1    3019065 3019100                 +
>> chr1    3019761 3019796                 -
>> chr1    3020044 3020079                 -
>>
>> fine. It shows the 36 bases long reads.
>>
>> Now I follow your suggestion loading it into R:
>>
>> suppressMessages(library(rtracklayer))
>>
>> myTags <- import('myTags.bed')
>>
>> ranges(myTags["chr1"])[[1]]
>> IRanges instance:
>>             start       end width
>> [1]        3002445   3002479    35
>> [2]        3002990   3003024    35
>> [3]        3017604   3017638    35
>> [4]        3017880   3017914    35
>> [5]        3018174   3018208    35
>> [6]        3018184   3018218    35
>> [7]        3018184   3018218    35
>> [8]        3019066   3019100    35
>> [9]        3019762   3019796    35
>> ...            ...       ...   ...
>> [322808] 197166880 197166914    35
>> [322809] 197167672 197167706    35
>> [322810] 197167851 197167885    35
>> [322811] 197185820 197185854    35
>> [322812] 197185850 197185884    35
>> [322813] 197188518 197188552    35
>> [322814] 197189251 197189285    35
>> [322815] 197189593 197189627    35
>> [322816] 197191697 197191731    35
>>
>> So, all start positions are shown as starting one nucleotide upstream
>> from the original record and the features are reported as being 35
>> bases long instead of 36.
>>
>> Is it feature or bug?
>
> Hi, Ivan.  I think bed format is zero-based, half-open coordinates.
>
> http://genome.ucsc.edu/FAQ/FAQformat#format1
>
> Sean
>
>
>>
>> On Thu, Sep 24, 2009 at 2:51 AM, Patrick Aboyoun <paboyoun at fhcrc.org> wrote:
>>> Ivan,
>>> The RangedData class can store strand information in its values table. The
>>> values table can store any "vector-like" object from simple R vectors
>>> (including lists) to an instance of any of the *List classes defined in
>>> IRanges. If you use rtracklayer's import function on a bed file containing
>>> the information you have shown, the chromosome information will be used to
>>> segment the other values into spaces, the start and end values will be
>>> joined together in the ranges information (as a CompressedIRangesList
>>> object) and the strand information will be stored as a factor column across
>>> the values set (which is a CompressedDataFrameList object). The strand
>>> information can be accessed by the strand accessor function. If your data
>>> are sorted by strand within chromosome, you could add another level of
>>> compression by storing the strand information as a 'factor' Rle in the
>>> values table instead of a plain factor. rtracklayer's export function is
>>> aware of a possible strand column in the values table and handles it
>>> appropriately when serializing a RangedData object back into a bed file.
>>>
>>>
>>> Patrick
>>>
>>>
>>> Ivan Gregoretti wrote:
>>>>
>>>> Hi everybody,
>>>>
>>>> What is the minimal container class for position-and-orientation of
>>>> Solexa reads?
>>>>
>>>>
>>>> For example, the minimal positional information should be something
>>>> like a BED record, like this
>>>>
>>>> chr1\t3000001\t3000036\t\t\t+\t
>>>> ...(and many more lines)...
>>>>
>>>> sorry for the cumbersome string but I just want to stress that the
>>>> minimal information is:
>>>>
>>>> column 1: chromosome
>>>> column 2: start
>>>> column 3: end
>>>> column 6: orientation, either 'plus', 'minus' or undefined. (in this case
>>>> a '+')
>>>>
>>>> Is there any compact container to load, say, 50 million records? I
>>>> thought that RangedData could do that but after reading the
>>>> documentation I see that it does not hold strand information.
>>>>
>>>> If there is such container, how do you load it up from a BED file?
>>>>
>>>> Thank you,
>>>>
>>>> Ivan
>>>>
>>>> Ivan Gregoretti, PhD
>>>> National Institute of Diabetes and Digestive and Kidney Diseases
>>>> National Institutes of Health
>>>> 5 Memorial Dr, Building 5, Room 205.
>>>> Bethesda, MD 20892. USA.
>>>> Phone: 1-301-496-1592
>>>> Fax: 1-301-496-9878
>>>>
>>>> _______________________________________________
>>>> Bioc-sig-sequencing mailing list
>>>> Bioc-sig-sequencing at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>>
>>>
>>>
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>