[Bioc-sig-seq] Minimal short sequences position/orientation container
Ivan Gregoretti
ivangreg at gmail.com
Thu Sep 24 22:16:14 CEST 2009
Thanks, Sean. That answers the question.
Ivan
Ivan Gregoretti, PhD
National Institute of Diabetes and Digestive and Kidney Diseases
National Institutes of Health
5 Memorial Dr, Building 5, Room 205.
Bethesda, MD 20892. USA.
Phone: 1-301-496-1592
Fax: 1-301-496-9878
On Thu, Sep 24, 2009 at 4:04 PM, Sean Davis <seandavi at gmail.com> wrote:
> On Thu, Sep 24, 2009 at 2:40 PM, Ivan Gregoretti <ivangreg at gmail.com> wrote:
>> Hi Patrick,
>>
>> Great. It works.
>>
>> Can you clarify if the following observation is a feature or a bug?
>>
>> When I download
>>
>> http://dl.getdropbox.com/u/2051155/myTags.bed
>>
>> and from the unix prompt I take a peek at it, I get:
>>
>> head myTags.bed
>>
>> chr1 3002444 3002479 +
>> chr1 3002989 3003024 -
>> chr1 3017603 3017638 +
>> chr1 3017879 3017914 -
>> chr1 3018173 3018208 +
>> chr1 3018183 3018218 -
>> chr1 3018183 3018218 -
>> chr1 3019065 3019100 +
>> chr1 3019761 3019796 -
>> chr1 3020044 3020079 -
>>
>> fine. It shows the 36 bases long reads.
>>
>> Now I follow your suggestion loading it into R:
>>
>> suppressMessages(library(rtracklayer))
>>
>> myTags <- import('myTags.bed')
>>
>> ranges(myTags["chr1"])[[1]]
>> IRanges instance:
>> start end width
>> [1] 3002445 3002479 35
>> [2] 3002990 3003024 35
>> [3] 3017604 3017638 35
>> [4] 3017880 3017914 35
>> [5] 3018174 3018208 35
>> [6] 3018184 3018218 35
>> [7] 3018184 3018218 35
>> [8] 3019066 3019100 35
>> [9] 3019762 3019796 35
>> ... ... ... ...
>> [322808] 197166880 197166914 35
>> [322809] 197167672 197167706 35
>> [322810] 197167851 197167885 35
>> [322811] 197185820 197185854 35
>> [322812] 197185850 197185884 35
>> [322813] 197188518 197188552 35
>> [322814] 197189251 197189285 35
>> [322815] 197189593 197189627 35
>> [322816] 197191697 197191731 35
>>
>> So, all start positions are shown as starting one nucleotide upstream
>> from the original record and the features are reported as being 35
>> bases long instead of 36.
>>
>> Is it feature or bug?
>
> Hi, Ivan. I think bed format is zero-based, half-open coordinates.
>
> http://genome.ucsc.edu/FAQ/FAQformat#format1
>
> Sean
>
>
>>
>> On Thu, Sep 24, 2009 at 2:51 AM, Patrick Aboyoun <paboyoun at fhcrc.org> wrote:
>>> Ivan,
>>> The RangedData class can store strand information in its values table. The
>>> values table can store any "vector-like" object from simple R vectors
>>> (including lists) to an instance of any of the *List classes defined in
>>> IRanges. If you use rtracklayer's import function on a bed file containing
>>> the information you have shown, the chromosome information will be used to
>>> segment the other values into spaces, the start and end values will be
>>> joined together in the ranges information (as a CompressedIRangesList
>>> object) and the strand information will be stored as a factor column across
>>> the values set (which is a CompressedDataFrameList object). The strand
>>> information can be accessed by the strand accessor function. If your data
>>> are sorted by strand within chromosome, you could add another level of
>>> compression by storing the strand information as a 'factor' Rle in the
>>> values table instead of a plain factor. rtracklayer's export function is
>>> aware of a possible strand column in the values table and handles it
>>> appropriately when serializing a RangedData object back into a bed file.
>>>
>>>
>>> Patrick
>>>
>>>
>>> Ivan Gregoretti wrote:
>>>>
>>>> Hi everybody,
>>>>
>>>> What is the minimal container class for position-and-orientation of
>>>> Solexa reads?
>>>>
>>>>
>>>> For example, the minimal positional information should be something
>>>> like a BED record, like this
>>>>
>>>> chr1\t3000001\t3000036\t\t\t+\t
>>>> ...(and many more lines)...
>>>>
>>>> sorry for the cumbersome string but I just want to stress that the
>>>> minimal information is:
>>>>
>>>> column 1: chromosome
>>>> column 2: start
>>>> column 3: end
>>>> column 6: orientation, either 'plus', 'minus' or undefined. (in this case
>>>> a '+')
>>>>
>>>> Is there any compact container to load, say, 50 million records? I
>>>> thought that RangedData could do that but after reading the
>>>> documentation I see that it does not hold strand information.
>>>>
>>>> If there is such container, how do you load it up from a BED file?
>>>>
>>>> Thank you,
>>>>
>>>> Ivan
>>>>
>>>> Ivan Gregoretti, PhD
>>>> National Institute of Diabetes and Digestive and Kidney Diseases
>>>> National Institutes of Health
>>>> 5 Memorial Dr, Building 5, Room 205.
>>>> Bethesda, MD 20892. USA.
>>>> Phone: 1-301-496-1592
>>>> Fax: 1-301-496-9878
>>>>
>>>> _______________________________________________
>>>> Bioc-sig-sequencing mailing list
>>>> Bioc-sig-sequencing at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>>
>>>
>>>
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>
More information about the Bioc-sig-sequencing
mailing list