[Bioc-sig-seq] Minimal short sequences position/orientation container

Sean Davis seandavi at gmail.com
Thu Sep 24 22:04:17 CEST 2009


On Thu, Sep 24, 2009 at 2:40 PM, Ivan Gregoretti <ivangreg at gmail.com> wrote:
> Hi Patrick,
>
> Great. It works.
>
> Can you clarify if the following observation is a feature or a bug?
>
> When I download
>
> http://dl.getdropbox.com/u/2051155/myTags.bed
>
> and from the unix prompt I take a peek at it, I get:
>
> head myTags.bed
>
> chr1    3002444 3002479                 +
> chr1    3002989 3003024                 -
> chr1    3017603 3017638                 +
> chr1    3017879 3017914                 -
> chr1    3018173 3018208                 +
> chr1    3018183 3018218                 -
> chr1    3018183 3018218                 -
> chr1    3019065 3019100                 +
> chr1    3019761 3019796                 -
> chr1    3020044 3020079                 -
>
> fine. It shows the 36 bases long reads.
>
> Now I follow your suggestion loading it into R:
>
> suppressMessages(library(rtracklayer))
>
> myTags <- import('myTags.bed')
>
> ranges(myTags["chr1"])[[1]]
> IRanges instance:
>             start       end width
> [1]        3002445   3002479    35
> [2]        3002990   3003024    35
> [3]        3017604   3017638    35
> [4]        3017880   3017914    35
> [5]        3018174   3018208    35
> [6]        3018184   3018218    35
> [7]        3018184   3018218    35
> [8]        3019066   3019100    35
> [9]        3019762   3019796    35
> ...            ...       ...   ...
> [322808] 197166880 197166914    35
> [322809] 197167672 197167706    35
> [322810] 197167851 197167885    35
> [322811] 197185820 197185854    35
> [322812] 197185850 197185884    35
> [322813] 197188518 197188552    35
> [322814] 197189251 197189285    35
> [322815] 197189593 197189627    35
> [322816] 197191697 197191731    35
>
> So, all start positions are shown as starting one nucleotide upstream
> from the original record and the features are reported as being 35
> bases long instead of 36.
>
> Is it feature or bug?

Hi, Ivan.  I think bed format is zero-based, half-open coordinates.

http://genome.ucsc.edu/FAQ/FAQformat#format1

Sean


>
> On Thu, Sep 24, 2009 at 2:51 AM, Patrick Aboyoun <paboyoun at fhcrc.org> wrote:
>> Ivan,
>> The RangedData class can store strand information in its values table. The
>> values table can store any "vector-like" object from simple R vectors
>> (including lists) to an instance of any of the *List classes defined in
>> IRanges. If you use rtracklayer's import function on a bed file containing
>> the information you have shown, the chromosome information will be used to
>> segment the other values into spaces, the start and end values will be
>> joined together in the ranges information (as a CompressedIRangesList
>> object) and the strand information will be stored as a factor column across
>> the values set (which is a CompressedDataFrameList object). The strand
>> information can be accessed by the strand accessor function. If your data
>> are sorted by strand within chromosome, you could add another level of
>> compression by storing the strand information as a 'factor' Rle in the
>> values table instead of a plain factor. rtracklayer's export function is
>> aware of a possible strand column in the values table and handles it
>> appropriately when serializing a RangedData object back into a bed file.
>>
>>
>> Patrick
>>
>>
>> Ivan Gregoretti wrote:
>>>
>>> Hi everybody,
>>>
>>> What is the minimal container class for position-and-orientation of
>>> Solexa reads?
>>>
>>>
>>> For example, the minimal positional information should be something
>>> like a BED record, like this
>>>
>>> chr1\t3000001\t3000036\t\t\t+\t
>>> ...(and many more lines)...
>>>
>>> sorry for the cumbersome string but I just want to stress that the
>>> minimal information is:
>>>
>>> column 1: chromosome
>>> column 2: start
>>> column 3: end
>>> column 6: orientation, either 'plus', 'minus' or undefined. (in this case
>>> a '+')
>>>
>>> Is there any compact container to load, say, 50 million records? I
>>> thought that RangedData could do that but after reading the
>>> documentation I see that it does not hold strand information.
>>>
>>> If there is such container, how do you load it up from a BED file?
>>>
>>> Thank you,
>>>
>>> Ivan
>>>
>>> Ivan Gregoretti, PhD
>>> National Institute of Diabetes and Digestive and Kidney Diseases
>>> National Institutes of Health
>>> 5 Memorial Dr, Building 5, Room 205.
>>> Bethesda, MD 20892. USA.
>>> Phone: 1-301-496-1592
>>> Fax: 1-301-496-9878
>>>
>>> _______________________________________________
>>> Bioc-sig-sequencing mailing list
>>> Bioc-sig-sequencing at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>
>>
>>
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>



More information about the Bioc-sig-sequencing mailing list