[BioC] [devteam-bioc] GenomicRanges seqlengths problem
Hervé Pagès
hpages at fhcrc.org
Thu May 1 19:27:44 CEST 2014
On 05/01/2014 10:21 AM, Maintainer wrote:
> Hi Sonia, Val,
>
> On 05/01/2014 09:52 AM, Maintainer wrote:
>> Hi,
>>
>> The seqnames are in a different orders in the 'chrs' and 'sl' objects.
>> Looking at the first 3 names in each,
>>
>> > head(seqlengths(chrs), 3)
>> lm_SuperContig_0_v2 lm_SuperContig_1_v2 lm_SuperContig_10_v2
>> NA NA NA
>>
>> > head(sl, 3)
>> lm_SuperContig_0_v2 lm_SuperContig_1_v2 lm_SuperContig_2_v2
>> 4258568 3378610 2939989
>>
>> When a GRanges is created, seqnames are sorted in ascii order.
>
> Just to clarify, the seqlevels are sorted, not the seqnames:
>
> > gr <- GRanges(c("b", "a", "b"), IRanges(1:3, 10))
> > gr
> GRanges with 3 ranges and 0 metadata columns:
> seqnames ranges strand
> <Rle> <IRanges> <Rle>
> [1] b [1, 10] *
> [2] a [2, 10] *
> [3] b [3, 10] *
> ---
> seqlengths:
> a b
> NA NA
>
> Not sorted (i.e. user-supplied order is preserved):
>
> > seqnames(gr)
> factor-Rle of length 3 with 3 runs
> Lengths: 1 1 1
> Values : b a b
> Levels(2): a b
>
> Sorted:
>
> > seqlevels(gr)
> [1] "a" "b"
>
> This sorting of the seqlevels is not good anyway (people with different
> LOCALE will get different results). I just fixed this so now the
> GRanges() constructor will preserve the seqlevels in the order supplied
> by the user:
>
> > gr <- GRanges(c("b", "a", "b"), IRanges(1:3, 10))
> > seqlevels(gr)
> [1] "b" "a"
>
> More precisely, the seqlevels are obtained by doing unique() on the
> seqnames.
>
> The fix will propagate to the public repos and become available thru
> biocLite() in the next 24 hours.
>
> Then your original code should just work Sonia.
Or maybe not :) You might still need to use 'stringsAsFactors=FALSE'
when calling read.table(). Please let us know if that still doesn't
make it.
Thanks,
H.
>
> Cheers,
> H.
>
>
>> You can
>> see the full list with seqinfo():
>>
>> > seqinfo(chrs)
>> Seqinfo of length 34
>> seqnames seqlengths isCircular genome
>> lm_SuperContig_0_v2 <NA> <NA> <NA>
>> lm_SuperContig_1_v2 <NA> <NA> <NA>
>> lm_SuperContig_10_v2 <NA> <NA> <NA>
>> lm_SuperContig_11_v2 <NA> <NA> <NA>
>> lm_SuperContig_12_v2 <NA> <NA> <NA>
>> ... ... ... ...
>>
>> The seqnames in the replacement vector must match the order in the
>> 'Seqinfo' object.
>>
>> To re-order:
>>
>> sl_new <- sl[match(levels(factor(names(sl))), names(sl))]
>>
>> Then add the new lengths:
>>
>>>> seqlengths(chrs) <- sl_new
>>>> chrs
>>> GRanges with 34 ranges and 0 metadata columns:
>>> seqnames ranges strand
>>> <Rle> <IRanges> <Rle>
>>> [1] lm_SuperContig_0_v2 [1, 4258568] *
>>> [2] lm_SuperContig_1_v2 [1, 3378610] *
>>> [3] lm_SuperContig_2_v2 [1, 2939989] *
>>> [4] lm_SuperContig_3_v2 [1, 2348246] *
>>> [5] lm_SuperContig_4_v2 [1, 1918205] *
>>> ... ... ... ...
>>> [30] lm_SuperContig_29_v2 [1, 200940] *
>>> [31] lm_SuperContig_30_v2 [1, 154863] *
>>> [32] lm_SuperContig_31_v2 [1, 143268] *
>>> [33] lm_SuperContig_32_v2 [1, 87679] *
>>> [34] lm_SuperContig_34_v2 [1, 58596] *
>>> ---
>>> seqlengths:
>>> lm_SuperContig_0_v2 lm_SuperContig_1_v2 ... lm_SuperContig_9_v2
>>> 4258568 3378610 ... 1772623
>>
>>
>> Valerie
>>
>>
>> On 05/01/2014 07:20 AM, Maintainer wrote:
>>>
>>> Hello,
>>>
>>> I have a problem with supplying GRanges object with seqlengths.
>>> I have a files.
>>> chrs.txt - contains information about chromosomes
>>>
>>> chrs.txt
>>> chr,start,end,len
>>> lm_SuperContig_0_v2,1,4258568,4258568
>>> lm_SuperContig_1_v2,1,3378610,3378610
>>> lm_SuperContig_2_v2,1,2939989,2939989
>>> lm_SuperContig_3_v2,1,2348246,2348246
>>> lm_SuperContig_4_v2,1,1918205,1918205
>>> lm_SuperContig_6_v2,1,1888674,1888674
>>> lm_SuperContig_5_v2,1,1869450,1869450
>>> lm_SuperContig_8_v2,1,1809296,1809296
>>> lm_SuperContig_9_v2,1,1772623,1772623
>>> lm_SuperContig_7_v2,1,1769547,1769547
>>> lm_SuperContig_10_v2,1,1758670,1758670
>>> lm_SuperContig_13_v2,1,1634580,1634580
>>> lm_SuperContig_12_v2,1,1631710,1631710
>>> lm_SuperContig_11_v2,1,1590160,1590160
>>> lm_SuperContig_15_v2,1,1560629,1560629
>>> lm_SuperContig_14_v2,1,1533332,1533332
>>> lm_SuperContig_17_v2,1,1445693,1445693
>>> lm_SuperContig_16_v2,1,1397653,1397653
>>> lm_SuperContig_18_v2,1,1351976,1351976
>>> lm_SuperContig_19_v2,1,1186800,1186800
>>> lm_SuperContig_20_v2,1,1087932,1087932
>>> lm_SuperContig_21_v2,1,1020521,1020521
>>> lm_SuperContig_22_v2,1,731443,731443
>>> lm_SuperContig_23_v2,1,521426,521426
>>> lm_SuperContig_24_v2,1,475869,475869
>>> lm_SuperContig_25_v2,1,318058,318058
>>> lm_SuperContig_26_v2,1,261540,261540
>>> lm_SuperContig_27_v2,1,250629,250629
>>> lm_SuperContig_28_v2,1,236098,236098
>>> lm_SuperContig_29_v2,1,200940,200940
>>> lm_SuperContig_30_v2,1,154863,154863
>>> lm_SuperContig_31_v2,1,143268,143268
>>> lm_SuperContig_32_v2,1,87679,87679
>>> lm_SuperContig_34_v2,1,58596,58596
>>>
>>> I try to do the following:
>>> # creating GRanges object for chromosomes
>>> dataChr <- read.table("chrs.txt",header=T,sep=",")
>>> chrs <- with(dataChr, GRanges(chr, IRanges(start, end)))
>>> sl <- setNames(dataChr$len, as.character(dataChr$chr))
>>> seqlengths(chrs) <- sl
>>>
>>> And I get the following error:
>>>
>>> Error in .normargSeqlengths(value, seqnames(x)) :
>>> when the supplied 'seqlengths' vector is named, the names must match the seqnames
>>>
>>> Any chance you could help me with what is going on?
>>>
>>> Best wishes,
>>> Agnieszka
>>>
>>>
>>> -- output of sessionInfo():
>>>
>>> R version 3.1.0 (2014-04-10)
>>> Platform: i386-w64-mingw32/i386 (32-bit)
>>>
>>> locale:
>>> [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 LC_MONETARY=English_United Kingdom.1252
>>> [4] LC_NUMERIC=C LC_TIME=English_United Kingdom.1252
>>>
>>> attached base packages:
>>> [1] parallel stats graphics grDevices utils datasets methods base
>>>
>>> other attached packages:
>>> [1] XVector_0.4.0 ggbio_1.12.3 ggplot2_0.9.3.1 GenomicRanges_1.16.2 GenomeInfoDb_1.0.2 IRanges_1.22.4 BiocGenerics_0.10.0
>>> [8] BiocInstaller_1.14.2
>>>
>>> loaded via a namespace (and not attached):
>>> [1] AnnotationDbi_1.26.0 BatchJobs_1.2 BBmisc_1.6 Biobase_2.24.0 BiocParallel_0.6.0 biomaRt_2.20.0
>>> [7] Biostrings_2.32.0 biovizBase_1.12.1 bitops_1.0-6 brew_1.0-6 BSgenome_1.32.0 cluster_1.15.2
>>> [13] codetools_0.2-8 colorspace_1.2-4 DBI_0.2-7 dichromat_2.0-0 digest_0.6.4 fail_1.2
>>> [19] foreach_1.4.2 Formula_1.1-1 GenomicAlignments_1.0.0 GenomicFeatures_1.16.0 grid_3.1.0 gridExtra_0.9.1
>>> [25] gtable_0.1.2 Hmisc_3.14-4 iterators_1.0.7 labeling_0.2 lattice_0.20-29 latticeExtra_0.6-26
>>> [31] MASS_7.3-31 munsell_0.4.2 plyr_1.8.1 proto_0.3-10 RColorBrewer_1.0-5 Rcpp_0.11.1
>>> [37] RCurl_1.95-4.1 reshape2_1.4 Rsamtools_1.16.0 RSQLite_0.11.4 rtracklayer_1.24.0 scales_0.2.4
>>> [43] sendmailR_1.1-2 splines_3.1.0 stats4_3.1.0 stringr_0.6.2 survival_2.37-7 tools_3.1.0
>>> [49] VariantAnnotation_1.10.0 XML_3.98-1.1 zlibbioc_1.10.0
>>>
>>>
>>> --
>>> Sent via the guest posting facility at bioconductor.org.
>>>
>>> ________________________________________________________________________
>>> devteam-bioc mailing list
>>> To unsubscribe from this mailing list send a blank email to
>>> devteam-bioc-leave at lists.fhcrc.org
>>> You can also unsubscribe or change your personal options at
>>> https://lists.fhcrc.org/mailman/listinfo/devteam-bioc
>>>
>>
>>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list