[BioC] [devteam-bioc] GenomicRanges seqlengths problem

Fri May 2 02:28:18 CEST 2014

Sorry did not include my chromosome and snp files.

It looks like this:
chrs.txt - summary of chromosomes
chr,start,end,len
0,1,4258568,4258568
1,1,3378610,3378610
2,1,2939989,2939989
3,1,2348246,2348246
4,1,1918205,1918205
6,1,1888674,1888674
5,1,1869450,1869450
8,1,1809296,1809296
9,1,1772623,1772623
7,1,1769547,1769547
10,1,1758670,1758670
13,1,1634580,1634580
12,1,1631710,1631710
11,1,1590160,1590160
15,1,1560629,1560629
14,1,1533332,1533332
17,1,1445693,1445693
16,1,1397653,1397653
18,1,1351976,1351976
19,1,1186800,1186800
20,1,1087932,1087932
21,1,1020521,1020521
22,1,731443,731443
23,1,521426,521426
24,1,475869,475869
25,1,318058,318058
26,1,261540,261540
27,1,250629,250629
28,1,236098,236098
29,1,200940,200940
30,1,154863,154863
31,1,143268,143268
32,1,87679,87679
34,1,58596,58596

snps.txt - SNP file
chr,id,start,end
6,egn4_Lema_G046740.1,538805,538805
6,egn4_Lema_G049660.1,1607018,1607018
10,egn4_Lema_G076370.1,1242542,1242542
13,egn4_Lema_G081640.1,1352946,1352946
12,egn4_Lema_G082370.1,109077,109077
12,egn4_Lema_G085610.1,1190615,1190615
12,egn4_Lema_G085620.1,1190933,1190933
15,egn4_Lema_G090230.1,331669,331669
16,egn4_Lema_G104510.1,667778,667778
18,egn4_Lema_G110260.1,1321392,1321392
19,egn4_Lema_G113060.1,958993,958993
19,egn4_Lema_G113080.1,1177185,1177185
21,egn4_Lema_G116710.1,263332,263332

If I  only do that:
dataChr <- read.table("chrs.txt",header=T,sep=",",stringsAsFactors=FALSE)
dataChrS <- dataChr[order(dataChr$chr),]
t <- read.table("snps.txt",header=T,sep=",")
snps <- with(t, GRanges(chr, IRanges(start, end), id=id)) 
seqlens <- setNames(dataChrS$len, as.character(dataChrS$chr))
seqlengths(snps) <- seqlens
I get the initial error -
Error in .normargSeqlengths(value, seqnames(x)) :
when the supplied 'seqlengths' vector is named, the names must match the seqnames
Makes sense, there are no SNPs on some chromosomes

If I do that: 
lChrs <- dataChrS[dataChrS$chr%in%t$chr,]
seqlens <- setNames(lChrs$len, as.character(lChrs$chr))
seqlengths(snps) <- seqlens
seqlegths(snps)
      6      10      12      13      15      16 
1888674 1758670 1631710 1634580 1560629 1397653

I only have chromosomes with SNPs.

Best wishes,
Agnieszka

________________________________________
From: Miss Agnieszka Aleksandra Golicz
Sent: 02 May 2014 10:01
To: Hervé Pagès; vobencha at fhcrc.org; guest at bioconductor.org; bioconductor at r-project.org
Cc: GenomicRanges Maintainer
Subject: RE: [devteam-bioc] GenomicRanges seqlengths problem

Hello,

Thank you very much for your help.

I am facing one more problem with GRanges.
I am trying to emulate this object.
data("CRC", package = "biovizBase")
object: mut.gr (it's SNP annotation for human chromosomes 1-22)

When I just do show on mut.gr
show(mut.gr)
GRanges with 60 ranges and 10 metadata columns:
       seqnames                 ranges strand | Hugo_Symbol Entrez_Gene_Id   Center NCBI_Build   Strand Variant_Classification Variant_Type Reference_Allele
          <Rle>              <IRanges>  <Rle> |    <factor>      <integer> <factor>  <integer> <factor>               <factor>     <factor>         <factor>
   [1]        1 [ 11003085,  11003085]      + |      TARDBP          23435    Broad         36        +               Missense          SNP                G
   [2]        1 [ 62352395,  62352395]      + |       INADL          10207    Broad         36        +               Missense          SNP                T
   [3]        1 [194960885, 194960885]      + |         CFH           3075    Broad         36        +               Missense          SNP                G
   [4]        2 [ 10116508,  10116508]      - |        CYS1         192668    Broad         36        -               Missense          SNP                C
   [5]        2 [ 33617747,  33617747]      + |     RASGRP3          25780    Broad         36        +               Missense          SNP                C
   [6]        2 [ 73894280,  73894280]      + |     C2orf78         388960    Broad         36        +               Missense          SNP                T
   [7]        2 [ 96732769,  96732769]      + |      FER1L5          90342    Broad         36        +               Missense          SNP                T
   [8]        2 [179160267, 179160267]      - |         TTN           7273    Broad         36        -               Missense          SNP                C
   [9]        2 [217251189, 217251189]      - |      IGFBP5           3488    Broad         36        -               Missense          SNP                C
  [10]        3 [ 12620699,  12620699]      - |        RAF1           5894    Broad         36        -               Missense          SNP                G
  [11]        3 [ 46472880,  46472880]      - |         LTF           4057    Broad         36        -               Missense          SNP                C
  [12]        3 [137203130, 137203130]      + |     PPP2R3A           5523    Broad         36        +               Missense          SNP                A
  [13]        3 [137457429, 137457429]      + |        PCCB           5096    Broad         36        +               Missense          SNP                C
  [14]        3 [184708629, 184708629]      - |       KLHL6          89857    Broad         36        -               Missense          SNP                G
  [15]        4 [147434591, 147434591]      - |     SLC10A7          84068    Broad         36        -               Missense          SNP                G
  [16]        4 [185915412, 185915412]      - |       ACSL1           2180    Broad         36        -               Missense          SNP                A
  [17]        5 [ 79070342,  79070342]      + |       CMYA5         202333    Broad         36        +               Missense          SNP                C
  [18]        5 [ 94775579,  94775579]      + |      FAM81B         153643    Broad         36        +               Missense          SNP                G
  [19]        5 [140838266, 140838266]      + |     PCDHGC3           5098    Broad         36        +               Missense          SNP                T
  [20]        6 [109992724, 109992724]      - |        AKD1         221264    Broad         36        -               Missense          SNP                C
  [21]        6 [118993492, 118993492]      - |    C6orf204         387119    Broad         36        -               Missense          SNP                G
  [22]        7 [124286401, 124286401]      - |        POT1          25913    Broad         36        -               Missense          SNP                A
  [23]        7 [125960456, 125960456]      - |        GRM8           2918    Broad         36        -               Missense          SNP                A
  [24]        9 [ 35097658,  35097658]      - |    KIAA1539          80256    Broad         36        -               Missense          SNP                A
  [25]        9 [103164773, 103164773]      - |        BAAT            570    Broad         36        -               Missense          SNP                G
  [26]        9 [132745783, 132745783]      + |        ABL1             25    Broad         36        +               Missense          SNP                C
  [27]        9 [138481305, 138481305]      - |      SEC16A           9919    Broad         36        -               Missense          SNP                C
  [28]       10 [  7661761,   7661761]      - |       ITIH5          80760    Broad         36        -               Missense          SNP                C
  [29]       10 [106064683, 106064683]      - |      ITPRIP          85450    Broad         36        -               Missense          SNP                T
  [30]       11 [   603430,    603430]      - |        IRF7           3665    Broad         36        -               Missense          SNP                G
  [31]       11 [  5610148,   5610148]      + |      TRIM34          53840    Broad         36        +               Missense          SNP                A
  [32]       11 [  9420281,   9420281]      + |        IPO7          10527    Broad         36        +               Missense          SNP                G
  [33]       11 [ 26495005,  26495005]      + |        ANO3          63982    Broad         36        +               Missense          SNP                C
  [34]       11 [ 55629818,  55629818]      + |       OR8H2         390151    Broad         36        +               Missense          SNP                G
  [35]       11 [116737834, 116737834]      + |      CEP164          22897    Broad         36        +               Missense          SNP                C
  [36]       12 [ 25165980,  25165980]      - |       CASC1          55259    Broad         36        -               Missense          SNP                G
  [37]       12 [ 25289548,  25289548]      - |        KRAS           3845    Broad         36        -               Missense          SNP                C
  [38]       12 [ 31142142,  31142142]      + |       DDX11           1663    Broad         36        +               Missense          SNP                G
  [39]       12 [ 42434771,  42434771]      - |       PUS7L          83448    Broad         36        -               Missense          SNP                T
  [40]       12 [ 55006974,  55006974]      - |        PAN2           9924    Broad         36        -               Missense          SNP                T
  [41]       13 [102199348, 102199348]      - |   LOC643677         643677    Broad         36        -               Missense          SNP                G
  [42]       15 [ 40503502,  40503502]      - |      ZFP106          64397    Broad         36        -               Missense          SNP                C
  [43]       15 [ 70816269,  70816269]      + |        BBS4            585    Broad         36        +               Missense          SNP                A
  [44]       16 [ 23481033,  23481033]      + |       UBFD1          56061    Broad         36        +               Missense          SNP                C
  [45]       17 [ 21149010,  21149010]      + |      MAP2K3           5606    Broad         36        +               Missense          SNP                G
  [46]       17 [ 35812691,  35812691]      - |       TOP2A           7153    Broad         36        -               Missense          SNP                T
  [47]       18 [ 16788810,  16788810]      - |       ROCK1           6093    Broad         36        -               Missense          SNP                C
  [48]       18 [ 75322338,  75322338]      + |      NFATC1           4772    Broad         36        +               Missense          SNP                G
  [49]       19 [ 15713495,  15713495]      + |      OR10H3          26532    Broad         36        +               Missense          SNP                G
  [50]       19 [ 40730140,  40730140]      + |     TMEM147          10430    Broad         36        +               Missense          SNP                C
  [51]       19 [ 52338664,  52338664]      + |        SAE1          10055    Broad         36        +               Missense          SNP                G
  [52]       19 [ 57407795,  57407795]      + |     PPP2R1A           5518    Broad         36        +               Missense          SNP                G
  [53]       20 [ 23298287,  23298287]      + |        GZF1          64412    Broad         36        +               Missense          SNP                A
  [54]       20 [ 31012946,  31012946]      + |      EFCAB8         388795    Broad         36        +               Missense          SNP                C
  [55]       20 [ 40223536,  40223536]      - |       PTPRT          11122    Broad         36        -               Missense          SNP                C
  [56]       20 [ 54467136,  54467136]      + |       CASS4          57091    Broad         36        +               Missense          SNP                G
  [57]       20 [ 60201983,  60201983]      + |      GTPBP5          26164    Broad         36        +               Missense          SNP                C
  [58]       21 [ 36688774,  36688774]      + |      CHAF1B           8208    Broad         36        +               Missense          SNP                C
  [59]       21 [ 39699770,  39699770]      - |       LCA5L         150082    Broad         36        -               Missense          SNP                T
  [60]       22 [ 27437953,  27437953]      - |       CHEK2          11200    Broad         36        -               Missense          SNP                C

In the seqnames column chromosomes 8 and 14 are missing, which makes sense - there is no SNPs on those.

However, when  do:
head(seqlengths(mut.gr), 25)

       1         2         3         4         5         6         7         8         9        10        11        12        13        14        15
249250621 243199373 198022430 191154276 180915260 171115067 159138663 146364022 141213431 135534747 135006516 133851895 115169878 107349540 102531392
       16        17        18        19        20        21        22
 90354753  81195210  78077248  59128983  63025520  48129895  51304566

All 22 chromosomes are there.

How is that possible?

I believe that for my data, I need to do a similar thing (I'm trying to create a  multilayer circular plot in ggbio). Although I have snps for only few chromosomes, I think I need to include lengths for all of them.
How do I do that?

Best wishes,
Agnieszka
________________________________________
From: Hervé Pagès <hpages at fhcrc.org>
Sent: 02 May 2014 03:27
To: vobencha at fhcrc.org; guest at bioconductor.org; bioconductor at r-project.org; Miss Agnieszka Aleksandra Golicz
Cc: GenomicRanges Maintainer
Subject: Re: [devteam-bioc] GenomicRanges seqlengths problem

On 05/01/2014 10:21 AM, Maintainer wrote:
> Hi Sonia, Val,
>
> On 05/01/2014 09:52 AM, Maintainer wrote:
>> Hi,
>>
>> The seqnames are in a different orders in the 'chrs' and 'sl' objects.
>> Looking at the first 3 names in each,
>>
>>    > head(seqlengths(chrs), 3)
>>     lm_SuperContig_0_v2  lm_SuperContig_1_v2 lm_SuperContig_10_v2
>>                      NA                   NA                   NA
>>
>>    > head(sl, 3)
>> lm_SuperContig_0_v2 lm_SuperContig_1_v2 lm_SuperContig_2_v2
>>                4258568             3378610             2939989
>>
>> When a GRanges is created, seqnames are sorted in ascii order.
>
> Just to clarify, the seqlevels are sorted, not the seqnames:
>
>     > gr <- GRanges(c("b", "a", "b"), IRanges(1:3, 10))
>     > gr
>     GRanges with 3 ranges and 0 metadata columns:
>           seqnames    ranges strand
>              <Rle> <IRanges>  <Rle>
>       [1]        b   [1, 10]      *
>       [2]        a   [2, 10]      *
>       [3]        b   [3, 10]      *
>       ---
>       seqlengths:
>         a  b
>        NA NA
>
> Not sorted (i.e. user-supplied order is preserved):
>
>     > seqnames(gr)
>     factor-Rle of length 3 with 3 runs
>       Lengths: 1 1 1
>       Values : b a b
>     Levels(2): a b
>
> Sorted:
>
>     > seqlevels(gr)
>     [1] "a" "b"
>
> This sorting of the seqlevels is not good anyway (people with different
> LOCALE will get different results). I just fixed this so now the
> GRanges() constructor will preserve the seqlevels in the order supplied
> by the user:
>
>     > gr <- GRanges(c("b", "a", "b"), IRanges(1:3, 10))
>     > seqlevels(gr)
>     [1] "b" "a"
>
> More precisely, the seqlevels are obtained by doing unique() on the
> seqnames.
>
> The fix will propagate to the public repos and become available thru
> biocLite() in the next 24 hours.
>
> Then your original code should just work Sonia.

Or maybe not :) You might still need to use 'stringsAsFactors=FALSE'
when calling read.table(). Please let us know if that still doesn't
make it.

Thanks,
H.

>
> Cheers,
> H.
>
>
>> You can
>> see the full list with seqinfo():
>>
>>    > seqinfo(chrs)
>> Seqinfo of length 34
>> seqnames             seqlengths isCircular genome
>> lm_SuperContig_0_v2        <NA>       <NA>   <NA>
>> lm_SuperContig_1_v2        <NA>       <NA>   <NA>
>> lm_SuperContig_10_v2       <NA>       <NA>   <NA>
>> lm_SuperContig_11_v2       <NA>       <NA>   <NA>
>> lm_SuperContig_12_v2       <NA>       <NA>   <NA>
>> ...                         ...        ...    ...
>>
>> The seqnames in the replacement vector must match the order in the
>> 'Seqinfo' object.
>>
>> To re-order:
>>
>> sl_new <- sl[match(levels(factor(names(sl))), names(sl))]
>>
>> Then add the new lengths:
>>
>>>> seqlengths(chrs) <- sl_new
>>>> chrs
>>> GRanges with 34 ranges and 0 metadata columns:
>>>                      seqnames       ranges strand
>>>                         <Rle>    <IRanges>  <Rle>
>>>      [1]  lm_SuperContig_0_v2 [1, 4258568]      *
>>>      [2]  lm_SuperContig_1_v2 [1, 3378610]      *
>>>      [3]  lm_SuperContig_2_v2 [1, 2939989]      *
>>>      [4]  lm_SuperContig_3_v2 [1, 2348246]      *
>>>      [5]  lm_SuperContig_4_v2 [1, 1918205]      *
>>>      ...                  ...          ...    ...
>>>     [30] lm_SuperContig_29_v2  [1, 200940]      *
>>>     [31] lm_SuperContig_30_v2  [1, 154863]      *
>>>     [32] lm_SuperContig_31_v2  [1, 143268]      *
>>>     [33] lm_SuperContig_32_v2  [1,  87679]      *
>>>     [34] lm_SuperContig_34_v2  [1,  58596]      *
>>>     ---
>>>     seqlengths:
>>>       lm_SuperContig_0_v2  lm_SuperContig_1_v2 ...  lm_SuperContig_9_v2
>>>                   4258568              3378610 ...              1772623
>>
>>
>> Valerie
>>
>>
>> On 05/01/2014 07:20 AM, Maintainer wrote:
>>>
>>> Hello,
>>>
>>> I have a problem with supplying GRanges object with seqlengths.
>>> I have a files.
>>> chrs.txt - contains information about chromosomes
>>>
>>> chrs.txt
>>> chr,start,end,len
>>> lm_SuperContig_0_v2,1,4258568,4258568
>>> lm_SuperContig_1_v2,1,3378610,3378610
>>> lm_SuperContig_2_v2,1,2939989,2939989
>>> lm_SuperContig_3_v2,1,2348246,2348246
>>> lm_SuperContig_4_v2,1,1918205,1918205
>>> lm_SuperContig_6_v2,1,1888674,1888674
>>> lm_SuperContig_5_v2,1,1869450,1869450
>>> lm_SuperContig_8_v2,1,1809296,1809296
>>> lm_SuperContig_9_v2,1,1772623,1772623
>>> lm_SuperContig_7_v2,1,1769547,1769547
>>> lm_SuperContig_10_v2,1,1758670,1758670
>>> lm_SuperContig_13_v2,1,1634580,1634580
>>> lm_SuperContig_12_v2,1,1631710,1631710
>>> lm_SuperContig_11_v2,1,1590160,1590160
>>> lm_SuperContig_15_v2,1,1560629,1560629
>>> lm_SuperContig_14_v2,1,1533332,1533332
>>> lm_SuperContig_17_v2,1,1445693,1445693
>>> lm_SuperContig_16_v2,1,1397653,1397653
>>> lm_SuperContig_18_v2,1,1351976,1351976
>>> lm_SuperContig_19_v2,1,1186800,1186800
>>> lm_SuperContig_20_v2,1,1087932,1087932
>>> lm_SuperContig_21_v2,1,1020521,1020521
>>> lm_SuperContig_22_v2,1,731443,731443
>>> lm_SuperContig_23_v2,1,521426,521426
>>> lm_SuperContig_24_v2,1,475869,475869
>>> lm_SuperContig_25_v2,1,318058,318058
>>> lm_SuperContig_26_v2,1,261540,261540
>>> lm_SuperContig_27_v2,1,250629,250629
>>> lm_SuperContig_28_v2,1,236098,236098
>>> lm_SuperContig_29_v2,1,200940,200940
>>> lm_SuperContig_30_v2,1,154863,154863
>>> lm_SuperContig_31_v2,1,143268,143268
>>> lm_SuperContig_32_v2,1,87679,87679
>>> lm_SuperContig_34_v2,1,58596,58596
>>>
>>> I try to do the following:
>>> # creating GRanges object for chromosomes
>>> dataChr <- read.table("chrs.txt",header=T,sep=",")
>>> chrs <- with(dataChr, GRanges(chr, IRanges(start, end)))
>>> sl <- setNames(dataChr$len, as.character(dataChr$chr))
>>> seqlengths(chrs) <- sl
>>>
>>> And I get the following error:
>>>
>>> Error in .normargSeqlengths(value, seqnames(x)) :
>>>      when the supplied 'seqlengths' vector is named, the names must match the seqnames
>>>
>>> Any chance you could help me with what is going on?
>>>
>>> Best wishes,
>>> Agnieszka
>>>
>>>
>>>     -- output of sessionInfo():
>>>
>>> R version 3.1.0 (2014-04-10)
>>> Platform: i386-w64-mingw32/i386 (32-bit)
>>>
>>> locale:
>>> [1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252    LC_MONETARY=English_United Kingdom.1252
>>> [4] LC_NUMERIC=C                            LC_TIME=English_United Kingdom.1252
>>>
>>> attached base packages:
>>> [1] parallel  stats     graphics  grDevices utils     datasets  methods   base
>>>
>>> other attached packages:
>>> [1] XVector_0.4.0        ggbio_1.12.3         ggplot2_0.9.3.1      GenomicRanges_1.16.2 GenomeInfoDb_1.0.2   IRanges_1.22.4       BiocGenerics_0.10.0
>>> [8] BiocInstaller_1.14.2
>>>
>>> loaded via a namespace (and not attached):
>>>     [1] AnnotationDbi_1.26.0     BatchJobs_1.2            BBmisc_1.6               Biobase_2.24.0           BiocParallel_0.6.0       biomaRt_2.20.0
>>>     [7] Biostrings_2.32.0        biovizBase_1.12.1        bitops_1.0-6             brew_1.0-6               BSgenome_1.32.0          cluster_1.15.2
>>> [13] codetools_0.2-8          colorspace_1.2-4         DBI_0.2-7                dichromat_2.0-0          digest_0.6.4             fail_1.2
>>> [19] foreach_1.4.2            Formula_1.1-1            GenomicAlignments_1.0.0  GenomicFeatures_1.16.0   grid_3.1.0               gridExtra_0.9.1
>>> [25] gtable_0.1.2             Hmisc_3.14-4             iterators_1.0.7          labeling_0.2             lattice_0.20-29          latticeExtra_0.6-26
>>> [31] MASS_7.3-31              munsell_0.4.2            plyr_1.8.1               proto_0.3-10             RColorBrewer_1.0-5       Rcpp_0.11.1
>>> [37] RCurl_1.95-4.1           reshape2_1.4             Rsamtools_1.16.0         RSQLite_0.11.4           rtracklayer_1.24.0       scales_0.2.4
>>> [43] sendmailR_1.1-2          splines_3.1.0            stats4_3.1.0             stringr_0.6.2            survival_2.37-7          tools_3.1.0
>>> [49] VariantAnnotation_1.10.0 XML_3.98-1.1             zlibbioc_1.10.0
>>>
>>>
>>> --
>>> Sent via the guest posting facility at bioconductor.org.
>>>
>>> ________________________________________________________________________
>>> devteam-bioc mailing list
>>> To unsubscribe from this mailing list send a blank email to
>>> devteam-bioc-leave at lists.fhcrc.org
>>> You can also unsubscribe or change your personal options at
>>> https://lists.fhcrc.org/mailman/listinfo/devteam-bioc
>>>
>>
>>
>

--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319