[BioC] [devteam-bioc] GenomicRanges seqlengths problem
Miss Agnieszka Aleksandra Golicz
agnieszka.golicz at uq.net.au
Fri May 2 02:28:18 CEST 2014
Sorry did not include my chromosome and snp files.
It looks like this:
chrs.txt - summary of chromosomes
chr,start,end,len
0,1,4258568,4258568
1,1,3378610,3378610
2,1,2939989,2939989
3,1,2348246,2348246
4,1,1918205,1918205
6,1,1888674,1888674
5,1,1869450,1869450
8,1,1809296,1809296
9,1,1772623,1772623
7,1,1769547,1769547
10,1,1758670,1758670
13,1,1634580,1634580
12,1,1631710,1631710
11,1,1590160,1590160
15,1,1560629,1560629
14,1,1533332,1533332
17,1,1445693,1445693
16,1,1397653,1397653
18,1,1351976,1351976
19,1,1186800,1186800
20,1,1087932,1087932
21,1,1020521,1020521
22,1,731443,731443
23,1,521426,521426
24,1,475869,475869
25,1,318058,318058
26,1,261540,261540
27,1,250629,250629
28,1,236098,236098
29,1,200940,200940
30,1,154863,154863
31,1,143268,143268
32,1,87679,87679
34,1,58596,58596
snps.txt - SNP file
chr,id,start,end
6,egn4_Lema_G046740.1,538805,538805
6,egn4_Lema_G049660.1,1607018,1607018
10,egn4_Lema_G076370.1,1242542,1242542
13,egn4_Lema_G081640.1,1352946,1352946
12,egn4_Lema_G082370.1,109077,109077
12,egn4_Lema_G085610.1,1190615,1190615
12,egn4_Lema_G085620.1,1190933,1190933
15,egn4_Lema_G090230.1,331669,331669
16,egn4_Lema_G104510.1,667778,667778
18,egn4_Lema_G110260.1,1321392,1321392
19,egn4_Lema_G113060.1,958993,958993
19,egn4_Lema_G113080.1,1177185,1177185
21,egn4_Lema_G116710.1,263332,263332
If I only do that:
dataChr <- read.table("chrs.txt",header=T,sep=",",stringsAsFactors=FALSE)
dataChrS <- dataChr[order(dataChr$chr),]
t <- read.table("snps.txt",header=T,sep=",")
snps <- with(t, GRanges(chr, IRanges(start, end), id=id))
seqlens <- setNames(dataChrS$len, as.character(dataChrS$chr))
seqlengths(snps) <- seqlens
I get the initial error -
Error in .normargSeqlengths(value, seqnames(x)) :
when the supplied 'seqlengths' vector is named, the names must match the seqnames
Makes sense, there are no SNPs on some chromosomes
If I do that:
lChrs <- dataChrS[dataChrS$chr%in%t$chr,]
seqlens <- setNames(lChrs$len, as.character(lChrs$chr))
seqlengths(snps) <- seqlens
seqlegths(snps)
6 10 12 13 15 16
1888674 1758670 1631710 1634580 1560629 1397653
I only have chromosomes with SNPs.
Best wishes,
Agnieszka
________________________________________
From: Miss Agnieszka Aleksandra Golicz
Sent: 02 May 2014 10:01
To: Hervé Pagès; vobencha at fhcrc.org; guest at bioconductor.org; bioconductor at r-project.org
Cc: GenomicRanges Maintainer
Subject: RE: [devteam-bioc] GenomicRanges seqlengths problem
Hello,
Thank you very much for your help.
I am facing one more problem with GRanges.
I am trying to emulate this object.
data("CRC", package = "biovizBase")
object: mut.gr (it's SNP annotation for human chromosomes 1-22)
When I just do show on mut.gr
show(mut.gr)
GRanges with 60 ranges and 10 metadata columns:
seqnames ranges strand | Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Strand Variant_Classification Variant_Type Reference_Allele
<Rle> <IRanges> <Rle> | <factor> <integer> <factor> <integer> <factor> <factor> <factor> <factor>
[1] 1 [ 11003085, 11003085] + | TARDBP 23435 Broad 36 + Missense SNP G
[2] 1 [ 62352395, 62352395] + | INADL 10207 Broad 36 + Missense SNP T
[3] 1 [194960885, 194960885] + | CFH 3075 Broad 36 + Missense SNP G
[4] 2 [ 10116508, 10116508] - | CYS1 192668 Broad 36 - Missense SNP C
[5] 2 [ 33617747, 33617747] + | RASGRP3 25780 Broad 36 + Missense SNP C
[6] 2 [ 73894280, 73894280] + | C2orf78 388960 Broad 36 + Missense SNP T
[7] 2 [ 96732769, 96732769] + | FER1L5 90342 Broad 36 + Missense SNP T
[8] 2 [179160267, 179160267] - | TTN 7273 Broad 36 - Missense SNP C
[9] 2 [217251189, 217251189] - | IGFBP5 3488 Broad 36 - Missense SNP C
[10] 3 [ 12620699, 12620699] - | RAF1 5894 Broad 36 - Missense SNP G
[11] 3 [ 46472880, 46472880] - | LTF 4057 Broad 36 - Missense SNP C
[12] 3 [137203130, 137203130] + | PPP2R3A 5523 Broad 36 + Missense SNP A
[13] 3 [137457429, 137457429] + | PCCB 5096 Broad 36 + Missense SNP C
[14] 3 [184708629, 184708629] - | KLHL6 89857 Broad 36 - Missense SNP G
[15] 4 [147434591, 147434591] - | SLC10A7 84068 Broad 36 - Missense SNP G
[16] 4 [185915412, 185915412] - | ACSL1 2180 Broad 36 - Missense SNP A
[17] 5 [ 79070342, 79070342] + | CMYA5 202333 Broad 36 + Missense SNP C
[18] 5 [ 94775579, 94775579] + | FAM81B 153643 Broad 36 + Missense SNP G
[19] 5 [140838266, 140838266] + | PCDHGC3 5098 Broad 36 + Missense SNP T
[20] 6 [109992724, 109992724] - | AKD1 221264 Broad 36 - Missense SNP C
[21] 6 [118993492, 118993492] - | C6orf204 387119 Broad 36 - Missense SNP G
[22] 7 [124286401, 124286401] - | POT1 25913 Broad 36 - Missense SNP A
[23] 7 [125960456, 125960456] - | GRM8 2918 Broad 36 - Missense SNP A
[24] 9 [ 35097658, 35097658] - | KIAA1539 80256 Broad 36 - Missense SNP A
[25] 9 [103164773, 103164773] - | BAAT 570 Broad 36 - Missense SNP G
[26] 9 [132745783, 132745783] + | ABL1 25 Broad 36 + Missense SNP C
[27] 9 [138481305, 138481305] - | SEC16A 9919 Broad 36 - Missense SNP C
[28] 10 [ 7661761, 7661761] - | ITIH5 80760 Broad 36 - Missense SNP C
[29] 10 [106064683, 106064683] - | ITPRIP 85450 Broad 36 - Missense SNP T
[30] 11 [ 603430, 603430] - | IRF7 3665 Broad 36 - Missense SNP G
[31] 11 [ 5610148, 5610148] + | TRIM34 53840 Broad 36 + Missense SNP A
[32] 11 [ 9420281, 9420281] + | IPO7 10527 Broad 36 + Missense SNP G
[33] 11 [ 26495005, 26495005] + | ANO3 63982 Broad 36 + Missense SNP C
[34] 11 [ 55629818, 55629818] + | OR8H2 390151 Broad 36 + Missense SNP G
[35] 11 [116737834, 116737834] + | CEP164 22897 Broad 36 + Missense SNP C
[36] 12 [ 25165980, 25165980] - | CASC1 55259 Broad 36 - Missense SNP G
[37] 12 [ 25289548, 25289548] - | KRAS 3845 Broad 36 - Missense SNP C
[38] 12 [ 31142142, 31142142] + | DDX11 1663 Broad 36 + Missense SNP G
[39] 12 [ 42434771, 42434771] - | PUS7L 83448 Broad 36 - Missense SNP T
[40] 12 [ 55006974, 55006974] - | PAN2 9924 Broad 36 - Missense SNP T
[41] 13 [102199348, 102199348] - | LOC643677 643677 Broad 36 - Missense SNP G
[42] 15 [ 40503502, 40503502] - | ZFP106 64397 Broad 36 - Missense SNP C
[43] 15 [ 70816269, 70816269] + | BBS4 585 Broad 36 + Missense SNP A
[44] 16 [ 23481033, 23481033] + | UBFD1 56061 Broad 36 + Missense SNP C
[45] 17 [ 21149010, 21149010] + | MAP2K3 5606 Broad 36 + Missense SNP G
[46] 17 [ 35812691, 35812691] - | TOP2A 7153 Broad 36 - Missense SNP T
[47] 18 [ 16788810, 16788810] - | ROCK1 6093 Broad 36 - Missense SNP C
[48] 18 [ 75322338, 75322338] + | NFATC1 4772 Broad 36 + Missense SNP G
[49] 19 [ 15713495, 15713495] + | OR10H3 26532 Broad 36 + Missense SNP G
[50] 19 [ 40730140, 40730140] + | TMEM147 10430 Broad 36 + Missense SNP C
[51] 19 [ 52338664, 52338664] + | SAE1 10055 Broad 36 + Missense SNP G
[52] 19 [ 57407795, 57407795] + | PPP2R1A 5518 Broad 36 + Missense SNP G
[53] 20 [ 23298287, 23298287] + | GZF1 64412 Broad 36 + Missense SNP A
[54] 20 [ 31012946, 31012946] + | EFCAB8 388795 Broad 36 + Missense SNP C
[55] 20 [ 40223536, 40223536] - | PTPRT 11122 Broad 36 - Missense SNP C
[56] 20 [ 54467136, 54467136] + | CASS4 57091 Broad 36 + Missense SNP G
[57] 20 [ 60201983, 60201983] + | GTPBP5 26164 Broad 36 + Missense SNP C
[58] 21 [ 36688774, 36688774] + | CHAF1B 8208 Broad 36 + Missense SNP C
[59] 21 [ 39699770, 39699770] - | LCA5L 150082 Broad 36 - Missense SNP T
[60] 22 [ 27437953, 27437953] - | CHEK2 11200 Broad 36 - Missense SNP C
In the seqnames column chromosomes 8 and 14 are missing, which makes sense - there is no SNPs on those.
However, when do:
head(seqlengths(mut.gr), 25)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
249250621 243199373 198022430 191154276 180915260 171115067 159138663 146364022 141213431 135534747 135006516 133851895 115169878 107349540 102531392
16 17 18 19 20 21 22
90354753 81195210 78077248 59128983 63025520 48129895 51304566
All 22 chromosomes are there.
How is that possible?
I believe that for my data, I need to do a similar thing (I'm trying to create a multilayer circular plot in ggbio). Although I have snps for only few chromosomes, I think I need to include lengths for all of them.
How do I do that?
Best wishes,
Agnieszka
________________________________________
From: Hervé Pagès <hpages at fhcrc.org>
Sent: 02 May 2014 03:27
To: vobencha at fhcrc.org; guest at bioconductor.org; bioconductor at r-project.org; Miss Agnieszka Aleksandra Golicz
Cc: GenomicRanges Maintainer
Subject: Re: [devteam-bioc] GenomicRanges seqlengths problem
On 05/01/2014 10:21 AM, Maintainer wrote:
> Hi Sonia, Val,
>
> On 05/01/2014 09:52 AM, Maintainer wrote:
>> Hi,
>>
>> The seqnames are in a different orders in the 'chrs' and 'sl' objects.
>> Looking at the first 3 names in each,
>>
>> > head(seqlengths(chrs), 3)
>> lm_SuperContig_0_v2 lm_SuperContig_1_v2 lm_SuperContig_10_v2
>> NA NA NA
>>
>> > head(sl, 3)
>> lm_SuperContig_0_v2 lm_SuperContig_1_v2 lm_SuperContig_2_v2
>> 4258568 3378610 2939989
>>
>> When a GRanges is created, seqnames are sorted in ascii order.
>
> Just to clarify, the seqlevels are sorted, not the seqnames:
>
> > gr <- GRanges(c("b", "a", "b"), IRanges(1:3, 10))
> > gr
> GRanges with 3 ranges and 0 metadata columns:
> seqnames ranges strand
> <Rle> <IRanges> <Rle>
> [1] b [1, 10] *
> [2] a [2, 10] *
> [3] b [3, 10] *
> ---
> seqlengths:
> a b
> NA NA
>
> Not sorted (i.e. user-supplied order is preserved):
>
> > seqnames(gr)
> factor-Rle of length 3 with 3 runs
> Lengths: 1 1 1
> Values : b a b
> Levels(2): a b
>
> Sorted:
>
> > seqlevels(gr)
> [1] "a" "b"
>
> This sorting of the seqlevels is not good anyway (people with different
> LOCALE will get different results). I just fixed this so now the
> GRanges() constructor will preserve the seqlevels in the order supplied
> by the user:
>
> > gr <- GRanges(c("b", "a", "b"), IRanges(1:3, 10))
> > seqlevels(gr)
> [1] "b" "a"
>
> More precisely, the seqlevels are obtained by doing unique() on the
> seqnames.
>
> The fix will propagate to the public repos and become available thru
> biocLite() in the next 24 hours.
>
> Then your original code should just work Sonia.
Or maybe not :) You might still need to use 'stringsAsFactors=FALSE'
when calling read.table(). Please let us know if that still doesn't
make it.
Thanks,
H.
>
> Cheers,
> H.
>
>
>> You can
>> see the full list with seqinfo():
>>
>> > seqinfo(chrs)
>> Seqinfo of length 34
>> seqnames seqlengths isCircular genome
>> lm_SuperContig_0_v2 <NA> <NA> <NA>
>> lm_SuperContig_1_v2 <NA> <NA> <NA>
>> lm_SuperContig_10_v2 <NA> <NA> <NA>
>> lm_SuperContig_11_v2 <NA> <NA> <NA>
>> lm_SuperContig_12_v2 <NA> <NA> <NA>
>> ... ... ... ...
>>
>> The seqnames in the replacement vector must match the order in the
>> 'Seqinfo' object.
>>
>> To re-order:
>>
>> sl_new <- sl[match(levels(factor(names(sl))), names(sl))]
>>
>> Then add the new lengths:
>>
>>>> seqlengths(chrs) <- sl_new
>>>> chrs
>>> GRanges with 34 ranges and 0 metadata columns:
>>> seqnames ranges strand
>>> <Rle> <IRanges> <Rle>
>>> [1] lm_SuperContig_0_v2 [1, 4258568] *
>>> [2] lm_SuperContig_1_v2 [1, 3378610] *
>>> [3] lm_SuperContig_2_v2 [1, 2939989] *
>>> [4] lm_SuperContig_3_v2 [1, 2348246] *
>>> [5] lm_SuperContig_4_v2 [1, 1918205] *
>>> ... ... ... ...
>>> [30] lm_SuperContig_29_v2 [1, 200940] *
>>> [31] lm_SuperContig_30_v2 [1, 154863] *
>>> [32] lm_SuperContig_31_v2 [1, 143268] *
>>> [33] lm_SuperContig_32_v2 [1, 87679] *
>>> [34] lm_SuperContig_34_v2 [1, 58596] *
>>> ---
>>> seqlengths:
>>> lm_SuperContig_0_v2 lm_SuperContig_1_v2 ... lm_SuperContig_9_v2
>>> 4258568 3378610 ... 1772623
>>
>>
>> Valerie
>>
>>
>> On 05/01/2014 07:20 AM, Maintainer wrote:
>>>
>>> Hello,
>>>
>>> I have a problem with supplying GRanges object with seqlengths.
>>> I have a files.
>>> chrs.txt - contains information about chromosomes
>>>
>>> chrs.txt
>>> chr,start,end,len
>>> lm_SuperContig_0_v2,1,4258568,4258568
>>> lm_SuperContig_1_v2,1,3378610,3378610
>>> lm_SuperContig_2_v2,1,2939989,2939989
>>> lm_SuperContig_3_v2,1,2348246,2348246
>>> lm_SuperContig_4_v2,1,1918205,1918205
>>> lm_SuperContig_6_v2,1,1888674,1888674
>>> lm_SuperContig_5_v2,1,1869450,1869450
>>> lm_SuperContig_8_v2,1,1809296,1809296
>>> lm_SuperContig_9_v2,1,1772623,1772623
>>> lm_SuperContig_7_v2,1,1769547,1769547
>>> lm_SuperContig_10_v2,1,1758670,1758670
>>> lm_SuperContig_13_v2,1,1634580,1634580
>>> lm_SuperContig_12_v2,1,1631710,1631710
>>> lm_SuperContig_11_v2,1,1590160,1590160
>>> lm_SuperContig_15_v2,1,1560629,1560629
>>> lm_SuperContig_14_v2,1,1533332,1533332
>>> lm_SuperContig_17_v2,1,1445693,1445693
>>> lm_SuperContig_16_v2,1,1397653,1397653
>>> lm_SuperContig_18_v2,1,1351976,1351976
>>> lm_SuperContig_19_v2,1,1186800,1186800
>>> lm_SuperContig_20_v2,1,1087932,1087932
>>> lm_SuperContig_21_v2,1,1020521,1020521
>>> lm_SuperContig_22_v2,1,731443,731443
>>> lm_SuperContig_23_v2,1,521426,521426
>>> lm_SuperContig_24_v2,1,475869,475869
>>> lm_SuperContig_25_v2,1,318058,318058
>>> lm_SuperContig_26_v2,1,261540,261540
>>> lm_SuperContig_27_v2,1,250629,250629
>>> lm_SuperContig_28_v2,1,236098,236098
>>> lm_SuperContig_29_v2,1,200940,200940
>>> lm_SuperContig_30_v2,1,154863,154863
>>> lm_SuperContig_31_v2,1,143268,143268
>>> lm_SuperContig_32_v2,1,87679,87679
>>> lm_SuperContig_34_v2,1,58596,58596
>>>
>>> I try to do the following:
>>> # creating GRanges object for chromosomes
>>> dataChr <- read.table("chrs.txt",header=T,sep=",")
>>> chrs <- with(dataChr, GRanges(chr, IRanges(start, end)))
>>> sl <- setNames(dataChr$len, as.character(dataChr$chr))
>>> seqlengths(chrs) <- sl
>>>
>>> And I get the following error:
>>>
>>> Error in .normargSeqlengths(value, seqnames(x)) :
>>> when the supplied 'seqlengths' vector is named, the names must match the seqnames
>>>
>>> Any chance you could help me with what is going on?
>>>
>>> Best wishes,
>>> Agnieszka
>>>
>>>
>>> -- output of sessionInfo():
>>>
>>> R version 3.1.0 (2014-04-10)
>>> Platform: i386-w64-mingw32/i386 (32-bit)
>>>
>>> locale:
>>> [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 LC_MONETARY=English_United Kingdom.1252
>>> [4] LC_NUMERIC=C LC_TIME=English_United Kingdom.1252
>>>
>>> attached base packages:
>>> [1] parallel stats graphics grDevices utils datasets methods base
>>>
>>> other attached packages:
>>> [1] XVector_0.4.0 ggbio_1.12.3 ggplot2_0.9.3.1 GenomicRanges_1.16.2 GenomeInfoDb_1.0.2 IRanges_1.22.4 BiocGenerics_0.10.0
>>> [8] BiocInstaller_1.14.2
>>>
>>> loaded via a namespace (and not attached):
>>> [1] AnnotationDbi_1.26.0 BatchJobs_1.2 BBmisc_1.6 Biobase_2.24.0 BiocParallel_0.6.0 biomaRt_2.20.0
>>> [7] Biostrings_2.32.0 biovizBase_1.12.1 bitops_1.0-6 brew_1.0-6 BSgenome_1.32.0 cluster_1.15.2
>>> [13] codetools_0.2-8 colorspace_1.2-4 DBI_0.2-7 dichromat_2.0-0 digest_0.6.4 fail_1.2
>>> [19] foreach_1.4.2 Formula_1.1-1 GenomicAlignments_1.0.0 GenomicFeatures_1.16.0 grid_3.1.0 gridExtra_0.9.1
>>> [25] gtable_0.1.2 Hmisc_3.14-4 iterators_1.0.7 labeling_0.2 lattice_0.20-29 latticeExtra_0.6-26
>>> [31] MASS_7.3-31 munsell_0.4.2 plyr_1.8.1 proto_0.3-10 RColorBrewer_1.0-5 Rcpp_0.11.1
>>> [37] RCurl_1.95-4.1 reshape2_1.4 Rsamtools_1.16.0 RSQLite_0.11.4 rtracklayer_1.24.0 scales_0.2.4
>>> [43] sendmailR_1.1-2 splines_3.1.0 stats4_3.1.0 stringr_0.6.2 survival_2.37-7 tools_3.1.0
>>> [49] VariantAnnotation_1.10.0 XML_3.98-1.1 zlibbioc_1.10.0
>>>
>>>
>>> --
>>> Sent via the guest posting facility at bioconductor.org.
>>>
>>> ________________________________________________________________________
>>> devteam-bioc mailing list
>>> To unsubscribe from this mailing list send a blank email to
>>> devteam-bioc-leave at lists.fhcrc.org
>>> You can also unsubscribe or change your personal options at
>>> https://lists.fhcrc.org/mailman/listinfo/devteam-bioc
>>>
>>
>>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list