[BioC] Coverage by base

Thu Oct 13 00:52:49 CEST 2011

On 11-10-12 02:08 PM, rohan bareja wrote:
> Hi Herve,
>
>
> Thanks Herve,Steve and Michael..I really appreciate for the help.I have been able to get the summaries as explained by you.
> As you can see below my result>viewSums(vl)
> SimpleIntegerList of length 93
> [["chr1"]] 556640 0 0 0 0 1085 3337 ... 209237 781 72987 154846 52256 168230[["chr10"]] 0 0 0 0 327 0 0 1065 0 0 0 ... 122 0 24495 29181 0 0 0 3337 0 14239
>
>
>   >viewSums(vl)[["chr1"]]
>     [1]  556640       0       0       0       0    1085    3337     355       0  [10]     497     445       0   17466    1065   34932    1065   27665    5251
> However,while getting the names of the genes from gene_ranges,I am getting the following error.Do you have any idea about this?
>
>   >  genes<- unlist(gene_ranges,use.names=FALSE)
> genesGRanges with 24774 ranges and 0 elementMetadata values              seqnames                 ranges strand   |<Rle>                <IRanges>    <Rle>     |    [1]          chr19 [ 58858172,  58864865]      -   |    [2]           chr8 [ 18248755,  18258723]      +   |    [3]          chr20 [ 43248163,  43280376]      -   |    [4]          chr18 [ 25530930,  25757445]      -   |    [5]           chr1 [243651535, 244006886]      -   |
>
>    >names(genes)<- names(gene_ranges)
> Error in `rownames<-`(`*tmp*`, value = c("1", "10", "100", "1000", "10000",  :   missing values not allowed in rownames

sessionInfo and reproducible example please? In particular, how did you
generate gene_ranges? As I said previously, you need to make sure that
every element in your GRangesList object 'gene_ranges' contains only 1
range. Otherwise you cannot assume 1-to-1 correspondence between its
elements and the elements of the GRanges object you get after
unlisting:

   > grl <- GRangesList(GRanges("chr1", IRanges(3, 9)),
                        GRanges("chr2", IRanges(2:1, 5)))
   > names(grl) <- c("geneA", "geneB")
   > gr <- unlist(grl)
   > names(gr) <- names(grl)
   Error in `rownames<-`(`*tmp*`, value = c("geneA", "geneB", NA)) :
     missing values not allowed in rownames

Yeah the error message is not helping a lot :-/

For example, if you use makeTranscriptDbFromUCSC() to get those genes,
it's very likely that you will face this problem:

   > txdb <- makeTranscriptDbFromUCSC(genome="hg19", tablename="refGene")
   > txbygene <- transcriptsBy(txdb, by="gene")
   > gene_ranges <- range(txbygene)
   > table(elementLengths(gene_ranges))

       1     2     3     4     5     6     7     8
   22721   311    15     7    18    40    73    66

   > gene_ranges[which(elementLengths(gene_ranges) == 8L)[1L]]
   GRangesList of length 1
   $100129636
   GRanges with 8 ranges and 0 elementMetadata values
             seqnames               ranges strand |
                <Rle>            <IRanges>  <Rle> |
   [1]           chr6 [29004000, 29044517]      + |
   [2] chr6_ssto_hap7 [  344880,   385401]      + |
   [3]  chr6_mcf_hap5 [  307555,   348067]      + |
   [4]  chr6_cox_hap2 [  522820,   563332]      + |
   [5] chr6_mann_hap4 [  307401,   347917]      + |
   [6]  chr6_apd_hap1 [  307442,   315573]      + |
   [7]  chr6_qbl_hap6 [  307397,   347916]      + |
   [8]  chr6_dbb_hap3 [  307411,   347926]      + |

   seqlengths
                   chr1                  chr2 ... chr18_gl000207_random
              249250621             243199373 ...                  4262

This just means that Entrez Gene ID 100129636 has transcripts on
reference sequences that are not the main chromosomes. So you need
to take extra steps to address this. One approach is to unlist
anyway and propagate the names of the genes by repeating them:

   > genes <- unlist(gene_ranges, use.names=FALSE)
   > names(genes) <- rep.int(names(gene_ranges),
                             elementLengths(gene_ranges))

(Note that the above will only work with the devel version of
GenomicFeatures, which requires R 2.14, because the current release
version of the package doesn't allow duplicate names in a GRanges
or GRangesList object).

Then subset 'genes' to get rid of the ranges that are not on the
main chromosomes:

   > main_chroms <- paste("chr", c(1:22, "X", "Y", "M"), sep="")
   > genes0 <- genes[seqnames(genes) %in% main_chroms]

Then check that the gene names are now unique (with
any(duplicated(names(genes0)))). If they are not, then again you'll
need to choose a reasonable strategy to get rid of the duplicates.

Note that it's not a strict requirement to have 1 range per gene,
it'll just be more convenient to summarize things at the gene
level e.g. when you do viewSums() on your SimpleRleViewsList object.

Cheers,
H.

>
>
>
> Thanks,Rohan
>
>
>
>
> 	[[alternative HTML version deleted]]
>
>
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319