[BioC] Re-send directly to BioC on miRNA annoation
Valerie Obenchain
vobencha at fhcrc.org
Tue May 8 18:35:29 CEST 2012
Hi William,
countOverlaps() was deprecated in R-2.14 and has been replaced by
summarizeOverlaps(). You'll want to update R and use
summarizeOverlaps(). When posting a question please remember to provide
the output of sessionInfo().
Valerie
On 05/07/2012 02:52 PM, Wu, Huiyun wrote:
> Dear Steve,
>
>
>
> thank you for your feedback. Below is what it returned.
>
>
>
>> st<- sum(counts==0)
>> st
> [1] 12296
>
>
>
> Let me give more details of my situation. Say, I have read in 6 BAM files (s1 to s6). and then got a count table (filename = countTable) using countOverlaps function with resultant file head as below. Question here, is the rowname (first col) miRNA ID?
>
>
>
>> colnames(countTable)<- samples
>> head(countTable)
> s1 s2 s3 s4 s5 s6
>
> 1 1 1 3 0 2 2
>
> 10 0 0 1 0 1 0
>
> 100 0 1 0 1 3 3
>
> 1000 0 7 1 1 3 3
>
> 10000 0 6 2 8 16 18
>
> 100008586 0 0 0 0 0 0
>
>> dim(countTable)
> [1] 22932 6
>
>
>
>
>
> I then filtered out those miRNAs whose expressions were 0 across all samples by the following code (it might be what you asked for).
>
>
>
>> newCountTable<- countTable[rowSums(countTable) != 0, ]
>> dim(newCountTable)
> [1] 22105 6
>
>> head(newCountTable)
> s1 s2 s3 s4 s5 s6
>
> 1 1 1 3 0 2 2
>
> 10 0 0 1 0 1 0
>
> 100 0 1 0 1 3 3
>
> 1000 0 7 1 1 3 3
>
> 10000 0 6 2 8 16 18
>
> 100009676 0 0 0 1 3 4
>
>
>
>
>
> After annotating using the following library, I got another count table.
>
>
>
> entrezGenes<- rownames(newCountTable)
>
> mapped_genes<- mappedkeys(org.Hs.egSYMBOL)
>
> overlapgenes<- intersect(entrezGenes, mapped_genes)
>
> overlapsymbols<- as.list(org.Hs.egSYMBOL[overlapgenes])
>
> getGeneSymbol<- function(x) {
>
> if (x %in% overlapgenes) {
>
> return(overlapsymbols[[which(x == names(overlapsymbols))]])
>
> }
>
> else {
>
> return(NA)
>
> }
>
> }
>
>
>
> geneSymbols<- sapply(entrezGenes, getGeneSymbol)
>
> countTable<- cbind(geneSymbols, numberOfexons[names(numberOfexons) %in% entrezGenes], geneLength[names(geneLength) %in% entrezGenes],
>
> newCountTable)
>
>
>
>> dim(countTable)
> [1] 22105 9
>
>
>
> I further removed redundant IDs and NAs to generate an updated count table (dataset named 'd') for DE analysis
>
>
>
>> dim(d)
> [1] 22100 7
>
>
>
>> d[20:30,]
> X s1 s2 s3 s4 s5 s6
>
> 20 7-Mar 1 2 2 11 12 5
>
> 21 7-Sep 2 6 4 6 19 9
>
> 22 8-Mar 1 2 0 1 6 4
>
> 23 8-Sep 0 4 4 4 3 10
>
> 24 9-Mar 1 3 5 1 6 9
>
> 25 9-Sep 2 5 8 10 35 20
>
> 26 A1BG 1 1 3 0 2 2
>
> 27 A1BG-AS1 0 1 1 0 0 4
>
> 28 A1CF 0 2 1 3 9 1
>
> 29 A2LD1 1 1 2 0 1 6
>
> 30 A2M 5 53 48 70 58 114
>
>
>
> It looks like there are 25 technical controls, but majority were annotated.
>
>
>
> thanks,
>
>
>
> ---
>
> William Wu
>
> Vanderbilt University/Department of Biostatistics.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> -----Original Message-----
> From: Steve Lianoglou [mailto:mailinglist.honeypot at gmail.com]
> Sent: Monday, May 07, 2012 1:59 PM
> To: Wu, Huiyun
> Cc: bioconductor at r-project.org<mailto:bioconductor at r-project.org>
> Subject: Re: [BioC] miRNA annotation
>
>
>
> Hi,
>
>
>
> On Mon, May 7, 2012 at 2:15 PM, Wu, Huiyun<william.wu at vanderbilt.edu<mailto:william.wu at vanderbilt.edu>> wrote:
>
>> Hello there,
>> I am new in seq data analysis, and have difficulty in miRNA annotation. To clarify my question, I have the following codes.
>> bamfile<- "GSE33858_1.bowtie_alignment.bam"
>> aln<- readBamGappedAlignments(bamfile)
>> txdb<-makeTranscriptDbFromUCSC(genome="hg19",tablename="knownGene")
>> exonRangesList<- exonsBy(txdb, by = "gene")
>> exonRanges<- unlist(exonRangesList)
>> strand(exonRanges)<- "*"
>> geneNames<- sub("\\..*$<file:///\\..*$>", "", names(exonRanges))
>> exonRangesListNoStrand<- split(exonRanges, geneNames)
>> numberOfexons<- sapply(width(exonRangesListNoStrand), length)
>> geneLength<- sum(width(reduce(exonRangesListNoStrand)))
>> # Counting reads for the BAM file
>> counts<- countOverlaps(exonRangesListNoStrand, aln)
>> names(counts)<- names(exonRangesListNoStrand)
>> > length(counts)
>> [1] 22932
>> So my question is from aligned file to annotated file. Specifically, why I got 22932 features after mapping to hg19? I learned there are only less than 2000 miRNAs matched genes. I also know there are several RNA databases including miRBase for annotation. Did I do something logically wrong? It would be greatly appreciated if someone could show me how to implement this annotation in R.
>
>
> What does `sum(counts == 0)` give you?
>
>
>
> -steve
>
>
>
> --
>
> Steve Lianoglou
>
> Graduate Student: Computational Systems Biology
>
> | Memorial Sloan-Kettering Cancer Center
>
> | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
>
>
>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list