[BioC] Re-send directly to BioC on miRNA annoation

Tue May 8 21:27:29 CEST 2012

Hi Valerie, 

thank you for your helpful suggestions. I will try it. 

best, 

William

-----Original Message-----
From: Valerie Obenchain [mailto:vobencha at fhcrc.org] 
Sent: Tuesday, May 08, 2012 11:35 AM
To: Wu, Huiyun
Cc: bioconductor at r-project.org; Steve Lianoglou
Subject: Re: [BioC] Re-send directly to BioC on miRNA annoation

Hi William,

countOverlaps() was deprecated in R-2.14 and has been replaced by summarizeOverlaps(). You'll want to update R and use summarizeOverlaps(). When posting a question please remember to provide the output of sessionInfo().

Valerie

On 05/07/2012 02:52 PM, Wu, Huiyun wrote:
> Dear Steve,
>
>
>
> thank you for your feedback. Below is what it returned.
>
>
>
>> st<- sum(counts==0)
>> st
> [1] 12296
>
>
>
> Let me give more details of my situation. Say, I have read in 6 BAM files (s1 to s6). and then got a count table (filename = countTable) using countOverlaps function with resultant file head as below. Question here, is the rowname (first col) miRNA ID?
>
>
>
>>     colnames(countTable)<- samples
>>    head(countTable)
>            s1 s2 s3 s4 s5 s6
>
> 1          1  1  3  0  2  2
>
> 10         0  0  1  0  1  0
>
> 100        0  1  0  1  3  3
>
> 1000       0  7  1  1  3  3
>
> 10000      0  6  2  8 16 18
>
> 100008586  0  0  0  0  0  0
>
>>     dim(countTable)
> [1] 22932     6
>
>
>
>
>
> I then filtered out those miRNAs whose expressions were 0 across all samples by the following code (it might be what you asked for).
>
>
>
>>    newCountTable<- countTable[rowSums(countTable) != 0, ]
>>     dim(newCountTable)
> [1] 22105     6
>
>>     head(newCountTable)
>            s1 s2 s3 s4 s5 s6
>
> 1          1  1  3  0  2  2
>
> 10         0  0  1  0  1  0
>
> 100        0  1  0  1  3  3
>
> 1000       0  7  1  1  3  3
>
> 10000      0  6  2  8 16 18
>
> 100009676  0  0  0  1  3  4
>
>
>
>
>
> After annotating  using the following library, I got another count table.
>
>
>
> entrezGenes<- rownames(newCountTable)
>
>     mapped_genes<- mappedkeys(org.Hs.egSYMBOL)
>
>     overlapgenes<- intersect(entrezGenes, mapped_genes)
>
>     overlapsymbols<- as.list(org.Hs.egSYMBOL[overlapgenes])
>
>     getGeneSymbol<- function(x) {
>
>     if (x %in% overlapgenes) {
>
>               return(overlapsymbols[[which(x == 
> names(overlapsymbols))]])
>
>        }
>
>        else {
>
>       return(NA)
>
>         }
>
>      }
>
>
>
>      geneSymbols<- sapply(entrezGenes, getGeneSymbol)
>
>      countTable<- cbind(geneSymbols, 
> numberOfexons[names(numberOfexons) %in% entrezGenes], 
> geneLength[names(geneLength) %in%  entrezGenes],
>
>                    newCountTable)
>
>
>
>> dim(countTable)
> [1] 22105     9
>
>
>
> I further removed redundant IDs and NAs to generate an updated count 
> table (dataset named 'd') for DE analysis
>
>
>
>> dim(d)
> [1] 22100     7
>
>
>
>> d[20:30,]
>            X s1 s2 s3 s4 s5  s6
>
> 20    7-Mar  1  2  2 11 12   5
>
> 21    7-Sep  2  6  4  6 19   9
>
> 22    8-Mar  1  2  0  1  6   4
>
> 23    8-Sep  0  4  4  4  3  10
>
> 24    9-Mar  1  3  5  1  6   9
>
> 25    9-Sep  2  5  8 10 35  20
>
> 26     A1BG  1  1  3  0  2   2
>
> 27 A1BG-AS1  0  1  1  0  0   4
>
> 28     A1CF  0  2  1  3  9   1
>
> 29    A2LD1  1  1  2  0  1   6
>
> 30      A2M  5 53 48 70 58 114
>
>
>
> It looks like there are 25 technical controls, but majority were annotated.
>
>
>
> thanks,
>
>
>
> ---
>
> William Wu
>
> Vanderbilt University/Department of Biostatistics.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> -----Original Message-----
> From: Steve Lianoglou [mailto:mailinglist.honeypot at gmail.com]
> Sent: Monday, May 07, 2012 1:59 PM
> To: Wu, Huiyun
> Cc: bioconductor at r-project.org<mailto:bioconductor at r-project.org>
> Subject: Re: [BioC] miRNA annotation
>
>
>
> Hi,
>
>
>
> On Mon, May 7, 2012 at 2:15 PM, Wu, Huiyun<william.wu at vanderbilt.edu<mailto:william.wu at vanderbilt.edu>>  wrote:
>
>> Hello there,
>> I am new in seq data analysis, and have difficulty in miRNA annotation. To clarify my question, I have the following codes.
>>    bamfile<- "GSE33858_1.bowtie_alignment.bam"
>>    aln<- readBamGappedAlignments(bamfile)
>>    txdb<-makeTranscriptDbFromUCSC(genome="hg19",tablename="knownGene")
>>    exonRangesList<- exonsBy(txdb, by = "gene")
>>    exonRanges<- unlist(exonRangesList)
>>    strand(exonRanges)<- "*"
>>    geneNames<- sub("\\..*$<file:///\\..*$>", "", names(exonRanges))
>>    exonRangesListNoStrand<- split(exonRanges, geneNames)
>>    numberOfexons<- sapply(width(exonRangesListNoStrand), length)
>>    geneLength<- sum(width(reduce(exonRangesListNoStrand)))
>>    # Counting reads for the BAM file
>>    counts<- countOverlaps(exonRangesListNoStrand, aln)
>>    names(counts)<- names(exonRangesListNoStrand)
>>    >  length(counts)
>>        [1] 22932
>> So my question is from aligned file to annotated file. Specifically, why I got 22932 features after mapping to hg19? I learned there are only less than 2000 miRNAs matched genes. I also know there are several RNA databases including miRBase for annotation. Did I do something logically wrong? It would be greatly appreciated if someone could show me how to implement this annotation in R.
>
>
> What does `sum(counts == 0)` give you?
>
>
>
> -steve
>
>
>
> --
>
> Steve Lianoglou
>
> Graduate Student: Computational Systems Biology
>
>   | Memorial Sloan-Kettering Cancer Center
>
>   | Weill Medical College of Cornell University Contact Info: 
> http://cbio.mskcc.org/~lianos/contact
>
>
>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: 
> http://news.gmane.org/gmane.science.biology.informatics.conductor