[BioC] miRNA annotation

Tue May 8 03:18:45 CEST 2012

Hi William,

I'm actually now at a bit of a loss as to what you are asking.

(1) Do you want someone to check your code to see if you did the
annotations correctly? Or

(2)  are you curious why you are getting reads from (what I guess is)
a small RNA library that seem to be coming from full length mRNAs?

If it's the first, why not try loading your BAM file up in something
like IGV and hopping over to some of the non-miRNA genes that your
count table says you have reads for and see if you see in the browser
what your table says.

If it's the second question, I guess you'd have to go through the
protocol and see how they prepped the library. You might start by
looking to see when (if) they did the size selection of the RNA/cDNA
to enrich for small RNA's.

HTH,
-steve

On Mon, May 7, 2012 at 4:00 PM, Wu, Huiyun <william.wu at vanderbilt.edu> wrote:
> Dear Steve,
>
> thank you for your feedback. Below is what it returned.
>
>> st <- sum(counts==0)
>> st
> [1] 12296
>
> Let me give more details of my situation. Say, I have read in 6 BAM files
> (s1 to s6). and then got a count table (filename = countTable) using
> countOverlaps function with resultant file head as below. Question here, is
> the rowname (first col) miRNA ID?
>
>>    colnames(countTable) <- samples
>>   head(countTable)
>           s1 s2 s3 s4 s5 s6
> 1          1  1  3  0  2  2
> 10         0  0  1  0  1  0
> 100        0  1  0  1  3  3
> 1000       0  7  1  1  3  3
> 10000      0  6  2  8 16 18
> 100008586  0  0  0  0  0  0
>>    dim(countTable)
> [1] 22932     6
> I then filtered out those miRNAs whose expressions were 0 across all samples
> by the following code (it might be what you asked for).
>
>>   newCountTable <- countTable[rowSums(countTable) != 0, ]
>>    dim(newCountTable)
> [1] 22105     6
>>    head(newCountTable)
>           s1 s2 s3 s4 s5 s6
> 1          1  1  3  0  2  2
> 10         0  0  1  0  1  0
> 100        0  1  0  1  3  3
> 1000       0  7  1  1  3  3
> 10000      0  6  2  8 16 18
> 100009676  0  0  0  1  3  4
>
>
> After annotating using the following library, I got another count table.
>
> entrezGenes <- rownames(newCountTable)
>    mapped_genes <- mappedkeys(org.Hs.egSYMBOL)
>    overlapgenes <- intersect(entrezGenes, mapped_genes)
>    overlapsymbols <- as.list(org.Hs.egSYMBOL[overlapgenes])
>    getGeneSymbol <- function(x) {
>    if (x %in% overlapgenes) {
>              return(overlapsymbols[[which(x == names(overlapsymbols))]])
>       }
>       else {
>      return(NA)
>        }
>     }
>
>     geneSymbols <- sapply(entrezGenes, getGeneSymbol)
>     countTable <- cbind(geneSymbols, numberOfexons[names(numberOfexons) %in%
> entrezGenes], geneLength[names(geneLength) %in%  entrezGenes],
>                   newCountTable)
>
>> dim(countTable)
> [1] 22105     9
>
> I further removed redundant IDs and NAs to generate an updated count table
> (dataset named 'd') for DE analysis
>
>> dim(d)
> [1] 22100     7
>
>> d[20:30,]
>           X s1 s2 s3 s4 s5  s6
> 20    7-Mar  1  2  2 11 12   5
> 21    7-Sep  2  6  4  6 19   9
> 22    8-Mar  1  2  0  1  6   4
> 23    8-Sep  0  4  4  4  3  10
> 24    9-Mar  1  3  5  1  6   9
> 25    9-Sep  2  5  8 10 35  20
> 26     A1BG  1  1  3  0  2   2
> 27 A1BG-AS1  0  1  1  0  0   4
> 28     A1CF  0  2  1  3  9   1
> 29    A2LD1  1  1  2  0  1   6
> 30      A2M  5 53 48 70 58 114
>
> It looks like there are 25 technical controls, but majority were annotated.
>
> thanks,
>
> ---
> William Wu
> Vanderbilt University/Department of Biostatistics.
>
>
>
>
>
>
>
> -----Original Message-----
> From: Steve Lianoglou [mailto:mailinglist.honeypot at gmail.com]
> Sent: Monday, May 07, 2012 1:59 PM
> To: Wu, Huiyun
> Cc: bioconductor at r-project.org
> Subject: Re: [BioC] miRNA annotation
>
> Hi,
>
> On Mon, May 7, 2012 at 2:15 PM, Wu, Huiyun <william.wu at vanderbilt.edu>
> wrote:
>> Hello there,
>>
>> I am new in seq data analysis, and have difficulty in miRNA annotation. To
>> clarify my question, I have the following codes.
>>
>>
>>   bamfile <- "GSE33858_1.bowtie_alignment.bam"
>>   aln <- readBamGappedAlignments(bamfile)
>>   txdb<-makeTranscriptDbFromUCSC(genome="hg19",tablename="knownGene")
>>   exonRangesList <- exonsBy(txdb, by = "gene")
>>   exonRanges <- unlist(exonRangesList)
>>   strand(exonRanges) <- "*"
>>   geneNames <- sub("\\..*$", "", names(exonRanges))
>>   exonRangesListNoStrand <- split(exonRanges, geneNames)
>>   numberOfexons <- sapply(width(exonRangesListNoStrand), length)
>>   geneLength <- sum(width(reduce(exonRangesListNoStrand)))
>>   # Counting reads for the BAM file
>>   counts <- countOverlaps(exonRangesListNoStrand, aln)
>>   names(counts) <- names(exonRangesListNoStrand)
>>
>>   > length(counts)
>>
>>       [1] 22932
>>
>> So my question is from aligned file to annotated file. Specifically, why I
>> got 22932 features after mapping to hg19? I learned there are only less than
>> 2000 miRNAs matched genes. I also know there are several RNA databases
>> including miRBase for annotation. Did I do something logically wrong? It
>> would be greatly appreciated if someone could show me how to implement this
>> annotation in R.
>
> What does `sum(counts == 0)` give you?
>
> -steve
>
> --
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
>  | Memorial Sloan-Kettering Cancer Center
>  | Weill Medical College of Cornell University Contact Info:
> http://cbio.mskcc.org/~lianos/contact
>
>

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact