[BioC] FW: miRNA annotation

Tue May 8 16:07:21 CEST 2012

Hi Steve, 

thank you very much for your suggestion. I guess what I'd like to know is whether it makes sense to have 22100 genes annotated from a miRNA BAM file given (as I know) only about 2000 miRNAs available in the library. If that does not make sense, then it could be due to (1) wrong coding, or (2) misuse of library, which I just can't figure out. The length of miRNAs in my file is 16-23. Anyway, I will think about your suggestions. 

thanks again, 

William 

-----Original Message-----
From: Steve Lianoglou [mailto:mailinglist.honeypot at gmail.com] 
Sent: Monday, May 07, 2012 8:19 PM
To: Wu, Huiyun
Cc: bioconductor at r-project.org
Subject: Re: [BioC] miRNA annotation

Hi William,

I'm actually now at a bit of a loss as to what you are asking.

(1) Do you want someone to check your code to see if you did the annotations correctly? Or

(2)  are you curious why you are getting reads from (what I guess is) a small RNA library that seem to be coming from full length mRNAs?

If it's the first, why not try loading your BAM file up in something like IGV and hopping over to some of the non-miRNA genes that your count table says you have reads for and see if you see in the browser what your table says.

If it's the second question, I guess you'd have to go through the protocol and see how they prepped the library. You might start by looking to see when (if) they did the size selection of the RNA/cDNA to enrich for small RNA's.

HTH,
-steve

On Mon, May 7, 2012 at 4:00 PM, Wu, Huiyun <william.wu at vanderbilt.edu> wrote:
> Dear Steve,
>
> thank you for your feedback. Below is what it returned.
>
>> st <- sum(counts==0)
>> st
> [1] 12296
>
> Let me give more details of my situation. Say, I have read in 6 BAM 
> files
> (s1 to s6). and then got a count table (filename = countTable) using 
> countOverlaps function with resultant file head as below. Question 
> here, is the rowname (first col) miRNA ID?
>
>>    colnames(countTable) <- samples
>>   head(countTable)
>           s1 s2 s3 s4 s5 s6
> 1          1  1  3  0  2  2
> 10         0  0  1  0  1  0
> 100        0  1  0  1  3  3
> 1000       0  7  1  1  3  3
> 10000      0  6  2  8 16 18
> 100008586  0  0  0  0  0  0
>>    dim(countTable)
> [1] 22932     6
> I then filtered out those miRNAs whose expressions were 0 across all 
> samples by the following code (it might be what you asked for).
>
>>   newCountTable <- countTable[rowSums(countTable) != 0, ]
>>    dim(newCountTable)
> [1] 22105     6
>>    head(newCountTable)
>           s1 s2 s3 s4 s5 s6
> 1          1  1  3  0  2  2
> 10         0  0  1  0  1  0
> 100        0  1  0  1  3  3
> 1000       0  7  1  1  3  3
> 10000      0  6  2  8 16 18
> 100009676  0  0  0  1  3  4
>
>
> After annotating using the following library, I got another count table.
>
> entrezGenes <- rownames(newCountTable)
>    mapped_genes <- mappedkeys(org.Hs.egSYMBOL)
>    overlapgenes <- intersect(entrezGenes, mapped_genes)
>    overlapsymbols <- as.list(org.Hs.egSYMBOL[overlapgenes])
>    getGeneSymbol <- function(x) {
>    if (x %in% overlapgenes) {
>              return(overlapsymbols[[which(x == 
> names(overlapsymbols))]])
>       }
>       else {
>      return(NA)
>        }
>     }
>
>     geneSymbols <- sapply(entrezGenes, getGeneSymbol)
>     countTable <- cbind(geneSymbols, 
> numberOfexons[names(numberOfexons) %in% entrezGenes], 
> geneLength[names(geneLength) %in%  entrezGenes],
>                   newCountTable)
>
>> dim(countTable)
> [1] 22105     9
>
> I further removed redundant IDs and NAs to generate an updated count 
> table (dataset named 'd') for DE analysis
>
>> dim(d)
> [1] 22100     7
>
>> d[20:30,]
>           X s1 s2 s3 s4 s5  s6
> 20    7-Mar  1  2  2 11 12   5
> 21    7-Sep  2  6  4  6 19   9
> 22    8-Mar  1  2  0  1  6   4
> 23    8-Sep  0  4  4  4  3  10
> 24    9-Mar  1  3  5  1  6   9
> 25    9-Sep  2  5  8 10 35  20
> 26     A1BG  1  1  3  0  2   2
> 27 A1BG-AS1  0  1  1  0  0   4
> 28     A1CF  0  2  1  3  9   1
> 29    A2LD1  1  1  2  0  1   6
> 30      A2M  5 53 48 70 58 114
>
> It looks like there are 25 technical controls, but majority were annotated.
>
> thanks,
>
> ---
> William Wu
> Vanderbilt University/Department of Biostatistics.
>
>
>
>
>
>
>
> -----Original Message-----
> From: Steve Lianoglou [mailto:mailinglist.honeypot at gmail.com]
> Sent: Monday, May 07, 2012 1:59 PM
> To: Wu, Huiyun
> Cc: bioconductor at r-project.org
> Subject: Re: [BioC] miRNA annotation
>
> Hi,
>
> On Mon, May 7, 2012 at 2:15 PM, Wu, Huiyun <william.wu at vanderbilt.edu>
> wrote:
>> Hello there,
>>
>> I am new in seq data analysis, and have difficulty in miRNA 
>> annotation. To clarify my question, I have the following codes.
>>
>>
>>   bamfile <- "GSE33858_1.bowtie_alignment.bam"
>>   aln <- readBamGappedAlignments(bamfile)
>>   txdb<-makeTranscriptDbFromUCSC(genome="hg19",tablename="knownGene")
>>   exonRangesList <- exonsBy(txdb, by = "gene")
>>   exonRanges <- unlist(exonRangesList)
>>   strand(exonRanges) <- "*"
>>   geneNames <- sub("\\..*$", "", names(exonRanges))
>>   exonRangesListNoStrand <- split(exonRanges, geneNames)
>>   numberOfexons <- sapply(width(exonRangesListNoStrand), length)
>>   geneLength <- sum(width(reduce(exonRangesListNoStrand)))
>>   # Counting reads for the BAM file
>>   counts <- countOverlaps(exonRangesListNoStrand, aln)
>>   names(counts) <- names(exonRangesListNoStrand)
>>
>>   > length(counts)
>>
>>       [1] 22932
>>
>> So my question is from aligned file to annotated file. Specifically, 
>> why I got 22932 features after mapping to hg19? I learned there are 
>> only less than
>> 2000 miRNAs matched genes. I also know there are several RNA 
>> databases including miRBase for annotation. Did I do something 
>> logically wrong? It would be greatly appreciated if someone could 
>> show me how to implement this annotation in R.
>
> What does `sum(counts == 0)` give you?
>
> -steve
>
> --
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
>  | Memorial Sloan-Kettering Cancer Center
>  | Weill Medical College of Cornell University Contact Info:
> http://cbio.mskcc.org/~lianos/contact
>
>

--
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact