[BioC] RNASeq:- getting Zero Count

Thu Aug 8 17:41:30 CEST 2013

Hi,

I think you are on the right track. There are still a couple of things 
that don't look right.

A bam file often has more than one chromosome. In your example I see 
only one chromosome named 'Chromosome'. Is there really only one 
chromosome in the file? The idea behind changing the names of the 
seqlevels is to make something like 'chr22' and '22' match by adding the 
'chr' to the second. Then we have 'chr22' and 'chr22'. (Equivalently you 
could remove the 'chr' to get '22' and '22'). The chromosomes names must 
be preserved at some level so that chromosomes from the bam are being 
overlapped with the same chromosomes from the gff file.

In the counter function below you check the strand of 'aln' but don't do 
anything with it. Did you want to change the strand? As an fyi, you can 
ignore strand when performing overlaps by setting the 'ignore.strand' 
argument to TRUE.

Valerie

On 08/08/2013 12:17 AM, Reema Singh wrote:
>
>
>
> On Tue, Aug 6, 2013 at 9:13 PM, Valerie Obenchain <vobencha at fhcrc.org
> <mailto:vobencha at fhcrc.org>> wrote:
>
>     I would do some investigating with a single bam file.
>
>     Confirm 'gnModel' and 'aln' have some common seqlevels. This call
>     should produce a result.
>
>          intersect(seqlevels(aln), seqlevels(gnModel))
>
>
>    SRR568038    SRR568039    SRR568040
> "Chromosome" "Chromosome" "Chromosome"
>
>
>     Call countOverlaps on a single file.
>
>          co <- countOverlaps(aln, gnModel)
>
>  > aln <- GenomicRanges::readGappedAlignments(bamFiles[1])
>  > head(aln)
> GappedAlignments with 6 alignments and 0 metadata columns:
>          seqnames strand       cigar    qwidth     start       end     width
>             <Rle>  <Rle> <character> <integer> <integer> <integer> <integer>
>    [1] Chromosome      +         39M        39         2        40        39
>    [2] Chromosome      +         39M        39         2        40        39
>    [3] Chromosome      +         39M        39         4        42        39
>    [4] Chromosome      +         39M        39         4        42        39
>    [5] Chromosome      +         39M        39         5        43        39
>    [6] Chromosome      +         39M        39         5        43        39
>             ngap
>        <integer>
>    [1]         0
>    [2]         0
>    [3]         0
>    [4]         0
>    [5]         0
>    [6]         0
>    ---
>    seqlengths:
>     Chromosome
>        4411708
>
>
>     Evidently you only want the bam records that have exactly 5 hits.
>     This could be limiting. To see the distribution of hits make a table
>     of the counts.
>
>          table(co)
>
>  > co <- countOverlaps(aln, gnModel)
>  > table(co)
> co
>         0        1        2
>    575536 33179936    17850
>
> You are right, when I changed the exactly bam hit list from % to all.
> Then I got some count. Here's the table.
>
>  > head(counts)
>            SRR568038 SRR568039 SRR568040
> RVBD_0001       456      1331        94
> RVBD_0002       258      1714        64
> RVBD_0003       270      2090        90
> RVBD_0004       122       536        15
> RVBD_0005      2487     23056       997
> RVBD_0006      2343     23013       814
>
> I hope this is right. Kindly correct me, if you something wrong here.
>
>
> Kind Regards
>
>
>
>
>     Valerie
>
>
>
>     On 08/05/2013 10:05 PM, Reema Singh wrote:
>
>         Hi Valerie,
>
>         Thank you so much for the reply.
>
>         After checking the seqlevels, I am able to get rid off the
>         error, but
>         still getting the zero count entries. Is there any another way
>         of doing
>         this?
>
>         KInd Regards
>
>
>         On Mon, Aug 5, 2013 at 9:21 PM, Valerie Obenchain
>         <vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
>         <mailto:vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>>> wrote:
>
>              Hi Reema,
>
>              To perform overlap or matching operations the seqlevels
>         (chromosome
>              names) of the objects must match. The error message is
>         telling you
>              that some of these do not match. It's reasonable that a few
>         names
>              may not match (maybe a chromosome is present in one object
>         and not
>              the other) but the majority should.
>
>              Check the seqlevels:
>              seqlevels(aln)
>              seqlevels(gnModel)
>
>              Which names are common to both:
>              intersect(seqlevels(gnModel), seqlevels(aln))
>
>                You can rename seqlevels in several different ways. See
>              ?renameSeqlevels or ?seqlevels for examples.
>
>              Valerie
>
>
>              On 08/04/2013 06:35 AM, Reema Singh wrote:
>
>                  Dear All,
>
>                  I am trying to extract the read count from three .bam
>         files. But
>                  I am
>                  getting Zero count entries.
>
>                  I am using Mycobacterium Tuberculosis H37Rv gtf file (
>         ftp://ftp.ensemblgenomes.org/____pub/release-19/bacteria//gtf/____bacteria_1_collection/____mycobacterium_tuberculosis_____h37rv/Mycobacterium_____tuberculosis_h37rv.GCA_____000277735.1.19.gtf.gz
>         <ftp://ftp.ensemblgenomes.org/__pub/release-19/bacteria//gtf/__bacteria_1_collection/__mycobacterium_tuberculosis___h37rv/Mycobacterium___tuberculosis_h37rv.GCA___000277735.1.19.gtf.gz>
>
>         <ftp://ftp.ensemblgenomes.org/__pub/release-19/bacteria//gtf/__bacteria_1_collection/__mycobacterium_tuberculosis___h37rv/Mycobacterium___tuberculosis_h37rv.GCA___000277735.1.19.gtf.gz
>         <ftp://ftp.ensemblgenomes.org/pub/release-19/bacteria//gtf/bacteria_1_collection/mycobacterium_tuberculosis_h37rv/Mycobacterium_tuberculosis_h37rv.GCA_000277735.1.19.gtf.gz>>)
>
>                  and RNASeq data used here were downloaded from (
>         http://www.ncbi.nlm.nih.gov/____geo/query/acc.cgi?acc=GSE40846
>         <http://www.ncbi.nlm.nih.gov/__geo/query/acc.cgi?acc=GSE40846>
>
>         <http://www.ncbi.nlm.nih.gov/__geo/query/acc.cgi?acc=GSE40846
>         <http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE40846>__>__)
>
>                  and aligned
>                  with bowtie2.
>
>
>                  library(GenomicFeatures)
>                  txdb <-
>
>         makeTranscriptDbFromGFF(file="____Mycobacterium_tuberculosis_____h37rv.GCA_000277735.1.19.gtf"__,__format="gtf")
>                  saveDb(txdb,file="____MycoTubeH37Rv.sqlite")
>
>                  load("MycoTubeH37Rv.sqlite")
>                  gnModel <- exonsBy(txdb,"gene") ### *also tried with
>                  "transcripts", "cds",
>                  but getting same *
>
>
>                  bamFiles <- list.files(".", "bam$", full=TRUE)
>                  names(bamFiles) <- sub("\\..*","",basename(____bamFiles))
>                  counter <- function(fl, gnModel){
>                  aln <- GenomicRanges::____readGappedAlignments(fl)
>                  strand(aln)
>                  hits <- countOverlaps(aln,gnModel)
>                  counts <- countOverlaps(gnModel,aln[____hits==5])
>                  names(counts) <- names(gnModel)
>                  counts
>                  }
>
>                  counts <- sapply(bamFiles,counter,____gnModel)
>
>
>                  Note: method with signature ‘Vector#GRangesList’ chosen
>         for function
>                  ‘countOverlaps’,
>                     target signature ‘GappedAlignments#GRangesList’____.
>
>                     "GappedAlignments#Vector" would also be valid
>                  Note: method with signature ‘GRangesList#Vector’ chosen
>         for function
>                  ‘countOverlaps’,
>                     target signature ‘GRangesList#GappedAlignments’____.
>
>                     "Vector#GappedAlignments" would also be valid
>                  Warning messages:
>                  1: In .Seqinfo.mergexy(x, y) :
>                      Each of the 2 combined objects has sequence levels
>         not in
>                  the other:
>                      - in 'x': gi|448814763|ref|NC_000962.3|
>                      - in 'y': Chromosome
>                      Make sure to always combine/compare objects based
>         on the
>                  same reference
>                      genome (use suppressWarnings() to suppress this
>         warning).
>                  2: In .Seqinfo.mergexy(x, y) :
>                      Each of the 2 combined objects has sequence levels
>         not in
>                  the other:
>                      - in 'x': Chromosome
>                      - in 'y': gi|448814763|ref|NC_000962.3|
>                      Make sure to always combine/compare objects based
>         on the
>                  same reference
>                      genome (use suppressWarnings() to suppress this
>         warning).
>                  3: In .Seqinfo.mergexy(x, y) :
>                      Each of the 2 combined objects has sequence levels
>         not in
>                  the other:
>                      - in 'x': gi|448814763|ref|NC_000962.3|
>                      - in 'y': Chromosome
>                      Make sure to always combine/compare objects based
>         on the
>                  same reference
>                      genome (use suppressWarnings() to suppress this
>         warning).
>                  4: In .Seqinfo.mergexy(x, y) :
>                      Each of the 2 combined objects has sequence levels
>         not in
>                  the other:
>                      - in 'x': Chromosome
>                      - in 'y': gi|448814763|ref|NC_000962.3|
>                      Make sure to always combine/compare objects based
>         on the
>                  same reference
>                      genome (use suppressWarnings() to suppress this
>         warning).
>                  5: In .Seqinfo.mergexy(x, y) :
>                      Each of the 2 combined objects has sequence levels
>         not in
>                  the other:
>                      - in 'x': gi|448814763|ref|NC_000962.3|
>                      - in 'y': Chromosome
>                      Make sure to always combine/compare objects based
>         on the
>                  same reference
>                      genome (use suppressWarnings() to suppress this
>         warning).
>                  6: In .Seqinfo.mergexy(x, y) :
>                      Each of the 2 combined objects has sequence levels
>         not in
>                  the other:
>                      - in 'x': Chromosome
>                      - in 'y': gi|448814763|ref|NC_000962.3|
>                      Make sure to always combine/compare objects based
>         on the
>                  same reference
>                      genome (use suppressWarnings() to suppress this
>         warning).
>
>                  head(counts)
>
>                              SRR568038 SRR568039 SRR568040
>                  RVBD_0001         0         0         0
>                  RVBD_0002         0         0         0
>                  RVBD_0003         0         0         0
>                  RVBD_0004         0         0         0
>                  RVBD_0005         0         0         0
>                  RVBD_0006         0         0         0
>
>                      sessionInfo()
>
>                  R version 3.0.1 (2013-05-16)
>                  Platform: x86_64-redhat-linux-gnu (64-bit)
>
>                  locale:
>                     [1] LC_CTYPE=en_US.utf8       LC_NUMERIC=C
>                     [3] LC_TIME=en_US.utf8        LC_COLLATE=en_US.utf8
>                     [5] LC_MONETARY=en_US.utf8    LC_MESSAGES=en_US.utf8
>                     [7] LC_PAPER=C                LC_NAME=C
>                     [9] LC_ADDRESS=C              LC_TELEPHONE=C
>                  [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C
>
>                  attached base packages:
>                  [1] parallel  stats     graphics  grDevices utils
>         datasets
>                    methods
>                  [8] base
>
>                  other attached packages:
>                  [1] Rsamtools_1.12.3       Biostrings_2.28.0
>                    GenomicFeatures_1.12.3
>                  [4] AnnotationDbi_1.22.6   Biobase_2.20.1
>                  GenomicRanges_1.12.4
>                  [7] IRanges_1.18.2         BiocGenerics_0.6.0
>
>                  loaded via a namespace (and not attached):
>                     [1] biomaRt_2.16.0     bitops_1.0-5
>         BSgenome_1.28.0
>                    DBI_0.2-6
>
>                     [5] RCurl_1.95-4.1     RSQLite_0.11.3
>         rtracklayer_1.20.4
>                  stats4_3.0.1
>
>                     [9] tools_3.0.1        XML_3.96-1.1       zlibbioc_1.6.0
>
>
>                  Kind regards
>
>
>
>
>                  ___________________________________________________
>                  Bioconductor mailing list
>         Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>         <mailto:Bioconductor at r-__project.org
>         <mailto:Bioconductor at r-project.org>>
>         https://stat.ethz.ch/mailman/____listinfo/bioconductor
>         <https://stat.ethz.ch/mailman/__listinfo/bioconductor>
>
>                  <https://stat.ethz.ch/mailman/__listinfo/bioconductor
>         <https://stat.ethz.ch/mailman/listinfo/bioconductor>>
>                  Search the archives:
>         http://news.gmane.org/gmane.____science.biology.informatics.____conductor
>         <http://news.gmane.org/gmane.__science.biology.informatics.__conductor>
>
>
>         <http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>         <http://news.gmane.org/gmane.science.biology.informatics.conductor>>
>
>
>
>
>         --
>         Reema Singh
>         PhD Scholar
>         Computational Biology and Bioinformatics
>         School of Computational and Integrative Sciences
>         Jawaharlal Nehru University
>         New Delhi-110067
>         INDIA
>
>
>
>
>
> --
> Reema Singh
> PhD Scholar
> Computational Biology and Bioinformatics
> School of Computational and Integrative Sciences
> Jawaharlal Nehru University
> New Delhi-110067
> INDIA