[BioC] RNASeq:- getting Zero Count
Valerie Obenchain
vobencha at fhcrc.org
Thu Aug 8 17:41:30 CEST 2013
Hi,
I think you are on the right track. There are still a couple of things
that don't look right.
A bam file often has more than one chromosome. In your example I see
only one chromosome named 'Chromosome'. Is there really only one
chromosome in the file? The idea behind changing the names of the
seqlevels is to make something like 'chr22' and '22' match by adding the
'chr' to the second. Then we have 'chr22' and 'chr22'. (Equivalently you
could remove the 'chr' to get '22' and '22'). The chromosomes names must
be preserved at some level so that chromosomes from the bam are being
overlapped with the same chromosomes from the gff file.
In the counter function below you check the strand of 'aln' but don't do
anything with it. Did you want to change the strand? As an fyi, you can
ignore strand when performing overlaps by setting the 'ignore.strand'
argument to TRUE.
Valerie
On 08/08/2013 12:17 AM, Reema Singh wrote:
>
>
>
> On Tue, Aug 6, 2013 at 9:13 PM, Valerie Obenchain <vobencha at fhcrc.org
> <mailto:vobencha at fhcrc.org>> wrote:
>
> I would do some investigating with a single bam file.
>
> Confirm 'gnModel' and 'aln' have some common seqlevels. This call
> should produce a result.
>
> intersect(seqlevels(aln), seqlevels(gnModel))
>
>
> SRR568038 SRR568039 SRR568040
> "Chromosome" "Chromosome" "Chromosome"
>
>
> Call countOverlaps on a single file.
>
> co <- countOverlaps(aln, gnModel)
>
> > aln <- GenomicRanges::readGappedAlignments(bamFiles[1])
> > head(aln)
> GappedAlignments with 6 alignments and 0 metadata columns:
> seqnames strand cigar qwidth start end width
> <Rle> <Rle> <character> <integer> <integer> <integer> <integer>
> [1] Chromosome + 39M 39 2 40 39
> [2] Chromosome + 39M 39 2 40 39
> [3] Chromosome + 39M 39 4 42 39
> [4] Chromosome + 39M 39 4 42 39
> [5] Chromosome + 39M 39 5 43 39
> [6] Chromosome + 39M 39 5 43 39
> ngap
> <integer>
> [1] 0
> [2] 0
> [3] 0
> [4] 0
> [5] 0
> [6] 0
> ---
> seqlengths:
> Chromosome
> 4411708
>
>
> Evidently you only want the bam records that have exactly 5 hits.
> This could be limiting. To see the distribution of hits make a table
> of the counts.
>
> table(co)
>
> > co <- countOverlaps(aln, gnModel)
> > table(co)
> co
> 0 1 2
> 575536 33179936 17850
>
> You are right, when I changed the exactly bam hit list from % to all.
> Then I got some count. Here's the table.
>
> > head(counts)
> SRR568038 SRR568039 SRR568040
> RVBD_0001 456 1331 94
> RVBD_0002 258 1714 64
> RVBD_0003 270 2090 90
> RVBD_0004 122 536 15
> RVBD_0005 2487 23056 997
> RVBD_0006 2343 23013 814
>
> I hope this is right. Kindly correct me, if you something wrong here.
>
>
> Kind Regards
>
>
>
>
> Valerie
>
>
>
> On 08/05/2013 10:05 PM, Reema Singh wrote:
>
> Hi Valerie,
>
> Thank you so much for the reply.
>
> After checking the seqlevels, I am able to get rid off the
> error, but
> still getting the zero count entries. Is there any another way
> of doing
> this?
>
> KInd Regards
>
>
> On Mon, Aug 5, 2013 at 9:21 PM, Valerie Obenchain
> <vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
> <mailto:vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>>> wrote:
>
> Hi Reema,
>
> To perform overlap or matching operations the seqlevels
> (chromosome
> names) of the objects must match. The error message is
> telling you
> that some of these do not match. It's reasonable that a few
> names
> may not match (maybe a chromosome is present in one object
> and not
> the other) but the majority should.
>
> Check the seqlevels:
> seqlevels(aln)
> seqlevels(gnModel)
>
> Which names are common to both:
> intersect(seqlevels(gnModel), seqlevels(aln))
>
> You can rename seqlevels in several different ways. See
> ?renameSeqlevels or ?seqlevels for examples.
>
> Valerie
>
>
> On 08/04/2013 06:35 AM, Reema Singh wrote:
>
> Dear All,
>
> I am trying to extract the read count from three .bam
> files. But
> I am
> getting Zero count entries.
>
> I am using Mycobacterium Tuberculosis H37Rv gtf file (
> ftp://ftp.ensemblgenomes.org/____pub/release-19/bacteria//gtf/____bacteria_1_collection/____mycobacterium_tuberculosis_____h37rv/Mycobacterium_____tuberculosis_h37rv.GCA_____000277735.1.19.gtf.gz
> <ftp://ftp.ensemblgenomes.org/__pub/release-19/bacteria//gtf/__bacteria_1_collection/__mycobacterium_tuberculosis___h37rv/Mycobacterium___tuberculosis_h37rv.GCA___000277735.1.19.gtf.gz>
>
> <ftp://ftp.ensemblgenomes.org/__pub/release-19/bacteria//gtf/__bacteria_1_collection/__mycobacterium_tuberculosis___h37rv/Mycobacterium___tuberculosis_h37rv.GCA___000277735.1.19.gtf.gz
> <ftp://ftp.ensemblgenomes.org/pub/release-19/bacteria//gtf/bacteria_1_collection/mycobacterium_tuberculosis_h37rv/Mycobacterium_tuberculosis_h37rv.GCA_000277735.1.19.gtf.gz>>)
>
> and RNASeq data used here were downloaded from (
> http://www.ncbi.nlm.nih.gov/____geo/query/acc.cgi?acc=GSE40846
> <http://www.ncbi.nlm.nih.gov/__geo/query/acc.cgi?acc=GSE40846>
>
> <http://www.ncbi.nlm.nih.gov/__geo/query/acc.cgi?acc=GSE40846
> <http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE40846>__>__)
>
> and aligned
> with bowtie2.
>
>
> library(GenomicFeatures)
> txdb <-
>
> makeTranscriptDbFromGFF(file="____Mycobacterium_tuberculosis_____h37rv.GCA_000277735.1.19.gtf"__,__format="gtf")
> saveDb(txdb,file="____MycoTubeH37Rv.sqlite")
>
> load("MycoTubeH37Rv.sqlite")
> gnModel <- exonsBy(txdb,"gene") ### *also tried with
> "transcripts", "cds",
> but getting same *
>
>
> bamFiles <- list.files(".", "bam$", full=TRUE)
> names(bamFiles) <- sub("\\..*","",basename(____bamFiles))
> counter <- function(fl, gnModel){
> aln <- GenomicRanges::____readGappedAlignments(fl)
> strand(aln)
> hits <- countOverlaps(aln,gnModel)
> counts <- countOverlaps(gnModel,aln[____hits==5])
> names(counts) <- names(gnModel)
> counts
> }
>
> counts <- sapply(bamFiles,counter,____gnModel)
>
>
> Note: method with signature ‘Vector#GRangesList’ chosen
> for function
> ‘countOverlaps’,
> target signature ‘GappedAlignments#GRangesList’____.
>
> "GappedAlignments#Vector" would also be valid
> Note: method with signature ‘GRangesList#Vector’ chosen
> for function
> ‘countOverlaps’,
> target signature ‘GRangesList#GappedAlignments’____.
>
> "Vector#GappedAlignments" would also be valid
> Warning messages:
> 1: In .Seqinfo.mergexy(x, y) :
> Each of the 2 combined objects has sequence levels
> not in
> the other:
> - in 'x': gi|448814763|ref|NC_000962.3|
> - in 'y': Chromosome
> Make sure to always combine/compare objects based
> on the
> same reference
> genome (use suppressWarnings() to suppress this
> warning).
> 2: In .Seqinfo.mergexy(x, y) :
> Each of the 2 combined objects has sequence levels
> not in
> the other:
> - in 'x': Chromosome
> - in 'y': gi|448814763|ref|NC_000962.3|
> Make sure to always combine/compare objects based
> on the
> same reference
> genome (use suppressWarnings() to suppress this
> warning).
> 3: In .Seqinfo.mergexy(x, y) :
> Each of the 2 combined objects has sequence levels
> not in
> the other:
> - in 'x': gi|448814763|ref|NC_000962.3|
> - in 'y': Chromosome
> Make sure to always combine/compare objects based
> on the
> same reference
> genome (use suppressWarnings() to suppress this
> warning).
> 4: In .Seqinfo.mergexy(x, y) :
> Each of the 2 combined objects has sequence levels
> not in
> the other:
> - in 'x': Chromosome
> - in 'y': gi|448814763|ref|NC_000962.3|
> Make sure to always combine/compare objects based
> on the
> same reference
> genome (use suppressWarnings() to suppress this
> warning).
> 5: In .Seqinfo.mergexy(x, y) :
> Each of the 2 combined objects has sequence levels
> not in
> the other:
> - in 'x': gi|448814763|ref|NC_000962.3|
> - in 'y': Chromosome
> Make sure to always combine/compare objects based
> on the
> same reference
> genome (use suppressWarnings() to suppress this
> warning).
> 6: In .Seqinfo.mergexy(x, y) :
> Each of the 2 combined objects has sequence levels
> not in
> the other:
> - in 'x': Chromosome
> - in 'y': gi|448814763|ref|NC_000962.3|
> Make sure to always combine/compare objects based
> on the
> same reference
> genome (use suppressWarnings() to suppress this
> warning).
>
> head(counts)
>
> SRR568038 SRR568039 SRR568040
> RVBD_0001 0 0 0
> RVBD_0002 0 0 0
> RVBD_0003 0 0 0
> RVBD_0004 0 0 0
> RVBD_0005 0 0 0
> RVBD_0006 0 0 0
>
> sessionInfo()
>
> R version 3.0.1 (2013-05-16)
> Platform: x86_64-redhat-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C
> [3] LC_TIME=en_US.utf8 LC_COLLATE=en_US.utf8
> [5] LC_MONETARY=en_US.utf8 LC_MESSAGES=en_US.utf8
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel stats graphics grDevices utils
> datasets
> methods
> [8] base
>
> other attached packages:
> [1] Rsamtools_1.12.3 Biostrings_2.28.0
> GenomicFeatures_1.12.3
> [4] AnnotationDbi_1.22.6 Biobase_2.20.1
> GenomicRanges_1.12.4
> [7] IRanges_1.18.2 BiocGenerics_0.6.0
>
> loaded via a namespace (and not attached):
> [1] biomaRt_2.16.0 bitops_1.0-5
> BSgenome_1.28.0
> DBI_0.2-6
>
> [5] RCurl_1.95-4.1 RSQLite_0.11.3
> rtracklayer_1.20.4
> stats4_3.0.1
>
> [9] tools_3.0.1 XML_3.96-1.1 zlibbioc_1.6.0
>
>
> Kind regards
>
>
>
>
> ___________________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
> <mailto:Bioconductor at r-__project.org
> <mailto:Bioconductor at r-project.org>>
> https://stat.ethz.ch/mailman/____listinfo/bioconductor
> <https://stat.ethz.ch/mailman/__listinfo/bioconductor>
>
> <https://stat.ethz.ch/mailman/__listinfo/bioconductor
> <https://stat.ethz.ch/mailman/listinfo/bioconductor>>
> Search the archives:
> http://news.gmane.org/gmane.____science.biology.informatics.____conductor
> <http://news.gmane.org/gmane.__science.biology.informatics.__conductor>
>
>
> <http://news.gmane.org/gmane.__science.biology.informatics.__conductor
> <http://news.gmane.org/gmane.science.biology.informatics.conductor>>
>
>
>
>
> --
> Reema Singh
> PhD Scholar
> Computational Biology and Bioinformatics
> School of Computational and Integrative Sciences
> Jawaharlal Nehru University
> New Delhi-110067
> INDIA
>
>
>
>
>
> --
> Reema Singh
> PhD Scholar
> Computational Biology and Bioinformatics
> School of Computational and Integrative Sciences
> Jawaharlal Nehru University
> New Delhi-110067
> INDIA
More information about the Bioconductor
mailing list