[BioC] Error mapping RNA-seq data to genome
Paul Geeleher
paulgeeleher at gmail.com
Mon Aug 23 13:51:03 CEST 2010
Hi Martin,
Your advice seems to have been pretty successful, there was a read
aligned beyond the end of the chrM. No idea why this might have
happened though, which is a bit worrying.
For the record removing the offending read worked ok. I used the following code:
# find offending read:
> refLen <- seqlengths(hgTxDb)[as.character(seqnames(alnRanges))]
> print(alnRanges[refLen < end(alnRanges)])
GRanges with 1 range and 1 elementMetadata value
seqnames ranges strand | pData.alignData.from...notNA...
<Rle> <IRanges> <Rle> | <integer>
[1] chrM [16541, 16577] - | 147
# remove offending read:
alnRanges <- alnRanges[-which(refLen < end(alnRanges))]
Cheers,
Paul
On Wed, Aug 18, 2010 at 2:30 PM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
> Hi Paul --
>
> On 08/17/2010 08:09 AM, Paul Geeleher wrote:
>> I'm attempting to read some aligned RNA-seq paired end reads in
>> Bioconductor. The reads were aligned with maq (.fastq to .bqf to .map)
>> and converted to SAM then BAM format by samtools as I couldn't get the
>> paired end reads into biocondcutor in MAQ format.
>>
>> I'm basically following this tutorial:
>> http://bioconductor.org/help/course-materials/2010/CSAMA10/Lab-8-RNAseqUseCase.pdf
>>
>> The following step causes me an error:
>>
>>> seqlengths(alnRanges) <- seqlengths(hgTxDb)[names(seqlengths(alnRanges))]
>> Error in validObject(.Object) :
>> invalid class "GRanges" object: slot 'ranges' contains values
>> outside of sequence bounds
>>
>> I think it means that there is something aligned to outside the range
>> of my reference genome but I don't understand why this would be as I
>> used HG18 for everything.
>
> probably one or more of your reads are 'hanging over' one of the
> chromosome bounds. You might
>
> tapply(alnRanges, seqnames(alnRanges), f
> unction(x) c(min(start(x)), max(end(x))))
>
> to get a sense of whether you're on the right track, and maybe something
> like
>
> refLen <- seqlengths(hgTxDb)[as.character(seqnames(alnRanges)]
> alnRanges[refLen < end(alnRanges)
>
> to get the actual ranges? You could trim or discard these
>
> Martin
>
>> Any suggestions as to how I can possibly figure out what went wrong
>> here would be greatly appreciated. I can provide more information if
>> necessary. Summaries of my GenomicRanges and GenomicFeatures objects
>> are below.
>>
>>
>>> alnRanges
>> GRanges with 11634719 ranges and 1 elementMetadata value
>> seqnames ranges strand | pData.alignData.from...notNA...
>> <Rle> <IRanges> <Rle> | <integer>
>> [1] chr1 [3345, 3381] + | 163
>> [2] chr1 [3391, 3427] - | 147
>> [3] chr1 [4223, 4259] + | 163
>> [4] chr1 [4247, 4283] - | 147
>> [5] chr1 [4275, 4311] + | 163
>> [6] chr1 [4304, 4340] + | 163
>> [7] chr1 [4320, 4356] + | 163
>> [8] chr1 [4343, 4379] - | 147
>> [9] chr1 [4343, 4379] - | 147
>> ... ... ... ... ... ...
>> [11634711] chrM [16531, 16567] - | 147
>> [11634712] chrM [16532, 16568] + | 161
>> [11634713] chrM [16532, 16568] - | 147
>> [11634714] chrM [16532, 16568] - | 147
>> [11634715] chrM [16532, 16568] - | 147
>> [11634716] chrM [16534, 16570] - | 147
>> [11634717] chrM [16534, 16570] - | 147
>> [11634718] chrM [16534, 16570] - | 147
>> [11634719] chrM [16541, 16577] - | 147
>>
>> seqlengths
>> chr1 chr2 chr3 chr4 chr5 chr6 ... chr20 chr21 chr22 chrX chrY chrM
>> NA NA NA NA NA NA ... NA NA NA NA NA NA
>>
>>> hgTxDb <- makeTranscriptDbFromUCSC(genome="hg18", tablename="ensGene")
>>> hgTxDb
>> TranscriptDb object:
>> | Db type: TranscriptDb
>> | Data source: UCSC
>> | Genome: hg18
>> | UCSC Table: ensGene
>> | Type of Gene ID: Ensembl gene ID
>> | Full dataset: yes
>> | transcript_nrow: 63280
>> | exon_nrow: 276075
>> | cds_nrow: 225373
>> | Db created by: GenomicFeatures package from Bioconductor
>> | Creation time: 2010-08-17 15:28:19 +0100 (Tue, 17 Aug 2010)
>> | GenomicFeatures version at creation time: 1.0.6
>> | RSQLite version at creation time: 0.9-1
>>>
>>
>>
>>
>>
>>
>
>
> --
> Martin Morgan
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
>
--
Paul Geeleher
School of Mathematics, Statistics and Applied Mathematics
National University of Ireland
Galway
Ireland
--
www.bioinformaticstutorials.com
More information about the Bioconductor
mailing list