[BioC] Error mapping RNA-seq data to genome

Mon Aug 23 13:51:03 CEST 2010

Hi Martin,

Your advice seems to have been pretty successful, there was a read
aligned beyond the end of the chrM. No idea why this might have
happened though, which is a bit worrying.

For the record removing the offending read worked ok. I used the following code:

# find offending read:
> refLen <- seqlengths(hgTxDb)[as.character(seqnames(alnRanges))]
> print(alnRanges[refLen < end(alnRanges)])

GRanges with 1 range and 1 elementMetadata value
    seqnames         ranges strand | pData.alignData.from...notNA...
       <Rle>      <IRanges>  <Rle> |                       <integer>
[1]     chrM [16541, 16577]      - |                             147

# remove offending read:
alnRanges <- alnRanges[-which(refLen < end(alnRanges))]

Cheers,

Paul

On Wed, Aug 18, 2010 at 2:30 PM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
> Hi Paul --
>
> On 08/17/2010 08:09 AM, Paul Geeleher wrote:
>> I'm attempting to read some aligned RNA-seq paired end reads in
>> Bioconductor. The reads were aligned with maq (.fastq to .bqf to .map)
>> and converted to SAM then BAM format by samtools as I couldn't get the
>> paired end reads into biocondcutor in MAQ format.
>>
>> I'm basically following this tutorial:
>> http://bioconductor.org/help/course-materials/2010/CSAMA10/Lab-8-RNAseqUseCase.pdf
>>
>> The following step causes me an error:
>>
>>> seqlengths(alnRanges) <- seqlengths(hgTxDb)[names(seqlengths(alnRanges))]
>> Error in validObject(.Object) :
>>   invalid class "GRanges" object: slot 'ranges' contains values
>> outside of sequence bounds
>>
>> I think it means that there is something aligned to outside the range
>> of my reference genome but I don't understand why this would be as I
>> used HG18 for everything.
>
> probably one or more of your reads are 'hanging over' one of the
> chromosome bounds. You might
>
>  tapply(alnRanges, seqnames(alnRanges), f
>         unction(x) c(min(start(x)), max(end(x))))
>
> to get a sense of whether you're on the right track, and maybe something
> like
>
>  refLen <- seqlengths(hgTxDb)[as.character(seqnames(alnRanges)]
>  alnRanges[refLen < end(alnRanges)
>
> to get the actual ranges? You could trim or discard these
>
> Martin
>
>> Any suggestions as to how I can possibly figure out what went wrong
>> here would be greatly appreciated. I can provide more information if
>> necessary. Summaries of my GenomicRanges and GenomicFeatures objects
>> are below.
>>
>>
>>> alnRanges
>> GRanges with 11634719 ranges and 1 elementMetadata value
>>            seqnames         ranges strand   | pData.alignData.from...notNA...
>>               <Rle>      <IRanges>  <Rle>   |                       <integer>
>>        [1]     chr1   [3345, 3381]      +   |                             163
>>        [2]     chr1   [3391, 3427]      -   |                             147
>>        [3]     chr1   [4223, 4259]      +   |                             163
>>        [4]     chr1   [4247, 4283]      -   |                             147
>>        [5]     chr1   [4275, 4311]      +   |                             163
>>        [6]     chr1   [4304, 4340]      +   |                             163
>>        [7]     chr1   [4320, 4356]      +   |                             163
>>        [8]     chr1   [4343, 4379]      -   |                             147
>>        [9]     chr1   [4343, 4379]      -   |                             147
>>        ...      ...            ...    ... ...                             ...
>> [11634711]     chrM [16531, 16567]      -   |                             147
>> [11634712]     chrM [16532, 16568]      +   |                             161
>> [11634713]     chrM [16532, 16568]      -   |                             147
>> [11634714]     chrM [16532, 16568]      -   |                             147
>> [11634715]     chrM [16532, 16568]      -   |                             147
>> [11634716]     chrM [16534, 16570]      -   |                             147
>> [11634717]     chrM [16534, 16570]      -   |                             147
>> [11634718]     chrM [16534, 16570]      -   |                             147
>> [11634719]     chrM [16541, 16577]      -   |                             147
>>
>> seqlengths
>>   chr1  chr2  chr3  chr4  chr5  chr6 ... chr20 chr21 chr22  chrX  chrY  chrM
>>     NA    NA    NA    NA    NA    NA ...    NA    NA    NA    NA    NA    NA
>>
>>> hgTxDb <- makeTranscriptDbFromUCSC(genome="hg18", tablename="ensGene")
>>> hgTxDb
>> TranscriptDb object:
>> | Db type: TranscriptDb
>> | Data source: UCSC
>> | Genome: hg18
>> | UCSC Table: ensGene
>> | Type of Gene ID: Ensembl gene ID
>> | Full dataset: yes
>> | transcript_nrow: 63280
>> | exon_nrow: 276075
>> | cds_nrow: 225373
>> | Db created by: GenomicFeatures package from Bioconductor
>> | Creation time: 2010-08-17 15:28:19 +0100 (Tue, 17 Aug 2010)
>> | GenomicFeatures version at creation time: 1.0.6
>> | RSQLite version at creation time: 0.9-1
>>>
>>
>>
>>
>>
>>
>
>
> --
> Martin Morgan
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
>

-- 
Paul Geeleher
School of Mathematics, Statistics and Applied Mathematics
National University of Ireland
Galway
Ireland
--
www.bioinformaticstutorials.com