[BioC] rtracklayer proposal for ISSUE: import.gff3 asRangedData=FALSE fails when strand is '.'

Cook, Malcolm MEC at stowers.org
Wed Apr 18 20:49:42 CEST 2012


Thanks Herve!

A+

~Malcolm


> -----Original Message-----
> From: Hervé Pagès [mailto:hpages at fhcrc.org]
> Sent: Wednesday, April 18, 2012 12:19 PM
> To: Cook, Malcolm
> Cc: bioconductor at r-project.org
> Subject: Re: [BioC] rtracklayer proposal for ISSUE: import.gff3
> asRangedData=FALSE fails when strand is '.'
> 
> Hi Malcolm,
> 
> On 04/18/2012 09:04 AM, Cook, Malcolm wrote:
> > Hi, rtracklayerers,
> >
> > import.gff3 with asRangedData=TRUE passes a period through to the
> strand of imported RangedData, however, calling it with
> asRangedData=FALSE errors:
> >
> >> gff.str<-
> "2L\tFlyBase\tgene\t7529\t9484\t0\t.\t0\tID=FBgn0031208;Name=CG11023"
> >> import.gff3(textConnection(gff.str),asRangedData=TRUE)
> > RangedData with 1 row and 7 value columns across 1 space
> >       space       ranges |     type   source    phase   strand          ID        Name
> score
> >    <factor>     <IRanges>  |<factor>  <factor>  <factor>  <factor>  <character>
> <character>  <numeric>
> > 1       2L [7529, 9484] |     gene  FlyBase        0       NA FBgn0031208     CG11023
> 0
> 
> IMO * should be used instead of NA to be more consistent with how the
> strand is handled in the rest of the infrastructure.
> 
> >> import.gff3(textConnection(gff.str),asRangedData=FALSE)
> > Error in strand(runValue(strand)) : strand values must be in '+' '-' '*'
> 
> It looks like this problem is fixed in rtracklayer 1.16.1:
> 
>    > import.gff3(textConnection(gff.str),asRangedData=FALSE)
>    GRanges with 1 range and 6 elementMetadata cols:
>          seqnames       ranges strand |   source     type     score
> phase
>             <Rle>    <IRanges>  <Rle> | <factor> <factor> <numeric>
> <integer>
>      [1]       2L [7529, 9484]      * |  FlyBase     gene         1
>      0
>                   ID        Name
>          <character> <character>
>      [1] FBgn0031208     CG11023
>      ---
>      seqlengths:
>       2L
>       NA
>    Warning message:
>    In newGRanges("GRanges", seqnames = seqnames, ranges = ranges,
> strand
> = strand,  :
>      missing values in strand converted to "*"
> 
> >
> > The GFF3 spec allows '.' (and '?') to appear as value of strand:
> >
> > Column 7: "strand"
> > The strand of the feature.  + for positive strand (relative to the
> > landmark), - for minus strand, and . for features that are not
> > stranded.  In addition, ? can be used for features whose strandedness
> > is relevant, but unknown.
> >
> > Arguably, import.gff{,2,3} should provide some control over interpretation
> of '.' and '?' appearing in the strand column, allowing it to comport with
> strand and GRanges
> 
> In the early days of the strand() constructor, we've also tried to make
> the distinction between *'s and NA's in the strand column, with more or
> less the same subtle differences than GFF3 makes between . and ?
> But then we abandoned that.
> 
> See:
> 
>    https://stat.ethz.ch/pipermail/bioconductor/2012-January/043067.html
> 
> It's not written in stone though so if people have a use case where
> they need to be able to distinguish between (a) "range/feature is on
> both strands" and (b) "strand is unknown or irrelevant", then we could
> revisit that decision.
> 
> Cheers,
> H.
> 
> >
> > I propose the following as an intended backwards compatible fix.
> >
> > New argument to import.gff{,2,3}
> >
> >   strandMap: control for mapping out-of-band values  (FALSE,TRUE,a string,
> a list), understood as follows
> > 	FALSE: the default - do not  map out of band values to '*'
> > 	TRUE:  map all out of band values to '*'
> > 	any 0 length character vector: map out of band values to it
> (presumably it will be one of '*', '-','+'
> > 	a list: lookup how to map out of band values in the list by name.
> >
> > If it is agreed that this is the best resolution, and the rtracklayer gods wish
> it, I will take this as my first opportunity to contribute and will follow-up
> accordingly....
> >
> > Else?
> >
> > Cheers,
> >
> > Malcolm
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at r-project.org
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> 
> --
> Hervé Pagès
> 
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
> 
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319



More information about the Bioconductor mailing list