[BioC] rtracklayer proposal for ISSUE: import.gff3 asRangedData=FALSE fails when strand is '.'

Hervé Pagès hpages at fhcrc.org
Wed Apr 18 19:19:17 CEST 2012


Hi Malcolm,

On 04/18/2012 09:04 AM, Cook, Malcolm wrote:
> Hi, rtracklayerers,
>
> import.gff3 with asRangedData=TRUE passes a period through to the strand of imported RangedData, however, calling it with asRangedData=FALSE errors:
>
>> gff.str<-"2L\tFlyBase\tgene\t7529\t9484\t0\t.\t0\tID=FBgn0031208;Name=CG11023"
>> import.gff3(textConnection(gff.str),asRangedData=TRUE)
> RangedData with 1 row and 7 value columns across 1 space
>       space       ranges |     type   source    phase   strand          ID        Name     score
>    <factor>     <IRanges>  |<factor>  <factor>  <factor>  <factor>  <character>  <character>  <numeric>
> 1       2L [7529, 9484] |     gene  FlyBase        0       NA FBgn0031208     CG11023         0

IMO * should be used instead of NA to be more consistent with how the
strand is handled in the rest of the infrastructure.

>> import.gff3(textConnection(gff.str),asRangedData=FALSE)
> Error in strand(runValue(strand)) : strand values must be in '+' '-' '*'

It looks like this problem is fixed in rtracklayer 1.16.1:

   > import.gff3(textConnection(gff.str),asRangedData=FALSE)
   GRanges with 1 range and 6 elementMetadata cols:
         seqnames       ranges strand |   source     type     score 
phase
            <Rle>    <IRanges>  <Rle> | <factor> <factor> <numeric> 
<integer>
     [1]       2L [7529, 9484]      * |  FlyBase     gene         1 
     0
                  ID        Name
         <character> <character>
     [1] FBgn0031208     CG11023
     ---
     seqlengths:
      2L
      NA
   Warning message:
   In newGRanges("GRanges", seqnames = seqnames, ranges = ranges, strand 
= strand,  :
     missing values in strand converted to "*"

>
> The GFF3 spec allows '.' (and '?') to appear as value of strand:
>
> Column 7: "strand"
> The strand of the feature.  + for positive strand (relative to the
> landmark), - for minus strand, and . for features that are not
> stranded.  In addition, ? can be used for features whose strandedness
> is relevant, but unknown.
>
> Arguably, import.gff{,2,3} should provide some control over interpretation of '.' and '?' appearing in the strand column, allowing it to comport with strand and GRanges

In the early days of the strand() constructor, we've also tried to make
the distinction between *'s and NA's in the strand column, with more or
less the same subtle differences than GFF3 makes between . and ?
But then we abandoned that.

See:

   https://stat.ethz.ch/pipermail/bioconductor/2012-January/043067.html

It's not written in stone though so if people have a use case where
they need to be able to distinguish between (a) "range/feature is on
both strands" and (b) "strand is unknown or irrelevant", then we could
revisit that decision.

Cheers,
H.

>
> I propose the following as an intended backwards compatible fix.
>
> New argument to import.gff{,2,3}
>
>   strandMap: control for mapping out-of-band values  (FALSE,TRUE,a string, a list), understood as follows
> 	FALSE: the default - do not  map out of band values to '*'
> 	TRUE:  map all out of band values to '*'
> 	any 0 length character vector: map out of band values to it (presumably it will be one of '*', '-','+'
> 	a list: lookup how to map out of band values in the list by name.
>
> If it is agreed that this is the best resolution, and the rtracklayer gods wish it, I will take this as my first opportunity to contribute and will follow-up accordingly....
>
> Else?
>
> Cheers,
>
> Malcolm
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list