[BioC] rtracklayer proposal for ISSUE: import.gff3 asRangedData=FALSE fails when strand is '.'
Hervé Pagès
hpages at fhcrc.org
Wed Apr 18 19:19:17 CEST 2012
Hi Malcolm,
On 04/18/2012 09:04 AM, Cook, Malcolm wrote:
> Hi, rtracklayerers,
>
> import.gff3 with asRangedData=TRUE passes a period through to the strand of imported RangedData, however, calling it with asRangedData=FALSE errors:
>
>> gff.str<-"2L\tFlyBase\tgene\t7529\t9484\t0\t.\t0\tID=FBgn0031208;Name=CG11023"
>> import.gff3(textConnection(gff.str),asRangedData=TRUE)
> RangedData with 1 row and 7 value columns across 1 space
> space ranges | type source phase strand ID Name score
> <factor> <IRanges> |<factor> <factor> <factor> <factor> <character> <character> <numeric>
> 1 2L [7529, 9484] | gene FlyBase 0 NA FBgn0031208 CG11023 0
IMO * should be used instead of NA to be more consistent with how the
strand is handled in the rest of the infrastructure.
>> import.gff3(textConnection(gff.str),asRangedData=FALSE)
> Error in strand(runValue(strand)) : strand values must be in '+' '-' '*'
It looks like this problem is fixed in rtracklayer 1.16.1:
> import.gff3(textConnection(gff.str),asRangedData=FALSE)
GRanges with 1 range and 6 elementMetadata cols:
seqnames ranges strand | source type score
phase
<Rle> <IRanges> <Rle> | <factor> <factor> <numeric>
<integer>
[1] 2L [7529, 9484] * | FlyBase gene 1
0
ID Name
<character> <character>
[1] FBgn0031208 CG11023
---
seqlengths:
2L
NA
Warning message:
In newGRanges("GRanges", seqnames = seqnames, ranges = ranges, strand
= strand, :
missing values in strand converted to "*"
>
> The GFF3 spec allows '.' (and '?') to appear as value of strand:
>
> Column 7: "strand"
> The strand of the feature. + for positive strand (relative to the
> landmark), - for minus strand, and . for features that are not
> stranded. In addition, ? can be used for features whose strandedness
> is relevant, but unknown.
>
> Arguably, import.gff{,2,3} should provide some control over interpretation of '.' and '?' appearing in the strand column, allowing it to comport with strand and GRanges
In the early days of the strand() constructor, we've also tried to make
the distinction between *'s and NA's in the strand column, with more or
less the same subtle differences than GFF3 makes between . and ?
But then we abandoned that.
See:
https://stat.ethz.ch/pipermail/bioconductor/2012-January/043067.html
It's not written in stone though so if people have a use case where
they need to be able to distinguish between (a) "range/feature is on
both strands" and (b) "strand is unknown or irrelevant", then we could
revisit that decision.
Cheers,
H.
>
> I propose the following as an intended backwards compatible fix.
>
> New argument to import.gff{,2,3}
>
> strandMap: control for mapping out-of-band values (FALSE,TRUE,a string, a list), understood as follows
> FALSE: the default - do not map out of band values to '*'
> TRUE: map all out of band values to '*'
> any 0 length character vector: map out of band values to it (presumably it will be one of '*', '-','+'
> a list: lookup how to map out of band values in the list by name.
>
> If it is agreed that this is the best resolution, and the rtracklayer gods wish it, I will take this as my first opportunity to contribute and will follow-up accordingly....
>
> Else?
>
> Cheers,
>
> Malcolm
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list