[BioC] rtracklayer proposal for ISSUE: import.gff3 asRangedData=FALSE fails when strand is '.'
Cook, Malcolm
MEC at stowers.org
Wed Apr 18 20:49:42 CEST 2012
Thanks Herve!
A+
~Malcolm
> -----Original Message-----
> From: Hervé Pagès [mailto:hpages at fhcrc.org]
> Sent: Wednesday, April 18, 2012 12:19 PM
> To: Cook, Malcolm
> Cc: bioconductor at r-project.org
> Subject: Re: [BioC] rtracklayer proposal for ISSUE: import.gff3
> asRangedData=FALSE fails when strand is '.'
>
> Hi Malcolm,
>
> On 04/18/2012 09:04 AM, Cook, Malcolm wrote:
> > Hi, rtracklayerers,
> >
> > import.gff3 with asRangedData=TRUE passes a period through to the
> strand of imported RangedData, however, calling it with
> asRangedData=FALSE errors:
> >
> >> gff.str<-
> "2L\tFlyBase\tgene\t7529\t9484\t0\t.\t0\tID=FBgn0031208;Name=CG11023"
> >> import.gff3(textConnection(gff.str),asRangedData=TRUE)
> > RangedData with 1 row and 7 value columns across 1 space
> > space ranges | type source phase strand ID Name
> score
> > <factor> <IRanges> |<factor> <factor> <factor> <factor> <character>
> <character> <numeric>
> > 1 2L [7529, 9484] | gene FlyBase 0 NA FBgn0031208 CG11023
> 0
>
> IMO * should be used instead of NA to be more consistent with how the
> strand is handled in the rest of the infrastructure.
>
> >> import.gff3(textConnection(gff.str),asRangedData=FALSE)
> > Error in strand(runValue(strand)) : strand values must be in '+' '-' '*'
>
> It looks like this problem is fixed in rtracklayer 1.16.1:
>
> > import.gff3(textConnection(gff.str),asRangedData=FALSE)
> GRanges with 1 range and 6 elementMetadata cols:
> seqnames ranges strand | source type score
> phase
> <Rle> <IRanges> <Rle> | <factor> <factor> <numeric>
> <integer>
> [1] 2L [7529, 9484] * | FlyBase gene 1
> 0
> ID Name
> <character> <character>
> [1] FBgn0031208 CG11023
> ---
> seqlengths:
> 2L
> NA
> Warning message:
> In newGRanges("GRanges", seqnames = seqnames, ranges = ranges,
> strand
> = strand, :
> missing values in strand converted to "*"
>
> >
> > The GFF3 spec allows '.' (and '?') to appear as value of strand:
> >
> > Column 7: "strand"
> > The strand of the feature. + for positive strand (relative to the
> > landmark), - for minus strand, and . for features that are not
> > stranded. In addition, ? can be used for features whose strandedness
> > is relevant, but unknown.
> >
> > Arguably, import.gff{,2,3} should provide some control over interpretation
> of '.' and '?' appearing in the strand column, allowing it to comport with
> strand and GRanges
>
> In the early days of the strand() constructor, we've also tried to make
> the distinction between *'s and NA's in the strand column, with more or
> less the same subtle differences than GFF3 makes between . and ?
> But then we abandoned that.
>
> See:
>
> https://stat.ethz.ch/pipermail/bioconductor/2012-January/043067.html
>
> It's not written in stone though so if people have a use case where
> they need to be able to distinguish between (a) "range/feature is on
> both strands" and (b) "strand is unknown or irrelevant", then we could
> revisit that decision.
>
> Cheers,
> H.
>
> >
> > I propose the following as an intended backwards compatible fix.
> >
> > New argument to import.gff{,2,3}
> >
> > strandMap: control for mapping out-of-band values (FALSE,TRUE,a string,
> a list), understood as follows
> > FALSE: the default - do not map out of band values to '*'
> > TRUE: map all out of band values to '*'
> > any 0 length character vector: map out of band values to it
> (presumably it will be one of '*', '-','+'
> > a list: lookup how to map out of band values in the list by name.
> >
> > If it is agreed that this is the best resolution, and the rtracklayer gods wish
> it, I will take this as my first opportunity to contribute and will follow-up
> accordingly....
> >
> > Else?
> >
> > Cheers,
> >
> > Malcolm
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at r-project.org
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone: (206) 667-5791
> Fax: (206) 667-1319
More information about the Bioconductor
mailing list