[Rd] duplicated.data.frame() is broken on data frames containing \r
Hervé Pagès
hpages at fhcrc.org
Mon Jul 29 21:06:03 CEST 2013
OK it's actually documented:
The data frame method works by pasting together a character
representation of the rows separated by ‘\r’, so may be imperfect
if the data frame has characters with embedded carriage returns or
columns which do not reliably map to characters.
But what about fixing it? One possible fix is to use "\r\r" as
separator and to substitute user-supplied "\r" with, say, "#\r#".
Just an example.
Thanks,
H.
On 07/29/2013 11:52 AM, Hervé Pagès wrote:
> Hi,
>
> The trick used by duplicated.data.frame() is to transform the supplied
> data.frame into a character vector by pasting together the columns using
> "\r" as separator. But no precautions are taken to deal with "\r" in
> the supplied data.frame. As a consequence it's easy to imagine
> situations where duplicated.data.frame() returns an incorrect answer:
>
> > df <- data.frame(a=c("AA", "AA\r"), b=c("\rBBB", "BBB"))
> > df
> a b
> 1 AA \rBBB
> 2 AA\r BBB
> > duplicated(df)
> [1] FALSE TRUE
>
> Cheers,
> H.
>
> > sessionInfo()
> R version 3.0.1 (2013-05-16)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the R-devel
mailing list