[R] Parsing JSON records to a dataframe
Martin Morgan
mtmorgan at fhcrc.org
Fri Jan 7 14:17:55 CET 2011
On 01/07/2011 12:05 AM, Dieter Menne wrote:
>
>
> Jeroen Ooms wrote:
>>
>> What is the most efficient method of parsing a dataframe-like structure
>> that has been json encoded in record-based format rather than vector
>> based. For example a structure like this:
>>
>> [ {"name":"joe", "gender":"male", "age":41}, {"name":"anna",
>> "gender":"female", "age":23} ]
>>
>> RJSONIO parses this as a list of lists, which I would then have to apply
>> as.data.frame to and append them to an existing dataframe, which is
>> terribly slow.
>>
>>
>
> unlist is pretty fast. The solution below assumes that you know how your
> structure is, so it is not very flexible, but it should show you that the
> conversion to data.frame is not the bottleneck.
>
> # json
> library(RJSONIO)
> # [ {"name":"joe", "gender":"male", "age":41},
> # {"name":"anna", "gender":"female", "age":23} ]
> n = 300000
> d = data.frame(name=rep(c("joe","anna"),n),
> gender=rep(c("male","female"),n),
> age = rep(c("23","41"),n))
> dj = toJSON(d)
This doesn't create the required structure
> cat(dj)
{
"name": [ "joe", "anna", "joe", "anna" ],
"gender": [ "male", "female", "male", "female" ],
"age": [ "23", "41", "23", "41" ]
}
instead
library(rjson)
n <- 1000
name <- apply(matrix(sample(letters, n * 5, TRUE), n),
1, paste, collapse="")
gender <- sample(c("male", "female"), n, TRUE)
age <- ceiling(runif(n, 20, 60))
recs <- sprintf('{"name": "%s", "gender":"%s", "age":%d}',
name, gender, age)
j <- sprintf("[%s]", paste(recs, collapse=","))
lol <- fromJSON(j)
and then with
f <- function(lst)
function(nm) unlist(lapply(lst, "[[", nm), use.names=FALSE)
> oopt <- options(stringsAsFactors=FALSE) # convenience for 'identical'
> system.time({
+ df0 <- as.data.frame(Map(f(lol), names(lol[[1]])))
+ })
user system elapsed
0.006 0.000 0.006
versus for instance
> system.time({
+ df1 <- do.call(rbind, lapply(lol, data.frame))
+ })
user system elapsed
1.497 0.000 1.500
> identical(df0, df1)
[1] TRUE
Martin
>
> system.time(d1 <- fromJSON(dj))
> # user system elapsed
> # 4.06 0.26 4.32
>
> system.time(
> dd <- data.frame(
> name = unlist(d1$name),
> gender = unlist(d1$gender),
> age=as.numeric(unlist(d1$age)))
> )
> # user system elapsed
> # 1.13 0.05 1.18
>
>
>
>
--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
Location: M1-B861
Telephone: 206 667-2793
More information about the R-help
mailing list