[R] on ``unfolding'' a json into data frame columns
Hadley Wickham
h.wickham at gmail.com
Tue Nov 29 15:26:44 CET 2016
Two quick hints:
* use simplifyDataFrame = FALSE in fromJSON()
* read https://jennybc.github.io/purrr-tutorial/ls02_map-extraction-advanced.html
(and https://jennybc.github.io/purrr-tutorial/)
Hadley
On Tue, Nov 29, 2016 at 8:06 AM, Daniel Bastos <dbastos at toledo.com> wrote:
> Greetings!
>
> In an SQL table, I have a column that contains a JSON. I'd like easy
> access to all (in an ideal world) of these JSON fields. I started out
> trying to get all fields from the JSON and so I wrote this function.
>
> unfold.json <- function (df, column)
> {
> library(jsonlite)
> ret <- data.frame()
>
> for (i in 1:nrow(df)) {
> js <- fromJSON(df[i, ][[column]])
> ret <- rbind(ret, cbind(df[i, ], js))
> }
>
> ret
> }
>
> It takes a data frame and a column-string where the JSON is to be
> found. It produces a new RET data frame with all the rows of DF but
> with new columns --- extracted from every field in the JSON.
>
> (The performance is horrible.)
>
> fromJSON sometimes produces a list that sometimes contains a data frame.
> As a result, I end up getting a RET data frame with duplicated rows.
> Here's what happens.
>
>> nrow(df)
> [1] 1
>
>> nrow(unfold.json(df, "response"))
> [1] 3
> Warning messages:
> 1: In data.frame(CreateUTC = "2016-11-29 02:00:43", Payload = list( :
> row names were found from a short variable and have been discarded
> 2: In data.frame(..., check.names = FALSE) :
> row names were found from a short variable and have been discarded
>>
>
> I expected a data frame with 1 row. The reason 3 rows is produced is
> because in the JSON there's an array with 3 rows.
>
>> fromJSON(df$response)$RawPayload
> [1] 200 1 128
>
> I have also cases where fromJSON(df$response)$Payload$Fields is a data
> frame containing various rows. So unfold.json produces a data frame
> with these various rows.
>
> So I gave up on this general approach.
>
> (*) My humble approach
>
> For the moment I'm not interested in RawPayload nor Payload$Fields, so I
> nullified them in this new approach. To improve performance, I guessed
> perhaps merge() would help and I think it did, but this was not at all a
> decision thought out.
>
> unfold.json.fast <- function (df, column)
> {
> library(jsonlite)
> ret <- data.frame()
> if (nrow(df) > 0) {
> for (i in 1:nrow(df)) {
> ls <- fromJSON(df[i, ][[column]])
> ls$RawPayload <- NULL
> ls$Payload$Fields <- NULL
> js <- data.frame(ls)
> ret <- rbind(ret, merge(df[i, ], js))
> }
> }
>
> ret
> }
>
> I'm looking for advice. How would you approach this problem?
>
> Thank you!
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
--
http://hadley.nz
More information about the R-help
mailing list