[R] Reading Multi-value data fields for descriptive analysis

jim holtman jholtman at gmail.com
Mon Jul 14 02:45:29 CEST 2008


Is one of the rows NULL?  Do an 'str(x)' show?  The example you sent
seems to work with the code.  Are you reading in a different set of
data?  I think I know what happened.  I shortened the names on your
example so it was easier to access.  Here is the data I used:

ID|name|picnic|food|other
1|Yogi Bear|Yes|Hamburgers;#Hot Dogs;#I rely on others to bring the
good stuff|"Softball;#Blanket;#I bring boo-boo, but he hides"
2|Boo-Boo|Yes|Potato Salad;#Cole Slaw;#whatever Yogi doesn't eat|Lawn
Chairs;#Blanket;#my running shoes
3|Ranger Rick|No|I told you I don't picnic|a big net and handcuffs
4|Magilla Gorilla|Yes|Hamburgers;#Hot Dogs;#Potato Salad;#Cole
Slaw;#BBQ Chicken|Softball;#Volleyball;#Lawn Chairs;#Blanket
5|Foghorn Leghorn|Yes|"Hot Dogs;#Cole Slaw;#I say, I say, BBQ
Chicken?"|Softball;#Blanket
6|Peter Potamus|Yes|"Hamburgers;#Hot Dogs;#anything, just a lot of
it"|Softball;#Lawn Chairs;#hot air balloon
7|Jonny Quest|No|too busy getting into and out of trouble|Hadji and Bandit
8|"Fleegle, Bingo, Drooper and Snorky"|Yes|Hamburgers;#Hot
Dogs;#Potato Salad;#Cole Slaw;#A banana split|a laugh track
9|George Jetson|No|Mr. Spacely is making me work|Lawn
Chairs;#Blanket;#my flying car
10|Snagglepuss|Yes|Hamburgers;#Hot Dogs;#Potato Salad;#Cole Slaw;#BBQ
Chicken|Softball;#Heavens to Murgatroyd!  Exit stage left!


Here is the code with a few more comments.  The basic structure was to
loop through the three data columns since they had the same format.

x <- read.table("/tempxx.txt", comment="", quote="", sep="|",
header=TRUE, as.is=TRUE)
# split out by name. the 'lapply' will cycle through for each row in
the data and
# the index of the row is passed to the '.row' parameter of the function
z <- lapply(seq(nrow(x)), function(.row){
    # this sets the result to NULL so that we can accumulate the data
as it is processed
    .result <- NULL
    # construct the data output
    # the three columns were shorted to the following names.  the
'for' loop will
    # iterate through each of the three columns, taking the data in
the columns and
    # spliting them by your separator ';#'
    for (i in c('picnic', 'food', 'other')){
        # this will access the specific column (given in the variable 'i')
        .split <- strsplit(x[.row,][[i]], ";#")
        # this appended on to '.result' the contents of this column
after creating
        # the three columns of data which are the name, the column ID
and the value
        # from that column
        .result <- rbind(.result, cbind(name=x[.row,][['name']],
field=i, value=unlist(.split)))
    }
    .result   # return the result
})


z



On Sun, Jul 13, 2008 at 7:25 PM, Hohm, Dale <dale.hohm at hp.com> wrote:
> Thanks Jim,
>
> I wish I were comfortable enough with the language for the fix needed to the syntax to be obvious, but it is not yet.  With your example, I get:
>
>        Error in strsplit(x[.row, ][[i]], ";#") : non-character argument
>
> x appears to be filled properly, but z is not due to the error.
>
> Also, if you were willing to provide some brief annotation or describe the overall logic in the code you supplied it would help me immensely.
>
> Thanks,
>
> Dale
>
> -----Original Message-----
> From: jim holtman [mailto:jholtman at gmail.com]
> Sent: Sunday, July 13, 2008 6:35 AM
> To: Hohm, Dale
> Cc: r-help at r-project.org
> Subject: Re: [R] Reading Multi-value data fields for descriptive analysis
>
> This may do what you want:
>
>> x <- read.table("/tempxx.txt", comment="", quote="", sep="|", header=TRUE, as.is=TRUE)
>> # split out by name
>> z <- lapply(seq(nrow(x)), function(.row){
> +     .result <- NULL
> +     # construct the data output
> +     for (i in c('picnic', 'food', 'other')){
> +         .split <- strsplit(x[.row,][[i]], ";#")
> +         .result <- rbind(.result, cbind(name=x[.row,][['name']],
> field=i, value=unlist(.split)))
> +     }
> +     .result
> + })
>>
>>
>> z
> [[1]]
>     name        field    value
> [1,] "Yogi Bear" "picnic" "Yes"
> [2,] "Yogi Bear" "food"   "Hamburgers"
> [3,] "Yogi Bear" "food"   "Hot Dogs"
> [4,] "Yogi Bear" "food"   "I rely on others to bring the good stuff"
> [5,] "Yogi Bear" "other"  "\"Softball"
> [6,] "Yogi Bear" "other"  "Blanket"
> [7,] "Yogi Bear" "other"  "I bring boo-boo, but he hides\""
>
> [[2]]
>     name      field    value
> [1,] "Boo-Boo" "picnic" "Yes"
> [2,] "Boo-Boo" "food"   "Potato Salad"
> [3,] "Boo-Boo" "food"   "Cole Slaw"
> [4,] "Boo-Boo" "food"   "whatever Yogi doesn't eat"
> [5,] "Boo-Boo" "other"  "Lawn Chairs"
> [6,] "Boo-Boo" "other"  "Blanket"
> [7,] "Boo-Boo" "other"  "my running shoes"
>
> [[3]]
>     name          field    value
> [1,] "Ranger Rick" "picnic" "No"
> [2,] "Ranger Rick" "food"   "I told you I don't picnic"
> [3,] "Ranger Rick" "other"  "a big net and handcuffs"
>
> [[4]]
>      name              field    value
>  [1,] "Magilla Gorilla" "picnic" "Yes"
>  [2,] "Magilla Gorilla" "food"   "Hamburgers"
>  [3,] "Magilla Gorilla" "food"   "Hot Dogs"
>  [4,] "Magilla Gorilla" "food"   "Potato Salad"
>  [5,] "Magilla Gorilla" "food"   "Cole Slaw"
>  [6,] "Magilla Gorilla" "food"   "BBQ Chicken"
>  [7,] "Magilla Gorilla" "other"  "Softball"
>  [8,] "Magilla Gorilla" "other"  "Volleyball"
>  [9,] "Magilla Gorilla" "other"  "Lawn Chairs"
> [10,] "Magilla Gorilla" "other"  "Blanket"
>
>
>
> On Sun, Jul 13, 2008 at 12:56 AM, Hohm, Dale <dale.hohm at hp.com> wrote:
>> Thanks for the reply Jim.
>>
>> Here is a representation of the data I want to analyze - 10 records as requested.  Each line can easily include an ID number as below.
>>
>> So I want to determine a frequency or percentage of respondents that bring each of the 5 foods (Hamburgers, Hot Dogs, Potato Salad, Cole Slaw and BBQ Chicken) and how many "Other" write-ins there are.  The same for what else is brought besides food (Softball, Volleyball, Lawn Chairs and Blanket) as well as a count of "Other" write-ins.  I'll also need to be able to discern how many brought Hambergers AND a Blanket or how many brought a Softball AND a Vollyball etc.
>>
>> ID|Your Name|Do you picnic?|What is your favorite picnic food?|What do you bring besides food?
>> 1|Yogi Bear|Yes|Hamburgers;#Hot Dogs;#I rely on others to bring the good stuff|"Softball;#Blanket;#I bring boo-boo, but he hides"
>> 2|Boo-Boo|Yes|Potato Salad;#Cole Slaw;#whatever Yogi doesn't eat|Lawn Chairs;#Blanket;#my running shoes
>> 3|Ranger Rick|No|I told you I don't picnic|a big net and handcuffs
>> 4|Magilla Gorilla|Yes|Hamburgers;#Hot Dogs;#Potato Salad;#Cole Slaw;#BBQ Chicken|Softball;#Volleyball;#Lawn Chairs;#Blanket
>> 5|Foghorn Leghorn|Yes|"Hot Dogs;#Cole Slaw;#I say, I say, BBQ Chicken?"|Softball;#Blanket
>> 6|Peter Potamus|Yes|"Hamburgers;#Hot Dogs;#anything, just a lot of it"|Softball;#Lawn Chairs;#hot air balloon
>> 7|Jonny Quest|No|too busy getting into and out of trouble|Hadji and Bandit
>> 8|"Fleegle, Bingo, Drooper and Snorky"|Yes|Hamburgers;#Hot Dogs;#Potato Salad;#Cole Slaw;#A banana split|a laugh track
>> 9|George Jetson|No|Mr. Spacely is making me work|Lawn Chairs;#Blanket;#my flying car
>> 10|Snagglepuss|Yes|Hamburgers;#Hot Dogs;#Potato Salad;#Cole Slaw;#BBQ Chicken|Softball;#Heavens to Murgatroyd!  Exit stage left!
>>
>> Thanks in advance,
>>
>> Dale
>>
>> -----Original Message-----
>> From: jim holtman [mailto:jholtman at gmail.com]
>> Sent: Saturday, July 12, 2008 11:32 AM
>> To: Hohm, Dale
>> Cc: r-help at r-project.org
>> Subject: Re: [R] Reading Multi-value data fields for descriptive analysis
>>
>> Can you provide a more complete example (say 10 lines) of what the
>> input is like. Does each line have a unique index that can be related
>> to it?  Do you want to summarize all the multi1-n values of Col2?  Do
>> you want to know the percentage of input lines that have a
>> Col3/multi-value4 on them?  You could read in the data as you have
>> indicated below and add a column that is the record number and
>> therefore you would have have to worry about trying to say if it
>> existed or not.  For example, you might have:
>>
>> Rec#|col#|value
>> 1|1|single
>> 1|2|multi1
>> 1|2|multi2
>> 1|3|multi1
>> 2|1|single
>> 3|1|single
>> 3|2|multi1
>> ....
>>
>> There are a number of potential ways of representing the data, but a
>> lot depends on what you want to do with it, so a more extensive
>> example of the input, along with the type of output you would like
>> will help in providing an answer.
>>
>> On Sat, Jul 12, 2008 at 12:37 PM, Hohm, Dale <dale.hohm at hp.com> wrote:
>>> Hello,
>>>
>>> I'm looking for help on the best approach to get "multi-value" data fields into R for simple descriptive analysis.
>>>
>>> -------------------------------------
>>>
>>> I am new to this list and new to R, but I really want to get over the hump and get productive with it.  Some help with how to best get the following data into R would be greatly appreciated.  I have programming experience and stale experience with SPSS.
>>
>>>
>>> I am trying to do some simple descriptive analysis (frequencies, cross-tabs) of data stored in a Microsoft SharePoint list.  The data can be accessed with ODBC or it can readily be extracted into an Excel or CSV format.  One of the challenges with the data is that it uses several "multi-value" fields (Microsoft Access provides the same data-type).
>>>
>>> By "multi-value" I mean that multiple responses are packed into a single data column; the data input form presents a question with several checkboxes and a free-format write-in response.  The individual values within the data field are separated with the two characters ";#".  So, the data would be of the following format (in CSV form with column headers and a tilde as the field separator):
>>>
>>> Column1single~Column2multi~Column3multi
>>> a sample value~C2 a multi one;#C2 a multi two~C3 a multi one;#C3 a multi two;#C3 a free-form answer
>>>
>>>
>>> The first approach that comes to mind is to explode the multi-value fields into unique bi-variate data columns and then assign a 0 or 1 to these new columns in each record based on whether that specific value was present.  This approach is complicated by the free-form answer as the unique columns could grow very large in number - it might be better to figure out how to indicate the presence of the free-form value in a data column called "Other" (or "C2 Other") and then hold the free-form value in a separate column.
>>>
>>> The data would then look like this...
>>>
>>> Column1single: a sample value
>>> C2 a multi one: 1
>>> C2 a multi two: 1
>>> C2 a multi three: 0
>>> C3 a multi one: 1
>>> C3 a multi two: 1
>>> C3 a free-form answer: 1
>>> C3 another free-form answer: 0
>>>
>>>
>>> Or in the second scenario...
>>>
>>> Column1single: a sample value
>>> C2 a multi one: 1
>>> C2 a multi two: 1
>>> C2 a multi three: 0
>>> C3 a multi one: 1
>>> C3 a multi two: 1
>>> C3 Other: 1
>>> C3 Other Text: a free-form answer
>>>
>>>
>>> I am uncertain help to read this data into R in this format, so suggestions and examples would help me greatly.
>>>
>>> This is a pretty common data packing scenario, so perhaps there are better approaches to reading this data and better ways in R to analyze it than what I have presented.  Suggestions greatly appreciated.
>>>
>>>
>>> Thanks,
>>>
>>> Dale Hohm
>>>
>>>        [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>>
>>
>> --
>> Jim Holtman
>> Cincinnati, OH
>> +1 513 646 9390
>>
>> What is the problem you are trying to solve?
>>
>
>
>
> --
> Jim Holtman
> Cincinnati, OH
> +1 513 646 9390
>
> What is the problem you are trying to solve?
>



-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem you are trying to solve?



More information about the R-help mailing list