[R] Transforming a dataframe into a response/predictor matrix

cls59 chuck at sharpsteen.net
Fri Nov 13 02:44:53 CET 2009



Ki L. Matlock wrote:
> 
> I currently have a data frame whose rows correspond to each student and
> whose columns are different variables for the student, as shown below:
> 
>  Lastname Firstname CATALOG_NBR           Email StudentID   EMPLID    
> Start
> 1     alastname     afirstname        1213 *@uark.edu  10295236 #
> 12/2/2008
> 2     anotherlastname     anotherfirstname        1213 **@uark.edu  ##
> 10295236 9/3/2008
>   Xattempts Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18
> Q19
> 1         1  1  1  0  0  0  0  0  0  0   1   0   0   1   1   0   1   1   0  
> 1
> 2         1  1  1  1  1  1  0  1  0  0   1   1   0   0   1   0   0   0   0  
> 1
>   Q20 Q21 Q22 Q23 Q24 Q25 Q26 Q27 Q28 Q29 Q30 Q31 Q32 Score Form
> CRSE_GRADE_OFF
> 1   0   0   0   0   0   0   0   0   0   1   0   0   0     9    E             
> D
> 2   0   0   0   0   0   0   0   0   0   0   1   1   0    13    G             
> D
> 

Thanks for providing test data-- however this sort of format is difficult to
work with as email tends to mangle the line wrapping.  It took me about 5
minutes and the combined powers of Vim and OpenOffice to reflow and
re-export the example into a format that R could ingest.  And I probably
made a mistake somewhere along the way.  Nothing wrong with providing data
like this-- but it probably limits the number of people who are willing to
give your problem a try.

A good way to share test data frames if they contain a lot of rows/columns
is to dump them using the dput() function.  This encodes the data in the
following format:


structure(list(Lastname = structure(1:2, .Label = c("alastname", 
"anotherlastname"), class = "factor"), Firstname = structure(1:2, .Label =
c("afirstname", 
"anotherfirstname"), class = "factor"), CATALOG_NBR = c(1213L, 
1213L), Email = structure(1:2, .Label = c("*@uark.edu", "**@uark.edu"
), class = "factor"), StudentID = structure(c(2L, 1L), .Label = c("##", 
"10295236"), class = "factor"), EMPLID = structure(1:2, .Label = c("#", 
"10295236"), class = "factor"), Start = structure(c(14215, 14125
), class = "Date"), Xattempts = c(1L, 1L), Q1 = c(1L, 1L), Q2 = c(1L, 
1L), Q3 = 0:1, Q4 = 0:1, Q5 = 0:1, Q6 = c(0L, 0L), Q7 = 0:1, 
    Q8 = c(0L, 0L), Q9 = c(0L, 0L), Q10 = c(1L, 1L), Q11 = 0:1, 
    Q12 = c(0L, 0L), Q13 = c(1L, 0L), Q14 = c(1L, 1L), Q15 = c(0L, 
    0L), Q16 = c(1L, 0L), Q17 = c(1L, 0L), Q18 = c(0L, 0L), Q19 = c(1L, 
    1L), Q20 = c(0L, 0L), Q21 = c(0L, 0L), Q22 = c(0L, 0L), Q23 = c(0L, 
    0L), Q24 = c(0L, 0L), Q25 = c(0L, 0L), Q26 = c(0L, 0L), Q27 = c(0L, 
    0L), Q28 = c(0L, 0L), Q29 = c(1L, 0L), Q30 = 0:1, Q31 = 0:1, 
    Q32 = c(0L, 0L), Score = c(9L, 13L), Form = structure(1:2, .Label =
c("E", 
    "G"), class = "factor"), CRSE_GRADE_OFF = structure(c(1L, 
    1L), .Label = "D", class = "factor")), .Names = c("Lastname", 
"Firstname", "CATALOG_NBR", "Email", "StudentID", "EMPLID", "Start", 
"Xattempts", "Q1", "Q2", "Q3", "Q4", "Q5", "Q6", "Q7", "Q8", 
"Q9", "Q10", "Q11", "Q12", "Q13", "Q14", "Q15", "Q16", "Q17", 
"Q18", "Q19", "Q20", "Q21", "Q22", "Q23", "Q24", "Q25", "Q26", 
"Q27", "Q28", "Q29", "Q30", "Q31", "Q32", "Score", "Form", "CRSE_GRADE_OFF"
), row.names = c(NA, -2L), class = "data.frame")


Not very pretty, but this format is more resistant to email mangling and can
generally be copied/pasted into an R session-- saves all the monkey business
with Vim/OpenOffice/Excel/whatever.



Ki L. Matlock wrote:
> 
> 
> Each student took a pre- and post- test indicated by the date under
> "Start", column 7.  (a date, mm/dd/yyyy, whose mm is 08 or 09 is pre-test;
> a date whose mm is 11 or 12 is post-test.  This test was one of four
> forms, E, F, G, or H, listed under "Form", column 42. Each test had 32
> questions, Q1 to Q32, with a binary 1 indicating the student answered
> correctly to this question and 0 if incorrectly.
> 
> I am needing a matrix, y, with five columns labeled: response, i, j, r, s. 
> Column 1 indicates the response (0 or 1) for i-th student, on the j-th
> question (1:32), on the r-th form (E,F,G,H- these could be changed to
> numeric 1 for E, 2 for F, etc.), on the s-th test (pre or post indicated
> by a binary 0 for pre, 1 for post).
> 
> The data-set is very lengthy of approximately 2000 rows.  An efficient way
> to transform this data into the desired matrix would be very helpful. 
> Thank you.
> 
> 

The melt() function from Hadley Wickham's 'reshape' package can probably
take care of this for you.  Assuming the data.frame is named "studentData",
the following might process your data the way you want it:

  require( reshape )

  # Retrieve the names of all columns holding responses to questions.
  questions <- names( studentData )[ grep( '^[Q]', names( studentData ) ) ]

  testBreakdown <- melt( studentData, c( 'StudentID', 'Form'), questions,
variable_name = 'Question' )


The first argument after the name of the data set specifies the names of
those columns that we wish to use in order to categorize the data. The
second argument specifies the names of columns that contain the data we are
interested in. testBreakdown is now a data.frame containing:

  A column labeled "StudentID"-- contains the ID of the student.
  A column labeled "Form" -- contains the code of the form they used.
  A column labeled "Question" -- contains the name of the question they
answered.  The default name for this column is "variable", but I overrode it
by setting variable_name in the above call to melt().
  A column labeled "value"-- contains the result of the student's answer to
the given question.

I was not able to figure out which part of your data.frame contained
information concerning the "s-th" test taken by a student-- maybe it got
lost in translation.  Anyway, if the column names and order you gave above
are important, then all you need to do is rename and reorder the columns of
testBreakdown.


Hope this helps!

-Charlie

-----
Charlie Sharpsteen
Undergraduate
Environmental Resources Engineering
Humboldt State University
-- 
View this message in context: http://old.nabble.com/Transforming-a-dataframe-into-a-response-predictor-matrix-tp26328345p26328719.html
Sent from the R help mailing list archive at Nabble.com.




More information about the R-help mailing list