[R] reshaping data

Mia Bengtsson miamynta at gmail.com
Fri May 21 19:48:21 CEST 2010


Yes, that works beautifully on both the test dataset and my real dataset. This was exactly what I was looking for. Thank you!

/ Mia

On May 21, 2010, at 6:10 PM, William Dunlap wrote:

> 
>> -----Original Message-----
>> From: r-help-bounces at r-project.org 
>> [mailto:r-help-bounces at r-project.org] On Behalf Of Mia Bengtsson
>> Sent: Friday, May 21, 2010 3:39 AM
>> To: Dennis Murphy; Henrique Dallazuanna
>> Cc: r-help at r-project.org
>> Subject: Re: [R] reshaping data
>> 
>> Thank you Dennis and Henrique for your help!
>> 
>> Both solutions work! I just need to find a way of removing 
>> the empty "cells" from the final "long" dataframe since they 
>> are not NAs. 
>> 
>> Maybe there is an easier way of doing this of the data is not 
>> treated as a dataframe? The original data file that is 
>> derived from another program (mothur) is a textfile with the 
>> following format:
>> 
>> red \t A,B,C
>> green \t D
>> blue \t E,F
>> 
>> The first column "species" is separated from the 
>> "sequences"(A, B, C...) with tab, and then the "sequences" 
>> are separated from each other with commas.
>> 
>> I imported into R as what I thought was a dataframe using:
>> 
>> test1<-readLines("path/test")
>> test2<-gsub(pattern= "\t", otu, replacement=",")
>> test3<-textConnection(test2)
>> test.df<-read.csv(test3, header=F)
>> 
>> Should I rather have imported it as something else if I want 
>> to reshape it into a list as described previously?
> 
> Does the following do what you want, where my "txt" should
> resemble the output of your test1, the output of
> readLines("path/test")?
> 
>> txt <- c("red \t A,B,C", "green \t D", "blue \t E,F")
>> f <- function (textLines) {
>    tmp <- strsplit(textLines, " *\t *")
>    letters <- strsplit(vapply(tmp, FUN = `[`, 2, FUN.VALUE = ""), 
>        ",")
>    numLetters <- vapply(letters, FUN = length, FUN.VALUE = 0L)
>    data.frame(Species = rep(vapply(tmp, FUN = `[`, 1, FUN.VALUE = ""), 
>        numLetters), Letter = unlist(letters))
> }
>> f(txt)
>  Species Letter
> 1     red      A
> 2     red      B
> 3     red      C
> 4   green      D
> 5    blue      E
> 6    blue      F
> 
> vapply() is new in R 2.11.? and is like sapply but lets
> you specify what the return value of FUN is expected to
> be.  Thus it gives you some error checking, saves some
> time over sapply, and works nicely when the length of the
> input is 0.  If you don't have 2.11 replace with by sapply
> and remove the FUN.VALUE argument.
> 
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com 
>> 
>> Thanks a million!
>> 
>> / Mia Bengtsson
>> 
>> 
>> On May 21, 2010, at 2:15 AM, Dennis Murphy wrote:
>> 
>>> Hi:
>>> 
>>> 
>>> On Thu, May 20, 2010 at 10:13 AM, Mia Bengtsson 
>> <mia.bengtsson at bio.uib.no> wrote:
>>> Hello,
>>> 
>>> I am a relatively new R-user who has a lot to learn. I have 
>> a large dataset that is in the following dataframe format:
>>> 
>>> red             A       B       C
>>> green   D
>>> blue    E       F
>>> 
>>> This isn't a data frame in R - if it were, it would have NA 
>> (or at least ""/" "padding at the end of each row.
>>> Data frames are not ragged arrays. To have this type of 
>> structure in R, the data would have to be in a list.
>>> 
>>> This matters because Henrique's solution with reshape() 
>> assumes a data frame as input. A similar solution
>>> would be to use melt() in the reshape package, something like
>>> 
>>> library(reshape)
>>> longdf <- melt(yourdf, id.var = 'species')
>>> longdf
>>> 
>>> If you have NA padding, the way to get rid of them in the 
>> reshaped data frame is (with the above approach)
>>> 
>>> longdf[!is.na(longdf$value), -longdf$variable]
>>> 
>>> If the padding is with blanks, then Henrique's solution 
>> works here, too.
>>> 
>>> HTH,
>>> Dennis
>>> 
>>> 
>>> Where red, green and blue are "species" names and A, B and 
>> C are observations (corresponding to DNA sequences). Each 
>> observation can only belong to one species. I would like to 
>> list the observations in one column, with the species they 
>> belong to in the next. Like this:
>>> 
>>> A       red
>>> B       red
>>> C       red
>>> D       green
>>> E       blue
>>> F       blue
>>> 
>>> I have tried using reshape() and stack() but I cannot get 
>> my head around it. Any help is highly appreciated!
>>> 
>>> Thanks in advance,
>>> __________________________________
>>> 
>>> Mia Bengtsson, PhD-student
>>> Department of Biology
>>> University of Bergen
>>> +47 55584715
>>> +47 97413634
>>> mia.bengtsson at bio.uib.no
>>> 
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>> 
>> 
>> 
>> 	[[alternative HTML version deleted]]
>> 
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>> 



More information about the R-help mailing list