[R] importing spreadsheet data - linera regression - panel data
Millo Giovanni
Giovanni_Millo at Generali.com
Mon Apr 20 17:31:50 CEST 2009
Dear Cecilia,
just adding some examples to Stefan's post, which says everything
already. I've recently gone mad with reshaping, so I assume it is a
little tricky. Or maybe what I tell you is obvious, then just skip it.
**import**
Your files are spreadsheets, so the best way to import is to save them
in a .csv or maybe in a .txt file (File>Save as>[choose 'Tab delimited'
format]) and then do as Stefan said with read.csv() or, if you saved in
tab delimited, read.table(..., sep="\t"). See help("read.table") or
?read.table which is the same.
Specifically, looks like you already exported them as tab delimited. So
something like
yourdata <- read.table(file="yourfile.txt", sep="\t", header=TRUE)
should work.
**reshape**
Now 'yourdata' is a data.frame object. Then you reshape() them, as said,
from 'wide' to 'long' format.
You need to do something like
yourreshapeddata <- reshape(yourdata, direction="long",
varying=list(paste("revenue", 2007:1998, sep="")))
i.e., 'varying' must be the list of the vectors (here: only one vector!)
of column names corresponding to the time-varying variables to be
stacked. (Hint: see what 'paste("revenue", 2007:1998, sep=""))' does).
An example on the Grunfeld data just to clarify:
data(Grunfeld, package="Ecdat") # data start in 'long' format, the one
you need
## make them 'wide'
pippo<-reshape(Grunfeld, direction="wide", timevar="year", idvar="firm")
## see what you got
fix(pippo)
## now make them 'long' again (what you need!)
pluto<-reshape(pippo, direction="long",
varying=list(paste("inv",1935:1954,sep="."),paste("value",1935:1954,sep=
"."),paste("capital",1935:1954,sep=".")))
## see how it worked. Notice the "id" and "time" variables it produced.
fix(pluto)
(Nevermind the names, Pippo and Pluto are Goofy and his dog in Italian.
Stands for "Foo" etc.)
**merge**
You say, you have one variable per spreadsheet/txt-file. Then as soon as
you have as many 'long' dataframes, let's say they are called 'revenue',
'cost', etc., you might want to merge() them two by two to make your
final dataframe, along the lines of
yourfinaldata <- merge("revenue", "cost", by=c("id", "time"))
**modeling**
Now your data are ready to do, say,
yourmodel <- lm(revenue~cost, data=yourfinaldata)
Notice that if you want to use 'plm' for panel data models, you will
want to have "id" and "time" in the first two columns, or alternatively
to specify an 'index' (see ?plm in library(plm)). Most "panel" models
can also be very effectively estimated with the 'nlme' or 'lme4'
packages, although the syntax is slightly more complicated.
Just to be sure, I am adding a small reproducible example on fake data
that should resemble what you have to do:
## begin example ##
## make fake 'wide' data for variable 'revenue'
pippo1<-matrix(rnorm(12),ncol=4)
pippo1<-as.data.frame(pippo1)
pippo1[,1]<-1:3
dimnames(pippo1)[[2]]<-c("firm",paste("revenue",2002:2000, sep=""))
## reshape
pippo1<-reshape(pippo1, direction="long", drop="firm",
varying=list(paste("revenue",2002:2000,sep="")))
## ...and have a look:
print(pippo1)
## make fake 'wide' data for variable 'cost'
pippo2<-matrix(rnorm(12),ncol=4)
pippo2<-as.data.frame(pippo2)
pippo2[,1]<-1:3
dimnames(pippo2)[[2]]<-c("firm",paste("cost",2002:2000, sep=""))
## reshape
pippo2<-reshape(pippo2, direction="long", drop="firm",
varying=list(paste("cost",2002:2000,sep="")))
## merge them (R is smart enough to guess the names of 'by' variables)
alldata<-merge(pippo1, pippo2)
## see what happened:
print(alldata)
## fix var names
dimnames(alldata)[[2]][3:4]<-c("revenue","cost")
## estimate a linear model
yourmod<-lm(cost~revenue, data=alldata)
summary(yourmod)
## to do panel models (with 'plm'):
## e.g., a fixed effects model
yourFEmod<-plm(cost~revenue, data=alldata, index=c("id","time"))
## end example ##
Original message:
------------------------------
Date: Sun, 19 Apr 2009 12:30:10 +0100
From: "Cecilia Carmo" <cecilia.carmo at ua.pt>
Subject: [R] importing spreadsheet data - linera regression - panel
data
To: r-help at r-project.org
Message-ID: <web-74601530 at controller2.cgpmail.ua.pt>
Content-Type: text/plain;charset=iso-8859-1;format="flowed"
Hi everyone and thank you for the help you could give me.
My data is in a spreadsheet. The 1st column identifies the
firm (with the fiscal number), the columns 2 to 11 have
the variable value for 11 years. I have many variables
(files like this). Each file has about 40.000 firms
(rows). I transformed all the files in txt files. The data
is a panel data, like this:
firm revenu2007 revenue2006 revenue2005 revenue2004
revenue2003 revenue2002 revenue2001 revenue2000
revenue1999 revenue1998
500100144
504394029 4282809 3769159 3520807 3548322 3458122
503264032
502011475 2595780 2417433 2299563 2060552 1804531 1821638
1789533 1463371 947712
500400911
504615947 22801 28656 27067 26182 26356 34060
39147
502616695 1412354 1209619 1429755 1623496 1955123 2273486
2087406 2076868 2036937
500829993 1383396 1095570 805830 793809 777591 791614
779924 774560 702845
The objective of my work is to do linear regressions with
the variables in the files or with other variables that I
can obtain from those by doing some mathematical
operations. I?ve already tried to import this information
to an array in R, but I?ve seen that the linear
regressions functions need the data into a dataframe. So
I?m asking: How to import this information to a dataframe
in R, in a manner that I can easily do the mathematical
operations between the variables and then use it to do the
regressions?
Cec?lia Carmo
**********
I hope it helps. Please if you use 'plm' let me know how it fares with
40.000x10 data points. Sure 'nlme' and 'lme4' can handle this.
Best wishes,
Giovanni
Giovanni Millo
Research Dept.,
Assicurazioni Generali SpA
Via Machiavelli 4,
34132 Trieste (Italy)
tel. +39 040 671184
fax +39 040 671160
Ai sensi del D.Lgs. 196/2003 si precisa che le informazi...{{dropped:13}}
More information about the R-help
mailing list