[R] importing spreadsheet data - linera regression - panel data

Mon Apr 20 17:31:50 CEST 2009

Dear Cecilia,

just adding some examples to Stefan's post, which says everything
already. I've recently gone mad with reshaping, so I assume it is a
little tricky. Or maybe what I tell you is obvious, then just skip it.

**import**

Your files are spreadsheets, so the best way to import is to save them
in a .csv or maybe in a .txt file (File>Save as>[choose 'Tab delimited'
format]) and then do as Stefan said with read.csv() or, if you saved in
tab delimited, read.table(..., sep="\t"). See help("read.table") or
?read.table which is the same.

Specifically, looks like you already exported them as tab delimited. So
something like 

yourdata <- read.table(file="yourfile.txt", sep="\t", header=TRUE)
should work.

**reshape**

Now 'yourdata' is a data.frame object. Then you reshape() them, as said,
from 'wide' to 'long' format.
You need to do something like

yourreshapeddata <- reshape(yourdata, direction="long",
varying=list(paste("revenue", 2007:1998, sep="")))

i.e., 'varying' must be the list of the vectors (here: only one vector!)
of column names corresponding to the time-varying variables to be
stacked. (Hint: see what 'paste("revenue", 2007:1998, sep=""))' does).

An example on the Grunfeld data just to clarify:
data(Grunfeld, package="Ecdat") # data start in 'long' format, the one
you need
## make them 'wide'
pippo<-reshape(Grunfeld, direction="wide", timevar="year", idvar="firm")

## see what you got
fix(pippo) 
## now make them 'long' again (what you need!)
pluto<-reshape(pippo, direction="long",
varying=list(paste("inv",1935:1954,sep="."),paste("value",1935:1954,sep=
"."),paste("capital",1935:1954,sep=".")))
## see how it worked. Notice the "id" and "time" variables it produced.
fix(pluto)

(Nevermind the names, Pippo and Pluto are Goofy and his dog in Italian.
Stands for "Foo" etc.)

**merge**

You say, you have one variable per spreadsheet/txt-file. Then as soon as
you have as many 'long' dataframes, let's say they are called 'revenue',
'cost', etc., you might want to merge() them two by two to make your
final dataframe, along the lines of

yourfinaldata <- merge("revenue", "cost", by=c("id", "time"))

**modeling**

Now your data are ready to do, say,

yourmodel <- lm(revenue~cost, data=yourfinaldata)

Notice that if you want to use 'plm' for panel data models, you will
want to have "id" and "time" in the first two columns, or alternatively
to specify an 'index' (see ?plm in library(plm)). Most "panel" models
can also be very effectively estimated with the 'nlme' or 'lme4'
packages, although the syntax is slightly more complicated.

Just to be sure, I am adding a small reproducible example on fake data
that should resemble what you have to do:

## begin example ##
## make fake 'wide' data for variable 'revenue'
pippo1<-matrix(rnorm(12),ncol=4)
pippo1<-as.data.frame(pippo1)
pippo1[,1]<-1:3
dimnames(pippo1)[[2]]<-c("firm",paste("revenue",2002:2000, sep=""))
## reshape
pippo1<-reshape(pippo1, direction="long", drop="firm",
varying=list(paste("revenue",2002:2000,sep="")))
## ...and have a look:
print(pippo1)

## make fake 'wide' data for variable 'cost'
pippo2<-matrix(rnorm(12),ncol=4)
pippo2<-as.data.frame(pippo2)
pippo2[,1]<-1:3
dimnames(pippo2)[[2]]<-c("firm",paste("cost",2002:2000, sep=""))
## reshape
pippo2<-reshape(pippo2, direction="long", drop="firm",
varying=list(paste("cost",2002:2000,sep="")))

## merge them (R is smart enough to guess the names of 'by' variables)
alldata<-merge(pippo1, pippo2)
## see what happened:
print(alldata)

## fix var names
dimnames(alldata)[[2]][3:4]<-c("revenue","cost")

## estimate a linear model
yourmod<-lm(cost~revenue, data=alldata)
summary(yourmod)

## to do panel models (with 'plm'):
## e.g., a fixed effects model
yourFEmod<-plm(cost~revenue, data=alldata, index=c("id","time"))
## end example ##

Original message:
------------------------------
Date: Sun, 19 Apr 2009 12:30:10 +0100
From: "Cecilia Carmo" <cecilia.carmo at ua.pt>
Subject: [R] importing spreadsheet data - linera regression - panel
	data
To: r-help at r-project.org
Message-ID: <web-74601530 at controller2.cgpmail.ua.pt>
Content-Type: text/plain;charset=iso-8859-1;format="flowed"

Hi everyone and thank you for the help you could give me.

My data is in a spreadsheet. The 1st column identifies the 
firm (with the fiscal number), the columns 2 to 11 have 
the variable value for 11 years. I have many variables 
(files like this). Each file has about 40.000 firms 
(rows). I transformed all the files in txt files. The data 
is a panel data, like this:
firm	revenu2007	revenue2006	revenue2005	revenue2004
revenue2003	revenue2002	revenue2001	revenue2000
revenue1999	revenue1998		
500100144

504394029		4282809	3769159	3520807	3548322	3458122

503264032

502011475		2595780	2417433	2299563	2060552	1804531	1821638
1789533	1463371	947712
500400911

504615947		22801	28656	27067	26182	26356	34060
39147		
502616695		1412354	1209619	1429755	1623496	1955123	2273486
2087406	2076868	2036937
500829993		1383396	1095570	805830	793809	777591	791614
779924	774560	702845
The objective of my work is to do linear regressions with 
the variables in the files or with other variables that I 
can obtain from those by doing some mathematical 
operations.  I?ve already tried to import this information 
to an array in R, but I?ve seen that the linear 
regressions functions need the data into a dataframe. So 
I?m asking: How to import this information to a dataframe 
in R, in a manner that I can easily do the mathematical 
operations between the variables and then use it to do the 
regressions?

Cec?lia Carmo

**********
I hope it helps. Please if you use 'plm' let me know how it fares with
40.000x10 data points. Sure 'nlme' and 'lme4' can handle this.

Best wishes,
Giovanni

Giovanni Millo
Research Dept.,
Assicurazioni Generali SpA
Via Machiavelli 4, 
34132 Trieste (Italy)
tel. +39 040 671184 
fax  +39 040 671160

Ai sensi del D.Lgs. 196/2003 si precisa che le informazi...{{dropped:13}}