[R] How to count number of year per firm in panel data?
Marc Schwartz
marc_schwartz at comcast.net
Wed Feb 11 18:38:26 CET 2009
on 02/11/2009 10:43 AM Johannes Habel wrote:
> Hello,
>
> I have an unbalanced panel dataset and would like to exclude all objects
> that don't appear at least x times.
>
> Therefore, I would like to include a column indicating for every line how
> many periods are available, e.g.
>
> id, year, number
> 1, 2000, 3
> 1, 2001, 3
> 1, 2002, 3
> 2, 2001, 1
> 3, ..., ...
>
> This would allow me to exclude companies by setting "subset=number>=x".
> However, I don't know how to create this column, i.e. how to count the years
> for each object and include the numbers into the dataset.
>
> Could anybody help me, please?
>
> Alternatively, is there an easier way to achieve my goal?
>
> Thank you very much.
>
> Johannes Habel
You don't need to add the extra column. You can just create a frequency
table of the unique 'id' values, get the subset of values that meet your
count criteria and then use those values in subset().
Let's create a little larger dataset:
set.seed(1)
id <- sample(letters[1:4], 8, replace = TRUE)
> id
[1] "b" "b" "c" "d" "a" "d" "d" "c"
years <- unlist(lapply(split(id, id),
function(i) 2000:(2000 + length(i) - 1)))
> years
a b1 b2 c1 c2 d1 d2 d3
2000 2000 2001 2000 2001 2000 2001 2002
DF <- data.frame(id = sort(id), year = years)
> DF
id year
a a 2000
b1 b 2000
b2 b 2001
c1 c 2000
c2 c 2001
d1 d 2000
d2 d 2001
d3 d 2002
> subset(DF, id %in% names(which(table(DF$id) >= 3)))
id year
d1 d 2000
d2 d 2001
d3 d 2002
Step by step:
> table(DF$id)
a b c d
1 2 2 3
> table(DF$id) >= 3
a b c d
FALSE FALSE FALSE TRUE
> which(table(DF$id) >= 3)
d
4
> names(which(table(DF$id) >= 3))
[1] "d"
Then use subset() as above, filtering only those id's that are in the
names from the table. If we change the requirement to >= 2:
> subset(DF, id %in% names(which(table(DF$id) >= 2)))
id year
b1 b 2000
b2 b 2001
c1 c 2000
c2 c 2001
d1 d 2000
d2 d 2001
d3 d 2002
See ?table, ?names, ?which and ?"%in%"
If you really need to add the column, you could use aggregate() to get a
count of years for each id as a data frame, then use merge() to add the
column to DF:
> aggregate(DF$year, list(id = DF$id), length)
id x
1 a 1
2 b 2
3 c 2
4 d 3
> merge(DF, aggregate(DF$year, list(id = DF$id), length), by = "id")
id year x
1 a 2000 1
2 b 2000 2
3 b 2001 2
4 c 2000 2
5 c 2001 2
6 d 2000 3
7 d 2001 3
8 d 2002 3
Then use subset() as you initially considered.
See ?aggregate and ?merge
HTH,
Marc Schwartz
More information about the R-help
mailing list