[R] How to count number of year per firm in panel data?

Wed Feb 11 18:38:26 CET 2009

on 02/11/2009 10:43 AM Johannes Habel wrote:
> Hello,
> 
> I have an unbalanced panel dataset and would like to exclude all objects
> that don't appear at least x times.
> 
> Therefore, I would like to include a column indicating for every line how
> many periods are available, e.g.
> 
> id, year, number
> 1, 2000, 3
> 1, 2001, 3
> 1, 2002, 3
> 2, 2001, 1
> 3, ..., ...
> 
> This would allow me to exclude companies by setting "subset=number>=x".
> However, I don't know how to create this column, i.e. how to count the years
> for each object and include the numbers into the dataset.
> 
> Could anybody help me, please?
> 
> Alternatively, is there an easier way to achieve my goal?
> 
> Thank you very much.
> 
> Johannes Habel

You don't need to add the extra column. You can just create a frequency
table of the unique 'id' values, get the subset of values that meet your
count criteria and then use those values in subset().

Let's create a little larger dataset:

set.seed(1)
id <- sample(letters[1:4], 8, replace = TRUE)

> id
[1] "b" "b" "c" "d" "a" "d" "d" "c"

years <- unlist(lapply(split(id, id),
                       function(i) 2000:(2000 + length(i) - 1)))

> years
   a   b1   b2   c1   c2   d1   d2   d3
2000 2000 2001 2000 2001 2000 2001 2002

DF <- data.frame(id = sort(id), year = years)

> DF
   id year
a   a 2000
b1  b 2000
b2  b 2001
c1  c 2000
c2  c 2001
d1  d 2000
d2  d 2001
d3  d 2002

> subset(DF, id %in% names(which(table(DF$id) >= 3)))
   id year
d1  d 2000
d2  d 2001
d3  d 2002

Step by step:

> table(DF$id)

a b c d
1 2 2 3

> table(DF$id) >= 3

    a     b     c     d
FALSE FALSE FALSE  TRUE

> which(table(DF$id) >= 3)
d
4

> names(which(table(DF$id) >= 3))
[1] "d"

Then use subset() as above, filtering only those id's that are in the
names from the table. If we change the requirement to >= 2:

> subset(DF, id %in% names(which(table(DF$id) >= 2)))
   id year
b1  b 2000
b2  b 2001
c1  c 2000
c2  c 2001
d1  d 2000
d2  d 2001
d3  d 2002

See ?table, ?names, ?which and ?"%in%"

If you really need to add the column, you could use aggregate() to get a
count of years for each id as a data frame, then use merge() to add the
column to DF:

> aggregate(DF$year, list(id = DF$id), length)
  id x
1  a 1
2  b 2
3  c 2
4  d 3

> merge(DF, aggregate(DF$year, list(id = DF$id), length), by = "id")
  id year x
1  a 2000 1
2  b 2000 2
3  b 2001 2
4  c 2000 2
5  c 2001 2
6  d 2000 3
7  d 2001 3
8  d 2002 3

Then use subset() as you initially considered.

See ?aggregate and ?merge

HTH,

Marc Schwartz