[R] Data manipulation problem

moleps islon moleps2 at gmail.com
Tue Apr 6 15:56:16 CEST 2010


OK... next question.. Which is still a data manipulation problem so I
believe the heading is still OK.

##So now I read my population data from excel.
pop<-read.csv("pop.csv")

typeof(pop) ## yields a list where I have age-specific population rows
and a yearly column population, where the years are suffixed by X

c<-(1953:2008)
names(pop)<-c
c.div<-cut(c,break=seq(1950,2010,by=5)

Now I'd like to sum the agespecific population over the individual
levels of -c.div- and generate a new table for this with agespecific
rows and columns containing the 5-year bins instead of the original
yearly data. Do I have to program this from scratch or is it possible
to use an already existing function?


//M






qta<- table(cut(age,breaks = seq(0, 100, by = 10),include.lowest =
TRUE),cut(year,breaks=seq(1950,2010,by=5),include.lowest=TRUE

On Mon, Apr 5, 2010 at 10:11 PM, moleps <moleps2 at gmail.com> wrote:
>
> Thx Erik,
> I have no idea what went wrong with the other code snippet, but this one works.. Appreciate it.
>
> qta<- table(cut(age,breaks = seq(0, 100, by = 10),include.lowest = TRUE),cut(year,breaks=seq(1950,2010,by=5),include.lowest=TRUE))
>
> M
>
>
> On 5. apr. 2010, at 21.45, Erik Iverson wrote:
>
>> I don't know what your data are like, since you haven't given a reproducible example. I was imagining something like:
>>
>> ## generate fake data
>> age <- sample(20:90, 100, replace = TRUE)
>> year <- sample(1950:2000, 100, replace = TRUE)
>>
>> ##look at big table
>> table(age, year)
>>
>> ## categorize data
>> ## see include.lowest and right arguments to cut
>> age.factor <- cut(age, breaks = seq(20, 90, by = 10),
>>                  include.lowest = TRUE)
>>
>> year.factor <- cut(year, breaks = seq(1950, 2000, by = 10),
>>                   include.lowest = TRUE)
>>
>> table(age.factor, year.factor)
>>
>> moleps wrote:
>>> I already did try the regression modeling approach. However the epidemiologists (referee) turns out to be quite fond of comparing the incidence rates to different standard populations, hence the need for this labourius approach. And trying the "cutting" approach I ended up with :
>>>> table (age5)
>>> age5
>>>   (0,5]   (5,10]  (10,15]  (15,20]  (20,25]  (25,30]  (30,35]  (35,40]  (40,45]  (45,50]  (50,55]  (55,60]  (60,65]  (65,70]  (70,75]  (75,80]  (80,85] (85,100]       35       34       33       47       51      109      157      231      362      511      745      926     1002      866      547      247       82       18
>>>> table (yr5)
>>> yr5
>>> (1950,1955] (1955,1960] (1960,1965] (1965,1970] (1970,1975] (1975,1980] (1980,1985] (1985,1990] (1990,1995] (1995,2000] (2000,2005] (2005,2009]           3           5           5           5           5           5           5           5           5           5           5           3
>>>> table (yr5,age5)
>>> Error in table(yr5, age5) : all arguments must have the same length
>>> Sincerely,
>>> M
>>> On 5. apr. 2010, at 20.59, Bert Gunter wrote:
>>>> You have tempted, and being weak, I yield to temptation:
>>>>
>>>> "Any good ideas?"
>>>>
>>>> Yes. Don't do this.
>>>>
>>>> (what you probably really want to do is fit a model with age as a factor,
>>>> which can be done statistically e.g. by logistic regression; or graphically
>>>> using conditioning plots, e.g. via trellis graphics (the lattice package).
>>>> This avoids the arbitrariness and discontinuities of binning by age range.)
>>>>
>>>> Bert Gunter
>>>> Genentech Nonclinical Biostatistics
>>>>
>>>> -----Original Message-----
>>>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
>>>> Behalf Of moleps
>>>> Sent: Monday, April 05, 2010 11:46 AM
>>>> To: r-help at r-project.org
>>>> Subject: [R] Data manipulation problem
>>>>
>>>> Dear R´ers.
>>>>
>>>> I´ve got a dataset with age and year of diagnosis. In order to
>>>> age-standardize the incidence I need to transform the data into a matrix
>>>> with age-groups (divided in 5 or 10 years) along one axis and year divided
>>>> into 5 years along the other axis. Each cell should contain the number of
>>>> cases for that age group and for that period.
>>>> I.e.
>>>> My data format now is
>>>> ID-age (to one decimal)-year(yearly data).
>>>>
>>>> What I´d like is
>>>>
>>>> age 1960-1965 1966-1970 etc...
>>>> 0-5 3 8 10 15
>>>> 6-10 2 5 8 13
>>>> etc..
>>>>
>>>>
>>>> Any good ideas?
>>>>
>>>> Regards,
>>>> M
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>
>



More information about the R-help mailing list