[R] Subsetting dataframes

Uwe Ligges ligges at statistik.uni-dortmund.de
Thu Jul 19 15:01:48 CEST 2007



CG Pettersson wrote:
> Dear all!
> 
> W2k, R 2.5.1
> 
> I am working with an ongoing malting barley variety evaluation within
> Sweden. The structure is 25 cultivars tested each year at four sites, in
> field trials with three replicates and 'lattice' structure (the replicates
> are divided into five sub blocks in a structured way). As we are normally
> keeping around 15 varieties from each year to the next, and take in 10 new
> for next year, we have tested totally 72 different varieties during five
> years.
> 
> I store the data in a field trial database, and generate text tables with
> the subset of data I want and import the frame to R. I take in all
> cultivars in R and use 'subset' to select what I want to look at. Using
> lme{nlme} works with no problems to get mean results over the years, but
> as I now have a number of years I want to analyse the general site x
> cultivar relation. I am testing AMMI{agricolae} for this and it seems to
> work except for the subsetting. This is what happens:
> 
> If I do the subsetting like this:
> 
> x62_samvar <- subset(x62_5, cn %in%
> c("Astoria","Barke","Christina","Makof", "Prestige","Publican","Quench"))
> 
> A test run with AMMI seems to work in the first part:
> 
>> AMMI(site, cn, rep, yield)
> 
> ANALYSIS AMMI:  yield
> Class level information
> 
> ENV:  Hag Klb Bjt Ska
> GEN:  Astoria Prestige Makof Christina Publican Quench
> REP:  1 2 3
> 
> Number of observations:  240
> 
> model Y: yield  ~ ENV + REP%in%ENV + GEN + ENV:GEN
> 
> Analysis of Variance Table
> 
> Response: Y
>            Df    Sum Sq   Mean Sq F value    Pr(>F)
> ENV         3 120092418  40030806 90.0424 1.665e-06 ***
> REP(ENV)    8   3556620    444578  0.5674  0.803923
> GEN         5  21376142   4275228  5.4564 9.680e-05 ***
> ENV:GEN    15  28799807   1919987  2.4504  0.002555 **
> Residuals 208 162973213    783525
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> 
> Coeff var       Mean yield
> 13.08629         6764.098
> 
> After this something goes wrong, as AMMI finds a cultivar name not
> selected in the subsetting. (The plotting might go wrong anyhow, but I
> haven´t got that far yet):
> 
> Error in model.frame.default(Terms, newdata, na.action = na.action, xlev =
> object$xlevels) :
>         factor 'y' has new level(s) Arkadia
> 
> 
> Looking at the dataframe using
> 
>> edit(x62_samvar)
> 
> only shows the selected lines, but using levels() gives another answer as
> 
>> levels(x62_samvar$cn)
> 
> gives back all 72 cultivar names used during the five years (starting with
> Arcadia).
> 
> Where do I go wrong and how do I use subset in a proper way?


So you have to drop the levels you are excluding. Example:

   x <- factor(letters[1:4])
   x
   x[1:2]
   x[1:2, drop=TRUE]


Uwe Ligges




> Thanks
> /CG
>



More information about the R-help mailing list