[R] Working With Variables Having Different Lengths

David Winsemius dwinsemius at comcast.net
Fri Oct 21 21:43:00 CEST 2011


On Oct 21, 2011, at 3:02 PM, Rich Shepard wrote:

> On Fri, 21 Oct 2011, David Winsemius wrote:
>
>> First you need to clarify whether "TDS" is the name of a column or a
>> possible value in a column named "param". This whole painful
>> multi-question process would be greatly accelerated if you offered
>> str(chemdata).
>
>  Yes, I did on a different thread, but not on this one.
>
> str(chemdata)
> 'data.frame':	47244 obs. of  6 variables:
> $ site    : Factor w/ 143 levels "BC-0.5","BC-1",..: 134 134 134 127  
> 127
> $ sampdate: Date, format: "2006-12-06" "2006-12-06" ...
> $ param   : Factor w/ 66 levels "AGP","ANP","ANP/AGP",..: 58 66 12  
> 24 59 66
> $ quant   : num  1.08e+04 7.95 1.80e-02 2.80e+02 1.90e+01 8.44 1.62e 
> +03
> $ stream  : Factor w/ 24 levels "B","C",..: 4 4 4 21 21 21 4
> $ basin   : Factor w/ 2 levels "Basin1","Basin2": 1 1 1 1 1 1 1 1 1  
> 2 ...
>
>  What I need to do is examine the relationships between the  
> parameter "TDS"
> and other parameters associated with it; e.g., "Cond" and "SO4".

How are we to determine which lines contain information about the   
"relationships" of param=="TDS" with  whatever cases or variable has  
values of "Cond" and "SO4"? Are you really trying to compare two  
disjoint groups on some statistic like the means and std-dev of  
"quant"? (This would be a job for `aggregate`.)

> I started
> by subsetting the main data frame (chemdata)
>
> tds.basin <- subset(chemdata, param == "TDS", select = c(param,  
> quant, \
> basin), na.rm = TRUE, drop = TRUE)
>
> cond.basin <- subset(chemdata, param == "Cond", select = c(param,  
> quant, \
> basin), na.rm = TRUE, drop = TRUE)

So now you have two disjoint subsets. Why should we think they can be  
analyzed with regression methods?

>
> However, these left the NA rows in the new data frames.

Not for the "param" column I hope. And the na.rm= arguments should get  
ignored by subset.

>
>  I can produce an xyplot() using tds.basin$quant and cond.basin 
> $quant, but
> it's obvious there are many points where one or the other have NA  
> values.
> When I tried a linear regression it failed because of an unequal  
> number of
> rows in both data frames.
>
>  What I need to learn are: 1) how to write the subset() to remove  
> the NA
> rows for each one and 2) how to perform linear regression (and further
> analyses) on these pairs of data frames.
>
>> If you do not offer both the code and the verbatim copy of the  
>> error there
>> will be very little that we can do to diagnose your problem.
>
> str(tds.basin)
> 'data.frame':	2206 obs. of  3 variables:
> $ param: Factor w/ 66 levels "AGP","ANP","ANP/AGP",..: 58 58 58 58  
> 58 58 58
> $ quant: num  10800 530 3838 3658 3756 ...
> $ basin: Factor w/ 2 levels "Basin1","Basin2": 1 2 2 2 2 2 2 2 2 2 ...
>
> str(cond.basin)
> 'data.frame':	1191 obs. of  3 variables:
> $ param: Factor w/ 66 levels "AGP","ANP","ANP/AGP",..: 24 24 24 24  
> 24 24 24
> $ quant: num  280 3170 4220 3420 3700 ...
> $ basin: Factor w/ 2 levels "Basin1","Basin2": 1 2 2 2 2 2 2 2 2 2 ...
>
> then,
>
> m1 <- lm(tds.basin$quant ~ cond.basin$quant)
> Error in model.frame.default(formula = tds.basin$quant ~ cond.basin 
> $quant,
> :
>  variable lengths differ (found for 'cond.basin$quant')

In regression call it is almost alwasy better to construct them with a  
data argument:

> m1 <- lm(tds.basin$quant ~ cond.basin$quant)

>
> Rich
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
West Hartford, CT



More information about the R-help mailing list