[R] Working With Variables Having Different Lengths

Sat Oct 22 01:46:04 CEST 2011

On Oct 21, 2011, at 6:17 PM, Rich Shepard wrote:

> On Fri, 21 Oct 2011, David Winsemius wrote:
>
>> What problem are you trying to solve?
>
>   What I need now is to compare TDS (total dissolved solids) with  
> specific
> conductivity and the ions that are normally comprise TDS. Before  
> running any
> regression models I need to look at these data from three points of  
> view:
> all data from all sites within each hydrographic drainage basin  
> collected
> during the past 30 years; average (or total) concentrations (not yet  
> decided
> on what makes the most ecological sense) within a stream having  
> multiple
> collection sites; and by site within certain streams.
>
>  Here is the data frame structure:
>
>  str(chemdata)
> 'data.frame':	47244 obs. of  6 variables:
>  $ site    : Factor w/ 143 levels "BC-0.5","BC-1",..: 134 134 134  
> 127 127
>  $ sampdate: Date, format: "2006-12-06" "2006-12-06" ...
>  $ param   : Factor w/ 66 levels "AGP","ANP","ANP/AGP",..: 58 66 12  
> 24 59 66
>  $ quant   : num  1.08e+04 7.95 1.80e-02 2.80e+02 1.90e+01 8.44 1.62e 
> +03
>  $ stream  : Factor w/ 24 levels "BCrk","CCrk",..: 4 4 4 21 21 21 4
>  $ basin   : Factor w/ 2 levels "BasinEast","BasinWest": 1 1 1 1 1 1  
> 1 1 1 2 ...

The only variable in that dataframe with what appears to be a  
continuous value (which is how I would expect "total dissolved solids"  
to be measured) is "quant" Are you saying that the value of quant is  
measuring something with different units depending on the value of  
'param' and that 'site' and 'date' shoud be used to identify  
associated measurements? This would appear to be the case based on  
what you are saying below.

If this is so the problem is to break apart the dataframe by type of  
measurement ('param') butone way would be to split into separate  
dataframes then merge back together by an appropriate linkage on site  
and date. I'm guessing that 'stream' and 'basin' are superfluous for  
the matching and can be later associated with 'site'?

The goal would be a dataframe with 7 renamed 'param' columns ('TDS',  
'Cond', 'Mg', 'SO4', 'Cl', 'Na', and 'Ca') and two identifier columns  
('site' and 'sampdate'. For the moment I would think you would want  
all the data together an not make any decisions about excluding NA  
values until you get an overall picture of the situation.

The first thing I would try would be

with(subset(chemdata, param %in% c('TDS', 'Cond', 'Mg', 'SO4', 'Cl',  
'Na', and 'Ca') , 1:4) ,
      xtabs(quant ~ site + sampdate + param) )

You would get 7 tables One for each 'param' with up to 143 rows and as  
many columns as you have sampdates.

This might be a good use for package reshape2 since it generally  
returns a dataframe. The above operation would return an array with 3  
dimensions. You might get immediate success with something like:

dcast( subset(chemdata, param %in% c('TDS', 'Cond', 'Mg', 'SO4', 'Cl',  
'Na', and 'Ca') , 1:4) ,
      site + sampdate ~ param)
# the omitted varialble name should ent up in the values columns

To do your testing it might be wise to apply more selective use of  
subset. Perhaps on;u go for a few sites and dates.

-- 
david.

>
>  While all the data sets used in the books I've read are simpler and  
> well
> illustrate the analyses presented, what I've not read is guidance on  
> how
> complex data sets could (or should) be partitioned into smaller but  
> still
> related data sets to facilitate analyses. Or, how I extract the  
> relevant
> rows and columns for specific analyses.
>
>> That seems very unlikely. What we need is a clearer description of  
>> that
>> values that your "param" variable can assume, and what you want to  
>> within
>> categories of those values. We also need you to stop dropping  
>> context.
>
>  There are 66 different chemicals in the param factor. However, for  
> the
> immediate effort, only 7 are needed. They are coded 'TDS', 'Cond',  
> 'Mg',
> 'SO4', 'Cl', 'Na', and 'Ca'.
>
>  From the database table I know the number of non-NULL (non-NA) rows  
> for
> each parameter:
>
> 	TDS	2181
> 	Cond	 820
> 	Mg	1120
> 	SO4	1980
> 	Cl	1971
> 	Na	 866
> 	Ca	1110
>
>  Not all were required to be measured at all sites from the  
> beginning in
> 1981. I do not yet know how many rows have non-NULL values for the 6  
> pairs
> compared with TDS.
>
>  If there's more information to provide I'll gladly do so.
>
> Thanks,
>
> Rich
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
West Hartford, CT