[Bioc-devel] Multiple colData in SummarizedExperiment

Thu Jun 18 06:25:18 CEST 2015

Oh wow, I didn't know you could put a DataFrame into a single column of 
another DataFrame. That actually solves a problem for me too (I don't 
intend to expose nested DataFrames to the users though).

On 6/17/15 7:23 PM, Martin Morgan wrote:
> On 06/17/2015 11:41 AM, davide risso wrote:
>> Dear list,
>>
>> I'm creating an R package to store RNA-seq data of a somewhat large 
>> project
>> in which I'm involved.
>>
>> One of the initial goals is to compare different pre-processing 
>> pipelines,
>> hence I have multiple expression matrices corresponding to the same 
>> samples.
>> The SummarizedExperiment class seems a good candidate, since I have
>> multiple expression matrices with the same rowData and colData 
>> information.
>>
>> I have several sample-specific variables that I want to store with the
>> object, namely, experimental information (e.g., batch, date, 
>> experimental
>> condition, ...) and sample quality (e.g., proportion of aligned reads,
>> total duplicate reads, etc...).
>>
>> Of course, I can always create one big data frame concatenating the two
>> (experimental info + sample quality), but it seems that both 
>> conceptually
>> and practically, it might be useful to have two separate data frames.
>> Since this seems somewhat a reasonably standard type of information that
>> one would want to carry on, I was wondering if it would be possible /
>> useful to allow the user to have multiple data.frames in the colData 
>> slot
>
> Actually, colData() is a DataFrame, and a DataFrame column can contain 
> a DataFrame. So after
>
>   example(SummarizedExperiment)
>
> we could make some faux sample quality data
>
>   quality = DataFrame(x=1:6, y=6:1, row.names=colnames(se1))
>
> add this as a column in the colData()
>
>   colData(se1)$quality = quality
>
> (or create the SummarizedExperiment from a similar DataFrame up-front) 
> and manage our grouped data
>
> > colData(se1)
> DataFrame with 6 rows and 2 columns
>     Treatment     quality
>   <character> <DataFrame>
> A        ChIP    ########
> B       Input    ########
> C        ChIP    ########
> D       Input    ########
> E        ChIP    ########
> F       Input    ########
> > colData(se1[,1:2])$quality
> DataFrame with 2 rows and 2 columns
>           x         y
>   <integer> <integer>
> A         1         6
> B         2         5
>
> I'm not sure that this is any less confusing to the end user than 
> having to manage a DataFrameList(), but it does not require any new 
> features.
>
> Martin
>
>> of SummarizedExperiment.
>>
>> Best,
>> Davide
>>
>>     [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
>