[Bioc-devel] Multiple colData in SummarizedExperiment

Kasper Daniel Hansen kasperdanielhansen at gmail.com
Thu Jun 18 20:51:28 CEST 2015


you can just implement this by having reserved column names in the colData
slot; that will work and will take appr. 23 seconds to implement.  I agree
it is not as clean from a design perspective, but you get 100% of the
functionality and you can write a separate checker for the colData argument.

On Thu, Jun 18, 2015 at 2:00 PM, davide risso <risso.davide at gmail.com>
wrote:

> Thank you all for the responses.
>
> I didn't think about the nested DataFrame solution.  It should work.
> I agree that an extension might be cleaner, but I clearly need to give it
> more thought.
>
> One of the reasons I wanted to have quality and metadata as separate slots
> is that one could enforce that all the qualities are numeric, and have a
> quality() method to extract just the quality scores (e.g., for plotting /
> quality control). Having them in the same slot makes it harder for the user
> to extract just the scores (if the column order and/or names are not
> standardized).
>
> Best,
> davide
>
>
> On Thu, Jun 18, 2015 at 6:35 AM Vincent Carey <stvjc at channing.harvard.edu>
> wrote:
>
>> yes, if a formal extension is warranted.  the metadata slot could also be
>> used.
>>
>> On Thu, Jun 18, 2015 at 2:59 PM, Kasper Daniel Hansen <
>> kasperdanielhansen at gmail.com> wrote:
>>
>> > I think the more clean solution for Davide (if he inists on having
>> separate
>> > objects; I decided against it in minfi) is to extend the class to allow
>> > this.
>> >
>> > Kasper
>> >
>> > On Thu, Jun 18, 2015 at 12:25 AM, Ryan <rct at thompsonclan.org> wrote:
>> >
>> > > Oh wow, I didn't know you could put a DataFrame into a single column
>> of
>> > > another DataFrame. That actually solves a problem for me too (I don't
>> > > intend to expose nested DataFrames to the users though).
>> > >
>> > >
>> > > On 6/17/15 7:23 PM, Martin Morgan wrote:
>> > >
>> > >> On 06/17/2015 11:41 AM, davide risso wrote:
>> > >>
>> > >>> Dear list,
>> > >>>
>> > >>> I'm creating an R package to store RNA-seq data of a somewhat large
>> > >>> project
>> > >>> in which I'm involved.
>> > >>>
>> > >>> One of the initial goals is to compare different pre-processing
>> > >>> pipelines,
>> > >>> hence I have multiple expression matrices corresponding to the same
>> > >>> samples.
>> > >>> The SummarizedExperiment class seems a good candidate, since I have
>> > >>> multiple expression matrices with the same rowData and colData
>> > >>> information.
>> > >>>
>> > >>> I have several sample-specific variables that I want to store with
>> the
>> > >>> object, namely, experimental information (e.g., batch, date,
>> > experimental
>> > >>> condition, ...) and sample quality (e.g., proportion of aligned
>> reads,
>> > >>> total duplicate reads, etc...).
>> > >>>
>> > >>> Of course, I can always create one big data frame concatenating the
>> two
>> > >>> (experimental info + sample quality), but it seems that both
>> > conceptually
>> > >>> and practically, it might be useful to have two separate data
>> frames.
>> > >>> Since this seems somewhat a reasonably standard type of information
>> > that
>> > >>> one would want to carry on, I was wondering if it would be possible
>> /
>> > >>> useful to allow the user to have multiple data.frames in the colData
>> > slot
>> > >>>
>> > >>
>> > >> Actually, colData() is a DataFrame, and a DataFrame column can
>> contain a
>> > >> DataFrame. So after
>> > >>
>> > >>   example(SummarizedExperiment)
>> > >>
>> > >> we could make some faux sample quality data
>> > >>
>> > >>   quality = DataFrame(x=1:6, y=6:1, row.names=colnames(se1))
>> > >>
>> > >> add this as a column in the colData()
>> > >>
>> > >>   colData(se1)$quality = quality
>> > >>
>> > >> (or create the SummarizedExperiment from a similar DataFrame
>> up-front)
>> > >> and manage our grouped data
>> > >>
>> > >> > colData(se1)
>> > >> DataFrame with 6 rows and 2 columns
>> > >>     Treatment     quality
>> > >>   <character> <DataFrame>
>> > >> A        ChIP    ########
>> > >> B       Input    ########
>> > >> C        ChIP    ########
>> > >> D       Input    ########
>> > >> E        ChIP    ########
>> > >> F       Input    ########
>> > >> > colData(se1[,1:2])$quality
>> > >> DataFrame with 2 rows and 2 columns
>> > >>           x         y
>> > >>   <integer> <integer>
>> > >> A         1         6
>> > >> B         2         5
>> > >>
>> > >> I'm not sure that this is any less confusing to the end user than
>> having
>> > >> to manage a DataFrameList(), but it does not require any new
>> features.
>> > >>
>> > >> Martin
>> > >>
>> > >>  of SummarizedExperiment.
>> > >>>
>> > >>> Best,
>> > >>> Davide
>> > >>>
>> > >>>     [[alternative HTML version deleted]]
>> > >>>
>> > >>> _______________________________________________
>> > >>> Bioc-devel at r-project.org mailing list
>> > >>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>> > >>>
>> > >>>
>> > >>
>> > >>
>> > > _______________________________________________
>> > > Bioc-devel at r-project.org mailing list
>> > > https://stat.ethz.ch/mailman/listinfo/bioc-devel
>> > >
>> >
>> >         [[alternative HTML version deleted]]
>> >
>> > _______________________________________________
>> > Bioc-devel at r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
>> >
>>
>>         [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list