[Bioc-devel] Query regarding size limit and including external datasets

Anand MT anand_mt at hotmail.com
Mon Oct 2 04:39:46 CEST 2017


Hi Tim,


Thank you for the encouraging words.

I will wait for suggestion from core members and will probably submit them as a data package.


-Anand.

________________________________
From: Tim Triche, Jr. <tim.triche at gmail.com>
Sent: 02 October 2017 07:44:14
To: Kasper Daniel Hansen
Cc: Anand MT; bioc-devel at r-project.org
Subject: Re: [Bioc-devel] Query regarding size limit and including external datasets

Any TCGA MAFs released to the public were considered deidentified. That wouldn't be the part i would worry about. It's a nice idea, and a data package or packages seems like the idiomatic way to do it, as you noted. Personally I think it would indeed benefit a lot of people (vs, say, GDC). Maftools is a super handy package for visualization.

Like Kasper, I am not speaking for bioc-core, just as a TCGA author who spent a lot of time discussing releases with our DCC back-in-the-day.

--t

> On Oct 1, 2017, at 9:29 PM, Kasper Daniel Hansen <kasperdanielhansen at gmail.com> wrote:
>
> I cannot speak for the core team.
>
> You should separate the data from the software methods and provide a data
> package containing the MAFs. This has the additional advantage of
> separating versionning of the mutation data from your software. As a data
> package this does not sound extensive; the largest dataset is 3.7Mb. There
> is a potential privacy problem with sharing mutations, but I don't know at
> what level the mutations are described. I assume you have considered this?
>
> Best,
> Kasper
>
>> On Sun, Oct 1, 2017 at 9:16 PM, Anand MT <anand_mt at hotmail.com> wrote:
>>
>> Hi all,
>>
>> I maintain maftools package which offers multitude of functions to perform
>> various analyses and visualization of MAF (Mutation Annotation Format)
>> files from cancer cohorts.
>>
>> In the upcoming bioconductor release, I plan to include all MAFs from 32
>> TCGA cohorts as a part of the package. These tcga mafs will be stored as
>> MAF objects containing curated somatic mutations along with clinical
>> information in the extdata directory and can be loaded via a “tcga_load”
>> function.
>>
>> I think this will help many researchers working with tcga mutation data
>> and saves the time and hassle of going through various databases to search
>> and assemble. I believe this also helps in reproducible research.
>>
>> However, size of these MAF objects vary according to the cohorts size and
>> mutation burden; with LAML (leukemia) being the smallest (91 kb) and LUAD
>> (Lung Adeno Carcinoma) being the largest (3.7 mb). Also including these
>> MAFs increases package size to 46 mb (from 7mb without theses datasets).
>>
>> My question is,
>>
>>  *   is it okay for a package to be of this size ?
>>  *   I haven't tried to push these commits to repository yet, but in case
>> git rejects my push due to size limit, is it possible to make an exception,
>> given the scenario ?
>>
>> If this can't be done in any ways or if it breaks any rules of package
>> guidelines, I don't mind dropping the idea either.
>>
>> Thanks.
>>
>> -Anand.
>>
>>
>>        [[alternative HTML version deleted]]
>>
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
>    [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list