[Bioc-devel] Query regarding size limit and including external datasets

Tim Triche, Jr. tim.triche at gmail.com
Mon Oct 2 04:14:14 CEST 2017

Any TCGA MAFs released to the public were considered deidentified. That wouldn't be the part i would worry about. It's a nice idea, and a data package or packages seems like the idiomatic way to do it, as you noted. Personally I think it would indeed benefit a lot of people (vs, say, GDC). Maftools is a super handy package for visualization.

Like Kasper, I am not speaking for bioc-core, just as a TCGA author who spent a lot of time discussing releases with our DCC back-in-the-day. 


> On Oct 1, 2017, at 9:29 PM, Kasper Daniel Hansen <kasperdanielhansen at gmail.com> wrote:
> I cannot speak for the core team.
> You should separate the data from the software methods and provide a data
> package containing the MAFs. This has the additional advantage of
> separating versionning of the mutation data from your software. As a data
> package this does not sound extensive; the largest dataset is 3.7Mb. There
> is a potential privacy problem with sharing mutations, but I don't know at
> what level the mutations are described. I assume you have considered this?
> Best,
> Kasper
>> On Sun, Oct 1, 2017 at 9:16 PM, Anand MT <anand_mt at hotmail.com> wrote:
>> Hi all,
>> I maintain maftools package which offers multitude of functions to perform
>> various analyses and visualization of MAF (Mutation Annotation Format)
>> files from cancer cohorts.
>> In the upcoming bioconductor release, I plan to include all MAFs from 32
>> TCGA cohorts as a part of the package. These tcga mafs will be stored as
>> MAF objects containing curated somatic mutations along with clinical
>> information in the extdata directory and can be loaded via a “tcga_load”
>> function.
>> I think this will help many researchers working with tcga mutation data
>> and saves the time and hassle of going through various databases to search
>> and assemble. I believe this also helps in reproducible research.
>> However, size of these MAF objects vary according to the cohorts size and
>> mutation burden; with LAML (leukemia) being the smallest (91 kb) and LUAD
>> (Lung Adeno Carcinoma) being the largest (3.7 mb). Also including these
>> MAFs increases package size to 46 mb (from 7mb without theses datasets).
>> My question is,
>>  *   is it okay for a package to be of this size ?
>>  *   I haven't tried to push these commits to repository yet, but in case
>> git rejects my push due to size limit, is it possible to make an exception,
>> given the scenario ?
>> If this can't be done in any ways or if it breaks any rules of package
>> guidelines, I don't mind dropping the idea either.
>> Thanks.
>> -Anand.
>>        [[alternative HTML version deleted]]
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>    [[alternative HTML version deleted]]
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel

More information about the Bioc-devel mailing list