[Bioc-devel] Query regarding size limit and including external datasets

Anand MT anand_mt at hotmail.com
Mon Oct 2 03:57:35 CEST 2017

Hi Kasper,

Thanks for the suggestion. I thought including them in the same package would be much easier than making a separate package, but I guess you're rite regarding versioning of mutation data.

Regarding mutations, they are purely somatic variants (no germline variants) made public by Broad GDAC. They are compiled from recent analyses run generated by Firehose pipeline, so I think its okay to share them with proper acknowledgement (there is a citable DOI for each dataset).

From: Kasper Daniel Hansen <kasperdanielhansen at gmail.com>
Sent: 02 October 2017 06:59:16
To: Anand MT
Cc: bioc-devel at r-project.org
Subject: Re: [Bioc-devel] Query regarding size limit and including external datasets

I cannot speak for the core team.

You should separate the data from the software methods and provide a data package containing the MAFs. This has the additional advantage of separating versionning of the mutation data from your software. As a data package this does not sound extensive; the largest dataset is 3.7Mb. There is a potential privacy problem with sharing mutations, but I don't know at what level the mutations are described. I assume you have considered this?


On Sun, Oct 1, 2017 at 9:16 PM, Anand MT <anand_mt at hotmail.com<mailto:anand_mt at hotmail.com>> wrote:
Hi all,

I maintain maftools package which offers multitude of functions to perform various analyses and visualization of MAF (Mutation Annotation Format) files from cancer cohorts.

In the upcoming bioconductor release, I plan to include all MAFs from 32 TCGA cohorts as a part of the package. These tcga mafs will be stored as MAF objects containing curated somatic mutations along with clinical information in the extdata directory and can be loaded via a “tcga_load” function.

I think this will help many researchers working with tcga mutation data and saves the time and hassle of going through various databases to search and assemble. I believe this also helps in reproducible research.

However, size of these MAF objects vary according to the cohorts size and mutation burden; with LAML (leukemia) being the smallest (91 kb) and LUAD (Lung Adeno Carcinoma) being the largest (3.7 mb). Also including these MAFs increases package size to 46 mb (from 7mb without theses datasets).

My question is,

  *   is it okay for a package to be of this size ?
  *   I haven't tried to push these commits to repository yet, but in case git rejects my push due to size limit, is it possible to make an exception, given the scenario ?

If this can't be done in any ways or if it breaks any rules of package guidelines, I don't mind dropping the idea either.



        [[alternative HTML version deleted]]

Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org> mailing list

	[[alternative HTML version deleted]]

More information about the Bioc-devel mailing list