[Bioc-devel] Query regarding size limit and including external datasets

Martin Morgan martin.morgan at roswellpark.org
Mon Oct 2 21:43:29 CEST 2017


The data definitely does not belong in the software package.

We are a little schizophrenic about Experiment Data at the moment; my 
own feeling is that ExperimentHub is the right place, especially for 
data over say 10 Mb. This is more labor intensive and harder to version, 
but potentially offers other benefits such as enhanced discoverability 
and access outside R.

Guidelines on ExperimentHub data are 
http://bioconductor.org/packages/devel/bioc/vignettes/ExperimentHub/inst/doc/CreateAnExperimentHubPackage.html#new-resources 
.


On 10/01/2017 10:39 PM, Anand MT wrote:
> Hi Tim,
> 
> 
> Thank you for the encouraging words.
> 
> I will wait for suggestion from core members and will probably submit them as a data package.
> 
> 
> -Anand.
> 
> ________________________________
> From: Tim Triche, Jr. <tim.triche at gmail.com>
> Sent: 02 October 2017 07:44:14
> To: Kasper Daniel Hansen
> Cc: Anand MT; bioc-devel at r-project.org
> Subject: Re: [Bioc-devel] Query regarding size limit and including external datasets
> 
> Any TCGA MAFs released to the public were considered deidentified. That wouldn't be the part i would worry about. It's a nice idea, and a data package or packages seems like the idiomatic way to do it, as you noted. Personally I think it would indeed benefit a lot of people (vs, say, GDC). Maftools is a super handy package for visualization.
> 
> Like Kasper, I am not speaking for bioc-core, just as a TCGA author who spent a lot of time discussing releases with our DCC back-in-the-day.
> 
> --t
> 
>> On Oct 1, 2017, at 9:29 PM, Kasper Daniel Hansen <kasperdanielhansen at gmail.com> wrote:
>>
>> I cannot speak for the core team.
>>
>> You should separate the data from the software methods and provide a data
>> package containing the MAFs. This has the additional advantage of
>> separating versionning of the mutation data from your software. As a data
>> package this does not sound extensive; the largest dataset is 3.7Mb. There
>> is a potential privacy problem with sharing mutations, but I don't know at
>> what level the mutations are described. I assume you have considered this?
>>
>> Best,
>> Kasper
>>
>>> On Sun, Oct 1, 2017 at 9:16 PM, Anand MT <anand_mt at hotmail.com> wrote:
>>>
>>> Hi all,
>>>
>>> I maintain maftools package which offers multitude of functions to perform
>>> various analyses and visualization of MAF (Mutation Annotation Format)
>>> files from cancer cohorts.
>>>
>>> In the upcoming bioconductor release, I plan to include all MAFs from 32
>>> TCGA cohorts as a part of the package. These tcga mafs will be stored as
>>> MAF objects containing curated somatic mutations along with clinical
>>> information in the extdata directory and can be loaded via a �tcga_load�
>>> function.
>>>
>>> I think this will help many researchers working with tcga mutation data
>>> and saves the time and hassle of going through various databases to search
>>> and assemble. I believe this also helps in reproducible research.
>>>
>>> However, size of these MAF objects vary according to the cohorts size and
>>> mutation burden; with LAML (leukemia) being the smallest (91 kb) and LUAD
>>> (Lung Adeno Carcinoma) being the largest (3.7 mb). Also including these
>>> MAFs increases package size to 46 mb (from 7mb without theses datasets).
>>>
>>> My question is,
>>>
>>>   *   is it okay for a package to be of this size ?
>>>   *   I haven't tried to push these commits to repository yet, but in case
>>> git rejects my push due to size limit, is it possible to make an exception,
>>> given the scenario ?
>>>
>>> If this can't be done in any ways or if it breaks any rules of package
>>> guidelines, I don't mind dropping the idea either.
>>>
>>> Thanks.
>>>
>>> -Anand.
>>>
>>>
>>>         [[alternative HTML version deleted]]
>>>
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>
>>     [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> 
> 	[[alternative HTML version deleted]]
> 
> 
> 
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> 


This email message may contain legally privileged and/or...{{dropped:2}}



More information about the Bioc-devel mailing list