[Bioc-devel] Including large files for the package

Mike Smith gr|mbough @end|ng |rom gm@||@com
Wed Sep 6 15:31:48 CEST 2023


Hi Ali,

I don't know the specifics of how the bioc-issue-bot checks submissions
(someone from the core team would know better), but I suspect you're
failing that file size test before the package is built.  Thus adding to
.Rbuildignore probably wont help for this specific issue.  I would still
think about compressing those files and then cleaning the git repository of
the original uncompressed versions by following
http://contributions.bioconductor.org/git-version-control.html#remove-large-data-files-and-clean-git-tree

More generally, yes you want to exclude those files from the built packages
if they aren't useful to the end user of the package.  However it looks
like you're already excluding the raw and data-raw directories via
.Rbuildignore, so I don't think you need to do more than that.

Best,
Mike



On Tue, 5 Sept 2023 at 19:15, Ali Sajid Imami <ali.sajid.imami using gmail.com>
wrote:

> Hi Mike,
>
> These are raw data files that I use to createte internal data. They don’t
> end up being used in the final package themselves. Are you suggesting that
> adding them to .Rbuildignore should be sufficient? They are in the repo
> because I want to keep track of them.
>
> On Sep 5, 2023, at 4:54 AM, Mike Smith <grimbough using gmail.com> wrote:
>
> Hi Ali,
>
> Looking at the files, it seems although the file extension is .xls,
> they're actually just plain text TSV files.  They compress pretty well with
> standard tools and R is able to easily read a tsv compressed with something
> like GZIP. I wonder if you've considered just compressing the files and
> otherwise using them as they are.  The Hub approaches are neat, but maybe
> overkill if the files are < 2MB compressed.
>
> However, I wonder if it's necessary to distribute them with the package at
> all.  Perhaps I'm missing it, but I don't see any reference to reading
> those files in your code, and the contents already appear to be held in
> sysdata.rda.  IMO it would be sufficient to document how sysdata.rda was
> created in a README so others can see how it was created (perhaps hosting
> those files on your own S3 storage if you have permission to do so) and
> then remove the raw files from the package.
>
> Best wishes,
> Mike
>
> On Thu, 31 Aug 2023 at 20:08, Ali Sajid Imami <ali.sajid.imami using gmail.com>
> wrote:
>
>> Hi,
>>
>> The file is downloaded and filtered from the ilincs website. Unfortunately
>> it is not readily available from ilincs.org itself. We have the
>> capability
>> of storing the file in our S3 buckets, however.
>>
>> The metadata under discussion is metadata related to each individual
>> signature stored in iLINCS, including information about the cell lines,
>> the
>> time points and the dosages. From what I understand this is more likely to
>> be suited for ExperimentHub since it's processed.
>>
>>
>> Regards,
>> Dr. Ali Sajid Imami
>> LinkedIn <https://pk.linkedin.com/pub/ali-sajid-imami/50/956/2a6>
>>
>>
>> On Thu, Aug 31, 2023 at 12:02 PM Kern, Lori via Bioc-devel <
>> bioc-devel using r-project.org> wrote:
>>
>> > Hello,
>> >
>> > Regarding Hub use:
>> > What sort of information does the metadata contain? That would determine
>> > whether ExperimentHub or AnnotationHub is more appropriate. Is the file
>> > accessed directly from the http://ilincs.org/ portal with a url link or
>> > is there processing/filtering that occurs?  The hubs can access data
>> stored
>> > on other websites/hosts as long as they are trusted sites (ilincs would
>> > fall in this category) if you can access it directly with a url link.
>> The
>> > way the hubs work is the data is stored elsewhere either directly from
>> site
>> > access or on some hosting serve (S3, Azure, etc) if its processed. The
>> data
>> > would be removed from directly being in the package, and downloaded then
>> > using the hub interface when needed (and also cached in the backend so
>> its
>> > not done every time).
>> >
>> >
>> >
>> >
>> >
>> > Lori Shepherd - Kern
>> >
>> > Bioconductor Core Team
>> >
>> > Roswell Park Comprehensive Cancer Center
>> >
>> > Department of Biostatistics & Bioinformatics
>> >
>> > Elm & Carlton Streets
>> >
>> > Buffalo, New York 14263
>> >
>> > ________________________________
>> > From: Bioc-devel <bioc-devel-bounces using r-project.org> on behalf of
>> Vincent
>> > Carey <stvjc using channing.harvard.edu>
>> > Sent: Thursday, August 31, 2023 8:29 AM
>> > To: Martin Grigorov <martin.grigorov using gmail.com>
>> > Cc: bioc-devel using r-project.org <bioc-devel using r-project.org>
>> > Subject: Re: [Bioc-devel] Including large files for the package
>> >
>> > On Thu, Aug 31, 2023 at 7:28 AM Martin Grigorov <
>> martin.grigorov using gmail.com
>> > >
>> > wrote:
>> >
>> > > Hello,
>> > >
>> > > Perhaps you could use
>> >
>> https://secure-web.cisco.com/1PWeIBsHtYFpwnIBjpsq_YN8z0VkqqbOqtHQk4ITS1RC58_4Mploz6OJS4-Uxw4jq_g9JHqlT9Wq6tkKR-aBwYiSF6Bf-ajT-d7vnHBJlAHNLxs2Y3F979xVFa07xAiyrpeXtgfU0dHry6aNaTQmruT5HzYIplDg0UVfcLK9976qFmnnwuRbo24PxtCSMLTLKbVqlHi_URSb7MYdKpuxIP8SmFalHHQUUZWSG9NT1XSeuTkw8pXPtGzJPB2vyj-zO3-cy9RUHz5gLoFe53a3qV2cRVz7ov7WXhLErjX9fqk7A-EQOQSq5QeyWzmoonEUu/https%3A%2F%2Fbioconductor.r-universe.dev%2FBiocFileCache
>> > to
>> > > download the big file on demand.
>> > > The benefit is that the file would be stored in
>> ~/.cache/R/yourPackage/
>> > > (for Linux; something similar for Windows/Mac) and reused between
>> > sessions.
>> > >
>> >
>> > Thanks Martin.  I think that is a possible approach, but the proposals
>> at
>> >
>> >
>> http://secure-web.cisco.com/1PN99uHlZGkagOQGmEM4lhVob-mny_wuOMrU_eG-JFkBnBX5W-tXbKupcTbZ-gSq-XMcO9_rg2sGp_3KwriGP5nkPGjk_bL8O5IxcEaPE04uFIvB_UVQh-2NzX-1LfalQo2nPrpuxM3FDJJJPRBz8pjayIb27ThNpZZQI50lyjOLdJUikYdS5-Y4TlTMDGCPfs_854qpfJREWoKeYTJOpRb-95SzxaPxDp2qePIkigSmQzj1JrjIfCYyLGCVIIq1Zz1-kbIEqem7cvMtWe2ZE_Af1yG9wA-51shDuYxapn9yaETK7E8Rsg_OTsp4yfB-R/http%3A%2F%2Fcontributions.bioconductor.org%2Fnon-software.html%3Fq%3DAnnotationHub%23annotationexperiment-hub-packages
>> > should also be considered.
>> >
>> > Ali, if the documentation regarding *Hub contributions is unclear,
>> please
>> > file an
>> > issue or write back here with the difficulties so that we can improve
>> the
>> > material and
>> > the methods!
>> >
>> > Thanks!
>> >
>> >
>> >
>> > >
>> > > Regards,
>> > > Martin
>> > >
>> > > On Tue, Aug 29, 2023 at 5:15 AM Ali Sajid Imami <
>> > ali.sajid.imami using gmail.com
>> > > >
>> > > wrote:
>> > >
>> > > > Hi BioConductor Team,
>> > > >
>> > > > I am a PhD Candidate in the Cognitive Disorders Research lab at the
>> > > > university of Toledo. I am responsible for a number of R packages
>> and
>> > our
>> > > > intention is to submit them to bioconductor over the next several
>> > > months. I
>> > > > had just submitted a package drugfindR (
>> > > >
>> >
>> https://secure-web.cisco.com/1cicrzPanVq35q1BPuFjU_LiICsEK7iZoXLM-t2R1mHcgZYx9SUW2VsKWpSf18Qth0RFcer0FVZwPETWM2KmL8gNtvqOXoL4pEnpyzZqLv1acHN06QD6rwkShy1iEZsPyZLIJHhtNgsJEt7_0s7gYZE98GqoE2RSVyYhNOPS_2ZakwjaFtb-w3_dJGmt7wV1GXpapSa6w5gLICAPUjaaw1jFLsgCc_2dCVuc0mX9VGYNJywp_SDKJH8ex4KX6Groq7ThXm-EQbmSxB8WVqCR0rb-vIqAyS2IC_suOg22e6PkjRwYqgwjtN4mf7i6xe7r2/https%3A%2F%2Fgithub.com%2FCogDisResLab%2FdrugfindR
>> ).
>> > This was immediately closed
>> > > as
>> > > > my repo had a single file over the 5MB limit.
>> > > >
>> > > > I wanted to ask both if you would reconsider/make an exception or
>> guide
>> > > me
>> > > > in the right direction.
>> > > >
>> > > > This package serves as a way to quickly learch through the LINCS
>> data
>> > > > stored at the ilincs.org portal. The file in question is one of
>> three
>> > > > metadata files that allows the package to function efficiently and
>> > > without
>> > > > having to go through the expensive network requests. It would
>> really be
>> > > > helpful if we could include the file as is. I do not expect more
>> files
>> > > like
>> > > > that to be added to the package at all.
>> > > >
>> > > > Barring that, I have seen the suggestion of using AnnotationHub or
>> > > > ExperimentHub. While I have gone through the documentation, I'm not
>> > > > entirely sure how those services work. Are those services where we
>> can
>> > > > store the data itself or we are expected to host the data elsewhere
>> and
>> > > > create lightweight "pointer" packages. Similarly, I'm not entirely
>> sure
>> > > > which Hub this would go to.
>> > > >
>> > > > Any advice or guidance will be appreciated.
>> > > >
>> > > >
>> > > > Regards,
>> > > > Dr. Ali Sajid Imami
>> > > > LinkedIn <
>> >
>> https://secure-web.cisco.com/1BTO_aZ7cH_8TaD11HyS10Fduxb3co4BqlJudIfzXykrcywobw2n0xsaOdEHdvKApkBAn1ZVq-dlLlBONRSk8O2_5L_2haztYIrFMPYFfQChfhTRe52Gdcvaf0lT4FPdRCC_JHpSCVynfXzds9EeIrf7CriylS-Hs59XtvvUZCfme16xvyeOjQgcY8rV_ODwI6TRsELOKgn34D-kyeRmOmAgaK36NoIFnfZ6uC2BufvWY5TsAXS7hD036WGkg8HSeW2GAYCpYrP95GhfcepkC45lkNsGGRLLFbS58VKw4kdp9OB5XG-9YYJC34SM_5vlF/https%3A%2F%2Fpk.linkedin.com%2Fpub%2Fali-sajid-imami%2F50%2F956%2F2a6
>> > >
>> > > >
>> > > >         [[alternative HTML version deleted]]
>> > > >
>> > > > _______________________________________________
>> > > > Bioc-devel using r-project.org mailing list
>> > > >
>> >
>> https://secure-web.cisco.com/1aIH389Qk-OTABdM2O6WRy3nL87dqGAbww3fvlRUQA1ie32pxTqf1ZNqzSwxT4LBBlZGgr0QEaJEiHj1JJUKtErqRKGsKQpZpnKjrVVRQPTE0tIORp-qF_USGEarsV6aGVvsNkXfJUc-R46vl1kdq1H4TflgSCi37HVdqHBiEwzEdWJ-gctbw92v8xqwORxqzLzv4PLo_qLaou5YH6hoa---kRWCjhAbC92iJJ-wGBp3n2pe8vsduhJsd0IIOOAsSu4YAgqm41T0oLGfuZYdgbBxT_rAg7iDKlHUxMLr0PbGQ_RGclNT-sztwjd0fbIZq/https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fbioc-devel
>> > > >
>> > >
>> > >         [[alternative HTML version deleted]]
>> > >
>> > > _______________________________________________
>> > > Bioc-devel using r-project.org mailing list
>> > >
>> >
>> https://secure-web.cisco.com/1aIH389Qk-OTABdM2O6WRy3nL87dqGAbww3fvlRUQA1ie32pxTqf1ZNqzSwxT4LBBlZGgr0QEaJEiHj1JJUKtErqRKGsKQpZpnKjrVVRQPTE0tIORp-qF_USGEarsV6aGVvsNkXfJUc-R46vl1kdq1H4TflgSCi37HVdqHBiEwzEdWJ-gctbw92v8xqwORxqzLzv4PLo_qLaou5YH6hoa---kRWCjhAbC92iJJ-wGBp3n2pe8vsduhJsd0IIOOAsSu4YAgqm41T0oLGfuZYdgbBxT_rAg7iDKlHUxMLr0PbGQ_RGclNT-sztwjd0fbIZq/https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fbioc-devel
>> > >
>> >
>> > --
>> > The information in this e-mail is intended only for th...{{dropped:31}}
>>
>> _______________________________________________
>> Bioc-devel using r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
>

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list