[Bioc-devel] Including large files for the package

Mike Smith gr|mbough @end|ng |rom gm@||@com
Tue Sep 5 10:54:22 CEST 2023


Hi Ali,

Looking at the files, it seems although the file extension is .xls, they're
actually just plain text TSV files.  They compress pretty well with
standard tools and R is able to easily read a tsv compressed with something
like GZIP. I wonder if you've considered just compressing the files and
otherwise using them as they are.  The Hub approaches are neat, but maybe
overkill if the files are < 2MB compressed.

However, I wonder if it's necessary to distribute them with the package at
all.  Perhaps I'm missing it, but I don't see any reference to reading
those files in your code, and the contents already appear to be held in
sysdata.rda.  IMO it would be sufficient to document how sysdata.rda was
created in a README so others can see how it was created (perhaps hosting
those files on your own S3 storage if you have permission to do so) and
then remove the raw files from the package.

Best wishes,
Mike

On Thu, 31 Aug 2023 at 20:08, Ali Sajid Imami <ali.sajid.imami using gmail.com>
wrote:

> Hi,
>
> The file is downloaded and filtered from the ilincs website. Unfortunately
> it is not readily available from ilincs.org itself. We have the capability
> of storing the file in our S3 buckets, however.
>
> The metadata under discussion is metadata related to each individual
> signature stored in iLINCS, including information about the cell lines, the
> time points and the dosages. From what I understand this is more likely to
> be suited for ExperimentHub since it's processed.
>
>
> Regards,
> Dr. Ali Sajid Imami
> LinkedIn <https://pk.linkedin.com/pub/ali-sajid-imami/50/956/2a6>
>
>
> On Thu, Aug 31, 2023 at 12:02 PM Kern, Lori via Bioc-devel <
> bioc-devel using r-project.org> wrote:
>
> > Hello,
> >
> > Regarding Hub use:
> > What sort of information does the metadata contain? That would determine
> > whether ExperimentHub or AnnotationHub is more appropriate. Is the file
> > accessed directly from the http://ilincs.org/ portal with a url link or
> > is there processing/filtering that occurs?  The hubs can access data
> stored
> > on other websites/hosts as long as they are trusted sites (ilincs would
> > fall in this category) if you can access it directly with a url link.
> The
> > way the hubs work is the data is stored elsewhere either directly from
> site
> > access or on some hosting serve (S3, Azure, etc) if its processed. The
> data
> > would be removed from directly being in the package, and downloaded then
> > using the hub interface when needed (and also cached in the backend so
> its
> > not done every time).
> >
> >
> >
> >
> >
> > Lori Shepherd - Kern
> >
> > Bioconductor Core Team
> >
> > Roswell Park Comprehensive Cancer Center
> >
> > Department of Biostatistics & Bioinformatics
> >
> > Elm & Carlton Streets
> >
> > Buffalo, New York 14263
> >
> > ________________________________
> > From: Bioc-devel <bioc-devel-bounces using r-project.org> on behalf of Vincent
> > Carey <stvjc using channing.harvard.edu>
> > Sent: Thursday, August 31, 2023 8:29 AM
> > To: Martin Grigorov <martin.grigorov using gmail.com>
> > Cc: bioc-devel using r-project.org <bioc-devel using r-project.org>
> > Subject: Re: [Bioc-devel] Including large files for the package
> >
> > On Thu, Aug 31, 2023 at 7:28 AM Martin Grigorov <
> martin.grigorov using gmail.com
> > >
> > wrote:
> >
> > > Hello,
> > >
> > > Perhaps you could use
> >
> https://secure-web.cisco.com/1PWeIBsHtYFpwnIBjpsq_YN8z0VkqqbOqtHQk4ITS1RC58_4Mploz6OJS4-Uxw4jq_g9JHqlT9Wq6tkKR-aBwYiSF6Bf-ajT-d7vnHBJlAHNLxs2Y3F979xVFa07xAiyrpeXtgfU0dHry6aNaTQmruT5HzYIplDg0UVfcLK9976qFmnnwuRbo24PxtCSMLTLKbVqlHi_URSb7MYdKpuxIP8SmFalHHQUUZWSG9NT1XSeuTkw8pXPtGzJPB2vyj-zO3-cy9RUHz5gLoFe53a3qV2cRVz7ov7WXhLErjX9fqk7A-EQOQSq5QeyWzmoonEUu/https%3A%2F%2Fbioconductor.r-universe.dev%2FBiocFileCache
> > to
> > > download the big file on demand.
> > > The benefit is that the file would be stored in ~/.cache/R/yourPackage/
> > > (for Linux; something similar for Windows/Mac) and reused between
> > sessions.
> > >
> >
> > Thanks Martin.  I think that is a possible approach, but the proposals at
> >
> >
> http://secure-web.cisco.com/1PN99uHlZGkagOQGmEM4lhVob-mny_wuOMrU_eG-JFkBnBX5W-tXbKupcTbZ-gSq-XMcO9_rg2sGp_3KwriGP5nkPGjk_bL8O5IxcEaPE04uFIvB_UVQh-2NzX-1LfalQo2nPrpuxM3FDJJJPRBz8pjayIb27ThNpZZQI50lyjOLdJUikYdS5-Y4TlTMDGCPfs_854qpfJREWoKeYTJOpRb-95SzxaPxDp2qePIkigSmQzj1JrjIfCYyLGCVIIq1Zz1-kbIEqem7cvMtWe2ZE_Af1yG9wA-51shDuYxapn9yaETK7E8Rsg_OTsp4yfB-R/http%3A%2F%2Fcontributions.bioconductor.org%2Fnon-software.html%3Fq%3DAnnotationHub%23annotationexperiment-hub-packages
> > should also be considered.
> >
> > Ali, if the documentation regarding *Hub contributions is unclear, please
> > file an
> > issue or write back here with the difficulties so that we can improve the
> > material and
> > the methods!
> >
> > Thanks!
> >
> >
> >
> > >
> > > Regards,
> > > Martin
> > >
> > > On Tue, Aug 29, 2023 at 5:15 AM Ali Sajid Imami <
> > ali.sajid.imami using gmail.com
> > > >
> > > wrote:
> > >
> > > > Hi BioConductor Team,
> > > >
> > > > I am a PhD Candidate in the Cognitive Disorders Research lab at the
> > > > university of Toledo. I am responsible for a number of R packages and
> > our
> > > > intention is to submit them to bioconductor over the next several
> > > months. I
> > > > had just submitted a package drugfindR (
> > > >
> >
> https://secure-web.cisco.com/1cicrzPanVq35q1BPuFjU_LiICsEK7iZoXLM-t2R1mHcgZYx9SUW2VsKWpSf18Qth0RFcer0FVZwPETWM2KmL8gNtvqOXoL4pEnpyzZqLv1acHN06QD6rwkShy1iEZsPyZLIJHhtNgsJEt7_0s7gYZE98GqoE2RSVyYhNOPS_2ZakwjaFtb-w3_dJGmt7wV1GXpapSa6w5gLICAPUjaaw1jFLsgCc_2dCVuc0mX9VGYNJywp_SDKJH8ex4KX6Groq7ThXm-EQbmSxB8WVqCR0rb-vIqAyS2IC_suOg22e6PkjRwYqgwjtN4mf7i6xe7r2/https%3A%2F%2Fgithub.com%2FCogDisResLab%2FdrugfindR
> ).
> > This was immediately closed
> > > as
> > > > my repo had a single file over the 5MB limit.
> > > >
> > > > I wanted to ask both if you would reconsider/make an exception or
> guide
> > > me
> > > > in the right direction.
> > > >
> > > > This package serves as a way to quickly learch through the LINCS data
> > > > stored at the ilincs.org portal. The file in question is one of
> three
> > > > metadata files that allows the package to function efficiently and
> > > without
> > > > having to go through the expensive network requests. It would really
> be
> > > > helpful if we could include the file as is. I do not expect more
> files
> > > like
> > > > that to be added to the package at all.
> > > >
> > > > Barring that, I have seen the suggestion of using AnnotationHub or
> > > > ExperimentHub. While I have gone through the documentation, I'm not
> > > > entirely sure how those services work. Are those services where we
> can
> > > > store the data itself or we are expected to host the data elsewhere
> and
> > > > create lightweight "pointer" packages. Similarly, I'm not entirely
> sure
> > > > which Hub this would go to.
> > > >
> > > > Any advice or guidance will be appreciated.
> > > >
> > > >
> > > > Regards,
> > > > Dr. Ali Sajid Imami
> > > > LinkedIn <
> >
> https://secure-web.cisco.com/1BTO_aZ7cH_8TaD11HyS10Fduxb3co4BqlJudIfzXykrcywobw2n0xsaOdEHdvKApkBAn1ZVq-dlLlBONRSk8O2_5L_2haztYIrFMPYFfQChfhTRe52Gdcvaf0lT4FPdRCC_JHpSCVynfXzds9EeIrf7CriylS-Hs59XtvvUZCfme16xvyeOjQgcY8rV_ODwI6TRsELOKgn34D-kyeRmOmAgaK36NoIFnfZ6uC2BufvWY5TsAXS7hD036WGkg8HSeW2GAYCpYrP95GhfcepkC45lkNsGGRLLFbS58VKw4kdp9OB5XG-9YYJC34SM_5vlF/https%3A%2F%2Fpk.linkedin.com%2Fpub%2Fali-sajid-imami%2F50%2F956%2F2a6
> > >
> > > >
> > > >         [[alternative HTML version deleted]]
> > > >
> > > > _______________________________________________
> > > > Bioc-devel using r-project.org mailing list
> > > >
> >
> https://secure-web.cisco.com/1aIH389Qk-OTABdM2O6WRy3nL87dqGAbww3fvlRUQA1ie32pxTqf1ZNqzSwxT4LBBlZGgr0QEaJEiHj1JJUKtErqRKGsKQpZpnKjrVVRQPTE0tIORp-qF_USGEarsV6aGVvsNkXfJUc-R46vl1kdq1H4TflgSCi37HVdqHBiEwzEdWJ-gctbw92v8xqwORxqzLzv4PLo_qLaou5YH6hoa---kRWCjhAbC92iJJ-wGBp3n2pe8vsduhJsd0IIOOAsSu4YAgqm41T0oLGfuZYdgbBxT_rAg7iDKlHUxMLr0PbGQ_RGclNT-sztwjd0fbIZq/https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fbioc-devel
> > > >
> > >
> > >         [[alternative HTML version deleted]]
> > >
> > > _______________________________________________
> > > Bioc-devel using r-project.org mailing list
> > >
> >
> https://secure-web.cisco.com/1aIH389Qk-OTABdM2O6WRy3nL87dqGAbww3fvlRUQA1ie32pxTqf1ZNqzSwxT4LBBlZGgr0QEaJEiHj1JJUKtErqRKGsKQpZpnKjrVVRQPTE0tIORp-qF_USGEarsV6aGVvsNkXfJUc-R46vl1kdq1H4TflgSCi37HVdqHBiEwzEdWJ-gctbw92v8xqwORxqzLzv4PLo_qLaou5YH6hoa---kRWCjhAbC92iJJ-wGBp3n2pe8vsduhJsd0IIOOAsSu4YAgqm41T0oLGfuZYdgbBxT_rAg7iDKlHUxMLr0PbGQ_RGclNT-sztwjd0fbIZq/https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fbioc-devel
> > >
> >
> > --
> > The information in this e-mail is intended only for th...{{dropped:31}}
>
> _______________________________________________
> Bioc-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list