[Bioc-devel] Moderately large files in an Experiment Data package?

Barry, Timothy P tb@rry @end|ng |rom h@ph@h@rv@rd@edu
Mon Apr 8 17:59:43 CEST 2024


Hi Lori,

Thank you for the speedy and detailed reply. I will take a crack at the ExperimentHub option and resubmit.

Cheers,
Tim

On Apr 8, 2024, at 8:34 AM, Kern, Lori <Lori.Shepherd using RoswellPark.org> wrote:

Yes we would recommend using ExperimentHub.  Which is a database with pointers to the data files; so files are only downloaded when necessary to keep the package lightweight for end users.

You have some options to where the data is stored.  We encourage the use of zenodo or other well trusted data storage sites,  but a Bioconductor provided Microsoft data lake is also an option.

More documentation can be found at
https://bioconductor.org/packages/release/bioc/vignettes/HubPub/inst/doc/CreateAHubPackage.html<https://urldefense.proofpoint.com/v2/url?u=https-3A__bioconductor.org_packages_release_bioc_vignettes_HubPub_inst_doc_CreateAHubPackage.html&d=DwMF-g&c=WO-RGvefibhHBZq3fL85hQ&r=d1v_WzzVGvOcjsx-QSqHbX6hdozHewJMVh7ESOXo9zU&m=Ci_SgT-JTg2C2wckg7LPwij4JNnQBEgcIjPz1hcOGgwqIFGafqTkBZz8FG9qmeM9&s=ZfeThonmZcfB0z8lAOOJJVUNxR6sRtgQi-4FrL9SDtE&e=>

If you already have a data package, really the only changes would be to remove the data from that package and use a trusted remote location.  Create a required inst/extdata/metadata.csv that has the information to add to the experimenthub database.  And add the required biocViews to the description.

Cheers,


Lori Shepherd - Kern
Bioconductor Core Team
Roswell Park Comprehensive Cancer Center
Department of Biostatistics & Bioinformatics
Elm & Carlton Streets
Buffalo, New York 14263
________________________________
From: Bioc-devel <bioc-devel-bounces using r-project.org<mailto:bioc-devel-bounces using r-project.org>> on behalf of Barry, Timothy P <tbarry using hsph.harvard.edu<mailto:tbarry using hsph.harvard.edu>>
Sent: Friday, April 5, 2024 4:22 PM
To: bioc-devel using r-project.org<mailto:bioc-devel using r-project.org> <bioc-devel using r-project.org<mailto:bioc-devel using r-project.org>>
Subject: [Bioc-devel] Moderately large files in an Experiment Data package?

Hello all,

I have initiated the submission of three packages to Bioconductor: sceptre<https://secure-web.cisco.com/1uRQxtsX_YKBN5zs-2CIVbqDfTt0sCf1iWI_8lIhOxVZTyoW5k9YxW1Kf3TSYW8dMCK81GaSgfUdFn8-pe1hFm52ij-4-5IL4KRzIRs7ppGV0UaIM3lHOwqVLnGwlwC-vEcDpec3LaTIh8wQ8zol8P7F5bNGhSjQfqBvOnckGY1H2yNJjn6DM_066B7XshBVkhTVO_dRz88WMhQVIIpJzAse8cPg65cPriNMYhULhbP_zoZxyLMjGP3XI7MoJSd4p54jV6JmaE73N5AjpmlbGQ36QBGnNDEgUowTkqCcggbHTjLHoxu2fdLhUmf-cBJz9/https%3A%2F%2Fkatsevich-lab.github.io%2Fsceptre%2F<https://urldefense.proofpoint.com/v2/url?u=https-3A__secure-2Dweb.cisco.com_1uRQxtsX-5FYKBN5zs-2D2CIVbqDfTt0sCf1iWI-5F8lIhOxVZTyoW5k9YxW1Kf3TSYW8dMCK81GaSgfUdFn8-2Dpe1hFm52ij-2D4-2D5IL4KRzIRs7ppGV0UaIM3lHOwqVLnGwlwC-2DvEcDpec3LaTIh8wQ8zol8P7F5bNGhSjQfqBvOnckGY1H2yNJjn6DM-5F066B7XshBVkhTVO-5FdRz88WMhQVIIpJzAse8cPg65cPriNMYhULhbP-5FzoZxyLMjGP3XI7MoJSd4p54jV6JmaE73N5AjpmlbGQ36QBGnNDEgUowTkqCcggbHTjLHoxu2fdLhUmf-2DcBJz9_https-253A-252F-252Fkatsevich-2Dlab.github.io-252Fsceptre-252F&d=DwMF-g&c=WO-RGvefibhHBZq3fL85hQ&r=d1v_WzzVGvOcjsx-QSqHbX6hdozHewJMVh7ESOXo9zU&m=Ci_SgT-JTg2C2wckg7LPwij4JNnQBEgcIjPz1hcOGgwqIFGafqTkBZz8FG9qmeM9&s=U-PvFyjlTp0nQ7pses0Gkjtuv0TDnAVNHP4iVh5rJfE&e=>> (an R package for perturb-seq analysis), ondisc<https://secure-web.cisco.com/1wQ89J_Jfsnn86uWBElDqXBGijAMHN62dtaubYHt5d049pzBT9_momDshB8co3nvf9_aYLHnJfGhtITRUBnNuZ40TB73qvOJ8F9QD3i0_hhj7iYmdWWNkayrSg76fUBbWwV699LyW1khRIHFcASQzm6Oe3kb3BDLnNAGlKjxIxr5iBonviyudeiZWUjSkNku7AODWpaPDVvRXZlB6uCohX6Z85JzfJP9mH5zHZlAxK-i7b6d0l0KJa7f3I9paH7Uqr-Ls7zVZTHAywo_FnA3r13iuLOBZ3j3vA-e79d_G-PEF822nU-wr4xQ1NPCaTLlv/https%3A%2F%2Ftimothy-barry.github.io%2Fondisc%2F<https://urldefense.proofpoint.com/v2/url?u=https-3A__secure-2Dweb.cisco.com_1wQ89J-5FJfsnn86uWBElDqXBGijAMHN62dtaubYHt5d049pzBT9-5FmomDshB8co3nvf9-5FaYLHnJfGhtITRUBnNuZ40TB73qvOJ8F9QD3i0-5Fhhj7iYmdWWNkayrSg76fUBbWwV699LyW1khRIHFcASQzm6Oe3kb3BDLnNAGlKjxIxr5iBonviyudeiZWUjSkNku7AODWpaPDVvRXZlB6uCohX6Z85JzfJP9mH5zHZlAxK-2Di7b6d0l0KJa7f3I9paH7Uqr-2DLs7zVZTHAywo-5FFnA3r13iuLOBZ3j3vA-2De79d-5FG-2DPEF822nU-2Dwr4xQ1NPCaTLlv_https-253A-252F-252Ftimothy-2Dbarry.github.io-252Fondisc-252F&d=DwMF-g&c=WO-RGvefibhHBZq3fL85hQ&r=d1v_WzzVGvOcjsx-QSqHbX6hdozHewJMVh7ESOXo9zU&m=Ci_SgT-JTg2C2wckg7LPwij4JNnQBEgcIjPz1hcOGgwqIFGafqTkBZz8FG9qmeM9&s=O-eOt0iS3ncDCRQUyPbRJ7sEAMrCJHh8VgIiEMzpR4s&e=>> (a companion R package to sceptre that implements new data structures for large-scale single-cell data), and sceptredata<https://secure-web.cisco.com/1HB9kABAwEnGmw-sgorYzuAo_navwpQHevV-fRN8iFCqtAsjH0xCzcu9VmX_9A0ZCsUO4QyvyMdA-OsTAONpfJ960ihtD0fpo0pY-udGSzT5O9HNzaCsCnobIx2kSlZgEjXV2kkCo-ARzyD10z74E2Njy0Po33tW696-D6D0NTONGdd0lEQXIyBNbshr0kU27hDIuBAuaGVgFg7C0iaZDflKwYN3kgYHcYwUrCjxK9TsHyZQ_IVeVmYRagqvNubiHPLaR7FKgvVLfXFzlW3fhtzWN_9bv9QtpeVQa9qCXpRHWRFYyg_2J2PlIeL6LNJIz/https%3A%2F%2Fgithub.com%2FKatsevich-Lab%2Fsceptredata<https://urldefense.proofpoint.com/v2/url?u=https-3A__secure-2Dweb.cisco.com_1HB9kABAwEnGmw-2DsgorYzuAo-5FnavwpQHevV-2DfRN8iFCqtAsjH0xCzcu9VmX-5F9A0ZCsUO4QyvyMdA-2DOsTAONpfJ960ihtD0fpo0pY-2DudGSzT5O9HNzaCsCnobIx2kSlZgEjXV2kkCo-2DARzyD10z74E2Njy0Po33tW696-2DD6D0NTONGdd0lEQXIyBNbshr0kU27hDIuBAuaGVgFg7C0iaZDflKwYN3kgYHcYwUrCjxK9TsHyZQ-5FIVeVmYRagqvNubiHPLaR7FKgvVLfXFzlW3fhtzWN-5F9bv9QtpeVQa9qCXpRHWRFYyg-5F2J2PlIeL6LNJIz_https-253A-252F-252Fgithub.com-252FKatsevich-2DLab-252Fsceptredata&d=DwMF-g&c=WO-RGvefibhHBZq3fL85hQ&r=d1v_WzzVGvOcjsx-QSqHbX6hdozHewJMVh7ESOXo9zU&m=Ci_SgT-JTg2C2wckg7LPwij4JNnQBEgcIjPz1hcOGgwqIFGafqTkBZz8FG9qmeM9&s=zNfm0v_sgWncPf8lERtZmNzjwjfg5vv-FxqLySJiOOI&e=>> (an experiment data package that provides example data for sceptre and ondisc). ondisc depends on sceptredata, and sceptre in turn depends on both ondisc and sceptredata. Our updated user manual<https://secure-web.cisco.com/16nv_lroIzZlpgnyWOgGvq1eqpBm2k_PuSDULf7U_Jx_vmZeAHLNlSM3eZl8jBZh91AfQmsb_m-q178ouM0xGbyeXH7gSvshnH_k4AAdVEBmcrhO_PvfEUBzm4Jp3NDzPO3h2TsF2SDLil7_lMBCZv3lqxDFDvViAXUqxoLzESMuwEzdRRNhJD6nsyCjhx1nNfsEAZV22OL2PV-3nThUm8d-ZXSoXJt94MVNqb2dePxI6Q9jNAkut-kbcJaA2kFHviUDRHyHIVFsSFhocg7EEUcqHS8V7ewhKc4q5jwbKC_ioZ2V7tcbxgX9oYpkSBJxn/https%3A%2F%2Ftimothy-barry.github.io%2Fsceptre-book%2F<https://urldefense.proofpoint.com/v2/url?u=https-3A__secure-2Dweb.cisco.com_16nv-5FlroIzZlpgnyWOgGvq1eqpBm2k-5FPuSDULf7U-5FJx-5FvmZeAHLNlSM3eZl8jBZh91AfQmsb-5Fm-2Dq178ouM0xGbyeXH7gSvshnH-5Fk4AAdVEBmcrhO-5FPvfEUBzm4Jp3NDzPO3h2TsF2SDLil7-5FlMBCZv3lqxDFDvViAXUqxoLzESMuwEzdRRNhJD6nsyCjhx1nNfsEAZV22OL2PV-2D3nThUm8d-2DZXSoXJt94MVNqb2dePxI6Q9jNAkut-2DkbcJaA2kFHviUDRHyHIVFsSFhocg7EEUcqHS8V7ewhKc4q5jwbKC-5FioZ2V7tcbxgX9oYpkSBJxn_https-253A-252F-252Ftimothy-2Dbarry.github.io-252Fsceptre-2Dbook-252F&d=DwMF-g&c=WO-RGvefibhHBZq3fL85hQ&r=d1v_WzzVGvOcjsx-QSqHbX6hdozHewJMVh7ESOXo9zU&m=Ci_SgT-JTg2C2wckg7LPwij4JNnQBEgcIjPz1hcOGgwqIFGafqTkBZz8FG9qmeM9&s=pdK02fvnPhg4b5UIQ1Jm0CiikWiIv2VhCVcrilTh6GA&e=>> describes how all three of these packages interface with one another.

In accordance with the Bioconductor submission instructions, I submitted the data package (i.e., sceptredata) first<https://secure-web.cisco.com/1B6Brc1BDURZGWcTiXfl11N7d084v9YyoAKfoLjx1iN8h8xcExKc_AbkFPuT7-el4MQekzdLj6lrHzkwGruUSBioB-mLOzC8zhmTJE6UGIFj4iaO3ieI_YlXOFE3EONre-abJa81Um_nBH25_dxjpdofbh1YNxOg1T8cJOTzyBLC15FXDm4C-Zdy_3zEKcKFELU6iwgMxCCuUJT9KcjGm4FGF98a8617yuYwCB8s0d91cLZ9SfXiok6-wW9YFPKA8X-ZDy5gKPZRa4h88frnz-OJ8eifcyPODhPD0cp1ljrKit65Ua_o60-cs3S0pFJrZ/https%3A%2F%2Fgithub.com%2FBioconductor%2FContributions%2Fissues%2F3386<https://urldefense.proofpoint.com/v2/url?u=https-3A__secure-2Dweb.cisco.com_1B6Brc1BDURZGWcTiXfl11N7d084v9YyoAKfoLjx1iN8h8xcExKc-5FAbkFPuT7-2Del4MQekzdLj6lrHzkwGruUSBioB-2DmLOzC8zhmTJE6UGIFj4iaO3ieI-5FYlXOFE3EONre-2DabJa81Um-5FnBH25-5Fdxjpdofbh1YNxOg1T8cJOTzyBLC15FXDm4C-2DZdy-5F3zEKcKFELU6iwgMxCCuUJT9KcjGm4FGF98a8617yuYwCB8s0d91cLZ9SfXiok6-2DwW9YFPKA8X-2DZDy5gKPZRa4h88frnz-2DOJ8eifcyPODhPD0cp1ljrKit65Ua-5Fo60-2Dcs3S0pFJrZ_https-253A-252F-252Fgithub.com-252FBioconductor-252FContributions-252Fissues-252F3386&d=DwMF-g&c=WO-RGvefibhHBZq3fL85hQ&r=d1v_WzzVGvOcjsx-QSqHbX6hdozHewJMVh7ESOXo9zU&m=Ci_SgT-JTg2C2wckg7LPwij4JNnQBEgcIjPz1hcOGgwqIFGafqTkBZz8FG9qmeM9&s=1oSy7hGcRk-B70DaQhr4bZAB-ZpOxPpL0njoJ5J99NQ&e=>>. However, I received the following error message: "The package contains individual files over 5Mb in size. This is currently not allowed.” Indeed, sceptredata contains two files that are 11MB and one file that is 6MB. The package stores example data in both the `data` directory and the `inst/extdata` directory.

I thought that experiment data packages were allowed to have larger files? If not, does anyone have a recommendation for how I should proceed? Kasper Hansen suggested ExperimentHub as a solution. Might that the way to go?

Thank you greatly for the help!
Tim


        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel using r-project.org<mailto:Bioc-devel using r-project.org> mailing list
https://secure-web.cisco.com/1BhRzwrpl2OU1g7a36Fr4cEEctfeVir9amqzPipnUV-nw8_GuKfeUAYMSwgmqg9worqIpRvTxMUM3DhBHSFwEGplp0LgGYlaQ8BP8or_g5cUiu0eDDnhm_ONPmX5jHf8gMtLVItRntbXJc7Vsop_ArZZKTXzuFDOzHrL_cYy9WZuiF9tnTgdYjNjyB4YNfCPGa6tKghYcatZClM57nWVn9FkHp1U0jg7bLNqUGiR2XcW59kmXmuIUiB3y-VesVK9VvGoonznj7k-tg0C0ebmLCdqn9IJ2fWnxb6_fDi5TJB0Mw4bWvEOpexLf1fz-MDwd/https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fbioc-devel<https://urldefense.proofpoint.com/v2/url?u=https-3A__secure-2Dweb.cisco.com_1BhRzwrpl2OU1g7a36Fr4cEEctfeVir9amqzPipnUV-2Dnw8-5FGuKfeUAYMSwgmqg9worqIpRvTxMUM3DhBHSFwEGplp0LgGYlaQ8BP8or-5Fg5cUiu0eDDnhm-5FONPmX5jHf8gMtLVItRntbXJc7Vsop-5FArZZKTXzuFDOzHrL-5FcYy9WZuiF9tnTgdYjNjyB4YNfCPGa6tKghYcatZClM57nWVn9FkHp1U0jg7bLNqUGiR2XcW59kmXmuIUiB3y-2DVesVK9VvGoonznj7k-2Dtg0C0ebmLCdqn9IJ2fWnxb6-5FfDi5TJB0Mw4bWvEOpexLf1fz-2DMDwd_https-253A-252F-252Fstat.ethz.ch-252Fmailman-252Flistinfo-252Fbioc-2Ddevel&d=DwMF-g&c=WO-RGvefibhHBZq3fL85hQ&r=d1v_WzzVGvOcjsx-QSqHbX6hdozHewJMVh7ESOXo9zU&m=Ci_SgT-JTg2C2wckg7LPwij4JNnQBEgcIjPz1hcOGgwqIFGafqTkBZz8FG9qmeM9&s=Of-rS4X7groRVzPnUa2a2Q_-oRxtogULv6ldp-4JDWs&e=>

This email message may contain legally privileged and/or confidential information. If you are not the intended recipient(s), or the employee or agent responsible for the delivery of this message to the intended recipient(s), you are hereby notified that any disclosure, copying, distribution, or use of this email message is prohibited. If you have received this message in error, please notify the sender immediately by e-mail and delete this email message from your computer. Thank you.


	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list