[Bioc-devel] 5mb limit for packfiles in .git is too harsh

Kern, Lori Lor|@Shepherd @end|ng |rom Ro@we||P@rk@org
Mon Jan 23 13:23:04 CET 2023

I would also argue that it is Bioconductor's current policy not to have such large data stored directly in a package and to use the hub interface.  Large data files often aren't necessary for an end user and as you say are only used for examples;  often smaller files are sufficient for proof of principle use of a package and users may not want to run the examples and it would be wasted space on their local machines.

Lori Shepherd - Kern

Bioconductor Core Team

Roswell Park Comprehensive Cancer Center

Department of Biostatistics & Bioinformatics

Elm & Carlton Streets

Buffalo, New York 14263

From: Bioc-devel <bioc-devel-bounces using r-project.org> on behalf of Park, Adam Keebum <sein.park using psu.edu>
Sent: Saturday, January 21, 2023 3:24 PM
To: bioc-devel using r-project.org <bioc-devel using r-project.org>
Subject: [Bioc-devel] 5mb limit for packfiles in .git is too harsh

Dear community,

First, I want to appreciate Nathan's amazing help on my two previous inquiries. The answers effectively led me to pinpoint the issue.

 The final decision I made after hours of analysis was to remove all data files exceeding 50k sizes from the git history. However, such practice is not sustainable and actually is pathological because it invalidates virtually all previous data files and hence hampers reproducibility of previous commits, especially unit testing. Therefore, I want to leave a message here with a hope to reach administrators of bioconductor.

 I would claim that this policy should be relaxed at least for the git packfile. Most of us know that the .pack file residing in .git/objects/pack has frequently been accused by BiocChecker() for its large size (as in here<https://secure-web.cisco.com/1qGX_Y4A5aLPFy3eRY-StH4bJj6acUqiCPpnAZk1XROjj3YSQBLlkOjw9SzySW3-54oanxv5lMb3x-79HcBca4hB56pNaazyfWiqSKTFhFMBhjSj3UvAcKCvtXvnNnccS8Dh8GBKsSLFM7XGtUXwnUXMsFl6bXxqCbwHuJ0k-9OD8E0UH_a0_DC4H8RPKHjmlpPD6aZkQ-uUkR9oDX2AQzZ_iWF8cx_HocFFrSDDX6pd7KxyhWmfTK-RJ-1sRl1Wzhx3QrBJ1w2pwCV5t2woXiWnLJYd-5rZHHpLWcBrBnbc2VLWl8xq_-IaQKA54f-jg/https%3A%2F%2Fstat.ethz.ch%2Fpipermail%2Fbioc-devel%2F2019-February%2F014703.html> or here<https://secure-web.cisco.com/1WtGICK5jVidJpKk46cYpaiDvrGne4qJhF9IGphAWyhtUNMwz2UYMByDVrGbF2PkYwK5Y-3jD6W3eRjQ7c1DTMhqyOHccVdZNsKC1mE-xTaaruLRm5B5PJy2uv0ymdcMYmefu4VAogprvuWNILutLatvBFAt5o6Im0t1AywlrrVnS8Bqiq689nBIt1Xv42Km49nRiuxHUN9f7eritfAj5Nk8o7hqFalP2cWRqoAPeoYaaD8tPwTlflqUCRdqhfDEVD_D121aEBCCqxkapNzPjFO42t1weBPb8bIbSKysCOs-DmDQxgQhjrRnLRVxmAIHw/https%3A%2F%2Fstat.ethz.ch%2Fpipermail%2Fbioc-devel%2F2020-October%2F017273.html>), which is natural due to the purpose of packfiles: storing "all removal history" in a single compact space<https://secure-web.cisco.com/1G_W-7yuorugfNr9gW8kIh_xqltZF-Anj3z-6pPyO655Z6ZuYqwNwU4rrWEYYp0BIlzrdn3Z8yFK8Q0c4YIdBGzSBQ1corj7678dm_RcqbOo_LMcY8PPCDuV_JyHR2bv8kbwfXL83HPfiRs1OZDze7rpAtCTKhL1dYJekOcOek3jbQC9vod7p0UB9llGOINWvGozPm76XRYbxlu03ERon3tHht0OtSNKFZcbyPJHSmHQdZaVWTizeweMJv-VzO2Dy_SwhIP93G34M4HeMe7HORWO0DWFCelheyqdZCbmsGlQUtVqVP2bVcjkNEWSRyHiv/https%3A%2F%2Fgit-scm.com%2Fbook%2Fen%2Fv2%2FGit-Internals-Packfiles%23%3A%7E%3Atext%3DThe%2520packfile%2520is%2520a%2520single,seek%20to%20a%20specific%20object.>.
 Compressing the whole git history in a file is effective only until the majority of delta are sentence-based changes in a text source file for example. In my practice, however, a modification in blob files tended to contribute much more because of boosted delta after compressing datasets where some modification has shaken their bit patterns. Such changes were still kilobyte-level, but gradually impacted the whole pack file size so I had to remove those cases. The current policy therefore forces deletions of kilo-sized files in git history, not just 'large' files...
 I might not be the only one using multiple 100kb-sized experimental data in unit testing and vignettes. Containing dozens of such files in a 5mb package might be acceptable. I believe the same can hold for the pack file because it just represents a collection of previous files which are still less than 5mb. I guess the policy can relax such file size limit to allow safer and reproducible developer practices.


        [[alternative HTML version deleted]]

Bioc-devel using r-project.org mailing list

This email message may contain legally privileged and/or confidential information.  If you are not the intended recipient(s), or the employee or agent responsible for the delivery of this message to the intended recipient(s), you are hereby notified that any disclosure, copying, distribution, or use of this email message is prohibited.  If you have received this message in error, please notify the sender immediately by e-mail and delete this email message from your computer. Thank you.
	[[alternative HTML version deleted]]

More information about the Bioc-devel mailing list