[Bioc-devel] Removal of large items in git history - BiocCheck warning

Mike Smith gr|mbough @end|ng |rom gm@||@com
Tue Mar 9 12:14:41 CET 2021

I've used something like this approach in the past.  All the normal caveats
about making sure you've got a backup apply.

Find the names of largest objects in the pack file (not necessarily in size
order).  In this case they're almost all .rda files.

git rev-list --objects --all | grep -f <(git verify-pack -v
.git/objects/pack/*.idx| sort -k 3 -n | cut -f 1 -d " " | tail -15)

e63fb55738f4d6643939863ec7799776d4b161c5 EWCE.html
f67b528ec5e029fbeb45c2ff90d619de0d7ae4c0 articles/EWCE.html
ae0e4cda88322aaff0b064136c84096d16dc219f reference/ewce.plot-1.png
8946eeb7255c328676a61da71276a29002e34d1f data/all_hgnc.rda
60814dfe9cbf3cb77b846a9fc0270bc7cc00d50c data/all_hgnc_wtEnsembl.rda
d152a56e7290abb06eab1112910a499145dbd3e1 data/all_mgi.rda
7075962fb2ccc78b826c7fc1823d0e3d5e5d7b01 data/all_mgi_wtEnsembl.rda
100a7fa8df12deb1803a437b442c0897811916df data/mgi_synonym_data.rda
f890d2bbd63b7ecff94e4917b6b7188399659221 data/mouse_to_human_homologs.rda
fddddd7022bc96d24d75cf71d65c097d84bade88 data/tt_alzh.rda
98aba69ade5c09a2100248c963bb5397860ae089 data/tt_alzh_BA36.rda
0f006997c7a45a5647dd5ce21be650d6c197ea29 data/tt_alzh_BA44.rda
67b2d63f55531f85ece47e298213fd25cacdaa01 data/cortex_mrna.rda

Filter files with the .rda extension.  I guess you should be more careful
here if there are rda files you want to retain, but I don't see any in the
main branch on Github.  I get a pretty scary looking warning from git, but
it seems to have worked out ok for me in the past.

git filter-branch --index-filter 'git rm --cached --ignore-unmatch *.rda'
-- --all

Apply the removal to the repo.

rm -Rf .git/refs/original
rm -Rf .git/logs/
git gc --aggressive --prune=now

Check the new size of the pack folder.

du -h .git/objects/pack
3,9M .git/objects/pack

You could probably apply this approach to remove the large .html files too,
but it looks like they're part of the pkgdown site for your package so I
imagine you want to keep them.


On Tue, 9 Mar 2021 at 10:09, Murphy, Alan E <a.murphy using imperial.ac.uk> wrote:

> Hi both,
> Thank you for your suggestions. Yes, I am still having problems with the
> size of my git history in the EWCE package. To clarify, I have already
> tried the BFG cleaner to no avail even when I set the max limit to 1 MB
> (see my first email for details).
> The issue is that a .git/objects/pack/ file is still greater than the
> allotted 5MB, it appears to be 8.9MB in size. As mentioned, I have used the
> BFG cleaner and yet this still remains too large. If anyone has suggestions
> on how else I could reduce this size that would be great.
> @Nitesh Turaga<mailto:nturaga.bioc using gmail.com> how would I go about
> checking (and removing?) hidden files from the .git/objects/pack history?
> Kind regards,
> Alan.
> ________________________________
> From: stefano <mangiolastefano using gmail.com>
> Sent: 08 March 2021 22:18
> To: Nitesh Turaga <nturaga.bioc using gmail.com>
> Cc: Murphy, Alan E <a.murphy using imperial.ac.uk>; bioc-devel using r-project.org <
> bioc-devel using r-project.org>
> Subject: Re: [Bioc-devel] Removal of large items in git history -
> BiocCheck warning
> This email from mangiolastefano using gmail.com originates from outside
> Imperial. Do not click on links and attachments unless you recognise the
> sender. If you trust the sender, add them to your safe senders list<
> https://spam.ic.ac.uk/SpamConsole/Senders.aspx> to disable email stamping
> for this address.
> Hello,
> you can use  bfg-repo-cleaner  ,
> have a read to this document, in the section "eliminate big files from
> repo"
> https://docs.google.com/document/d/1jxg7KCMQq3kiCcvodQk9JgtU51LqczOwLit1gHiTP4Q/edit?usp=sharing
> Best wishes.
> Stefano
> Stefano Mangiola | Postdoctoral fellow
> Papenfuss Laboratory
> The Walter Eliza Hall Institute of Medical Research
> +61 (0)466452544
> Il giorno mar 9 mar 2021 alle ore 09:11 Nitesh Turaga <
> nturaga.bioc using gmail.com<mailto:nturaga.bioc using gmail.com>> ha scritto:
> Hi Alan,
> Did you manage to solve this?
> There seems to be objects in your git repo which are bigger than the size
> which is required by Bioconductor for a software package. Please check
> hidden files as well.
> One test you can do is, to clone your package from github and see how much
> MB are downloaded to this new location. This is a good test to check which
> files are still larger than the limit.
> Best,
> Nitesh
> On 3/4/21, 11:19 AM, "Bioc-devel on behalf of Murphy, Alan E" <
> bioc-devel-bounces using r-project.org<mailto:bioc-devel-bounces using r-project.org>
> on behalf of a.murphy using imperial.ac.uk<mailto:a.murphy using imperial.ac.uk>>
> wrote:
>     Hi all,
>     I am working on the development of EWCE<
> https://github.com/NathanSkene/EWCE> for submission to Bioconductor. I
> have removed some large objects from the package and moved them to a
> separate ExperimentHub package however, after their removal, I got a
> BiocCheck large file warning.
>     To deal with the data stored in git history, I followed the
> instructions to use the BFG cleaner with the max size set to 5MB. This
> appeared to work and some things were removed but yet I still get the
> warning below:
>     $warning[1] "The following files are over 5MB in size:
> '.git/objects/pack/pack-366a7ab7a2ba4e656f3a9f3f1408be7ab9f41303.pack'"
>     If I try to rerun the BFG cleaner I get the following output:
>     Warning : no large blobs matching criteria found in packfiles - does
> the repo need to be packed?
>     I have tried two different methods to using the BFG cleaner, one from
> BFG<https://rtyley.github.io/bfg-repo-cleaner/> themselves and one from
> Bioconductor<
> https://bioconductor.org/developers/how-to/git/remove-large-data/>. I
> have also completed all steps in both including the prune step:
>     git reflog expire --expire=now --all && git gc --prune=now --aggressive
>     I have even tried reducing the max from 5MB to 1MB but still nothing
> seems to be left eve at that size. Does anyone know of another way to sort
> this issue or have any clue what I may be doing wrong?
>     Kind regards,
>     Alan.
>     Alan Murphy
>     Bioinformatician
>     Neurogenomics lab
>     UK Dementia Research Institute
>     Imperial College London
>         [[alternative HTML version deleted]]
>     _______________________________________________
>     Bioc-devel using r-project.org<mailto:Bioc-devel using r-project.org> mailing list
>     https://stat.ethz.ch/mailman/listinfo/bioc-devel
> _______________________________________________
> Bioc-devel using r-project.org<mailto:Bioc-devel using r-project.org> mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>         [[alternative HTML version deleted]]
> _______________________________________________
> Bioc-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel

	[[alternative HTML version deleted]]

More information about the Bioc-devel mailing list