[Bioc-devel] query regarding bfg on bioconductor repository

Turaga, Nitesh Nite@h@Tur@g@ @ending from Ro@wellP@rk@org
Fri Nov 2 17:54:01 CET 2018


Hello,

Shweta, I have now fixed your package in “master” and “RELEASE_3_8” branches. You should sync your package on GitHub either by taking a fresh clone from git.bioconductor.org (http://bioconductor.org/developers/how-to/git/maintain-github-bioc/) or syncing it (http://bioconductor.org/developers/how-to/git/sync-existing-repositories/). Please make sure there are no duplicate commits before you push from a local repository to the server. 

Please also check that the package works as expected, some commits may have been omitted, make sure the code and all the files look exactly like they are supposed to. 

Thank you. 

Nitesh,




P.S: This post-script is going to be some what long for the benefit of the other maintainers in the community. 

Just so other maintainers are informed, this was probably my most challenging “de-duplication”, “large-file” removal yet. One of the main concerns was the lack of informative commit messages in the package. 

NOTE: Lots of assumptions were made in fixing this package as I’m not the maintainer. Every maintainer should take precaution after such a process to make sure the package works as expected, and debug any issues.

I found the largest files in the restfulSE package using,

	>> du -ha 

This shows that the .pack file is the largest (257MB). The pack files contains indexed information for previous commits.

	>> git verify-pack -v ./.git/objects/pack/pack-39dcca4030fa3e3901bb5874afa855110956cb4b.pack | sort -k 3 -n | tail -10                                                                                                 
	d502afb0e1e3307f0185bf2463af330c0db01d45 blob   13144692 13148712 497110
	35452f5c77520daf966f0228e1a102b71f3302b1 blob   17148372 17018509 121401976
	b501ea4f451e61790377b06da6f7072f6532b2c9 blob   17849118 17730695 160967245
	1850dbfeb056878662e8b383fb2fbf0aa27f5ebd blob   25657619 22542379 138421662
	a495c635ebe44020dd563c0cd4265a33dd76c8ec blob   26339453 22683993 75191432
	7baad7ed1d83dbf2eeb8a97f4c542035a06adb43 blob   27136447 23505426 51685564
	6c28ec53a7765775540422258716467143182fb8 blob   27153019 23522387 97877531
	a34d64c68a7b02519905fa547674764351df4310 blob   27374529 23735207 13645822
	64d12c30c8f5805c896d010929ea7c77e7e39279 blob   38726127 38541293 230237839
	8bda5612a454c33e50efe4f87d9a7442e239ab23 blob   40075136 39119863 187385432

The commit id’s show what commits caused the largest files. I sent the stdout commits to a new file called “shas”. 

	>>    while read sha; do                                                                                                                                                                                                  
		git rev-list --objects --all | grep $sha
		done < ../shas 

	d502afb0e1e3307f0185bf2463af330c0db01d45 data/banoSEMeta.rda
	35452f5c77520daf966f0228e1a102b71f3302b1 data/stfull_rse.rda
	b501ea4f451e61790377b06da6f7072f6532b2c9 data/banoSEMeta.rda
	1850dbfeb056878662e8b383fb2fbf0aa27f5ebd data/stfull.rda
	a495c635ebe44020dd563c0cd4265a33dd76c8ec data/full_1Mneurons.rda
	7baad7ed1d83dbf2eeb8a97f4c542035a06adb43 data/full_1Mneurons.rda
	6c28ec53a7765775540422258716467143182fb8 data/sefull.rda
	a34d64c68a7b02519905fa547674764351df4310 data/full_1Mneurons.rda
	64d12c30c8f5805c896d010929ea7c77e7e39279 data/tenx_100k_sorted.rda
	8bda5612a454c33e50efe4f87d9a7442e239ab23 data/tenx_100k_sorted.h5

This gave me the name of the files which were causing the issue.  (All the .rda files)

I used BFG cleaner to clean those files, 

	>> java -jar ~/Downloads/bfg-1.13.0.jar --strip-blobs-bigger-than 5M restfulSE

	>> git reflog expire --expire=now --all && git gc --prune=now --aggressive


Next, I wanted to clean the commit history because there are many duplicates. In this step, If you look at the commit history, it’s important to notice that the commits till the ID `40a0cf0` are duplicated. The commit before this is `f6e30f8` which is unduplicated. The goal is to reset till this unduplicated commit, and relay successive unduplicated commits on top of this. (https://github.com/shwetagopaul92/restfulSE/commits/master)


I do the following to overlay a series of commits next, 

	>> git log --oneline > commits.txt 

I manually unduplicated the “commits.txt”  file. (i.e literally just go through and delete every alternate line). Then, get only the commit ID’s, since we don’t need the commit messages. We have to overlay these commits in reverse order, this is important to remember.

	>> cat commits.txt | awk -F" " '{print $1}’ > commits 

	 >> tail -r commits > commits_reversed ##reverse the order of the commits

I reset to the commit before duplicates

	~/D/restfulSE ❯❯❯ git reset --hard f6e30f8                                                                                                                                                                                             
	HEAD is now at f6e30f8 drop .swp file


Now we have to overlay these commits onto the. NOTE: I’m oversimplifying the process here, there were a few “cherry-pick” conflicts which needed to be sorted out. Once the conflict was resolved, I’d have to curate the “commits_back” file manually based on which commit the "while loop” broke.

	~/D/restfulSE ❯❯❯ while read commit; do
				     git cherry-pick $commit
				     done < ../commits_back


This process puts us at the beginning of RELEASE_3_7. There were more commits made which need to be cherry-picked from RELEASE_3_8. I repeated the process, to overlay commits. But before that, so that I can cherry-pick the latest commits to RELEASE_3_8 from a branch, /tmp/restfulSE is my location for the package’s most recent version (with all the duplicated commits from RELEASE_3_8 / master

	 >> git remote add upstream /tmp/restfulSE  
	
	 >> git fetch --all

	>> git log —oneline upstream/master > ../more_commits

	## Manually deduplicate them,
	>> vim more_commits 

	## reverse the order and get just the first column,
        >> cat more_commits| awk -F" " '{print $1}' | tail -r > ~/Documents/more_commits_back


Overlay all the new commits, while battling with conflicts. (There were plenty of conflicts in this)

				>> while read commit; do
				     git cherry-pick $commit
				     done < ../more_commits_back


Once the commits were overlaid, I had to do the release process (RELEASE_3_8) again for restfulSE,

	>> git reset --soft 6899c3f 

	>> git branch RELEASE_3_8

	>> git add DESCRIPTION

	>> git commit -m "bump x.y.z versions to odd y after creation of RELEASE_3_8 branch”


I removed any large “.rda” files added after adding commits from RELEASE_3_8 again, using BFG cleaner,

	>> java -jar ~/Downloads/bfg-1.13.0.jar --strip-blobs-bigger-than 5M restfulSE

	>> git reflog expire --expire=now --all && git gc --prune=now --aggressive


Then, add the appropriate remotes for git.bioconductor.org and force push.

This now gives the package restfulSE a clean commit history in both the `master` and the `RELEASE_3_8` branches. This of course does not guarantee that the package works as intended and the maintainer should take all precautions to fix it from this point forward. 

And as a reminder to authors, please follow the instructions on http://bioconductor.org/developers/how-to/git/. If you have any questions please ask on the bioc-devel mailing list. It is much easier to answer a question before , rather than having to manually fix a repository. 

The main take away from this is that, it is extremely tedious to fix commit histories. If you have any questions about how I did this, I did the best I could to document the process and save original copies of the original repositories with all their issues. 

Best,

Nitesh 


P.S: Just kidding, there can’t be more than one :D (Follow best practices. )

> On Oct 31, 2018, at 3:38 PM, Turaga, Nitesh <Nitesh.Turaga using RoswellPark.org> wrote:
> 
> Hi Shweta,
> 
> Please hold off making anymore commits to restfulSE. I’ve noticed some discrepancies in your package pre-release and post-release. I’ll try to correct it the best I can before letting you know. 
> 
> The issues seem to be two-fold, duplicate commits and unusually large file package. Your BFG cleaning as we spoke off line succeeded in making the package size smaller, but it seems to have induced more issues as far as contamination of commit history goes. 
> 
> I’ll work on this in the next day or so, and let you know. 
> 
> You can then sync a fresh copy of the package on the Bioconductor git server.
> 
> Best,
> 
> Nitesh 
> 
> 
>> On Oct 30, 2018, at 10:32 AM, Shweta Gopaulakrishnan <reshg using channing.harvard.edu> wrote:
>> 
>> Hi Nitesh, 
>> 
>> Hope you are doing good ! I am working with bfg to reduce the size of "restfulSE" package. I am able to do so with the github repository but not with the bioconductor repository. 
>> 
>> The steps I am following for the bioconductor one is :
>> 
>>> git clone --mirror git using git.bioconductor.org:packages/restfulSE
>>> java -jar bfg.jar --strip-blobs-bigger-than 30M restfulSE.git
>>> cd restfulSE.git
>>> git reflog expire --expire=now --all && git gc --prune=now --aggressive
>>> git push 
>> 
>> I get an error : fatal: This operation must be run in a work tree 
>> 
>> Is there any other way to push changes upstream after bfg ? 
>> 
>> Thank you! 
>> -- 
>> Shweta Gopaulakrishnan,
>> Bioinformatician,
>> Channing Division of Network Medicine,
>> Brigham and Women's Health Hospital,
>> Boston,MA 02115
>> 
>> The information in this e-mail is intended only for the person to whom it is
>> addressed. If you believe this e-mail was sent to you in error and the e-mail
>> contains patient information, please contact the Partners Compliance HelpLine at
>> http://www.partners.org/complianceline . If the e-mail was sent to you in error
>> but does not contain patient information, please contact the sender and properly
>> dispose of the e-mail.
> 



This email message may contain legally privileged and/or confidential information.  If you are not the intended recipient(s), or the employee or agent responsible for the delivery of this message to the intended recipient(s), you are hereby notified that any disclosure, copying, distribution, or use of this email message is prohibited.  If you have received this message in error, please notify the sender immediately by e-mail and delete this email message from your computer. Thank you.


More information about the Bioc-devel mailing list