[Bioc-sig-seq] Using SVN for a "data package" with HTS data -- howto???

Wed Feb 17 03:10:21 CET 2010

On Tue, Feb 16, 2010 at 8:55 PM, Leonardo Collado Torres
<lcollado at lcg.unam.mx> wrote:
> Hello everyone,
>
> How are you doing? I hope that everything is working out great for you ^_^.
>
> Anyhow, I'm emailing you because I have a Subversion / R / HTS related
> question. A few of us (4 right now) in my lab want to analyze some Illumina
> GAIIx data and the idea is to use R as the backbone. We want to keep all the
> results in .Rdata format and kind of build an "internal" package so that the
> biologists at the lab could then load the tables easily. Kind of what
> Patrick Aboyoun told us at BioC2009. So, we want:
>
> A) Major Script
> This one will call the individual scripts that do a step on the workflow. It
> will help us remember what we did and in what order. Actually, a .Rnw
> vignette file would be much better.

Yep, a vignette is a good way to go.  If you are doing time-consuming
things, be sure to check out the weaver package.

> B) Individual Scripts
> These will have code but no function definitions. For example, on one of
> these you could call the "aligner" through a function, then read the
> results, find the read coverage per base, make a plot. Kind of analysis
> modules.

There is also no real reason these scripts could not be in another
package; for example, you could have one package for each "project".
Using a package allows one to document objects and code as well as to
specify dependencies formally (a script might depend on several other
packages, for example).

> C) Package
> There we'll define all the functions that we'll be called by the
> "individual" scripts, examples, documentation for the functions, etc. Also,
> we'll save the results from the scripts as R objects; most likely, data
> frames. Some might be large (10mb?).
>
> The Illumina data and some big files like the alignments will not be kept on
> the package.
>
> The idea is that someone or a small team will develop individual scripts,
> but the package and the major script will be edited by everyone
> participating. Now, I think that using Subversion is the way to go. However,
> I'm puzzled at what SVN hosting service we should use... We are not building
> open source software; it's more like a data package -- VJ Carey talked about
> them at BioC2009. Eventually it would be great to share the package, but for
> some months it will all be a work in progress meant to be seen only by those
> in the lab/project. On a bad scenario the package would never make it out of
> the lab.
>
> I'm not aware if there is a public SVN hosting service that meets our needs.
> I guess that we could use Google Code or Rforge (just to mention a few) and
> not distribute the url for those "lab-only" months -- anyone could find
> randomly find it. Or should we hire one of the commercial SVN hosting
> services to keep the work private? (check
> http://www.svnhostingcomparison.com/ ) Hosting it at a local server is a
> problem for us since they are quite restrictive and svn checkouts/commits
> would most likely be blocked. They've had bad luck with exterior attacks on
> the servers.
>
> Otherwise I think that all the people involved could use the same server
> user and use SVN only at the server. Something very similar to using SVN on
> your laptop with 2 directories: the checkout one and the "repository" one
> (check
> http://www.guyrutenberg.com/2007/10/29/creating-local-svn-repository-home-repository/
> ).
>
>
> As you can notice, I'm quite the newbie on SVN and working collaboratively
> with Illumina GA data. Any tips are more than welcome :) I also asked on
> SEQanswers: http://seqanswers.com/forums/showthread.php?t=4071

Hi, Leonardo.  SVN is not too hard to set up, but you will probably
want to set it up behind apache.  However, you might consider others
as well.

http://en.wikipedia.org/wiki/Revision_control

The main discussion point, in my opinion, is whether to use a
distributed system (git, bazaar, mercurial) or a centralized system
like svn.  I actually prefer the distributed system (I use git) over
svn, but that is just personal preference.  Because much work is done
with svn, I interface with svn using git-svn (so that even my
interactions with the bioconductor svn server are via git).

No matter what system you go with, make sure that it is well backed up!

Sean

> Thank you and greetings,
> Leonardo
>
> --
> Leonardo Collado Torres, Bachelor in Genomic Sciences
> Member of Dr. Enrique Morett's lab and Winter Genomics
> UNAM Campus Cuernavaca, Mexico
>
> Homepage: http://www.lcg.unam.mx/~lcollado/
> Phone: [52] (777) 313-28-05
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>