[Bioc-sig-seq] Using SVN for a "data package" with HTS data -- howto???

Leonardo Collado Torres lcollado at lcg.unam.mx
Wed Feb 17 02:55:13 CET 2010


Hello everyone,

How are you doing? I hope that everything is working out great for you ^_^.

Anyhow, I'm emailing you because I have a Subversion / R / HTS related 
question. A few of us (4 right now) in my lab want to analyze some 
Illumina GAIIx data and the idea is to use R as the backbone. We want to 
keep all the results in .Rdata format and kind of build an "internal" 
package so that the biologists at the lab could then load the tables 
easily. Kind of what Patrick Aboyoun told us at BioC2009. So, we want:

A) Major Script
This one will call the individual scripts that do a step on the 
workflow. It will help us remember what we did and in what order. 
Actually, a .Rnw vignette file would be much better.
B) Individual Scripts
These will have code but no function definitions. For example, on one of 
these you could call the "aligner" through a function, then read the 
results, find the read coverage per base, make a plot. Kind of analysis 
modules.
C) Package
There we'll define all the functions that we'll be called by the 
"individual" scripts, examples, documentation for the functions, etc. 
Also, we'll save the results from the scripts as R objects; most likely, 
data frames. Some might be large (10mb?).

The Illumina data and some big files like the alignments will not be 
kept on the package.

The idea is that someone or a small team will develop individual 
scripts, but the package and the major script will be edited by everyone 
participating. Now, I think that using Subversion is the way to go. 
However, I'm puzzled at what SVN hosting service we should use... We are 
not building open source software; it's more like a data package -- VJ 
Carey talked about them at BioC2009. Eventually it would be great to 
share the package, but for some months it will all be a work in progress 
meant to be seen only by those in the lab/project. On a bad scenario the 
package would never make it out of the lab.

I'm not aware if there is a public SVN hosting service that meets our 
needs. I guess that we could use Google Code or Rforge (just to mention 
a few) and not distribute the url for those "lab-only" months -- anyone 
could find randomly find it. Or should we hire one of the commercial SVN 
hosting services to keep the work private? (check 
http://www.svnhostingcomparison.com/ ) Hosting it at a local server is a 
problem for us since they are quite restrictive and svn 
checkouts/commits would most likely be blocked. They've had bad luck 
with exterior attacks on the servers.

Otherwise I think that all the people involved could use the same server 
user and use SVN only at the server. Something very similar to using SVN 
on your laptop with 2 directories: the checkout one and the "repository" 
one (check 
http://www.guyrutenberg.com/2007/10/29/creating-local-svn-repository-home-repository/ 
).


As you can notice, I'm quite the newbie on SVN and working 
collaboratively with Illumina GA data. Any tips are more than welcome :) 
I also asked on SEQanswers: 
http://seqanswers.com/forums/showthread.php?t=4071

Thank you and greetings,
Leonardo

-- 
Leonardo Collado Torres, Bachelor in Genomic Sciences
Member of Dr. Enrique Morett's lab and Winter Genomics
UNAM Campus Cuernavaca, Mexico

Homepage: http://www.lcg.unam.mx/~lcollado/
Phone: [52] (777) 313-28-05



More information about the Bioc-sig-sequencing mailing list