[Bioc-sig-seq] Genominator: strategy for combining multiple AlignedRead objects

Kasper Daniel Hansen kasperdanielhansen at gmail.com
Mon Apr 19 17:26:58 CEST 2010


Hi Joe

This is addressed in the development version.  We now have the
capability of giving importFromAlignedReads a (named) vector of
filenames instead of a named list of AlignedRead objects.  This vector
of filenames will be read in one at a time, so you just need enough
memory to process a single lane.  I have processed around 160 lanes
worth of data using this approach.

There is an extended example in the 'with ShortRead' vignette.

importFromAlignedReads also has the capability of directly summing
several columns (fi you need this).  So let us say you have 6 files
(lanes) and you want to end up with a database with 2 columns
(assuming you have a 3x2 experiment and you have decided to add up
over the lanes).  Then you can do this using a construction where the
names of the files are like
  "a", "a", "a", "b", "b", "b"
(this will create two columns named "a" and "b" each holding 3 lanes
worth of data).

In this case, all 3 lanes will be read into memory at the same time -
it is less memory efficient but it was much easier to code.  If that
is impossible you should create a standard 6 column database and then
use collapseExpData.  The importFromAlignedReads is more of a
convenience (and speed) trick.

I uploaded a new version 1.1.6 yesterday which I recommend, because of
some documentation updates.  This version should replace 1.1.5 on the
Bioconductor development servers sometime tomorrow.

Kasper


On Mon, Apr 19, 2010 at 11:06 AM, joseph franklin
<joseph.franklin at yale.edu> wrote:
> I'm addressing this to Jim Bullard, who has been really helpful answering some of my questions, as well as the list, in case anyone has some advice for me.
>
> I've started using Genominator (I'm using the release version right now) to quantitate and analyze RNA-seq data, and have been really successful aggregating AlignedRead objects with my own annotation tables to produce per-gene counts.  I've done this with sets of 2-3 AlignedRead objects (each representing an Illumina lane), but I'd like to extend the approach to a few dozen lanes.  Since this is far too much data to fit in memory, I need an efficient way to combine many AlignedRead objects at once that doesn't rely on them being loaded as objects at the same time.
>
> I imagine that I need to load the objects into tables using the importFromAlignedReads, and then join the appropriate columns, either before or after aggregation (the manual hints that afterwards is preferable).  However, there are a few points I'm confused with (probably resulting from my limited experience with SQLite):
>
> - I've been unable load to load a SQLite database file that was previously created with the importFromAlignedReads--what is the best way to load the database connection--for instance, during a new R session?
>
> -Can AlignedRead objects only be imported (via importFromAlignedReads) as named lists of two or more objects?  What about single AlignedRead objects?  I would imagine that a solution to my problem would be to create a separate table in a database file for each of my AlignedRead objects (I made a loop to do this), and then join these tables (as long as I can create a connection to the database).
>
> I think my problems could be solved if I could load the AlignedRead objects from multiple lanes into tables in database file, load it, and join the appropriate columns from the various tables (and then aggregate with the annotations in a single step--this would seem to be the most straightforward).  Any advice on accomplishing these steps would be much appreciated.
>
> Thanks again,
> Joe Franklin
>
> ________________________________
> Joseph Franklin
> Department of Cell Biology
> Yale University
> 295 Congress Ave, BCMM 137
> New Haven, CT 06519
> USA
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>



More information about the Bioc-sig-sequencing mailing list