[Bioc-devel] Best alternative to transferring large data in parallel runs, To be used in IntEREst

Tue Apr 19 13:15:57 CEST 2022

Hi Martin!
It is done on a single bam files. However, the issue is that for a large bam file the whole data can't be loaded, or even if it can be loaded, analyzing it will take long! Therefore, alignment info of every e.g. 1 million paired reads from a bam is read and sent to the parallel processes so that they are overall analyzed faster. If there are 11 parallel processes then the process of analyzing the bam file would be almost 10 times faster.

Cheers,

Ali

--

Ali Oghabian

Bioinformatics,

Post Doctoral Researcher
Folkh�lsan Research Center
Neuromuscular Disorders Research lab
Genetic Determinants of Osteoporosis Research lab

Address: Room C307b, Folkh�lsan Research Center,
Biomedicum, University of Helsinki,
Haartmaninkatu 8,
00290 Helsinki, Finland
Tel (Office): +358 2941 25629, +358 50 4484028
________________________________
From: Martin Morgan <mtmorgan.bioc using gmail.com>
Sent: Tuesday, April 19, 2022 2:00 PM
To: Oghabian, Ali <ali.oghabian using helsinki.fi>; bioc-devel using r-project.org <bioc-devel using r-project.org>
Subject: Re: Best alternative to transferring large data in parallel runs, To be used in IntEREst

Maybe seeking clarification more than answering your question, but can the summarization (count reads that map to exons / introns) be done independently for each BAM file, so that the return value is a vector with a length equal to the number of introns / exons? This would be �small�, and would not require any special effort. Martin

From: Bioc-devel <bioc-devel-bounces using r-project.org> on behalf of Oghabian, Ali <ali.oghabian using helsinki.fi>
Date: Tuesday, April 19, 2022 at 6:51 AM
To: bioc-devel using r-project.org <bioc-devel using r-project.org>
Subject: [Bioc-devel] Best alternative to transferring large data in parallel runs, To be used in IntEREst

Dear fellow developers, Hello.

I have a question regarding the best method to collect numeric results that will be summed, from parallel running processes whilst preventing or limiting transfer of large data to/from the parallel processes.

My question is more directed towards IntEREst and other software/packags that attempt to summarise bam files in parallel. By "summarise" I mean to count how many reads map to genes or introns and exons of the genes in an alignment bam file.

Currently, IntEREst uses BiocParallel::bpiterate() function to go through all alignment info in a bam file N (e.g. 1,000,000) reads at a time; counts the reads that maps to each exon/intron of the requested genes; collects these read counts information; and sums over their values to get the overall reads that map to each exon/intron. If we want to run it genome-wide, it may be collecting numeric vectors of sizes up-to 3,000,000 from each parallel run which requires large memory capacity. What are good alternative to transferring these large numeric vectors from parallel runs (so that they would be summed over eventually) ?

I can think of 2 alternative methods:

1. Using temporary files: In the first versions of IntEREst, each run used to write its result into a temporary txt file which were eventually read and analysed and then deleted.  However, when I was submitting the package to BioConductor I was advised to avoid controlling these kind of data transfers by temporary files.

2. Using databases: It is also possible to store and update the results in a database I assume, but I am not sure how reliably it (e.g. SQLite) manages when as for instance 2 or several parallel running processes want to modify the same table in the database.

Any ideas on these issues or suggestions of better alternatives would be very useful and much appreciated.

Cheers,

Ali Oghabian

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel using r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

	[[alternative HTML version deleted]]