[Bioc-devel] Best alternative to transferring large data in parallel runs, To be used in IntEREst

Tue Apr 19 13:37:31 CEST 2022

I guess you are saying that there is one big BAM file, rather than several BAM files. If there are several BAM files, then the best strategy is probably to process each in its own parallel environment.

Perhaps GenomicFiles::reduceByRange is a solution (create ranges that span the genome, and process each range separately)? Care needs to be taken to avoid double-counting of reads spanning ranges.

There is also GenomicFiles::reduceByYield, which I think implements your strategy � sending individual chunks to workers for processing. I guess I�m still not seeing where the data become very large� the chunks sent to workers can be controlled, and the return �counts� are also small�

Martin

From: Oghabian, Ali <ali.oghabian using helsinki.fi>
Date: Tuesday, April 19, 2022 at 7:16 AM
To: Martin Morgan <mtmorgan.bioc using gmail.com>, bioc-devel using r-project.org <bioc-devel using r-project.org>
Subject: Re: Best alternative to transferring large data in parallel runs, To be used in IntEREst
Hi Martin!
It is done on a single bam files. However, the issue is that for a large bam file the whole data can't be loaded, or even if it can be loaded, analyzing it will take long! Therefore, alignment info of every e.g. 1 million paired reads from a bam is read and sent to the parallel processes so that they are overall analyzed faster. If there are 11 parallel processes then the process of analyzing the bam file would be almost 10 times faster.

Cheers,

Ali

--

Ali Oghabian

Bioinformatics,
Post Doctoral Researcher
Folkh�lsan Research Center
Neuromuscular Disorders Research lab
Genetic Determinants of Osteoporosis Research lab

Address: Room C307b, Folkh�lsan Research Center,
Biomedicum, University of Helsinki,
Haartmaninkatu 8,
00290 Helsinki, Finland
Tel (Office): +358 2941 25629, +358 50 4484028
________________________________
From: Martin Morgan <mtmorgan.bioc using gmail.com>
Sent: Tuesday, April 19, 2022 2:00 PM
To: Oghabian, Ali <ali.oghabian using helsinki.fi>; bioc-devel using r-project.org <bioc-devel using r-project.org>
Subject: Re: Best alternative to transferring large data in parallel runs, To be used in IntEREst

Maybe seeking clarification more than answering your question, but can the summarization (count reads that map to exons / introns) be done independently for each BAM file, so that the return value is a vector with a length equal to the number of introns / exons? This would be �small�, and would not require any special effort. Martin

From: Bioc-devel <bioc-devel-bounces using r-project.org> on behalf of Oghabian, Ali <ali.oghabian using helsinki.fi>
Date: Tuesday, April 19, 2022 at 6:51 AM
To: bioc-devel using r-project.org <bioc-devel using r-project.org>
Subject: [Bioc-devel] Best alternative to transferring large data in parallel runs, To be used in IntEREst

Dear fellow developers, Hello.

I have a question regarding the best method to collect numeric results that will be summed, from parallel running processes whilst preventing or limiting transfer of large data to/from the parallel processes.

My question is more directed towards IntEREst and other software/packags that attempt to summarise bam files in parallel. By "summarise" I mean to count how many reads map to genes or introns and exons of the genes in an alignment bam file.

Currently, IntEREst uses BiocParallel::bpiterate() function to go through all alignment info in a bam file N (e.g. 1,000,000) reads at a time; counts the reads that maps to each exon/intron of the requested genes; collects these read counts information; and sums over their values to get the overall reads that map to each exon/intron. If we want to run it genome-wide, it may be collecting numeric vectors of sizes up-to 3,000,000 from each parallel run which requires large memory capacity. What are good alternative to transferring these large numeric vectors from parallel runs (so that they would be summed over eventually) ?

I can think of 2 alternative methods:

1. Using temporary files: In the first versions of IntEREst, each run used to write its result into a temporary txt file which were eventually read and analysed and then deleted.  However, when I was submitting the package to BioConductor I was advised to avoid controlling these kind of data transfers by temporary files.

2. Using databases: It is also possible to store and update the results in a database I assume, but I am not sure how reliably it (e.g. SQLite) manages when as for instance 2 or several parallel running processes want to modify the same table in the database.

Any ideas on these issues or suggestions of better alternatives would be very useful and much appreciated.

Cheers,

Ali Oghabian

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel using r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

	[[alternative HTML version deleted]]