[Bioc-devel] Best alternative to transferring large data in parallel runs, To be used in IntEREst

Oghabian, Ali @||@ogh@b|@n @end|ng |rom he|@|nk|@||
Tue Apr 19 12:50:46 CEST 2022


Dear fellow developers, Hello.

I have a question regarding the best method to collect numeric results that will be summed, from parallel running processes whilst preventing or limiting transfer of large data to/from the parallel processes.

My question is more directed towards IntEREst and other software/packags that attempt to summarise bam files in parallel. By "summarise" I mean to count how many reads map to genes or introns and exons of the genes in an alignment bam file.

Currently, IntEREst uses BiocParallel::bpiterate() function to go through all alignment info in a bam file N (e.g. 1,000,000) reads at a time; counts the reads that maps to each exon/intron of the requested genes; collects these read counts information; and sums over their values to get the overall reads that map to each exon/intron. If we want to run it genome-wide, it may be collecting numeric vectors of sizes up-to 3,000,000 from each parallel run which requires large memory capacity. What are good alternative to transferring these large numeric vectors from parallel runs (so that they would be summed over eventually) ?

I can think of 2 alternative methods:

1. Using temporary files: In the first versions of IntEREst, each run used to write its result into a temporary txt file which were eventually read and analysed and then deleted.  However, when I was submitting the package to BioConductor I was advised to avoid controlling these kind of data transfers by temporary files.

2. Using databases: It is also possible to store and update the results in a database I assume, but I am not sure how reliably it (e.g. SQLite) manages when as for instance 2 or several parallel running processes want to modify the same table in the database.

Any ideas on these issues or suggestions of better alternatives would be very useful and much appreciated.

Cheers,

Ali Oghabian

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list