[Bioc-devel] Best alternative to transferring large data in parallel runs, To be used in IntEREst

Martin Morgan mtmorg@n@b|oc @end|ng |rom gm@||@com
Wed Apr 20 12:03:08 CEST 2022


Hi Krisitan – I’m not exactly sure of your use case. I looked at ?scanFa and did

library(Rsamtools)
fa <- system.file("extdata", "ce2dict1.fa", package="Rsamtools", mustWork=TRUE)
seqinfo =:seqinfo(FaFile(fa)

to get the sequences and lengths of the sequences in the index

Seqinfo object with 5 sequences from an unspecified genome:
  seqnames  seqlengths isCircular genome
  pattern01         18         NA   <NA>
  pattern02         25         NA   <NA>
  pattern03         24         NA   <NA>
  pattern04         24         NA   <NA>
  pattern05         25         NA   <NA>

I then used GenomicFeatures::tileGenome() to generate ranges that span the entire fasta file; I’m using a very small window because the example is small…

tiles = tileGenome(seqinfo, tilewidth=10, cut.last.tile.in.chrom=TRUE)

So I have

> tiles
GRanges object with 14 ranges and 0 metadata columns:
        seqnames    ranges strand
           <Rle> <IRanges>  <Rle>
   [1] pattern01      1-10      *
   [2] pattern01     11-18      *
   [3] pattern02      1-10      *
   [4] pattern02     11-20      *
   [5] pattern02     21-25      *
   ...       ...       ...    ...
  [10] pattern04     11-20      *
  [11] pattern04     21-24      *
  [12] pattern05      1-10      *
  [13] pattern05     11-20      *
  [14] pattern05     21-25      *
  -------
  seqinfo: 5 sequences from an unspecified genome

I could coerce this to a GRangesList and use in lapply to iterate over ranges, e.g., counting GC content of each region

FUN = function(tile, file) {
    dna = scanFa(file, tile)
    sum(alphabetFrequency(dna)[,c("G", "C")])
}
gc_tiles = lapply(as(tiles, "GRangesList"), FUN, fa)
unlist(gc_tiles) |> sum()

So in parallel I could do, using BiocParallel

gc_tiles = bplapply(as(tiles, "GRangesList"), FUN, fa)

The GenomicFiles equivalent works directly on the GRanges, and FUN servers as the MAP argument; unlist() is used to simplify the MAP result on each worker

reduceByRanges(tiles, fa, MAP = FUN, REDUCE = unlist)

which is automatically parallelized.

Hope that’s on the right track…

Martin


From: Kristian Ullrich <ullrich using evolbio.mpg.de>
Date: Tuesday, April 19, 2022 at 10:58 AM
To: Oghabian, Ali <ali.oghabian using helsinki.fi>
Cc: Martin Morgan <mtmorgan.bioc using gmail.com>, bioc-devel using r-project.org <bioc-devel using r-project.org>
Subject: Re: [Bioc-devel] Best alternative to transferring large data in parallel runs, To be used in IntEREst
Dear Martin Morgan,

just a related question to the `FastqStreamer` and `bpiterate` function.

Do you have an example how to parallelise the `scanFa` function, which extracts a region instead of sequence IDs of an indexed FASTA file?

The aim would be to have an easy and parallel sliding-window (fasta_file_path, window_length, window_jump) function for big whole-genome multiple-sequence alignment files.

The chunks of that MSA could be analysed in parallel without the need to pre-load a big MSA FASTA file for each worker.

Best regards

Kristian




On 19. Apr 2022, at 13:57, Oghabian, Ali <ali.oghabian using helsinki.fi<mailto:ali.oghabian using helsinki.fi>> wrote:

Hi again Martin!

The GenomicFiles::reduceByYield function that you mentioned seems to be a very good fucntion to carry out the task efficiently. Thank you for the nice hint.

Otherwise, right now all results are collected from all of the parallel runs (and stored in a list) and finally their results are reduced.

Cheers,

Ali


________________________________
From: Martin Morgan <mtmorgan.bioc using gmail.com<mailto:mtmorgan.bioc using gmail.com>>
Sent: Tuesday, April 19, 2022 2:37 PM
To: Oghabian, Ali <ali.oghabian using helsinki.fi<mailto:ali.oghabian using helsinki.fi>>; bioc-devel using r-project.org<mailto:bioc-devel using r-project.org> <bioc-devel using r-project.org<mailto:bioc-devel using r-project.org>>
Subject: Re: Best alternative to transferring large data in parallel runs, To be used in IntEREst


I guess you are saying that there is one big BAM file, rather than several BAM files. If there are several BAM files, then the best strategy is probably to process each in its own parallel environment.



Perhaps GenomicFiles::reduceByRange is a solution (create ranges that span the genome, and process each range separately)? Care needs to be taken to avoid double-counting of reads spanning ranges.



There is also GenomicFiles::reduceByYield, which I think implements your strategy � sending individual chunks to workers for processing. I guess I�m still not seeing where the data become very large� the chunks sent to workers can be controlled, and the return �counts� are also small�



Martin



From: Oghabian, Ali <ali.oghabian using helsinki.fi<mailto:ali.oghabian using helsinki.fi>>
Date: Tuesday, April 19, 2022 at 7:16 AM
To: Martin Morgan <mtmorgan.bioc using gmail.com<mailto:mtmorgan.bioc using gmail.com>>, bioc-devel using r-project.org<mailto:bioc-devel using r-project.org> <bioc-devel using r-project.org<mailto:bioc-devel using r-project.org>>
Subject: Re: Best alternative to transferring large data in parallel runs, To be used in IntEREst

Hi Martin!

It is done on a single bam files. However, the issue is that for a large bam file the whole data can't be loaded, or even if it can be loaded, analyzing it will take long! Therefore, alignment info of every e.g. 1 million paired reads from a bam is read and sent to the parallel processes so that they are overall analyzed faster. If there are 11 parallel processes then the process of analyzing the bam file would be almost 10 times faster.



Cheers,



Ali





--

Ali Oghabian

Bioinformatics,

Post Doctoral Researcher

Folkh�lsan Research Center

Neuromuscular Disorders Research lab

Genetic Determinants of Osteoporosis Research lab



Address: Room C307b, Folkh�lsan Research Center,

Biomedicum, University of Helsinki,

Haartmaninkatu 8,

00290 Helsinki, Finland

Tel (Office): +358 2941 25629, +358 50 4484028

________________________________

From: Martin Morgan <mtmorgan.bioc using gmail.com<mailto:mtmorgan.bioc using gmail.com>>
Sent: Tuesday, April 19, 2022 2:00 PM
To: Oghabian, Ali <ali.oghabian using helsinki.fi<mailto:ali.oghabian using helsinki.fi>>; bioc-devel using r-project.org<mailto:bioc-devel using r-project.org> <bioc-devel using r-project.org<mailto:bioc-devel using r-project.org>>
Subject: Re: Best alternative to transferring large data in parallel runs, To be used in IntEREst



Maybe seeking clarification more than answering your question, but can the summarization (count reads that map to exons / introns) be done independently for each BAM file, so that the return value is a vector with a length equal to the number of introns / exons? This would be �small�, and would not require any special effort. Martin



From: Bioc-devel <bioc-devel-bounces using r-project.org<mailto:bioc-devel-bounces using r-project.org>> on behalf of Oghabian, Ali <ali.oghabian using helsinki.fi<mailto:ali.oghabian using helsinki.fi>>
Date: Tuesday, April 19, 2022 at 6:51 AM
To: bioc-devel using r-project.org<mailto:bioc-devel using r-project.org> <bioc-devel using r-project.org<mailto:bioc-devel using r-project.org>>
Subject: [Bioc-devel] Best alternative to transferring large data in parallel runs, To be used in IntEREst

Dear fellow developers, Hello.

I have a question regarding the best method to collect numeric results that will be summed, from parallel running processes whilst preventing or limiting transfer of large data to/from the parallel processes.

My question is more directed towards IntEREst and other software/packags that attempt to summarise bam files in parallel. By "summarise" I mean to count how many reads map to genes or introns and exons of the genes in an alignment bam file.

Currently, IntEREst uses BiocParallel::bpiterate() function to go through all alignment info in a bam file N (e.g. 1,000,000) reads at a time; counts the reads that maps to each exon/intron of the requested genes; collects these read counts information; and sums over their values to get the overall reads that map to each exon/intron. If we want to run it genome-wide, it may be collecting numeric vectors of sizes up-to 3,000,000 from each parallel run which requires large memory capacity. What are good alternative to transferring these large numeric vectors from parallel runs (so that they would be summed over eventually) ?

I can think of 2 alternative methods:

1. Using temporary files: In the first versions of IntEREst, each run used to write its result into a temporary txt file which were eventually read and analysed and then deleted.  However, when I was submitting the package to BioConductor I was advised to avoid controlling these kind of data transfers by temporary files.

2. Using databases: It is also possible to store and update the results in a database I assume, but I am not sure how reliably it (e.g. SQLite) manages when as for instance 2 or several parallel running processes want to modify the same table in the database.

Any ideas on these issues or suggestions of better alternatives would be very useful and much appreciated.

Cheers,

Ali Oghabian

       [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel using r-project.org<mailto:Bioc-devel using r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

                [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel using r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list