Thanks Michael, that's a very smart solution! This is exactly what I
wanted! :-)

Regards,
Karolis


On Mon, Jun 9, 2014 at 5:05 PM, Michael Lawrence <lawrence.michael@gene.com>
wrote:

> Hi Karolis,
>
> I hope you don't mind that I've cc'd the Bioc help list for this reply.
> The "arbitrary" select mode is not meant to be random; it's deterministic
> and is just whatever the algorithm finds first (not necessarily according
> to the order in the subject). Not all of the select modes are supported for
> every combination of object types. In your "real world" example, you have a
> GAlignments and a GRangesList. For that particular combination, only "all"
> and "first" are supported.
>
> To solve your problem, you should probably use "all" and then select
> randomly from the resulting Hits object. Something like this:
>
> hits <- findOverlaps(bam_aligns, exonRanges)
> perm <- sample(length(hits))
> features <- rep(NA_integer_, queryLength(hits))
> features[queryHits(hits)[perm]] <- subjectHits(hits)[perm]
>
> Then see tabulate() for counting the reads for each feature. You might
> also want to look into summarizeOverlaps() and its custom function mode
> feature.
>
> Michael
>
> On Mon, Jun 9, 2014 at 2:39 AM, Karolis Uziela <
> karolis.uziela@scilifelab.se> wrote:
>
>> Dear Mr. Michael Lawrence,
>>
>> I am using findOverlaps function from GenomicRanges to assign reads to
>> custom features in reference genome. I am concerned about the cases where a
>> read overlaps with several features. In this case, I would like the read to
>> be assigned exactly to one feature which is picked at random from all
>> overlapping ones.
>>
>> According to the manual, this behaviour can be controlled using "select"
>> parameter. If select="first", the value of the findOverlaps should be a
>> vector of query length, containing the index of the first overlapping
>> interval in the subject. Analogically, if select="last", the function
>> should return the indices of last overlapping interval. Finally, if
>> select="arbitrary", the function should return the indices of "arbitrary"
>> overlapping interval, but it is not explained clearly what "arbitrary"
>> means. I assumed that "arbitrary" option will return an index of an
>> interval picked at random from all overlapping ones, which is the behaviour
>> that I want, however, it does not seem to be the case. From a toy example
>> (see EXAMPLE 1), it seems that select="arbitrary" always returns the same
>> value as select="last". Moreover, in a real world example (see EXAMPLE 2),
>> where query is human genes and subject is reads scanned from a sample bam
>> file (attached), it seems that "arbitrary" option does not work at all.
>>
>> Could you, please, explain what was the supposed behaviour of
>> select="arbitrary" function and why it does not work in a real-world
>> example? Moreover, can you give me advice, on how to achieve the behaviour
>> that I want?
>>
>> Regards,
>> Karolis Uziela
>>
>> *============= EXAMPLE 1 ==============*
>> library(GenomicRanges)
>> > query <- IRanges(c(1, 4, 9), c(5, 7, 10))
>> > subject <- IRanges(c(2, 2, 10), c(2, 3, 12))
>> > findOverlaps(query, subject, select="last")
>> [1]  2 NA  3
>> > findOverlaps(query, subject, select="first")
>> [1]  1 NA  3
>> > findOverlaps(query, subject, select="arbitrary")
>> [1]  2 NA  3
>> > findOverlaps(query, subject, select="arbitrary")
>> [1]  2 NA  3
>> > findOverlaps(query, subject, select="arbitrary")
>> [1]  2 NA  3
>>
>> *============= EXAMPLE 2 ==============*
>> library(GenomicFeatures)
>> library(GenomicRanges)
>> library(Rsamtools)
>> txdb <- makeTranscriptDbFromBiomart(biomart="ensembl",
>> dataset="hsapiens_gene_ensembl")
>> exonRanges <- exonsBy(txdb, "gene")
>> bam_aligns <- readBamGappedAlignments("input1.bam")
>>  counts <- findOverlaps(bam_aligns, exonRanges, select="arbitrary")
>>
>> Error in match.arg(select) : 'arg' should be one of “all”, “first”
>> *============= Session info ==============*
>> > sessionInfo()
>> R version 2.14.1 (2011-12-22)
>> Platform: x86_64-pc-linux-gnu (64-bit)
>>
>>  locale:
>>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>  [7] LC_PAPER=C                 LC_NAME=C
>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>> [1] Rsamtools_1.6.3       Biostrings_2.22.0     GenomicFeatures_1.6.9
>> [4] AnnotationDbi_1.16.19 Biobase_2.14.0        GenomicRanges_1.6.7
>> [7] IRanges_1.12.6        BiocInstaller_1.2.1
>>
>> loaded via a namespace (and not attached):
>>  [1] biomaRt_2.10.0     bitops_1.0-5       BSgenome_1.22.0    DBI_0.2-5
>>
>>  [5] RCurl_1.95-3       RSQLite_0.11.2     rtracklayer_1.14.4
>> tools_2.14.1
>>  [9] XML_3.95-0.1       zlibbioc_1.0.1
>>
>>
>> --
>>
>> Karolis Uziela
>>
>> PhD student
>> Science for Life Laboratory
>> Box 1031
>> 17121 Solna, Sweden
>> Mob. +46 729 120 395
>>
>
>


-- 

Karolis Uziela

PhD student
Science for Life Laboratory
Box 1031
17121 Solna, Sweden
Mob. +46 729 120 395

	[[alternative HTML version deleted]]