[BioC] Determining an overlapping annotation data subset(overlap/overlaps)

Mon Aug 6 18:13:47 CEST 2007

Hi Stephen,

Don't know if it does what you want (and it isn't a one-liner), but here
it is anyway:

> a<-data.frame(id=1:4, start=seq(10, 40, 10), end=seq(15, 45, 10))
> b<-data.frame(id=5:8, start=c(11,24,44,55), end=c(14,26,45,57))
> a  # large sequence features
  id start end
1  1    10  15
2  2    20  25
3  3    30  35
4  4    40  45
> b  # smaller sequence features
  id start end
1  5    11  14
2  6    24  26
3  7    44  45
4  8    55  57
> bool.matrix<-NULL
> for(i in 1:nrow(b)) {bool.matrix<-rbind(bool.matrix, b$start[i] >=
a$start & b$end[i] <= a$end)}
> colnames(bool.matrix)<-a$id
> rownames(bool.matrix)<-b$id
> bool.matrix
      1     2     3     4
5  TRUE FALSE FALSE FALSE
6 FALSE FALSE FALSE FALSE
7 FALSE FALSE FALSE  TRUE
8 FALSE FALSE FALSE FALSE

Cheers,
Alex
------------------------------------
Alex Lam
Roslin Institute (Edinburgh)
Roslin
Midlothian EH25 9PS
Great Britain

Phone +44 131 5274471
Web   http://www.roslin.ac.uk

Roslin Institute is a company limited by guarantee, registered in
Scotland (registered number SC157100) and a Scottish Charity (registered
number SC023592). Our registered office is at Roslin, Midlothian, EH25
9PS. VAT registration number 847380013.

The information contained in this e-mail (including any attachments) is
confidential and is intended for the use of the addressee only.   The
opinions expressed within this e-mail (including any attachments) are
the opinions of the sender and do not necessarily constitute those of
Roslin Institute (Edinburgh) ("the Institute") unless specifically
stated by a sender who is duly authorised to do so on behalf of the
Institute

-----Original Message-----
From: bioconductor-bounces at stat.math.ethz.ch
[mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of Stephen
Montgomery
Sent: 06 August 2007 13:52
To: bioconductor at stat.math.ethz.ch
Subject: [BioC] Determining an overlapping annotation data
subset(overlap/overlaps)

Hello Bioconductor -

Apologies as this a fairly rookie bioinformatics based R question, but I
am trying to determine if there is a R one-liner to extract a subset of
a data frame which possesses annotation contained within it that has
been stored in another data frame?  (For example extracting genomic
intervals which contain certain features/annotation)

Such that:
If I have dataframe "A" possessing an "id", "start", and "end"; And
dataframe "B" also possessing an "id", "start", and "end"; The output is
all the rows of A which contain an entry of B (B$start, B$end) within
A$start and A$end.

I have tried my own fairly uninformed variants like this to no-avail
A[length(B[B$start <= A$end & B$end >= A$start]) > 0,] I fear the
solution will be trivial but as yet it has eluded me. :/

Thanks for any help!  (Theoretically, I can also see doing this in its
own function by creating a vector of counts for each member of "A" and
then reporting those that are non-zero but I was wondering if there was
a more succinct and likely efficient way)

Thanks again,
Stephen

Stephen Montgomery, B.A.Sc., Ph.D.
Postdoctoral Researcher, Team 16
Wellcome Trust Sanger Institute
Hinxton, Cambridge CB10 1SA
Phone: 44-1223-834244 (ext 7297)
Skype: stephen.b.montgomery

--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

_______________________________________________
Bioconductor mailing list
Bioconductor at stat.math.ethz.ch
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor