[R] Assigning genes to CBS segmented output:
Martin Morgan
mtmorgan at fhcrc.org
Wed Oct 5 01:35:40 CEST 2011
On 10/04/2011 02:44 PM, Angel Russo wrote:
> Hi All,
>
> I have an CBS segmentation algorithm output for 10 tumor samples each from 2
> different tumors.
>
> Now, I am in an urgent need to assign gene (followed by all genes present)
> that belong to a particular segment after I removed all the CNVs from
> segment data. The format of the data is:
>
> Sample Chromosome Start End Num_Probes Segment_Mean
> Sample1A-TA 1 51598 76187 15 -1.115
Hi Angel -- In Bioconductor
http://bioconductor.org
for some model organism create a data frame of known Entrez genes and
their begin / end locations. Start by installing necessary software and
data packages
source('http://bioconductor.org/biocLite.R")
biocLite(c('org.Hs.eg.db', "GenomicRanges'))
then load the library with annotations about genic coordinates
library(org.Hs.eg.db)
anno = merge(toTable(org.Hs.egCHRLOC), toTable(org.Hs.egCHRLOCEND))
leading to
> head(anno)
gene_id Chromosome start_location end_location
1 10000 1 -243666483 -244006553
2 10000 1 -243666483 -244006553
3 10000 1 -243651534 -244006553
4 10000 1 -243651534 -244006553
5 100008586 X 49217770 49223847
6 100008586 X 49217770 49332715
For the simple question 'which genes are located on chromosome A
starting at X and going to Y' you could
subset(geno, Chromosome=="A" &
abs(start_location) > X &
abs(end_location) < Y)
This could also be done through the 'biomaRt' package or GenomicFeatures
/ TxDb packages. To get this for many segments filter 'anno' to remove
funky genes, e.g., those that have negative length(!)
idx = with(anno, abs(start_location) > abs(end_location))
anno = anno[!idx,]
manipulate this to a GRanges object;
library(GenomicRanges)
gr = with(anno, GRanges(Chromosome,
IRanges(abs(start_location), abs(end_location)),
names=gene_id))
convert your CBS result into a GRanges
seg = with(CBS, GRanges(Chromosome, IRanges(Start, End)))
then find overlaps
olap = findOverlaps(gr, seg)
the 'gr' is called the 'query', 'seg' is called the 'subject'.
queryHits(olap) and subjectHits(olap) give equal-length vectors
describing which queries overlap which subjects. You could group gene
names by segment with
split(names(gr)[queryHits(olap)], subjectHits(olap))
An important issue is to use the same genome build for annotations as
you used for segmentation. Hope that helps / provides some hints for
getting from A to B.
Martin
>
> Could anyone suggest an R library or code or method that I can quickly use
> to get the genes assigned to CBS output.
>
> Thanks so much,
> Angel
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
Location: M1-B861
Telephone: 206 667-2793
More information about the R-help
mailing list