[R] Assigning genes to CBS segmented output:

Wed Oct 5 01:35:40 CEST 2011

On 10/04/2011 02:44 PM, Angel Russo wrote:
> Hi All,
>
> I have an CBS segmentation algorithm output for 10 tumor samples each from 2
> different tumors.
>
> Now, I am in an urgent need to assign gene (followed by all genes present)
> that belong to a particular segment after I removed all the CNVs from
> segment data. The format of the data is:
>
> Sample  Chromosome      Start   End     Num_Probes      Segment_Mean
> Sample1A-TA  1       51598   76187   15      -1.115

Hi Angel -- In Bioconductor

   http://bioconductor.org

for some model organism create a data frame of known Entrez genes and 
their begin / end locations. Start by installing necessary software and 
data packages

   source('http://bioconductor.org/biocLite.R")
   biocLite(c('org.Hs.eg.db', "GenomicRanges'))

then load the library with annotations about genic coordinates

   library(org.Hs.eg.db)
   anno = merge(toTable(org.Hs.egCHRLOC), toTable(org.Hs.egCHRLOCEND))

leading to

 > head(anno)
     gene_id Chromosome start_location end_location
1     10000          1     -243666483   -244006553
2     10000          1     -243666483   -244006553
3     10000          1     -243651534   -244006553
4     10000          1     -243651534   -244006553
5 100008586          X       49217770     49223847
6 100008586          X       49217770     49332715

For the simple question 'which genes are located on chromosome A 
starting at X and going to Y' you could

   subset(geno, Chromosome=="A" &
                  abs(start_location) > X &
                    abs(end_location) < Y)

This could also be done through the 'biomaRt' package or GenomicFeatures 
/ TxDb packages. To get this for many segments filter 'anno' to remove 
funky genes, e.g., those that have negative length(!)

   idx = with(anno, abs(start_location) > abs(end_location))
   anno = anno[!idx,]

manipulate this to a GRanges object;

   library(GenomicRanges)
   gr = with(anno, GRanges(Chromosome,
                     IRanges(abs(start_location), abs(end_location)),
                     names=gene_id))

convert your CBS result into a GRanges

   seg = with(CBS, GRanges(Chromosome, IRanges(Start, End)))

then find overlaps

   olap = findOverlaps(gr, seg)

the 'gr' is called the 'query', 'seg' is called the 'subject'. 
queryHits(olap) and subjectHits(olap) give equal-length vectors 
describing which queries overlap which subjects. You could group gene 
names by segment with

   split(names(gr)[queryHits(olap)], subjectHits(olap))

An important issue is to use the same genome build for annotations as 
you used for segmentation. Hope that helps / provides some hints for 
getting from A to B.

Martin

>
> Could anyone suggest an R library or code or method that I can quickly use
> to get the genes assigned to CBS output.
>
> Thanks so much,
> Angel
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793