[BioC] Question about using Biostrings & BSgenome

J.delasHeras at ed.ac.uk J.delasHeras at ed.ac.uk
Sat Sep 20 15:15:47 CEST 2008


Hi Joern,

that was useful, thank you! I have some new homework to do now. :-)

As for what I'm after exactly... it'll be various things at various  
times, but I can give you one very specific example right now.
I have a human promoter array in my hands (and soon a mouse one). Each  
probeset covers a region of around 2.2kb upstream and 0.5kb downstream  
the TSS.
Now... in reality, some genes have multiple TSSs... sometimes they are  
close, sometimes far apart. Also, each probeset may be longer than the  
2.7Kb expected, for instance if you have two genes going in different  
directions starting in a short region. I want to dissect all this out.
I want to find all the genes, all the TSSs, and create "my own"  
probesets (from the probes available to me in the array) based on  
these TSSs and covering a region defined by me also (I may choose to  
create probesets just +/-400bp around the TSS, and other perhaps  
covering the 1kb region located -1000 to -2000bp from the TSS) etc.  
And later on I may have another requirement, depending on my findings  
and whatever I may be looking for.
So I need to locate the TSSs. Then I have to decide for each gene with  
multiple TSSs, which ones are just too close to make any significant  
difference to my results so that I can treat them as one, and which  
ones are further apart so that I treat them as distinct (different  
promoter regions for a single gene). I would do that based on the TSS  
locations (and orientation), so it seems simple enough. Then with  
those locations, I can search the array annotation and figure out  
which ones are located within the subareas I want. I can do that based  
on positions alone, but I'd like to have the actual sequences (not  
just the probes, but the whole region) because in some cases I am  
looking for particular motifs, and even something simple like  
restriction sites...

For promoter arrays this won't apply, but I also have tiling arrays  
for a couple of human chromosomes, and in this case I'll find it  
interesting to separate probesets from exons, introns... I want to  
sometimes consider a region of x bp around the 5' end of the  
transcript and another around the 3'...
I already have some annotation provided, but I think it's probably  
easier to look it up myself (from teh probe locations & their given  
sequence) and that way create the annotation I find useful for my  
purposes, than adapting whatever was given to me. Especially as it  
seems (on paper) a relatively simple procedure that can be achieved  
now entirely from R.

I will come up with more detailed questions probably once I start  
applying these tools to my problems.

Jose

Quoting Joern Toedling <toedling at ebi.ac.uk>:

> Hello,
>
> Biostrings and BSgenome can certainly be used to retrieve genomic
> sequences. For instance, here's a very basic function I have used many
> times to retrieve the sequence of short genome segments on either strand
> of budding yeast.
>
> getYeastSeq <- function(chr, start, end, strand="+"){
>   stopifnot(length(chr)==1, length(start)==1, length(end)==1)
>   require("BSgenome.Scerevisiae.UCSC.sacCer1")
>   strand <- match.arg(strand, c("+","-"))
>   thisSeq <- gsub("[[:space:]]","", as.character(getSeq(Scerevisiae,
> gsub("17","M",paste("chr",chr,sep="")), start=start, end=end)))
>   if (strand=="-")
>     thisSeq <- as.character(reverseComplement(DNAString(thisSeq)))
>   return(thisSeq)
> }#getYeastSeq
>
> getYeastSeq(chr=2, start=200000, end=200020) ## test
>
> Biostrings offers many utility functions to work with DNA sequences. And
> you can always convert the sequences into character vectors and use
> basic R operations on those. Not sure what other games you have in mind
> when you say "play", but I guess a more precise question whether you can
> do XYZ with Biostrings or any other Bioconductor package will result in
> a more informative answer.
>
> Regards,
> Joern
>
>
> J.delasHeras at ed.ac.uk wrote:
>>
>> I haven't yet used either of these packages, but it looks like
>> something I may want to look at.
>>
>> I was wondering if I can use these packages together with something
>> like 'BSgenome.Hsapiens.UCSC.hg18' to extract sequences around every
>> TSS, for instance.
>> I have a couple of different oligo array designs, both in human and
>> mouse, and I would like to subset probes according to a number of
>> criteria, such as "promoter", "intergenic", etc...
>> I'm not yet familiar with these packages but I suspect they will
>> provide all teh tools I need to extract and "play" with genomic
>> sequences.
>>
>> Am I right?
>>
>> Anybody has some examples to help me get a better overview, beyond
>> those in the vignettes?
>>
>> Thanks.
>>
>> Jose
>>
>
> --
> Joern Toedling
> EMBL - European Bioinformatics Institute
> Wellcome Trust Genome Campus
> Hinxton, Cambridge CB10 1SD
> United Kingdom
> Phone  +44(0)1223 492566
> Email  toedling at ebi.ac.uk
>
>
>



-- 
Dr. Jose I. de las Heras                      Email: J.delasHeras at ed.ac.uk
The Wellcome Trust Centre for Cell Biology    Phone: +44 (0)131 6513374
Institute for Cell & Molecular Biology        Fax:   +44 (0)131 6507360
Swann Building, Mayfield Road
University of Edinburgh
Edinburgh EH9 3JR
UK

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



More information about the Bioconductor mailing list