[BioC] Help with promoter analysis
Alex Gutteridge
alexg at ruggedtextile.com
Mon Feb 27 12:34:17 CET 2012
On 27.02.2012 11:17, Davy wrote:
> Hi all,
> Hoping someone could give me a bit of direction here.
>
> I have a set of genes which are all members of the same pathway.
>
> I want to identify if there are any transcription factor binding
> sites
> (TFBS) in the "promoters" (so far defined as 5kb upstream of the TSS)
> that
> are more common to genes among the pathway.
>
> I have managed to get the 5kb upstream using biomaRt (although the
> query
> throws an intermittent error, moaning about the upstream_flank
> filter,
> doesn't happen all the time, it's weird!)
>
> I also managed to download all the JASPAR matrices, parse the file
> for only
> human ones and convert them into position weight matrices.
>
> Lastly, I have produced a table of counts of each human TFBS motif in
> each
> of my genes using countPWM(pwm, seq, cutoff="90%")
>
> This is as far as I have gotten and am simply wondering what do I do
> next.
>>From some reading the hypergeometric distribution is used in this
>> situation
> but I am not sure what metrics to place in as the white balls drawn,
> total
> white balls, black balls etc., for those of you familiar with the
> hypergeometric distribution.
>
> I read that perhaps I should compare to a background set of genes,
> some
> sources say all other genes. This seems like overkill.
>
> Any help is appreciated.
> Cheers,
> Davy
Hi Davy,
Your second paragraph is a little vague/confusing ('more common' than
what?). But if the question is does a given motif appear more often in
your pathway genes than one would expect by chance from a random
sampling of genes from the genome then the hypergeometric seems
appropriate. The nature of the white/black balls depends a little on how
you initially selected your genes and the precise question you wish to
ask, but essentially it will be:
White balls: All genes in the genome (or other background set) that
contain your motif
Black balls: All genes in the genome (or other background set) that
don't contain your motif
Balls drawn: All genes in your pathway
White balls drawn: Genes in your pathway that contain the motif
So if 1000 genes contain the motif, there are 30,000 genes in the
genome, 20 genes in the pathway and 10 genes in the pathway contain the
motif then the call to phyper would be:
> phyper(10,1000,30000-1000,20,lower.tail=F)
[1] 6.820356e-12
--
Alex Gutteridge
More information about the Bioconductor
mailing list