[BioC] example of working from a database dump
Vincent Carey 525-2265
stvjc at channing.harvard.edu
Wed Sep 24 13:52:19 MEST 2003
On Wed, 24 Sep 2003, Kaushik, Narendra K wrote:
> I have file in this format:
>
> Sequence ID Gene_info Avag_diff1 Avg_diff2..............Avg_diif95
> fffffffff Gene1 45 60 56
> -------- ------ ------ ------- ...............-----
> 9000 Gene9000 34 45 56
>
> Avg_diff1--- Avg_diff95 etc values of gene expression data from N=95 chips.
> This is not Affy hgu95av2 Chip. we have custom made chip oligo synthesize
> in situ and are processed by PM-MM method.
>
> Narendra
>
there are certain conventions that need to be obeyed. suppose
i edit your data snippet as follows
SequenceID Gene_info Avag_diff1 Avg_diff2 Avg_diif95
fffffffff Gene1 45 60 56
9000 Gene9000 34 45 56
and now those three lines are the contents of file "tab".
in R, the command
> ERtab <- read.table("tab", h=TRUE)
assigns the data to the object ERtab, which is a data.frame.
now you can do some statistics:
> apply(ERtab[,3:5],2,mean)
Avag.diff1 Avg.diff2 Avg.diif95
39.5 52.5 56.0
so with a little bit of massaging of a non-standard data snippet
and two R commands i have learned something about the data.
you will need to learn some R. 1) no embedded blanks in variable
(column) names 2) embedded underscores are translated to ".",
3) read.table is powerful and will distinguish
between numeric and character data.
you may have trouble reading in all 95 columns unless you have
lots of RAM. once you have the matrix of numbers you can
consider structuring the data in the exprSet class. clearly there
are some interesting a priori distinctions among the 95 chips. these
should be encoded in the phenoData component of the exprSet.
read the Biobase and affy vignettes to learn more about this.
there may also be facilities in limma and the marray* tools
that can help you with your custom chips.
suppose you simply don't have enough RAM to read in the data.
you will have to divide and conquer in an appropriate way.
it may be that you can cut the data up into chunks of genes,
with 95 instances of all genes in each chunk. or you can
cut it up into chunks of chips, with 9500 genes on all chips
in the chunk. you need to think about filtering
to make this manageable if your computing resources are
insufficient to deal with the whole dataset. R has all the
tools you need to do this -- you can work with scan, e.g.,
to get subsets of records in the file. or you may have operating
system facilities that help with file decomposition.
More information about the Bioconductor
mailing list