[BioC] affy hugene 2.1 st
Christian Stratowa [guest]
guest at bioconductor.org
Thu Aug 30 00:03:44 CEST 2012
Dear Dario,
For your purposes you can use package "xps", which I have just tested with the Human Gene 2.0 ST Array Data Set which is available for download from:
http://www.affymetrix.com/support/downloads/demo_data/human2_0.zip
1, However, first you need to download the corresponding Affymetrix library files and annotation files for HuGene-2_1-st. You need these files to create the ROOT scheme file as follows:
### new R session: load library xps
library(xps)
### define directories:
# directory containing Affymetrix library files
libdir <- "/Volumes/GigaDrive/Affy/libraryfiles"
# directory containing Affymetrix annotation files
anndir <- "/Volumes/GigaDrive/Affy/Annotation"
# directory to store ROOT scheme files
scmdir <- "/Volumes/GigaDrive/CRAN/Workspaces/Schemes"
# HuGene-2_1-st:
# use corrected annotation files
scheme.hugene21st.na32 <- import.exon.scheme("hugene21stv1", filedir = file.path(scmdir, "na32"), file.path(libdir, "HuGene-2_1-st", "HuGene-2_1-st.clf"), file.path(libdir, "HuGene-2_1-st", "HuGene-2_1-st.pgf"), file.path(anndir, "HuGene-2_1-st-v1.na32.hg19.probeset.csv", "HuGene-2_1-st-v1.na32.hg19.probeset.corr.csv"), file.path(anndir, "HuGene-2_1-st-v1.na32.hg19.transcript.csv", "HuGene-2_1-st-v1.na32.hg19.transcript.corr.csv"))
Since the Affymetrix annotation files for the new HuGene_2.x arrays have missing AFFX controls, you need first to add these controls. For this purpose I have created a Perl script (shown below) which adds the missing AFFX probesets and creates the corrected annotation files:
- HuGene-2_1-st-v1.na32.hg19.probeset.corr.csv
- HuGene-2_1-st-v1.na32.hg19.transcript.corr.csv
Note: Affymetrix has promised to add the missing AFFX controls in version na33 of the annotation files.
Alternatively, I can send you the finished ROOT scheme file "hugene21stv1.root", however it has a size of 52 MB.
2, After the creation of the ROOT scheme file "hugene21stv1.root" you are ready to import the CEL-files as follows:
### new R session: load library xps
library(xps)
### define directories:
# directory of ROOT scheme files
scmdir <- "/Volumes/GigaDrive/CRAN/Workspaces/Schemes/na32"
# directory to store ROOT raw data files
datdir <- "/Volumes/GigaDrive/CRAN/Workspaces/ROOTData"
# directory containing Tissues CEL files
celdir <- "/Volumes/GigaDrive/ChipData/Exon/HuGene2/human2.0/HuGene2.1_Plate"
### HuGene-2_1-st data: import raw data
# first, import ROOT scheme file
scheme.genome <- root.scheme(file.path(scmdir, "hugene21stv1.root"))
# subset of CEL files to import
celfiles <- c("Liver_HuGene-2_1_GT_Rep1_A03_MC.CEL", "Liver_HuGene-2_1_GT_Rep2_D06_MC.CEL", "Liver_HuGene-2_1_GT_Rep3_F02_MC.CEL", "Spleen_HuGene-2_1_GT_Rep1_A11_MC.CEL", "Spleen_HuGene-2_1_GT_Rep2_C07_MC.CEL", "Spleen_HuGene-2_1_GT_Rep3_F04_MC.CEL")
# rename CEL files
celnames <- c("LiverRep1", "LiverRep2", "LiverRep3", "SpleenRep1", "SpleenRep2", "SpleenRep3")
# import CEL files
data.genome <- import.data(scheme.genome, "HuTissuesGenome21", filedir=datdir, celdir=celdir, celfiles=celfiles, celnames=celnames)
3, Now you are ready to convert the data to expression levels using RMA:
### new R session: load library xps
library(xps)
### first, load ROOT scheme file and ROOT data file
scmdir <- "/Volumes/GigaDrive/CRAN/Workspaces/Schemes/na32"
scheme.genome <- root.scheme(file.path(scmdir, "hugene21stv1.root"))
datdir <- "/Volumes/GigaDrive/CRAN/Workspaces/ROOTData"
data.genome <- root.data(scheme.genome, paste(datdir, "HuTissuesGenome21_cel.root",sep="/"))
### preprocess raw data ###
datdir <- getwd()
# 1. RMA
data.rma <- rma(data.genome, "HuGene21RMAcore", filedir=datdir, tmpdir="", background="antigenomic", normalize=TRUE, exonlevel="core+affx")
# 2. DABG detection call
call.dabg <- dabg.call(data.genome, "HuGene21DABGcore", filedir=datdir, exonlevel="core+affx")
# get data.frames
expr.rma <- validData(data.rma)
pval.dabg <- pvalData(call.dabg)
pres.dabg <- presCall(call.dabg)
# density plots
hist(data.rma)
# boxplots
boxplot(data.rma)
# export expression data
export.expr(data.rma, treename = "*", treetype="mdp", varlist="fUnitName:fName:fSymbol:fLevel", outfile="HuGene21RMAcoreNamesSymbols.txt", sep="\t", as.dataframe=FALSE, verbose=TRUE)
I hope this info is helpful for you; below you find the Perl script.
Best regards,
Christian
_._._._._._._._._._._._._._._._._._
C.h.r.i.s.t.i.a.n S.t.r.a.t.o.w.a
V.i.e.n.n.a A.u.s.t.r.i.a
e.m.a.i.l: cstrato at aon.at
_._._._._._._._._._._._._._._._._._
### BEGIN perlscript "HuGene21_update_AFFX.pl" ###
#!/usr/bin/perl
#
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# Perl script to update AFFX controls of HuGene-2_1-st annotation files
#
# Copyright (c) 2012-2012 Christian Stratowa, Vienna, Austria.
# All rights reserved.
#
# save HuGene-2_1-st pgf-file and annotation files in current directory
# and run:
# > perl HuGene21_update_AFFX.pl
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
use strict;
use warnings;
# get current working dir
use Cwd;
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# intialize constants
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# input file names
my $in_pgf = "/Volumes/GigaDrive/Affy/libraryfiles/HuGene-2_1-st/HuGene-2_1-st.pgf";
my $in_annot_tc = "/Volumes/GigaDrive/Affy/Annotation/HuGene-2_1-st-v1.na32.hg19.transcript.csv/HuGene-2_1-st-v1.na32.hg19.transcript.csv";
my $in_annot_ps = "/Volumes/GigaDrive/Affy/Annotation/HuGene-2_1-st-v1.na32.hg19.probeset.csv/HuGene-2_1-st-v1.na32.hg19.probeset.csv";
# output file names
my $out_affx = "HuGene21.affx.csv";
my $out_annot_tc = "HuGene-2_1-st-v1.na32.hg19.transcript.corr.csv";
my $out_annot_ps = "HuGene-2_1-st-v1.na32.hg19.probeset.corr.csv";
# predefined strings
my $na = "---";
my $beg_assignment_tc = "--- // --- // ";
my $end_assignment_tc = " // --- // --- // --- // --- // --- // ---";
my $beg_assignment_ps = "--- // ";
my $end_assignment_ps = " // --- // --- // --- // ---";
# variables
my @array;
my $idx;
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# read pgf-file and put control->affx into array
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
print("reading pgf-file and storing control->affx in array ... ");
open(INFILE, $in_pgf) or die("Couldn't read $in_pgf: $!");
# fill array with [probeset_id,mrna_assignment,category, line_nr]
$idx = 0;
while (my $line = <INFILE>) {
$idx++;
if ($line =~ /control->affx/) {
chomp($line);
$line =~ s/\r//; # remove optional carriage return character
my @tmp = split(/\t/, $line);
push @array, [@tmp, $idx];
}#if
}#while
push @array, [0, "NA",, "NA", $idx+1];
close(INFILE) or die("Couldn't close $in_pgf: $!");
# replace "line_nr" with "total_probes"
for (my $i=0; $i<$#array; $i++) {
$array[$i][3] = ($array[$i+1][3] - $array[$i][3] - 1)/2;
#very dirty workaround (would need to find number of lines between probeset_ids)
# if ($array[$i][3] > 100) {$array[$i][3] = $array[$i-1][3];}
if ($array[$i][3] > 100) {$array[$i][3] = 20;}
}#for
print("done.\n");
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# write control->affx array to out_affx (for testing purposes only)
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
print("writing control->affx to $out_affx ... ");
open(OUTFILE, ">$out_affx") or die("Couldn't open $out_affx: $!");
for (my $i=0; $i<$#array; $i++) {
my $tmp = join("\",\"", @{$array[$i]});
# print(OUTFILE "\"$tmp\"\n");
print(OUTFILE "\"$tmp\"\r\n");
}#for
close(OUTFILE) or die("Couldn't close $out_affx: $!");
print("done.\n");
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# update control->affx lines of transcript annotation file out_annot_tc
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
print("appending control->affx lines to $out_annot_tc ... ");
open(OUTFILE, ">$out_annot_tc") or die("Couldn't open $out_annot_tc: $!");
open(INFILE, $in_annot_tc) or die("Couldn't read $in_annot_tc: $!");
# delete old control->affx lines
while (<INFILE>) {
if (/control->affx/) {next;}
print(OUTFILE $_);
}#while
# append new control->affx lines
for (my $i=0; $i<$#array; $i++) {
my $afx = join("", $beg_assignment_tc, $array[$i][2], $end_assignment_tc);
my $tmp = join("\",\"", $array[$i][0], $array[$i][0], $na, $na, 0,0, $array[$i][3],$na, $afx, $na, $na, $na, $na, $na, $na, $na, $na, $array[$i][1]);
print(OUTFILE "\"$tmp\"\n");
# print(OUTFILE "\"$tmp\"\r\n");
}#for
close(INFILE) or die("Couldn't close $in_annot_tc: $!");
close(OUTFILE) or die("Couldn't close $out_annot_tc: $!");
print("done.\n");
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# update control->affx lines of probeset annotation file out_annot_ps
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
print("appending control->affx lines to $out_annot_ps ... ");
open(OUTFILE, ">$out_annot_ps") or die("Couldn't open $out_annot_ps: $!");
open(INFILE, $in_annot_ps) or die("Couldn't read $in_annot_ps: $!");
# delete old control->affx lines
while (<INFILE>) {
if (/control->affx/) {next;}
print(OUTFILE $_);
}#while
# append new control->affx lines
for (my $i=0; $i<$#array; $i++) {
my $afx = join("", $beg_assignment_ps, $array[$i][2], $end_assignment_ps);
my $tmp = join("\",\"", $array[$i][0], $na, $na, 0, 0, $array[$i][3], 0, 0, 0, $na, $afx, 0, 0, 0, 0, $na, $na, $na, $na, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, $array[$i][1]);
print(OUTFILE "\"$tmp\"\n");
# print(OUTFILE "\"$tmp\"\r\n");
}#for
close(INFILE) or die("Couldn't close $in_annot_ps: $!");
close(OUTFILE) or die("Couldn't close $out_annot_ps: $!");
print("done.\n");
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
### END perlscript "HuGene21_update_AFFX.pl" ###
On 8/20/12 7:36 AM, Dario Greco wrote:
> dear friends,
> we are starting a project using the new affymetrix human gene 2.1 st chips.
> i would like to know:
> 1) does anyone have yet any experience with them? any opinion/particular note analysing them?
> 2) what is the bioc roadmap for including the cdf/annotation packages for this?
> 3) what is the roadmap for the alternative cdf packages?
>
> thanks you so much for your kind reply.
> cheers
> dario
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
-- output of sessionInfo():
> sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
locale:
[1] C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] xps_1.17.1
loaded via a namespace (and not attached):
[1] tools_2.15.0
>
--
Sent via the guest posting facility at bioconductor.org.
More information about the Bioconductor
mailing list