[BioC] affy hugene 2.1 st

Thu Aug 30 00:03:44 CEST 2012

Dear Dario,

For your purposes you can use package "xps", which I have just tested with the Human Gene 2.0 ST Array Data Set which is available for download from:
http://www.affymetrix.com/support/downloads/demo_data/human2_0.zip

1, However, first you need to download the corresponding Affymetrix library files and annotation files for HuGene-2_1-st. You need these files to create the ROOT scheme file as follows:

### new R session: load library xps
library(xps)

### define directories:
# directory containing Affymetrix library files
libdir <- "/Volumes/GigaDrive/Affy/libraryfiles"
# directory containing Affymetrix annotation files
anndir <- "/Volumes/GigaDrive/Affy/Annotation"
# directory to store ROOT scheme files
scmdir <- "/Volumes/GigaDrive/CRAN/Workspaces/Schemes"

# HuGene-2_1-st:
# use corrected annotation files
scheme.hugene21st.na32 <- import.exon.scheme("hugene21stv1", filedir = file.path(scmdir, "na32"), file.path(libdir, "HuGene-2_1-st", "HuGene-2_1-st.clf"), file.path(libdir, "HuGene-2_1-st", "HuGene-2_1-st.pgf"), file.path(anndir, "HuGene-2_1-st-v1.na32.hg19.probeset.csv", "HuGene-2_1-st-v1.na32.hg19.probeset.corr.csv"), file.path(anndir, "HuGene-2_1-st-v1.na32.hg19.transcript.csv", "HuGene-2_1-st-v1.na32.hg19.transcript.corr.csv"))

Since the Affymetrix annotation files for the new HuGene_2.x arrays have missing AFFX controls, you need first to add these controls. For this purpose I have created a Perl script (shown below) which adds the missing AFFX probesets and creates the corrected annotation files:
- HuGene-2_1-st-v1.na32.hg19.probeset.corr.csv
- HuGene-2_1-st-v1.na32.hg19.transcript.corr.csv
Note: Affymetrix has promised to add the missing AFFX controls in version na33 of the annotation files.

Alternatively, I can send you the finished ROOT scheme file "hugene21stv1.root", however it has a size of 52 MB.

2, After the creation of the ROOT scheme file "hugene21stv1.root" you are ready to import the CEL-files as follows:

### new R session: load library xps
library(xps)

### define directories:
# directory of ROOT scheme files
scmdir <- "/Volumes/GigaDrive/CRAN/Workspaces/Schemes/na32"
# directory to store ROOT raw data files
datdir <- "/Volumes/GigaDrive/CRAN/Workspaces/ROOTData"
# directory containing Tissues CEL files
celdir <- "/Volumes/GigaDrive/ChipData/Exon/HuGene2/human2.0/HuGene2.1_Plate"

### HuGene-2_1-st data: import raw data
# first, import ROOT scheme file
scheme.genome <- root.scheme(file.path(scmdir, "hugene21stv1.root"))

# subset of CEL files to import
celfiles <- c("Liver_HuGene-2_1_GT_Rep1_A03_MC.CEL", "Liver_HuGene-2_1_GT_Rep2_D06_MC.CEL", "Liver_HuGene-2_1_GT_Rep3_F02_MC.CEL", "Spleen_HuGene-2_1_GT_Rep1_A11_MC.CEL", "Spleen_HuGene-2_1_GT_Rep2_C07_MC.CEL", "Spleen_HuGene-2_1_GT_Rep3_F04_MC.CEL")
# rename CEL files
celnames <- c("LiverRep1", "LiverRep2", "LiverRep3", "SpleenRep1", "SpleenRep2", "SpleenRep3")
# import CEL files
data.genome <- import.data(scheme.genome, "HuTissuesGenome21", filedir=datdir, celdir=celdir, celfiles=celfiles, celnames=celnames)

3, Now you are ready to convert the data to expression levels using RMA:

### new R session: load library xps
library(xps)

### first, load ROOT scheme file and ROOT data file
scmdir <- "/Volumes/GigaDrive/CRAN/Workspaces/Schemes/na32"
scheme.genome <- root.scheme(file.path(scmdir, "hugene21stv1.root"))
datdir <- "/Volumes/GigaDrive/CRAN/Workspaces/ROOTData"
data.genome <- root.data(scheme.genome, paste(datdir, "HuTissuesGenome21_cel.root",sep="/"))

### preprocess raw data ###
datdir <- getwd()

# 1. RMA
data.rma <- rma(data.genome, "HuGene21RMAcore", filedir=datdir, tmpdir="", background="antigenomic", normalize=TRUE, exonlevel="core+affx")

# 2. DABG detection call
call.dabg <- dabg.call(data.genome, "HuGene21DABGcore", filedir=datdir, exonlevel="core+affx")

# get data.frames
expr.rma <- validData(data.rma)
pval.dabg <- pvalData(call.dabg)
pres.dabg <- presCall(call.dabg)

# density plots
hist(data.rma)

# boxplots
boxplot(data.rma)

# export expression data
export.expr(data.rma, treename = "*", treetype="mdp", varlist="fUnitName:fName:fSymbol:fLevel", outfile="HuGene21RMAcoreNamesSymbols.txt", sep="\t", as.dataframe=FALSE, verbose=TRUE)

I hope this info is helpful for you; below you find the Perl script.

Best regards,
Christian
_._._._._._._._._._._._._._._._._._
C.h.r.i.s.t.i.a.n   S.t.r.a.t.o.w.a
V.i.e.n.n.a           A.u.s.t.r.i.a
e.m.a.i.l:        cstrato at aon.at
_._._._._._._._._._._._._._._._._._

### BEGIN perlscript "HuGene21_update_AFFX.pl" ###

#!/usr/bin/perl
#
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# Perl script to update AFFX controls of HuGene-2_1-st annotation files
#
# Copyright (c) 2012-2012 Christian Stratowa, Vienna, Austria.
# All rights reserved.
#
# save HuGene-2_1-st pgf-file and annotation files in current directory
# and run:
# > perl HuGene21_update_AFFX.pl
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

use strict;
use warnings;
# get current working dir
use Cwd;

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# intialize constants
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

# input file names
my $in_pgf      = "/Volumes/GigaDrive/Affy/libraryfiles/HuGene-2_1-st/HuGene-2_1-st.pgf";
my $in_annot_tc = "/Volumes/GigaDrive/Affy/Annotation/HuGene-2_1-st-v1.na32.hg19.transcript.csv/HuGene-2_1-st-v1.na32.hg19.transcript.csv";
my $in_annot_ps = "/Volumes/GigaDrive/Affy/Annotation/HuGene-2_1-st-v1.na32.hg19.probeset.csv/HuGene-2_1-st-v1.na32.hg19.probeset.csv";

# output file names
my $out_affx     = "HuGene21.affx.csv";
my $out_annot_tc = "HuGene-2_1-st-v1.na32.hg19.transcript.corr.csv";
my $out_annot_ps = "HuGene-2_1-st-v1.na32.hg19.probeset.corr.csv";

# predefined strings
my $na                = "---";
my $beg_assignment_tc = "--- // --- // ";
my $end_assignment_tc = " // --- // --- // --- // --- // --- // ---";
my $beg_assignment_ps = "--- // ";
my $end_assignment_ps = " // --- // --- // --- // ---";

# variables
my @array;

my $idx;

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# read pgf-file and put control->affx into array
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

print("reading pgf-file and storing control->affx in array ... ");

open(INFILE, $in_pgf) or die("Couldn't read $in_pgf: $!");

# fill array with [probeset_id,mrna_assignment,category, line_nr]
$idx = 0;
while (my $line = <INFILE>) {
   $idx++;
   if ($line =~ /control->affx/) {
      chomp($line);
      $line =~ s/\r//;  # remove optional carriage return character
      my @tmp = split(/\t/, $line);
      push @array, [@tmp, $idx];
   }#if
}#while
push @array, [0, "NA",, "NA", $idx+1];

close(INFILE)  or die("Couldn't close $in_pgf: $!");

# replace "line_nr" with "total_probes"
for (my $i=0; $i<$#array; $i++) {
   $array[$i][3] = ($array[$i+1][3] - $array[$i][3] - 1)/2;
   #very dirty workaround (would need to find number of lines between probeset_ids)
#   if ($array[$i][3] > 100) {$array[$i][3] = $array[$i-1][3];}
   if ($array[$i][3] > 100) {$array[$i][3] = 20;}
}#for

print("done.\n");

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# write control->affx array to out_affx (for testing purposes only)
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

print("writing control->affx to $out_affx ... ");

open(OUTFILE, ">$out_affx") or die("Couldn't open $out_affx: $!");

for (my $i=0; $i<$#array; $i++) {
   my $tmp = join("\",\"", @{$array[$i]});
#   print(OUTFILE "\"$tmp\"\n");
   print(OUTFILE "\"$tmp\"\r\n");
}#for

close(OUTFILE) or die("Couldn't close $out_affx: $!");

print("done.\n");

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# update control->affx lines of transcript annotation file out_annot_tc
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

print("appending control->affx lines to $out_annot_tc ... ");

open(OUTFILE, ">$out_annot_tc") or die("Couldn't open $out_annot_tc: $!");
open(INFILE, $in_annot_tc)    or die("Couldn't read $in_annot_tc: $!");

# delete old control->affx lines
while (<INFILE>) {
   if (/control->affx/) {next;}
   print(OUTFILE $_);
}#while

# append new control->affx lines
for (my $i=0; $i<$#array; $i++) {
   my $afx = join("", $beg_assignment_tc, $array[$i][2], $end_assignment_tc);
   my $tmp = join("\",\"", $array[$i][0], $array[$i][0], $na, $na, 0,0, $array[$i][3],$na, $afx, $na, $na, $na, $na, $na, $na, $na, $na, $array[$i][1]);
   print(OUTFILE "\"$tmp\"\n");
#   print(OUTFILE "\"$tmp\"\r\n");
}#for

close(INFILE)  or die("Couldn't close $in_annot_tc: $!");
close(OUTFILE) or die("Couldn't close $out_annot_tc: $!");

print("done.\n");

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# update control->affx lines of probeset annotation file out_annot_ps
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

print("appending control->affx lines to $out_annot_ps ... ");

open(OUTFILE, ">$out_annot_ps") or die("Couldn't open $out_annot_ps: $!");
open(INFILE, $in_annot_ps)      or die("Couldn't read $in_annot_ps: $!");

# delete old control->affx lines
while (<INFILE>) {
   if (/control->affx/) {next;}
   print(OUTFILE $_);
}#while

# append new control->affx lines
for (my $i=0; $i<$#array; $i++) {
   my $afx = join("", $beg_assignment_ps, $array[$i][2], $end_assignment_ps);
   my $tmp = join("\",\"", $array[$i][0], $na, $na, 0, 0, $array[$i][3], 0, 0, 0, $na, $afx, 0, 0, 0, 0, $na, $na, $na, $na, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  $array[$i][1]);
   print(OUTFILE "\"$tmp\"\n");
#   print(OUTFILE "\"$tmp\"\r\n");
}#for

close(INFILE)  or die("Couldn't close $in_annot_ps: $!");
close(OUTFILE) or die("Couldn't close $out_annot_ps: $!");

print("done.\n");

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

### END perlscript "HuGene21_update_AFFX.pl" ###

On 8/20/12 7:36 AM, Dario Greco wrote:
> dear friends,
> we are starting a project using the new affymetrix human gene 2.1 st chips.
> i would like to know:
> 1) does anyone have yet any experience with them? any opinion/particular note analysing them?
> 2) what is the bioc roadmap for including the cdf/annotation packages for this?
> 3) what is the roadmap for the alternative cdf packages?
>
> thanks you so much for your kind reply.
> cheers
> dario
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

 -- output of sessionInfo(): 

> sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] xps_1.17.1

loaded via a namespace (and not attached):
[1] tools_2.15.0
>

--
Sent via the guest posting facility at bioconductor.org.