[Bioc-sig-seq] edgeR datasets

Davis McCarthy davismcc.lists at gmail.com
Fri Aug 12 09:44:53 CEST 2011

Hi Jerome

I decided not to put these datasets up on my personal website as the
data are already distributed with the package. The files I used for
the analysis in the edgeR Users Guide were originally from GEO, but a
search today of those sample IDs returned no results (which is a
concern, but I'm no expert on GEO).

In any case, the data are distributed with the edgeR package, so the
only trick is manipulating the data into a form suitable for following
the analysis in the Users Guide. Example code for how to do this from
a fresh R session is shown below. I actually just hacked some the code
from readDGE() to do this.

This should work generally, but I must point out that you have not
provided the output of sessionInfo() or told us what versions of
R/Bioconductor/edgeR you are using, so any advice that anyone
(including me) provides cannot be as precise as it might be if we had
all of the information. For instance, I don't know where any of those
.txt files are on your system, so can't really diagnose why readDGE
didn't work for you in this circumstance. readDGE() relies on in-built
functions like read.delim(), so it might be worth boning up on how
such R functions work to help you trouble-shoot such problems with
importing data.

Best wishes


Example code
data(NC1, NC2, Tu102, Tu98)
x <- list()
sets <- c("NC1","NC2","Tu102","Tu98")
x$samples <- data.frame(files=as.character(sets),stringsAsFactors=FALSE)
x$samples$group <- factor(rep(c("Normal","Tumour"),each=2))
d <- taglist <- list()
d[[1]] <- NC1
d[[2]] <- NC2
d[[3]] <- Tu102
d[[4]] <- Tu98
for(i in 1:4) {
    taglist[[i]] <- as.character(d[[i]][,1])
    if(any(duplicated(taglist[[i]]))) {
        stop(paste("Repeated tag sequences in",fn))
tags <- unique(unlist(taglist))
ntags <- length(tags)
nfiles <- length(sets)
x$counts <- matrix(0,ntags,nfiles)
rownames(x$counts) <- tags
colnames(x$counts) <- sets
for (i in 1:nfiles) {
    aa <- match(taglist[[i]],tags)
    x$counts[aa,i] <- d[[i]][,2]
x$samples$lib.size <- colSums(x$counts)
x$samples$norm.factors <- 1
row.names(x$samples) <- colnames(x$counts)
x$genes <- NULL
d <- new("DGEList",x)
d <- calcNormFactors(d)

On 12 August 2011 04:17, Jérôme Laroche <jerome.laroche at ibis.ulaval.ca> wrote:
> Hi,
> I try to replicate the analysis "Case study of SAGE data" presented on page 9 of edgeR document. I wonder if the mentioned datasets of Zhang et al. 1997 are available somewhere? The datasets are: GSM728.txt, GSM729.txt, GSM755.txt, GSM756.txt and particularly Targets.txt.
> I looked at the page http://sites.google.com/site/davismcc/useful-documents, but they do not seem to be there.
> I tried to work with the files that accompany the package (NC1.txt, NC2.txt, and Tu98.txt Tu102.txt) but I get an error message when I run the command:
>> d <- calcNormFactors (d)
> (Error in calcNormFactors (d) 'data matrix' Need to Be a matrix).
> All the files are in the form:
> Tag_Sequence    Count
> AAAAAAAAAA      17
> and the Targets.txt file is:
> files   group    description
> NC1.txt NC      Normal colon
> NC2.txt NC      Normal colon
> Tu98.txt        Tu      Primary colonrectal tumour
> Tu102.txt       Tu      Primary colonrectal tumour
> In fact, after running the commands:
>> targets <-read.delim (file = "Targets.txt" stringsAsFactors = FALSE)
>> d <- readDGE (targets, skip = 5, comment.char ="!")
> I do not get a column showing the normalization factors (1 for all files) as shown in the document.
> Also, when I run the command
>> dim(d)
> I get "NULL" as a result.
> Thank you for your help.
> Jerome
> Universite Laval, Quebec, Canada
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Davis J McCarthy
Research Technician
Bioinformatics Division
Walter and Eliza Hall Institute of Medical Research
1G Royal Parade, Parkville, Vic 3052, Australia
dmccarthy at wehi.edu.au

More information about the Bioc-sig-sequencing mailing list