[BioC] differential expression- edgeR

Mark Robinson mark.robinson at imls.uzh.ch
Thu Dec 26 20:08:57 CET 2013


Hi Ashutosh,

Please keep these exchanges on the BioC mailing list.  I've cc'd the list here.

> But when I exactly follow the script from edgeR manual on the data and on making DGE list, I get library size as NA. 

This depends on how the data is stored.  readDGE() expects read counts where absence of a tag id in a single sample means a 0 count for that sample, while ArrayExpress appears to store 0s as NAs (with aligned ids across the dataset).  So, I believe this is why your library sizes come out as NA (summing over columns with NAs present).  As you see below, that's why I added:

counts[is.na(counts)] <- 0

before I created the 'DGEList'.

> Also can you please explain or give reference to function "grp <- sapply(colnames(counts),function(u) strsplit(u,"_")[[1]][1])". At the moment, I just copied into the terminal but I want to know what it signifies?

It's just a way to convert this vector:

> colnames(counts)
[1] "DLCK.TG_1" "DLCK.TG_2" "DLCK.TG_3" "DLCK.TG_4" "WT_1"      "WT_3"     
[7] "WT_4"      "WT_6"     

into this:

> grp
DLCK.TG_1 DLCK.TG_2 DLCK.TG_3 DLCK.TG_4      WT_1      WT_3      WT_4      WT_6 
"DLCK.TG" "DLCK.TG" "DLCK.TG" "DLCK.TG"      "WT"      "WT"      "WT"      "WT" 

There are many other ways to do this, including:

grp <- gsub("_[0-9]","",colnames(counts))

but, these are typically customized to the dataset (and filename structure) on hand.

Best, Mark


On 26.12.2013, at 11:02, "Dhingra, Ashutosh" <a.dhingra at vumc.nl> wrote:

> Dear Mark, 
> 
> Thanks a lot for your kind help! It works with your script. I was also able to find a error in the target list because of which I was getting out of bound error. 
> 
> 
> But when I exactly follow the script from edgeR manual on the data and on making DGE list, I get library size as NA. 
> 
> Also can you please explain or give reference to function "grp <- sapply(colnames(counts),function(u) strsplit(u,"_")[[1]][1])". At the moment, I just copied into the terminal but I want to know what it signifies?
> 
> 
> 
> Best regards,
> Ashutosh
> ________________________________________
> From: Mark Robinson [mark.robinson at imls.uzh.ch]
> Sent: 23 December 2013 17:35
> To: Ashutosh [guest]
> Cc: bioconductor at r-project.org; Dhingra, Ashutosh
> Subject: Re: [BioC] differential expression- edgeR
> 
> Dear Ashutosh,
> 
> I wasn't able to (nicely) get exactly the same files from GEO, but here is an alternative:
> 
> See:
> http://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-71/samples/
> 
> f <- "ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/experiment/MTAB/E-MTAB-71/E-MTAB-71.raw.1.zip"
> bf <- basename(f)
> download.file(f, bf)  # download ZIP from ArrayExpress
> 
> unzip(bf)  # should put the 8 TXT files in current dir
> 
> tg <- dir(".","^DLCK|WT")
> 
> library("edgeR")
> counts <- readDGE(tg)$counts
> counts[is.na(counts)] <- 0
> 
> grp <- sapply(colnames(counts),function(u) strsplit(u,"_")[[1]][1])
> d <- DGEList(counts=counts, group=grp)
> 
> This should return:
> 
>> d
> An object of class "DGEList"
> $counts
>                  DLCK.TG_1 DLCK.TG_2 DLCK.TG_3 DLCK.TG_4 WT_1  WT_3 WT_4 WT_6
> AAAAAAAAAAAAAAAAA     22653      3059      1366      6574 7782 35096 6623 9633
> AAAAAAAAAAAAAAAAC        82        51        55        93  412   134  335  519
> AAAAAAAAAAAAAAAAG         2         3         7         9   59     5   45   84
> AAAAAAAAAAAAAAAAT       118       471       359       717 1842    94 2465 3311
> AAAAAAAAAAAAAAACA        67         4         4        12   17   108   12   21
> 844311 more rows ...
> 
> $samples
>            group lib.size norm.factors
> DLCK.TG_1 DLCK.TG   651172            1
> DLCK.TG_2 DLCK.TG  2685418            1
> DLCK.TG_3 DLCK.TG  3202246            1
> DLCK.TG_4 DLCK.TG  2460753            1
> WT_1           WT  3142262            1
> WT_3           WT   294909            1
> WT_4           WT  3517977            1
> WT_6           WT  3558260            1
> 
> [… proceed from here …]
> 
> Alternatively, from your original code, I think you want 'skip=1' and is it possible that you have an extra unwanted file in your 'targets' list ?
> 
> Anyways, hope that helps.
> 
> Best, Mark
> 
> 
> ----------
> Prof. Dr. Mark Robinson
> Bioinformatics, Institute of Molecular Life Sciences
> University of Zurich
> http://ow.ly/riRea
> 
> 
> 
> 
> 
> 
> On 23.12.2013, at 16:50, Ashutosh [guest] <guest at bioconductor.org> wrote:
> 
>> 
>> I am new to R and edgeR. I am trying to follow to example
>> 9. Case Study: deep-sequenced short tags from the edgeR manual.
>> 
>> I download the data from http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM272105
>> by clicking the full table and then saving webpage as .txt file containing
>> 
>> First 4 rows as below folled by fifth which contains the tag seq and count.
>> 
>> #SEQUENCE =
>> #COUNT =
>> #TPM = tags per million
>> SEQUENCE      COUNT   TPM
>> CATCGCCAGCGGGCACC     1       0.37
>> 
>> Now we I follow the steps of 9.3 reading data and creating the DGElist: After running
>> < d<- readDGE(targets, skip = 5, comment.char = "#"), I get
>> Error in taglist[[i]] : subscript out of bounds
>> 
>> Can anyone please help, how I can solve this issue.
>> 
>> Best regards,
>> Ashutosh
>> 
>> 
>> -- output of sessionInfo():
>> 
>> R version 3.0.2 (2013-09-25)
>> Platform: x86_64-pc-linux-gnu (64-bit)
>> 
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>> [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>> [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>> [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>> [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>> 
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>> 
>> other attached packages:
>> [1] edgeR_3.4.2  limma_3.18.7
>> 
>> 
>> --
>> Sent via the guest posting facility at bioconductor.org.
>> 
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> 



More information about the Bioconductor mailing list