[BioC] [flowCore] problems with read.flowSet

Finak, Greg gfinak at fhcrc.org
Thu Oct 6 15:31:55 CEST 2011


Hi, Stephan

This error means that your FCS files don't all have the same channel names on each dimension, or don't have the same number of channels.
A flowSet must consist of FCS files that have identical channels. You can use read.FCShearer() to read in the keywords of individual files to verify which ones don't match.

Greg Finak, PhD
Post-doctoral fellow
Fred Hutchinson Cancer Research Center


"bioconductor-request at r-project.org" <bioconductor-request at r-project.org> wrote:

Send Bioconductor mailing list submissions to
	bioconductor at r-project.org

To subscribe or unsubscribe via the World Wide Web, visit
	https://stat.ethz.ch/mailman/listinfo/bioconductor
or, via email, send a message with subject or body 'help' to
	bioconductor-request at r-project.org

You can reach the person managing the list at
	bioconductor-owner at r-project.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Bioconductor digest..."


Today's Topics:

   1. [flowCore] problems with read.flowSet (Stephan Fuhrmann)
   2. Re: GEO and Limma (Ovokeraye Achinike-Oduaran)
   3. Re: Chip-seq quality control (Lucia Peixoto)
   4. Re: [question] integration of microarray data from different
      sources in limma package (???)
   5. Re: limma data frame subsetting problem (Jabez Wilson)
   6. Re: limma data frame subsetting problem (axel.klenk at actelion.com)
   7. Re: heatmap.2 function and the scale option (Ale? Maver)
   8. RankProd related question (Wendy Qiao)
   9. how to deal with a 30G fastq file (wang peter)
  10. Re: how to deal with a 30G fastq file (Martin Morgan)
  11. Re: poe.mcmc - phenotypic label - coding (mjonczyk)


----------------------------------------------------------------------

Message: 1
Date: Wed, 5 Oct 2011 12:32:32 +0200
From: "Stephan Fuhrmann" <stephan.fuhrmann at charite.de>
To: bioconductor at r-project.org
Subject: [BioC] [flowCore] problems with read.flowSet
Message-ID:
	<692b9f576d4668f47e6b8256f34430f1.squirrel at webmail.charite.de>
Content-Type: text/plain;charset=iso-8859-1

dear all,

i tried to load 17 fcs-files using the read.flowSet-command:
flowData <- read.flowSet(path = "/Users/stephan/Desktop/FCS
files/IE-Studie/R_pp65_A", phenoData = (pData), transformation = FALSE)

this returned the following error:

Error in dim(colNames) <- c(ncol(from[[frameList[[1]]]]),
length(frameList)) :
  Dimensionen [Produkt 170] don?t match the length of the object [17]

i can?t work out whats going wrong. loading the example data works fine.
is there anything wrong with the fcs-files?

help is much appreciated!

stephan

> sessionInfo()
R version 2.13.1 (2011-07-08)
Platform: i386-apple-darwin9.8.0/i386 (32-bit)

locale:
[1] de_DE.UTF-8/de_DE.UTF-8/C/C/de_DE.UTF-8/de_DE.UTF-8

attached base packages:
 [1] tools     splines   grid      stats     graphics  grDevices utils   
datasets  methods
[10] base

other attached packages:
 [1] flowUtils_1.12.0      flowQ_1.12.0          latticeExtra_0.6-18  
RColorBrewer_1.0-5
 [5] parody_1.10.0         bioDist_1.24.1        KernSmooth_2.23-6    
outliers_0.14
 [9] flowMerge_2.0.3       multicore_0.1-7       foreach_1.3.2        
codetools_0.2-8
[13] iterators_1.0.5       snow_0.3-7            flowMeans_1.4.0      
flowFP_1.8.0
[17] flowClust_2.10.0      ellipse_0.3-5         mnormt_1.4-3         
flowViz_1.16.0
[21] lattice_0.19-33       xtable_1.5-6          sfsmisc_1.0-14       
cluster_1.14.0
[25] mvoutlier_1.9.3       robCompositions_1.5.0 car_2.0-11           
survival_2.36-9
[29] nnet_7.3-1            compositions_1.10-1   energy_1.3-0         
MASS_7.3-14
[33] boot_1.3-2            tensorA_0.36          rgl_0.92.798         
fda_2.2.7
[37] zoo_1.7-4             flowCore_1.18.0       rrcov_1.3-01         
pcaPP_1.9-44
[41] mvtnorm_0.9-9991      robustbase_0.7-6      Biobase_2.12.2       
flowType_0.99.4
[45] Rgraphviz_1.30.1      graph_1.30.0

loaded via a namespace (and not attached):
 [1] annotate_1.30.1      AnnotationDbi_1.14.1 DBI_0.2-5           
feature_1.2.8
 [5] geneplotter_1.30.0   ks_1.8.2             RSQLite_0.9-4       
RUnit_0.4.26
 [9] stats4_2.13.1        XML_3.4-3
> traceback()
5: asMethod(object)
4: as(list2env(from, new.env(hash = T, parent = emptyenv())), "flowSet")
3: asMethod(object)
2: as(flowSet, "flowSet")
1: read.flowSet(path = "/Users/stephan/Desktop/FCS files/IE-Studie/R_pp65_A",
       phenoData = (pData), transformation = FALSE)



------------------------------

Message: 2
Date: Wed, 5 Oct 2011 15:35:32 +0200
From: Ovokeraye Achinike-Oduaran <ovokeraye at gmail.com>
To: Sean Davis <sdavis2 at mail.nih.gov>, bioconductor at r-project.org
Subject: Re: [BioC] GEO and Limma
Message-ID:
	<CAMNjH+KfMWf4TU+fBKeARL_eHDtMRCFj3Dm9mAqzCm-kiSD6tw at mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1

Thanks Sean.

Avoks

On Tue, Oct 4, 2011 at 8:08 PM, Sean Davis <sdavis2 at mail.nih.gov> wrote:
> On Tue, Oct 4, 2011 at 2:04 PM, Ovokeraye Achinike-Oduaran
> <ovokeraye at gmail.com> wrote:
>> Thanks a bunch, Johannes and Sean. I'll try the GSE data again as soon
>> as I ?get in front of a comp. About the first question on creating a
>> model matrix ?for a 3x2 factorial dataset like gds3715, any ideas?
>
> You'll probably need a contrast matrix in there somewhere to pull out
> contrasts of interest. ?I'm not sure which you are interested in.
>
> Sean
>
>
>> Thanks again.
>>
>> Avoks
>>
>> On 10/4/11, Sean Davis <sdavis2 at mail.nih.gov> wrote:
>>> On Tue, Oct 4, 2011 at 10:28 AM, Ovokeraye Achinike-Oduaran
>>> <ovokeraye at gmail.com> wrote:
>>>> Hi all,
>>>>
>>>> I've been struggling a bit with GEO and limma. I've come up with 2
>>>> questions that I'ld appreciate some input on.
>>>>
>>>> Question 1:
>>>> I have a design matrix with 6columns for an expression set with 3
>>>> columns(3 levels, 2 agents). How do I create a model matrix to fit my
>>>> design matrix for this dataset (eg. gds3715 below)? For a straight
>>>> forward single factor analysis eg. gds161, the code below seems to
>>>> work fine.
>>>>
>>>> gds161dat = getGEO('GDS161',destdir=".")
>>>> gds161eset = GDS2eSet(gds161dat, do.log2=TRUE)
>>>> m = pData(gds161eset)$metabolism
>>>> design_gds161 = createDesignMatrix(gds161eset)
>>>> design_gds161 = model.matrix(~m)
>>>> fit = lmFit(gds161eset, design_gds161)
>>>> fit2 = eBayes(fit)
>>>> results = topTable(fit2, adjust ="BH", number = nrow(gds161eset))
>>>> excel = write.table(results, file = file.choose(), sep = ",")
>>>>
>>>> gds3715dat = getGEO('GDS3715',destdir=".")
>>>> gds3715eset = GDS2eSet(gds3715dat, do.log2=TRUE)
>>>> DIR = paste(pData(gds3715eset)$disease.state,
>>>> pData(gds3715eset)$agent, sep =".")
>>>> m = data.frame("fac1" = pData(gds3715eset)$disease.state, "fac2" =
>>>> pData(gds3715eset)$agent)
>>>> design_gds3715 = createDesignMatrix(gds3715eset)
>>>> design_gds3715 = model.matrix(~m)
>>>> fit = lmFit(gds3715eset, design_gds3715)
>>>> fit2 = eBayes(fit)
>>>> results = topTable(fit2, adjust ="BH", number = nrow(gds3715eset))
>>>> excel = write.table(results, file = file.choose(), sep = ",")
>>>>
>>>> Question 2:
>>>> How can I run a similar (limma) analysis with GSE files?
>>>> I can hardly figure out how to get from the expression set to the
>>>> analysis...I've read the Using the GEOquery Package documentation
>>>> several times over. I'm just have a hard time getting it to work.
>>>>
>>>> gse25724dat = getGEO('GSE25724', GSEMatrix = TRUE)
>>>
>>> Hi, Avoks. ?The return value from this call is a LIST of
>>> ExpressionSets, not a single ExpressionSet. ?You probably want the
>>> first ExpressionSet from the list and can get that:
>>>
>>> gse25724eSet = gse25724dat[[1]]
>>>
>>> GEOquery works this way because many GSEs (so-called superseries) have
>>> multiple series included within them, so returning a list is necessary
>>> to be general.
>>>
>>> Sean
>>>
>>>
>>>> This is supposed to give me an expressionset, I can't seem to figure
>>>> out how to analyze it with limma.
>>>>
>>>> Please help.
>>>>
>>>> Thanks.
>>>>
>>>> Avoks
>>>>
>>>> sessionInfo()
>>>> R version 2.13.1 (2011-07-08)
>>>> Platform: i386-pc-mingw32/i386 (32-bit)
>>>>
>>>> locale:
>>>> [1] LC_COLLATE=English_xxx. LC_CTYPE=English_xxx
>>>> [3] LC_MONETARY=English_xxx LC_NUMERIC=C
>>>> [5] LC_TIME=English_xxx
>>>>
>>>> attached base packages:
>>>> [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base
>>>>
>>>> other attached packages:
>>>> [1] puma_2.4.0 ? ? ?mclust_3.4.10 ? affy_1.30.0 ? ? limma_3.8.3
>>>> [5] GEOquery_2.19.4 Biobase_2.12.2
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] affyio_1.20.0 ? ? ? ? preprocessCore_1.14.0 RCurl_1.6-10.1
>>>> [4] tools_2.13.1 ? ? ? ? ?XML_3.4-2.2
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>



------------------------------

Message: 3
Date: Wed, 5 Oct 2011 10:06:12 -0400
From: Lucia Peixoto <luciap at iscb.org>
To: Martin Morgan <mtmorgan at fhcrc.org>
Cc: "bioconductor at r-project.org" <bioconductor at r-project.org>
Subject: Re: [BioC] Chip-seq quality control
Message-ID:
	<CAHzS3XtSO2jEHXyXUFkM2e+417iq_z_ohhGE0HF6KNU_JEgyHQ at mail.gmail.com>
Content-Type: text/plain

Thanks very much for the suggestions
I will likely have more questions as I start the analysis

Lucia


On Tue, Oct 4, 2011 at 2:57 PM, Martin Morgan <mtmorgan at fhcrc.org> wrote:

> On 10/04/2011 07:33 AM, Ivan Gregoretti wrote:
> > Hello Lucia,
> >
> > A proper response to your post would take a lecture rather than an
> > email. I can't do that but I can bullet the main points. I think that
> > it will help you if you are indeed a newcomer to ChIP-seq.
> >
> > 1) Expect 10 million reads per sample for a genome the size of human.
>
> I'd run some basic QA on your lanes, via ShortRead::qa on the fastq files
> (or bam if fastq are not available); use FastqSampler if memory is tight
> (but in general if memory is tight the solution will be to find a larger
> computer).
>
> See http://bioconductor.org/help/**workflows/high-throughput-**sequencing/<http://bioconductor.org/help/workflows/high-throughput-sequencing/>for qa and perhaps other operations common to RNAseq / ChIPseq work flows
>
>
> >
> > 2) Stick to SAM/BAM formats so that you can use well known, publicly
> > available tools. Your best friend is called Picard.
>
> People can and do use R / Bioconductor for Picard-like tasks.
>
>
> > 3) Remove duplicates. Again, Picard is your best friend.
>
> > 4) Create WIG files for all samples, treatments and controls so that
> > you can display them simultaneously on any genome browser.
>
> here for interactive use I would rather use basic R plotting commands,
> avoiding the round-trip and allowing programmatic interaction.
>
>
> > 5) Find peaks with a well documented peak finder.
>
> probably a good suggestion for a one-off or common ChIP; the chipseq
> vignette
>
>  http://bioconductor.org/**packages/release/bioc/html/**chipseq.html<http://bioconductor.org/packages/release/bioc/html/chipseq.html>
>
> provides inspiration for more flexible analysis; packages under the ChIPseq
> biocViews term (Software --> AssayTechnologies -> HighThroughputSequencing->
> **ChIPSeq) might offer a solution tailored to your ChIP.
>
>
> > 6) Compute enrichment for all treatments relative to their controls.
>
> again the chipseq vignette is an alternative source.
>
>
> >
> > So, points 4 and 6 are your quality controls at this stage. Once you
> > know what a good immunoprecipitation looks like compared to a bad one,
> > you can start diving into the details. You can invent your own quality
>
> especially at getting a sense for good versus bad results the interactivity
> of R / Bioconductor seem essential.
>
> Martin
>
>
> > indicators. For instance, I compute the proportion of tags inside the
> > 1000 strongest peaks. I do that for BOTH treatment and controls.
> >
> > In my workflow, Bioconductor does not get involved until I reach point 6.
> >
> > Happy ChIPing.
> >
> > Ivan
> >
> >
> >
> >
> >
> > On Mon, Oct 3, 2011 at 5:17 PM, Lucia Peixoto<luciap at iscb.org>  wrote:
> >> Hi,
> >> I am new to Chip-seq, my experiment's sequencing has finished, and the
> read
> >> alignment is currently running
> >> The experiment  was done for histone acetylation, and I have two types
> of
> >> controls: input DNA and unmodified histone.
> >> I have two conditions and 6 biological replicates of each condition
> >> I wanted some advice on how to perform basic quality control on Chip-seq
> >> data using Bioconductor
> >> and also some ideas of which kinds of biases people usually observe and
> I
> >> should keep my eyes open for
> >> any advice will be greatly appreciated!
> >> thanks
> >>
> >> Lucia
> >>
> >>         [[alternative HTML version deleted]]
> >>
> >> ______________________________**_________________
> >> Bioconductor mailing list
> >> Bioconductor at r-project.org
> >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
> >> Search the archives: http://news.gmane.org/gmane.**
> science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
> >>
> >
> > ______________________________**_________________
> > Bioconductor mailing list
> > Bioconductor at r-project.org
> > https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
> > Search the archives: http://news.gmane.org/gmane.**
> science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
> --
> Computational Biology
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>
> Location: M1-B861
> Telephone: 206 667-2793
>

	[[alternative HTML version deleted]]



------------------------------

Message: 4
Date: Wed, 5 Oct 2011 23:50:01 +0900
From: ??? <sjlee at biosoft.kaist.ac.kr>
To: "Freudenberg, Johannes (NIH/NIEHS) [E]"
	<johannes.freudenberg at nih.gov>,	"bioconductor at stat.math.ethz.ch"
	<bioconductor at stat.math.ethz.ch>
Subject: Re: [BioC] [question] integration of microarray data from
	different sources in limma package
Message-ID: <4E8C6E99.5020703 at biosoft.kaist.ac.kr>
Content-Type: text/plain; charset="EUC-KR"

Hi Johannes,

I would like to explain my data in detail for your deep understanding.
There are three replicate data from two-channel microarray

Two of them were scanned by genepix scanner and the other was scanned by
agilent scanner.

All of them are originated from same microarray platform( agilent chip
). Just different scanner makes them provide different outputs, but its
original information is same.

However, when I want to use three microarray data for the analysis using
limma package, it is difficult to use built-in functions in the package
to normalize and analyze them for extracting DEGs(differentially
expressed genes).

For example,

library(limma)
setwd("~~")
targets_agilent = readTargets("targets_agilent.txt")
targets_genepix = readTargets("targets_genepix.txt")
rg_agilent = read.maimages(targets_agilent$FileName, source="agilent")
rg_genepix = read.maimages(targets_genepix$FileName, source="genepix")
rg_all = cbind( rg_agilent, rg_genepix )

When I tried to use rg_all object for further analysis, it provokes some
problems because cbind function just concetanates the data without
consideration of interal information.

Please let me know your precious opinion for this beginner in this field.

Thanks in advance.

Sincerely,

Sunjae LEE




2011-10-01 ?? 2:29, Freudenberg, Johannes (NIH/NIEHS) [E] ? ?:
> Hi Sunjae,
>
>> Is there any possibility to combine, normalize and analyze them without problems?
> I think the order should be "normalize, combine, and analyze (hopefully) without problems."  I don't really see a way (nor a good reason) for combining the different platforms first and then doing the pre-processing.
>
> Best regards,
> --Johannes
>
>
>
> -----Original Message-----
> From: ??? [mailto:sjlee at biosoft.kaist.ac.kr] 
> Sent: Tuesday, September 27, 2011 11:05 PM
> To: bioconductor at stat.math.ethz.ch
> Subject: [BioC] [question] integration of microarray data from different sources in limma package
>
> Dear developers,
>
> I want to ask some questions about microarray data pre-processing
>
> In the limma package, it supports reading microarray data from different source such as agilent(agilent feature extraction), genepix(genepix pro) format by fuction read.maimages().
>
> And cbind fuction enables us to combine RGList obtained from micorarray data from different source
>
> However, when we try to normalize the combined RGList data through normalizeBetweenArrays function and analyze differentially expressed gene through lmFit, eBayes function, it produces some problems of not-compatible integrations due to diffrenence of its sources. ( I just want to integrate microarray data from agilent and genepix format as soon as possible )
>
> Is there any possibility to combine, normalize and analyze them without problems?
>
> please let me know how to solve them.
>
>
>
> Sincerely,
>
> Sunjae LEE
>
> Ph.D. candidate, BISL, Dept. of Bio and Brain engineering, KAIST 335-Gwahangno, Yuseong-gu, Daejeon 305-701, Republic of Korea
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor


-- 
Sunjae LEE
Ph.D. candidate, BISL, Dept. of Bio and Brain engineering, KAIST
335-Gwahangno, Yuseong-gu, Daejeon 305-701, Republic of Korea
TEL +82-42-350-4353
(cell.) +82-10-3090-9050



------------------------------

Message: 5
Date: Wed, 5 Oct 2011 15:22:14 +0100 (BST)
From: Jabez Wilson <jabezwuk at yahoo.co.uk>
To: axel.klenk at actelion.com
Cc: bioconductor at r-project.org, bioconductor-bounces at r-project.org
Subject: Re: [BioC] limma data frame subsetting problem
Message-ID:
	<1317824534.79133.YahooMailClassic at web28515.mail.ukl.yahoo.com>
Content-Type: text/plain

Yes, thanks, Axel, I think that may well do the trick - assuming you meant rowMaxs() ;-)

You're right, it is an RGList not a DF.

Jab


--- On Wed, 5/10/11, axel.klenk at actelion.com <axel.klenk at actelion.com> wrote:


From: axel.klenk at actelion.com <axel.klenk at actelion.com>
Subject: Re: [BioC] limma data frame subsetting problem

Cc: bioconductor at r-project.org, bioconductor-bounces at r-project.org
Date: Wednesday, 5 October, 2011, 10:42


Dear Jabez, 

try:

RG[rowMax(RG$G) > 5000, ]

Is that what you want? 

And BTW, note that you are dealing with an RGList and not a data frame...

Cheers,

- axel


Axel Klenk
Research Informatician
Actelion Pharmaceuticals Ltd / Gewerbestrasse 16 / CH-4123 Allschwil / 
Switzerland




From:

To:
bioconductor at r-project.org
Date:
05.10.2011 11:15
Subject:
[BioC] limma data frame subsetting problem
Sent by:
bioconductor-bounces at r-project.org








Dear Bioconductors, I want to do something simple, which I cannot find the 
solution to. I have a limma data frame and I want to select a subset of 
the data frame based on whether the values in the "G" channel are > e.g. 
5000
As an example I use swirl data
targets <- readTargets("SwirlSample.txt"); RG <- read.maimages(targets, 
source="spot")
> head(RG$G)
       swirl.1   swirl.2    swirl.3    swirl.4
[1,] 22028.260 19278.770  2727.5600 19930.6500
[2,] 25613.200 21438.960  2787.0330 25426.5800
[3,] 22652.390 20386.470  2419.8810 16225.9500
[4,]  8929.286  6677.619   383.2381   786.9048
[5,]  8746.476  6576.292   901.0000   468.0476
[6,] 37010.080 23769.100 23377.9700 28399.0900


I can select the first 20 lines of the df by 
>RG[1:20,]
but I really want to select those lines of the df (and keep the df format 
intact) where the "G" value in any column is > a certain figure (e.g. 
5000)
However, 
> RG[RG$G>5000,]
Error in `[.RGList`(RG, RG$G > 5000, ) : 
  (subscript) logical subscript too long

I have no success with subset either:
> subset(RG, RG$G>5000,)
Error: Two subscripts required
> subset(RG$G, RG$G>5000)
Error in subset.matrix(RG$G, RG$G > 5000) : 
  (subscript) logical subscript too long


Do I need to write a loop to check each column of "G" seperately? Or is 
there a simpler solution?
TIA
Jabez

                 [[alternative HTML version deleted]]

_______________________________________________
Bioconductor mailing list
Bioconductor at r-project.org
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives: 
http://news.gmane.org/gmane.science.biology.informatics.conductor



The information of this email and in any file transmitted with it is strictly confidential and may be legally privileged.
It is intended solely for the addressee. If you are not the intended recipient, any copying, distribution or any other use of this email is prohibited and may be unlawful. In such case, you should please notify the sender immediately and destroy this email.
The content of this email is not legally binding unless confirmed by letter.
Any views expressed in this message are those of the individual sender, except where the message states otherwise and the sender is authorised to state them to be the views of the sender's company. For further information about Actelion please see our website at http://www.actelion.com 



	[[alternative HTML version deleted]]



------------------------------

Message: 6
Date: Wed, 5 Oct 2011 17:06:24 +0200
From: axel.klenk at actelion.com

Cc: bioconductor at r-project.org, bioconductor-bounces at r-project.org
Subject: Re: [BioC] limma data frame subsetting problem
Message-ID:
	<1002_1317827251_4E8C72B3_1002_4917_1_OF930BDEB2.27423C34-ONC1257920.00511A52-C1257920.005315A3 at actelion.com>
	
Content-Type: text/plain; charset="US-ASCII"

No, I meant rowMax() from package Biobase but obviously this functionality
has been implemented several times... rowMaxs() from fBasics (if that's 
the
one you've found) is less efficient but it will hardly make a difference 
for your
use case - as in fortune(98)... :-)

 - axel


Axel Klenk
Research Informatician
Actelion Pharmaceuticals Ltd / Gewerbestrasse 16 / CH-4123 Allschwil / 
Switzerland




From:

To:
axel.klenk at actelion.com
Cc:
bioconductor at r-project.org, bioconductor-bounces at r-project.org
Date:
05.10.2011 16:22
Subject:
Re: [BioC] limma data frame subsetting problem




Yes, thanks, Axel, I think that may well do the trick - assuming you meant 
rowMaxs() ;-)

You're right, it is an RGList not a DF.

Jab


--- On Wed, 5/10/11, axel.klenk at actelion.com <axel.klenk at actelion.com> 
wrote:

From: axel.klenk at actelion.com <axel.klenk at actelion.com>
Subject: Re: [BioC] limma data frame subsetting problem

Cc: bioconductor at r-project.org, bioconductor-bounces at r-project.org
Date: Wednesday, 5 October, 2011, 10:42

Dear Jabez, 

try:

RG[rowMax(RG$G) > 5000, ]

Is that what you want? 

And BTW, note that you are dealing with an RGList and not a data frame...

Cheers,

- axel


Axel Klenk
Research Informatician
Actelion Pharmaceuticals Ltd / Gewerbestrasse 16 / CH-4123 Allschwil / 
Switzerland




From:

To:
bioconductor at r-project.org
Date:
05.10.2011 11:15
Subject:
[BioC] limma data frame subsetting problem
Sent by:
bioconductor-bounces at r-project.org








Dear Bioconductors, I want to do something simple, which I cannot find the 

solution to. I have a limma data frame and I want to select a subset of 
the data frame based on whether the values in the "G" channel are > e.g. 
5000
As an example I use swirl data
targets <- readTargets("SwirlSample.txt"); RG <- read.maimages(targets, 
source="spot")
> head(RG$G)
       swirl.1   swirl.2    swirl.3    swirl.4
[1,] 22028.260 19278.770  2727.5600 19930.6500
[2,] 25613.200 21438.960  2787.0330 25426.5800
[3,] 22652.390 20386.470  2419.8810 16225.9500
[4,]  8929.286  6677.619   383.2381   786.9048
[5,]  8746.476  6576.292   901.0000   468.0476
[6,] 37010.080 23769.100 23377.9700 28399.0900


I can select the first 20 lines of the df by 
>RG[1:20,]
but I really want to select those lines of the df (and keep the df format 
intact) where the "G" value in any column is > a certain figure (e.g. 
5000)
However, 
> RG[RG$G>5000,]
Error in `[.RGList`(RG, RG$G > 5000, ) : 
  (subscript) logical subscript too long

I have no success with subset either:
> subset(RG, RG$G>5000,)
Error: Two subscripts required
> subset(RG$G, RG$G>5000)
Error in subset.matrix(RG$G, RG$G > 5000) : 
  (subscript) logical subscript too long


Do I need to write a loop to check each column of "G" seperately? Or is 
there a simpler solution?
TIA
Jabez

                 [[alternative HTML version deleted]]

_______________________________________________
Bioconductor mailing list
Bioconductor at r-project.org
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives: 
http://news.gmane.org/gmane.science.biology.informatics.conductor



The information of this email and in any file transmitted with it is 
strictly confidential and may be legally privileged.
It is intended solely for the addressee. If you are not the intended 
recipient, any copying, distribution or any other use of this email is 
prohibited and may be unlawful. In such case, you should please notify the 
sender immediately and destroy this email.
The content of this email is not legally binding unless confirmed by 
letter.
Any views expressed in this message are those of the individual sender, 
except where the message states otherwise and the sender is authorised to 
state them to be the views of the sender's company. For further 
information about Actelion please see our website at 
http://www.actelion.com 






The information of this email and in any file transmitted with it is strictly confidential and may be legally privileged.
It is intended solely for the addressee. If you are not the intended recipient, any copying, distribution or any other use of this email is prohibited and may be unlawful. In such case, you should please notify the sender immediately and destroy this email.
The content of this email is not legally binding unless confirmed by letter.
Any views expressed in this message are those of the individual sender, except where the message states otherwise and the sender is authorised to state them to be the views of the sender's company. For further information about Actelion please see our website at http://www.actelion.com



------------------------------

Message: 7
Date: Wed, 5 Oct 2011 19:33:36 +0200
From: Ale? Maver <ales.maver at gmail.com>
To: john herbert <arraystruggles at gmail.com>
Cc: bioconductor at r-project.org
Subject: Re: [BioC] heatmap.2 function and the scale option
Message-ID:
	<CAHzx6mcCNq6eHG=Vz9HFQH=yhSQwkSJTwdRaBmZrFejC1MgYiw at mail.gmail.com>
Content-Type: text/plain

Hi!
I think it depends on what information you have stored in rows and what in
columns. It also depends what data you have stored in the matrix.
For example, if your input values are deltaCt values (you mentioned you work
on RT-PCR results), they may differ substantially across genes, which may
cause your heatmap to look anomalous.

If you look at an exemplar matrix with dCt values for three genes in rows
(Gene1-Gene3) and ten samples in columns:
mat<-matrix(c(rnorm(n=10, mean=5, sd=1),
            rnorm(n=10, mean=6, sd=1),
            rnorm(n=10, mean=15, sd=1)), ncol=10, byrow=T,
dimnames=list(paste("Gene", c(1:3), sep=""), paste("Sample", c(1:10),
sep="")))

Here the distributions of measured values differ substantially between genes
(means 5,10 and 15). If you inspect the plot where scaling across columns is
used:
heatmap(mat, scale="column")

you will notice Gene3 to be always more yellow in color than Gene1 or Gene2,
as Gene3 values are always more than other two.

However, using scaling across rows brings out the differences between
samples  - separately for each gene (row), as value range is calculated for
each gene (row) seperately:
heatmap(mat, scale="row")

Hope this explains it,
Ales

2011/10/3 john herbert <arraystruggles at gmail.com>

> Dear Bioconductors,
> I am using the the heatmap.2 function to plot out a heatmap of 130
> genes from Q-PCR expression.
>
> If I plot them out scaling by row, I get a heatmap that looks a bit
> blotchy.
>
> However, if I scale by column, the heatmap looks a lot better.
>
> The clustering of samples is the same but I don't really know the
> theory behind why column scaling looks better than row scaling.
>
> Please can someone advise me on any theory behind this?
>
> The data is in a matrix, like microarray, with columns of different
> conditions and rows of genes.
>
> Thank you,
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>



-- 
Ales Maver, MD
Institute of Medical Genetics, Department of Obstetrics and Gynaecology
UMC Ljubljana
©lajmerjeva 3
SI-1000 Ljubljana
Slovenia

	[[alternative HTML version deleted]]



------------------------------

Message: 8
Date: Wed, 5 Oct 2011 23:00:33 -0400
From: Wendy Qiao <wendy2.qiao at gmail.com>
To: bioconductor at r-project.org
Subject: [BioC] RankProd related question
Message-ID:
	<CAH=jfhyGyJWSwamVaPtdOtahWc3pcPbDHxLjL+pLgvcBb5M39Q at mail.gmail.com>
Content-Type: text/plain

Hi all,

I need to analyze a combined dataset of microarrays from two sources. The
arrays for the first cell type is from source1 and the arrays for the other
cell types are from source2. I would like to use the advanced rank product
method to identify the unregulated genes. My "origin" vector and sample
vector are as following. However, by using the following code, I got an
error message as highlighted. I think that the arrays for cell type 1 must
from two sources. Does it mean that I cannot use the advanced RankProd? Does
anyone know the major difference between the normal and advanced RankProd
methods?

Thank you in advance,
Wendy


mergeD.origin   # origin vector
 [1] 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[54] 2 2 2 2 2 2 2 2

mergeD.cl   # sample vector
 [1] 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[54] 0 0 0 0 0 0 0 0

RP.adv.out <- RPadvance(mergeD,  mergeD.cl, mergeD.origin, num.perm = 100,
logged = FALSE, gene.names = rownames(mergeD), rand = 123)
 The data is from  2 different origins

Rank Product analysis for two-class case

Error in OriginxyCall(data, cl, origin) :
  Error: data from different origins should contain data from both classes

	[[alternative HTML version deleted]]



------------------------------

Message: 9
Date: Wed, 5 Oct 2011 22:04:45 -0500
From: wang peter <wng.peter at gmail.com>
To: bioconductor at r-project.org
Subject: [BioC] how to deal with a 30G fastq file
Message-ID:
	<CAKH7RFCAt40Fe9pXFFNhgov0zoS+si_2_d_TYgvS+tpM1tme8A at mail.gmail.com>
Content-Type: text/plain

IT is too slow to read them in the memory.
who can tell me if i need split them by other program or
call some R function to split them

thx

	[[alternative HTML version deleted]]



------------------------------

Message: 10
Date: Wed, 05 Oct 2011 20:39:12 -0700
From: Martin Morgan <mtmorgan at fhcrc.org>
To: wang peter <wng.peter at gmail.com>
Cc: bioconductor at r-project.org
Subject: Re: [BioC] how to deal with a 30G fastq file
Message-ID: <4E8D22E0.5010607 at fhcrc.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

On 10/05/2011 08:04 PM, wang peter wrote:
> IT is too slow to read them in the memory.
> who can tell me if i need split them by other program or
> call some R function to split them

ShortRead::FastqSampler streams the entire file but returns a subset 
(often faster than reading in the whole data) ShortRead::FastqStreamer 
(in development) iterates over the file --

   fq = FastqStreamer(<...>)
   while (length(res <- yield(fq)))
       # work, e.g., filter

A cheap hack is to force R to allocate a large amount of memory and then 
to run the operation

   replicate(10, raw(1e9)) ## that's alot
   dna = readFastq(...)

The 'withIds=FALSE' argument to readFastq can save a lot of time if ids 
are not necessary.

If the records are all 4 lines long it is very easy to split a file 
(untested code; the Linux pros would use awk for efficient processing; 
check out StackOverflow / Biostar)

   fl = file("foo.fastq", "r")
   idx = 0;
   while (isIncomplete(fl)) {
       recs = readLines(fl, n=1000000)
       writeLines(sprintf("fout-%d.fastq", idx), recs)
       idx <- idx + 1
   }

once split on Linux / Mac use library(multicore) or library(parallel) 
(R-2.14 or later) and

   mclapply(seq_len(idx), function(i) {
       fq = readFastq(sprintf("fout-%d.fastq", idx))
       ## work, then...
       TRUE
   })

to process in parallel (it doesn't make sense to try to read them in 
parallel and try to return them back to a 'master').

Martin

>
> thx
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor


-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793



------------------------------

Message: 11
Date: Thu, 06 Oct 2011 08:19:50 +0200
From: mjonczyk <mjonczyk at biol.uw.edu.pl>
To: Debashis Ghosh <ghoshd at psu.edu>
Cc: bioconductor at r-project.org
Subject: Re: [BioC] poe.mcmc - phenotypic label - coding
Message-ID: <1a80ce9d9686c922b1cca5ccaf0dd604 at biol.uw.edu.pl>
Content-Type: text/plain; charset=UTF-8; format=flowed

Dear Dr. Ghosh,

thank you for the clarification.

Best Wishes,
Maciej

---
Maciej Jonczyk,
Department of Plant Molecular Ecophysiology
Institute of Plant Experimental Biology
Faculty of Biology, University of Warsaw
02-096 Warszawa, Miecznikowa 1
Poland

On Wed, 5 Oct 2011 22:22:08 -0400, Debashis Ghosh wrote:
> Dear Dr. Jonczyk,
>
> It's an error in the text.  Hyungwon should be revising the vignette
> in future versions of metaArray.
>
> Debashis
>
> On Tue, Oct 4, 2011 at 8:09 AM, mjonczyk <mjonczyk at biol.uw.edu.pl> 
> wrote:
>> Dear List,
>>
>> I'd like to make sure I understand coding of phenotypic label in
>> poe.mcmc function of metaArray correctly.
>>
>> Vector of my arrays' treatments could be presented as:
>>
>> (ctrl, ctrl, ctrl, trt1, trt1, trt1) so as NN value I used vector 
>> (1,
>> 1, 1, 0, 0, 0).
>>
>> Documentation for either poe.mcmc, poe.em and fit.em say that "1" 
>> means
>> that e=0 (baseline expression),
>> and "0" is for other conditions.
>>
>> Although this three documentation files suggests the same coding,
>> fit.em states that:
>> "
>> cl: A vector of 0s and 1s. Use 1 for normal phenotype and 0 for
>> ? ? ? ? ? non-normal phenotype. Note that
>> *this is the opposite of POE MCMC.*
>> If all samples are of unknown phenotype or of the same
>> ? ? ? ? ? one, give vector of zeros. When class information is
>> ? ? ? ? ? provided, conditional estimation of the mixture is 
>> applied.
>> "
>> So is it simply an error in text or should I code my variants
>> differently?
>>
>> Best Wishes!
>>
>> --
>> Maciej Jonczyk,
>> Department of Plant Molecular Ecophysiology
>> Institute of Plant Experimental Biology
>> Faculty of Biology, University of Warsaw
>> 02-096 Warszawa, Miecznikowa 1
>> Poland
>>
>>
>>
>> --
>> This email was Anti Virus checked by Astaro Security Gateway.
>> http://www.astaro.com
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>



------------------------------

_______________________________________________
Bioconductor mailing list
Bioconductor at r-project.org
https://stat.ethz.ch/mailman/listinfo/bioconductor


End of Bioconductor Digest, Vol 104, Issue 6



More information about the Bioconductor mailing list