[BioC] FeatureExpressionSet using list.files() in place of read.xysfiles()

Benilton Carvalho beniltoncarvalho at gmail.com
Thu Jun 20 05:53:20 CEST 2013


Dear Franklin,

I'm not sure I follow your message... my most sincere apologies...

Now that you have your XYS files, I was expecting you to simply use:

library(oligo)
rawData = read.xysfiles(filelist, pkgname='pd.pdinfo.gpl11164.ndf.txt')

doesn't this work for you? (the 'filelist' variable above contains the
names of your converted XYS files)

Let me know what are your findings...

b

2013/6/19 Johnson, Franklin Theodore <franklin.johnson at email.wsu.edu>:
> Dear Dr. Carvalho,
>
> Thanks for the reply.
> I saw the thread of FAQs how to read in the annotation package made using pdInfoBuilder.
> For anyone having issues, it seems as straight forward as:
> #install pdinfo.gpl11164.ndf.txt
> install.packages("pd.pdinfo.gpl11164.ndf.txt", type="source", repos=NULL)
> Installing package into ‘C:/Users/ZHUGRP/Documents/R/win-library/3.0’
> (as ‘lib’ is unspecified)
> * installing *source* package 'pd.pdinfo.gpl11164.ndf.txt' ...
> ** R
> ** data
> ** inst
> ** preparing package for lazy loading
> ** help
> *** installing help indices
> ** building package indices
> ** testing if installed package can be loaded
> *** arch - i386
> *** arch - x64
> * DONE (pd.pdinfo.gpl11164.ndf.txt)
> ############################################################################################################
> I am currently trying to make the FeatureExpressionSet with my converted PAIR -> XYS.txt files unfortunately obtaining X/Y/S only.
> NimbleScan expected .tiff files to read into the software. These files were not available from NCBI/GEO. NimbleGen also did not respond to my inquiry regarding this matter to be able to obtain XYS files from available PAIR files. Using R, I'm testing 12 of 24 tab-delimited XYS files, to also test the annotation package made using pdInfoBuilder.
> #read in files from wd()
> filelist=list.files(pattern=".*.txt")
>> filelist
>  [1] "GSM01.txt" "GSM02.txt" "GSM03.txt" "GSM04.txt" "GSM05.txt" "GSM06.txt" "GSM07.txt" "GSM08.txt" "GSM09.txt" "GSM10.txt" "GSM11.txt" "GSM12.txt"
> #read in each data file in filelist as a matrix to make EFS object
>> datalist=lapply(filelist, function(x)as.matrix(read.table(x, header=T, sep="\t", as.is=T)))
> #construct phenoData frame
>> theData=data.frame(Key=rep(c("Week0","Week-2","Week-4"), each=4))
>> rownames(theData)=basename(filelist)
>> pd=new("AnnotatedDataFrame", data=theData)
> ....
> However, I fail the EFS construction:
> hardline=new("ExpressionFeatureSet", datalist, phenoData=pd, annotation=library(pd.pdinfo.gpl11164.ndf.txt))
> Error in .names_found_unique(names(value), names(object)) :
>   'sampleNames' replacement list must have unique named elements corresponding to assayData element names
> To confirm,
>> sampleNames(datalist)
> [1] "X"  "Y"  "PM"
> So, it seems EFS is expecting unique sampleNames for each file in filelist?
> How to read in multiple files into an efs object, as is done with read.xysfiles? Is this doable?
>
> Is it necessary to execute datalist=lapply(filelist, function(x)as.matrix(read.table(x, header=T, sep="\t", as.is=T))) surrounded with Booleans to make the object TRUE, per se?
> i.e. (datalist=lapply(filelist, function(x)as.matrix(read.table(x, header=T, sep="\t", as.is=T))) )
> Best Regards,
> Franklin
>
> Great minds discuss ideas. Average minds discuss events. Small minds discuss people. -Eleanor Roosevelt
>
>
>
>
> ________________________________________
> From: Benilton Carvalho [beniltoncarvalho at gmail.com]
> Sent: Thursday, June 13, 2013 4:43 PM
> To: Johnson, Franklin Theodore
> Cc: bioconductor at r-project.org
> Subject: Re: [BioC] PAIR files -- feature set table
>
> dont worry about that particular warning.... just install the package
> and try to read your XYS files.
>
> 2013/6/13 Johnson, Franklin Theodore <franklin.johnson at email.wsu.edu>:
>> Dr. Carvalho,
>>
>> Yes. I see what you mean.
>> Switching the columns helped in the FeatureSet table loading inserted more
>> that 2 rows:
>>
>> Inserting 198661 rows into table featureSet... OK
>> However, the warning message did print again.
>>
>>
>> Warning message:
>> In is.na(ndfdata[["SIGNAL"]]) :
>>   is.na() applied to non-(list or vector) of type 'NULL'
>>
>> Below is the output + sessionInfo(), as I upgraded to R 3.0.1.
>>
>> #Begin R command line code:
>>
>>> makePdInfoPackage(arrays, destDir = getwd(), unlink=TRUE)
>> ==============================================================================================================================================================
>>
>>
>> Building annotation package for Nimblegen Expression Array
>> NDF: pdinfo_GPL11164.ndf.txt <-new .ndf file with PROBE_ID<->SEQ_ID
>> XYS: XYS.txt
>> ==============================================================================================================================================================
>> Parsing file: pdinfo_GPL11164.ndf.txt... OK
>>
>> Parsing file: XYS.txt... OK
>> Merging NDF and XYS files... OK
>> Preparing contents for featureSet table... OK
>> Preparing contents for bgfeature table... OK
>> Preparing contents for pmfeature table... OK
>> Creating package in E:/RANDOM/Test/Yanmin's Microarray Paper/Yanmin
>> Microarray RAW/pd.pdinfo.gpl11164.ndf.txt
>> Inserting 198661 rows into table featureSet... OK
>> Inserting 770599 rows into table pmfeature... OK
>>
>> Counting rows in featureSet
>> Counting rows in pmfeature
>> Creating index idx_pmfsetid on pmfeature... OK
>> Creating index idx_pmfid on pmfeature... OK
>> Creating index idx_fsfsetid on featureSet... OK
>> Saving DataFrame object for PM.
>> Done.
>> Warning message:
>> In is.na(ndfdata[["SIGNAL"]]) :
>>   is.na() applied to non-(list or vector) of type 'NULL'
>>
>>
>>> sessionInfo()
>> R version 3.0.1 (2013-05-16)
>> Platform: i386-w64-mingw32/i386 (32-bit)
>>
>> locale:
>> [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United
>> States.1252    LC_MONETARY=English_United States.1252
>> [4] LC_NUMERIC=C                           LC_TIME=English_United
>> States.1252
>>
>> attached base packages:
>> [1] parallel  stats     graphics  grDevices utils     datasets  methods
>> base
>>
>> other attached packages:
>> [1] pdInfoBuilder_1.24.0 oligo_1.24.0         oligoClasses_1.22.0
>> affxparser_1.32.1    RSQLite_0.11.4       DBI_0.2-7
>> Biobase_2.20.0
>> [8] BiocGenerics_0.6.0   BiocInstaller_1.10.2
>>
>> loaded via a namespace (and not attached):
>>  [1] affyio_1.28.0         Biostrings_2.28.0     bit_1.1-10
>> codetools_0.2-8       ff_2.2-11             foreach_1.4.1
>> GenomicRanges_1.12.4
>>  [8] IRanges_1.18.1        iterators_1.0.6       preprocessCore_1.22.0
>> splines_3.0.1         stats4_3.0.1          tools_3.0.1
>> zlibbioc_1.6.0
>>
>>
>>
>>>q()
>>
>>
>>
>> The built pdInfopackage loaded in Destdir is identical to previous message.
>>
>> However the featureSet table now has more than 2 rows...
>>
>> Lastly, I did multiple combos, as my merged file has (X.x, Y.x)<-seems to be
>> identifiers for the 'probe IDs' on the array as well as (X.y, Y.y) <- seems
>> to be the sequence identifiers for the "SEQ_ID". I used X.x, Y.x and PM
>> which gave the result I pasted above. All others had errors. I'm close, but
>> that Warning Message is annoying...
>>
>>
>>
>> Regards,
>>
>> Franklin
>>
>>
>> Great minds discuss ideas. Average minds discuss events. Small minds discuss
>> people. -Eleanor Roosevelt
>>
>>
>>
>>
>> ________________________________________
>> From: Benilton Carvalho [beniltoncarvalho at gmail.com]
>> Sent: Wednesday, June 12, 2013 8:25 PM
>>
>> To: Johnson, Franklin Theodore
>> Cc: bioconductor at r-project.org
>> Subject: Re: [BioC] PAIR files -- feature set table
>>
>> That does not look ok.
>>
>> The problem is the count for the featureSet table... This table stores
>> the information for "genes" (or whatever the target for this
>> particular array is)... so, it is unlikely that you have a microarray
>> with only 2 "target units"... I'd expect something around the
>> thousands...
>>
>> pdInfoBuilder uses the information in SEQ_ID (in the NDF) to get the
>> target information (i.e., the contents for featureSet).
>>
>> Given that this is a custom array, I believe that the best idea is to
>> contact the person who designed it and ask more details about the
>> design (in particular, how many probesets and average number of probes
>> per probeset)...
>>
>> I've seen some designs in which the information that was expected to
>> be in SEQ_ID was actually stored in PROBE_ID (in such cases, the user
>> needs to create a backup copy of the NDF, and then move the contents
>> of PROBE_ID to SEQ_ID - and vice-versa).
>>
>> b
>>
>> 2013/6/12 Johnson, Franklin Theodore <franklin.johnson at email.wsu.edu>:
>>> Dear Dr. Carvalho,
>>>
>>> Recently, we had cooresponence regaring makePDInfoPackage for an NimbleGen
>>> apple microarray.
>>> I was able to merge the ndf design and XYS files using PROBE_ID.
>>> As a reminder this is a custom array, and there are no SIGNAL==NAs for
>>> control probes.
>>> It seemed to work:
>>>> makePdInfoPackage(seed, destDir(""))
>>>
>>> ============================================================================================================================================================
>>> Building annotation package for Nimblegen Expression Array
>>> NDF: GPL11164.ndf
>>> XYS: XYS.txt
>>>
>>> ============================================================================================================================================================
>>> Parsing file: GPL11164.ndf... OK
>>> Parsing file: XYS.txt... OK
>>> Merging NDF and XYS files... OK
>>> Preparing contents for featureSet table... OK
>>> Preparing contents for bgfeature table... OK
>>> Preparing contents for pmfeature table... OK
>>> Creating package in
>>> C:/Users/franklin.johnson.PW50-WEN/Desktop/Test/Yanmin's Microarray
>>> Paper/Yanmin Microarray RAW/pd.gpl11164
>>> Inserting 2 rows into table featureSet... OK
>>> Inserting 765524 rows into table pmfeature... OK
>>> Inserting 5075 rows into table bgfeature... OK
>>> Counting rows in bgfeature
>>> Counting rows in featureSet
>>> Counting rows in pmfeature
>>> Creating index idx_bgfsetid on bgfeature... OK
>>> Creating index idx_bgfid on bgfeature... OK
>>> Creating index idx_pmfsetid on pmfeature... OK
>>> Creating index idx_pmfid on pmfeature... OK
>>> Creating index idx_fsfsetid on featureSet... OK
>>> Saving DataFrame object for PM.
>>> Saving DataFrame object for BG.
>>> Done.
>>> Warning message:
>>> In is.na(ndfdata[["SIGNAL"]]) :
>>> is.na() applied to non-(list or vector) of type 'NULL'
>>>>
>>>
>>> In contrast to this warning message, I see a pdinfopackage directory with
>>> 4 subdirectories: c=("data", "inst", "man", R"), as well as
>>> subsubdirectories in "inst"=c("extdata", and "Unit Tests"), in addition to
>>> two text files in the main directory: c=("DESCRIPTION", "NAMESPACE") were
>>> created in my destination folder.
>>> Before using "oligo", if possible, I wanted to confirm with you that this
>>> package is viable to use with "oligo" although a warning message that may
>>> not pertain to my custom designed microarray was printed.
>>>
>>> Regards,
>>> Franklin
>>>
>>> Great minds discuss ideas. Average minds discuss events. Small minds
>>> discuss people. -Eleanor Roosevelt
>>>
>>>
>>>
>>>
>>> ________________________________________
>>> From: Johnson, Franklin Theodore
>>> Sent: Friday, June 07, 2013 10:39 AM
>>> To: Benilton Carvalho
>>> Cc: bioconductor at r-project.org
>>> Subject: RE: [BioC] PAIR files -- feature set table
>>>
>>> Resending to bioconductor message thread:
>>>
>>> Dear Dr. Carvalho,
>>> Thanks for the response.
>>> As you suggested, I will look into the merge function using "Probe_ID".
>>> After reading in the data, I will start here: merge.datasets(dataset1,
>>> dataset2, by="key").
>>> Best Regards,
>>> Franklin
>>>
>>> Great minds discuss ideas. Average minds discuss events. Small minds
>>> discuss people. -Eleanor Roosevelt
>>>
>>> ________________________________________
>>> From: Benilton Carvalho [beniltoncarvalho at gmail.com]
>>> Sent: Thursday, June 06, 2013 8:11 PM
>>> To: Johnson, Franklin Theodore
>>> Cc: bioconductor at r-project.org; franklin.johnson at wsu.edu
>>> Subject: Re: [BioC] PAIR files -- feature set table
>>>
>>> You will need to merge the PAIR and the NDF using the PROBE_ID column
>>> as key. This will allow you to get the X/Y coordinates needed to
>>> create the XYS as described on the other messages.
>>>
>>> Regarding annotation, you may need to contact NimbleGen to request
>>> this information directly from them...
>>>
>>> benilton
>>>
>>> 2013/6/6 Johnson, Franklin Theodore <franklin.johnson at email.wsu.edu>:
>>>> Dear Dr. Carvalho,
>>>>
>>>> Muchos grasias for the reply.
>>>>
>>>> Actually, this is what my .ndf file looks like:
>>>>> head(ndf)
>>>>   PROBE_DESIGN_ID   CONTAINER DESIGN_NOTE SELECTION_CRITERIA SEQ_ID
>>>> 1  7552_0343_0009 Duplicate_1
>>>> 2  7552_0345_0009 Duplicate_2
>>>> 3  7552_0347_0009 Duplicate_1
>>>> 4  7552_0349_0009 Duplicate_2
>>>> 5  7552_0351_0009 Duplicate_2
>>>> 6  7552_0353_0009 Duplicate_1
>>>>                                                PROBE_SEQUENCE MISMATCH
>>>> MATCH_INDEX FEATURE_ID ROW_NUM COL_NUM PROBE_CLASS
>>>> 1  cttgactcttctaagttcaaaggtaactcaagtgaagctgtcagatatgatccttcca        0
>>>> 64535488   64535488       9     343
>>>> 2 cccaagcattaaaccttactcatatacttataatgcagccatcaagagtttgtgcaagg        0
>>>> 64799310   64799310       9     345
>>>> 3          agggaggctgaaagagagagtgaatggtccagctgggcataattgctgca        0
>>>> 64476989   64476989       9     347
>>>> 4          ttgttggtgggggtgttgcccttagtaccccagaccttgaagcagttaaa        0
>>>> 64862794   64862794       9     349
>>>> 5          gtgtggggccccctttctttaactggaacctttctttgaagcaatttggg        0
>>>> 64832726   64832726       9     351
>>>> 6          ttgtccaattccaacatgccgagacggcagggattgtgatcgtgttgttc        0
>>>> 64435686   64435686       9     353
>>>>                       PROBE_ID POSITION DESIGN_ID   X Y
>>>> 1    Contig19819_1_f_28_10_535        0      7552 343 9
>>>> 2 Malus_CN899188_2_f_147_1_755        0      7552 345 9
>>>> 3  Contig20738_8_r_1179_2_1432        0      7552 347 9
>>>> 4 Malus_CN880097_2_r_336_2_536        0      7552 349 9
>>>> 5 Malus_CN918117_2_f_632_1_781        0      7552 351 9
>>>> 6     Contig1991_1_f_71_2_1239        0      7552 353 9
>>>>
>>>> The pair files, .532 pair files only (one-color arrays), only obtain the
>>>> probe ID and signal; after some text at the top describing the experiment.
>>>> My real issue is that I can further normalize and analyze the RMA files with
>>>> sva and limma, etc. However, I cannot annotate the probes without the array
>>>> annotation, as there are duplicates in the ndf file which are removed in the
>>>> RMA.pair files available on NCBI/GEO. So they will not match in any
>>>> annotation package I've failed at trying.
>>>> So, I' tried to go back and start from the raw pair files...this custom
>>>> array is really a "custom" array without
>>>> NimbleScan.
>>>>
>>>> Salud,
>>>> Franklin
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Great minds discuss ideas. Average minds discuss events. Small minds
>>>> discuss people. -Eleanor Roosevelt
>>>>
>>>>
>>>>
>>>>
>>>> ________________________________________
>>>> From: Benilton Carvalho [beniltoncarvalho at gmail.com]
>>>> Sent: Wednesday, June 05, 2013 6:42 PM
>>>> To: FRANKLIN JOHNSON [guest]
>>>> Cc: bioconductor at r-project.org; franklin.johnson at wsu.edu; pdInfoBuilder
>>>> Maintainer
>>>> Subject: Re: [BioC] PAIR files -- feature set table
>>>>
>>>> It's an unfortunate mistake to have the pairFile *argument* in the
>>>> call (not in the slots session, but I see your point). :-( I'll make
>>>> sure that this is fixed.
>>>>
>>>> You need to convert the PAIR files to XYS...
>>>>
>>>> Some refs that should help you in the process:
>>>>
>>>> https://stat.ethz.ch/pipermail/bioconductor/2012-January/043186.html
>>>>
>>>> http://comments.gmane.org/gmane.science.biology.informatics.conductor/27547
>>>>
>>>> b
>>>>
>>>> 2013/6/5 FRANKLIN JOHNSON [guest] <guest at bioconductor.org>:
>>>>>
>>>>> Dear Maintainer,
>>>>>
>>>>> I downloaded available NimbleGen 'single channel' 532.PAIR files for a
>>>>> custom built expression microarray from NCBI/GEO (GPL11164). However, I get
>>>>> an error message when I try to make the annotation for this platform using
>>>>> pdInfoBuild.
>>>>>
>>>>> In pdInfoBuilder Reference Manual (June 5, 2013), under the
>>>>> NgsExpressionPDInfoPkgSeed method, there is a slot for pairFile, although,
>>>>> showClasses("Ngs.."), does not show a slot for this, only, XYS. Thus, I
>>>>> changed the .pair file extension to .xys.
>>>>>
>>>>> (ndf<- list.files(getwd(), pattern=".ndf", full.names=TRUE)) # read
>>>>> annotation file
>>>>> [1] "C:/Users/franklin.johnson.PW99-WEN/Desktop/Test/Yanmin's Microarray
>>>>> Paper/Yanmin Microarray RAW/GPL11164.ndf"
>>>>>
>>>>> (xys <- list.files(getwd(), pattern = ".xys", full.names = TRUE)[1])
>>>>> [1] "C:/Users/franklin.johnson.PW99-WEN/Desktop/Test/Yanmin's Microarray
>>>>> Paper/Yanmin Microarray RAW/GSM618107_14418002_532.xys"
>>>>>
>>>>> But, doing this resulted in an error message:
>>>>> seed <- new("NgsExpressionPDInfoPkgSeed", ndfFile = ndf, xysFile = xys,
>>>>> author = "FJ", organism = "Apple", species = "Malus x Domestica cv.GD")
>>>>>
>>>>> makePdInfoPackage(arrays, destDir = getwd())
>>>>>
>>>>> ============================================================================================================================================
>>>>> Building annotation package for Nimblegen Expression Array
>>>>> NDF: GPL11164.ndf
>>>>> XYS: GSM618107_14418002_532.xys
>>>>>
>>>>> ============================================================================================================================================
>>>>> Parsing file: GPL11164.ndf... OK
>>>>> Parsing file: GSM618107_14418002_532.xys... OK
>>>>> Merging NDF and XYS files... OK
>>>>> Preparing contents for featureSet table... Error in
>>>>> `[.data.frame`(ndfdata, , colsFS) : undefined columns selected
>>>>> In addition: Warning message:
>>>>> In is.na(ndfdata[["SIGNAL"]]) :
>>>>>   is.na() applied to non-(list or vector) of type 'NULL'
>>>>>
>>>>> The only files available from NCBI/GEO are 24 PAIR files and 1 ndf. It
>>>>> seems .xys has a different arrangement than .pair, thus .ndf is not
>>>>> applicable to annotate the .pair file? Any suggestions?
>>>>> Hope to hear from you soon.
>>>>> Franklin
>>>>>
>>>>>  -- output of sessionInfo():
>>>>>
>>>>>> sessionInfo()
>>>>> R version 3.0.1 (2013-05-16)
>>>>> Platform: x86_64-w64-mingw32/x64 (64-bit)
>>>>>
>>>>> locale:
>>>>> [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United
>>>>> States.1252    LC_MONETARY=English_United States.1252
>>>>> [4] LC_NUMERIC=C                           LC_TIME=English_United
>>>>> States.1252
>>>>>
>>>>> attached base packages:
>>>>>  [1] tcltk     grid      parallel  stats     graphics  grDevices utils
>>>>> datasets  methods   base
>>>>>
>>>>> other attached packages:
>>>>>  [1] pdInfoBuilder_1.24.0 oligo_1.24.0         oligoClasses_1.22.0
>>>>> affxparser_1.32.1    RSQLite_0.11.4       DBI_0.2-7
>>>>>  [7] Mfuzz_2.18.0         DynDoc_1.38.0        widgetTools_1.38.0
>>>>> e1071_1.6-1          class_7.3-7          gplots_2.11.0.1
>>>>> [13] KernSmooth_2.23-10   caTools_1.14         gdata_2.12.0.2
>>>>> gtools_2.7.1         timecourse_1.32.0    MASS_7.3-26
>>>>> [19] Biobase_2.20.0       BiocGenerics_0.6.0   limma_3.16.5
>>>>> ggplot2_0.9.3.1      BiocInstaller_1.10.1
>>>>>
>>>>> loaded via a namespace (and not attached):
>>>>>  [1] affyio_1.28.0         Biostrings_2.28.0     bit_1.1-10
>>>>> bitops_1.0-5          codetools_0.2-8       colorspace_1.2-2
>>>>>  [7] dichromat_2.0-0       digest_0.6.3          ff_2.2-11
>>>>> foreach_1.4.0         GenomicRanges_1.12.4  gtable_0.1.2
>>>>> [13] IRanges_1.18.1        iterators_1.0.6       labeling_0.1
>>>>> marray_1.38.0         munsell_0.4           plyr_1.8
>>>>> [19] preprocessCore_1.22.0 proto_0.3-10          RColorBrewer_1.0-5
>>>>> reshape2_1.2.2        scales_0.2.3          splines_3.0.1
>>>>> [25] stats4_3.0.1          stringr_0.6.2         tkWidgets_1.38.0
>>>>> tools_3.0.1           zlibbioc_1.6.0
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Sent via the guest posting facility at bioconductor.org.
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>



More information about the Bioconductor mailing list