[R] How to more efficently read in a big matrix

affy snp affysnp at gmail.com
Sat Nov 10 06:41:09 CET 2007


Yes, I am showing the first 5 columns as an example. Thank you very much
for your suggestion. Let me check it out.

Allen

On Nov 10, 2007 12:39 AM, jim holtman <jholtman at gmail.com> wrote:
> Your data is mixed: numeric and characters/factors.  You can use
> skip=1 to skip the header line, but it looks like the rest is mixed.
> In you example there are only 5 columns; are you just showing the
> first 5 columns?  if there is the pattern that you show, then you
> would have a scan like:
>
> scan('yourfile', what=list('', 0, '', 0, ''))
>
> You can extend the 'what' to the size of the column that you have; e.g.
>
> what=c(rep(c(list(''), list(0)), rep=243), list(''))
>
>
>
>
> On Nov 10, 2007 12:29 AM, affy snp <affysnp at gmail.com> wrote:
> > Hi Jim,
> >
> > I tired scan() first and got
> >
> > > x <- scan(file="243_47mel_withnormal_expression_log2.txt", what=0)
> > Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
> >  scan() expected 'a real', got 'probe_set'
> >
> > So I guess it requires the file be numeric. But I do have row names
> > and header.
> >
> > The real file looks like (I am listing the header and first 4 rows of the file):
> >
> > probe_set WM_806_Signal_A WM_806_call WM_1716_Signal_A WM_1716_call
> > SNP_A-1909444   1.59  B         1.48    B
> > SNP_A-2237149   2.24  B         1.87    B
> > SNP_A-2118217   2.04  AB       1.70   AB
> > SNP_A-1866065   1.80  NoCall  1.39   A
> >
> > So how can I get rid of the header and row.names to use scan()?
> >
> > Thanks!
> >
> > Allen
> >
> >
> >
> >
> > On Nov 10, 2007 12:18 AM, jim holtman <jholtman at gmail.com> wrote:
> > > Here is an example of reading in file of 3M numbers (11MB of text
> > > file) on my laptop:
> > >
> > > > system.time(x <- scan('/tempyy', what=0))
> > > Read 3000000 items
> > >    user  system elapsed
> > >    6.22    0.16    6.53
> > > > str(x)
> > >  num [1:3000000] 1 2 3 4 5 6 7 8 9 10 ...
> > > > gc()
> > >           used (Mb) gc trigger (Mb) max used (Mb)
> > > Ncells  169954  4.6     350000  9.4   350000  9.4
> > > Vcells 3102277 23.7    7803840 59.6  7200206 55.0
> > > > object.size(x)
> > > [1] 24000024
> > >
> > > This took about 7 seconds.  You have about 40X more data, so it should
> > > be interesting to see how it scales up.  The object size if 24MB, so
> > > 40X more is about 1GB.
> > >
> > >
> > > On Nov 9, 2007 11:52 PM, affy snp <affysnp at gmail.com> wrote:
> > > > Hi Jim,
> > > >
> > > > Thanks a lot! I am currently running it on my laptop but without any
> > > > success. I could upload it to a server which is with 8Gb memory
> > > > and it might be better to go from there.
> > > >
> > > > Actually, I could have the whole file splitted in two parts,
> > > > one with 2nd column to 95th column, the other one with
> > > > the rest of columns. However, I need all rows for the
> > > > two parts.
> > > >
> > > > The file is in txt format and around 480Mb, very large though.
> > > > Yes, it is of numeric values.
> > > >
> > > > I appreciate!
> > > >
> > > > Allen
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Nov 9, 2007 11:46 PM, jim holtman <jholtman at gmail.com> wrote:
> > > > > If they are all numeric, you can use 'scan' to read them in.  With
> > > > > that amount of data, you will need almost 1GB to contain the single
> > > > > object.  If you want to do any processing, you will probably need a
> > > > > machine with at least 3-4GB of physical memory, preferrably a 64-bit
> > > > > version of R.  What type of computer are you using?  Do you really
> > > > > need all the data in at once, or can you process it in smaller batches
> > > > > (e.g., 20,000 rows at a time)?  So a little more detail on what you
> > > > > actually want to do with the data would be useful, since it does
> > > > > create a very large object.  BTW how large is the file you are reading
> > > > > and what is its format?  Have you considered a database with this
> > > > > amount of data?
> > > > >
> > > > >
> > > > > On Nov 9, 2007 11:39 PM, affy snp <affysnp at gmail.com> wrote:
> > > > > > Dear list,
> > > > > >
> > > > > > I need to read in a big table with 487 columns and 238,305 rows (row names
> > > > > > and column names are supplied). Is there a code to read in the table in
> > > > > > a fast way? I tried the read.table() but it seems that it takes forever :(
> > > > > >
> > > > > > Thanks a lot!
> > > > > >
> > > > > > Best,
> > > > > >    Allen
> > > > > >
> > > > > > ______________________________________________
> > > > > > R-help at r-project.org mailing list
> > > > > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > > > > > and provide commented, minimal, self-contained, reproducible code.
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Jim Holtman
> > > > > Cincinnati, OH
> > > > > +1 513 646 9390
> > > > >
> > > > > What is the problem you are trying to solve?
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > >
> > > Jim Holtman
> > > Cincinnati, OH
> > > +1 513 646 9390
> > >
> > > What is the problem you are trying to solve?
> > >
> >
>
>
>
> --
>
> Jim Holtman
> Cincinnati, OH
> +1 513 646 9390
>
> What is the problem you are trying to solve?
>



More information about the R-help mailing list