[R] A "subscript out of bonds" and "write.table" problem on manipulating a large size dataset
Yong Wang
wangyong1 at gmail.com
Mon May 21 22:28:41 CEST 2007
Dear all:
Described below is a large data set problem (data size > 2G after
unzipping, table delimited). I know R is not the
appropriate tool for such task, anyway
I did it on a server and get some straightforward problems.
1. The first is count.fields can count all the rows, however, when I
tried to remove rows beyond 3/5 of the data,R says
subscripts out of bounds, is there any option constraining the maximal
size for R to read in?
2. I rewrote the original data due to careless coding and find the
rewrote table delimited file does not match the
original file.
I experimented the code on a small dataset as attached at the end, no
problem at all for such small dataset.
I appreciate any tips and suggestions on how to remove the unwanted
rows in such a large dataset.
finally, thanks for all answering the tab delimited problem I rised yesterday.
### code as following ###
data.mm <- read.table(file,header=T,sep="\t",fill=T); #read in the large file
cf <- count.fields(file,sep="\t"); #count fields
n <- 23; #the CORRECT fields for each row i.e., the number of variable name
del <- which(cf!=n); # try to remove any row which has number of
fields not euqal to 23
del <- del-1; # count cf contains the fields of header, -1 give the
row I want to remove
data.mm <- data.mm[-del,]; # try to remove the rows with fields number
unequal to 23
### PROBLEM: R says "subscripts out of bonds"
write.table(data.mm,file="mm_0206.txt",
eol="\n",sep="\t",
quote=F,row.names=F); # since data.mm <- data.mm[-del,] aborted,
write the original data as mm_0206.txt
### PROBLEM:then following code should have the same output
table(cf); # maximal fields number is 23
table( count.fields("mm_0206.txt",sep="\t")); # maximal fields number
larger than 23 and other things also unequle
# for example, original data has x rows with 10 fields, the wrote
# data has y row with 10 fields.
# if the original file is not correctly rewrote, probably
an equal length
# file will also not be wrote properly wrote, suppose
data.mm <- data.mm[-del,];
# get executed successfully.
#### experimental data set as following ###
V1 V2 V3 v4 v5 v6 v7 v8 v9
11 1 desc A 1 34 1-Sep-00 1 first mid last
12 2 desc B 6 56 2-Sep-00 1 First last
13 3 desc A 7 32 3-Sep-00 1 last
14 4 desc 4-Sep-00 0 first mid last
15 5 desc A 2 . 5-Sep-00 1 first mid last
16 6 desc B 9 3 6-Sep-00 0 last
17 7 A 6 65 7-Sep-00 first last
18 8 desc B 2 . 8-Sep-00 0 last
19 9 desc A 8 56 9-Sep-00 1 first last
20 10 desc B 5 89 10-Sep-00 0 first last
More information about the R-help
mailing list