[R] Variable length datafile import problem

Sat Feb 19 05:00:18 CET 2011

Ingo,

The awk solution may be your preferred bet,
but here's an R way to do it. It's based on
adding a copy of the longest row (in terms of
number of fields) at the top of the file so
that R knows that you need that many fields.
( read.table and friends check the first 5 rows
to determine what's needed.)

## check how many fields in each row
cf <- count.fields("test.dat")

## which row has most fields?
id <- which.max(cf)

## read file as character string rows
dL <- readLines("test.dat")

## put copy of 'longest' row on top and write back
dL <- c(dL[id], dL)
writeLines(dL, "test1.dat")

## read as dataframe
d <- read.delim("test1.dat", header=FALSE)

## remove top row
d <- d[-1,]

Peter Ehlers

On 2011-02-18 00:16, Ingo Reinhold wrote:
> Hi John,
>
> seems there is no easy way. I'll just precondition it with AWK as described here http://www.mail-archive.com/r-help@stat.math.ethz.ch/msg53401.html
>
> There are some remarks in the thread that R is not supposed to read too large files for "political" reasons. Maybe that's it.
>
> Many thanks again for the effort.
>
> Ingo
> ________________________________________
> From: John Kane [jrkrideau at yahoo.ca]
> Sent: Thursday, February 17, 2011 11:54 AM
> To: Ingo Reinhold
> Subject: RE: [R] Variable length datafile import problem
>
> Generally most of the gurus are in this list.  Hopefully someone will take an interest in the problem.
>
> I suspect that there may be some kind of weird value in the file that is upsetting in import.  Given the results I got when I removed the data past BD and then at AL it seems that the problem might be within this range.
>
> You could try removing half the data between those columns and see what happens, then repeat if something turns up. It's tedious but unless someone with a better grasp of variable length data import can help it's the best I can suggest.
>
> BTW you only replied to me.  You should make sure to cc the list otherwise readers won't realise that I am being of no help.
>
> If you still have the problem by Saturday e-mail me or post to the list and I'll try to spent some more time messing about with the problem.
>
> Sorry to be of so little help.
> --- On Thu, 2/17/11, Ingo Reinhold<ingor at kth.se>  wrote:
>
>> From: Ingo Reinhold<ingor at kth.se>
>> Subject: RE: [R] Variable length datafile import problem
>> To: "John Kane"<jrkrideau at yahoo.ca>
>> Received: Thursday, February 17, 2011, 5:36 AM
>> Hi John,
>>
>> as it seems we're hitting the wall here, can you maybe
>> recommend another mailing list with "gurus" (as you put it)
>> that may be able to help?
>>
>> Regards,
>>
>> Ingo
>> ________________________________________
>> From: John Kane [jrkrideau at yahoo.ca]
>> Sent: Thursday, February 17, 2011 11:25 AM
>> To: Ingo Reinhold
>> Subject: RE: [R] Variable length datafile import problem
>>
>> Hi Ingo,
>>
>> I've had a bit of time to examine the file and I must say
>> that, at the moment, I have no idea what is going on.
>> I tried the old cut the file into pieces trick just came up
>> with even more anomalous results.
>>
>> My first attempt remove all the data past column AL in an
>> OOo Calc spreadsheet.  This created a rectangular
>> dataset It imported into R with no problem with 38 columns
>> as expected.
>>
>> Then I deleted all the data from the orignal data file
>> (test.dat) removing all the data past column BD in an OOo
>> Calc spreadsheet.
>>
>> This imported a file with only 38 columns.
>>
>> Something very funny is happening but at the moment I have
>> no
>>
>> --- On Wed, 2/16/11, Ingo Reinhold<ingor at kth.se>
>> wrote:
>>
>>> From: Ingo Reinhold<ingor at kth.se>
>>> Subject: RE: [R] Variable length datafile import
>> problem
>>> To: "John Kane"<jrkrideau at yahoo.ca>
>>> Received: Wednesday, February 16, 2011, 1:59 AM
>>> Hi John,
>>>
>>> V1 should be just a character. However I figured
>> something
>>> out myself. The import looks OK in terms of column
>> when
>>> adding the flush=TRUE option.
>>>
>>> I am still very confused about the dimensions that
>> the
>>> imported data shows. Loading my data file into
>> something
>>> like OOspreadsheet shows me a maximum of about 245,
>> which
>>> does not correspond to the 146 generated by R. Any
>> idea
>>> where this saturation comes from?
>>>
>>> Thanks,
>>>
>>> Ingo
>>> ________________________________________
>>> From: John Kane [jrkrideau at yahoo.ca]
>>> Sent: Wednesday, February 16, 2011 1:57 AM
>>> To: Ingo Reinhold
>>> Subject: RE: [R] Variable length datafile import
>> problem
>>>
>>> Is rawData$V1 intended to be factor or character?
>>>
>>> str(rawData) gives
>>> $ V1  : Factor w/ 54 levels "-232.0","-234.0",..:
>> 41
>>> 41 41 41 41 41 41 41 41 41 ...
>>>
>>> If you were not expecting a factor you might try
>>> options(stringsAsFactors = FALSE) before importing
>> the
>>> data.
>>>
>>> --- On Tue, 2/15/11, Ingo Reinhold<ingor at kth.se>
>>> wrote:
>>>
>>>> From: Ingo Reinhold<ingor at kth.se>
>>>> Subject: RE: [R] Variable length datafile import
>>> problem
>>>> To: "John Kane"<jrkrideau at yahoo.ca>
>>>> Received: Tuesday, February 15, 2011, 3:35 PM
>>>> Dear all,
>>>>
>>>> I have changed the file-ending with no change in
>> the
>>>> result. I don't think that this should matter.
>>>>
>>>> http://dl.dropbox.com/u/2414056/Test.dat
>>>> is a test file which represent the structure I
>> am
>>> trying to
>>>> read. So far I have used
>>>>
>>>> rawData=read.table("Test.txt", fill=TRUE,
>> sep="\t",
>>>> header=FALSE);
>>>>
>>>> When then looking at rawData$V1 this gives me a
>>> distorted
>>>> view of my original first column.
>>>>
>>>> Thanks,
>>>>
>>>> Ingo
>>>
>>>
>>>
>>
>>
>>
>>
>
>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.