[R] Read file

Mon Oct 4 04:55:41 CEST 2010

Hi Nilza,

Just to add to David's comments, if you are reading in your file with
read.table(..., fill=TRUE), and assuming that you haven't yet replace
-9999 with NA, you don't need grep. You can just use the number of NAs
in each line to locate data blocks.

Date records have 3 NAs
Location records have 2 NAs
Data records have none.

my.data2<-read.table("d2010100100.txt",fill=TRUE,nrows=20000)
na.count <- apply( my.data2, 1, function(x) sum( is.na(x) ) )
date.recs <- which( na.count == 3)
num.stns <- length(date.recs)
stn.data.length <- c(diff(date.recs) - 2, nrow(my.data2) -
date.recs[num.stns] - 1)

Michael

On 4 October 2010 13:05, David Winsemius <dwinsemius at comcast.net> wrote:
>
> On Oct 3, 2010, at 9:40 PM, Nilza BARROS wrote:
>
>> Hi, Michael
>> Thank you for your help. I have already done what you said.
>> But I am still facing problems to deal with my data.
>>
>> I need to split the data according to station..
>>
>> I was able to identify where the station information start using:
>>
>> my.data<-file("d2010100100.txt",open="rt")
>> indata <- readLines(my.data, n=20000)
>> i<-grep("^[837]",indata)  #station number
>
> That would give you the line numbers for any line that had an 8 , _or_ a 3,
> _or_ a 7 as its first digit. Was that your intent? My guess is that you did
> not really want to use the square braces and should have been using "^837".
>
> ?regex  # Paragraph starting "A character class .... "
>
>> my.data2<-read.table("d2010100100.txt",fill=TRUE,nrows=20000)
>> stn<- my.data2$V1[i]
>
> That would give you the first column values for the lines you earlier
> selected.
>
>
>> ====
>
> This does not look like what I would expect as a value for stn. Is that what
> you wanted us to think this was?
>
> --
> David.
>
>
>> 2010 10 01 00
>> *82599  -35.25  -5.91     52   1
>> * 1008.0  -9999    115     3.1   298.6   294.6 64
>> 2010 10 01 00
>> *83649  -40.28 -20.26      4  7*
>> 1011.0  -9999      0     0.0   298.4   296.1 64
>> 1000.0     96     40     5.7   297.9   295.1 32
>>  925.0    782    325     3.1   295.4   294.1 32
>>  850.0   1520    270     4.1   293.8   289.4 32
>>  700.0   3171    240     8.7   284.1   279.1 32
>>  500.0   5890    275     8.2   266.2   262.9 32
>>  400.0   7600    335     9.8   255.4   242.4 32
>> ===========
>> As you can see in the data above the line show the number of leves (or
>> lines) for each station.
>> I need to catch these lines so as to be able to feed my database.
>> By the way, I didn't understand the regular expression you've used. I've
>> tried to run it but it did not work.
>>
>> Hope you can help me!
>> Best Regards,
>> Nilza
>>
>>
>>
>>
>>
>> On Sun, Oct 3, 2010 at 2:18 AM, Michael Bedward
>> <michael.bedward at gmail.com>wrote:
>>
>>> Hello Nilza,
>>>
>>> If your file is small you can read it into a character vector like this:
>>>
>>> indata <- readLines("foo.dat")
>>>
>>> If your file is very big you can read it in batches like this...
>>>
>>> MAXRECS <- 1000  # for example
>>> fcon <- file("foo.dat", open="r")
>>> indata <- readLines(fcon, n=MAXRECS)
>>>
>>> The number of lines read will be given by length(indata).
>>>
>>> You can check to see if the end of the file has been read yet with:
>>> isIncomplete( fcon )
>>>
>>> If a leading "*" character is a flag for the start of a station data
>>> block you can find this in the indata vector with grepl...
>>>
>>> start.pos <- which(indata, grepl("^\\s*\\*", indata)
>>>
>>> When you're finished reading the file...
>>> close(fcon)
>>>
>>> Hope this helps,
>>>
>>> Michael
>>>
>>>
>>> On 3 October 2010 13:31, Nilza BARROS <nilzabarros at gmail.com> wrote:
>>>>
>>>> Dear R-users,
>>>>
>>>> I would like to know how could I read a file with different lines
>>>
>>> lengths.
>>>>
>>>> I need read this file and create an output to feed my database.
>>>> So after reading I'll need create an output like this
>>>>
>>>> "INSERT INTO TEMP (DATA,STATION,VAR1,VAR2) VALUES (20100910,837460,
>>>
>>> 39,390)"
>>>>
>>>> I mean,  each line should be read. But I don`t how to do this when these
>>>> lines have different lengths
>>>>
>>>> I really appreciate any help.
>>>>
>>>> Thanks.
>>>>
>>>>
>>>>
>>>> ====Below the file that should be read ===========
>>>>
>>>>
>>>> *2010 10 01 00
>>>> 83746  -43.25 -22.81      6  51*
>>>> 1012.0  -9999    320     1.5   299.1   294.4 64
>>>> 1000.0    114    250     4.1   298.4   294.8 32
>>>> 925.0    797      0     0.0   293.6   292.9 32
>>>> 850.0   1524    195     3.1   289.6   288.9 32
>>>> 700.0   3156    290    11.3   280.1   280.1 32
>>>> 500.0   5870    280    20.1   266.1   260.1 32
>>>> 400.0   7570    265    23.7   256.6   222.7 32
>>>> 300.0   9670    265    28.8   240.2   218.2 32
>>>> 250.0  10920    280    27.3   230.2   220.2 32
>>>> 200.0  12390    260    32.4   218.7   206.7 32
>>>> 176.0  -9999    255    37.6 -9999.0 -9999.0  8
>>>> 150.0  14180    245    35.5   205.1   196.1 32
>>>> 100.0  16560    300    17.0   195.2   186.2 32
>>>> *2010 10 01 00
>>>> 83768  -51.13 -23.33    569  41
>>>> * 1000.0     79  -9999 -9999.0 -9999.0 -9999.0 32
>>>> 946.0  -9999    270     1.0   295.8   292.1 64
>>>> 925.0    763     15     2.1   296.4   290.4 32
>>>> 850.0   1497    175     3.6   290.8   288.4 32
>>>> 700.0   3140    295     9.8   282.9   278.6 32
>>>> 500.0   5840    285    23.7   267.1   232.1 32
>>>> 400.0   7550    255    35.5   255.4   231.4 32
>>>> 300.0   9640    265    37.0   242.2   216.2 32
>>>>
>>>>
>>>> Best Regards,
>>>>
>>>> --
>>>> Abraço,
>>>> Nilza Barros
>>
>
>
> David Winsemius, MD
> West Hartford, CT
>
>