[R] Read file
Michael Bedward
michael.bedward at gmail.com
Mon Oct 4 04:55:41 CEST 2010
Hi Nilza,
Just to add to David's comments, if you are reading in your file with
read.table(..., fill=TRUE), and assuming that you haven't yet replace
-9999 with NA, you don't need grep. You can just use the number of NAs
in each line to locate data blocks.
Date records have 3 NAs
Location records have 2 NAs
Data records have none.
my.data2<-read.table("d2010100100.txt",fill=TRUE,nrows=20000)
na.count <- apply( my.data2, 1, function(x) sum( is.na(x) ) )
date.recs <- which( na.count == 3)
num.stns <- length(date.recs)
stn.data.length <- c(diff(date.recs) - 2, nrow(my.data2) -
date.recs[num.stns] - 1)
Michael
On 4 October 2010 13:05, David Winsemius <dwinsemius at comcast.net> wrote:
>
> On Oct 3, 2010, at 9:40 PM, Nilza BARROS wrote:
>
>> Hi, Michael
>> Thank you for your help. I have already done what you said.
>> But I am still facing problems to deal with my data.
>>
>> I need to split the data according to station..
>>
>> I was able to identify where the station information start using:
>>
>> my.data<-file("d2010100100.txt",open="rt")
>> indata <- readLines(my.data, n=20000)
>> i<-grep("^[837]",indata) #station number
>
> That would give you the line numbers for any line that had an 8 , _or_ a 3,
> _or_ a 7 as its first digit. Was that your intent? My guess is that you did
> not really want to use the square braces and should have been using "^837".
>
> ?regex # Paragraph starting "A character class .... "
>
>> my.data2<-read.table("d2010100100.txt",fill=TRUE,nrows=20000)
>> stn<- my.data2$V1[i]
>
> That would give you the first column values for the lines you earlier
> selected.
>
>
>> ====
>
> This does not look like what I would expect as a value for stn. Is that what
> you wanted us to think this was?
>
> --
> David.
>
>
>> 2010 10 01 00
>> *82599 -35.25 -5.91 52 1
>> * 1008.0 -9999 115 3.1 298.6 294.6 64
>> 2010 10 01 00
>> *83649 -40.28 -20.26 4 7*
>> 1011.0 -9999 0 0.0 298.4 296.1 64
>> 1000.0 96 40 5.7 297.9 295.1 32
>> 925.0 782 325 3.1 295.4 294.1 32
>> 850.0 1520 270 4.1 293.8 289.4 32
>> 700.0 3171 240 8.7 284.1 279.1 32
>> 500.0 5890 275 8.2 266.2 262.9 32
>> 400.0 7600 335 9.8 255.4 242.4 32
>> ===========
>> As you can see in the data above the line show the number of leves (or
>> lines) for each station.
>> I need to catch these lines so as to be able to feed my database.
>> By the way, I didn't understand the regular expression you've used. I've
>> tried to run it but it did not work.
>>
>> Hope you can help me!
>> Best Regards,
>> Nilza
>>
>>
>>
>>
>>
>> On Sun, Oct 3, 2010 at 2:18 AM, Michael Bedward
>> <michael.bedward at gmail.com>wrote:
>>
>>> Hello Nilza,
>>>
>>> If your file is small you can read it into a character vector like this:
>>>
>>> indata <- readLines("foo.dat")
>>>
>>> If your file is very big you can read it in batches like this...
>>>
>>> MAXRECS <- 1000 # for example
>>> fcon <- file("foo.dat", open="r")
>>> indata <- readLines(fcon, n=MAXRECS)
>>>
>>> The number of lines read will be given by length(indata).
>>>
>>> You can check to see if the end of the file has been read yet with:
>>> isIncomplete( fcon )
>>>
>>> If a leading "*" character is a flag for the start of a station data
>>> block you can find this in the indata vector with grepl...
>>>
>>> start.pos <- which(indata, grepl("^\\s*\\*", indata)
>>>
>>> When you're finished reading the file...
>>> close(fcon)
>>>
>>> Hope this helps,
>>>
>>> Michael
>>>
>>>
>>> On 3 October 2010 13:31, Nilza BARROS <nilzabarros at gmail.com> wrote:
>>>>
>>>> Dear R-users,
>>>>
>>>> I would like to know how could I read a file with different lines
>>>
>>> lengths.
>>>>
>>>> I need read this file and create an output to feed my database.
>>>> So after reading I'll need create an output like this
>>>>
>>>> "INSERT INTO TEMP (DATA,STATION,VAR1,VAR2) VALUES (20100910,837460,
>>>
>>> 39,390)"
>>>>
>>>> I mean, each line should be read. But I don`t how to do this when these
>>>> lines have different lengths
>>>>
>>>> I really appreciate any help.
>>>>
>>>> Thanks.
>>>>
>>>>
>>>>
>>>> ====Below the file that should be read ===========
>>>>
>>>>
>>>> *2010 10 01 00
>>>> 83746 -43.25 -22.81 6 51*
>>>> 1012.0 -9999 320 1.5 299.1 294.4 64
>>>> 1000.0 114 250 4.1 298.4 294.8 32
>>>> 925.0 797 0 0.0 293.6 292.9 32
>>>> 850.0 1524 195 3.1 289.6 288.9 32
>>>> 700.0 3156 290 11.3 280.1 280.1 32
>>>> 500.0 5870 280 20.1 266.1 260.1 32
>>>> 400.0 7570 265 23.7 256.6 222.7 32
>>>> 300.0 9670 265 28.8 240.2 218.2 32
>>>> 250.0 10920 280 27.3 230.2 220.2 32
>>>> 200.0 12390 260 32.4 218.7 206.7 32
>>>> 176.0 -9999 255 37.6 -9999.0 -9999.0 8
>>>> 150.0 14180 245 35.5 205.1 196.1 32
>>>> 100.0 16560 300 17.0 195.2 186.2 32
>>>> *2010 10 01 00
>>>> 83768 -51.13 -23.33 569 41
>>>> * 1000.0 79 -9999 -9999.0 -9999.0 -9999.0 32
>>>> 946.0 -9999 270 1.0 295.8 292.1 64
>>>> 925.0 763 15 2.1 296.4 290.4 32
>>>> 850.0 1497 175 3.6 290.8 288.4 32
>>>> 700.0 3140 295 9.8 282.9 278.6 32
>>>> 500.0 5840 285 23.7 267.1 232.1 32
>>>> 400.0 7550 255 35.5 255.4 231.4 32
>>>> 300.0 9640 265 37.0 242.2 216.2 32
>>>>
>>>>
>>>> Best Regards,
>>>>
>>>> --
>>>> Abraço,
>>>> Nilza Barros
>>
>
>
> David Winsemius, MD
> West Hartford, CT
>
>
More information about the R-help
mailing list