[R] Controlling number of numbers before R rewrites to "+e18" etc

jim holtman jholtman at gmail.com
Mon Oct 25 14:08:40 CEST 2010


You can always read a portion of the file and then write it out.  For
large files, I will read in 10,000 line, fix them up and then write
them out and go back and process the next batch of lines.  You haven't
shown us what a sample of your input/output is, or how you are
processing them.  Depending on what type of preprocessing needs to be
done to the data, PERL is also an option.  But most things I used to
use PERL for, I can do within R these days.

Here is an example of reading in your IDs:

> x <- read.table(textConnection("1234567890123456789012 987654321234567898765432 98765432123456789876543
+ 1234567890123456789012 987654321234567898765432 98765432123456789876543
+ 1234567890123456789012 987654321234567898765432 98765432123456789876543
+ 1234567890123456789012 987654321234567898765432 98765432123456789876543
+ 1234567890123456789012 987654321234567898765432 98765432123456789876543
+ 1234567890123456789012 987654321234567898765432 98765432123456789876543
+ 1234567890123456789012 987654321234567898765432 98765432123456789876543")
+     , colClasses = rep('character', 3))
> closeAllConnections()
> str(x)
'data.frame':   7 obs. of  3 variables:
 $ V1: chr  "1234567890123456789012" "1234567890123456789012"
"1234567890123456789012" "1234567890123456789012" ...
 $ V2: chr  "987654321234567898765432" "987654321234567898765432"
"987654321234567898765432" "987654321234567898765432" ...
 $ V3: chr  "98765432123456789876543" "98765432123456789876543"
"98765432123456789876543" "98765432123456789876543" ...
>     x
                      V1                       V2                      V3
1 1234567890123456789012 987654321234567898765432 98765432123456789876543
2 1234567890123456789012 987654321234567898765432 98765432123456789876543
3 1234567890123456789012 987654321234567898765432 98765432123456789876543
4 1234567890123456789012 987654321234567898765432 98765432123456789876543
5 1234567890123456789012 987654321234567898765432 98765432123456789876543
6 1234567890123456789012 987654321234567898765432 98765432123456789876543
7 1234567890123456789012 987654321234567898765432 98765432123456789876543



On Mon, Oct 25, 2010 at 4:41 AM, ZeMajik <zemajik at gmail.com> wrote:
> Thanks Jim, but I still got the problem that the pre-processing becomes way
> too computationally expensive. R seems to handle characters and factors much
> much worse than numeric IDs. I don't have enough RAM to even write the file
> when they are viewed as chars instead of numeric values!
>
> Anyone have any other ideas? Is it not possible to tell R not to rewrite
> upon import? It wouldn't matter if it only would write the correct IDs to
> the exported csv file, but it exports the abbreviated version which is of no
> use.
>
> Mike
>
> On Sat, Oct 23, 2010 at 3:56 AM, jim holtman <jholtman at gmail.com> wrote:
>>
>> Your best bet is to make sure that you read the IDs in as characters.
>> If they are being read in as floating point numbers, then there is
>> only 15 digits of accuracy, so if you have IDs 18-22 digits, you will
>> be missing data.  So if you are using read.table, then look at
>> colClasses to see how to do this.
>>
>> Provide a subset of your data and the statements that you are using to
>> read in the data.
>>
>> On Fri, Oct 22, 2010 at 1:15 PM, ZeMajik <zemajik at gmail.com> wrote:
>> > Hey,
>> >
>> > I'm using R as a pre-processor for a large dataset with IDs which are
>> > numeric (but has no numeric meaning so can be seen as factors).
>> > I do some data formating and then write it out to a csv file.
>> >
>> > However the problem is that the IDs are very long, 18-22 chars long more
>> > precisely. R is constantly rewriting these IDs to the abbreviated +eX
>> > which
>> > hinders me from exporting the data to the csv since the IDs are no
>> > longer
>> > intact.
>> > I've tried telling R that ID column is a factor, but this results in two
>> > problems: 1) Since I have millions of rows and R is slower handling
>> > factors
>> > than numbers my comp can't run the process in any kind of reasonable
>> > time.
>> > and 2) Some IDs STILL seem to be rewritten somehow. The second point
>> > made me
>> > believe that perhaps R is rewriting upon import?
>> >
>> > Does anyone have any tips on how to solve this problem?
>> >
>> > Thanks,
>> > Mike
>> >
>> >        [[alternative HTML version deleted]]
>> >
>> > ______________________________________________
>> > R-help at r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>> > http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>> >
>>
>>
>>
>> --
>> Jim Holtman
>> Cincinnati, OH
>> +1 513 646 9390
>>
>> What is the problem that you are trying to solve?
>
>



-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?



More information about the R-help mailing list