[R] read.xls counts more rows than I really have in my Excel

Tue Feb 8 11:41:30 CET 2011

On 08-Feb-11 10:05:58, agent dunham wrote:
> Hi, 
> i'm using read.xls, and it counts more rows in my Excel
> than I really have. 
> 
> i've used both: 
> 
>  especie26 <- read.xls("especie26cargar.xls")
> 
>  datos26 <- read.xls("especie26cargar.xls", header = TRUE,
>     as.is = FALSE, na.strings = "NA", 
>     skip = 0, check.names = TRUE, fill = FALSE, 
>     strip.white = FALSE, blank.lines.skip = TRUE)
>  
> when i check dim(especie26) or dim(datos26) i got 858  17
> and it should be 667  17
> 
> thanks in advance !!!!
> -- 

You say "i'm using read.xls, and it counts more rows
in my Excel than I really have." I would be inclined
to the view that this should be "it counts more rows
in my Excel than I think I really have."

In other words, very probably there really are 858 rows
in your spreadsheet, even though you think there are 667
(and can only see 667).

As a check, I would suggest that you export the entire
worksheet into a CSV file, e.g. "especie26cargar.csv".
For the purposes of this check DO NOT highlight the data
area to be saved (since doing this would export your
possibly false impression of the data as well).

Then, using some suitable utility, count the number of
lines in the resulting CSV file "especie26cargar.csv".
I am prepared to bet that you will find 858 lines in
this file (plus 1 for the header).

In that case the excess (858 - 667) = 191 rows would
probably consist of rows containing only commas, like

  ,,,,,,,,,,,,,,,,

(i.e. 16 commas, for the 17 fields, though the nunber
might vary from row to row). 

The phenomenon on which I base this guess is that very often,
when people enter data into an Excel spreadsheet, they
somehow allow the cursor to wander outside the area of
the spreadsheet which is to be occupied by real data.
Internally (and invisibly to the user) Excel will then
attribute the extraneous cells into which the cursor had
wandered as being data cells with empty content. They
therefore continue to appear on screen as pure blanks,
but they have been invisibly added to the structure of
the spreadsheet.

A possible alternative to this check (but less definitive)
is to look at the dataframe which you get after you have
read the spreadsheet into R.

E.g., after

  especie26 <- read.xls("especie26cargar.xls")

on the basis tghat you should only have 667 data rows,
have a look at the result of, say,

  especie26[(670:680),]

or some other range of rows beyond (1:667). On the same
basis as above, I am betting that these rows probably
consist solely of NAs.

Why am I saying this? Because more times than I have any
wish to remember in detail I have encountered exactly this
problem in spreadsheets sloppily prepared by people who
believed what they saw on screen, totally unaware of
what Excel had really done to their data. Excel is a con
artist: it imparts an impression of success and confidence
to the user, which may be quite false.

By the way: By a similar mechanism, you can find that
different rows in the spreadsheet have different numbers
of fields.

Ted.
feel

--------------------------------------------------------------------
E-Mail: (Ted Harding) <ted.harding at wlandres.net>
Fax-to-email: +44 (0)870 094 0861
Date: 08-Feb-11                                       Time: 10:41:27
------------------------------ XFMail ------------------------------