[R] Getting an unexpected extra row when merging two dataframes

jim holtman jholtman at gmail.com
Thu Mar 30 02:18:40 CEST 2017


first of all when you read the data in you get 379 rows of data since
you did not say 'header = TRUE' in the read.table.  Here is what the
first 6 lines of you data are:

> dataset1 <- read.table('/users/jh52822/downloads/containertestdata.txt')
>
> str(dataset1)
'data.frame':   379 obs. of  2 variables:
 $ V1: Factor w/ 379 levels "1-Apr-00","1-Apr-01",..: 379 333 301 80
145 113 239 18 270 207 ...
 $ V2: Factor w/ 66 levels "10","11","12",..: 66 46 57 5 39 48 40 61 10 18 ...
> View(dataset1)
> head(dataset1)
           V1       V2
1 TransitDate Transits
2    1-Oct-85       55
3    1-Nov-85       66
4    1-Dec-85       14
5    1-Jan-86       48
6    1-Feb-86       57
>

You need to learn to use 'str' to look at the structure.  So when you
are converting the dates, you will get an NA because the first row has
"TransitDate".  Now if you had used 'header = TRUE', you data would
look like this:

> dataset1 <- read.table('/users/jh52822/downloads/containertestdata.txt',
+                     header = TRUE,
+                     as.is = TRUE  # prevent conversion to factors
+                     )
>
> str(dataset1)
'data.frame':   378 obs. of  2 variables:
 $ TransitDate: chr  "1-Oct-85" "1-Nov-85" "1-Dec-85" "1-Jan-86" ...
 $ Transits   : int  55 66 14 48 57 49 70 19 27 28 ...
> head(dataset1)
  TransitDate Transits
1    1-Oct-85       55
2    1-Nov-85       66
3    1-Dec-85       14
4    1-Jan-86       48
5    1-Feb-86       57
6    1-Mar-86       49
>

So try again.

Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.


On Wed, Mar 29, 2017 at 11:02 AM, Paul Bernal <paulbernal07 at gmail.com> wrote:
> Hello everyone,
>
> Hope you are all doing great. So I have two datasets:
>
> -dataset1Frame: which contains the historical number of transits from
> october 1st, 1985 up to march 1, 2017. It has two columns, one called
> TransitDate and the other called Transits. dataset1Frame is a table comming
> from an SQL Server Database.
>
> -TransitDateFrame: a made up dataframe that goes from october 1st, 1985 up
> to the last date available in dataset1Frame.
>
> Note: The reason why I made up TransitDataFrame is because, since sometimes
> dataset1Frame has missing observations (some dates do not exist), and I
> just want to make sure I have all the dates available from october 1, 1985
> up to the last available observation.
> The idea is to leave the transits that do exist as they come, and add the
> dates missing as aditional rows (observations) with a value of NA for the
> transits.
>
> That being said, here is the code:
>
>>install.packages("src/lubridate_1.6.0.zip", lib=".", repos=NULL,
> verbose=TRUE)
>>library(lubridate, lib.loc=".", verbose=TRUE)
>>library(forecast)
>>library(tseries)
>>library(stats)
>>library(stats4)
>
>>dataset1 <-read.table("CONTAINERTESTDATA.txt")
>
>
>>dataset1Frame<-data.frame(dataset1)
>
>>dataset1Frame$TransitDate<-as.Date(dataset1Frame$TransitDate, "%Y-%m-%d")
>
>>TransitDate<-seq(as.Date("1985-10-01"),
> as.Date(dataset1Frame[nrow(dataset1Frame),1]), "months")
>
>>TransitDate["Transits"]<-NA
>
>>TransitDateFrame<-data.frame(TransitDate)
>
>>NewTransitsFrame<-merge(dataset1Frame,TransitDateFrame, all.y=TRUE)
>
> #Output from resulting dataframes
>
>>TransitDateFrame
>
>>NewTransitsFrame
>
> Why is there an additional row(observation) with a value of NA if I
> specified that the dataframe should only go to the last observation? There
> should be 378 observations at the end and I get 379 observations instead.
>
> The reason I am doing it this way is because this is how I got to fill in
> the gaps in dates (whenever there are nonexistent observations/missing
> data).
>
> Any guidance will be greatly appreciated.
>
> I am attaching a .txt file as a reference,
>
> Best regards,
>
> Paul
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list