[R] join/merge two data frames
Arnaud Gaboury
@rn@ud@g@boury @end|ng |rom gm@||@com
Wed Apr 2 22:30:22 CEST 2025
On Wed, 2025-04-02 at 15:41 -0400, avi.e.gross using gmail.com wrote:
> Arnaud,
>
> I won't comment on other aspects but want to ask how sure you are
> that your data is guaranteed to have a single row reflecting a
> closing price at 18:59:59 exactly?
>
> It may be true for your data source. I note that markets technically
> close at 4:00 PM, New York time, but many have after-hours trading,
> and there are days it closes early (such as 1 PM) and times when
> trading is halted.
>
> Generally, you can get closing prices (or other data) from other
> reliable sources and you could choose to merge data from such a
> source in rather than calculating them from your data. If you do want
> to use your data, one suggestion is to use the LAST record in each
> grouping for a day. I find that easy to do in dplyr by having a
> column containing the date info except for the time, and another
> containing the time in a sortable format. You can then sort the
> data.frame by the date and then time and then group your data.frame
> by the date and and select only last record in each group and you
> have the last time, whatever that may be.
>
> Again, this may not apply in your case. As you note, you are planning
> on doing many things, one step at a time, and early stages can set up
> your data.frames in ways that make later stages easier to do. As one
> example, you could create a column in early stages that marks if the
> current row is a closing row or not.
I work on the crypto currency market which in fact never close, even on
weekends and Xmas. So each exchange fix its own closing time and there
is NO official price. I decided to fix it at 19:00 UTC. I don't compute
the closing price but fetch it from Binance exchange with the Binancer
package[1].
Here is the command to get BTCUSDT (Bitcoin against USD) on 2024-12-16:
klines <- binance_klines('BTCUSDT', interval = '6h', start_time =
as.POSIXct("2024-12-17", tz = 'UTC'), end_time = as.POSIXct("2024-12-
17", tz = 'UTC'))
Result is:
klines <- structure(list(open_time = structure(c(1734307200,
1734328800,
1734350400, 1734372000, 1734393600), class = c("POSIXct", "POSIXt"
), tzone = ""), open = c(104463.99, 105028.01, 103758, 107078.55,
106058.65), high = c(106648, 105420.76, 107195.58, 107793.07,
107000), low = c(104259.48, 103625.78, 103333, 105480.02, 105657.34
), close = c(105028, 103757.99, 107078.55, 106058.66, 106817.43
), volume = c(10236.165136, 5623.21583, 14589.713338, 10853.308436,
4992.89346), close_time = structure(c(1734328799.999, 1734350399.999,
1734371999.999, 1734393599.999, 1734415199.999), class = c("POSIXct",
"POSIXt"), tzone = ""), quote_asset_volume = c(1077588575.22963,
588656892.646352, 1542338025.25396, 1156380907.50461, 531077258.225013
), trades = c(1914225L, 1110218L, 3148766L, 1874845L, 1208364L
), taker_buy_base_asset_volume = c(5181.465886, 2387.39941,
7576.364568,
5203.869636, 2543.63848), taker_buy_quote_asset_volume =
c(545582484.666663,
250042657.490346, 800715095.504901, 554210661.585612, 270539317.70439
), symbol = c("BTCUSDT", "BTCUSDT", "BTCUSDT", "BTCUSDT", "BTCUSDT"
)), row.names = c(NA, -5L), class = c("data.table", "data.frame"
), .internal.selfref = <pointer: 0x60ac8ea98850>)
[1]https://cran.r-project.org/web/packages/binancer/binancer.pdf
>
> -----Original Message-----
> From: R-help <r-help-bounces using r-project.org> On Behalf Of Arnaud
> Gaboury
> Sent: Wednesday, April 2, 2025 2:10 PM
> To: Ebert,Timothy Aaron <tebert using ufl.edu>; r-help using r-project.org
> Subject: Re: [R] join/merge two data frames
>
>
>
> On Wed, 2025-04-02 at 16:55 +0000, Ebert,Timothy Aaron wrote:
> > Your result data frame example makes no sense to me. The price and
> > executed_qty are the same for all symbols?
> >
> > To get it all into one data frame you need a common variable that
> > is
> > used to join the data frames.
> > My guess is that all_trade_sample$symbol has equivalents to the
> > variables in token_close_sample.
> >
> > You need to pivot one of the tables, and then full join them.
> > Please look at the join functions. In base R it is "merge()" and in
> > dyply it is full_join(), left_join() and similar.
> >
> > I will pivot all_trade_sample to make all of the elements in
> > "symbol"
> > into variables.
> >
> > I will do this in tidyverse. There are some parts of this that I
> > really like. However, it can as easily be done using base R.
> > Library(tidyr)
> > Library(dplyr)
> > token_close_long <- token_close_sample %>%
> > pivot_longer(cols = -time, names_to = "symbol", values_to =
> > "close_price")
> > #I rename time in token_close_long so that it will be preserved
> > colnames(token_close_long) <- c("time_close", "symbol",
> > "close_price")
> > combined <- full_join(all_trade_sample,token_close_long,
> > by="symbol")
> >
> > This generates errors. This is ok, but I need to know more about
> > all_trade_sample. Each symbol appears multiple times in
> > all_trade_sample. Each instance of "AAVEUSIX" has a different time
> > stamp. So maybe filter to eliminate all but the first (earliest)
> > time?
> >
> > The program looks something like this:
> > token_close_long <- token_close_sample %>%
> > pivot_longer(cols = -time, names_to = "symbol", values_to =
> > "close_price")
> >
> > #I will rename time in token_close_long so that it will be
> > preserved
> > colnames(token_close_long) <- c("time_close", "symbol",
> > "close_price")
> > combined <- full_join(all_trade_sample,token_close_long,
> > by="symbol")
> >
> > filtered_trades <- all_trade_sample |>
> > group_by(symbol) |>
> > slice_min(time, n = 1)|>
> > ungroup()
> >
> > combined <- full_join(filtered_trades,token_close_long,
> > by="symbol")
> >
> > I did not do something right, as there are four closing prices for
> > each symbol. However, the general approach should work even if it
> > needs a little modification to give the correct result. Your
> > closing
> > price is only relevant based on some other price (probably buying
> > price, but could be opening price).
> >
> > Tim
>
> Thank you Tim for your answer. I will have a close look at it later
> today. But in short:
> - I will do some work on my data later in the script. The tibble I
> want
> is far from being the result.
> - Time of closing price is everyday the same: 18:59:59. I will
> compute
> returns, volatility, correlations etc so I need a common reference.
> Prices for the trades change and are given by the exchange. I don't
> need them, won't do anything.
> - Yes, symbols may be the common variable.
> - I feel comfortable with tibbles and tidyverse
> - All_trade_sample has the goal to fetch price (price of asset when
> the
> trade is done) and executed_qty. btc_price is needed at one point of
> my
> calculation, but will not be shown in the very final table.
> - Multiple lines for one only symbol means I made multiple trades of
> this asset during the same day.
>
> I will test your suggestions.
> Thank you again.
>
>
> >
> > -----Original Message-----
> > From: R-help <r-help-bounces using r-project.org> On Behalf Of Arnaud
> > Gaboury
> > Sent: Wednesday, April 2, 2025 6:20 AM
> > To: r-help using r-project.org
> > Subject: [R] join/merge two data frames
> >
> > [External Email]
> >
> > I work on a trading journal for a portfolio of crypto currencies.
> > The
> > goal is to fetch from my account (binance exchange) the trades I
> > have
> > done and daily closing prices of my assets.
> > The first part (getting the data from exchange) are in two parts.
> >
> > 1- get the daily closing prices of my assets. Here is a sample of
> > my
> > data frame:
> > token_close_sample <- structure(list(time =
> > structure(c(1734371999.999, 1734458399.999, 1734544799.999,
> > 1734631199.999), tzone = "", class = c("POSIXct", "POSIXt")),
> > BTCUSDC
> > = c(107112.36, 107517.25, 104630.49, 98692.01 ), SUIUSDC =
> > c(4.7252,
> > 4.6923, 4.7017, 4.2422), ENAUSDC = c(1.1862, 1.1412, 1.0928,
> > 1.0256),
> > AAVEUSDC = c(388, 365.68, 373.15, 316.69 ), ETHUSDC = c(4034.74,
> > 3975.39, 3879.06, 3474.91), FTMUSDC = c(1.381, 1.3596, 1.2222,
> > 1.0445)), row.names = c(NA, -4L), class = c("tbl_df", "tbl",
> > "data.frame"))
> >
> > the tibble looks like this:
> > time BTCUSDC SUIUSDC ENAUSDC AAVEUSDC ETHUSDC
> > FTMUSDC
> > <dttm> <dbl> <dbl> <dbl> <dbl> <dbl>
> > <dbl>
> > 1 2024-12-16 18:59:59 107112. 4.73 1.19 388 4035.
> > 1.38
> > 2 2024-12-17 18:59:59 107517. 4.69 1.14 366. 3975.
> > 1.36
> > 3 2024-12-18 18:59:59 104630. 4.70 1.09 373. 3879.
> > 1.22
> > 4 2024-12-19 18:59:59 98692. 4.24 1.03 317. 3475.
> > 1.04
> >
> > 2- get my trades. Here is a sample:
> > all_trade_sample <- structure(list(time =
> > structure(c(1737335082.949,
> > 1737336735.697, 1738059550.671, 1738142709.422, 1738142709.422,
> > 1738169351.788 ), tzone = "UTC", class = c("POSIXct", "POSIXt")),
> > symbol = c("AAVEUSDC", "AAVEUSDC", "SUIUSDC", "AAVEUSDC",
> > "AAVEUSDC",
> > "ETHUSDC"), executed_qty = c(866.666, -834.998, 67649.3, -0.393, -
> > 0.393, 36.1158), price = c(0.003005, 0.003131, 0.000038,
> > 294.738321,
> > 294.738321, 0.03027), cummulative_quote_qty = c(262699.317950113, -
> > 263696.723173419, 263987.20719179, -115.83216, -115.83216,
> > 111456.491386979
> > ), day = structure(c(20108, 20108, 20116, 20117, 20117, 20117 ),
> > class = "Date")), row.names = c(NA, -6L), class = c("tbl_df",
> > "tbl",
> > "data.frame"))
> >
> > the tibble looks like this:
> > time symbol executed_qty price
> > cummulative_quote_qty day
> > <dttm> <chr> <dbl> <dbl>
> > <dbl> <date>
> > 1 2025-01-20 01:04:42 AAVEUSDC 867. 0.00300
> > 262699. 2025-01-20
> > 2 2025-01-20 01:32:15 AAVEUSDC -835. 0.00313
> > -
> > 263697. 2025-01-20
> > 3 2025-01-28 10:19:10 SUIUSDC 67649. 0.000038
> > 263987. 2025-01-28
> > 4 2025-01-29 09:25:09 AAVEUSDC -0.393 295.
> > -116. 2025-01-29
> > 5 2025-01-29 09:25:09 AAVEUSDC -0.393 295.
> > -116. 2025-01-29
> > 6 2025-01-29 16:49:11 ETHUSDC 36.1 0.0303
> > 111456. 2025-01-29
> >
> > Now, to finalize, I want to get all the info in one data frame so I
> > can compute daily valuation (with potential trades, or in/out of
> > asset).
> > The finalized tibble should look something like this:
> > result <- structure(list(time = structure(c(1734371999.999,
> > 1734458399.999, 1734544799.999, 1734631199.999, 1737335082.949,
> > 1737336735.697, 1738059550.671, 1734721199, 1734807599, 1734893999,
> > 1734980399, 1735066799, 1735153199, 1735239599, 1735325999,
> > 1738142709.422, 1735412399, 1738142709.422, 1738169351.788,
> > 1735498799), tzone = "", class = c("POSIXct", "POSIXt")), BTCUSDC =
> > c(107112.36, 107517.25, 104630.49, 98692.01,NA_real_, NA_real_,
> > 102000, 101500, 101700,100300,100400,102300,102300,103100,
> > NA_real_,
> > 99800, NA_real_, NA_real_,NA_real_, 99900 ), SUIUSDC = c(4.7252,
> > 4.6923, 4.7017, 4.2422, NA_real_, NA_real_, 4.25, 4.26, 4.7, 4.65,
> > 4.52, 4.23, 4.17, 4.34, NA_real_, 4.52, NA_real_,
> > NA_real_,NA_real_,
> > 4.44), ENAUSDC = c(1.1862, 1.1412, 1.0928, 1.0256, NA_real_,
> > NA_real_, 1.176, 1.16, 1.163, 1.183, 1.196, 1.165, 1.158, 1.142,
> > NA_real_, 1.196, NA_real_, NA_real_,NA_real_, 1.113), AAVEUSDC =
> > c(388, 365.68, 373.15, 316.69, NA_real_, NA_real_, 102000, 101500,
> > 101700,100300,100400,102300,102300,103100, NA_real_, 99800,
> > NA_real_,
> > NA_real_,NA_real_, 99900 ), ETHUSDC = c(4034.74, 3975.39, 3879.06,
> > 3474.91, NA_real_,NA_real_, 3420, 3410, 3412, 3367, 3388, 3355,
> > 3374,
> > 3392, NA_real_, 3401, NA_real_, NA_real_,NA_real_, 3411), FTMUSDC =
> > c(1.381,1.3596, 1.2222, 1.0445, NA_real_,NA_real_, 1.36, 1.368,
> > 1.342, 1.339, 1.436, 1.562, 1.53, 1.62, NA_real_, 1.31, NA_real_,
> > NA_real_,NA_real_, 1.58), executed_qty = c(NA_real_, NA_real_,
> > NA_real_, NA_real_, 230, 559, NA_real_, NA_real_,NA_real_,
> > NA_real_,
> > NA_real_,NA_real_, NA_real_, NA_real_, 200, NA_real_, 700, 200,
> > 350,
> > NA_real_), price = c( NA_real_, NA_real_, NA_real_, NA_real_, 2, 3,
> > NA_real_, NA_real_,NA_real_, NA_real_, NA_real_,NA_real_, NA_real_,
> > NA_real_, 10, NA_real_, 3, 4, 5, NA_real_) ), row.names = c(NA, -
> > 20L), class = c("tbl_df", "tbl", "data.frame"))
> >
> > The idea is to have:
> > - one line each day with closing price of my portfolio assets at
> > 18:59:59
> > - one line for each trade where I shall write 'asset', 'price' and
> > 'executed_qty'.
> >
> > I spent quite some time trying to figure out how to do it, but
> > couldn't. I don't even know if it is possible. Of course, I don't
> > want to add anything (price, date, asset...) by hand.
> > Thank you for any help.
> >
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: This is a digitally signed message part
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20250402/1c8210fd/attachment.sig>
More information about the R-help
mailing list