forked from ethereum/rig
-
Notifications
You must be signed in to change notification settings - Fork 0
/
explore_data.Rmd
373 lines (308 loc) · 12.8 KB
/
explore_data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
---
title: "Exploring blocks, gas and transactions"
description: |
A focus on the recent high gas prices, towards understanding high congestion regimes for EIP 1559.
author:
- name: Barnabé Monnot
url: https://twitter.com/barnabemonnot
affiliation: Robust Incentives Group, Ethereum Foundation
affiliation_url: https://github.com/ethereum/rig
date: "`r Sys.Date()`"
output:
distill::distill_article:
toc: true
toc_depth: 3
---
While real world gallons of oil went negative, Ethereum gas prices have sustained a long period of high fees since the beginning of May. I wanted to dig in a bit deeper, with a view to understanding the fundamentals of the demand. Some of the charts below retrace steps that are very well-known to a lot of us -- these are mere restatements and updates. The data includes all blocks produced between May 4th, 2020, 13:22:16 UTC and May 19th, 2020, 19:57:17 UTC.
[Onur Solmaz](https://twitter.com/onurhsolmaz) from Casper Labs wrote [a very nice post](https://solmaz.io/2019/10/21/gas-price-fee-volatility/) arguing that since we observe daily cycles, there must be something more than one-off ICOs and Ponzis at play.
<aside>
I was [as surprised as he was](https://ethresear.ch/t/daily-demand-cycle-and-intraday-gas-price-volatility/6330) that there aren't more posts investigating the same.
</aside>
We will see these cycles here too, and a few more questions I thought were interesting (or at least, that I kinda knew the answer to but never had derived or played with myself). This is an excuse to play with my new [DAppNode](https://dappnode.io) full node, using the wonderful [ethereum-etl](https://github.com/blockchain-etl/ethereum-etl) package from Evgeny Medvedev to extract transaction and block details. This data will also be useful to calibrate good simulations for EIP 1559 (more on this soon!)
```{r setup, message = FALSE}
library(tidyverse)
library(here)
library(glue)
library(lubridate)
library(forecast)
library(infer)
library(matrixStats)
library(rmarkdown)
library(knitr)
library(skimr)
options(digits=10)
options(scipen = 999)
# Make the plots a bit less pixellated
knitr::opts_chunk$set(dpi = 300)
# A minimal theme I like (zero bonus point for using it though!)
newtheme <- theme_grey() + theme(
axis.text = element_text(size = 9),
axis.title = element_text(size = 12),
axis.line = element_line(colour = "#000000"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
legend.title = element_text(size = 12),
legend.text = element_text(size = 10),
legend.box.background = element_blank(),
legend.key = element_blank(),
strip.text.x = element_text(size = 10),
strip.background = element_rect(fill = "white")
)
theme_set(newtheme)
```
```{r}
start_block <- 10000001
end_block <- 10100000
suffix <- glue("-", start_block, "-", end_block)
```
```{r message = FALSE, eval=FALSE}
txs <- read_csv(here::here(glue("data/txs", suffix, ".csv")))
txs <- txs %>% select(-block_timestamp)
txs %>% glimpse()
```
```{r message=FALSE, eval=FALSE}
txs_receipts <- txs %>%
left_join(
read_csv(here::here(glue("data/rxs", suffix, ".csv"))),
by = c("hash" = "transaction_hash")) %>%
arrange(block_number)
saveRDS(txs_receipts, here::here(glue("data/txs", suffix, ".rds")))
```
```{r message=FALSE, cache=TRUE}
txs_receipts <- readRDS(here::here(glue("data/txs", suffix, ".rds"))) %>%
mutate(gas_fee = gas_price * gas_used) %>%
mutate(gas_price = gas_price / (10 ^ 9),
gas_fee = gas_fee / (10 ^ 9),
value = value / (10 ^ 18))
```
```{r message=FALSE, cache=TRUE}
blocks <- read_csv(here::here(glue("data/bxs", suffix, ".csv"))) %>%
mutate(block_date = as_datetime(timestamp),
prop_used = gas_used / gas_limit) %>%
rename(block_number = number) %>%
arrange(block_number)
gas_prices_per_block <- blocks %>%
select(block_number) %>%
left_join(
txs_receipts %>%
group_by(block_number) %>%
summarise(
min_gas_price = min(gas_price),
total_gas_used = sum(gas_used),
avg_gas_price = sum(gas_fee) / total_gas_used,
med_gas_price = weightedMedian(gas_price, w = gas_used),
max_gas_price = max(gas_price)
)
) %>%
select(-total_gas_used)
blocks <- blocks %>%
left_join(gas_prices_per_block)
```
```{r message=FALSE, cache=TRUE}
date_sample <- interval(ymd("2020-05-13"), ymd("2020-05-20"))
blocks_sample <- blocks %>%
filter(block_date %within% date_sample)
txs_sample <- txs_receipts %>%
semi_join(blocks_sample)
```
## Block properties
### Gas used by a block
Miners have some control over the gas limit of a block, but how much gas do blocks generally use?
```{r}
blocks %>%
ggplot() +
geom_histogram(aes(x = gas_used), bins = 1000, fill = "steelblue") +
scale_y_log10()
```
There are a few peaks, notably at 0 (the amount of gas used by an empty block) and towards the maximum gas limit set at 10,000,000. Let's zoom in on blocks that use more than 9,800,000 gas.
```{r}
blocks %>%
filter(gas_used >= 9.8 * 10^6) %>%
ggplot() +
geom_histogram(aes(x = gas_used), fill = "steelblue")
```
We can also look at the proportion of gas used, i.e., the amount of gas used by the block divided by the total gas available in that block. Taking a moving average over the last 500 blocks, we obtain the following plot.
```{r}
blocks_sample %>%
mutate(ma_prop_used = ma(prop_used, 500)) %>%
ggplot() +
geom_line(aes(x = block_date, y = ma_prop_used), colour = "#FED152") +
xlab("Block timestamp")
```
Where does the dip on May 15th come from? Empty blocks? We plot how many empty blocks are found in chunks of 2000 blocks.
```{r}
chunk_size <- 2000
blocks_sample %>%
mutate(block_chunk = block_number %/% chunk_size) %>%
filter(gas_used == 0) %>%
group_by(block_chunk) %>%
summarise(block_date = min(block_date),
`Empty blocks` = n()) %>%
ggplot() +
geom_point(aes(x = block_date, y = 1/2, size = `Empty blocks`),
alpha = 0.3, colour = "steelblue") +
scale_size_area(max_size = 12) +
theme(
axis.line.y = element_blank(),
axis.text.y = element_blank(),
axis.title.y = element_blank(),
axis.ticks.y = element_blank(),
) +
xlab("Block timestamp")
```
It doesn't seem so.
### Relationship between block size and gas used
Does the block weight (in _gas_) roughly correlate with the block size (in _bytes_)?
```{r}
cor.test(blocks$gas_used, blocks$size)
```
It does! But since most blocks have very high `gas_used` anyways, it pays to look a bit more closely.
```{r}
blocks %>%
ggplot() +
geom_point(aes(x = gas_used, y = size), alpha = 0.1, colour = "steelblue") +
scale_y_log10() +
xlab("Gas used per block") +
ylab("Block size (in bytes)")
```
We use a logarithmic scale for the y-axis. There is definitely a big spread around the 10 million gas limit. Does the block size correlate with the number of transactions instead then?
```{r cache=TRUE}
blocks_num_txs <- blocks %>%
left_join(
txs_receipts %>%
group_by(block_number) %>%
summarise(n = n())
) %>%
replace_na(list(n = 0))
```
```{r}
blocks_num_txs %>%
ggplot() +
geom_point(aes(x = n, y = size), alpha = 0.2, colour = "steelblue") +
xlab("Number of transactions per block") +
ylab("Block size (in bytes)")
```
A transaction has a minimum size, if only to include things like the sender and receiver addresses and the other necessary fields. This is why we pretty much only observe values above some diagonal. The largest blocks (in bytes) are not the ones with the most transactions.
## Gas prices
### Distribution of gas prices
First, some descriptive stats for the distribution of gas prices.
```{r}
quarts = c(0, 0.25, 0.5, 0.75, 1)
tibble(
`Quartile` = quarts,
) %>%
add_column(`Value` = quantile(txs_receipts$gas_price, quarts)) %>%
kable()
```
75% of included transactions post a gas price less than or equal to 31 Gwei! Plotting the distribution of gas prices under 2000 Gwei:
```{r}
txs_receipts %>%
filter(gas_price <= 2000) %>%
ggplot() +
geom_histogram(aes(x = gas_price), bins = 100, fill = "#F05431") +
scale_y_log10() +
xlab("Gas price (Gwei)")
```
The y-axis is in logarithmic scale. Notice these curious, regular peaks? Turns out people love round numbers (or their wallets do). Let's dig into this.
### Do users like default prices?
How do users set their gas prices? We can make the hypothesis that most rely on some oracle (e.g., the Eth Gas Station or their values appearing as Metamask defaults). We show next the 50 most frequent gas prices (in Gwei) and their frequency among included transactions.
```{r cache=TRUE}
gas_price_freqs <- txs_receipts %>%
group_by(gas_price) %>%
summarise(count = n()) %>%
arrange(-count) %>%
mutate(freq = count / nrow(txs_receipts), cumfreq = cumsum(freq),
`Gas price (Gwei)` = gas_price,
`Gas price (wei)` = round(gas_price * (10 ^ 9), 10)) %>%
mutate(frequency = str_c(round(freq * 100), "%"), cum_freq = str_c(round(cumfreq * 100), "%")) %>%
select(-freq, -cumfreq, -gas_price)
```
```{r}
paged_table(gas_price_freqs %>%
select(`Gas price (Gwei)`, `Gas price (wei)`, count, frequency, cum_freq) %>%
filter(row_number() <= 50))
```
Clearly round numbers dominate here!
### Evolution of gas prices
I wanted to see how the gas prices evolve over time. To compute the average gas price in a block, I do a weighted mean using `gas_used` as weight. I then compute the average gas price over 100 blocks by doing another weighted mean using the total gas used in the blocks.
```{r}
chunk_size <- 100
blocks_sample %>%
mutate(block_chunk = block_number %/% chunk_size) %>%
replace_na(list(
avg_gas_price = 0, gas_used = 0)) %>%
mutate(block_num = gas_used * avg_gas_price) %>%
group_by(block_chunk) %>%
summarise(avg_prop_used = mean(prop_used),
gas_used_chunk = sum(gas_used),
num_chunk = sum(block_num),
avg_gas_price = num_chunk / gas_used_chunk,
block_date = min(block_date)) %>%
ggplot() +
geom_line(aes(x = block_date, y = avg_gas_price), colour = "#F05431") +
xlab("Block timestamp")
```
We see a daily seasonality, with peaks and troughs corresponding to high congestion and low congestion hours of the day. How does this jive with other series we saw before? We now average over 200 blocks and present a comparison with the series of block proportion used.
```{r}
chunk_size <- 200
blocks_sample %>%
mutate(block_chunk = block_number %/% chunk_size) %>%
replace_na(list(
avg_gas_price = 0, gas_used = 0)) %>%
mutate(block_num = gas_used * avg_gas_price) %>%
group_by(block_chunk) %>%
summarise(gas_limit_chunk = sum(gas_limit),
gas_used_chunk = sum(gas_used),
num_chunk = sum(block_num),
avg_gas_price = num_chunk / gas_used_chunk,
block_date = min(block_date),
prop_used = gas_used_chunk / gas_limit_chunk) %>%
select(block_date, `Proportion used` = prop_used, `Average gas price` = avg_gas_price) %>%
pivot_longer(-block_date, names_to = "Series") %>%
ggplot() +
geom_line(aes(x = block_date, y = value, color = Series)) +
scale_color_manual(values = c("#F05431", "#FED152")) +
facet_grid(rows = vars(Series), scales = "free") +
xlab("Block timestamp")
```
Blocks massively unused right after a price peak? The mystery deepens.
### Timestamp difference between blocks
How much time elapses between two consecutive blocks? Miners are responsible for setting the timestamp, so it's not a perfectly objective value, but good enough!
```{r}
blocks %>%
mutate(time_difference = timestamp - lag(timestamp)) %>%
ggplot() +
geom_histogram(aes(x = time_difference), binwidth = 1, fill = "#BFCE80")
```
```{r cache=TRUE}
late_blocks <- blocks %>%
mutate(time_difference = timestamp - lag(timestamp),
late_block = time_difference >= 20) %>%
replace_na(list(gas_used = 0, avg_gas_price = 0)) %>%
drop_na()
mean_diff <- late_blocks %>%
specify(formula = avg_gas_price ~ late_block) %>%
calculate(stat = "diff in means", order = c(TRUE, FALSE))
```
```{r cache=TRUE}
null_distribution <- late_blocks %>%
specify(formula = avg_gas_price ~ late_block) %>%
hypothesize(null = "independence") %>%
generate(reps = 500, type = "permute") %>%
calculate(stat = "diff in means", order = c(TRUE, FALSE))
```
We can do a simple difference-in-means test to check whether the difference between the average gas price of late blocks (with timestamp difference greater than 20 seconds) and early blocks (lesser than 20 seconds) is significant.
```{r fig.cap="Mean gas price in \"late\" and \"early\" blocks"}
kable(late_blocks %>%
group_by(late_block) %>%
summarise(avg_gas_price = mean(avg_gas_price)))
```
<aside>
We find indeed that the average gas price of late blocks is significantly greater.
```{r}
null_distribution %>%
get_p_value(mean_diff, direction = "greater") %>%
kable()
```
</aside>