Fuel prices in New South Wales - Part 2

PUBLISHED ON JAN 24, 2018 — R

It is no secret that gasoline and diesel prices at the pump vary from day to day and from one place to the next. Few people are aware though that a typical pattern in fuel prices looks like this:

Newspaper articles have written about this pattern, web services exist that show where you are on the cycle, and you can compare main cities across Australia. In Sydney, the cycle is about 30 days long (more about that later), and the timing of the spike increase in fuel (which takes only 2-3 days) varies little from pump to pump.

I have lots of questions about this cycle! This post will take the next step in analyzing a large open source dataset on fuel prices across New South Wales. Read my previous post about obtaining and cleaning the dataset and some broad spatial patterns, and visit this github repository for an R package with the cleaned dataset (and some basic spatial attributes of the stations).

From this point on I will focus largely on service stations in the greater Sydney area. The reason is that the reporting frequency on fuel prices is higher than remote NSW - usually daily -and because there are lots of service stations in Sydney alone (>750). This analysis - that I work on in my free time - is still well in the exploratory data analysis stage. I have many ideas for follow-up analyses, modelling, and so on - but these will depend on how much time I will have to spend on this topic!

This is an R blog, so all the below will not just show results but also the code used to generate them. You can find this reproducible document by following this direct link..

The data

The data are stored in this R package (not on CRAN), once installed we can do:

library(fuelpricensw)
library(dplyr)

# Merge spatial attributes
fuel <- left_join(fuel, fuelstations, by = "Address")

The first step is to select all Sydney stations. I did this by drawing an awkward polygon that captures most the of more densely populated bits (because ‘Greater Sydney’ shapefiles include lots of national parks, many more remote areas, etc.), and deciding whether some station falls in the polygon.

library(ggmap)
library(sp)

# Map tile for background.
syd_map <- ggmap::get_map(c(lon = 151, lat=-33.8), zoom=10)

# Manually entered polygon for 'Sydney', excluding blue mountains etc.
syd_vert <- read.table(text="
                   lon lat
                   150.65 -34.1
                   150.65 -33.55
                   150.9 -33.55
                   151 -33.7
                   151.35 -33.7
                   151.27 -34.15
                   150.95 -33.95
                   150.8 -34.1",header=TRUE)


# Select stations in this polygon
in_syd <- sp::point.in.polygon(fuelstations$lon, fuelstations$lat,
                           syd_vert$lon, syd_vert$lat)
locsyd <- fuelstations[in_syd == 1,]

# Also add Brand (BP, shell, etc) to the locations dataframe.
fuelkey <- dplyr::select(fuel, Address, Brand, Postcode) %>% distinct
locsyd <- left_join(locsyd, fuelkey, by="Address")


# Sydney data
fuelsyd <- filter(fuel, Address %in% locsyd$Address)

A simple map shows the service stations, using the excellent ggmap.

ggmap(syd_map) + 
  geom_polygon(aes(x=lon,y=lat), data=syd_vert, alpha=0.2) +
  geom_point(aes(x=lon, y=lat), data=locsyd, size=0.8, col="red") +
  labs(x="", y="")

Summarizing the ups and downs in fuel prices

Next we are going to summarize the fuel price cycles into timings of ‘price hikes’ (the sudden increase in price), the time between cycles (time to reach the minimum counting from the maximum), the time between the minimum and the maximum, the rate of decrease of price between the cycles, and so on.

To do this, I am going to approximate the fuel price timeseries by a saw pattern - by only considering the minimum and maximum fuel prices reached for each cycle. See the below figure for a few examples.

The code I developed for this analysis is quite long, and pretty terrible. One issue is that the data have to be cleaned in a variety of ways (short spikes in the data mess up any simple approach).

In the following I will only consider one of 11 fuel types, U91 (Unleaded petrol, 91 octane), the most commonly reported fuel in the database.

# ... U91 only
fuelu91syd <- filter(fuelsyd, FuelCode == "U91")

# List of dataframes, one timeseries for each service station. 
# I still prefer this approach over purrr or whatever.
fuelu91syds <- split(fuelu91syd, fuelu91syd$Address)

We then apply our custom magic function to the timeseries for each of 773 service stations in Sydney.

# Using dplyr we can write some seriously concise code.
# Using fct_lump, combine rare brands into 'Other'
cycsyd <- lapply(fuelu91syds, make_cycledf) %>%
  bind_rows %>% 
  left_join(locsyd, by="Address") %>%
  filter(ndata_cycle > 5, price_hike > 5) %>% 
  mutate(Brand2 = forcats::fct_lump(as.factor(Brand), 6))

Here’s one of them:

The red points are the peak prices, just after the price hike has occurred, the blue symbols the minimum price reached during this cycle - just before the price hike.

Finally I prepare a dataset with the cycles themselves summarized, simply by averaging price hikes, median prices at the minimum of the cycles, and so on.

cycsyd_a <- group_by(cycsyd, Address) %>%
  summarize(price_hike_median = median(price_hike, na.rm=T),
            price_peak_median = median(price_peak, na.rm=T),
            price_low_median = median(price_low, na.rm=T),
            ndays_cycle_median = median(ndays_cycle),
            dpricedt = mean(dpricedt),
            lon = mean(lon), lat=mean(lat), 
            nr_5km = mean(nr_5km),
            Brand = first(Brand),
            Brand2 = first(Brand2),
            Postcode = first(Postcode)
  )

Now we can investigate the average ‘bottom prices’ (blue dots in the figure above), and the average ‘peak prices’ across Sydney, grouped by the different brands of service stations. Here I use the fun ggridges packages to make semi-overlapping grouped density plots.

library(reshape2)
library(forcats)
library(ggridges)

brand_colours <- RColorBrewer::brewer.pal(7,"Set3")

datsub <- dplyr::select(cycsyd_a, Brand2, price_peak_median, price_low_median) %>%
  melt(id.var = "Brand2") %>%
 mutate(variable = fct_recode(variable, "peak"="price_peak_median",
                              "low"="price_low_median"),
        Brand2 = reorder(as.factor(Brand2), value, median))  
                      
ggplot(datsub, aes(y=Brand2)) +
  geom_density_ridges(aes(x=value, fill=paste(Brand2, variable)),
                      color="dimgrey", alpha=0.8) +
  theme_tufte() + 
  labs(x="Price (c)", y="") +
  ggtitle("Bottom price                                   Peak price") +
  theme(legend.position="none")  + 
  scale_fill_manual(values=rep(brand_colours, each=2))
## Picking joint bandwidth of 0.909

The results quite clearly show that Metro Fuel is the cheapest of the lot, as well as Independent, which makes up a large chunk of the ‘Other’ category in the figure. Prices are otherwise remarkably similar between the brands associated with large corporations.

That’s it for now. Next up is a closer look at the timing of price hikes. Stay tuned!

TAGS: SPATIAL, TEMPORAL