Analysis of fuel prices in New South Wales

PUBLISHED ON JAN 24, 2018 — R

How do petrol prices vary across the state?

The price of fuel at service stations varies tremendously, both in space - across the large Australian state of New South Wales, and time - with surprising patterns. In this post I begin my quest to understand how fuel price varies, what factors explain its fluctuation over time and spatial variation. Obviously it would be nice to be able to know where to drive for cheaper fuel (what kind of areas are cheaper?), and when to fill up (should I wait until Thursday?).

Since August 2016, the NSW government runs the FuelCheck service, which allows monitoring of fuel prices in real time. Several apps tap into this publicly available API, allowing users to find the cheapest fuel in the neighborhood, or inspect some simple graphs of fuel price over time.

As an additional service, (nearly) daily prices of all types of fuel, all brands of service stations, and all locations across the state can be downloaded from the NSW Data portal, currently from August 2016 to October 2017. The dataset contains over one million records, for over 2000 service stations, and 11 fuel types.

Approach

This is the first in a series of posts on this hobby project to find out if we can predict fuel prices in space and time. What I want to know is:

  • Is fuel cheaper on a particular day of the week? It is a widely-held belief that fuel prices are more expensive on the day that everyone receives their weekly paycheck (Thursday), but do the data support this?

  • Across the state, how does fuel price vary and why? Quick inspection of the data shows that remote areas are more expensive; how do we summarize that, and what else matters? As roughly 5 out of 7.5 million people in NSW live in the Sydney metropolitan area - that is two thirds of the population on 1.5% of the land area - I will look at Sydney and non-Sydney data separately for much of this analysis.

Instead of just showing results, these posts are very much about getting into the details of the R code to generate them. I will use dplyr, forcats, ggplot2, deldir (Voronoi polygons), SDraw (better Voronoi polygons), oz (quick maps of Australia), rgeos, and magicaxis throughout the code.

But before we get started, here is a taste of what’s to come:

The data show an interesting fluctuation in fuel prices, which can be viewed on this official site as well, but the fluctuations are on much longer timescales than weekly. Zooming in on a two-month period in 2017, we see the peculiar pattern of a very sudden jump in fuel prices, followed by a gradual decline over the period of a month or so.

Getting the Data

The FuelCheck service in New South Wales, Australia, provides real-time data on fuel prices at service stations across the state. This page contains information, as well as monthly files containing fuel prices for all service stations, for all fuel types. The first step is to download all xlsx files, and save them locally. We then use readxl to read them all, and dplyr to tidy things up.

pacman::p_load(readxl, dplyr, lubridate, ggmap,
               SDraw, deldir, sp, rgeos, geosphere,
               janitor, magicaxis)

And then

# Don't have to run this, since the clean raw data is bundled
# in the fuelpricensw package (github: remkoduursma/fuelpricensw)
if(FALSE){
  readx <- function(x){
    res <-  read_excel(x)
    if(names(res)[2] == "X__1"){
      res <-  read_excel(x, skip=1)   
    }
    
    res <- mutate(res, Postcode  = as.numeric(Postcode))
    if("FuelType" %in% names(res)){
      res <- rename(res, FuelCode = FuelType)
    }
    
  return(res)
  }
  
  fuel <- lapply(dir("rawdata", pattern="xlsx", full.names = TRUE), readx) %>%
    bind_rows %>%
    mutate(DateTime = ymd_hms(PriceUpdatedDate),
           Date = as.Date(DateTime)) %>%
    dplyr::select(-PriceUpdatedDate) %>%
    filter(Price < 500)

}

You can skip this step as I have bundled the clean dataset in the R package fuelpricensw, which is available on this Github repos.

devtools::install_github("remkoduursma/fuelpricensw")
library(fuelpricensw)
data(fuel)

Feature engineering : spatial components

The remainder of this first part on fuel price analysis will look at the spatial component of fuel prices, and we will calculate some spatial features based on the locations of the service stations. The first step, then, is to get the latitude and longitude for each station based on the address.

Here I used Google’s geocode service, as made easily available in the ggmap package. The service does not always return a proper result, even though when typing in the same address in maps.google.com might give the right address.

latcache <- "data/Fuel_NSW_addresses_latlon.csv"
if(!file.exists(latcache)){

  # Unique addresses to look up.
  addr <- unique(fuel$Address)
  
  # After some failures, I found that extra info between () messes with 
  # the geocode service. Remove them.
  addr_re <- gsub("\\(.+\\)", "", addr)
  
  # Get rid of "Cnr of ...,", but not when address starts with it.
  addr_re <- gsub("(.+)(Cnr.+,)", "\\1", addr_re)
  
  # Add Australia though it seems unnecessary
  addr_re <- paste(addr_re, "Australia")
  
  # Now run the service.
  gcres <- geocode(addr_re, output="latlon")
  
  # Code not shown: run code twice on separate days,
  # since we go over the API use limit.
  write.csv(gcres, latcache, row.names=FALSE) 
  
}

The next step is to get rid of USA addresses, since when geocode does not find a good match, Google returns some random address in the USA (even when we paste ‘Australia’ at the end!). Better approaches exist here, like the geonames service to find the country code, and filter that way, but here this suffices.

locs <- read.csv(latcache, stringsAsFactors = FALSE) %>%
  filter(lon > 120) %>%
  dplyr::select(-Address_geo)

Distance to nearest competitor

Suppose you drive up to a service station and wonder, what if I drive to the next station because it might be cheaper? If there are many choices for you, competition between service stations will be more intense and we should see a lower price at the pump. The next step is then to calculate the distance to the nearest station, as a potential predictor for fuel price.

library(sp)
library(rgeos)
library(geosphere)

# Copy of simple dataframe 'locs', to a SpatialPointsDataframe.
# Note how we assign new variables to 'locs', which is a simple 
# dataframe.
locs_sp <- locs
coordinates(locs_sp) <- ~lon+lat

# From geosphere, the correct way to calculate distances between spatial coordinates.
geo_dist <- distm(locs_sp)

# How many other service stations <5km away
countd <- function(x, dist)length(x[x < dist])
locs$nr_5km <- apply(geo_dist, 1, countd, d = 5000)

# Dist to nearest service station
min2 <- function(x)min(x[x > 0])  # exclude self; x > 0 
locs$dist_1 <- apply(geo_dist, 1, min2)

The distance to the nearest next service station peaks at 1km (1000m). The most remote service station appears to be 28 Columbus Street, Ivanhoe NSW 2878, though recall that geocoding failed for some stations, so we don’t have distance to all stations.

hist(log10(locs$dist_1), breaks=100, axes=FALSE,
     main="", col="cornflowerblue",
     xlab="Distance to nearest service station (m)")
magicaxis::magaxis(side=1, unlog=1)
axis(2)

Area served : Voronoi polygons

Related to the above is the idea that each station ‘serves’ a particular area of the state. Think of a polygon around each service station, if you are inside this polygon then that service station is the closest to your location. These polygons are known as Voronoi polygons, and are easily computed in R.

library(deldir)
voro_plain <- deldir(locs$lon, locs$lat)

I say easily, but the edge effects are massive! Clearly this is not the desired result, and as always a reminder to always visualize your analysis to make sure you did not do something stupid.

par(mar=c(3,3,1,1))
plot(voro_plain, wlines="tess", wpoints="real", 
     lty=1, col=c("black","grey","red","black","black"),
     cex=0.6, xlab="", ylab="")
box()

Instead we will use the New South Wales border to trim the Voronoi polygons. This is not an ideal solution either, since stations will be competing with stations on the other side of each border (for which we have no data), but it is an improvement on the above.

# Make NSW polygon
# Similar to oz::oz(), but coordinates are ordered by state.
oz2 <- read.csv("http://www.remkoduursma.com/files/ozdata.csv")
nsw <- filter(oz2, state == "NSW") %>%
  dplyr::select(long, lat) %>% as.data.frame

# Convert to SpatialPolygonsDataframe
coordinates(nsw) <- ~long + lat
nswp <- Polygon(nsw)
nswpg <- SpatialPolygons(list(Polygons(list(NSW=nswp), "NSW")))

# Using a zero-width buffer cleans up many topology problems in R.
nswpg <- rgeos::gBuffer(nswpg, byid=TRUE, width=0)

# We use the coordinates returned by voronoi to assign voronoi
# areas for each service station, because for some reason a few dozen 
# polygons cannot be computed (so simple cbind is not possible).
coorsx <- voro_plain$summary[,c("x","y")]
coordinates(coorsx) <- ~x+y

# Voronoi polygons with a 
library(SDraw)
vp <- voronoi.polygons(coorsx)
voro_buffer <- gIntersection(vp, nswpg, byid=TRUE)

# Now lookup area of each polygon
# We have to do this the hard way, because not all polygons
# are returned (perhaps some failed?).
get_area <- function(point, spoly){
  g <- gContains(spoly, point, byid=TRUE)
  if(any(g)){
    pol <- spoly[which(g)]
    if(!is.null(pol)){
      geosphere::areaPolygon(pol)
    }
  } else {
    NA
  }
}

# Loop through points, look up Voronoi polygon areas.
# Note that `apply` won't work with a SpatialPointsDataframe,
# at least not like this.
area <- c()
for(i in 1:length(locs_sp))area[i] <- get_area(locs_sp[i,], spoly=voro_buffer)

# Add to 'locs' dataframe with locations of Service Stations.
locs$area_voronoi <- area
plot(voro_buffer, col="lightgrey")
with(locs, points(lon, lat, pch=16, col=scales::alpha("red",0.5), cex=0.5))

The figure below shows each service station, colored by the ‘area served’ (i.e. the area of the Voronoi polygon), with more yellow indicating smaller areas.

library(ggmap)

nsw_map <- get_map(c(lon = 147.5, lat = -32.5), 
                    source = "google", zoom=6, maptype="terrain")
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=-32.5,147.5&zoom=6&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
ggmap(nsw_map) + 
  geom_point(aes(x=lon, y=lat, col=log10(area_voronoi+1)), data=locs) +
  scale_colour_gradientn(colours=rev(heat.colors(10))) + 
  theme(legend.position = "none") +
  geom_path(aes(x=long, y=lat), data=as.data.frame(nsw), col="darkgrey", lwd=0.6) +
  labs(x="", y="", caption="Service stations coloured by area served (Voronoi polygons)")

Remoteness, distance to coast

Eventually I want to build a model that predicts fuel price based on location, and time of year. To do so, we have to start adding some features of interest. The Atlas of Living Australia provides a ‘remoteness index’, which seems interesting since at first sight fuel prices are much higher for more remote locations. Though the ALA provides API services, I did this the quick way by visiting this page, uploading a CSV with lat and long, and downloading a CSV file with a remoteness index, and the distance to coast. You can read more about how the (unitless) remoteness index is calculated here.

library(dplyr)
library(janitor)

# Subset of one date, to get unique locations only.
# The ALA service wants a Date added to the dataframe,
# since remoteness is a time-dependent variable. 
locsub <- filter(locs, lon, lat) %>%
  mutate(eventDate = "2016-6-1")

# Save to disk...
write.csv(locsub, "data/NSW_fuel_locations_lonlatonly.csv", row.names=FALSE)

# ... so that it can be uploaded at the ALA:
# http://spatial.ala.org.au/webportal//#
# See that page for help on how to select variables. I just picked
# 'remoteness', and distance to coast, which I saved again locally:
remo <- read.csv("data/NSW_fuel_ALA_remoteness_locations.csv", stringsAsFactors = FALSE) %>%
  clean_names %>%
  dplyr::select(locality, latitude_original, longitude_original,
                remoteness_index, distance_to_coast) %>%
  rename(Address = locality,
         lat = latitude_original,
         lon = longitude_original,
         remoteness = remoteness_index,
         dist_to_coast = distance_to_coast) %>%
  mutate(dist_to_coast = 100 * dist_to_coast)
write.csv(remo, "data/NSW_fuel_ALA_remoteness_locations_cleaned.csv", row.names=F)

# And merge onto 'locs'.
remo_m <- dplyr::select(remo, Address, remoteness, dist_to_coast)
locs <- left_join(locs, remo_m, by="Address")

A plot of New South Wales, where every service station is colored by its ‘remoteness index’,

# Figure
library(oz)
oz(sections=c(4, 13:15), col="darkgrey")
cols <- colorRampPalette(c("yellow","darkorange","red"))(10)
with(locs, 
     points(lon,lat, pch=21, cex=0.95, col="white",
            bg=cols[cut(log(remoteness+1), 10)]))
title(main = "Darker colour = more remote", line=1, cex.main=0.8, 
      adj=0, font.main=3)

Shares some features with the median fuel price, which I first calculate for each service station:

u91_mean_dat <- filter(fuel, FuelCode == "U91") %>%
  group_by(Address) %>%
  dplyr::summarize(Price = median(Price, na.rm=TRUE),
                   lon = first(lon), lat=first(lat),
                   Brand = first(Brand),
                   remote = mean(remoteness),
                   nr_5km = mean(nr_5km),
                   dist_1 = mean(dist_1)
                   ) %>%
  filter(Price < 160) %>%
  mutate(Brand = forcats::fct_lump(as.factor(Brand), 6))

And finally we see some relationship between the median fuel price recorded for each station (for just U91 fuel, for now), and the remoteness index of the station. We also see some effect of the brand, with ‘Metro Fuel’ (which serves Sydney, Newcastle mostly) the cheaptes, and Coles Express quickly becoming expensive in slightly more remote locations.

ggplot(u91_mean_dat, aes(x=remote, y=Price, group=Brand, col=Brand)) +
  geom_point() +
  geom_smooth(method="lm", se=F) +
  theme_bw() + 
  theme(
    legend.position = c(.95, .05),
    legend.justification = c("right", "bottom"),
    legend.box.just = "right",
    legend.margin = margin(6, 6, 6, 6)
  ) +
  scale_colour_manual(values=RColorBrewer::brewer.pal(7,"Set3")) +
  labs(x="Remoteness index", y="Median fuel price (U91) ($ cents)")

We see another interesting relationship with the number of service stations in a 5km radius; the fuel price drops quickly when more service stations are added. However, there is a lot of variation particularly for low number of stations. Presumably other factors like distance from the supplier, affluence of the region, will come into play as well.

Conclusions

So far, I spent most of my time cleaning and organizing the data, obtaining spatial coordinates of the stations, and some basic exploration of spatial patterns. I was able to show that fuel is cheaper in less remote, more competitive environments - but many questions remain on what drives the variation in fuel prices. To be continued.

TAGS: SPATIAL, TEMPORAL