So you want to query Australian electricity data?

A guide to AEMO's MMS dataset

Screenshot of spreadsheet

|
|
53 min

This guide is for researchers who want to analyse data about Australia's electricity industry. Fortunately there is far more public data available for this sector than almost any other industry, or even any other country's electricity market. Unfortunately there are many undocumented aspects of the data which can trip up newcomers. The purpose of this post is to highlight some of those 'gotchas'.

The focus here is on how to dive deep into the data, for those who intend to spend hours performing bespoke analysis, mostly for queries which require data on a timescale smaller than 1 day. Before you spend the time on this, first check whether you can quickly obtain the data you want from Open Electricity. Other high-level statistics are available from the AEMO Data Dashboard and the IEA. With those sources you can quickly answer questions like:

If you want to answer something more bespoke and complex, this guide explains how.

This article also includes a quick overview of the design of the market. I will list some acronyms and terms which newcomers will not be familiar with.

I do not expect anyone to read this from top to bottom. Rather, you should read the first few sections, then skim the rest, using ctrl-f or an LLM to find relevant content.

Quickstart

What data is available?

Australia's National Electricity Market (NEM) includes Queensland, New South Wales, the ACT, Victoria, and South Australia. The market is operated by the Australian Energy Market Operator (AEMO). (Typically "AEMO" is used in sentences on its own, instead of "the AEMO".) The NEM does not include Western Australia or the Northern Territory. Those are separate grids, with separate data. The Western Australia Market (WEM) is also operated by AEMO, but the market rules, data schemas and data pipelines are different. This post is focused on Australia's NEM, and the data in the "Market Management System" (MMS) dataset.

The publicly available MMS data is very comprehensive. If you have experience with data from other electricity markets/grids, you will be pleasantly surprised by how much data there is. If you are an academic with a research question, and you have not yet chosen a particular grid, you should probably just choose Australia because the data availability is so good.

The data includes:

You can estimate the energy revenue of each generator. The exact invoice amount each participant is paid is not public. (Generators are paid for their energy, but they also pay a wide range of fees, e.g. for wind/solar forecast inaccuracies.) You can estimate this amount to within a few percent. However doing so requires a great deal of expertise, which is beyond the scope of this post. Estimating just energy revenue/cost is straightforward.

The MMS dataset does not include information about green products such as Australia carbon credit units (ACCUs). For that data you will need to search elsewhere.

This dataset also does not include any information about the cost incurred by each generator, only their output and revenue. For estimates of costs, you can look at the CSIRO's GenCost model, or AEMO's System Plan.

Most data is published publicly every 5 minutes. Some sets of commercially sensitive data are published with a deliberate delay of a few days (e.g. bids). The online dataset goes back to 2009, although some tables within that do not go back as far.

The Tables

AEMO's dataset contains many different 'tables'. They are called this because AEMO expects market participants to load the data into tables in a SQL database. (If you are an analyst querying CSV/parquet files with Pandas, these may each be a different dataframe. The concept is the same.) There are hundreds of tables available to market participants. Some of those are not available to the public (e.g. settlement data), but most are. The most important ones are:

Some other important tables are listed in the Nemosis wiki.

There are conceptually related groups of tables.

CO2 Emissions Data

Emissions data is elusive. The GENUNITS table contains emissions intensity per generator (which you can then join with power data to get emissions). I have found region-level daily emissions data in undocumented tables. The detail of where to find these undocumented files, and an example query is shown later.

Understanding the Market Structure

If you are already familiar with the market design and just want to read about the data, skip ahead to the next section.

Compared to other countries, Australia's electricity market is very simple. Other markets can feel like this:

xkcd comic

For example, Germany cannot directly subsidise renewables, because of EU state aid laws. So the government offers contracts for difference (CfDs) to shield investors from the risk of low wholesale prices. They block low price signals from reaching generators by inserting the taxpayer's wallet in the middle. The generators respond to the artificially-high prices which they see in the way you would expect any business to, by producing more. This drives the wholesale price down, which causes fossil fuel generators to produce less, which is the point of the policy. However the government does not like prices below zero, so they added carveouts to halt CfD payments during negative prices. Now solar/wind investors are re-exposed to the risk of low prices, even though shielding them from that risk is why these CfDs were created.

Australia's market is far simpler. The NEM is an "energy only" market. (Well, almost. The federal and state governments keep fiddling with the nicely designed market by adding some capacity products, CfDs and other patchwork schemes. See my master's thesis for an explanation of why capacity markets are unnecessary in a market with such a high ceiling price.) There is no day-ahead market. AEMO publishes forecasts every 5 minutes which guide the market participants, and then everyone is paid (or pays) the one energy price.

A note on terminology: where academic economists say "supply" and "demand", in the industry we tend to say "generation" and "load".

Prices

Energy prices have a legally defined maximum "ceiling". This is indexed each year, and is currently around 20,000 $/MWh. This is far higher than most countries. (To understand why this is a great thing, see my master's thesis.)

The minimum is below zero, at -1000 $/MWh (not indexed). Yes, prices can go negative. South Australia has the most negative electricity prices in the world. It is unusual to have a day in Australia without negative prices at some point. From a data science perspective, this means that taking logarithms of the price to do regressions does not work.

Electricity prices are very volatile. Price and revenue data is highly skewed. Outliers are of central importance. For example, it is normal for a generator to make as much profit in the top 10% highest-priced hours each year as the rest of the year combined. (I show this with an example query later.) I have seen a consultant's report commissioned by AEMO where they simply deleted intervals with negative prices and very high prices. Do not do this. This is incorrect and will give you meaningless results.

Predispatch is Not a Day-Ahead Market

The NEM does not have a day-ahead market.

"Trading days" start in the 4:00-4:05 interval, and finish in the 3:55-4:00 interval the next calendar day. Participants need to submit bids each trading day by 12:30pm. They can submit bids much further in advance. If no bids are submitted for a given day, AEMO copy-pastes the last bid which was submitted for that participant. AEMO takes these bids, and a prediction of load, rooftop solar output etc, and runs the same code they use for the actual real-time dispatch, but in advance. From this they can predict prices, and the power flows from each generator, across each transmission line, whether constraints are binding etc. This is called "predispatch". (Whereas "dispatch" is the actual live instruction and price every 5 minutes.) This prediction is updated every 5 minutes, for the next 24-48 hours. (The specific tables are described later.)

The crucial thing to understand here is that this is merely a prediction for planning purposes. No generator is paid at the predispatch price, so it is not a real price, nor a real market.

The 12:30pm cutoff is not as important as it sounds. After the 12:30pm cutoff, generators can still "rebid". The only two restrictions on "rebids" vs "bids" are:

Rebids can be submitted at any time. There is no "gate closure". For example, a generator can submit a rebid at 01:04:50 for the interval 01:05-01:10, and it will be accepted. (There are a few seconds of lag for IT reasons, not regulatory ones. So 01:04:59 may be too late.) From 01:50:00 AEMO will start their calculations based on all the bids and other information they have. They will publish dispatch instructions (power levels) and prices around 01:05:10 to 01:05:40. (Note that technically AEMO tells generators what to do several seconds after they are supposed to have started doing it.)

The timeline for bids and rebids is explained by Watt Clarity. Bid data is discussed in more detail later. If you want a more cite-able source about the lack of a day-ahead market, check out this interesting paper by Gilmore, Nolan and Simhauser, where they estimate the levelised cost of FCAS.

Two-Dimensional Time

AEMO's data often contains predictions of the future. e.g. price, power flows, constraint binding. These predictions are updated often, yielding a history of different versions of predictions which apply to the same interval. In the data this means you may see two datetime columns.

Here is a screenshot of a paid application called ez2view (which takes MMS data and displays it graphically), by a company called Global Roam (the authors of Watt Clarity). The screenshot is from this article. Each cell in this table is a prediction of price from predispatch. Each column is a time interval which the prediction applies to. Each row is a time when the prediction was made. The screenshot graphically shows how predictions changed over time.

Screenshot of ez2view

The top-left corner of the graph is blank, because these would be "predictions" of the past (which do not exist). The bottom-right corner is blank because these are predictions of the distant future, which do not exist. (Or at least, not in this MMS table, which only goes out one hour. For predictions further out, see the price prediction section further down.) Every 5 minutes, AEMO publishes a new set of predictions, which appear as a newly inserted row up the top of this graph. Each row is shifted one interval to the right, because the start and end of the predictions are for intervals which will occur 5 minutes later.

For example, the third row is mostly blue (relatively high values). This means that around 10:51 (2 intervals before the one ending at 11:05) AEMO predicted high prices for the next 12 5-minute intervals (10:50-10:55 until 11:45-11:50). The red "30" value at the start of row 2 shows a relatively low value. The cell below (71) is blue. This shows that the final price of 30 was a surprise (30 is far lower than 71).

If you are looking at historical predictions and wanting to know what market participants expected at a certain time, you will need to find the latest row of data in this graph (e.g. RUN_DATETIME < t in P5MIN_REGIONSOLUTION, or sometimes LASTCHANGED) and then use a different column (e.g. INTERVAL_DATETIME or SETTLEMENTDATE) to see which interval it applied to. Later on there is an example query showing this.

Watt Clarity also explain this concept.

Scheduled Vs Non-Scheduled

A "scheduled" generator is one which submits bids to AEMO, and is told what power level to produce at by AEMO, every 5 minutes. Most big generators are scheduled. Most loads are non-scheduled (you do not talk to AEMO before turning on your dishwasher). Some large loads (such as smelters) are scheduled.

A "non-scheduled" generator generates whenever it wants. These tend to be small hydro plants and biowaste generators. If you find a generator mentioned in some tables but not others, it may be non-scheduled. To find out, check the NEM Registration and Exemption List.

A "semi-scheduled" generator is a type of scheduled generator where the instruction from AEMO to generate a certain power level is a one-sided upper limit. This is for wind and solar, which are allowed to produce less after an unexpected drop in wind/sunshine, but sometimes must waste extra power when there is an unexpected increase in wind/sunshine.

Rooftop solar is none of these categories. It is treated as negative load. This is described later in the Rooftop Solar section.

What is FCAS?

"Frequency Control and Ancillary Services" (FCAS) are additional markets for non-energy products. Of course they still involve moving electrons down wires to transmit watts and joules. However the objective of these services is not to transmit energy from a generator to an consumer's device, but rather to keep the grid running. You can think of them as "reliability services".

In Australia, generators providing FCAS are paid for being ready. If the FCAS capacity is not needed, they are still paid. If it is needed, the only additional payment is for the increase/decrease in energy they provide, at the normal energy price. This pricing approach is discussed in my master's thesis.

Similar services in other countries are called "balancing", "mFFR", "aFFR", "FCR", "raise", "lower" etc. The problems being solved are the same, but the mappings between products are not one-to-one. For a more detailed introduction to Australia's FCAS market, see this interesting paper by Gilmore, Nolan and Simhauser about estimating the levelised cost of FCAS.

The time spans of contingency FCAS products are:

The directions are:

To understand the relationship between power and frequency, see my bachelor's thesis.

In the data you will find combinations of these timescales and directions. For example, in DISPATCHPRICE, RAISE60SECRRP is the regional reference price of the 60 second raise product. The 1 second products were introduced relatively recently, so sometimes in the data and documentation those columns are not adjacent to the others. For data prior to 9 October 2023, those columns will be empty (or missing). If you use Nemosis and find it missing for recent data, use this trick to add it.

It seems that most academic researchers ignore FCAS, because it is technically complex. I believe this is unwise. FCAS is important, and increasingly so. For many years, batteries earned more profit from providing FCAS than energy. (See Watt Clarity.) In a world with a huge amount of zero-marginal-cost renewables, energy profits would be approximately zero most of the time. Most profit would come from the few periods of ceiling price spikes during scarcity, and from the FCAS services keeping the low-inertia system stable. If your research question is about what the future will look like, especially in high-renewables scenarios, you should consider FCAS.

There are some obscure ancillary services which are not FCAS (e.g. black start). These are typically handled through bespoke procurement on the timescale of years, and do not appear in the MMS dataset.

If you ever encounter the string ENOF, this refers to energy. (I suspect it stands for "ENergy OFfer", but that is just my guess. See the BIDTYPES table.)

How are Batteries Treated?

Battery data is often counterintuitive and inconsistent. I suspect this is because the rules about batteries have evolved over time. For example, the first battery (Hornsdale) is actually classified as a wind generator.

If you look at the list of registered participants (file "NEM Registration and Exemption List", sheet "PU and Scheduled Loads") you will see some assets listed with a "Dispatch Type" of "Bidirectional Unit". These are all batteries. Although one battery (Kennedy Energy Park) is listed as a "Generating Unit" (DUID KEPBG1) for discharging, and a "Load" (DUID KEPBL1) for charging. (This particular battery is actually non-scheduled.)

Sometimes generators show negative power levels. e.g. a solar farm needs to draw a trickle of power to 'keep the lights on' during the night. (This is for the local IT systems, but also can be iron losses in the transformer.) Less intuitive is that loads can have both positive and negative values. I still do not understand why this can happen, especially for a bidirectional unit which has been split. In these cases I tend to try to use heuristics to figure out the convention. (e.g. is the sum of negative values larger or smaller than the sum of positive values? Do I expect this DUID to generate/consume more, a lot less or slightly less than it consumes/generates).

As more and more sites colocate wind/solar with batteries (despite the lack of spot revenue benefit), analysing the data for these will become increasingly difficult.

Wholesale Demand response

The NEM does have a formal demand response market at the wholesale level. ("Wholesale Demand Response", or "WDR".)

The point of electricity retailers is that they hedge consumers against volatile spot prices. The purpose of the WDR market is to let consumers re-expose themselves to the spot market, selling "negawatts" at high spot prices. It is simple in theory, but complex in practice.

In terms of the data, you may see WDR participants (including aggregators) in the data where you expect generators. Since these are not real megawatts, you often need to drop these rows. e.g. If you want to find the fraction of energy generated by a particular fuel type, you should exclude WDR values in the denominator, because that energy is not real, it is counterfactual.

In practice the total volume of the WDR market (and other demand response schemes too) is smaller than other countries.

Constraints

In Economics 101 we are taught that price and quantity is set by the intersection of supply and demand curves. In electricity markets the aggregate supply curve is explicitly uploaded to the market operator (e.g. AEMO). The market operator also has the demand curve from scheduled loads, and they add a prediction of unscheduled load (i.e. most load) at the ceiling price for the rest of the aggregate demand curve. However the intersection of these two curves is rarely the actual price. In practice we are physically unable to get power from the cheapest generator to the marginal consumer. (If we always could, it means we are spending too much on network infrastructure.)

Sometimes this is because transmission wires only have a certain maximum capacity. Sometimes there are other more complicated constraints about voltage, fault levels, inertia etc.

AEMO accounts for these directly in their decision, as algebraic constraints considered by linear optimiser software. (This is a much more efficient approach than the typical European one, where constraints are ignored during bid clearing, then adjustments are made afterwards to account for them.)

Australia's grid is much more constrained than most grids. This is probably because it is very large and the population density is low. For many research questions, you should be considering constraints. If you see something unexpected in the data (e.g. generators bidding far below their marginal cost), you should ask yourself whether it is because of constraints.

Information about whether constraints bind each interval (or are forecast to bind) is in the data. This data tends to be very large. Only download it if you need it.

Understanding constraints is very difficult. e.g. what does F_I+ML_L5_0400 mean? The table GENCONDATA describes what each constraint is about. However the descriptions are quite difficult for the uninitiated to understand. AEMO has built a tool to explain each constraint in plain English. Unfortunately this is only available to market participants, not researchers.

GENCONSET and GENCONSETINVOKE contain data about when constraints are "invoked". "Invoked" means that it is considered by NEMDE. "Binding" means that the limit has been hit, and it is actively preventing a more economically efficient dispatch solution. A constraint which is invoked but non-binding happened to have no effect.

If you see Out = Nil, that means that this constraint is invoked when there is no outage. i.e. it is invoked the majority of the time.

Constraints with names that start with a # are private. Only one generator is impacted, so the details are not published, or not published live.

Green Products

"Large-scale green certificates" (LGCs) are a form of renewables subsidy for large renewable generators. In Europe these would be called "guarantees of origin". Each eligible generator earns one certificate per MWh generated, which they can sell. Retailers and large loads have an obligation to buy a certain amount each year, which drives the price up. AEMO's MMS data does not include LGC prices. (Although of course the price of LGCs drives bidding behavior. If you are trying to understand wind/solar bidding, you must consider LGCs.)

"Small-scale technology certificates" (STCs) are subsidies for smaller installations (typically rooftop solar). A wide range of subsidised and unsubsidised feed-in tariffs for rooftop solar exist, varying by region and installation age.

Unfortunately Australia does not have a carbon price. To the best of my knowledge, we are the only country to ever repeal a carbon price. (Arguably the carbon price cost the Gillard government the election in 2013.)

Other Terms and Acronyms

LOR stands for "Lack of Reserve". This is a prediction by AEMO that standard market clearing may not provide enough energy, resulting in load shedding. In most cases the market reacts to this prediction (e.g. cancelling generator maintenance outages to generate at high prices), so the feared outcome rarely happens. This is described more in this Hachiko article.

Where is the Documentation?

The official documentation for the meaning of each table, each column and their data type is here. The data types are Oracle SQL data types.

Sometimes the explanations in this documentation are useful. For example, the definition of UIGF in DISPATCHREGIONSUM is "Regional aggregated Unconstrained Intermittent Generation Forecast of Semi-scheduled generation (MW)". (Unconstrained means that AEMO is not telling wind and solar generators to turn down and 'spill' wind/sunshine. Semi-scheduled is what most wind and solar are. AEMO tells them what to generate, but it is only an upper limit. In contrast coal and gas are scheduled. They cannot generate less or more power than they are told.)

Some documentation is not useful. For example in table DISPATCHREGIONSUM the column EXCESSGENERATION is defined as "MW quantity of excess". What is "excess" energy? Is that when a solar/wind generator exceeds its UIGF forecast? Is it when a generators exceeds its dispatch target? Is it the net or gross export for the region? The writers of this schema documentation make strong assumptions about the reader's background knowledge.

When you find that documentation lacking, you can check the Nemosis Wiki and Watt Clarity.

For specific terms such as "bidirectional unit", "wholesale demand response" etc, you may need to search elsewhere on AEMO's website or the glossary of the NER to find answers. Note that AEMO sometimes restructure their website in a way which breaks bookmarks, search engine results and hyperlinks within their PDFs. Aside from that, their website often has very high-level information aimed at the general public, and some extremely niche detail, without much of a middle ground for researchers who want to understand concepts without trading in the market.

There is no machine-readable schema available (e.g. a json file containing all column names and types). I had written a crawler to scrape the metadata, but then AEMO changed the structure of this documentation page in a way that broke my crawler, and made it harder for humans to browse (by mixing tables on the same iframe so ctrl-f may find columns for the wrong table). If you are interested in a machine-readable schema, let me know. AEMO do publish SQL scripts (privately, for market participants only) which create empty tables with the right schema in Oracle or Microsoft SQL Server. You could parse those scripts to get the schema. (I have done that in the past.) Those scripts have subtle inconsistencies which make me suspect that they are hand written. If so, it seems possible that even internally, AEMO does not have a single machine-readable schema as their source of truth.

Where is the Data?

Most of the data is in the "MMSDM", on a public website called "Nemweb". The root URL is https://www.nemweb.com.au/Data_Archive/Wholesale_Electricity/MMSDM/. You can explore those folders to get a sense of the structure. The files you want are probably the ones like these:

https://www.nemweb.com.au/Data_Archive/Wholesale_Electricity/MMSDM/2025/MMSDM_2025_09/MMSDM_Historical_Data_SQLLoader/DATA/

(Note the month and years in the URL). The filename mostly (but not always) corresponds to the table. e.g. PUBLIC_ARCHIVE#DISPATCHPRICE#FILE01#202509010000.zip contains data for the DISPATCHPRICE table. The large tables may be split into multiple files per month.

The other folders contain scripts and the same data in other forms used to load the data into an Oracle SQL database. This is what AEMO expects market participants to do. However it is incredibly complex and error-prone. For example some scripts make assumptions about column order, but the data files do not necessarily have a consistent column order across months. The schema itself has changed over the years, and the historical scripts do not necessarily cope with that well. From personal experience, my advice is to not touch that stuff unless you already have extensive experience with PDR Loader and already have an Oracle SQL database.

The P5MIN_* tables are in both DATA/ and P5MIN_ALL_DATA/. As far as I can tell, the files are identical, other than the first row (which is metadata) and the filename. Most PREDISPATCH* tables are extremely large. (PREDISPATCHPRICE is not particularly large though.) The data for these tables in DATA/ is just a subset of the full table. The full dataset for those tables is in PREDISP_ALL_DATA. Further down there is an example of finding and using PREDISPATCH and P5 data.

The NEMDE folder contains the detailed inner workings of the linear optimiser used by AEMO to decide how much each generator should produce. This is useful if you want to query "price setter" data to understand who set the price. Note that you should be cautious when drawing conclusions from that data. See the guide from Watt Clarity.

The live (5 minutes old) data is available in /Reports/CURRENT. The file formats are more complicated because they are several CSV files with different columns concatenated into one file. You probably do not need that unless you are actively operating in the market.

How to download the data?

Data Size

The biggest two tables add up to as the rest combined. So only download what you need.

For the full decade or more of historical data, most of the tables are hundreds of megabytes (as compressed CSVs). Price forecasts are tens of gigabytes. Bidding data is hundreds of gigabytes. Price setter data is also hundreds of gigabytes.

Nemosis

There are a few open source tools out there for downloading and parsing AEMO's data. Nick Gorman from UNSW has written a great library called Nemosis.

Here is an example of how to obtain the DISPATCHPRICE data as a Pandas dataframe:

from nemosis import dynamic_data_compiler

start_time = '2017/01/01 00:00:00'
end_time = '2017/01/01 00:05:00'
table = 'DISPATCHPRICE'
raw_data_cache = 'C:/Users/your_data_storage'

price_data = dynamic_data_compiler(start_time, end_time, table, raw_data_cache)

Personally I prefer Polars to Pandas. If you have not heard of Polars, it is like the new Pandas. It is about 30 times faster, and I find the syntax easier to read and write. (No more dreaded errors about "A value is trying to be set on a copy of a slice from a DataFrame", nor clunky df[df[col]], df.loc['col', df['col'] == df['col']].) For this, you can use the cache_compiler to just save the data as Parquet with Nemosis, then query those files on disk with Polars. (Use the same approach for DuckDB, Arrow, R etc.) For example:

from nemosis import cache_compiler
import polars as pl
import os

start_time = '2024/01/01 00:00:00'
end_time = '2024/01/01 00:05:00'
table = 'DISPATCHPRICE'
raw_data_cache = '/home/matthew/Data/nemosis'

# download zips of CSV files 
# extract them, conver them to parquet with nemosis
cache_compiler(start_time, end_time, table, raw_data_cache, fformat='parquet')

def parse_datetimes(
    lf, cols=["SETTLEMENTDATE"], format="%Y/%m/%d %H:%M:%S"
) -> pl.LazyFrame:
    for col in cols:
        lf = lf.with_columns(
            pl.col(col).str.strptime(pl.Datetime(time_unit="ms"), format=format)
        )
    return lf


# query the data with Polars
lf = (
    pl.scan_parquet(os.path.join(raw_data_cache, "*DISPATCHPRICE*.parquet"))
    .pipe(parse_datetimes)
)

print(lf.collect())

Note that I am passing a wildcard filename *DISPATCHPRICE*.parquet to polars. This is because Nemosis downloads all files into the same directory. If you are processing multiple tables (e.g. P5MIN_REGIONSOLUTION and DISPATCHPRICE) you need this string to tell Polars which files to read and which to ignore.

Nemosis currently saves datetimes as strings in the Parquet file. If using cache_compiler, you need to parse them into datetimes yourself. Since this happens for every table, I factored that into a simple function with .pipe(parse_datetimes).

Sometimes Nemosis excludes columns from the data. (It includes only the most important columns, as an optimisation.) If this happens (e.g. 1 second FCAS data), you can fix that with this workaround.

Nemosis has some limitations. I am working with the maintainer to make some improvements. In particular, it loads each file into memory as a Pandas dataframe (even when using cache_compiler). This means that you cannot process the very large files (e.g. bid data) on a normal laptop.

If Nemosis works for you, skip the next sections and jump to How to Query and Understand the Data. If not, the next few sections describe the details of how to download and parse AEMO's data files yourself.

PDR Batcher and PDR Loader

AEMO created this dataset and the Data Interchange for market participants (generators, retailers etc) in the 1990s, prior to REST APIs, TLS, parquet files, distributed systems, clouds etc. They expect that anyone who wants to read the data will download a pair of applications called "PDR Batcher" and "PDR Loader". ("PDR" stands for Participant Data Replicator.) Batcher downloads the files over FTP, inside a private VPN, then Loader loads them one row at a time into an Oracle SQL Database. The documentation is here.

There are several challenges with this approach. The first is that AEMO removed this application from their website. They took it down when the log4j vulnerability become public, because this software was affected. Then when they released a patched version months later, they did so only through the private FTP server. So if you are not a participant (e.g. you are an academic researcher), you will probably not be able to even obtain the binary.

Even if you have a copy, it is extremely difficult to get working unless you know someone with experience. If you are a researcher, you do not have the time to spend weeks learning how this bespoke system works, and debugging it when it does not. The system is designed as a thick client, instead of a clear abstract API. There are many obscure mapping tables and configuration options. (e.g. How do you tell PDR Batcher which tables should be append-only, and which ones should overwrite existing rows based on certain partition columns?) The latest release did add a lot of great functionality (e.g. natively connecting to cloud storage such as S3), but my view is that if you are just a researcher (and even for some market participants) it is not worth using. e.g. I know people who tried to backfill a fresh database with all publicly available data. They ran into many issues. They asked AEMO for help, and were told that there is no simple, generalised way to backfill all the public data.

PDR Loader operates one row at a time. If you want to backfill all bid or price prediction data for the last decade, that takes weeks, even if you use large, expensive servers. This was probably because the system is optimised for operational use cases, where generators insert a few rows at a time into a row-based database, and query mostly the last few rows in each table. However if you are a researcher doing analytic queries about historical data, you will probably do infrequent, batched insertions, and your queries will scan most of the rows in a table. So a column-based approach (e.g. parquet files) is probably more suitable. Running an Oracle or Microsoft SQL Server database with 1 TB of data in the cloud is expensive. On-premise options have their challenges too. (e.g. will your laptop have enough memory even if you run the queries you want on a SQL server installed locally, connected to a slow external hard drive?) Using a more modern, column-based approach (e.g. parquet files locally or even in the cloud) will be far cheaper and gives faster query results.

The main benefit of PDR Loader over a DIY approach is that it will figure out which tables are append-only, and which are update-insert. (Or rather, you need to somehow find the configuration file that tells PDR Loader how to do this.) That is, AEMO publishes new data all the time. Sometimes the new rows should be added above/below the old rows. Sometimes the old rows should be updated. It depends on the table. (Remember, there are hundreds of tables.) If you use a DIY approach (including Nemosis), you may need to deduplicate the data when you query it. This is described later in Deduplication.

If you do want to know more about how to run PDR Batcher and Loader, and the pros and cons, check out Hachiko's guide. As they put it: "It's not broken. However it's also not easy."

Connectivity

As mentioned earlier, AEMO expects market participants to download these files and private files over FTP inside a private VPN. I think the public website is intended for researchers, which is fantastic. (AEMO if you are reading this, thank you so much!) Although I have never seen AEMO state this goal, or enumerate conditions of use. Please do not hammer their website (e.g. downloading data you do not need, or doing an unreasonable number of parallel downloads). I live in fear that one day they will simply turn off Nemweb.

Nemweb is hosted in AWS's Sydney region (ap-southeast-2). So if you are connecting from the cloud, choose something in Sydney. (For example, I am living in France at the moment. Downloading the data to my laptop is slow. If I spin up a server in Sydney the download is far faster, and then I connect to that server with Jupyter over SSH.)

Nemweb has both an HTTP and HTTPS interface. Note that if you are crawling the HTML file pages, sometimes the encrypted HTTPS pages contain links to unencrypted HTTP pages. I think they have fixed this now. However I am mentioning it because if you are doing this in a network environment where outbound HTTP (port 80) is blocked and HTTPS (port 443) is allowed, then you will get unexpected timeouts which cannot be resolved through retries.

File Formats

Quick File Format Explanation For Monthly MMSDM Data

One benefit of Nemosis is that it handles unzipping and dropping metadata rows and columns for you. This section contains details of how to do this yourself.

For the monthly MMSDM data that this post focuses on, the files are zips of CSVs, with one CSV per zip file. (Pandas can do the unzipping for us. Polars cannot do this yet.)

A CSV file might look like:

C,SETP.WORLD,DVD_DISPATCHPRICE,AEMO,PUBLIC,2025/10/08,09:02:27,001759878147550,MONTHLY_ARCHIVE,001759878147550
I,DISPATCH,PRICE,5,SETTLEMENTDATE,RRP
D,DISPATCH,PRICE,5,"2025/09/01 00:05:00",1.1
D,DISPATCH,PRICE,5,"2025/09/01 00:10:00",2.2
C,"END OF REPORT",4

The content you care about is really:

SETTLEMENTDATE,RRP
"2025/09/01 00:05:00",1.1
"2025/09/01 00:10:00",2.2

The CSVs have a header and a footer. (These have a different number of columns to the main data, which some CSV parsing libraries do not like.) Most CSV parsers can skip the header easily. You are unlikely to need that information. The timestamp in the header tells you the exact time when AEMO generated the data, which might be useful for some niche queries. Skipping the footer can be harder. Few libraries have a skip_footer option. With some libraries you may need to read the file as text first to count the rows, then tell it to only read the first N - 1 rows. Alternatively you can tell it that the "comment" character is C, but only if that comment option only applies from the start of the line. I tend to parse these with Polars, in which case you can tell it that the comment prefix is C,"END OF REPORT",.

After stripping the header and footer, you will also need to drop the first 4 columns. These are metadata columns. (e.g. in this example, DISPATCH and PRICE tell you that this data is for the DISPATCHPRICE table. However most such mappings are less obvious. You should figure out which file goes where from the filename if you can.) The names of these metadata columns vary between tables. Note that there may be overlapping column names. Some tables have a name for the 2nd or 3rd column (which is metadata) which is the same as the name of another column later on (which is data). Some libraries struggle to cope with this.

Here is an example of how to read these files with Pandas. This is the same approach which Nemosis uses under the hood. For the footer, it is parsed as an incomplete data row, with mostly NaNs, and then that last row dropped after reading. Since the footer is 3 columns, and we drop the first 4 columns anyway because they are metadata columns, this works.

import pandas as pd

df = pd.read_csv(
    'SOMEFILE.ZIP',  # pandas can unzip if there is only one CSV inside
    header=1, # skip first metadata row
).iloc[:-1, 4:] # drop 4 metadata columns and footer

A safer way is to use skipfooter=1, but this is far slower because it requires engine='python'.

df = pd.read_csv(
    'SOMEFILE.ZIP', # pandas can unzip if there is only one CSV inside
    header=1, # skip first metadata row
    skipfooter=1, # skip footer
    engine='python', # far slower, but required for skipfooter
).iloc[:, 4:] # drop metadata columns

You could also read the whole file once (with Pandas or something else) to count the rows, then exclude the footer with nrows.

If you prefer Polars to Pandas:

import polars as pl
import polars.selectors as cs

lf = (
    pl.scan_csv(
        'SOMEFILE.CSV', # must be unzipped beforehand
        skip_rows=1, # skip the header
        comment_prefix='C,"END OF REPORT",' # skip the footer
    )
    .select(~cs.by_index(range(4))) # drop metadata columns
)

In base R we can use read.csv with the same trick as for Pandas (reading the footer row as data, and then dropping it).

df <- read.csv(
  "/tmp/example.CSV",
  skip=1, # skip header metadata
  header=TRUE, # next row has column names
)[df[[1]] != "C", -c(1:4)] # drop footer and metadata columns

To use read_csv from Tidyverse in R:

library(tidyverse)

read_csv("example.CSV", skip=1) |> 
  filter(I != "C") |> # filter out metadata footer
  select(-c(1:4)) # drop metadata columns

Note that in all of these examples the column schema is inferred from the data by the library you use, based on the first thousand rows or so. For large data this might not work. In some AEMO CSVs a particular column may have only integers for the first million rows, and then in row 1,000,001 the number has a decimal component so cannot be parsed as an integer. Many libraries (e.g. Pandas, Polars) will throw an error when they encounter this row, after spending a long time parsing all the prior rows. An easy, but slow way to resolve this is to tell your library to scan the entire file to infer data types. Otherwise you will need to hard-code the schema for that particular column or all columns.

File Format Details And Other Data Files

For other AEMO data (e.g. the daily data) the files are zips of many zips of many CSVs. (I have never seen AEMO publish zips of zips of zips. If you are programmatically unzipping nested zips, watch out for zip bombs.)

The daily files have a more complicated file structure. You should avoid parsing these unless you really want live data (within 5 minutes). In case you do, here is an example of such a file.

C,SETP.WORLD,DVD_DISPATCHPRICE,AEMO,PUBLIC,2025/10/08,09:02:27,001759878147550,MONTHLY_ARCHIVE,001759878147550
I,DISPATCH,PRICE,5,SETTLEMENTDATE,RRP
D,DISPATCH,PRICE,5,"2025/09/01 00:05:00",1.1
D,DISPATCH,PRICE,5,"2025/09/01 00:10:00",2.2
I,DISPATCH,REGIONSUM,3,REGIONID,SETTLEMENTDATE,TOTALDEMAND
D,DISPATCH,REGIONSUM,3,NSW1,"2025/09/01 00:10:00",5.6
D,DISPATCH,REGIONSUM,3,QLD1,"2025/09/01 00:10:00",7.8
D,DISPATCH,REGIONSUM,3,VIC1,"2025/09/01 00:10:00",9.0
C,"END OF REPORT",9

The interesting thing here is that the number of columns changes throughout the file, even after removing the header and footer. This is because each CSV file is actually several CSV files concatenated together. Some libraries will refuse to read this CSV at all. There are two chunks of data you want to extract:

DISPATCHPRICE:

SETTLEMENTDATE,RRP
"2025/09/01 00:05:00",1.1
"2025/09/01 00:10:00",2.2

and DISPATCHREGIONSUM:

REGIONID,SETTLEMENTDATE,TOTALDEMAND
NSW1,"2025/09/01 00:10:00",5.6
QLD1,"2025/09/01 00:10:00",7.8
VIC1,"2025/09/01 00:10:00",9.0

To process this, you must read line by line, and look at the first character, which will be C, I or D.

Again it is worth pointing out that mapping DISPATCH and REGIONSUM to DISPATCHREGIONSUM sounds straightforward, but it is often not. Sometimes it would be something like DISPATCH_REGIONSUM, sometimes it is something completely different (especially for bids).

The 4th column is an integer. It is somehow related to the versioning of the schemas of each table. I do not understand it exactly, and have only seen one situation in all my years of working with this data where it mattered.

The final row is used as a checksum. The integer is the number of rows in the overall file (including the header and footer). I am unsure how rows with escaped newlines are counted. I normally just ignore this. The only time I have seen a checksum mismatch was with data in this format from someone other than AEMO, who calculated the checksum excluding the footer. This checksum was put in because this system was designed back in the 1990s, when formats such as FTP could result in partial files being written to disk, and file processing commencing prior to the download finishing. With modern protocols such as HTTP over TCP, and especially with object storage such as AWS S3, getting only half a file is very unlikely. Even more so if you are splitting up your analysis script to first download all the files you want, then parse them, in the same process.

File Names

Many files have a # in the name. Note that if you are using a terminal (e.g. Bash on Linux), this will be treated as a comment. So instead of cat my#file.CSV do cat "my#file.CSV", otherwise you will get "File my not found". AEMO uses uppercase for file extensions. (.CSV and .ZIP, not .csv, .zip.)

Reusing the HTTP Connection

f you are downloading files with Requests in Python you may write something like this:

import requests

data = []
for year in range(2024, 2026):
    for month in range(12):
        url = f"https://www.nemweb.com.au/Data_Archive/Wholesale_Electricity/MMSDM/{year}/MMSDM_{year}_{month}/MMSDM_Historical_Data_SQLLoader/DATA/PUBLIC_ARCHIVE%23DISPATCHPRICE%23FILE01%23{year}{month}010000.zip"
        resp = requests.get(url)
        resp.raise_for_status()
        data.append(resp.json())

There's a small trick you can use to improve this: requests.Session! This looks like:

import requests

session = requests.Session()
data = []
for year in range(2024, 2026):
    for month in range(12):
        url = f"https://www.nemweb.com.au/Data_Archive/Wholesale_Electricity/MMSDM/{year}/MMSDM_{year}_{month}/MMSDM_Historical_Data_SQLLoader/DATA/PUBLIC_ARCHIVE%23DISPATCHPRICE%23FILE01%23{year}{month}010000.zip"
        resp = session.get(url)
        resp.raise_for_status()
        data.append(resp.json())

Reusing the HTTP connection speeds up the download and reduces AEMO's server load (so they are less likely to take down the data one day). (You should to this for other webscraping and APIs too.) I explained this in more detail here.

Retries

Nemweb's servers can be slow and unreliable. Try to add sleeps between requests, and have aggressive retries with delays and backoff. (You do not want to leave a script running overnight to download data, only to find that it gave up after 5 minutes because of a timeout.)

If you are using a requests.Session, this is simple:

from urllib3.util.retry import Retry
import requests

session = requests.session()

retries = Retry(
    total=4,
    backoff_factor=2,
    status_forcelist=[403, 429, 500, 502, 503, 504], 
    allowed_methods=["HEAD", "GET", "OPTIONS"]
)
adapter = requests.adapters.HTTPAdapter(max_retries=retries)
for protocol in ['http://', 'https://']:
    session.mount(protocol, adapter)

Note that including 403 in a retry configuration is unusual. I added this because Nemweb often returns this as a throttling signal. So sleeping and retrying is the correct response.

How to Preprocess the Data?

Parsing Timestamps

Most timestamps are in the format %Y/%m/%D %H:%M:%S, e.g. "2025/11/10 19:30:00". AEMO data never contains just a date. Where something is logically a date not a datetime, it will appear as a datetime at midnight at the start of that day.

Special characters

DUID is the identifier for a generator. Some DUIDs contain funny characters, such as W/HOE#2 for Wivenhoe Power Station. This slash and hash can cause errors with misleading error messages. (e.g. if you are saving files to disk with hive partitioning on the DUID key, you may accidentally create an additional nested folder called HOE#2 within parent folder W.)

There are a few obscure tables which contain newline characters within the cell values. These are escaped with double quotes around the whole value. These are in tables which researchers do not normally look at, such as the Market Suspension Notices in MARKETNOTICEDATA. Most CSV parsing libraries can handle this. I am mentioning this just in case you are trying to read the data with something simple like:

with open('data.CSV', 'r') as f:
    headers = f.readline().strip().split(',')
    for data_row in f:
        cells = data_row.strip().split(',')

If you do that, it will work for the vast majority of tables, but not MARKETNOTICEDATA.

Constraint identifiers have many funny characters (e.g. #, >)

Running Out of Memory With Polars

Sometimes when you query data with Polars you may run out of memory, even when the query seems small. To prevent this:

Vertical scaling (switching to a larger computer) might not prevent you from running out of memory.

Polars uses all CPU cores by default. If the bigger computer has more cores, each core consumes memory, so the total memory usage will go up and you may still run out of memory.

You can tell Polars to use fewer cores with the POLARS_MAX_THREADS environment variable.

How to Query and Understand the Data

Units

If no units are specified in the schema documentation or the column name itself, power is usually megawatts (MW), and energy is megawatt hours (MWh).

Deduplication

The MMS dataset has hundreds of tables. Some of them are append-only (e.g. DISPATCHPRICE). For some reference data the same dataset is republished every month, so that you can still have a usable dataset if you only download the latest month of zip files. For some data, some or all rows change regularly, in a way where they should overwrite previous rows. This is only a minority of tables. (AEMO normally errs on the side of providing higher-dimension data with version history). So you should deduplicate to grab the latest row for each entity. (e.g. sort by column LASTCHANGED then take the first row for each group of primary keys.)

One of the main benefits of PDR Loader is that it will do this insert/update for you. If you are doing something else (e.g. Nemosis) then I recommend doing some exploratory queries to check whether the unique primary keys are indeed unique.

The schema documentation lists "Primary Key Columns" and "Index Columns" for each table. Using Pandas/Polars/SQL you can deduplicate based on those. (Typically you should choose the one with the latest LASTCHANGED value.)

I have seen one particular month where the CSVs for some tables overlapped with the prior month by one interval.

If you have unexpected duplicates, check for an INTERVENTION column.

Missing Data

If you find a table in the schema documentation for which you cannot find the data (or it appears empty), it may be only published privately. (e.g. each generator sees the data for themselves. The public files are empty.) Sometimes the public/private classification applies in a row-wise way. (e.g. some constraint data is private.)

To find out if a table is public, check the documentation for that table. You can also check the list of monthly MMSDM files.

Sometimes columns are always empty. This could be because they have been deprecated, or maybe they will be populated in the future. Sometimes you just need to look at the actual data to find out if it is empty.

Sometimes data is just missing for one table for one month. (It may be present for the prior and subsequent month.) Different tables only go back historically to different starting points. This may be because that is when the relevant rules or schema changed.

If you cannot find data for a particular generator, that may be because it is non-scheduled.

Understanding Timestamps

Timestamps generally refer to the end of the period, not the start.

e.g. a SETTLEMENTDATE of "2025/01/02 03:05:00" refers to the period from 3:00 to 3:05.

Pay attention to the documented definition of each field. For some power values it is the average across the 5 minute period. For most power values it is an instantaneous power value at the end or start of the period.

INITIALMW refers to the power at the start of the period. TOTALCLEARED refers to the power at the end of the interval. "Cleared" refers to what AEMO instructs or predicts. AEMO's plan is that generators will take the entire period to adjust their output, linearly. This is described in more detail in my thesis.

Some fields are logically dates not datetimes. However they will still appear in the data as datetimes, at midnight at the start of the day.

Timezones

All timestamps are in "market time", i.e. Australia/Brisbane, UTC+10, with no daylight savings.

This is true even for data which applies to regions in other time zones.

(Note that if Queensland ever adopts daylight savings, it is likely that a lot of IT systems in the electricity sector will break in a Y2K kind of way, because many people configure the timezone as Australia/Brisbane when it is technically supposed to be UTC+10.)

5 vs 30 minutes

The NEM operates on a 5 minute schedule. This was not always the case. Prior to October 1st 2021 there was a mix of 5 and 30 minutes. Bids were submitted with 5 minute granularity, and were evaluated every 5 minutes, to produce a "dispatch price" every 5 minutes, and tell generators what power level to generate at every 5 minutes. However generators were paid based on the half-hour average of 6 5-minute prices, called the "trading price". (This was a historical design choice, from when computers were less powerful.) This led to some perverse distortions, where after a ceiling price event (e.g. + 15,000 $/MWh), every generator would bid to the floor (- 1,000 $/MWh) for the remainder of the half hour, because the average would still be very high. So if you are querying data from prior to October 1st 2021, check each table to see whether the frequency of rows changes to half-hourly back then.

Even today, some tables still have a half-hour granularity. (e.g. some price forecasts, and rooftop solar power). So always check the data before writing your queries.

Region ID

Regions are the geographical states of the NEM.

Note that all the regions end in 1. If you find your query returns empty results, make sure you did not filter by just NSW, but instead NSW1. This 1 is because they thought that one day in the future regions might be split up.

The only time I have seen regions ending in something other than 1 is in ROOFTOP_PV_ACTUAL. That contains QLD1 (all of Queensland) as well as QLDC, QLDN, QLDS. My guess is that these are Central, Northern and Southern Queensland.

I have seen some older data with SNOWY1. This region was merged into NSW1 many years ago.

DUID, Genset ID etc

DUID is the identifier for each generator, battery, and scheduled loads.

Watch out: DUIDs (generator identifiers) may contain funny characters such as / and #, which might need to be escaped

For some data you have different granularities.

For example, the Coopers Gap Wind Farm has STATIONID = COOPGWF, containing one DUID (COOPGWF1). This contains two GENSETIDs (COOPGWF1 and COOPGWF2). Until 2024, the PARTICIPANTID was AGLPARFQ, i.e. AGL. Then it changed to COOPGWF. (This is a good example of how seemingly static reference data can change over time.)

Relevant tables for joining these together are DUDETAILSUMMARY, DUDETAIL, DUALLOC, STATION, STATIONOWNER. Most stations have only one DUID. Most DUIDs have only one Genset.

Note that these tables for joining IDs have a version history, so you should deduplicate to the latest record (sort by EFFECTIVEDATE descending then LASTCHANGED descending then take the first row in each group, or join with the timestamp).

Intervention

AEMO takes all generators' bids, transmission line constraints, demand forecasts etc, and they plug it into a big linear optimiser called the NEM Dispatch Engine (NEMDE), which finds the economically optimal solution. (e.g. Maybe the grid cannot get power from the cheapest generator to the consumer, so a more expensive generator elsewhere is used instead.) Sometimes the complexity of the electrical grid cannot be represented nicely in mathematics. In those cases AEMO manually intervenes to tweak the results. This is called "Intervention". The data often contains both the pure-math result, and the actual result. Interventions are rare, but the values in INTERVENTION==1 rows may be drastically different to the INTERVENTION==0 rows.

If a table contains an INTERVENTION column, you should keep only rows with INTERVENTION == 0.

The only exception I can think of is if your research question is about interventions.

There is a slightly different adjustment called "Direction", which is harder to see in the data. If you find 'out of merit' dispatch happening, Directions and Interventions are one reason why. e.g. A huge amount of gas generation in South Australia happens when the price is below the gas generators' bid/cost, because there is a requirement to always have a minimum amount of gas generation running. (As an aside, this metric is crudely defined, and this limit drastically limits the decarbonisation impact of additional solar and wind, yet was notably absent from the recent political discussions about nuclear power.)

RRP

RRP is the energy price.

It does not stand for "Recommended Retail Price". It stands for "Regional Reference Price".

ROP is the "Regional Override Price". This is a counterfactual price which was not used because some adjustment was made.

Within a region (e.g. within NSW), transmission constraints may hinder the ability to transmit power from one generator to a load in the same region. "Local prices" incorporate transmission constraints to provide the theoretical marginal cost of increasing power at each node in the network. However (for now) generators are not paid this price. This price is useful only to understand why generators may be constrained on (forced to generate even when paid less than their bid) or constrained off (forced to not generate even though their bid is lower than what they will be paid). The controversial "COGATI" proposal is to transform the NEM into a nodal network, such that generators are paid the local price, not regional price. That interesting debate is outside of the scope of this article. Just note that this has not come into effect at the time of writing (late 2025), and is unlikely to be implemented within the next few years.

You can find an explanation of price setting in Watt Clarity.

Import/Export

Interconnectors are the transmission links between the regions. Sometimes they are the kind of transmission line you would expect, with a few thick wires strung between a line of towers. Sometimes they are more of an abstraction over several smaller lines, as a "hub and spoke" simplification.

Interconnectors are bidirectional. The convention for the sign and direction is:

Positive export values mean that power is flowing away from Tasmania.

Negative values mean power flows in the opposite direction.

Losses

You may see terms such as "marginal loss factor" (MLF), "transmission loss factor" (TLF) or "distribution loss factor" (DLF).

Even within one region, power generated by one generator will be partially lost in transmission before reaching the consumer. MLF and TLF account for this. The real grid topology is very complex, so AEMO models the grid as a "hub and spoke" model, where all loads and generators are directly connected with individual, lossy lines to an imaginary central reference node within each region.

Loss factors are typically slightly smaller than 1. Sometimes they are exactly 1 (e.g. a generator connected directly to the transmission network instead of a distribution network will have a DLF of 1.) In rare cases they may be slightly above 1 (e.g. when a generator in a load center alleviates constraints).

AEMO has already applied these loss factors to most data, which appears as if the generator were at the regional reference node. So you can generally ignore it. One case where you would care is if you want to know how much energy a generator provided including energy lost in transmission, as opposed to knowing how much usable energy there is. You would have to divide the power values in tables such as DISPATCHUNITSCADA by a loss factor in DUDETAILS. Another context is when dealing with private settlement data (e.g. SETGENDATA). This is described in more detail in Watt Clarity.

TRK Tables

There are many table names ending in TRK. These contain metadata about version history. (e.g. STATIONOWNERTRK contains version history metadata for STATIONOWNER.) You probably do not need to look at them.

Bids

Bidding data is quite difficult to query, mostly because it is a tremendously large volume. (Terabytes, when in uncompressed CSVs.) So if you are planning on using some simple Pandas code, think again. You probably need something more advanced. (Polars on a laptop might be sufficient, if you are careful with memory management. I listed some tips earlier.) I might write another blog post focusing on bidding data. Let me know if you are interested. For now, here are some preliminary notes.

I already described some of the timing and structure of bids in an earlier section.

The relevant tables are BIDDAYOFFER, BIDOFFERPERIOD, BIDPEROFFER_D, BIDDAYOFFER_D, BIDPEROFFER.

Sometimes the filenames for bidding data files on Nemweb are for a different bidding table to the contents of that file.

Check the column names against the schema documentation to confirm which table it is.

If you want just the price bands for a given day, find the row in table BIDDAYOFFER with the largest OFFERDATE, for a given tuple of DUID, BIDTYPE, DIRECTION, DUID, SETTLEMENTDATE.

If you want to look at the actual bid volumes, use BIDOFFERPERIOD. TRADINGDATE is a date (even if it looks like a datetime), referring to the trading day, which goes from 4:00-4:05 am to 3:55-4:00 am the next calendar day. So this does not align to calendar days. You should construct a datetime at 4am on that date, then add 5 minutes multiplied by PERIODID. Then you must delete rows in the past (OFFERDATETIME >= INTERVAL_START). Then to deduplicate, find the row with the largest OFFERDATETIME for each DUID, BIDTYPE, INTERVAL_START. Note that this last step requires a lot of memory, even if you deduplicate each month individually. (I err on the side of caution by deduplicating on a per-month basis, not per file, if there are many files per month.) Then join each row with BIDDAYOFFER to get the 10 price bands that each of the 10 volumes corresponds to.

Common Queries

Average Price

Here is an example of how to find the unweighted average price in each region, using Nemosis and Pandas. The energy prices are available in table DISPATCHPRICE.

This yields:

INFO: Compiling data for table DISPATCHPRICE
INFO: Returning DISPATCHPRICE.
REGIONID
NSW1     96.937920
QLD1     82.126428
SA1     164.951991
TAS1    117.847834
VIC1     82.122759
Name: RRP, dtype: float64

I recommend explicitly checking that the dataframe starts and ends when you expect. It is easy to get an off-by-one error.

Revenue Skewness

Here is an example of how outliers drive generator revenue. The objective of this query is to find the numbers for the following claim: "Half of all generators' energy revenue each year comes from only x% of trading intervals." (Here I am excluding FCAS revenue.)

We can get energy prices from DISPATCHPRICE, and energy per generator per interval from DISPATCH_UNIT_SCADA. (DISPATCHLOAD also contains energy measurements per generator per interval. However it contains a lot more data too, so the files are far bigger.) The docs for DISPATCH_UNIT_SCADA say SCADAVALUE is the "Instantaneous MW reading from SCADA at the start of the Dispatch interval". I will use linear interpolation to join the dots. For most cases this is good enough. In practice the power level of generators does meander from this diagonal line. (e.g. data lags mean that generators may have a constant output for the first 30 seconds or so, then ramp to their next level.) If you want more exact data, you can use the 4 Second SCADA data. That is very large though.

For this example I will download the data with Nemosis, and then query it with Polars.

I sink the result to a Parquet file, then read them back. This may seem unnecessary, but this is a trick to reduce memory consumption. (This query uses a lot of memory for a normal sized laptop.) Similarly, I did not really have to parse the timestamps. Doing so reduces the memory footprint of the join.

This yields:

REGIONID FRAC_TIME
SA1 0.05997
NSW1 0.085669
QLD1 0.108968
VIC1 0.113849
TAS1 0.131157

i.e. In 2024, generators in South Australia earned half of their revenue during the best 6% of the time.

Rooftop Solar

Unlike most generation, rooftop solar is not directly measured. (This is true even in Victoria, where smart meters are mandatory.) Instead AEMO estimates how much power is, was or will be generated by rooftop solar. Typically AEMO (and most other grid operators too) treat solar power as negative consumption, due to the unique data provenance. This leads to funny things like "negative" demand.

For most analysis, I recommend doing the work to get rooftop solar data, and adding it to large-scale generation.

Suppose you want to analyse some data from Australia's national electricity market, to see our current fuel mix. To figure out the fuel mix, you take the per-generator power DISPATCH_UNIT_SCADA (or DISPATCHLOAD), and join that to a list of generators with fuel type and region (e.g. the participant registration list). Then you can do a simple group by and sum up the energy. You will find that 8.4% of South Australia's generation in 2024 was from solar. However this is actually not correct. Once you add rooftop solar, you see that actually 28.4% of South Australia's generation in 2024 was from solar. Without including rooftop solar, you would be wrong by a factor of 3. For more information, see my post on LinkedIn.

Rooftop solar data takes a bit of work to get. It is 30 minute granularity, so do not forget to upsample it. If you append it to large-scale generation data without upsampling, you will have 0 for 5 out of every 6 intervals.

You should filter out REGIONIDs which do not end in 1, as described earlier.

There are also overlapping rows because data comes from several different estimation methods. In the TYPE column the possible values are DAILY, MEASUREMENT, SATELLITE. SATELLITE is the most accurate, so take that if it is available. MEASUREMENT is the next most accurate. Luckily this is reverse alphabetical order. QI is a quality indicator. I am unsure, but I think you should take the best TYPE, and then break ties by taking the highest QI.

As an example, here is how to get deduplicated 5-minute rooftop solar data with Polars:

Unfortunately upsampling from 30 minutes to 5 minutes generally requires that you have all the data in memory. For Polars this means you cannot stream, so I call .collect() first. This particular table is small enough that this is generally not a problem. If it is a problem, or just too fiddly, a workaround is to duplicate the data 6 times, adding 5 × N minutes to each timestamp, then concatenating them vertically. (Making sure that you duplicate power, not energy.)

Price Predictions

We can obtain AEMO's price predictions (including the history of predictions made at various times for a given period) from "predispatch" data. The predispatch process was explained earlier.

There are 3 relevant tables. Of these 3, P5MIN_REGIONSOLUTION and PREDISPATCH both contain a history of price predictions. This can be tricky to get your head around at first. Read the section earlier about two-dimensional time.

You can read more about how far the data extends and the different granularities in Watt Clarity.

These tables overlap. The P5MIN_REGIONSOLUTION contains 'predictions' made in the same interval they apply to. This is what DISPATCHPRICE does (which are definite, final prices, not predictions). Given the timestamps, you would expect that P5MIN_REGIONSOLUTION is exactly the same as DISPATCHPRICE for rows where INTERVAL_DATETIME == RUN_DATETIME. However there are some slight approximations made in predispatch calculations, to speed the computation up. So predispatch is not exactly the same algorithm as dispatch. Thus P5MIN_REGIONSOLUTION may be wrong for INTERVAL_DATETIME == RUN_DATETIME. So choose DISPATCHPRICE for those rows.

There is also an overlap between P5MIN_REGIONSOLUTION and PREDISPATCHPRICE (even before interpolating). When this happens, use P5MIN_REGIONSOLUTION. So for each interval the prediction was made for, for each interval the prediction was made in, if there are multiple rows, choose DISPATCHPRICE, then P5MIN_REGIONSOLUTION, then PREDISPATCHPRICE. This is a good approach for operational queries, where you just filter for predictions for the next interval.

For analytic queries, this data is large. In that case, doing a join and deduplication (e.g. vertical concatenation, then group by the two timestamp columns, sort by which table it came from etc) is very expensive (i.e. your laptop will run out of memory). So you can filter each table instead with conditions about the distance between the two time columns.

If these last few paragraphs were unclear, just read the code in example below.

Most PREDISPATCH* tables are very large. So there is a subset of each PREDISPATCH* table in the normal DATA folder on Nemweb. The full dataset is in ../PREDISPATCH_ALL_DATA (as mentioned earlier):

e.g.

https://www.nemweb.com.au/Data_Archive/Wholesale_Electricity/MMSDM/2025/MMSDM_2025_10/MMSDM_Historical_Data_SQLLoader/PREDISP_ALL_DATA/PUBLIC_ARCHIVE%23PREDISPATCHPRICE%23ALL%23FILE01%23202510010000.zip

This is true even for PREDISPATCHPRICE, even though it is a moderate size.

In this example I try to analyse how AEMO's price predictions get more/less accurate depending on how far in advance they are made. Specifically I compare whether the predictions are on the correct side of 0 $/MWh. I expected them to get monotonically better (i.e. downward sloping graph), but that is not the case. The rise around 24 hours shows that sometimes the closer we get to the time period AEMO is predicting, the worse the predictions get (for some regions, for some hours out, for this particular definition of "worse").

Graph

Grouping By Fuel Type

The fuel type (e.g. coal vs solar) of each generator can be found in the list of registered participants (file "NEM Registration and Exemption List", sheet "PU and Scheduled Loads"). Nemosis gives us an easy way to download this with the "Generators and Scheduled Loads" table. Although at the time of writing, AEMO has recently changed their firewall rules in a way which breaks Nemosis. The example below shows how to use Nemosis, and how to use a workaround.

There are multiple columns relating to fuel type, with quite varied, detailed, inconsistent and even misspelled values. You will probably need to hard code some if statements to classify these into something simpler.

Watch out: If a generator's fuel type includes the substring "gas", it might be "biogas" or "landfill gas", which is biofuel not natural gas. If a generator's fuel type includes the word "coal", it might be "coal seam gas", which is gas not coal.

This example uses Polars.

Note that rooftop solar is not included here because it does not have a DUID. Some generators appear here, but not elsewhere, because they are non-scheduled. Some generators appear elsewhere but not here because they have been decommissioned.

Emissions Data

The emissions intensity of generators (a static value representing the average emissions in tonnes of CO2e per MWh) is published in the GENUNITS table in the usual MMSDM location. This also contains columns about the provenance of this emissions intensity data. This is per-genset (GENSETID), which means there may be more than one value per DUID. (See the hierarchy earlier.) Most power data is DUID level, so we need to aggregate somehow. This example shows one way to do that.

You could join this with generator-level power data to get generator-level emissions. In theory you can do this at a 5 minute granularity. However you should smooth things out to a larger timescale before interpreting, because this is average emissions intensity. e.g. coal generators emit a lot of CO2 when starting up, hours before the export the first MWh, and the efficiency depends on the power level. (More generally, you should be very wary of any causal inference at the 5 minute level. Fossil fuel generators and even wind/solar/batteries make decisions over time with dynamic constraints. Intervals are not really independent.)

There is daily region-level emissions (and per-generator emissions intensity data again) in the CDEII subdirectory of the realtime data folder. (I do not know what the acronym CDEII means. I do not know why it is not published in the usual MMSDM folder.)

https://www.nemweb.com.au/REPORTS/CURRENT/CDEII/

Whilst most other realtime data in /CURRENT/ is harder to parse than the monthly data, thankfully this data is only one table per CSV (with a header, footer and 4 metadata columns), like the monthly data.

Unlike most data on Nemweb, these files are small, and some are published as uncompressed CSVs, not zips of CSVs. This means that most libraries can read them from the URL directly. (It is normally good practice to download the files and then read them, so you do not put load on AEMO's servers and wait for a download every time you run a query.)

CO2EII_AVAILABLE_GENERATORS.CSV is per-generator emissions intensity. These figures may differ slightly from those in GENUNITS. I do not know what the adjacent CO2EII_AVAILABLE_GENERATORS_YYYY_*.CSV files are in the CDEII folder. This file structure and versioning approach is different to all other AEMO data.

CO2EII_SUMMARY_RESULTS.CSV contains daily emissions data per region. (Despite the fact that the timestamp contains an hour and minute part, this is daily data. The hour and minute are always midnight.) The columns are not documented. Based on what ballpark the numbers are in, my belief is that:

The monthly data does not include any tables called CDEII, but does include something with CO2E in the name (BILLING_CO2E_PUBLICATION). This appears to be the same data as CO2EII_SUMMARY_RESULTS.CSV, but with a few extra columns (and only for one month). I would not be surprised if the emissions numbers themselves differed slightly. If so, my guess is that the monthly one is more accurate, because it is "billing" quality.

If you find out more, please reach out to me so I can add more detail here for others. (LinkedIn, or raise an issue on GitHub.)

I Feel Your Pain

As you delve into the world of AEMO's data, you will come across countless inconsistencies which will drive you mad. You need to be prepared to roll with the punches. Just remember to be thankful that we have this data at all. I do not know of any other grid or industry with this much public data.

Examples of inconsistencies:

Further Reading

AEMO implements the National Electricity Rules (NER) written by the Australian Energy Market Commission. These rules are available here.

Watt Clarity is a blog by a market consultant (Global Roam). They have:

Whilst they are a business trying to sell consulting services and market reports, I find the free content on their site to be very high quality, useful and unbiased.

The benefits of Parquet over CSV are described in R for Data Science. (Apache Arrow has similar benefits.)

Any opinions in this post are my own, and do not necessarily reflect that of my employer. But you already knew that.

(In fact, at the time of writing I have no employer. I have recently finished my ../masters-thesis and have not yet started my new role.)

In this post I have made several criticisms of AEMO. However I am still very grateful that they publish this data at all. This is far better than any other grid operator I have encountered. Whilst the documentation and consistency is sometimes lacking, it is better than most comparable datasets.

Where To Go Next

If you have questions (or corrections), reach out to me on LinkedIn, or raise an issue on the GitHub repo for this website. You can also try asking on the Nemosis mailing list.