Climate Model Data Organization

After choosing the set of models and model experiments you’re interested in working with, the next step is to understand how the data is organized, in order to be able to actually do anything concrete with it! This can be difficult for novice users since climate model output uses a specialized file format called netCDF, and the variables can be non-trivial to understand (see the Model Naming Conventions page for some details on this). 

File Format: Network Common Data Format

By far the most common way for climate model output data files to be stored is in a format called the Network Common Data Format, or netCDF. This format was developed by researchers in Boulder, CO in the US, and is designed to make accessing very large datasets as easy and fast as possible.

What is a netCDF file?

A netCDF file is just like any other type of data file: it contains data, or the quantity you’re interested in, as well as metadata describing things about the data that you might need in order to properly work with it. Examples of common metadata include:

  • The units of the variable within the data file (for instance, is temperature in Celsius or Kelvin?)
  • The calendar being used by the relevant climate model (it’s not always a “normal” calendar – sometimes they leave out leap years and things like that)
  • The plain English name of the variable
  • Any coordinates that are needed to interpret the data: latitude, longitude, and time are the most common examples, but sometimes depth or altitude is also necessary

The nice thing about netCDF files, and why they are so common in climate research, is that they are what’s called “self-describing”. This means that the metadata travels with the file, even if you do something to it like take a subset of the times in the file or convert the units to different ones – all that is automatically written inside the file, so that you don’t lose track of anything.

A self-describing file is also really good at describing the coordinates of the data! Those also travel with the data if the file is modified, and will be automatically updated if the data is averaged or altered along a coordinate dimension (for example, a depth or time average).

How do I see what’s inside a netCDF file?

When you first start working with information in netCDF format, it can be a bit disorienting since it’s not immediately obvious how to see what’s in there. However, there are packages which can do this in all the major programming languages: see the Tutorials page for examples in R and Python. Most GIS applications also have the ability to open and display netCDF data!

If you don’t want to deal with a language like R or Python, but still want to quickly display the contents of a netCDF file, there are also a couple of other options:

  • Panoply is a software package developed by the NASA Goddard Institute for Space Studies, which has an intuitive graphical interface to allow you to display information from netCDF files.
  • Ncview is a similar packaged developed at UC San Diego; the interface is a bit clunkier, but will also allow you to quickly make simple visualizations. 

Although it might be a bit less intuitive to start, we highly recommend going through the process of using a programming language (R, Python, Matlab, whatever!) to work with netCDF information, since this will give you a much more powerful toolkit to manipulate and analyze whatever data you’re interested in. 

Data Organization: Variables

The concept of a climate model variable is discussed on the Model Naming Conventions page as well: but in brief, a variable is any quantity calculated by a climate model. Variables can be generated by any component (atmosphere, ocean, land, etc) of the model, and can have different sizes depending on the model grid resolution, the dimension of the data, and the time averaging period

Lists of the available climate model variables in two different commonly-used data collections can be found here:

Generally speaking, the name of a climate model variable will be a relatively short string of letters derived from the “normal English” name in some way that the authors thought was fairly intuitive (you can see for yourself if you agree with that). For example, in the “standard” set of variable names used for the CMIP project, some commonly used variables are:

  • “tas”: temperature at the air surface
  • “pr” : precipitation
  • “psl”: pressure at sea level

You can look at the URL linked above to see the translation between the “normal English” name and the shorthand variable name, or follow the directions in the previous section under “How do I see what’s inside a netCDF file?” to look at a particular file that you’re interested in.

Best Practices: when you’re getting started working with a new dataset, it’s ALWAYS a good idea to look through the file header (or “metadata”) to make sure you’re looking at the thing you think you’re looking at! It’s pretty easy to get tripped up by things like calendars and units if you don’t do this at the beginning – so might as well save yourself some time! The instructions above should get you started with being able to do this, or you can look through the Tutorials section for more detailed information.

Data Organization: File Structure

The other important topic to understand when downloading and working with climate model output is how the data are organized into various files. When you think about it, it’s pretty obvious that this is a huge task: after all, there are hundreds of different variables being output from any given model every time it’s run. Then those variables are being saved at some time frequency (sometimes daily, sometimes monthly, etc), and most often each climate model experiment extends for decades or centuries. All of that adds up to an enormous amount of information! 

Model Experiments

As mentioned on the “Model Naming Conventions” page, there are various ways to set up and configure climate models. Some common ones include:

  • Historical: simulating the climate over the recent observational period (typically post-1850 or so)
  • Future projections: using different scenarios of future climate change to simulate projected climate out to the end of the 21st century

More detail on these model experiments can also be found on the “CMIP and Other MIPs” page!

For now, the main thing to understand is that you’ll need to choose the experiment that you’re interested in, before downloading or accessing the data from your variables of interest. Depending on the method of data access you’re working with, this may require looking up the name associated with that particular experiment; common experiments of interest include:

  • “historical”
  • “ssp370”, “ssp585”, or other members of the SSP set of CMIP6-era future projections
  • “rcp45”, “rcp85”, or other members of the RCP set of CMIP5-era future projections

Check out “Model Naming Conventions” for a refresher on these and other experiments, or view the list of CMIP6 experiments here!

Model Components

After choosing a model, an experiment, and a variable, to actually find the data may still require a bit of digging. Why? Because data is often organized according to the model component with which it was generated. In other words, data for atmospheric variables is often stored in the “atmosphere” area, data for atmospheric variables in an “ocean” area, and so on.

Luckily, this part usually isn’t too difficult! It’s typically pretty intuitive to figure out which model component your variable of interest is coming from: if it seems like it should be the atmosphere, for example, it probably is. You can also refer back to lists of variables if necessary:

Subsetting and Time Series 

The last thing to be aware of is that since these data are BIG (multiple gigabytes for each individual variable), they are sometimes split into MULTIPLE files even for a given model experiment and variable. 

There is no single universally accepted way to do this, since it’s usually done based on wanting to cap the total file size rather than fixing the number of model years included in a given file. So you’ll have to be aware of this when you’re looking at your data files – the file name will typically contain the starting and ending years covered by the data it contains, and if you’re trying to look at a longer period than that, then you’ll need to make sure you have the other files containing the rest of your time period.

In other words: make sure you grab all the files that have the variable you need in them! There might be more than one!

A Practical Example: Surface Air Temperature Trends

Here is a brief example of the steps you would go through in order to find information on surface air temperature and calculate a long-term trend. Much more detailed examples of how to do this can be found in the walkthrough tutorials:

  1. Choose a climate model
    First, figure out which model – or models – you’d like to look at data from. This choice can sometimes be somewhat arbitrary, since there is often no a priori reason to expect that one model will be definitively better than another.

    Ways this choice is often made include:
    – familiarity with a particular model and its behavior
    – the presence of multiple ensemble members (for more on that, see the “Large Ensembles” page)
    – wanting to span a particular range of plausible outcomes for a given region

For these purposes, let’s arbitrarily select the Canadian Earth System Model version 5, or CanESM5 – this is a model that was used to submit data for CMIP6.

  1. Select an experiment (or experiments)
    Next, think about the type of model experiment you would like to use for your analysis. In this case, we’re interested in looking at long-term trends in temperature, so let’s pick the historical simulation.
  1. Locate data

Now it’s time to go find the data! In this case we started with the CMIP6 website, but you could also do this through other sources such as Pangeo.

A detailed explanation of how to find this particular file can be found in the CMIP6 website walkthrough; but after searching for CanESM5 data, the historical experiment which was run for CMIP6, and the surface air temperature variable “tas”, we find that there are multiple historical simulations available.

A typical filename for these experiments is:
tas_Amon_CanESM5_historical_r10i1p1f1_gn_185001-201412.nc

From the end of the filename, we can then tell that the time period this particular file covers is January 1850 (“185001”) through December 2014 (“201412”). So if this is the time period we are interested in for the trend calculation, we’re all done! If a longer time period is desired, we would then go ahead and download data from one of the future projections, or SSPs. More on this can be found in the CMIP6 website walkthrough!