Intake catalogs#
In order to make the DKRZ CMIP data pool more FAIR, we support the python package intake-esm
which allows you to use collections of climate data easily and fast.
We provide a tutorial here: https://tutorials.dkrz.de/intake.html
The offical intake-esm
page:
https://intake-esm.readthedocs.io/
Features
display catalogs as clearly structured tables inside jupyter notebooks for easy investigation
import intake
col = intake.open_esm_datastore("/work/ik1017/Catalogs/dkrz_cmip6_disk.json")
col.df.head()
/mnt/root_disk3/gitlab-runner/.conda/envs/mambaenv/lib/python3.11/site-packages/intake_esm/cat.py:264: DtypeWarning: Columns (21,22,23) have mixed types. Specify dtype option on import or set low_memory=False.
df = pd.read_csv(
activity_id | institution_id | source_id | experiment_id | member_id | table_id | variable_id | grid_label | dcpp_init_year | version | ... | frequency | time_reduction | long_name | units | realm | level_type | time_min | time_max | format | uri | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AerChemMIP | BCC | BCC-ESM1 | hist-piAer | r1i1p1f1 | AERmon | c2h6 | gn | NaN | v20200511 | ... | mon | mean | C2H6 Volume Mixing Ratio | mol mol-1 | aerosol | alevel | 185001.0 | 201412 | netcdf | /work/ik1017/CMIP6/data/CMIP6/AerChemMIP/BCC/B... |
1 | AerChemMIP | BCC | BCC-ESM1 | hist-piAer | r1i1p1f1 | AERmon | c2h6 | gn | NaN | v20200511 | ... | mon | mean | C2H6 Volume Mixing Ratio | mol mol-1 | aerosol | alevel | 185001.0 | 201412.nc.modified | netcdf | /work/ik1017/CMIP6/data/CMIP6/AerChemMIP/BCC/B... |
2 | AerChemMIP | BCC | BCC-ESM1 | hist-piAer | r1i1p1f1 | AERmon | c3h6 | gn | NaN | v20200511 | ... | mon | mean | C3H6 volume mixing ratio | mol mol-1 | aerosol | alevel | 185001.0 | 201412 | netcdf | /work/ik1017/CMIP6/data/CMIP6/AerChemMIP/BCC/B... |
3 | AerChemMIP | BCC | BCC-ESM1 | hist-piAer | r1i1p1f1 | AERmon | c3h8 | gn | NaN | v20200511 | ... | mon | mean | C3H8 volume mixing ratio | mol mol-1 | aerosol | alevel | 185001.0 | 201412 | netcdf | /work/ik1017/CMIP6/data/CMIP6/AerChemMIP/BCC/B... |
4 | AerChemMIP | BCC | BCC-ESM1 | hist-piAer | r1i1p1f1 | AERmon | cdnc | gn | NaN | v20200522 | ... | mon | mean | Cloud Liquid Droplet Number Concentration | m-3 | aerosol | alevel | 185001.0 | 201412 | netcdf | /work/ik1017/CMIP6/data/CMIP6/AerChemMIP/BCC/B... |
5 rows × 26 columns
col.esmcat.description
"This is a ESM-collection for CMIP6 data on DKRZ's disk storage system"
Features
browse through the catalog and select your data without being on the pool file system
⇨ A pythonic reproducable alternative compared to complex find
commands or GUI searches. No need for Filesystems and filenames.
tas = col.search(experiment_id="historical", source_id="MPI-ESM1-2-HR", variable_id="tas", table_id="Amon", member_id="r1i1p1f1")
tas
/work/ik1017/Catalogs/dkrz_cmip6_disk catalog with 1 dataset(s) from 33 asset(s):
unique | |
---|---|
activity_id | 1 |
institution_id | 1 |
source_id | 1 |
experiment_id | 1 |
member_id | 1 |
table_id | 1 |
variable_id | 1 |
grid_label | 1 |
dcpp_init_year | 0 |
version | 1 |
time_range | 33 |
path | 33 |
opendap_url | 33 |
project | 1 |
simulation_id | 1 |
grid_id | 1 |
frequency | 1 |
time_reduction | 1 |
long_name | 1 |
units | 1 |
realm | 1 |
level_type | 0 |
time_min | 33 |
time_max | 33 |
format | 1 |
uri | 33 |
derived_variable_id | 0 |
Features
open climate data in an analysis ready dictionary of
xarray
datasets
Forget about annoying temporary merging and reformatting steps!
tas.to_dataset_dict(cdf_kwargs={"chunks":{"time":1}})
--> The keys in the returned dictionary of datasets are constructed as follows:
'activity_id.source_id.experiment_id.table_id.grid_label'
/tmp/ipykernel_27760/2534088414.py:1: DeprecationWarning: cdf_kwargs and zarr_kwargs are deprecated and will be removed in a future version. Please use xarray_open_kwargs instead.
tas.to_dataset_dict(cdf_kwargs={"chunks":{"time":1}})
{'CMIP.MPI-ESM1-2-HR.historical.Amon.gn': <xarray.Dataset>
Dimensions: (time: 1980, bnds: 2, lat: 192, lon: 384, member_id: 1,
dcpp_init_year: 1)
Coordinates:
* time (time) datetime64[ns] 1850-01-16T12:00:00 ... 2014-12-16T...
time_bnds (time, bnds) datetime64[ns] dask.array<chunksize=(1, 2), meta=np.ndarray>
* lat (lat) float64 -89.28 -88.36 -87.42 ... 87.42 88.36 89.28
lat_bnds (lat, bnds) float64 dask.array<chunksize=(192, 2), meta=np.ndarray>
* lon (lon) float64 0.0 0.9375 1.875 2.812 ... 357.2 358.1 359.1
lon_bnds (lon, bnds) float64 dask.array<chunksize=(384, 2), meta=np.ndarray>
height float64 2.0
* member_id (member_id) object 'r1i1p1f1'
* dcpp_init_year (dcpp_init_year) float64 nan
Dimensions without coordinates: bnds
Data variables:
tas (member_id, dcpp_init_year, time, lat, lon) float32 dask.array<chunksize=(1, 1, 1, 192, 384), meta=np.ndarray>
Attributes: (12/64)
Conventions: CF-1.7 CMIP-6.2
activity_id: CMIP
branch_method: standard
branch_time_in_child: 0.0
branch_time_in_parent: 0.0
contact: cmip6-mpi-esm@dkrz.de
... ...
intake_esm_attrs:time_reduction: mean
intake_esm_attrs:long_name: Near-Surface Air Temperature
intake_esm_attrs:units: K
intake_esm_attrs:realm: atmos
intake_esm_attrs:_data_format_: netcdf
intake_esm_dataset_key: CMIP.MPI-ESM1-2-HR.historical.Amon.gn}
Features
display catalogs as clearly structured tables inside jupyter notebooks for easy investigation
browse through the catalog and select your data without being on the pool file system
open climate data in an analysis ready dictionary of
xarray
datasets
⇨ intake-esm
reduces the data access and data preparation tasks on analysists side
Catalog content#
The catalog is a combination of
a list of files (at dkrz compressed as
.csv.gz
) where each line contains a filepath as an index and column values to describe that fileThe columns of the catalog should be selected such that a dataset in the project’s data repository can be uniquely identified. I.e., all elements of the project’s Data Reference Syntax should be covered (See the project’s documentation for more information about the DRS) .
a
.json
formatted descriptor file for the list which contains additional settings which tellintake
how to interprete the data.
According to our policy, both files have the same name and are available in the same directory.
print("What is this catalog about? \n" + col.esmcat.description)
#
print("The path to the list of files: "+ col.esmcat.catalog_file)
What is this catalog about?
This is a ESM-collection for CMIP6 data on DKRZ's disk storage system
The path to the list of files: /work/ik1017/Catalogs/dkrz_cmip6_disk.csv.gz
Creation of the .csv.gz
list :
A file list is created based on a
find
shell command on the project directory in the data pool.For the column values, filenames and Pathes are parsed according to the project’s
path_template
andfilename_template
. These templates need to be constructed with attribute values requested and required by the project.Filenames that cannot be parsed are sorted out
Depending on the project, additional columns can be created by adding project’s specifications.
E.g., for CMIP6, we added a
OpenDAP
column which allows users to access data from everywhere viahttp
Configuration of the .json
descriptor:
Makes the catalog self-descriptive by defining all necessary information to understand the .csv.gz
file
Specifications for the headers of the columns - in case of CMIP6, each column is linked to a Controlled Vocabulary.
col.esmcat.attributes[0]
Attribute(column_name='project', vocabulary='')
Defines how to open
the data as analysis ready as possible with the underlaying xarray
tool:
which column of the
.csv.gz
file contains the path or link to the fileswhat is the data format
how to aggregate files to a dataset
set a column to be used as a new dimension for the xarray by
merge
when opened a file, what is
concat
dimension?additional options for the
open
function
Jobs we do for you#
We make all catalogs available under
/pool/data/Catalogs/
and in the cloudWe create and update the content of project’s catalogs regularly by running scripts which are automatically executed and called cronjobs. We set the creation frequency so that the data of the project is updated sufficently quickly.
The updated catalog replaces the outdated one.
The updated catalog is uploaded to the DKRZ swift cloud
We plan to provide a catalog that tracks data which is removed by the update.
!ls /work/ik1017/Catalogs/dkrz_*.json
/work/ik1017/Catalogs/dkrz_cmip5_archive.json
/work/ik1017/Catalogs/dkrz_cmip5_disk.json
/work/ik1017/Catalogs/dkrz_cmip5_disk_netcdf.json
/work/ik1017/Catalogs/dkrz_cmip6_cloud.json
/work/ik1017/Catalogs/dkrz_cmip6_disk.json
/work/ik1017/Catalogs/dkrz_cmip6_disk_netcdf.json
/work/ik1017/Catalogs/dkrz_cmip6_swift_zarr.json
/work/ik1017/Catalogs/dkrz_cmip-data-pool_disk_netcdf.json
/work/ik1017/Catalogs/dkrz_cordex_disk.json
/work/ik1017/Catalogs/dkrz_dyamond-winter_disk.json
/work/ik1017/Catalogs/dkrz_era5_disk.json
/work/ik1017/Catalogs/dkrz_mpige_disk.json
/work/ik1017/Catalogs/dkrz_nextgems_disk.json
/work/ik1017/Catalogs/dkrz_palmod2_disk.json
Show code cell source
import pandas as pd
#pd.options.display.max_colwidth = 100
services = pd.DataFrame.from_dict({"CMIP6" : {
"Update Frequency" : "Daily",
"On cloud" : "Yes", #"https://swift.dkrz.de/v1/dkrz_a44962e3ba914c309a7421573a6949a6/intake-esm/mistral-cmip6.json",
"Path to catalog" : "/pool/data/Catalogs/dkrz_cmip6_disk.json",
"OpenDAP" : "Yes",
"Retraction Tracking" : "Yes",
"Minimum required Memory" : "10GB",
}, "CMIP5": {
"Update Frequency" : "On demand",
"On cloud" : "Yes", #"https://swift.dkrz.de/v1/dkrz_a44962e3ba914c309a7421573a6949a6/intake-esm/mistral-cmip5.json",
"Path to catalog" : "/pool/data/Catalogs/dkrz_cmip5_disk.json",
"OpenDAP" : "Yes",
"Retraction Tracking" : "",
"Minimum required Memory" : "5GB",
}, "CORDEX": {
"Update Frequency" : "Monthly",
"On cloud" : "Yes", #"https://swift.dkrz.de/v1/dkrz_a44962e3ba914c309a7421573a6949a6/intake-esm/mistral-cordex.json",
"Path to catalog" : "/pool/data/Catalogs/dkrz_cordex_disk.json",
"OpenDAP" : "No",
"Retraction Tracking" : "",
"Minimum required Memory" : "5GB",
}, "ERA5": {
"Update Frequency" : "On demand",
"On cloud" : "Yes",
"Path to catalog" : "/pool/data/Catalogs/dkrz_era5_disk.json",
"OpenDAP" : "No",
"Retraction Tracking" : "--",
"Minimum required Memory" : "5GB",
}, "MPI-GE": {
"Update Frequency" : "On demand",
"On cloud" : "Yes",# "https://swift.dkrz.de/v1/dkrz_a44962e3ba914c309a7421573a6949a6/intake-esm/mistral-MPI-GE.json
"Path to catalog" : "/pool/data/Catalogs/dkrz_mpige_disk.json",
"OpenDAP" : "",
"Retraction Tracking" : "--",
"Minimum required Memory" : "No minimum",
}}, orient = "index")
servicestb=services.style.set_properties(**{
'font-size': '14pt',
})
servicestb
Update Frequency | On cloud | Path to catalog | OpenDAP | Retraction Tracking | Minimum required Memory | |
---|---|---|---|---|---|---|
CMIP6 | Daily | Yes | /pool/data/Catalogs/dkrz_cmip6_disk.json | Yes | Yes | 10GB |
CMIP5 | On demand | Yes | /pool/data/Catalogs/dkrz_cmip5_disk.json | Yes | 5GB | |
CORDEX | Monthly | Yes | /pool/data/Catalogs/dkrz_cordex_disk.json | No | 5GB | |
ERA5 | On demand | Yes | /pool/data/Catalogs/dkrz_era5_disk.json | No | -- | 5GB |
MPI-GE | On demand | Yes | /pool/data/Catalogs/dkrz_mpige_disk.json | -- | No minimum |
Best practises and recommendations:#
Intake
can make your scripts reusable.Instead of working with local copy or editions of files, always start from a globally defined catalog which everyone can access.
Save the subset of the catalog which you work on as a new catalog instead of a subset of files. It can be hard to find out why data is not included anymore in recent catalog versions, especially if retraction tracking is not enabled.
Intake
helps you to avoid downloading data by reducing necessary temporary steps which can cause temporary output.Check for new ingests by just repeating your script - it will open the most recent catalog.
Only load datasets with
to_dataset_dict
into xarrray with the argumentcdf_kwargs={"chunks":{"time":1}}
. Otherwise, the chunnk will let your memory exceed limits.
Technical requirements for usage#
Memory:
Depending on the project’s volume, the catalogs can be big. If you need to work with the total catalog, you require at least 10GB memory.
On jupyterhub.dkrz.de, start the notebook server with matching ressources.
Software:
Intake
works on the basis ofxarray
andpandas
.On jupyterhub.dkrz.de , use one of the recent kernels:
unstable
bleeding edge
Load the catalog#
#import intake
#collection = intake.open_esm_datastore(services["Path to catalog"][0])