Retrieving data from ESGF with Intake-ESGF

Retrieving data from ESGF with Intake-ESGF#

This notebook explains how to use Intake-ESGF to get data from ESGF. You do not need to know anything about intake to run it. Intake-ESGF will search the ESGF Index nodes including Globus nodes, compare the results with your local data base and start a thread pool to download files in parallel. It is both efficient and simple to apply.

On Levante, this Notebook works well with the /work/bm1344/conda-envs/py_312 environment.

  1. Configure Intake-ESGF with your local cache and the esg dataroot and spefici ESGF Indexes.

  2. Setup a request dictionary

  3. Start retrieving

from ipywidgets import FloatProgress
import intake_esgf

Configure Intake-ESGF#

For the parameter setting, you can use

  • a /scratch path for your local_cache

  • the CMIP Data Pool trunk as the esg_dataroot

  • one of the high priority ESGF index nodes, e.g. esgf.ceda.ac.uk, to be sure to find all data

intake_esgf.conf.set(
    local_cache="/work/ik1017/Ingest/requests/US2",
    esg_dataroot="/work/ik1017/CMIP6/data",
    #all_indices=True,
    indices={"esgf.ceda.ac.uk":True}
)

Setupt a request#

Make sure that your request does not grow to large. A total size of 1-10TB should be fine. One can assume an average speed of 50MB/s depending on the data node from where you retrieve. This results in about 4TB/day.

high23=dict(
    project="CMIP6",
    activity_id=["CFMIP","CMIP","DAMIP"],
    institution_id=["CMCC","NCAR","NOAA-GFDL"],
    source_id=["CESM2","CESM2-FV2","CESM2-WACCM","CESM2-WACCM-FV2","CMCC-CM2-SR5","GFDL-CM4"],
    experiment_id=["1pctCO2","abrupt-2xCO2","abrupt-4xCO2","esm-piControl","hist-GHG","hist-aer","hist-nat","historical","piControl"],
    variant_label=["r10i1p1f1","r11i1p1f1","r1i1p1f1","r1i2p2f1","r2i1p1f1","r3i1p1f1","r3i1p2f1","r4i1p1f1","r5i1p1f1","r7i1p1f1","r8i1p1f1","r9i1p1f1"],
    table_id=["Oday"],
    variable_id=["chlos","tossq"],
    grid_label=["gn","gr"]
)

Start retrieving#

With the following three commands, you start your retrieval:

cat = intake_esgf.catalog.ESGFCatalog()
subset = cat.search(**high23)
dsdict=subset.to_dataset_dict(add_measures=False)

If you think your retrieved would be valueable for all DKRZ users, please contact supportATdkrz.de so that we can overtake the data and bring it to the CMIP Data Pool.