{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## The DKRZ CMIP Data Pool\n", "\n", "This is a beginners-level demonstration notebook and introduces you to the Data Pool at DKRZ. Based on the example of the recent phase 6 of the Coupled Model Intercomparison Project ([CMIP6](https://www.wcrp-climate.org/wgcm-cmip/wgcm-cmip6)), you will learn\n", "- how you benefit from the CMIP Data Pool (CDP)\n", "- how to approach CMIP data\n", "- how to use the python packages `intake-esm`, `xarray` and `pandas` to investigate the CMIP Data Pool" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "This notebook can be executed on [DKRZ's jupyterhub platform](https://jupyterhub.dkrz.de/). For a detailled introduction into `jupyterhub` and `intake`, we recommend the DKRZ tech talks\n", "\n", "- [jupyterhub](https://indico.dkrz.de/event/33/) by Sofiane Bendoukha\n", "- [intake](https://indico.dkrz.de/event/31/) by Aaron Spring\n", "\n", "Customizing the code inside, however, only requires *basic* python knowledge." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Introduction\n", "\n", "The Scientific Steering Commitee has thankfully granted a disk space on lustre file system of 5PB for the CMIP Data Pool for 2023. Started in 2016, DKRZ runs and maintains this common storage place.\n", "\n", " > πŸ“’ The [DKRZ CMIP data pool](https://cmip-data-pool.dkrz.de) contains often needed flagship collections of climate model data, is hosted as part of the DKRZ data infrastructure and supports scientists in high volume climate data collection, access and processing.\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The [notebook sources](https://gitlab.dkrz.de/data-infrastructure-services/cmip-data-pool/-/blob/master/analysis-support/notebooks/CMIP6-Data-Pool.ipynb) for the doc pages are available in this [gitlab-repo](https://gitlab.dkrz.de/data-infrastructure-services/cmip-data-pool)\n", "\n", "**Important news and updates** will be announced\n", "- on the [DKRZ user portal](https://doc.dkrz.de/)\n", "- via a mailing list. Subscribe for βœ‰ [cmip-data-poolATlists.dkrz.de](https://lists.dkrz.de/mailman/listinfo/cmip-data-pool)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "⭐ Highlight CDP climate model data collections are:\n", " - [CMIP6](https://www.wcrp-climate.org/wgcm-cmip/wgcm-cmip6): In May 2021, DKRZ provides **Europe's largest** data pool with an amount of 4 PB for the recent phase of the Coupled Model Intercomparison Project\n", " - [CORDEX](https://cordex.org/): The size of data for the Coordinated Regional Downscaling Experiment is about 600TB over different projects.\n", " - [CMIP5](https://www.wcrp-climate.org/wgcm-cmip/wgcm-cmip5): The fifth phase of CMIP." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "An example of a project which is also in the *data pool*, but not included in the term CMIP6⁺:\n", " - [ERA5](https://www.dkrz.de/up/services/data-management/projects-and-cooperations/era): Weather data from the [European Centre for Medium-Range Weather Forecasts](http://www.ecmwf.int) by re-analysed and homogenised observation data." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "from IPython.display import HTML, display, Markdown, IFrame\n", "display(Markdown(\"Time series of three different data pool disk space measures. DKRZ has published about 1.5 PB, 2.5 PB are replicated data from other data nodes. An average CMIP6 dataset contains about 5 files and covers 4GB.\"))\n", "IFrame(src=\"https://swift.dkrz.de/v1/dkrz_a44962e3ba914c309a7421573a6949a6/Pool-Statistics/pool-timeseries-hvplot.html\",width=\"900\",height=\"550\",frameborder=\"0\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "display(Markdown(\"We develop, prepare and provide [jupyter notebook demonstrations](https://gitlab.dkrz.de/data-infrastructure-services/tutorials-and-use-cases)
\" \n", " \"- as tutorials for software packages and applications *starting from scratch*
\"\n", " \"- for more frequent use cases like the plot of `tas` of one member of two experiments and simulated by the German ESMs.\"))\n", "IFrame(src=\"https://swift.dkrz.de/v1/dkrz_a44962e3ba914c309a7421573a6949a6/plots/globalmean-yearlymean-tas.html\",width=\"1000\",height=\"650\",frameborder=\"0\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Why do we host the CDP? πŸ€”\n", "\n", "πŸ‘‰ The key benefit of the data pool is that the data is available on lustre (`/work`) so that **All DKRZ users** with a current account have access. There is less need for local copies or data downloads. πŸ‘ˆ\n", "\n", "### Where can I find the data pool? πŸ•\n", "\n", "The Data pool can be accessed from different portals.\n", "\n", "- Server-side on the file system e.g. under `/pool/data/CMIP6/data`\n", " - All levante users with a current account have permission to do that.\n", " - Fastest way to work with the data\n", " \n", "```bash\n", "\n", "#Browsing with linux commands\n", "ls /pool/data/CMIP6/data/ -x\n", "echo \"\"\n", "#For which MIPs did MPI-ESM1-2-XR produce data for?\n", "find /pool/data/CMIP6/data/ -maxdepth 3 -name MPI-ESM1-2-XR -type d\n", "\n", "#Using the FreVA CMIP-ESMVal tool\n", "module load cmip6-dicad/1.0 \n", "freva --databrowser --help\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "- Web-based from remote: πŸ•Έ\n", " - Published data via the fail-safe [Earth System Grid Federation data portal](http://esgf-data.dkrz.de)\n", " - Partly available in the [Copernicus Climate Data Store](https://cds.climate.copernicus.eu/#!/home)\n", "- Regularly updated Intake-esm catalogs made publically available in the [cloud](https://swiftbrowser.dkrz.de/public/dkrz_a44962e3ba914c309a7421573a6949a6/intake-esm/ ) or in `/pool/data/Catalogs` πŸ“’" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Understanding CMIP6 data\n", "\n", "πŸ§‘β€πŸ« **The goal of CMIP6**\n", "\n", "In order to evaluate and compare climate models, a globally organized intercomparison project is periodically conducted.\n", "CMIP6 tackles three major questions:\n", "\n", "- How does the Earth system respond to forcing? πŸš‚\n", "- What are the origins and consequences of systematic model biases? 🐞\n", "- How can we assess future climate changes given internal climate variability, predictability, and uncertainties and scenarios? 🌑" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Metadata: Required Attributes and Controlled Vocabularies\n", "\n", "CDP data is **self-descriptive** as it contains extensive and controlled metadata. This metadata is prepared in the search facets of the data portals and catalogs.\n", "\n", "πŸ“œ\n", "\n", "Besides the technical requirements, the CMIP data standard defines **required attributes** in so called [**Controlled Vocabularies (CV)**](https://github.com/WCRP-CMIP/CMIP6_CVs). While some values are predefined, models and institutions have to be registered to become a valid value of corresponding attributes. For many attributes, both a short form with `_id` and a longer description exist." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Important required attributes:\n", "\n", "- `activity_id`: A CMIP6-endorsed MIP that investigates a specific research question. It defines `experiment`s and requests data for it.\n", "- `source_id` : An ID for the Earth System Model used to produce the data.\n", "- `experiment_id`: The experiment which was conducted by the `source_id`.\n", "- `member_id` : The ensemble simulation member of the `experiment_id`. All members should be statistically equal. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Investigating the CMIP6 data pool with `intake-esm` β›΅" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "**Features**\n", "\n", "- display catalogs as clearly structured tables inside jupyter notebooks for easy investigation" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "import intake\n", "cloudpath=[\"https://www.dkrz.de/s/intake\"]\n", "poolpath=\"/pool/data/Catalogs/dkrz_cmip6_disk.json\"\n", "#cdp = intake.open_catalog(pagespath)\n", "#col = cdp.dkrz_cmip6_disk\n", "col = intake.open_esm_datastore(\"/work/ik1017/Catalogs/dkrz_cmip6_disk.json\")\n", "col.df.head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "**Features**\n", "\n", "- browse through the catalog and select your data without being on the pool file system\n", "\n", "⇨ A pythonic reproducable alternative compared to complex `find` commands or GUI searches. No need for Filesystems and filenames." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "tas = col.search(experiment_id=\"historical\", source_id=\"MPI-ESM1-2-HR\", variable_id=\"tas\", table_id=\"Amon\", member_id=\"r1i1p1f1\")\n", "tas" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "**Features**\n", "\n", "- open climate data in an analysis ready dictionary of `xarray` datasets\n", "\n", "Forget about temporary merging and reformatting steps!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "tas.to_dataset_dict()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### `Intake` best practises:\n", "\n", "- `Intake` can make your scripts **reusable**.\n", " - Instead of working with local copy or editions of files, always start from a globally defined catalog which everyone can access\n", " - Save the subset of the catalog which you work on as a new catalog instead of a subset of files\n", "- Check for new ingests by just __repeating__ your script - it will open the most recent catalog.\n", "- Only load datasets with `to_dataset_dict` into xarrray which do not exceed your memory limits" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Let's get an overview over the CMIP6 Data pool by\n", "- finding the number of unique values of attributes\n", "- **group** and **plot** the names and sizes of different entries\n", "\n", "The resulting statistics is about the percentage of **File numbers**." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "unique_activites=col.unique(\"activity_id\")\n", "print(list(unique_activites[\"activity_id\"].values()))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "def pieplot(gbyelem) :\n", " #groupby, sort and select the ten largest\n", " size = col.df.groupby([gbyelem]).size().sort_values(ascending=False)\n", " size10 = size.nlargest(10)\n", " #Sum all others as 10th entry\n", " size10[9] = sum(size[9:])\n", " size10.rename(index={size10.index.values[9]:'all other'},inplace=True)\n", " #return a pie plot\n", " return size10.plot.pie(figsize=(18,8),ylabel='',autopct='%.2f', fontsize=16)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "pieplot(\"activity_id\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "unique_sources=col.unique(\"source_id\")\n", "print(\"Number of unique earth system models in the cmip6 data pool: \"+str(list(unique_sources[\"source_id\"].values())[0]))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "pieplot(\"source_id\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "unique_members=col.unique(\"member_id\")\n", "list(unique_members[\"member_id\"].values())[1][0:3]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "#### Data Reference Syntax\n", "\n", "An atomic **Dataset** contains all files which cover the entire time span of a single variable of a single simulation. This can be multiple files in one.\n", "\n", "The Data Reference Syntax (DRS) is a set of *required attributes* which **uniquely identify** and describe a dataset. The DRS usually includes all attributes used in the path templates so that both words are used synonymously. The DRS elements are arranged to a **hierarchical** path template for CMIP6:\n", "\n", "CMIP6: `mip_era`/`activity_id`/`institution_id`/`source_id`/`experiment_id`/`member_id`/`table_id`/`variable_id`/`grid_label`/`version`" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "*Be careful when browsing through the CMIP6 data tree!*\n", "\n", "**Unique** in CMIP6 data hierarchy:\n", "- `experiment_id` (only in one `activity_id`)\n", "- `variable_id` in `table_id` : Both combined represent the **CMIP Variable**\n", "- Only one `version` for one dataset should be published" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# Searching for the MIP which defines the experiment 'historical':\n", "\n", "cat = col.search(experiment_id=\"historical\")\n", "cat.unique(\"activity_id\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# Searching for all tables which contain the variable 'tas':\n", "\n", "cat = col.search(variable_id=\"tas\")\n", "cat.unique(\"table_id\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Not **Unique** in CMIP6 data hierarchy:\n", "- `institution_id` for both `source_id` + `experiment_id` ( + `member_id` )\n", "\n", "No requirements for `member_id`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# Searching for all institution_ids which uses the model 'MPI-ESM1-2-HR' to produce 'ssp585' results:\n", "\n", "cat = col.search(source_id=\"MPI-ESM1-2-HR\", experiment_id=\"ssp585\")\n", "cat.unique(\"institution_id\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# Searching for all experiment_ids produced with ESM 'EC-Earth3' and as ensemble member 'r1i1p1f1':\n", "\n", "cat = col.search(source_id=\"EC-Earth3\", member_id=\"r1i1p1f1\")\n", "cat.unique(\"experiment_id\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# Searching for all valid ensemble member_ids produced with ESM 'EC-Earth3' for experiment 'abrupt-4xCO2'\n", "\n", "cat = col.search(source_id=\"EC-Earth3\", experiment_id=\"abrupt-4xCO2\")\n", "cat.unique(\"member_id\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "⇨ **Do not** search for `institution_id`, `table_id` and `member_id` unless you are sure about what you are doing.\n", "Instead, begin to search for\n", "`experiment_id`, `source_id`, `variable_id`." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### How can I find the variables I need? πŸ”Ž\n", "\n", "1. **[Search](https://cfconventions.org/Data/cf-standard-names/77/build/cf-standard-name-table.html) for the matching `standard_name`**\n", "\n", "Most of the data in the data pool is compliant to the [Climate and Forecast Convention](https://cfconventions.org/). This defines `standard_names`, which need to be assigned to variables as a variable attribute inside the data. As a reliable description of the variable, the `standard_name` is a bridge to the shorter variable identifier name in the data, the so-called `short_name`. This short name is saved in the data catalogs which can be searched." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "2. **[Search](http://clipc-services.ceda.ac.uk/dreq/mipVars.html) for corresponding `short_name`s in the CMIP6 data request**\n", "\n", "E.g., you get many results for `air_temperature`. Multiple definitions for one β€˜physical’ variable like air_temperature exist in CMIP which are mostly specific diagnostics of that variable like `tasmin` and `tasmax`. Sometimes, there is output for a specific level given as a variable, e.g. `ta500`. This can be the case if not all levels are requested for a specific frequency.\n", "\n", "Best practice in [ESGF](http://esgf-data.dkrz.de)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "3. **Search for the fitting `mip_table`**\n", "\n", "Each `mip_table` is a combination of requirements for an output `variable_id` including\n", "- frequency\n", "- time cell methods (average or instantaneous)\n", "- vertical level (e.g. interpolated on pressure levels)\n", "- grid\n", "- realm (e.g. atmosphere model output or ocean model output)\n", "\n", "This requirements are set according to the interest of the MIPs. Variables with the similar requirements are collected in one MIP-table which can be identified by `table_id`." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### The data infrastructure for the DKRZ CDP\n", "\n", "In order to tackle the challenges of data provision and dissemination for a 4 PB repository, a [state-of-the-art data infrastructure](https://www.dkrz.de/c6de/dicad/dicad?set_language=en&cl=en) has been developed around that pool. In the following, we highlight three aspects of the data workflow." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "You benefit from the DKRZ CDP because\n", "\n", "- its data is standardized and quality controlled πŸ›‚\n", "- it is a curated, updated, published and catalogized data repository πŸ‘©β€πŸ­\n", "- it prevents data duplication and downloading into local workspaces which is inefficient, expensive and just a waste of storage resources πŸ—‘" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Data quality\n", "\n", "CMIP6 data is only available in a common and **reliable** [Data format](https://goo.gl/neswPr)\n", "- No adaptions needed for output of specific models\n", "- Makes data **interoperable** πŸ“  enabling evaluation software products as, for example, [ESMValTool](https://www.esmvaltool.org/)\n", "\n", "πŸ… CMIP6 data was **quality controlled** before published with [PrePARE](https://cmor.llnl.gov/mydoc_cmip6_validator/)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "CMIP6 data is **transparent** about occuring errors\n", "- Search the [errata](https://errata.es-doc.org/) data base for origins of suspicious analysis results ⚠\n", "\n", "If you find an error, please inform the modeling group. Either via the contact in the citation or, if available, via the `contact` attriubte in the file." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Data publication\n", "\n", "- Exentended **documentation** for simulation conducts provided in the [ES-Doc](https://explore.es-doc.org/) data base\n", "- [**Persistent Identfier**](https://esgf-data.dkrz.de/projects/esgf-dkrz/pid) (PIDs) ensure long-term webaccess to dataset information\n", "- [**Citation information**](https://cmip6cite.wdc-climate.de) and DOIs for all published datasets easily retrievable\n", "\n", "One method to retrieve a citation from the data is via the attribute `further_info_url`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import xarray\n", "random_file=xarray.open_dataset(cat.df[\"uri\"][0])\n", "random_file.attrs[\"further_info_url\"]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "When using data provided in the framework of the DKRZ CMIP Data Pool as basis for a publication, we ask you to add the following text to the Acknowledgements-Section:\n", "\n", "*β€œWe acknowledge the CMIP6 community for providing the climate model data, retained and globally distributed in the framework of the ESGF. The CMIP6 data of this study were replicated and made available for this study by the DKRZ.”*" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Contacts\n", "\n", "- **Requests** for replication and retraction: data-poolATdkrz.de 🎫\n", "- **News and updates** will be announced\n", " - on the new [DKRZ User Portal](https://doc.dkrz.de/)\n", " - via the mailing list βœ‰ [cmip-data-poolATlists.dkrz.de](https://lists.dkrz.de/mailman/listinfo/cmip-data-pool)\n", "\n", "This notebook is a collaboration effort by the DM Data Infrastructure team.\n", "\n", "πŸ™‚ Thank you for your attention!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "python3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" } }, "nbformat": 4, "nbformat_minor": 4 }