{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Intake catalogs\n", "\n", "In order to make the DKRZ CMIP data pool more [FAIR](https://www.dkrz.de/up/services/data-management/LTA/fairness), we support the **python package** `intake-esm` which allows you to **use collections of climate data easily and fast**. \n", "\n", "We provide a tutorial here:\n", "https://tutorials.dkrz.de/intake.html\n", "\n", "The offical `intake-esm` page:\n", "https://intake-esm.readthedocs.io/" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "**Features**\n", "\n", "- display catalogs as clearly structured tables inside jupyter notebooks for easy investigation" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "import intake\n", "col = intake.open_esm_datastore(\"/work/ik1017/Catalogs/dkrz_cmip6_disk.json\")\n", "col.df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "col.esmcat.description" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "**Features**\n", "\n", "- browse through the catalog and select your data without being on the pool file system\n", "\n", "⇨ A pythonic reproducable alternative compared to complex `find` commands or GUI searches. No need for Filesystems and filenames." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "tas = col.search(experiment_id=\"historical\", source_id=\"MPI-ESM1-2-HR\", variable_id=\"tas\", table_id=\"Amon\", member_id=\"r1i1p1f1\")\n", "tas" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "**Features**\n", "\n", "- open climate data in an analysis ready dictionary of `xarray` datasets\n", "\n", "Forget about annoying temporary merging and reformatting steps!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "tas.to_dataset_dict(cdf_kwargs={\"chunks\":{\"time\":1}})" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "**Features**\n", "\n", "- display catalogs as clearly structured tables inside jupyter notebooks for easy investigation\n", "- browse through the catalog and select your data without being on the pool file system\n", "- open climate data in an analysis ready dictionary of `xarray` datasets\n", "\n", "⇨ `intake-esm` reduces the data access and data preparation tasks on analysists side" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Catalog content\n", "\n", "The catalog is a combination of\n", "\n", "- a list of files (at dkrz compressed as `.csv.gz`) where each line contains a filepath as an index and column values to describe that file\n", " - The columns of the catalog should be selected such that a dataset in the project's data repository can be *uniquely identified*. I.e., all elements of the project's Data Reference Syntax should be covered (See the project's documentation for more information about the DRS) .\n", "- a `.json` formatted descriptor file for the list which contains additional settings which tell `intake` how to interprete the data. \n", "\n", "According to our policy, both files have the same name and are available in the same directory." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "print(\"What is this catalog about? \\n\" + col.esmcat.description)\n", "#\n", "print(\"The path to the list of files: \"+ col.esmcat.catalog_file)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "**Creation of the `.csv.gz` list :**\n", "\n", "1. A file list is created based on a `find` shell command on the project directory in the data pool.\n", "2. For the column values, filenames and Pathes are parsed according to the project's `path_template` and `filename_template`. These templates need to be constructed with attribute values requested and required by the project.\n", " - Filenames that cannot be parsed are sorted out\n", "3. Depending on the project, additional columns can be created by adding project's specifications.\n", " - E.g., for CMIP6, we added a `OpenDAP` column which allows users to access data from everywhere via `http`" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "**Configuration of the `.json` descriptor:**\n", "\n", "Makes the catalog **self-descriptive** by defining all necessary information to understand the `.csv.gz` file\n", "\n", "- Specifications for the *headers* of the columns - in case of CMIP6, each column is linked to a *Controlled Vocabulary*." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "col.esmcat.attributes[0]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Defines how to `open` the data as **analysis ready** as possible with the underlaying `xarray` tool:\n", "\n", "- which column of the `.csv.gz` file contains the path or link to the files\n", "- what is the data format\n", "- how to **aggregate** files to a dataset\n", " - set a column to be used as a new dimension for the xarray by `merge`\n", " - when opened a file, what is `concat` dimension?\n", " - additional options for the `open` function" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Jobs we do for you\n", "\n", "- We **make all catalogs available** under `/pool/data/Catalogs/` and in the [cloud](https://swiftbrowser.dkrz.de/public/dkrz_a44962e3ba914c309a7421573a6949a6/intake-esm/)\n", "- We **create and update** the content of project's catalogs regularly by running scripts which are automatically executed and called _cronjobs_. We set the creation frequency so that the data of the project is updated sufficently quickly.\n", " - The updated catalog __replaces__ the outdated one. \n", " - The updated catalog is __uploaded__ to the DKRZ swift cloud \n", " - We plan to provide a catalog that tracks data which is __removed__ by the update." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!ls /work/ik1017/Catalogs/dkrz_*.json" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" }, "tags": [ "hide-input" ] }, "outputs": [], "source": [ "import pandas as pd\n", "#pd.options.display.max_colwidth = 100\n", "services = pd.DataFrame.from_dict({\"CMIP6\" : {\n", " \"Update Frequency\" : \"Daily\",\n", " \"On cloud\" : \"Yes\", #\"https://swift.dkrz.de/v1/dkrz_a44962e3ba914c309a7421573a6949a6/intake-esm/mistral-cmip6.json\",\n", " \"Path to catalog\" : \"/pool/data/Catalogs/dkrz_cmip6_disk.json\",\n", " \"OpenDAP\" : \"Yes\",\n", " \"Retraction Tracking\" : \"Yes\",\n", " \"Minimum required Memory\" : \"10GB\",\n", "}, \"CMIP5\": {\n", " \"Update Frequency\" : \"On demand\",\n", " \"On cloud\" : \"Yes\", #\"https://swift.dkrz.de/v1/dkrz_a44962e3ba914c309a7421573a6949a6/intake-esm/mistral-cmip5.json\",\n", " \"Path to catalog\" : \"/pool/data/Catalogs/dkrz_cmip5_disk.json\",\n", " \"OpenDAP\" : \"Yes\",\n", " \"Retraction Tracking\" : \"\",\n", " \"Minimum required Memory\" : \"5GB\",\n", "}, \"CORDEX\": {\n", " \"Update Frequency\" : \"Monthly\",\n", " \"On cloud\" : \"Yes\", #\"https://swift.dkrz.de/v1/dkrz_a44962e3ba914c309a7421573a6949a6/intake-esm/mistral-cordex.json\",\n", " \"Path to catalog\" : \"/pool/data/Catalogs/dkrz_cordex_disk.json\",\n", " \"OpenDAP\" : \"No\",\n", " \"Retraction Tracking\" : \"\",\n", " \"Minimum required Memory\" : \"5GB\",\n", "}, \"ERA5\": {\n", " \"Update Frequency\" : \"On demand\",\n", " \"On cloud\" : \"Yes\",\n", " \"Path to catalog\" : \"/pool/data/Catalogs/dkrz_era5_disk.json\",\n", " \"OpenDAP\" : \"No\",\n", " \"Retraction Tracking\" : \"--\",\n", " \"Minimum required Memory\" : \"5GB\",\n", "}, \"MPI-GE\": {\n", " \"Update Frequency\" : \"On demand\",\n", " \"On cloud\" : \"Yes\",# \"https://swift.dkrz.de/v1/dkrz_a44962e3ba914c309a7421573a6949a6/intake-esm/mistral-MPI-GE.json\n", " \"Path to catalog\" : \"/pool/data/Catalogs/dkrz_mpige_disk.json\",\n", " \"OpenDAP\" : \"\",\n", " \"Retraction Tracking\" : \"--\",\n", " \"Minimum required Memory\" : \"No minimum\",\n", "}}, orient = \"index\")\n", "servicestb=services.style.set_properties(**{\n", " 'font-size': '14pt',\n", "})\n", "\n", "servicestb" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Best practises and recommendations:\n", "\n", "- `Intake` can make your scripts **reusable**.\n", " - Instead of working with local copy or editions of files, always start from a globally defined catalog which everyone can access. \n", " - Save the subset of the catalog which you work on as a new catalog instead of a subset of files. It can be hard to find out why data is not included anymore in recent catalog versions, especially if retraction tracking is not enabled.\n", "- `Intake` helps you to __avoid downloading data__ by reducing necessary temporary steps which can cause temporary output.\n", "- Check for new ingests by just __repeating__ your script - it will open the most recent catalog.\n", "- Only load datasets with `to_dataset_dict` into xarrray with the argument `cdf_kwargs={\"chunks\":{\"time\":1}}`. Otherwise, the chunnk will let your memory exceed limits." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "### Technical requirements for usage\n", "\n", "- Memory:\n", " - Depending on the project's volume, the catalogs can be big. If you need to work with the total catalog, you require at least **10GB** memory.\n", " - On jupyterhub.dkrz.de, start the notebook server with matching ressources.\n", "- Software:\n", " - `Intake` works on the basis of `xarray` and `pandas`.\n", " - On jupyterhub.dkrz.de , use one of the recent kernels:\n", " - unstable\n", " - bleeding edge" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "### Load the catalog" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "#import intake\n", "#collection = intake.open_esm_datastore(services[\"Path to catalog\"][0])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "### Next step:\n", "\n", "- https://tutorials.dkrz.de/intake.html\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "python3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.10" } }, "nbformat": 4, "nbformat_minor": 4 }