{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Intake catalogs\n",
    "\n",
    "In order to make the DKRZ CMIP data pool more [FAIR](https://www.dkrz.de/up/services/data-management/LTA/fairness), we support the **python package** `intake-esm` which allows you to **use collections of climate data easily and fast**. \n",
    "\n",
    "We provide a tutorial here:\n",
    "https://tutorials.dkrz.de/intake.html\n",
    "\n",
    "The offical `intake-esm` page:\n",
    "https://intake-esm.readthedocs.io/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "**Features**\n",
    "\n",
    "- display catalogs as clearly structured tables inside jupyter notebooks for easy investigation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "import intake\n",
    "col = intake.open_esm_datastore(\"/work/ik1017/Catalogs/dkrz_cmip6_disk.json\")\n",
    "col.df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "col.esmcat.description"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "**Features**\n",
    "\n",
    "- browse through the catalog and select your data without being on the pool file system\n",
    "\n",
    "⇨ A pythonic reproducable alternative compared to complex `find` commands or GUI searches. No need for Filesystems and filenames."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "tas = col.search(experiment_id=\"historical\", source_id=\"MPI-ESM1-2-HR\", variable_id=\"tas\", table_id=\"Amon\", member_id=\"r1i1p1f1\")\n",
    "tas"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "**Features**\n",
    "\n",
    "- open climate data in an analysis ready dictionary of `xarray` datasets\n",
    "\n",
    "Forget about annoying temporary merging and reformatting steps!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "tas.to_dataset_dict(cdf_kwargs={\"chunks\":{\"time\":1}})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "**Features**\n",
    "\n",
    "- display catalogs as clearly structured tables inside jupyter notebooks for easy investigation\n",
    "- browse through the catalog and select your data without being on the pool file system\n",
    "- open climate data in an analysis ready dictionary of `xarray` datasets\n",
    "\n",
    "⇨ `intake-esm` reduces the data access and data preparation tasks on analysists side"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Catalog content\n",
    "\n",
    "The catalog is a combination of\n",
    "\n",
    "- a list of files (at dkrz compressed as `.csv.gz`) where each line contains a filepath as an index and column values to describe that file\n",
    "    - The columns of the catalog should be selected such that  a dataset in the project's data repository can be *uniquely identified*. I.e., all elements of the project's Data Reference Syntax should be covered (See the project's documentation for more information about the DRS) .\n",
    "- a `.json` formatted descriptor file for the list which contains additional settings which tell `intake` how to interprete the data. \n",
    "\n",
    "According to our policy, both files have the same name and are available in the same directory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "print(\"What is this catalog about? \\n\" + col.esmcat.description)\n",
    "#\n",
    "print(\"The path to the list of files: \"+ col.esmcat.catalog_file)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "**Creation of the `.csv.gz` list :**\n",
    "\n",
    "1. A file list is created based on a `find` shell command on the project directory in the data pool.\n",
    "2. For the column values, filenames and Pathes are parsed according to the project's `path_template` and `filename_template`. These templates need to be constructed with attribute values requested and required by the project.\n",
    "    - Filenames that cannot be parsed are sorted out\n",
    "3. Depending on the project, additional columns can be created by adding project's specifications.\n",
    "    - E.g., for CMIP6, we added a `OpenDAP` column which allows users to access data from everywhere via `http`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "**Configuration of the `.json` descriptor:**\n",
    "\n",
    "Makes the catalog **self-descriptive** by defining all necessary information to understand the `.csv.gz` file\n",
    "\n",
    "- Specifications for the *headers* of the columns - in case of CMIP6, each column is linked to a *Controlled Vocabulary*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "col.esmcat.attributes[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Defines how to `open` the data as **analysis ready** as possible with the underlaying `xarray` tool:\n",
    "\n",
    "- which column of the `.csv.gz` file contains the path or link to the files\n",
    "- what is the data format\n",
    "- how to **aggregate** files to a dataset\n",
    "    - set a column to be used as a new dimension for the xarray by `merge`\n",
    "    - when opened a file, what is `concat` dimension?\n",
    "    - additional options for the `open` function"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Jobs we do for you\n",
    "\n",
    "- We **make all catalogs available** under `/pool/data/Catalogs/` and in the [cloud](https://swiftbrowser.dkrz.de/public/dkrz_a44962e3ba914c309a7421573a6949a6/intake-esm/)\n",
    "- We **create and update** the content of project's catalogs regularly by running scripts which are automatically executed and called _cronjobs_. We set the creation frequency so that the data of the project is updated sufficently quickly.\n",
    "    - The updated catalog __replaces__ the outdated one. \n",
    "    - The updated catalog is __uploaded__ to the DKRZ swift cloud \n",
    "    - We plan to provide a catalog that tracks data which is __removed__ by the update."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!ls /work/ik1017/Catalogs/dkrz_*.json"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    },
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "#pd.options.display.max_colwidth = 100\n",
    "services = pd.DataFrame.from_dict({\"CMIP6\" : {\n",
    "    \"Update Frequency\" : \"Daily\",\n",
    "    \"On cloud\" : \"Yes\", #\"https://swift.dkrz.de/v1/dkrz_a44962e3ba914c309a7421573a6949a6/intake-esm/mistral-cmip6.json\",\n",
    "    \"Path to catalog\" : \"/pool/data/Catalogs/dkrz_cmip6_disk.json\",\n",
    "    \"OpenDAP\" : \"Yes\",\n",
    "    \"Retraction Tracking\" : \"Yes\",\n",
    "    \"Minimum required Memory\" : \"10GB\",\n",
    "}, \"CMIP5\": {\n",
    "    \"Update Frequency\" : \"On demand\",\n",
    "    \"On cloud\" : \"Yes\", #\"https://swift.dkrz.de/v1/dkrz_a44962e3ba914c309a7421573a6949a6/intake-esm/mistral-cmip5.json\",\n",
    "    \"Path to catalog\" : \"/pool/data/Catalogs/dkrz_cmip5_disk.json\",\n",
    "    \"OpenDAP\" : \"Yes\",\n",
    "    \"Retraction Tracking\" : \"\",\n",
    "    \"Minimum required Memory\" : \"5GB\",\n",
    "}, \"CORDEX\": {\n",
    "    \"Update Frequency\" : \"Monthly\",\n",
    "    \"On cloud\" : \"Yes\", #\"https://swift.dkrz.de/v1/dkrz_a44962e3ba914c309a7421573a6949a6/intake-esm/mistral-cordex.json\",\n",
    "    \"Path to catalog\" : \"/pool/data/Catalogs/dkrz_cordex_disk.json\",\n",
    "    \"OpenDAP\" : \"No\",\n",
    "    \"Retraction Tracking\" : \"\",\n",
    "    \"Minimum required Memory\" : \"5GB\",\n",
    "}, \"ERA5\": {\n",
    "    \"Update Frequency\" : \"On demand\",\n",
    "    \"On cloud\" : \"Yes\",\n",
    "    \"Path to catalog\" : \"/pool/data/Catalogs/dkrz_era5_disk.json\",\n",
    "    \"OpenDAP\" : \"No\",\n",
    "    \"Retraction Tracking\" : \"--\",\n",
    "    \"Minimum required Memory\" : \"5GB\",\n",
    "}, \"MPI-GE\": {\n",
    "    \"Update Frequency\" : \"On demand\",\n",
    "    \"On cloud\" : \"Yes\",# \"https://swift.dkrz.de/v1/dkrz_a44962e3ba914c309a7421573a6949a6/intake-esm/mistral-MPI-GE.json\n",
    "    \"Path to catalog\" : \"/pool/data/Catalogs/dkrz_mpige_disk.json\",\n",
    "    \"OpenDAP\" : \"\",\n",
    "    \"Retraction Tracking\" : \"--\",\n",
    "    \"Minimum required Memory\" : \"No minimum\",\n",
    "}}, orient  = \"index\")\n",
    "servicestb=services.style.set_properties(**{\n",
    "    'font-size': '14pt',\n",
    "})\n",
    "\n",
    "servicestb"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Best practises and recommendations:\n",
    "\n",
    "- `Intake` can make your scripts **reusable**.\n",
    "    - Instead of working with local copy or editions of files, always start from a globally defined catalog which everyone can access. \n",
    "    - Save the subset of the catalog which you work on as a new catalog instead of a subset of files. It can be hard to find out why data is not included anymore in recent catalog versions, especially if retraction tracking is not enabled.\n",
    "- `Intake` helps you to __avoid downloading data__ by reducing necessary temporary steps which can cause temporary output.\n",
    "- Check for new ingests by just __repeating__ your script - it will open the most recent catalog.\n",
    "- Only load datasets with `to_dataset_dict` into xarrray with the argument `cdf_kwargs={\"chunks\":{\"time\":1}}`. Otherwise, the chunnk will let your memory exceed limits."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "### Technical requirements for usage\n",
    "\n",
    "- Memory:\n",
    "    - Depending on the project's volume, the catalogs can be big. If you need to work with the total catalog, you require at least **10GB** memory.\n",
    "    - On jupyterhub.dkrz.de, start the notebook server with matching ressources.\n",
    "- Software:\n",
    "    - `Intake` works on the basis of `xarray` and `pandas`.\n",
    "    - On jupyterhub.dkrz.de , use one of the recent kernels:\n",
    "        - unstable\n",
    "        - bleeding edge"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "### Load the catalog"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [],
   "source": [
    "#import intake\n",
    "#collection = intake.open_esm_datastore(services[\"Path to catalog\"][0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "### Next step:\n",
    "\n",
    "- https://tutorials.dkrz.de/intake.html\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "python3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}