## Intake catalogs

In order to make the DKRZ CMIP data pool more [FAIR](https://www.dkrz.de/up/services/data-management/LTA/fairness), we support the **python package** `intake-esm` which allows you to **use collections of climate data easily and fast**. 

We provide a tutorial here:
https://tutorials.dkrz.de/intake.html

The offical `intake-esm` page:
https://intake-esm.readthedocs.io/

**Features**

- display catalogs as clearly structured tables inside jupyter notebooks for easy investigation

In [None]:
import intake
col = intake.open_esm_datastore("/work/ik1017/Catalogs/dkrz_cmip6_disk.json")
col.df.head()

In [None]:
col.esmcat.description

**Features**

- browse through the catalog and select your data without being on the pool file system

⇨ A pythonic reproducable alternative compared to complex `find` commands or GUI searches. No need for Filesystems and filenames.

In [None]:
tas = col.search(experiment_id="historical", source_id="MPI-ESM1-2-HR", variable_id="tas", table_id="Amon", member_id="r1i1p1f1")
tas

**Features**

- open climate data in an analysis ready dictionary of `xarray` datasets

Forget about annoying temporary merging and reformatting steps!

In [None]:
tas.to_dataset_dict(cdf_kwargs={"chunks":{"time":1}})

**Features**

- display catalogs as clearly structured tables inside jupyter notebooks for easy investigation
- browse through the catalog and select your data without being on the pool file system
- open climate data in an analysis ready dictionary of `xarray` datasets

⇨ `intake-esm` reduces the data access and data preparation tasks on analysists side

### Catalog content

The catalog is a combination of

- a list of files (at dkrz compressed as `.csv.gz`) where each line contains a filepath as an index and column values to describe that file
 - The columns of the catalog should be selected such that a dataset in the project's data repository can be *uniquely identified*. I.e., all elements of the project's Data Reference Syntax should be covered (See the project's documentation for more information about the DRS) .
- a `.json` formatted descriptor file for the list which contains additional settings which tell `intake` how to interprete the data. 

According to our policy, both files have the same name and are available in the same directory.

In [None]:
print("What is this catalog about? \n" + col.esmcat.description)
#
print("The path to the list of files: "+ col.esmcat.catalog_file)

**Creation of the `.csv.gz` list :**

1. A file list is created based on a `find` shell command on the project directory in the data pool.
2. For the column values, filenames and Pathes are parsed according to the project's `path_template` and `filename_template`. These templates need to be constructed with attribute values requested and required by the project.
 - Filenames that cannot be parsed are sorted out
3. Depending on the project, additional columns can be created by adding project's specifications.
 - E.g., for CMIP6, we added a `OpenDAP` column which allows users to access data from everywhere via `http`

**Configuration of the `.json` descriptor:**

Makes the catalog **self-descriptive** by defining all necessary information to understand the `.csv.gz` file

- Specifications for the *headers* of the columns - in case of CMIP6, each column is linked to a *Controlled Vocabulary*.

In [None]:
col.esmcat.attributes[0]

Defines how to `open` the data as **analysis ready** as possible with the underlaying `xarray` tool:

- which column of the `.csv.gz` file contains the path or link to the files
- what is the data format
- how to **aggregate** files to a dataset
 - set a column to be used as a new dimension for the xarray by `merge`
 - when opened a file, what is `concat` dimension?
 - additional options for the `open` function

### Jobs we do for you

- We **make all catalogs available** under `/pool/data/Catalogs/` and in the [cloud](https://swiftbrowser.dkrz.de/public/dkrz_a44962e3ba914c309a7421573a6949a6/intake-esm/)
- We **create and update** the content of project's catalogs regularly by running scripts which are automatically executed and called _cronjobs_. We set the creation frequency so that the data of the project is updated sufficently quickly.
 - The updated catalog __replaces__ the outdated one. 
 - The updated catalog is __uploaded__ to the DKRZ swift cloud 
 - We plan to provide a catalog that tracks data which is __removed__ by the update.

In [None]:
!ls /work/ik1017/Catalogs/dkrz_*.json

In [None]:
import pandas as pd
#pd.options.display.max_colwidth = 100
services = pd.DataFrame.from_dict({"CMIP6" : {
 "Update Frequency" : "Daily",
 "On cloud" : "Yes", #"https://swift.dkrz.de/v1/dkrz_a44962e3ba914c309a7421573a6949a6/intake-esm/mistral-cmip6.json",
 "Path to catalog" : "/pool/data/Catalogs/dkrz_cmip6_disk.json",
 "OpenDAP" : "Yes",
 "Retraction Tracking" : "Yes",
 "Minimum required Memory" : "10GB",
}, "CMIP5": {
 "Update Frequency" : "On demand",
 "On cloud" : "Yes", #"https://swift.dkrz.de/v1/dkrz_a44962e3ba914c309a7421573a6949a6/intake-esm/mistral-cmip5.json",
 "Path to catalog" : "/pool/data/Catalogs/dkrz_cmip5_disk.json",
 "OpenDAP" : "Yes",
 "Retraction Tracking" : "",
 "Minimum required Memory" : "5GB",
}, "CORDEX": {
 "Update Frequency" : "Monthly",
 "On cloud" : "Yes", #"https://swift.dkrz.de/v1/dkrz_a44962e3ba914c309a7421573a6949a6/intake-esm/mistral-cordex.json",
 "Path to catalog" : "/pool/data/Catalogs/dkrz_cordex_disk.json",
 "OpenDAP" : "No",
 "Retraction Tracking" : "",
 "Minimum required Memory" : "5GB",
}, "ERA5": {
 "Update Frequency" : "On demand",
 "On cloud" : "Yes",
 "Path to catalog" : "/pool/data/Catalogs/dkrz_era5_disk.json",
 "OpenDAP" : "No",
 "Retraction Tracking" : "--",
 "Minimum required Memory" : "5GB",
}, "MPI-GE": {
 "Update Frequency" : "On demand",
 "On cloud" : "Yes",# "https://swift.dkrz.de/v1/dkrz_a44962e3ba914c309a7421573a6949a6/intake-esm/mistral-MPI-GE.json
 "Path to catalog" : "/pool/data/Catalogs/dkrz_mpige_disk.json",
 "OpenDAP" : "",
 "Retraction Tracking" : "--",
 "Minimum required Memory" : "No minimum",
}}, orient = "index")
servicestb=services.style.set_properties(**{
 'font-size': '14pt',
})

servicestb

### Best practises and recommendations:

- `Intake` can make your scripts **reusable**.
 - Instead of working with local copy or editions of files, always start from a globally defined catalog which everyone can access. 
 - Save the subset of the catalog which you work on as a new catalog instead of a subset of files. It can be hard to find out why data is not included anymore in recent catalog versions, especially if retraction tracking is not enabled.
- `Intake` helps you to __avoid downloading data__ by reducing necessary temporary steps which can cause temporary output.
- Check for new ingests by just __repeating__ your script - it will open the most recent catalog.
- Only load datasets with `to_dataset_dict` into xarrray with the argument `cdf_kwargs={"chunks":{"time":1}}`. Otherwise, the chunnk will let your memory exceed limits.

### Technical requirements for usage

- Memory:
 - Depending on the project's volume, the catalogs can be big. If you need to work with the total catalog, you require at least **10GB** memory.
 - On jupyterhub.dkrz.de, start the notebook server with matching ressources.
- Software:
 - `Intake` works on the basis of `xarray` and `pandas`.
 - On jupyterhub.dkrz.de , use one of the recent kernels:
 - unstable
 - bleeding edge

### Load the catalog

In [None]:
#import intake
#collection = intake.open_esm_datastore(services["Path to catalog"][0])

### Next step:

- https://tutorials.dkrz.de/intake.html
