{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## CEDA CF Checker\n",
    "\n",
    "The [CF Checker](https://github.com/cedadev/cf-checker) software tool is provided by [CEDA](https://www.ceda.ac.uk/) (Center for Environmental Analysis) to verify that netCDF files comply to the [CF convention](https://cfconventions.org/). \n",
    "\n",
    "> The CF conventions have been adopted by a number of projects and groups as a primary standard. The conventions define metadata that provide a definitive description of what the data in each variable represents, and the spatial and temporal properties of the data. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. Installation and Preparation\n",
    "\n",
    "We recommend using `pip`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "import os\n",
    "\n",
    "newpath = f\"{os.sep.join(sys.executable.split(os.sep)[:-1])}:{os.environ['PATH']}\"\n",
    "os.environ[\"PATH\"] = newpath"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. Settings\n",
    "\n",
    "Specify the file or dataset to be tested in `testfile`.\n",
    "\n",
    "The CF Checker uses the standard name tables as input. They will be downloaded to the working directory `working_dir` if you set the switch `download_tables=True`. Three tables are required which are versioned with different version numbers. You can specify them directly in the `versions` dictionary or set the switch `update_versions=True` so that the recent versions are taken from the homepage."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "testfile = \"/work/ik1017/CMIP6//data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp370/r1i1p1f1/Amon/tas/gn/v20190710/*\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "update_versions = False\n",
    "download_tables = False"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "table_dict = {\n",
    "    \"cf-standard-name-table\": {\n",
    "        \"version\": 76,\n",
    "        \"page\": \"http://cfconventions.org/Data/cf-standard-names/current/build/cf-standard-name-table.html\",\n",
    "    },\n",
    "    \"area-type-table\": {\n",
    "        \"version\": 9,\n",
    "        \"page\": \"http://cfconventions.org/Data/area-type-table/current/build/area-type-table.html\",\n",
    "    },\n",
    "    \"standardized-region-list\": {\n",
    "        \"version\": 4,\n",
    "        \"page\": \"http://cfconventions.org/Data/standardized-region-list/standardized-region-list.current.html\",\n",
    "    },\n",
    "}\n",
    "working_dir = \"./\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. Initialization\n",
    "\n",
    "If all switches are True, we download the homepage with the `request` package and parse it with `BeautifulSoup`. We then create download `url`s with fitting version numbers for the tables and download them to the working directory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "from bs4 import BeautifulSoup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_recent_versions(page):\n",
    "    response = requests.get(page)\n",
    "    parsed_html = BeautifulSoup(response.content)\n",
    "    return int(str(parsed_html).split(\"Version\")[1].split(\",\")[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if update_versions:\n",
    "    for idx, key in enumerate(table_dict.keys()):\n",
    "        table_dict[key][\"version\"] = get_recent_versions(table_dict[key][\"page\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "table_dict"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "table_dict[\"cf-standard-name-table\"][\n",
    "    \"url\"\n",
    "] = \"http://cfconventions.org/Data/cf-standard-names/{0}/src/cf-standard-name-table.xml\".format(\n",
    "    table_dict[\"cf-standard-name-table\"][\"version\"]\n",
    ")\n",
    "table_dict[\"area-type-table\"][\n",
    "    \"url\"\n",
    "] = \"http://cfconventions.org/Data/area-type-table/{0}/src/area-type-table.xml\".format(\n",
    "    table_dict[\"area-type-table\"][\"version\"]\n",
    ")\n",
    "table_dict[\"standardized-region-list\"][\n",
    "    \"url\"\n",
    "] = \"http://cfconventions.org/Data/standardized-region-list/standardized-region-list.{0}.xml\".format(\n",
    "    table_dict[\"standardized-region-list\"][\"version\"]\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for tablename in table_dict.keys():\n",
    "    table_dict[tablename][\"local_path\"] = \"{0}/CF/{1}-{2}.xml\".format(\n",
    "        working_dir, tablename, table_dict[tablename][\"version\"]\n",
    "    )\n",
    "    if download_tables:\n",
    "        response = requests.get(table_dict[tablename][\"url\"])\n",
    "        with open(\n",
    "            table_dict[tablename][\"local_path\"],\n",
    "            \"wb\",\n",
    "        ) as file:\n",
    "            file.write(response.content)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "table_dict"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4. Application\n",
    "\n",
    "We run the CF checker with `subprocess` in a shell and capture all output."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import subprocess\n",
    "\n",
    "a = subprocess.run(\n",
    "    \"cfchecks -a {0} -r {1} -s {2} {3}\".format(\n",
    "        table_dict[\"area-type-table\"][\"url\"],\n",
    "        table_dict[\"standardized-region-list\"][\"url\"],\n",
    "        table_dict[\"cf-standard-name-table\"][\"url\"],\n",
    "        testfile,\n",
    "    ),\n",
    "    capture_output=True,\n",
    "    shell=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5. Results\n",
    "\n",
    "We write the stdout into a file in the `working_dir`. Additionally, we grep for three patterns in the `stdout` to create a **summary** of the cfchecker results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "files = [\n",
    "    fileline.split(\":\")[1]\n",
    "    for fileline in a.stdout.decode(\"utf-8\").split(\"\\n\")\n",
    "    if \"CHECKING NetCDF FILE\" in fileline\n",
    "]\n",
    "warnings = [\n",
    "    warningline.split(\":\")[1]\n",
    "    for warningline in a.stdout.decode(\"utf-8\").split(\"\\n\")\n",
    "    if \"WARNINGS given\" in warningline\n",
    "]\n",
    "errors = [\n",
    "    errorline.split(\":\")[1]\n",
    "    for errorline in a.stdout.decode(\"utf-8\").split(\"\\n\")\n",
    "    if \"ERRORS detected\" in errorline\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!rm -r cf-checker-results\n",
    "!mkdir -p cf-checker-results\n",
    "with open(working_dir + \"cf-checker-results/\" + files[0].split(\"/\")[-1], \"w\") as file:\n",
    "    file.write(a.stdout.decode(\"utf-8\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "result_dict = {}\n",
    "for idx, file in enumerate(files):\n",
    "    result_dict[file] = {\"warnings\": warnings[idx], \"errors\": errors[idx]}\n",
    "print(result_dict)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "python3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}