{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***\n",
    "<span style=\"color:#008385\">\n",
    "\n",
    "**15-448: Machine Learning in a Nutshell**, *CMU-Qatar* Spring'20\n",
    "\n",
    "**Gianni A. Di Caro**, www.giannidicaro.com\n",
    "\n",
    "<u>Lab Test</u> \n",
    "</span>\n",
    "***"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "toc": true
   },
   "source": [
    "<h1>Table of Contents<span class=\"tocSkip\"></span></h1>\n",
    "<div class=\"toc\"><ul class=\"toc-item\"><li><span><a href=\"#Lab-Test-2:-Practice-with-the-ML-pipeline-and-python-tools-for-a-classification-task\" data-toc-modified-id=\"Lab-Test-2:-Practice-with-the-ML-pipeline-and-python-tools-for-a-classification-task-1\"><span class=\"toc-item-num\">1&nbsp;&nbsp;</span>Lab Test 2: Practice with the ML pipeline and python tools for a classification task</a></span><ul class=\"toc-item\"><li><span><a href=\"#Classify-wheat-varieties-by-the-properties-of-their-seeds\" data-toc-modified-id=\"Classify-wheat-varieties-by-the-properties-of-their-seeds-1.1\"><span class=\"toc-item-num\">1.1&nbsp;&nbsp;</span>Classify wheat varieties by the properties of their seeds</a></span></li><li><span><a href=\"#Read-and-inspect-the-data\" data-toc-modified-id=\"Read-and-inspect-the-data-1.2\"><span class=\"toc-item-num\">1.2&nbsp;&nbsp;</span>Read and inspect the data</a></span></li><li><span><a href=\"#Exploratory-Data-Analysis\" data-toc-modified-id=\"Exploratory-Data-Analysis-1.3\"><span class=\"toc-item-num\">1.3&nbsp;&nbsp;</span>Exploratory Data Analysis</a></span></li><li><span><a href=\"#Fit-a-k-NN-classfier\" data-toc-modified-id=\"Fit-a-k-NN-classfier-1.4\"><span class=\"toc-item-num\">1.4&nbsp;&nbsp;</span>Fit a k-NN classfier</a></span></li><li><span><a href=\"#Estimate-the-generalization-error-of-the-trained-model\" data-toc-modified-id=\"Estimate-the-generalization-error-of-the-trained-model-1.5\"><span class=\"toc-item-num\">1.5&nbsp;&nbsp;</span>Estimate the generalization error of the trained model</a></span></li><li><span><a href=\"#Select-the-best-model-(number-of-neighbors)-for-the-data\" data-toc-modified-id=\"Select-the-best-model-(number-of-neighbors)-for-the-data-1.6\"><span class=\"toc-item-num\">1.6&nbsp;&nbsp;</span>Select the best model (number of neighbors) for the data</a></span></li></ul></li></ul></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Lab Test 2: Practice with the ML pipeline and python tools for a classification task"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***\n",
    "## Classify wheat varieties by the properties of their seeds \n",
    "***"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The dataset in the file `seeds.tsv` contains attribute data about seeds belonging to three different varieties of wheat: Kama, Rosa and Canadian. For each variety that are **70 data entries.**\n",
    "\n",
    "High quality visualization of the internal seed structure was detected using a soft X-ray technique, and the images were recorded on 13x18 cm X-ray KODAK plates. \n",
    "\n",
    "\n",
    "<u>Features:</u> To construct the data, **seven geometric attributes** of wheat seeds (kernels, to be more precise) were measured from the images:\n",
    "\n",
    "1. area $A$\n",
    "2. perimeter $P$\n",
    "3. compactness $C = \\frac{4\\pi A}{P^2}$\n",
    "4. length of kernel\n",
    "5. width of kernel\n",
    "6. asymmetry coefficient\n",
    "7. length of kernel groove\n",
    "\n",
    "**All the attributes are real-valued continuous.**\n",
    "\n",
    "<u>Variety labels:</u> Kama (1), Rosa (2), Canadian (3)\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Read and inspect the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Use Pandas to read and store the dataset in a PandaFrame variable\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# inspect the dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# extract and print out the label names of the feature attributes, \n",
    "# one by one, on different lines, e.g.:\n",
    "# area\n",
    "# perimeter \n",
    "# ....\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# print out with code, a warning message if nan entries are present, and indicate how many are there\n",
    "# e.g.: if nans > 0: -> Warning, there are xx non-numeric entries!\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exploratory Data Analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "#import the necessary modules for visualization\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# EDA: let's give a look to the data\n",
    "# start by plotting the labeled data in the two-dimensional feature subspace defined by \n",
    "# compactnees (x) and asymmetry (y).\n",
    "# Give meaningful titles, xy labels, legends\n",
    "#\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Make some comments about the displayed data in terms of the features and the hardness/easiness to classify them (comments must be formatted in markdown!)\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# [EXTRAS] display all possible plots of two pairs of features\n",
    "#\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[EXTRAS] Make some comments about the displayed data in terms of the features and the hardness/easiness to classify them (comments must be formatted in markdown!)\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Fit a k-NN classfier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Employ the K-NN to fit a classifier to the data.\n",
    "# Set up the classifier. \n",
    "# Need to make choices for k and voting rule.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Justify your choices for `k` and `weights`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 87,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Fit the classifier to the training data\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Print out the empirical accuracy on the training data\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Is the accuracy good / bad? Try out different choices for k and weights and make some comments about what you have observed. Empirically select a value for `k, weights`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Estimate the generalization error of the trained model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "# For the selected (k, weights), estimate the generalization error\n",
    "# Use n-fold Cross-Validation with n = [2, 5, 10, 50]\n",
    "# For each folding, print out both the error on training and validation\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Print and comment the results: what can we say about the expected generalization error? \n",
    "\n",
    "What is the effect of `n` on the estimate?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Select the best model (number of neighbors) for the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Now, use n-fold Cross-Validation to SELECT the BEST k-NN model for the data. \n",
    "# In this case, this means using n-fold Cross-Validation to select the best value of k.\n",
    "# From the previous analysis, you can fix a value for n, the number of foldings\n",
    "#\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Comment the results and your model choice."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# [EXTRAS] Show the boundary decision regions (and the labeled training data) \n",
    "# for the learned classifier for two selected features of your choice."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.2"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": false,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": true,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}