{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "***\n", "\n", "\n", "**15-448: Machine Learning in a Nutshell**, *CMU-Qatar* Spring'20\n", "\n", "**Gianni A. Di Caro**, www.giannidicaro.com\n", "\n", "Lab Test \n", "\n", "***" ] }, { "cell_type": "markdown", "metadata": { "toc": true }, "source": [ "

Table of Contents

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Lab Test 2: Practice with the ML pipeline and python tools for a classification task" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***\n", "## Classify wheat varieties by the properties of their seeds \n", "***" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The dataset in the file `seeds.tsv` contains attribute data about seeds belonging to three different varieties of wheat: Kama, Rosa and Canadian. For each variety that are **70 data entries.**\n", "\n", "High quality visualization of the internal seed structure was detected using a soft X-ray technique, and the images were recorded on 13x18 cm X-ray KODAK plates. \n", "\n", "\n", "Features: To construct the data, **seven geometric attributes** of wheat seeds (kernels, to be more precise) were measured from the images:\n", "\n", "1. area $A$\n", "2. perimeter $P$\n", "3. compactness $C = \\frac{4\\pi A}{P^2}$\n", "4. length of kernel\n", "5. width of kernel\n", "6. asymmetry coefficient\n", "7. length of kernel groove\n", "\n", "**All the attributes are real-valued continuous.**\n", "\n", "Variety labels: Kama (1), Rosa (2), Canadian (3)\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Read and inspect the data" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "# Use Pandas to read and store the dataset in a PandaFrame variable\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# inspect the dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# extract and print out the label names of the feature attributes, \n", "# one by one, on different lines, e.g.:\n", "# area\n", "# perimeter \n", "# ....\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# print out with code, a warning message if nan entries are present, and indicate how many are there\n", "# e.g.: if nans > 0: -> Warning, there are xx non-numeric entries!\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exploratory Data Analysis" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "#import the necessary modules for visualization\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# EDA: let's give a look to the data\n", "# start by plotting the labeled data in the two-dimensional feature subspace defined by \n", "# compactnees (x) and asymmetry (y).\n", "# Give meaningful titles, xy labels, legends\n", "#\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Make some comments about the displayed data in terms of the features and the hardness/easiness to classify them (comments must be formatted in markdown!)\n", "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# [EXTRAS] display all possible plots of two pairs of features\n", "#\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[EXTRAS] Make some comments about the displayed data in terms of the features and the hardness/easiness to classify them (comments must be formatted in markdown!)\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fit a k-NN classfier" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# Employ the K-NN to fit a classifier to the data.\n", "# Set up the classifier. \n", "# Need to make choices for k and voting rule.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Justify your choices for `k` and `weights`" ] }, { "cell_type": "code", "execution_count": 87, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "# Fit the classifier to the training data\n" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "# Print out the empirical accuracy on the training data\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Is the accuracy good / bad? Try out different choices for k and weights and make some comments about what you have observed. Empirically select a value for `k, weights`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Estimate the generalization error of the trained model" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "# For the selected (k, weights), estimate the generalization error\n", "# Use n-fold Cross-Validation with n = [2, 5, 10, 50]\n", "# For each folding, print out both the error on training and validation\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Print and comment the results: what can we say about the expected generalization error? \n", "\n", "What is the effect of `n` on the estimate?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Select the best model (number of neighbors) for the data" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "# Now, use n-fold Cross-Validation to SELECT the BEST k-NN model for the data. \n", "# In this case, this means using n-fold Cross-Validation to select the best value of k.\n", "# From the previous analysis, you can fix a value for n, the number of foldings\n", "#\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Comment the results and your model choice." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# [EXTRAS] Show the boundary decision regions (and the labeled training data) \n", "# for the learned classifier for two selected features of your choice." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.2" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": false, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": true, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }