{ "cells": [ { "cell_type": "markdown", "id": "f9c79cb4-000c-4e1a-83bd-fe716365833d", "metadata": {}, "source": [ "# Homework #1: Basics (NumPy, Pandas, Matplotlib, Scikit-Learn)\n", "\n", "## Learning Objectives\n", "- Understand the fundamentals of NumPy arrays.\n", "- Perform array creation and manipulation using real-world data.\n", "- Conduct basic operations on NumPy arrays.\n", "- Learn to create and manipulate DataFrames.\n", "- Perform data analysis using pandas.\n", "- Conduct basic data cleaning and preprocessing tasks.\n", "- Understand the basics of plotting with Matplotlib.\n", "- Create scatter plots and bar plots.\n", "- Customize plots with titles, labels, and legends.\n", "- Understand the basic workflow of scikit-learn.\n", "- Perform data preparation and scaling.\n", "- Implement a K-Nearest Neighbors model and evaluate its performance." ] }, { "cell_type": "markdown", "id": "66bb408d-f8ed-4f2e-a38e-ae8c3346d086", "metadata": {}, "source": [ "## Structure\n", "- [**Part 1: NumPy Basics**](#1)\n", "- [1.1 Array Creation and Manipulation](#1.1)\n", "- [1.2 Array Operations](#1.2)\n", "- [1.3 Pseudo Matrix Inversion](1.3)\n", "
\n", "\n", "- [**Part 2: pandas Basics**](#2)\n", "- [2.1 DataFrame Creation and Manipulation](#2.1)\n", "- [2.2 Data Analysis](#2.2)\n", "
\n", "\n", "- [**Part 3: Matplotlib Basics**](#3)\n", "- [3.1 Scatter Plot](#3.1)\n", "
\n", "\n", "- [**Part 4: scikit-learn Basics**](#4)\n", "- [4.1 Data Preparation](#4.1)\n", "- [4.2 Model Training and Evaluation](#4.2)\n", "
\n", "\n", "- [**Part 5: Knowledge Check**](#5)" ] }, { "cell_type": "markdown", "id": "64efab4f-ad71-4ff4-ad26-517aff02c176", "metadata": {}, "source": [ "### Instructions\n", "- Read through each section carefully.\n", "- `Make sure you follow the task instructions carefully. Deviating from them will result in point deductions.`\n", "- Complete the code in areas marked `### YOUR CODE STARTS HERE` and `### YOUR CODE ENDS HERE`.\n", " - You will need to write around 100 lines of code to complete assignment\n", "- Run all code cells to see plots and results.\n", "- Ask for help if needed, but avoid code sharing.\n", "- Utilize online resources responsibly.\n", "- Ensure you have necessary libraries installed (numpy, pandas, matplotlib, scikit-learn).\n", "- For more information click on the task hints and expected outputs.\n", "\n", "### **Getting Help**\n", "\n", "- If you encounter issues or need assistance, here is the preferred approach to finding help:\n", "\n", " 1. **Search Online**: Begin by using online resources and forums to find answers to your questions. Make sure to solve problems on your own as much as possible without directly copying code.\n", " \n", " 2. **Consult Peers**: If online resources do not resolve your issue, you may reach out to fellow students for help. Please remember that sharing actual code is prohibited.\n", " \n", " 3. **Class Discord**: If peer consultation doesn't suffice, post your query in the class Discord to seek further assistance.\n", " \n", " 4. **Contact the TA**: If your issue still persists after exploring the above steps, reach out to the Teaching Assistant (TA).\n", " \n", " 5. **Instructor Assistance**: If you still require help and the TA has not been able to resolve your issue, you should contact the instructor.\n", " \n", " \n", "\n", "## Tutorials and Documentation Links For Help\n", "\n", "**Here are the links to the tutorials:**\n", "\n", "- [NumPy Tutorials](https://github.com/CuriousNeuralNerd/Tutorials/blob/fbdeb171ed16958e7571df0f1272c68ce3f4d3ab/numpy_basics.ipynb)\n", "- [pandas Tutorials](https://github.com/CuriousNeuralNerd/Tutorials/blob/fbdeb171ed16958e7571df0f1272c68ce3f4d3ab/pandas_basics.ipynb)\n", "- [Matplotlib Tutorials](https://github.com/CuriousNeuralNerd/Tutorials/blob/fbdeb171ed16958e7571df0f1272c68ce3f4d3ab/matplotlib_basics.ipynb)\n", "- [scikit-learn Tutorials](https://github.com/CuriousNeuralNerd/Tutorials/blob/fbdeb171ed16958e7571df0f1272c68ce3f4d3ab/scikit_learn_basics.ipynb)\n", "\n", "\n", "\n", "\n", "**Here are the links to the official documentation for the libraries:**\n", "\n", "- [NumPy Documentation](https://numpy.org/doc/stable/user/absolute_beginners.html)\n", "- [pandas Documentation](https://pandas.pydata.org/docs/user_guide/index.html#user-guide)\n", "- [Matplotlib Documentation](https://matplotlib.org/stable/users/explain/quick_start.html)\n", "- [SciPy Documentation](https://scipy.github.io/devdocs/tutorial/index.html)\n", "- [scikit-learn Documentation](https://scikit-learn.org/stable/)\n", "\n", "\n", "### **Grading Rubric**\n", " \n", "| **Part** | **Description** | **Weight** |\n", "|----------|-------------------------------|------------|\n", "| 1 | NumPy | **20%** |\n", "| 2 | Pandas | **20%** |\n", "| 3 | Matplotlib | **20%** |\n", "| 4 | Scikit-Learn | **20%** |\n", "| 5 | Knowledge Check | **20%** |\n", "\n", "\n", "\n", "## Submission Instructions\n", "\n", "To complete your submission for this homework, please follow these steps:\n", "\n", "1. **Save the Completed Notebook**: Ensure that all your code and written answers are finalized, then save the `.ipynb` file.\n", "2. **Export to PDF**: Additionally, export or save your completed notebook as a `PDF` file.\n", "3. **Submit Both Files**: Turn in both the `.ipynb` file and the `PDF` file. Make sure that both files are clearly labeled and include your Net ID (`HW_x_NetID`)." ] }, { "cell_type": "markdown", "id": "ecd779d5-d6f4-4897-8d0f-a34841231799", "metadata": {}, "source": [ "## Setup\n", "\n", "First, we import all the necessary libraries and load the dataset:" ] }, { "cell_type": "code", "execution_count": 2, "id": "a84ddb1a-407b-4cab-8b03-f30fcf3218eb", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "from matplotlib.ticker import ScalarFormatter\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.neighbors import KNeighborsRegressor\n", "from sklearn.metrics import mean_squared_error, r2_score\n", "\n", "# Load the dataset\n", "url = \"https://media.githubusercontent.com/media/CuriousNeuralNerd/data/main/housing_market.csv\"\n", "data = pd.read_csv(url)\n", "\n", "# Global variable for random state\n", "RANDOM_STATE = 42" ] }, { "cell_type": "markdown", "id": "2c3f0c04-fe90-42b4-bcf8-eb98d533b32e", "metadata": {}, "source": [ "## Data Handling\n", "\n", "#### Filtering Data for Tennessee\n", "**Task:**\n", "\n", "1. Filter the dataset to include only houses in Tennessee.\n", "2. Remove any rows with missing values in the `price`, `bed`, `bath`, and `house_size` columns.\n", "3. Sort the dataset from highest price to lowest price\n", "4. Create a new DataFrame with reindexed values starting from the highest price.\n", "5. Variables you must use:\n", " - tn_houses\n", " - tn_houses_clean\n", " - tn_houses_sorted" ] }, { "cell_type": "markdown", "id": "e4d80669-e687-4ff7-beb1-9a2849118eba", "metadata": {}, "source": [ "