3975 views
 owned this note
<style> h3 { border-bottom: 1px solid #ccc; } summary { font-weight: bolder; } summary:hover { text-decoration: underline; } section { margin-bottom: 2rem; padding: 0em 1em; padding-bottom: 1em; border-radius: 4px; } section.think { background-color: oklch(95% 0.01 0); border: 1px solid oklch(80% 0.01 0); color: oklch(40% 0.024 0); } section.conceptual { background-color: oklch(95% 0.01 150); border: 1px solid oklch(80% 0.01 150); color: oklch(40% 0.024 150); } section.design { background-color: oklch(95% 0.01 290); border: 1px solid oklch(80% 0.01 290); color: oklch(40% 0.024 290); } .todo { color: #ff00ff; border: 2px dashed #ff00ff; padding: 0em 1em; border-radius: 5px; margin-top: 1em; margin-bottom: 1em; //display: none; // UNCOMMENT TO HIDE TODOs } </style> # Project 1: Decision Tree (Spring 2025) **Key Dates** * **Design checks:** Sunday, March 2 - Friday, March 7 * **Plan Peer Reviews:** Wednesday, March 5 - Saturday, March 8 (Early). Friday, March 7 - Monday, March 10 (Regular) * **Final turn-in:** Friday, March 14 at 11:59 pm * **Peer Review Final Reflections:** Monday, March 17, 11:59pm :::info :eyes: **Looking for a stencil link?** **You will receive a customized stencil via email alongside your partner assignment**. Until you receive your partner assignment, you cannot clone the stencil. ::: :::success #### ✨Gearup: Friday, February 28 in Macmillan 115✨ <!-- a video overview of the project :movie_camera: --> For an overview of the project a concrete walkthrough about decision trees, we highly recommend coming to our gearup: - [Gear-up recording](https://drive.google.com/file/d/1l8rNdR-I1_0YUz47ZB48wdMd96MHgVnZ/view) - [Gear-up slides](https://docs.google.com/presentation/d/1jiY__T-QaVdmfo1wbY2VtXTlkN_NRdqi/edit#slide=id.p21) ::: # Overview In this project, you will implement a classic machine learning algorithm that uses training data to make predictions in new situations. This algorithm uses a tree-like data structure to arrive at its predictions. Concretely, you will: - Practice designing with classes and interfaces - Practice using trees in a real-world application - Practice using plans to outline what your code needs to do - Understand a bit of how one approach to machine learning works ### Project timeline (and handout overview) This handout contains a lot of information, but **don’t be afraid**! All of this information is here to help you get oriented with the process. To start, here's an overview of all the major components --- knowing what's here can help you navigate it effectively. Here's an overview of what's in the handout and how to put it in context for the project timeline: **Background info**: Most of the information in this handout is background information on the framework of a decision tree, accompanied by design check questions (in boxes) that will help you understand what goes into making and using one. **Design Stage Tasks**: The first week of the project is focused on design. There are two parts to this, each of which will give you crucial feedback for building your implementation: - **Design Check**: You'll begin preparing your design and overall strategy for the project (based on some questions throughout the background sections) and meet with a TA to discuss it. As you work through the background sections, **record your answers to the design check questions, and then submit your work to the Gradescope Design Check submission at least 1 hour before your design check. Only one person needs to submit but make sure to add your partner as collaborator on the submission.** - **Planning via Peer Review**: You'll also write a *plan* for your solution and submit it for *peer review*: you'll have a chance to review other students plans, and receive feedback from other students! For more info on this part, see **[this section](#Peer-review-and-feedback)**. **Testing**: After the design-stage section is advice on testing. Definitely test as you go; it may have been tempting on homeworks to just implement everything and test it at the end, but with projects of this size, you want to know if you have errors before you get to the ‘end.’ (It won’t be anywhere near the end for you if you don’t test first---we (TAs, Kathi, Elijah) all speak from experience). **Implementation**: Finally, the handout discusses the implementation stage tasks. These are the steps you will take to develop your final submission. Don’t start too late! This is a more challenging assignment, and **your ability to complete it will greatly depend on the amount of time you give your brain to mull it over**. Design and testing as you go are essential to managing the overall time for the project. **Gearup**: If you prefer a different way of digesting the background information provided in this handout, come to or watch the recording of our Gearup! You still have to do the design check tasks, but we’ll be discussing these concepts out loud if that’s how you prefer to learn. Look [here](#Gearup-a-video-overview-of-the-project-) for info on how to watch the gearup. Best of luck! If you have any questions, don’t hesitate to reach out to us, and please check our FAQ/reading list post on Edstem for guidance. We are here to help, and we believe in you!! :smile: :::warning **Please keep in mind**: This project is a nontrivial jump in difficulty from the homeworks you've seen so far, which is why you have a partner, two weeks, and an intermediate meeting with a TA. However, note that you **this project requires some you to do some *design* of an OO program, which will take some time to think about--so don't expect to be able to just sit down and whip up an implementation!** We recommend you start early, talk with your partner, come to office hours, and give yourself time to work on it.<br><br /> That said, know that you have all the concepts you need to implement the project: you've written recursive programs on lists and trees, you've written programs that traverse trees from top to bottom, you've written methods that call a recursive helper method. You've filtered through lists. You've practiced using hashtables. All of these pieces come together in this project. ::: <!-- ## Project timeline This project will proceed in two main stages, with you working with your partner in both: - In the **Design Stage**, you’ll work through the project handout, figure out how the intended algorithm works, and plan out the classes that you will need. You’ll follow a worksheet and answer some design questions in Gradescope, which you and your partner will review together with a TA during a **Design Check**. This occurs in the first week of the project. - In the **Implementation Stage**, you and your partner will write code that carries out your design. This occurs in the second week of the project. --> ## Stencil code On Friday, we will send you an email confirming your partner assignments that will include a personalized link to create your stencil. **Until you receive this email, you cannot clone the stencil.** Once you receive your partner assignments, **clone the stencil using the link you received via email** and create/join a team named as `[partner 1 brown login]-[partner 2 brown login]`. For instructions on how to download the stencil and set it up in IntelliJ, see our [Stencil Setup Guide](https://docs.cs200.io/s/java-stencil-setup-guide). # Background ## An Intro into Machine Learning You may have heard of machine learning, but what is it really? You have learned how to write functions that perform computations. So far, the functions you’ve written consist of explicit steps that tell the computer how to do the computation. Imagine, for example, that we asked you to write a program to determine whether a particular job applicant should get an interview. You could write a function with a bunch of if-statements and boolean operations over information about the candidate (years of experience, prior recommendations, etc). But what if you have a lot of potential variables to consider? Doesn’t this function become pretty messy and hard to maintain? With machine learning, you don’t write this function explicitly. Instead, you use a program (the machine learning algorithm) to _generate one possible function for you_ based on data about previous candidates (this process is called _training a model_ in machine-learning terminology). Then, when you get a new applicant, you call the generated function and ask it whether to interview the candidate. Set aside how you feel about the (considerable) perils here: The point is that **machine learning is about generating functions that recommend decisions based on prior decisions using the same variables**. ### How is this relevant to Decision Tree? In this project, you will implement a classic machine learning algorithm and think about how to test it. Concretely, you will: - write a program that generates a function to make predictions based on data in a dataset. Your program will read in datasets from [CSV files](https://www.geeksforgeeks.org/csv-file-format/) using code that we will provide in the stencil. - use your function to make predictions - reflect on what it means to test a machine-learned function At the heart of the algorithm you’ll be implementing is a data structure known as a _decision tree_. This is a tree-shaped data structure that models a series of multiple-choice questions, answers to which are found at the leaves. Overall, this project practices the data structures, testing, and programming concepts we’ve covered so far. You need both object-oriented programming and recursion to get this working. *The implementations you've designed to traverse trees (binary search trees or ancestry trees), to use recursion to traverse lists, to use hashtables to effectively store and retrieve data, and to create classes for lists and trees should all give you insights into this project.* # Building decision trees by hand :::warning As you follow along with this example, you will see some conceptual questions. These questions are on the `Project 1 (Decision Tree) Design Check Conceptual` assignment on Gradescope and are automatically graded (to allow you to see whether you got a question right or wrong and resubmit if need be). ‼️**To help all partners form a conceptual foundation for this project, these questions should be submitted individually before your design check date.** ‼️ ::: To help explain what our algorithm will be able to do, let’s consider the following problem: You would like to generate a function/train a model to tell you whether a food item is a fruit or vegetable given certain attributes including `color`, `highProtein`, and `calories`. ## Food dataset A subset of the data is shown below: | **name** | **color** | **highProtein** | **calories** | **foodType** | | ------------ | --------- | --------------- | ------------ | ------------ | | spinach | green | true | low | vegetable | | orange | orange | false | high | fruit | | cucumber | green | false | low | vegetable | | carrot | orange | false | medium | vegetable | | sweet potato | orange | true | high | vegetable | | avocado | green | true | high | fruit | | banana | yellow | true | high | fruit | Each row contains values for the attributes `color`, `highProtein`, `calories`, and `foodType`. When you generate your function, you will choose which attribute it should make a prediction on (here, `foodType`). Effectively, this means that you are creating classes and methods to implement a function: ```python foodType(color, highProtein, calories) ``` that will return either `fruit` or `vegetable` as the `foodType` depending on the given values for `color`, `highProtein`  and `calories`.  (You’ll call the function slightly differently, but this is the idea.) This same setup arises in many applications: there could be a table that tracks who has been admitted to college based on different attributes, which movies ended up generating high revenue, and so on. In each case, the goal is to predict an outcome (e.g. getting admitted or generating high revenue) given the attributes of a new case (e.g. student or movie). ## Decision tree Decision trees provide an ordered way of examining each attribute of an item, leading ultimately to a decision/classification for that item. Below is a decision tree generated from the food dataset, designed to predict the `foodType` of foods not in the original table. Note that in this example, attributes are split on in random order along each branch.![](https://lh3.googleusercontent.com/6hQBBTIaJAQBBVJg5YSJKD32tfc6gqg5gnOPeElaG__gcmJFIXD8vNs6cqagEWjh22EOYZX45vOdrjfMmFiNyJFeFovGxnCiPIqcw0jbO7LGTpV62agkaUtxc-Pc7VWrlrspPp36BXnxNPJ5ZTdkMUk)Some things to note about the decision tree: - Each non-leaf **node** (represented as a blue oval) is labeled with an attribute from the dataset - A node for an attribute has outgoing **edges** corresponding to the possible values of that attribute - The leaves (orange and red boxes) represent the **decisions** or predicted value for the target attribute (here, `foodType`) and are labeled with the classification of a datum that has traveled down that path of the tree. All decisions are for the same target attribute. Leaves mean that you’ve found a prediction! - The **same attribute arises only once** along any path from the root to a decision. <section class="conceptual"> ### Conceptual question 1 The decision tree for a dataset should always match the results listed in the original dataset. To make sure that you know how to use the tree, describe how the tree would be used to report that avocados and oranges are both fruits, based on the attributes in the table. </section> ## Making predictions Let’s walk through two examples to show how the generated decision tree could be used to predict the value of the target attribute (`foodType`) for new data. We will refer to the fruit dataset and the graphic of the generated tree. First, let’s predict the `foodType` for a tangerine, which has orange color, is not high protein, and is high in calories. We traverse the decision tree, checking the values of the given attributes starting from `color` (the root node). We would first follow the `orange` edge down to the `highProtein` node. Since the tangerine does not have high protein, we would then follow the `false` edge down to the `calories` node, and since the tangerine has `high` calories, the function would predict that a tangerine has the value `fruit` for the attribute `foodType`. Now, consider broccoli, which is green, high in protein, and medium in calories. Again, we follow the tree. Taking the `green` color edge gets us to a `calories` node that has no edge for `medium`. _This is a value that was not covered by our training data_, so we need to use some information other than the paths in the tree to predict the `foodType`. The path we had taken so far limited us to green-colored items. We therefore consult the training dataset to identify the most common `foodtype` among the green items in the training data (spinach, cucumber, and avocado). The function thus returns `vegetable`. Had there been a tie in the training data, we’d choose from the most-common values arbitrarily. <section class="conceptual"> ### Conceptual question 2 What would a function built on the above decision tree predict for green apples that are high in calories? What about red apples? </section> ## Building new attribute values into the tree <!--The broccoli example makes the algorithm sound complicated when given an unknown value. If you read that description and think in terms of processing steps, it sounds like we need to stop traversing, figure out what path we’re on, return to the dataset, etc. UGH!!--> Go back to the broccoli example, and the fact that there was no edge for its `medium` value for the Calories attribute. Just above, we said we'd return the most common decision among green vegetables. Any green vegetable whose calories were something other than "low" and "high" (the two edges in the diagram) would return the same default value. To save later computation time, then, we will store that default value in each node. Here's a revised diagram showing the defaults within each node: ![decision tree with default values on nodes](https://docs.cs200.io/uploads/7a8f8b3d-b4aa-465b-8392-3d77b3de7634.png) <section class="conceptual"> ### Conceptual question 3 Go through the process for predicting the `foodType` of broccoli (green, high in protein, and medium in calories) using this tree. <details><summary>Details</summary> Which node’s default value do we end up using? This example illustrates a key algorithmic principle: pre-computing data can turn some seemingly algorithmic problems into simpler data lookups. When a series of computational steps feels messy, ask whether you could pre-compute and store some data to simplify the computation later. This is the same optimization that we did with the size/length function in homework 2. </details> </section> <section class="conceptual"> ### Conceptual question 4 On paper, work out the decision tree, with default values, to predict the outcome for the following dataset on college admissions. Consider the attributes in the order of GPA, letters of rec, high school, and extracurriculars. </section> Check your work using the Gradescope conceptual assignment. **Be sure to save this tree for use later!** | **GPA** | **high school** | **extra curriculars** | **letters of rec** | **outcome** | | ------- | --------------- | --------------------- | ------------------ | ----------- | | C | blue school | strong | strong | reject | | C | red school | weak | weak | reject | | A | blue school | weak | strong | accept | | A | red school | strong | weak | reject | | B | blue school | weak | strong | accept | | B | red school | strong | strong | accept | | B | red school | weak | strong | reject | :::info It may be helpful to copy this table into an excel or Google Sheets file. You can make use of the "Sort" capabilities under "Data" to organize the rows by value for an attribute to visualize how to form the decision tree. ::: ## On the accuracy of machine learning decisions :::info Put your answers to the Design Check tasks in a document (e.g., google docs or similar) that you work on and turn in as a group. You will submit a PDF version of this document. ::: Assume that the tree that you generated earlier regarding admissions decisions will now be used on a new set of students. For each of the following students, use the tree to determine whether they will be admitted (fill in the outcome column): | **Name** | **GPA** | **high school** | **extra curriculars** | **letters of rec** | **neighborhood** | **outcome** | | -------- | ------- | --------------- | --------------------- | ------------------ | ---------------- | ----------- | | student1 | B | blue school | strong | strong | Canyon | | | student2 | C | red school | strong | weak | Elman | | | student3 | B | red school | weak | strong | Elman | | | student4 | B | blue school | weak | strong | Canyon | | | student5 | A | red school | strong | strong | Elman | | <section class="design"> ### Design Check Task 1 What are some things you notice about who does and does not get admitted? Are there outcomes that surprise you or concern you? Look at the table. Do you spot any patterns between the outcomes and attributes that were not part of the original tree? <details><summary>Details</summary> This example points to what is called **soft-coded bias**. Even if identities like race, gender, socioeconomic status, or sexuality are not included as attributes when training decision trees, the generated prediction function can still produce biased outcomes dependent on those identities. This often occurs when an attribute included in the model is a close substitute for one of these identities. For example, seemingly unrelated information like neighborhoods are strongly correlated with race. Essentially, if any attribute that you include in your model is correlated with one of these identities, your model is vulnerable to this sort of soft-coded bias. Often those creating algorithms or automating decision-making processes do not intend to cause harm and sometimes they’re actively trying to do good. Today, a lot of resources, thought, and effort are being put into learning how to ‘do good better’. For the next task, we’ll examine different perspectives on this issue. </details> </section> <!-- Kathi removed because article link no longer works <section class="design"> ### Design check task 2 Scroll through this [interactive article](https://www.thinkwithgoogle.com/feature/ml-fairness-for-marketers/) in its entirety. Answer the following. <details><summary>Details</summary> #### Part 1: Write a short paragraph (2-3 sentences) reflecting on the following: 1. Who is this resource created for? 1. What did you learn from this article? Give at least one specific example. Now, let’s critique the Google article. Resources, especially those created by corporations, often have pro-corporate agendas that may gloss over harm they’ve caused and fail to mention reforms or solutions that could damage their bottom line. "Social good", "ethics", "responsibility" are terms that get thrown around a lot. But how can we tell whether institutions and people are actually working towards these values? In 2019, Ben Green wrote a short 4-page paper analyzing and critiquing computer scientists’ efforts to do social good thus far. We’ve provided an annotated copy to guide you through the paper and explain tricky concepts. Read through [Green’s paper](https://www.dropbox.com/s/ji1pnuxfnvurd9y/good_isnt_good_enough_annotated.pdf?dl=0). #### Part 2: Now, write a second paragraph (at least 8-10 sentences) addressing the following questions. 1. What does he mean by "reformist reforms"? Give an example of where you might have seen this in your life.  1. What motives could Google have in creating the first article you just scrolled through? What systems or frameworks do they take as a given? 1. What might a non-reformist approach to targeted advertising and ML fairness in marketing look like? Give an example. 1. List one or two ways we might be more mindful when reading and consuming content created by corporations. </details> </section> --> # Overview: what you will build The following code block shows a interaction with the final code that you’ll develop for this project _(Note: in the starter code, this has been translated to a test class to help you debug)_: ```java // creates a Dataset out of a CSV file List<Row> dataObjects = DecisionTreeCSVParser.parse("fruits-and-vegetables.csv"); List<String> attributeList = new ArrayList<>(dataObjects.get(0).getAttributes()); Dataset training = new Dataset(attributeList, dataObjects); // builds a TreeGenerator object and generates a tree for "foodType" TreeGenerator generator = new TreeGenerator(); generator.generateTree(training, "foodType"); // makes a new (partial) Row representing the tangerine from the example Row tangerine = new Row(); tangerine.setAttributeValue("color", "orange"); tangerine.setAttributeValue("highProtein", "false"); tangerine.setAttributeValue("calories", "high"); // uses the decision tree generated on line 6 to make a decision String decision = generator.getDecision(tangerine); System.out.println(decision); // prints "fruit" ``` **We will give you the** `DecisionTreeCSVParser` **and** `Row` **classes** (the former will read in a CSV file and produce a list of `Row` objects). In the implementation stage, you will need to: - design a `Dataset` class that holds a list of `Row`. - write a `generateTree` method (in `TreeGenerator`) that will generate a decision tree from a list of `Row` objects. - write a `getDecision` method (in `TreeGenerator`) that will produce a prediction on a `Row` object. The rest of the Design Stage work will have you plan out the classes and methods that you will need for these implementation tasks. # Design stage tasks :::warning Before proceeding to the tasks in this section, make sure you have answered the conceptual assignment questions 1-4 on Gradescope and received full points. This helps make sure that you understand the concepts behind decision trees well enough to start thinking about designing and implementing your code. Question 5 is still coming up, so don’t worry about it yet. ::: ## Coming up with testing datasets In order to set yourselves up for success in debugging this project, you should come up with some **training datasets** and **expected input-output pairs** that you can run your code on. By **datasets**, we mean example CSV files of training data that the program reads in, like the produce and college admission tables above (if you are unfamiliar with CSV files, check [this](https://www.geeksforgeeks.org/csv-file-format/) out). By **expected input-output pairs**, we mean examples of inputs (rows of attribute values) that you provide to the decision tree (after training it on one of your test datasets) and the outputs (decisions) that you expect the tree to make for those inputs. To help you get started, here are some criteria that you should use when coming up with your datasets and expected outcomes: - The datasets should be small enough for you to be able to draw out on paper the expected decision tree and the steps your code takes to generate the tree. _Note: your code will choose attributes to split on at random, which means that there are multiple possible trees that could be generated._ - At least some of the datasets or expected outcomes should be basic enough to test very simple cases (e.g. a dataset that causes your code to generate a tree of only one non-leaf node, to test that splitting on a single attribute is working properly). - At least some of the datasets/expected outcomes should test new attribute values (that is, paths down the tree that have to make use of the default decision of some node). Conversely, at least some of the datasets/expected outcomes should test situations where default decisions are not used at all. <section class="think"> ### Stop and Think Are there any other criteria that you feel are important when coming up with datasets and expected outcomes for this project? (You don’t have to turn in this answer, but it might help you think of some datasets). </section> Also keep in mind that your final set of tests should allow for _incremental, test-driven development._ This means that you should be able to tackle one small part of the project (such as writing a specific method in a specific class), test it, debug if necessary, and only move on to another part of the project when the test passes. More on this is written in the implementation section of this handout, but for now, you should recognize that you will likely have to write additional tests when you start implementing specific methods, so that you can test as you go! <section class="design"> ### Design Check Task 2 Based on the criteria above, come up with _at least_ two example datasets to test your project. Include at least two expected outcomes for the tree generated from each dataset. During the design check, your TA will ask you how these example datasets fit the criteria listed above. </section> To check your understanding of this task, you can use our `BasicDatasetTest.java` file and replace the lines that are indicated by the `TODO` comments. **At this point, please do not look through any of the other files in the starter code – we’ll guide you through them in the next section.** :::info Please also uncomment `line 35` of `BasicDatasetTest.java` in order for our autograder to pass the wheat implementation ```java // TODO: Uncomment this once you've implemented generateTree // this.testGenerator.generateTree(training, this.targetAttribute); ``` ::: ```java // TODO: replace with your own input file String trainingPath = "data/fruits-and-vegetables.csv"; // TODO: replace with your own target attribute String targetAttribute = "foodType"; ... // TODO: make your own rows based on your dataset Row tangerine = new Row("test row (tangerine)"); ... // TODO: make your own assertions based on the expected classifications Assert.assertEquals("fruit", testGenerator.getDecision(tangerine)); ``` The tests will not run on the local copy of your code (yet), but you can then submit this file and your dataset csv to the **Project 1 (Decision Tree): Dataset Test** assignment of Gradescope. If it says it passes the wheat, your dataset is formatted correctly and your input-output pairs are correct. Some notes about this Gradescope assignment:- Remember to turn in your dataset CSV! In your code, the file path to it should be `data/your-csv-title.csv` (i.e. the path should not include any folders). - For your tests to work, the dataset has to be such that it gives the same output for a given input no matter the order in which attributes are chosen to build the tree. If you’re getting errors for the getDecision assertion, you should try making tests for a simpler dataset - This autograder is only useful for checking that your datasets are reasonable (which you can do during the design check phase or the implementation phase). We’ve configured the code so that it will not allow you to run other methods on Dataset, etc, so it will not work for testing your own unit tests that call on other methods. | Include a copy of your dataset and input-output pairs as part of your design check submission. ## Reading and understanding the provided code :::warning Do the following task **before** opening the starter code. ::: Based on the information above, you will be responsible for completing a `Dataset` class and a `TreeGenerator` class. In a later section of this document, you will reason about the computations that each class should do and how the classes should interact with each other. **For this next design check task, we want you to focus on the** `TreeGenerator` **class.** <section class="design"> ### Design Check Task 3 Use your understanding of decision trees and tree-like structures in general to reason about the additional classes and interfaces needed to represent a decision tree. <details><summary>Details</summary> Make a list of classes, any possible interface(s) they may implement, and what you think each one’s general functionality should be. A sentence or two will suffice per class and interface. For instance, what classes or interfaces should we use to represent nodes, edges, and leaves for our decision tree? What additional fields and methods should they have, if any? This is meant to be an exercise in calibrating your understanding of tree-like data structures, and the next two tasks guide you towards evaluating this understanding. You do not need to turn anything in for this task, but you will be asked to reflect on it below. </details> </section> <section class="design"> ### Design Check Task 4 Now look through the starter code and draw a association diagram of how the classes fit together. <details><summary>Details</summary> You can ignore `BasicDatasetTest`, `DecisionTreeCVSParser`, `DecisionTreeTester`, and  `DecisionTreeTest`. For classes that are in the src folder, you do not have to list their fields (but we recommend reading over/taking notes on the public methods they have, since you might want to make use of the public methods!) We have started this for you. For information on how to make a association diagram, check out [this guide](https://docs.cs200.io/containment-association-interface-inheritance-guide). <!--TODO For another example, check out the [Lecture 4 notes]() from the FP track! --> There is a file called `README.txt` that lists all of the files that are present in the starter code and gives a short summary of each one. It will be helpful to refer to this file as you familiarize yourself with the code for this project. Most files also have more documentation and guidance for you in their own comments. </details> </section> <section class="design"> ### Design Check Task 5 Write a few sentences that reflect on the differences, if any, between your answer to Design check task 3 and the code you saw in Design check task 4. Did your proposed tree classes and interfaces match those given to you in the starter code? Is there anything you don’t understand about the way the provided classes interact? </section> <section class="design"> ### Design Check Task 6 Now for each class, think about what information it might need to store in its fields/variables. Answer the following. <details><summary>Details</summary> Take a look at your containment/association diagram (from task 4) and your answers above and answer: - What field(s) should a `DecisionLeaf` have? - What field(s) should an `AttributeNode` have, besides the one we give you? - What field(s) should a `ValueEdge` have? - What field(s) should a `Dataset` have? </details> </section> ## Thinking through the methods Now, it’s time to sketch out how to actually construct the tree from the dataset. Generating the tree is a recursive process that works on smaller and smaller sets of rows from the original dataset. Let's see this on our running example: <section class="conceptual"> ### Conceptual question 5 We’ve numbered the nodes in the sample diagram. Indicate which rows from the dataset (by the corresponding food name) are being used to generate the portion of the tree rooted at each node. For example, at node 1 (the topmost node), all 7 food rows are being used. Identify the rows for the other three nodes (2-4) as well as the leaves (A-D). Check your work using the Gradescope conceptual assignment. ![](https://docs.cs200.io/uploads/7a8f8b3d-b4aa-465b-8392-3d77b3de7634.png) </section> Now, it's time to plan the key methods, along with some notes on how you will turn those plans into code (we covered plans in class on Feb 16th). For guidelines on how to plan and what makes for a good plan, [refer to this document](https://docs.google.com/document/d/1673PEcn_6ikHpm96T0JR07nDwnOjVtvwEmaytKaZWXw/). <section class="design"> ### Design Check Task 7 Write a high-level **plan** for an implementation of ```generateTree```. You might find the questions below helpful in guiding your planning. The plan you write should outline the sequence of high-level computations that get performed to generate a tree from a dataset (list of rows). You are welcome to name helper functions and provide plans for them as part of your overall plan. <details> <summary>Details</summary> Think about conceptual question 5: - What input(s) is/are needed to generate a tree or subtree? - Under what conditions do you make a leaf? How do you know which decision goes into a leaf? - What determines the number of edges out of an internal (non-leaf) tree node? - What information is used to generate the subtree at each edge? - How might we compute default values at each node? (see illustration below for clarification) ![thinking through row subsets](https://docs.cs200.io/uploads/c19a2a9e-beed-428e-a66c-54da86199fcf.png) </details> </section> <section class="design"> ### Design Check Task 8 Now write a high-level **plan** for an implementation of ```getDecision```. You should ensure that `getDecision` only uses the tree structure and pre-computed values from generating the tree to arrive at a decision. </details> </section> <section class="design"> ### Design Check Task 9 Based on your plans in tasks 7 and 8, propose a collection of method headers (names and types only) that you expect to need. Feel free to use these method headers in the plans themselves as well. <details><summary>Details</summary> For each method, does it belong in the `Dataset` class or the `TreeGenerator` class? Note that we haven’t asked you about methods that belong in `AttributeNode`, `ValueEdge`, and `DecisionLeaf`. Our solution only has a constructor and `getDecision` in `AttributeNode` and `DecisionLeaf`, and a constructor and two getter methods in `ValueEdge`. The idea is that the `TreeGenerator` should be doing the work on calling the constructors for these classes, and it should be doing this work recursively (i.e. the method to generate node 1 in the above tree should make recursive calls to generate nodes 2 and 3, because nodes 2 and 3 are the roots of their own decision trees). Think of the `TreeGenerator` as conducting the way that data goes from the dataset to the node/edge classes to build up a tree. </details> </section> <section class="design"> ### Design Check Task 10 (Submit for peer review!) Create a document with *only* your answers to tasks 7-9 (plans and method headers) and submit it for peer review. Do **not** include your names in the document itself! See **[this section](#Peer-review-and-feedback)** for instructions on how to submit your work for peer review, and an overview of the process. If your team submits this document by 11:59 pm on Wednesday of the first week of the project, you'll get to see other group's designs shortly thereafter. Everyone else will submit to the regular deadline on Friday and have access to other's plans after that. For details, see [here](#Peer-review-and-feedback). </section> # Implementation stage tasks Unlike on homeworks or the design stage, we are not providing a list of tasks. The following list of what needs to be done, along with the plans you wrote in the Design stage, will guide your work. :::warning **Test your code each time you complete a small part of the project.** Some of these tests will be written as JUnit assertions, while others could be as simple as running constructors and basic methods through a `main` class. Testing as you go will save you tons of debugging time later. We really can't stress this point enough. There is a separate section of this handout on Testing and Debugging (further down, or use the table of contents at the top) to guide you here. ::: **High-level TO-DO List:** - Implement the `Dataset` constructor and the methods defined in the `IDataset` interface. - Make sure you can read in a dataset from CSV and get a `Dataset` object back (following the example code in the `BasicDatasetTest` class we give you). Write some simple tests (say on the size of the dataset) to make sure you’ve gotten what you expected. - Create/Implement classes for the components of the decision tree (nodes, leaves, edges). - Use your plan to begin implementing `generateTree` in `TreeGenerator`, leaving space for helper methods that may or may not already be written (there will be errors!). A good way to do this is to start with simple cases (simple datasets) and work up to more complicated ones. Think about which of your design check test datasets will yield the simplest trees. Make sure that is working, then move on to slightly more complicated datasets. - Write the helper methods for `generateTree` in the appropriate places. - Write `getDecision`. ## Notes on Generating the decision tree Hopefully the design check activities and plans helped you see how the tree generation should work. If you still need help seeing what that code would look like, consider the following steps: - Choose an attribute to split on from the input training data (not the target or a previously split attribute) - Determine whether you should create a decision leaf or an attribute node. You should make a decision leaf when all rows have the same value for an attribute, or there are no remaining attributes to split on. - Compute the default value for the current node and store it in the node. - Create edges for each possible value of the attribute you chose. - Recursively generate the children nodes for these edges using the appropriate subset of the training dataset. - Attach each of these edges to the current node. :::warning Here are some possible pitfalls: 1. Your code does not properly stop to make a leaf. You can usually tell that this happens if the tree exhausts all possible attributes. 1. your code didn’t properly check the leaf creation conditions correctly 1. your code might have a faulty equality check somewhere (using `==` when you should have used `.equals(...)`) 1. Your code stores a dataset partition in each `ValueEdge` and/or `AttributeNode` 1. Your code splits on the same attribute multiple times in a given path down the tree. ::: ### Choosing a random attribute to split on There are three approaches for splitting an attribute that you must support. To help aid testing your `TreeGenerator` class we have provided implementations of `getAttributeToSplitOn` that select the attributes in either alphabetical order or reverse alphabetical order. This should allow you to generate trees with a deterministic (fixed) structure which may help you debug potential issues in your implementation. It is your job to implement random attribute splitting. The idea is that you will generate a random index into the list of attributes, then use the attribute at that index to split on. ```java import java.util.Random; Random random = new Random(); int upperBound = 10; // generates a random number between 0 (inclusive) and 10 (exclusive) // note that the upper bound must be greater than 0! int randomNum = random.nextInt(upperBound); ``` :::warning Possible pitfall: as you select attributes to split on, you will end up needing to manage which attributes you have used and which you have not. There's a subtlety here around mutable vs. immutable lists that many students have run into in the past. If your generated trees seem to be missing lots of nodes, review your approach, thinking about mutability vs immutability... ::: # Testing and test-driven development As you work on this project, you should be completing it one small piece at a time, starting with writing tests for that piece. Only when you have tested and debugged the small piece should you move on to the next piece of the project. This is called test-driven development and provides a safety-net of sorts as you work on a larger, more complex project. This is how software engineering works in the real world. Multiple people make changes to the same project, and everyone is responsible for testing their addition before the changes are integrated with the rest of the project. :::warning <details><summary>Referencing your input files in tests.</summary> In order to be able to run tests with input files, make sure that `decision-tree-[team name]` is the top-level project folder in your IntelliJ![](https://lh4.googleusercontent.com/DaoU0Kw6rjAnblU2KRgi6QA_75I9VHUjIGHzndpmhTsBjerlgpDtQl_GoFd323Ezu7AHQpkmBW4TvNvgA6_3bB7hm--IkLUs9jLkFSgP5S3Xm72E1Uq6j_BF7bJEXKwQ2aA6Nl1MTRiGAn3At90dMQ0). Then, you can reference the datasets relative to that folder, just like our starter code does,  e.g. `data/fruits-and-vegetables.csv` or `data/mushrooms/training.csv`. If you’re getting a `FileNotFoundException`, this is the first thing to check! :::danger You must submit these files to Gradescope! ::: </details> ::: ## Unit tests You will use JUnit unit tests to guide your test-driven development. Follow the starting point you had in the design check tasks to write more tests using your example data. Since we can only run a small set of unit tests on your code via the Autograder (the project is fairly open-ended), it is _important_ for you to write unit tests as you go and use them to evaluate your code. ## System-level tests For system testing, you will use the `DecisionTreeTester.java` file that we have provided in the stencil’s `src` directory. There are instructions in that file on exactly which lines to change to run different tests (we don’t recommend changing the other lines of the file, and instead making your own testing file if you want to use these tests to diagnose something in your project). Within that tester, we are checking your code on several new datasets: ## Descriptions of the larger datasets on which we’ll be testing your work - **Villains:** This dataset contains as attributes a bunch of restaurants on College Hill and whether or not the person/datum/row likes that restaurant as well as a boolean attribute `isVillain` for whether or not the person is a villain. The tests use your code to train a decision tree on the training data to use on testing data to predict whether or not a person is a villain based on their restaurant preferences. - **Mushrooms:** This dataset contains a bunch of attributes of mushrooms, one of which is `isPoisonous`. The tests use your code to train a decision tree on the training data to use on testing data to predict whether or not a mushroom is poisonous based on its other attributes. - **Candidates:** This dataset contains attributes of candidates applying for a job, one of which being whether or not they were hired. There are multiple different training datasets to choose for this, so you might want to experiment and see which training dataset produces the highest accuracy on the testing data. - **Songs:** This dataset contains a bunch of songs from the 2000s with their titles and artists removed. Instead, they are represented by attributes including `‘topGenre’`, `‘year’`, and boolean attributes such as `‘isLoud’`. The tests use your code to train a decision tree on the training data to use on testing data to predict whether or not a song is popular based on its other attributes. The tests that work with these datasets are located in the `DecisionTreeTester` class. For each dataset, the tester trains your decision tree on training data. Then, for each datum in the testing dataset, it compares the generated decision tree’s prediction against the datum’s actual value to measure accuracy. Performance of your decision tree can be variable because attributes are randomly selected. Therefore, we do this many times (currently `NUM_ITERATIONS = 100`) and average the results to get the average accuracy of the trees generated by your algorithm. You should not need to modify the code (unless you want to increase the number of iterations to make the measurement more precise). Your accuracy measured on testing data for the Mushrooms and Villains datasets should be above 70%. If the target attribute is a boolean, a decision tree that randomly selects one of the two possible values as its prediction would have an accuracy of around 50%. You should use this as a diagnostic – if the accuracy is lower than 50%, there are definitely bugs in your code. Your performance on the Songs dataset may be worse, but it still should be above 70%. There is also a test that tests the decision trees on the same training data used to generate the trees. Your accuracy should be 95-100% for this test, and if it is not, there may be something wrong with your code. ## How we will use these datasets to grade your code Instead, the Autograder will run the following tests with the following expectations: - Small unit tests on the `Dataset` class that test the `IDataset` methods - Small unit tests on the `TreeGenerator` class that test the `ITreeGenerator` methods - A small dataset of our own (expect 70% accuracy on the testing data and 95% on the training data) - Villains (expect 70% accuracy on the testing data and 95% on the training data) - Mushrooms (expect 70% accuracy on the testing data). Since `Mushrooms` is a large dataset, if your tree implementation is very inefficient in its use of computation time/space, this test might result in an output that says you ran out of heap space. This is considered a failed test (see below for some tips to increase your code efficiency). For this reason, `Mushrooms` will be weighted slightly lower than `Villains` in our suite of tests. ## Debugging ### Stepping through the code The `BasicDatasetTest.java` file can provide a good starting point for visualizing the tree. For example, to step through how your code is generating the tree using the debugger, place a breakpoint at the line where your tree gets generated: ![](https://lh3.googleusercontent.com/hJTyBQHA3ShoNR67cXXNZ1YzTINtgLBEmJmQsRKGPEhLXhAns-cGc2NuGl7l1pXCud9QHAhHIftHYSHOSuJEs2MRif35K7yTP0NnxPYVj_2mk7XVRoC7riZuIXnBpVShR5VEF4QiQcQZ0OvUo0vCDaM) (remember that this code is in the setup method annotated with `@Before`, so it is run before every test) Then, you can run a test using the debugger and step through the code that generates your tree: ![](https://lh6.googleusercontent.com/CnKaoJm_H107dbFCB54EkGMsMV1fBrSGPDwC_mQv258VlAxSeB6vnbcLgLTts09YfuICjXZPXSywDiW-WFC78DZa9sobriODdPWuABaCrn0W0bec1S4SYjQTg12UZfu-YyrBz6RcVNhllQ4_3InzLVs) ### Using the debugger to visualize a tree After the tree has been generated, you can also visualize it using the debugger. For example, if you want to see what the tree looks like right before calling getDecision, you can do so by placing a breakpoint at the appropriate line (line 40 in the screenshot below) and expanding the `TreeGenerator` variable (in this case, `testGenerator`) in the debug window _(some of this screenshot has been occluded to avoid giving away the structure of our solution. Your_ `AttributeNodes`_,_ `ValueEdges`_, etc can and should have more fields than shown here)_: You can also use this breakpoint as a starting point for stepping into the code and seeing how the tree makes the decision to classify the input row! # Submission links and descriptions - [**Project 1 Design Check Conceptual**](https://www.gradescope.com/courses/923570/assignments/5568227)**:** for self-checking conceptual knowledge (done individually; part of design check grade) - [**Project 1 Design Check**](https://www.gradescope.com/courses/923570/assignments/5568226): for written design check questions (part of design check grade) - [**Project 1 Dataset Test**](https://www.gradescope.com/courses/923570/assignments/5568225): for self-checking dataset format (not graded) - [**Peer Review Early Submission**](https://app.peerceptiv.com/course/60daf3a6-a765-4c09-aa06-f344588f04c5/assignment/79695323-8eb3-4ceb-ab7b-cab20ad1d225/dashboard): for earlier peer sharing of plans - [**Peer Review Regular Submission**](https://app.peerceptiv.com/course/60daf3a6-a765-4c09-aa06-f344588f04c5/assignment/fc62dadd-c876-4bc6-9ba5-c144d08ba7a1/dashboard): for peer sharing of plans after the end of the design week - [**Project 1 Implementation**](https://www.gradescope.com/courses/923570/assignments/5568228): for final hand-in of code and tests (part of implementation and style/design/testing/code structure grade) # Peer review and feedback As a "mini stage" that spans the end of design/start of implementation, you'll share plans (through Canvas), using **peer review** with other students to potentially get other ideas of how to approach the problem. In general, peer review works like this: - **Submit plans**: You will (as a group) upload your plans for key functions in your implementations before a due date. You will submit your plan via Peerceptive, which is accessible via Canvas--see **[this document](https://docs.google.com/document/d/1_Iv8SMMfMT_HCmbcun0pFGphJQQoxz2k347P5vg7RNk/)** for submission instructions. There are two deadlines: an early deadline, and a regular deadline: the earlier you submit, the earlier you get feedback--for details, see the **Timeline** below. - **Submit reviews**: After the plan deadline, the tool will assign **each individual student** (not group) two other plans to **review** (so your project pair collectively gets to see four other plan submissions). As an individual student, provide feedback through Peerceptiv. Reviews are due within a few days of you receiving plans, as noted on the timeline. - **Get feedback, and reflect**: After the review deadline, you'll receive feedback on your plans from other students! For this part, you will need to log into Peerceptive, acknowledge that you've received your feedback, and answer some reflection questions to tell us what you thought about the process. Your reflections are due shortly after the project deadline **Timeline**: Here are the deadlines for each part, for both the early and regular review schedules (all deadlines have a time of 11:59pm): | | Plans due | Reviews due | Feedback by | Reflections due | |-------------------|------------|-------------|-------------|-----------------| |**Early schedule** | Wed, March 5| Sat, March 8| Sat, March 8 | Mon, March 17 | |**Regular schedule** | <span style="color: #ff0000">Fri, March 7</span>| <span style="color: #ff0000">Mon, March 10</span>| Mon, March 10| Mon, March 17 | <span style="color: #ff0000">*: Hard deadline (no late days!) Due to limitations of the Peerceptiv system, **late reviews cannot be accepted**--we have no way to grant extension on this. To ensure that you and your peers get feedback, please be sure to submit on time!!! <!-- ### Peer review timeline **Timeline**: There are two deadline options to submit your work for peer review. You may submit to either one for full credit--but if you submit earlier, you'll get feedback earlier: - Early deadline: submit - If your group completes Design Task 10 before Wed, October 16 at 11:59pm, submit your document to the "Early Peer Review Submission" assignment link (you submit as a group) - [Peerceptiv usage instructions](https://docs.google.com/document/d/1_Iv8SMMfMT_HCmbcun0pFGphJQQoxz2k347P5vg7RNk/), including how to submit, create groups, etc - If you don't make the early deadline, you must submit your Task 10 document by the Design Stage deadline on Fri, Feb 23rd at 11:59pm. --> ### Reviews, anonymity, and professionalism Peer reviews are meant be **anonymous** (to students, course staff can still see your submission and your feedback). This means that when submitting documents or feedback to the system, you should NOT include your name or any other identifying information. Please also remember our course policies on [diversity and professionalism](https://docs.google.com/document/d/14D6T2fpMeLx9-cAsoiQm4-eJF_SoowVNqanEZZcT2xU/edit?tab=t.0#heading=h.gpun8jyluwej) when reviewing and giving feedback. # Your final submission Include all of these files from `sol` for every submission. Only the last submission is graded. You should receive an email from Gradescope immediately after your submission. - `AttributeNode.java` - `Dataset.java` - `DecisionLeaf.java` - `DecisionTreeTest.java` - `TreeGenerator.java` - `ValueEdge.java` - **Any test datasets that you have created** :::warning :rotating_light: **Super important**: When you do your final submission, **please make sure that your submission lists both you and your partner** if you are submitting as a group. This makes sure that we know which submission to grade! To check this, one member of your team should do the following (only necessary for your **final** submission): 1. Open up your submission on gradescope 2. Select **"Group Members"** at the bottom, which should look like this: ![](https://docs.cs200.io/uploads/a137a6d6-c735-4198-93ca-00fea4f96dc6.png) 4. Type in your partner's name and select them from the list Only one member of your group needs to do this, and **you can do it after the deadline without being marked late**. Please please please be sure do to this, though, as it saves us a lot of time when grading! ::: # Grading 1. **Design check** (33.33%): The design check includes the conceptual assignment (autograded so that you can check yourself), the written assignment (which includes SRC, design tasks, and peer review and will be manually graded), and the design check meeting (graded on good-effort attendance). 1. **Implementation + Code style checks (Autograded)** (33.33%): Please see the testing section for an overview on the system tests that we run. Due to the somewhat open-ended nature of the assignment, we are not doing wheat/chaff grading for testing.  **Code style checks are counted under this part–for more info, see the [**Java style guide**](https://docs.cs200.io/java-style-guide).** 1. **Design, testing, and code structure** (33.33%): The TAs will manually grade your work according to some rubrics. To earn the points for this part of the project score: - Follow every part of our [Java style guide](https://docs.cs200.io/java-style-guide) - Follow the design considerations for every task - Follow good OO design practices (i.e. if a computation is being done on the data that belongs to a certain class, that computation should be done in methods inside of that class) - Write thorough unit tests for each public method - Create and use your own datasets for system testing - Participate meaningfully in Peer Review, which means writing reviews with more content than a throwaway comment such as "looks good to me" This is the first assignment for which you have submitted plans. Grading of plans will be based on whether you followed the "guide to writing good plans" in the [planning handout](https://docs.google.com/document/d/1673PEcn_6ikHpm96T0JR07nDwnOjVtvwEmaytKaZWXw/). You don't have to get a certain exact set of plan steps to get full credit on the planning part. Just make a solid effort at it. # Working with Partners This is meant to be a "two people work together" project, not "we split the project in half and each do a part". At the design check, both partners will be asked about pieces of the project. If only one partner answers all the questions, the silent partner may receive a lower grade on the design check. We have also been known to put questions about the project functions on the exams, as a way to encourage you to actually understand all aspects of the project. If you are having problems with your project partner either being non-responsive, or not contributing to the work, or severely holding you back in some way, please reach out to Kathi. We have procedures in place to deal with these situations. # FAQ  <!--TODO Please see the [Decision Tree FAQ]() post on Ed, where we will post answers to common questions. --> <!-- # Version history  - No changes so far! *** --> _Please let us know if you find any mistakes, inconsistencies, or confusing language in this or any other CS200 document by filling out our_ [_anonymous feedback form_](https://docs.google.com/forms/d/e/1FAIpQLSfe9cDwpzv7xCbbevbrAZacuB4MB8yxer8jiK0SK-CQH3RXuQ/viewform?usp=dialog)_!_