Project 1: Decision Tree

Key Dates

  • Design checks: Wednesday Feb 21 - Friday Feb 23
  • Plan Peer Reviews: Wednesday Feb 21 - Monday Feb 26
  • Final turn-in: Friday, March 1st at 11:59 pm
  • Peer Review Final Reflections: Monday March 4, 11:59pm

Looking for a stencil link? See here! You should wait to accept the assignment until after you have received your partner assignment.

We will post an announcement on Ed once these emails have been sent out, so you can be sure you won’t miss it.

Gearup Information: Tuesday February 20th at 8pm in Rm 130 85 Waterman St.
[Slides]: Will be posted after the gearup

[Recording]: Will be posted after the gearup

Overview

In this project, you will implement a classic machine learning algorithm that uses training data to make predictions in new situations. This algorithm uses a tree-like data structure to arrive at its predictions.

How to read a project handout

This handout contains a lot of information. Like, a lot. But don’t be afraid, young Padawans! Let’s go through what this handout contains, and how to use it effectively.

Background info: Most of the information in this handout is background information on the framework of a decision tree, accompanied by design check questions (in boxes) that will help you understand what goes into making and using one. Record your answers to the design check questions, and then submit your work to the Gradescope Design Check submission at least 1 hour before your design check. You’ll want to allocate a decent amount of time to go through this with your partner because your understanding of it is integral to your program’s success.

Gearup: If you prefer a different way of digesting the background information provided in this handout, come to or watch the recording of our Gearup! You still have to do the design check tasks, but we’ll be discussing these concepts out loud if that’s how you prefer to learn.

Design Stage Tasks: The first few days of the project are devoted to designing your approach. You’ll write a plan for your solution, give and receive feedback on other students’ plans (via Canvas), create some testing data, and generally set up a strategy for completing the project. You and your partner will meet with a TA to review your design during the first week.

Testing: After the design-stage section is advice on testing. Definitely test as you go - it may have been tempting on homeworks to just implement everything and test it at the end, but with projects of this size, you want to know if you mess something up far before you get to the ‘end.’ (It won’t be anywhere near the end for you if you don’t test first—we (TAs, Kathi, Elijah) all speak from experience).

Implementation: Finally, the handout discusses the implementation stage tasks. These are the steps you will take to develop your final submission. Don’t start too late! This is a hard assignment, and your ability to complete it will greatly depend on the amount of time you give your brain to mull it over. Design and testing as you go are essential to managing the overall time for the project.

Best of luck! If you have any questions, don’t hesitate to reach out to your design check TA. We are here to guide you!

This project is a nontrivial jump in difficulty from the homeworks so far. That’s why you have a partner, two weeks, and an intermediate meeting with a TA. The project has you do a nontrivial design of an OO program. You will need to spend time on this design. You can’t just sit down and whip out an implementation.

That said, we have covered the implementation concepts you need: you’ve written recursive programs on lists and trees, you’ve written programs that traverse trees from top to bottom, you’ve written methods that call a recursive helper method. You’ve filtered through lists. All of these pieces come together in this project.

Learning objectives

  • To practice designing with classes and interfaces
  • To practice using trees in a real-world application
  • To practice using plans to outline what your code needs to do
  • To understand a bit of how one approach to machine learning works

Project timeline

This project will proceed in two main stages, with you working with your partner in both:

  • In the Design Stage, you’ll work through the project handout, figure out how the intended algorithm works, and plan out the classes that you will need. You’ll follow a worksheet and answer some design questions in Gradescope, which you and your partner will review together with a TA during a Design Check. This occurs in the first week of the project.
  • In the Implementation Stage, you and your partner will write code that carries out your design. This occurs in the second week of the project.

Peer Feedback and Review

As a “mini stage” that spans the end of design/start of implementation, you’ll share plans (through Canvas), using peer review with other students to potentially get other ideas of how to approach the problem.

In the Peer Review system that we’re using (Peerceptiv, through Canvas), you’ll upload your plans for key functions before a due date. After the due date, the system will randomly assign you two plans from other (anonymous) students to review and comment on. Since we want you to be able to leverage peer review to think through your designs, we’re setting up both an early due date and a regular one: if you submit to the earlier one, you’ll get to see other students’ work earlier in the week:

Peer Review Dates

  • If your group completes Design Task 10 before Wed, Feb 21st at 11:59pm, submit your document to the “Early Submission” assignment link (you submit as a group)
  • If you don’t make the early deadline, you must submit your Task 10 document by the Design Stage deadline on Fri, Feb 23rd at 11:59pm.
  • After each of these deadlines, the tool will assign each individual student (not group) two other plans to review (so your project pair collectively gets to see four other plan submissions). As an individual student, provide feedback through Peerceptiv. These reviews are due by Mon, Feb 26 at 11:59pm
  • After Mon, Feb 26th, everyone should be able to see the reviews on their submissions. Each individual student will have to go in, acknowledge that they’ve seen the feedback, and answer a couple of reflection questions. These reflections are due by Mon, Mar 4th.

Peer review information + timeline

Peerceptiv is meant be anonymous (to students, course staff can still see your submission and your feedback). This means that when submitting documents or feedback to the system, you should NOT include your name or any other identifying information.

Please also remember our course policies on diversity and professionalism when reviewing and giving feedback.

Stencil code

Shortly after class on Friday, we will send you an email confirming your partner assignments that will include a personalized link to create your stencil.

We will post an announcement on Ed once these emails have been sent out, so you can be sure you won’t miss it.

Once you receive your partner assignments, clone the stencil from Github Classroom and create/join a team named as [partner 1 brown login]-[partner 2 brown login]. For instructions on how to download the stencil and set it up in IntelliJ, see our Stencil Setup Guide.

Grading

  1. Design check (33.33%): The design check includes the conceptual assignment (autograded so that you can check yourself), the written assignment (which includes SRC, design tasks, and peer review and will be manually graded), and the design check meeting (graded on good-effort attendance).

  2. Implementation + Code style checks (Autograded) (33.33%): Please see the testing section for an overview on the system tests that we run. Due to the somewhat open-ended nature of the assignment, we are not doing wheat/chaff grading for testing.  Code style checks are counted under this part–for more info, see the Java style guide.

  3. Design, testing, and code structure (33.33%): The TAs will manually grade your work according to some rubrics. To earn the points for this part of the project score:

    • Follow every part of our Java style guide
    • Follow the design considerations for every task
    • Follow good OO design practices (i.e. if a computation is being done on the data that belongs to a certain class, that computation should be done in methods inside of that class)
    • Write thorough unit tests for each public method
    • Create and use your own datasets for system testing
    • Participate meaningfully in Peer Review, which means writing reviews with more content than a throwaway comment such as “looks good to me”

This is the first assignment for which you have submitted plans. Grading of plans will be based on whether you followed the “guide to writing good plans” in the planning handout. You don’t have to get a certain exact set of plan steps to get full credit on the planning part. Just make a solid effort at it.

Background

You may have heard of machine learning, but what is it really?

You have learned how to write functions that perform computations. So far, the functions you’ve written consist of explicit steps that tell the computer how to do the computation. Imagine, for example, that we asked you to write a program to determine whether a particular job applicant should get an interview. You could write a function with a bunch of if-statements and boolean operations over information about the candidate (years of experience, prior recommendations, etc). But what if you have a lot of potential variables to consider? Doesn’t this function become pretty messy and hard to maintain?

With machine learning, you don’t write this function explicitly. Instead, you use a program (the machine learning algorithm) to generate one possible function for you based on data about previous candidates (this process is called training a model in machine-learning terminology). Then, when you get a new applicant, you call the generated function and ask it whether to interview the candidate. Set aside how you feel about the (considerable) perils here: the point is that machine learning is about generating functions that recommend decisions based on prior decisions using the same variables.

In this project, you will implement a classic machine learning algorithm and think about how to test it. Concretely, you will:

  • write a program that generates a function to make predictions based on data in a dataset. Your program will read in datasets from CSV files using code that we will provide in the stencil.
  • use your function to make predictions
  • reflect on what it means to test a machine-learned function

At the heart of the algorithm you’ll be implementing is a data structure known as a decision tree. This is a tree-shaped data structure that models a series of multiple-choice questions, answers to which are found at the leaves.

Overall, this project practices the data structure, testing, and programming concepts we’ve covered so far. You need both object-oriented programming and recursion to get this working. The implementations you’ve done to traverse trees (binary search trees or ancestry trees), to use recursion to traverse lists, and to create classes for lists and trees should all give you insights into this project.

Building decision trees by hand

As you follow along with this example, you will see some conceptual questions. These questions are automatically graded on Gradescope (to allow you to see whether you got a question right or wrong and resubmit if need be). To help all partners form a conceptual foundation for this project, these questions should be submitted individually before your design check date.

To help explain what our algorithm will be able to do, let’s consider the following problem: you would like to generate a function/train a model to tell you whether a food item is a fruit or vegetable given certain attributes including color, highProtein, and calories.

Food dataset

A subset of the data is shown below:

name color highProtein calories foodType
spinach green true low vegetable
orange orange false high fruit
cucumber green false low vegetable
carrot orange false medium vegetable
sweet potato orange true high vegetable
avocado green true high fruit
banana yellow true high fruit

Each row contains values for the attributes color, highProtein, calories, and foodType. When you generate your function, you will choose which attribute it should make a prediction on (here, foodType). Effectively, this means that you are creating classes and methods to implement a function:

foodType(color, highProtein, calories)

that will return either fruit or vegetable as the foodType depending on the given values for color, highProtein  and calories.  (You’ll call the function slightly differently, but this is the idea.)

This same setup arises in many applications: there could be a table that tracks who has been admitted to college based on different attributes, which movies ended up generating high revenue, and so on. In each case, the goal is to predict an outcome (e.g., getting admitted or generating high revenue) given the attributes of a new case (e.g., student or movie).

Decision tree

Decision trees provide an ordered way of examining each attribute of an item, leading ultimately to a decision/classification for that item. Below is a decision tree generated from the food dataset, designed to predict the foodType of foods not in the original table.Some things to note about the decision tree:

  • Each non-leaf node (represented as a blue oval) is labeled with an attribute from the dataset
  • A node for an attribute has outgoing edges corresponding to the possible values of that attribute
  • The leaves (orange and red boxes) represent the decisions or predicted value for the target attribute (here, foodType) and are labeled with the classification of a datum that has traveled down that path of the tree. All decisions are for the same target attribute. Leaves mean that you’ve found a prediction!
  • The same attribute arises only once along any path from the root to a decision.

Conceptual question 1

The decision tree for a dataset should always match the results listed in the original dataset. To make sure that you know how to use the tree, describe how the tree would be used to report that avocados and oranges are both fruits, based on the attributes in the table.

Making predictions

Let’s walk through two examples to show how the generated decision tree could be used to predict the value of the target attribute (foodType) for new data. We will refer to the fruit dataset and the graphic of the generated tree.

First, let’s predict the foodType for a tangerine, which has orange color, is not high protein, and is high in calories. We traverse the decision tree, checking the values of the given attributes starting from color (the root node). We would first follow the orange edge down to the highProtein node. Since the tangerine does not have high protein, we would then follow the false edge down to the calories node, and since the tangerine has high calories, the function would predict that a tangerine has the value fruit for the attribute foodType.

Now, consider broccoli, which is green, high in protein, and medium in calories. Again, we follow the tree. Taking the green color edge gets us to a calories node that has no edge for medium. This is a value that was not covered by our training data, so we need to use some information other than the paths in the tree to predict the foodType. The path we had taken so far limited us to green-colored items. We therefore consult the training dataset to identify the most common foodtype among the green items in the training data (spinach, cucumber, and avocado). The function thus returns vegetable. Had there been a tie in the training data, we’d choose from the most-common values arbitrarily.

Conceptual question 2

What would a function built on the above decision tree predict for green apples that are high in calories? What about red apples?

Building new attribute values into the tree

Go back to the broccoli example, and the fact that there was no edge for its medium value for the Calories attribute. Just above, we said we’d return the most common decision among green vegetables. Any green vegetable whose calories were something other than “low” and “high” (the two edges in the diagram) would return the same default value. To save later computation time, then, we will store that default value in each node. Here’s a revised diagram showing the defaults within each node:

decision tree with default values on nodes

Conceptual question 3

Go through the process for predicting the foodType of broccoli (green, high in protein, and medium in calories) using this tree.

Details

Which node’s default value do we end up using?

_This example illustrates a key algorithmic principle: pre-computing data can turn some seemingly algorithmic problems into simpler data lookups. When a series of computational steps feels messy, ask whether you could pre-compute and store some data to simplify the computation later._This is the same optimization that we did with the size/length function in homework 2.

Conceptual question 4

On paper, work out the decision tree, with default values, to predict the outcome for the following dataset on college admissions. Consider the attributes in the order of GPA, letters of rec, high school, and extracurriculars.

Check your work using the Gradescope conceptual assignment. Be sure to save this tree for use later!

GPA high school extra curriculars letters of rec outcome
C blue school strong strong reject
C red school weak weak reject
A blue school weak strong accept
A red school strong weak reject
B blue school weak strong accept
B red school strong strong accept
B red school weak strong reject

It may be helpful to copy this table into an excel or Google Sheets file. You can make use of the “Sort” capabilities under “Data” to organize the rows by value for an attribute to visualize how to form the decision tree.

On the accuracy of machine learning decisions

Put your answers to the Design check tasks in a document (e.g., google docs or similar) that you work on and turn in as a group. You will submit a PDF version of this document.

Assume that the tree that you generated earlier regarding admissions decisions will now be used on a new set of students. For each of the following students, use the tree to determine whether they will be admitted (fill in the outcome column):

Name GPA high school extra curriculars letters of rec neighborhood outcome
student1 B blue school strong strong Canyon
student2 C red school strong weak Elman
student3 B red school weak strong Elman
student4 B blue school weak strong Canyon
student5 A red school strong strong Elman

Design check task 1

What are some things you notice about who does and does not get admitted? Are there outcomes that surprise you or concern you? Look at the table. Do you spot any patterns between the outcomes and attributes that were not part of the original tree?

Details

This example points to what is called soft-coded bias. Even if identities like race, gender, socioeconomic status, or sexuality are not included as attributes when training decision trees, the generated prediction function can still produce biased outcomes dependent on those identities. This often occurs when an attribute included in the model is a close substitute for one of these identities. For example, seemingly unrelated information like neighborhoods are strongly correlated with race. Essentially, if any attribute that you include in your model is correlated with one of these identities, your model is vulnerable to this sort of soft-coded bias.

Often those creating algorithms or automating decision-making processes do not intend to cause harm and sometimes they’re actively trying to do good. Today, a lot of resources, thought, and effort are being put into learning how to ‘do good better’. For the next task, we’ll examine different perspectives on this issue.

Overview of what you will be building

The following code block shows a sample interaction with the final code that you’ll develop for this project (Note: in the starter code, this has been translated to a test class to help you debug):

// creates a Dataset out of a CSV file
List<Row> dataObjects = DecisionTreeCSVParser.parse("fruits-and-vegetables.csv");
List<String> attributeList = new ArrayList<>(dataObjects.get(0).getAttributes());
Dataset training = new Dataset(attributeList, dataObjects);

// builds a TreeGenerator object and generates a tree for "foodType"
TreeGenerator generator = new TreeGenerator();
generator.generateTree(training, "foodType");

// makes a new (partial) Row representing the tangerine from the example
Row tangerine = new Row();
tangerine.setAttributeValue("color", "orange");
tangerine.setAttributeValue("highProtein", "false");
tangerine.setAttributeValue("calories", "high");

// uses the decision tree generated on line 6 to make a decision
String decision = generator.getDecision(tangerine);

System.out.println(decision); // prints "fruit"

We will give you the DecisionTreeCSVParser and Row classes (the former will read in a CSV file and produce a list of Row objects). In the implementation stage, you will need to:

  • design a Dataset class that holds a list of Row.
  • write a generateTree method (in TreeGenerator) that will generate a decision tree from a list of Row objects.
  • write a getDecision method (in TreeGenerator) that will produce a prediction on a Row object.

The rest of the Design Stage work will have you plan out the classes and methods that you will need for these implementation tasks.

Design stage tasks

Before proceeding to the tasks in this section, make sure you have answered the conceptual assignment questions 1-4 on Gradescope and received full points. This helps make sure that you understand the concepts behind decision trees well enough to start thinking about designing and implementing your code. Question 5 is still coming up, so don’t worry about it yet.

Coming up with testing datasets

In order to set yourselves up for success in debugging this project, you should come up with some training datasets and expected input-output pairs that you can run your code on. By datasets, we mean example CSV files of training data that the program reads in, like the produce and college admission tables above. By expected input-output pairs, we mean examples of inputs (rows of attribute values) that you provide to the decision tree (after training it on one of your test datasets) and the outputs (decisions) that you expect the tree to make for those inputs.

To help you get started, here are some criteria that you should use when coming up with your datasets and expected outcomes:

  • The datasets should be small enough for you to be able to draw out on paper the expected decision tree and the steps your code takes to generate the tree. Note: your code will choose attributes to split on at random, which means that there are multiple possible trees that could be generated.
  • At least some of the datasets or expected outcomes should be basic enough to test very simple cases (e.g. a dataset that causes your code to generate a tree of only one non-leaf node, to test that splitting on a single attribute is working properly).
  • At least some of the datasets/expected outcomes should test new attribute values (that is, paths down the tree that have to make use of the default decision of some node). Conversely, at least some of the datasets/expected outcomes should test situations where default decisions are not used at all.

Stop and Think

Are there any other criteria that you feel are important when coming up with datasets and expected outcomes for this project? (You don’t have to turn in this answer, but it might help you think of some datasets).

Also keep in mind that your final set of tests should allow for incremental, test-driven development. This means that you should be able to tackle one small part of the project (such as writing a specific method in a specific class), test it, debug if necessary, and only move on to another part of the project when the test passes. More on this is written in the implementation section of this handout, but for now, you should recognize that you will likely have to write additional tests when you start implementing specific methods, so that you can test as you go!

Design check task 2

Based on the criteria above, come up with at least two example datasets to test your project. Include at least two expected outcomes for the tree generated from each dataset. During the design check, your TA will ask you how these example datasets fit the criteria listed above.

To check your understanding of this task, you can use our BasicDatasetTest.java file and replace the lines that are indicated by the TODO comments. At this point, please do not look through any of the other files in the starter code – we’ll guide you through them in the next section.

// TODO: replace with your own input file
String trainingPath = "data/fruits-and-vegetables.csv";
// TODO: replace with your own target attribute
String targetAttribute = "foodType";
...
// TODO: make your own rows based on your dataset
Row tangerine = new Row("test row (tangerine)");
...
// TODO: make your own assertions based on the expected classifications
Assert.assertEquals("fruit", testGenerator.getDecision(tangerine));

The tests will not run on the local copy of your code (yet), but you can then submit this file and your dataset csv to the Project 1 (Decision Tree): Dataset Test assignment of Gradescope. If it says it passes the wheat, your dataset is formatted correctly and your input-output pairs are correct.

Some notes about this Gradescope assignment:- Remember to turn in your dataset CSV! In your code, the file path to it should be data/your-csv-title.csv (i.e. the path should not include any folders).

  • For your tests to work, the dataset has to be such that it gives the same output for a given input no matter the order in which attributes are chosen to build the tree. If you’re getting errors for the getDecision assertion, you should try making tests for a simpler dataset
  • This autograder is only useful for checking that your datasets are reasonable (which you can do during the design check phase or the implementation phase). We’ve configured the code so that it will not allow you to run other methods on Dataset, etc, so it will not work for testing your own unit tests that call on other methods. |

Include a copy of your dataset and input-output pairs as part of your design check submission.

Reading and understanding the provided code

Do the following task before opening the starter code.

You may NOT use HashMaps as part of your implementation. This assignment is designed to test your use of lists, and using HashMaps bypasses that goal.

Based on the information above, you will be responsible for completing a Dataset class and a TreeGenerator class. In a later section of this document, you will reason about the computations that each class should do and how the classes should interact with each other. For this next design check task, we want you to focus on the TreeGenerator class.

Design check task 3

Use your understanding of decision trees and tree-like structures in general to reason about the additional classes and interfaces needed to represent a decision tree.

Details

Make a list of classes, any possible interface(s) they may implement, and what you think each one’s general functionality should be. A sentence or two will suffice per class and interface. For instance, what classes or interfaces should we use to represent nodes, edges, and leaves for our decision tree? What additional fields and methods should they have, if any? This is meant to be an exercise in calibrating your understanding of tree-like data structures, and the next two tasks guide you towards evaluating this understanding. You do not need to turn anything in for this task, but you will be asked to reflect on it below.

Design check task 4

Now look through the starter code and draw a containment/association diagram of how the classes fit together.

Details

You can ignore BasicDatasetTest, DecisionTreeCVSParser, DecisionTreeTester, and  DecisionTreeTest. For classes that are in the src folder, you do not have to list their fields (but we recommend reading over/taking notes on the public methods they have, since you might want to make use of the public methods!) We have started this for you.

For information on how to make a containment/association diagram, check out this guide.

There is a file called README.txt that lists all of the files that are present in the starter code and gives a short summary of each one. It will be helpful to refer to this file as you familiarize yourself with the code for this project. Most files also have more documentation and guidance for you in their own comments.

Design check task 5

Write a few sentences that reflect on the differences, if any, between your answer to Design check task 3 and the code you saw in Design check task 4. Did your proposed tree classes and interfaces match those given to you in the starter code? Is there anything you don’t understand about the way the provided classes interact?

Design check task 6

Now for each class, think about what information it might need to store in its fields/variables. Answer the following.

Details

Take a look at your containment/association diagram (from task 4) and your answers above and answer:

  • What field(s) should a DecisionLeaf have?
  • What field(s) should an AttributeNode have, besides the one we give you?
  • What field(s) should a ValueEdge have?
  • What field(s) should a Dataset have?

Thinking through the methods

Now, it’s time to sketch out how to actually construct the tree from the dataset. Generating the tree is a recursive process that works on smaller and smaller sets of rows from the original dataset. Let’s see this on our running example:

Conceptual question 5

We’ve numbered the nodes in the sample diagram. Indicate which rows from the dataset (by the corresponding food name) are being used to generate the portion of the tree rooted at each node. For example, at node 1 (the topmost node), all 7 food rows are being used. Identify the rows for the other three nodes (2-4) as well as the leaves (A-D). Check your work using the Gradescope conceptual assignment.

Now, it’s time to plan the key methods, along with some notes on how you will turn those plans into code (we covered plans in class on Feb 16th).For guidelines on how to plan and what makes for a good plan, refer to this document.

Design check task 7

Write a high-level plan for an implementation of generateTree. You might find the questions below helpful in guiding your planning. The plan you write should outline the sequence of high-level computations that get performed to generate a tree from a dataset (list of rows). You are welcome to name helper functions and provide plans for them as part of your overall plan.

Details

Think about conceptual question 5:

  • What input(s) is/are needed to generate a tree or subtree?
  • Under what conditions do you make a leaf? How do you know which decision goes into a leaf?
  • What determines the number of edges out of an internal (non-leaf) tree node?
  • What information is used to generate the subtree at each edge?
  • How might we compute default values at each node? (see illustration below for clarification)

thinking through row subsets

Design check task 8

Now write a high-level plan for an implementation of getDecision. You should ensure that getDecision only uses the tree structure and pre-computed values from generating the tree to arrive at a decision.

Design check task 9

Based on your plans in tasks 7 and 8, propose a collection of method headers (names and types only) that you expect to need. Feel free to use these method headers in the plans themselves as well.

Details

For each method, does it belong in the Dataset class or the TreeGenerator class?

Note that we haven’t asked you about methods that belong in AttributeNode, ValueEdge, and DecisionLeaf. Our solution only has a constructor and getDecision in AttributeNode and DecisionLeaf, and a constructor and two getter methods in ValueEdge. The idea is that the TreeGenerator should be doing the work on calling the constructors for these classes, and it should be doing this work recursively (i.e. the method to generate node 1 in the above tree should make recursive calls to generate nodes 2 and 3, because nodes 2 and 3 are the roots of their own decision trees). Think of the TreeGenerator as conducting the way that data goes from the dataset to the node/edge classes to build up a tree.

Design check task 10

At this point, please create a document with only your answers to tasks 7-9 (plans and method headers) and upload it to the peer review assignment in Canvas (the links are at the top of the handout). Do not include your names in the document itself!

If your team submits this document (to Peerceptiv) by 11:59 pm on Wednesday, you’ll get to see other group’s designs shortly thereafter. Everyone else will submit to the regular deadline on Friday and have access to other’s plans after that.

Implementation stage tasks

Unlike on homeworks or the design stage, we are not providing a list of tasks. The following list of what needs to be done, along with the plans you wrote in the Design stage, will guide your work.

Test your code each time you complete a small part of the project. Some of these tests will be written as JUnit assertions, while others could be as simple as running constructors and basic methods through a main class. Testing as you go will save you tons of debugging time later. We really can’t stress this point enough.

There is a separate section of this handout on Testing and Debugging (further down, or use the table of contents at the top) to guide you here.

High-level TO-DO List:

  • Implement the Dataset constructor and the methods defined in the IDataset interface
  • Make sure you can read in a dataset from CSV and get a Dataset object back (following the example code in the BasicDatasetTest class we give you). Write some simple tests (say on the size of the dataset) to make sure you’ve gotten what you expected.
  • Create/Implement classes for the components of the decision tree (nodes, leaves, edges)
  • Use your plan to begin implementing generateTree in TreeGenerator, leaving space for helper methods that may or may not already be written (there will be errors!). A good way to do this is to start with simple cases (simple datasets) and work up to more complicated ones. Think about which of your design check test datasets will yield the simplest trees. Make sure that is working, then move on to slightly more complicated datasets.
  • Write the helper methods for generateTree in the appropriate places.
  • Write getDecision.

You may NOT use HashMaps as part of your implementation. This assignment is designed to test your use of lists, and using HashMaps bypasses that goal.

Notes on Generating the decision tree

Hopefully the design check activities and plans helped you see how the tree generation should work. If you still need help seeing what that code would look like, consider the following steps:

  • Choose an attribute to split on from the input training data (not the target or a previously split attribute)
  • Determine whether you should create a decision leaf or an attribute node. You should make a decision leaf when all rows have the same value for an attribute, or there are no remaining attributes to split on.
  • Compute the default value for the current node and store it in the node.
  • Create edges for each possible value of the attribute you chose.
  • Recursively generate the children nodes for these edges using the appropriate subset of the training dataset.
  • Attach each of these edges to the current node.

Here are some possible pitfalls:

  1. Your code does not properly stop to make a leaf. You can usually tell that this happens if the tree exhausts all possible attributes.
  2. your code didn’t properly check the leaf creation conditions correctly
  3. your code might have a faulty equality check somewhere (using == when you should have used .equals(...))
  4. Your code stores a dataset partition in each ValueEdge and/or AttributeNode
  5. Your code splits on the same attribute multiple times in a given path down the tree.

Choosing a random attribute to split on

There are three approaches for splitting an attribute that you must support. To help aid testing your TreeGenerator class we have provided implementations of getAttributeToSplitOn that select the attributes in either alphabetical order or reverse alphabetical order. This should allow you to generate trees with a deterministic (fixed) structure which may help you debug potential issues in your implementation.

It is you job to implement random attribute splitting. The idea is that you will generate a random index into the list of attributes, then use the attribute at that index to split on.

import java.util.Random;

Random random = new Random();
int upperBound = 10;
// generates a random number between 0 (inclusive) and 10 (exclusive)
// note that the upper bound must be greater than 0!
int randomNum = random.nextInt(upperBound);

Possible pitfall: as you select attributes to split on, you will end up needing to manage which attributes you have used and which you have not. There’s a subtlety here around mutable vs immutable lists that many students have run into in the past. If your generated trees seem to be missing lots of nodes, review your approach, thinking about mutability vs immutability …

Testing and test-driven development

As you work on this project, you should be completing it one small piece at a time, starting with writing tests for that piece. Only when you have tested and debugged the small piece should you move on to the next piece of the project. This is called test-driven development and provides a safety-net of sorts as you work on a larger, more complex project. This is how software engineering works in the real world. Multiple people make changes to the same project, and everyone is responsible for testing their addition before the changes are integrated with the rest of the project.

Referencing your input files in tests.

In order to be able to run tests with input files, make sure that decision-tree-[team name] is the top-level project folder in your IntelliJ. Then, you can reference the datasets relative to that folder, just like our starter code does,  e.g. data/fruits-and-vegetables.csv or data/mushrooms/training.csv. If you’re getting a FileNotFoundException, this is the first thing to check!

You must submit these files to Gradescope!

Unit tests

You will use JUnit unit tests to guide your test-driven development. Follow the starting point you had in the design check tasks to write more tests using your example data. Since we can only run a small set of unit tests on your code via the Autograder (the project is fairly open-ended), it is important for you to write unit tests as you go and use them to evaluate your code.

System-level tests

For system testing, you will use the DecisionTreeTester.java file that we have provided in the stencil’s src directory. There are instructions in that file on exactly which lines to change to run different tests (we don’t recommend changing the other lines of the file, and instead making your own testing file if you want to use these tests to diagnose something in your project). Within that tester, we are checking your code on several new datasets:

Descriptions of the larger datasets on which we’ll be testing your work

  • Villains: This dataset contains as attributes a bunch of restaurants on College Hill and whether or not the person/datum/row likes that restaurant as well as a boolean attribute isVillain for whether or not the person is a villain. The tests use your code to train a decision tree on the training data to use on testing data to predict whether or not a person is a villain based on their restaurant preferences.
  • Mushrooms: This dataset contains a bunch of attributes of mushrooms, one of which is isPoisonous. The tests use your code to train a decision tree on the training data to use on testing data to predict whether or not a mushroom is poisonous based on its other attributes.
  • Candidates: This dataset contains attributes of candidates applying for a job, one of which being whether or not they were hired. There are multiple different training datasets to choose for this, so you might want to experiment and see which training dataset produces the highest accuracy on the testing data.
  • Songs: This dataset contains a bunch of songs from the 2000s with their titles and artists removed. Instead, they are represented by attributes including ‘topGenre’, ‘year’, and boolean attributes such as ‘isLoud’. The tests use your code to train a decision tree on the training data to use on testing data to predict whether or not a song is popular based on its other attributes.

The tests that work with these datasets are located in the DecisionTreeTester class. For each dataset, the tester trains your decision tree on training data. Then, for each datum in the testing dataset, it compares the generated decision tree’s prediction against the datum’s actual value to measure accuracy.

Performance of your decision tree can be variable because attributes are randomly selected. Therefore, we do this many times (currently NUM_ITERATIONS = 100) and average the results to get the average accuracy of the trees generated by your algorithm. You should not need to modify the code (unless you want to increase the number of iterations to make the measurement more precise).

Your accuracy measured on testing data for the Mushrooms and Villains datasets should be above 70%. If the target attribute is a boolean, a decision tree that randomly selects one of the two possible values as its prediction would have an accuracy of around 50%. You should use this as a diagnostic – if the accuracy is lower than 50%, there are definitely bugs in your code. Your performance on the songs dataset may be worse, but it still should be above 70%.

There is also a test that tests the decision trees on the same training data used to generate the trees. Your accuracy should be ~100% for this test, and if it is not, there may be something wrong with your code.

How we will use these datasets to grade your code

Instead, the Autograder will run the following tests with the following expectations:

  • Small unit tests on the Dataset class that test the IDataset methods
  • Small unit tests on the TreeGenerator class that test the ITreeGenerator methods
  • A small dataset of our own (expect 70% accuracy on the testing data and 95% on the training data)
  • Villains (expect 70% accuracy on the testing data and 95% on the training data)
  • Mushrooms (expect 70% accuracy on the testing data). Since Mushrooms is a large dataset, if your tree implementation is very inefficient in its use of computation time/space, this test might result in an output that says you ran out of heap space. This is considered a failed test (see below for some tips to increase your code efficiency). For this reason, Mushrooms will be weighted slightly lower than Villains in our suite of tests.

Debugging

Stepping through the code

The BasicDatasetTest.java file can provide a good starting point for visualizing the tree. For example, to step through how your code is generating the tree using the debugger, place a breakpoint at the line where your tree gets generated:

(remember that this code is in the setup method annotated with @Before, so it is run before every test)

Then, you can run a test using the debugger and step through the code that generates your tree:

Using the debugger to visualize a tree

After the tree has been generated, you can also visualize it using the debugger. For example, if you want to see what the tree looks like right before calling getDecision, you can do so by placing a breakpoint at the appropriate line (line 40 in the screenshot below) and expanding the TreeGenerator variable (in this case, testGenerator) in the debug window (some of this screenshot has been occluded to avoid giving away the structure of our solution. Your AttributeNodes, ValueEdges, etc can and should have more fields than shown here):

You can also use this breakpoint as a starting point for stepping into the code and seeing how the tree makes the decision to classify the input row!

Handing in

Include all of these files from sol for every submission. Only the last submission is graded. You should receive an email from Gradescope immediately after your submission.

  • AttributeNode.java
  • Dataset.java
  • DecisionLeaf.java
  • DecisionTreeTest.java
  • TreeGenerator.java
  • ValueEdge.java
  • Any test datasets that you have created

:rotating_light: Super important: When you do your final submission, please make sure that your submission lists both you and your partner if you are submitting as a group. This makes sure that we know which submission to grade!

To check this, one member of your team should do the following (only necessary for your final submission):

  1. Open up your submission on gradescope

  2. Select “Group Members” at the bottom, which should look like this:

  3. Type in your partner’s name and select them from the list

Only one member of your group needs to do this, and you can do it after the deadline without being marked late. Please please please be sure do to this, though, as it saves us a lot of time when grading!

Working with Partners

This is meant to be a “two people work together” project, not “we split the project in half and each do a part”. At the design check, both partners will be asked about pieces of the project. If only one partner answers all the questions, the silent partner may receive a lower grade on the design check.

We have also been known to put questions about the project functions on the exams, as a way to encourage you to actually understand all aspects of the project.

If you are having problems with your project partner either being non-responsive, or not contributing to the work, or severely holding you back in some way, please reach out to Kathi. We have procedures in place to deal with these situations.

FAQ

Please let us know if you find any mistakes, inconsistencies, or confusing language in this or any other CS200 document by filling out our anonymous feedback form!