Key Dates
Looking for a stencil link? See here! You should wait to accept the assignment until after you have received your partner assignment.
We will post an announcement on Ed once these emails have been sent out, so you can be sure you won’t miss it.
Gearup Information: Tuesday February 20th at 8pm in Rm 130 85 Waterman St.
[Slides]: Will be posted after the gearup
[Recording]: Will be posted after the gearup
In this project, you will implement a classic machine learning algorithm that uses training data to make predictions in new situations. This algorithm uses a tree-like data structure to arrive at its predictions.
This handout contains a lot of information. Like, a lot. But don’t be afraid, young Padawans! Let’s go through what this handout contains, and how to use it effectively.
Background info: Most of the information in this handout is background information on the framework of a decision tree, accompanied by design check questions (in boxes) that will help you understand what goes into making and using one. Record your answers to the design check questions, and then submit your work to the Gradescope Design Check submission at least 1 hour before your design check. You’ll want to allocate a decent amount of time to go through this with your partner because your understanding of it is integral to your program’s success.
Gearup: If you prefer a different way of digesting the background information provided in this handout, come to or watch the recording of our Gearup! You still have to do the design check tasks, but we’ll be discussing these concepts out loud if that’s how you prefer to learn.
Design Stage Tasks: The first few days of the project are devoted to designing your approach. You’ll write a plan for your solution, give and receive feedback on other students’ plans (via Canvas), create some testing data, and generally set up a strategy for completing the project. You and your partner will meet with a TA to review your design during the first week.
Testing: After the design-stage section is advice on testing. Definitely test as you go - it may have been tempting on homeworks to just implement everything and test it at the end, but with projects of this size, you want to know if you mess something up far before you get to the ‘end.’ (It won’t be anywhere near the end for you if you don’t test first—we (TAs, Kathi, Elijah) all speak from experience).
Implementation: Finally, the handout discusses the implementation stage tasks. These are the steps you will take to develop your final submission. Don’t start too late! This is a hard assignment, and your ability to complete it will greatly depend on the amount of time you give your brain to mull it over. Design and testing as you go are essential to managing the overall time for the project.
Best of luck! If you have any questions, don’t hesitate to reach out to your design check TA. We are here to guide you!
This project is a nontrivial jump in difficulty from the homeworks so far. That’s why you have a partner, two weeks, and an intermediate meeting with a TA. The project has you do a nontrivial design of an OO program. You will need to spend time on this design. You can’t just sit down and whip out an implementation.
That said, we have covered the implementation concepts you need: you’ve written recursive programs on lists and trees, you’ve written programs that traverse trees from top to bottom, you’ve written methods that call a recursive helper method. You’ve filtered through lists. All of these pieces come together in this project.
This project will proceed in two main stages, with you working with your partner in both:
As a “mini stage” that spans the end of design/start of implementation, you’ll share plans (through Canvas), using peer review with other students to potentially get other ideas of how to approach the problem.
In the Peer Review system that we’re using (Peerceptiv, through Canvas), you’ll upload your plans for key functions before a due date. After the due date, the system will randomly assign you two plans from other (anonymous) students to review and comment on. Since we want you to be able to leverage peer review to think through your designs, we’re setting up both an early due date and a regular one: if you submit to the earlier one, you’ll get to see other students’ work earlier in the week:
Peerceptiv is meant be anonymous (to students, course staff can still see your submission and your feedback). This means that when submitting documents or feedback to the system, you should NOT include your name or any other identifying information.
Please also remember our course policies on diversity and professionalism when reviewing and giving feedback.
Shortly after class on Friday, we will send you an email confirming your partner assignments that will include a personalized link to create your stencil.
We will post an announcement on Ed once these emails have been sent out, so you can be sure you won’t miss it.
Once you receive your partner assignments, clone the stencil from Github Classroom and create/join a team named as [partner 1 brown login]-[partner 2 brown login]
. For instructions on how to download the stencil and set it up in IntelliJ, see our Stencil Setup Guide.
Design check (33.33%): The design check includes the conceptual assignment (autograded so that you can check yourself), the written assignment (which includes SRC, design tasks, and peer review and will be manually graded), and the design check meeting (graded on good-effort attendance).
Implementation + Code style checks (Autograded) (33.33%): Please see the testing section for an overview on the system tests that we run. Due to the somewhat open-ended nature of the assignment, we are not doing wheat/chaff grading for testing. Code style checks are counted under this part–for more info, see the Java style guide.
Design, testing, and code structure (33.33%): The TAs will manually grade your work according to some rubrics. To earn the points for this part of the project score:
This is the first assignment for which you have submitted plans. Grading of plans will be based on whether you followed the “guide to writing good plans” in the planning handout. You don’t have to get a certain exact set of plan steps to get full credit on the planning part. Just make a solid effort at it.
You may have heard of machine learning, but what is it really?
You have learned how to write functions that perform computations. So far, the functions you’ve written consist of explicit steps that tell the computer how to do the computation. Imagine, for example, that we asked you to write a program to determine whether a particular job applicant should get an interview. You could write a function with a bunch of if-statements and boolean operations over information about the candidate (years of experience, prior recommendations, etc). But what if you have a lot of potential variables to consider? Doesn’t this function become pretty messy and hard to maintain?
With machine learning, you don’t write this function explicitly. Instead, you use a program (the machine learning algorithm) to generate one possible function for you based on data about previous candidates (this process is called training a model in machine-learning terminology). Then, when you get a new applicant, you call the generated function and ask it whether to interview the candidate. Set aside how you feel about the (considerable) perils here: the point is that machine learning is about generating functions that recommend decisions based on prior decisions using the same variables.
In this project, you will implement a classic machine learning algorithm and think about how to test it. Concretely, you will:
At the heart of the algorithm you’ll be implementing is a data structure known as a decision tree. This is a tree-shaped data structure that models a series of multiple-choice questions, answers to which are found at the leaves.
Overall, this project practices the data structure, testing, and programming concepts we’ve covered so far. You need both object-oriented programming and recursion to get this working. The implementations you’ve done to traverse trees (binary search trees or ancestry trees), to use recursion to traverse lists, and to create classes for lists and trees should all give you insights into this project.
As you follow along with this example, you will see some conceptual questions. These questions are automatically graded on Gradescope (to allow you to see whether you got a question right or wrong and resubmit if need be). To help all partners form a conceptual foundation for this project, these questions should be submitted individually before your design check date.
To help explain what our algorithm will be able to do, let’s consider the following problem: you would like to generate a function/train a model to tell you whether a food item is a fruit or vegetable given certain attributes including color
, highProtein
, and calories
.
A subset of the data is shown below:
name | color | highProtein | calories | foodType |
---|---|---|---|---|
spinach | green | true | low | vegetable |
orange | orange | false | high | fruit |
cucumber | green | false | low | vegetable |
carrot | orange | false | medium | vegetable |
sweet potato | orange | true | high | vegetable |
avocado | green | true | high | fruit |
banana | yellow | true | high | fruit |
Each row contains values for the attributes color
, highProtein
, calories
, and foodType
. When you generate your function, you will choose which attribute it should make a prediction on (here, foodType
). Effectively, this means that you are creating classes and methods to implement a function:
foodType(color, highProtein, calories)
that will return either fruit
or vegetable
as the foodType
depending on the given values for color
, highProtein
and calories
. (You’ll call the function slightly differently, but this is the idea.)
This same setup arises in many applications: there could be a table that tracks who has been admitted to college based on different attributes, which movies ended up generating high revenue, and so on. In each case, the goal is to predict an outcome (e.g., getting admitted or generating high revenue) given the attributes of a new case (e.g., student or movie).
Decision trees provide an ordered way of examining each attribute of an item, leading ultimately to a decision/classification for that item. Below is a decision tree generated from the food dataset, designed to predict the foodType
of foods not in the original table.Some things to note about the decision tree:
foodType
) and are labeled with the classification of a datum that has traveled down that path of the tree. All decisions are for the same target attribute. Leaves mean that you’ve found a prediction!The decision tree for a dataset should always match the results listed in the original dataset. To make sure that you know how to use the tree, describe how the tree would be used to report that avocados and oranges are both fruits, based on the attributes in the table.
Let’s walk through two examples to show how the generated decision tree could be used to predict the value of the target attribute (foodType
) for new data. We will refer to the fruit dataset and the graphic of the generated tree.
First, let’s predict the foodType
for a tangerine, which has orange color, is not high protein, and is high in calories. We traverse the decision tree, checking the values of the given attributes starting from color
(the root node). We would first follow the orange
edge down to the highProtein
node. Since the tangerine does not have high protein, we would then follow the false
edge down to the calories
node, and since the tangerine has high
calories, the function would predict that a tangerine has the value fruit
for the attribute foodType
.
Now, consider broccoli, which is green, high in protein, and medium in calories. Again, we follow the tree. Taking the green
color edge gets us to a calories
node that has no edge for medium
. This is a value that was not covered by our training data, so we need to use some information other than the paths in the tree to predict the foodType
. The path we had taken so far limited us to green-colored items. We therefore consult the training dataset to identify the most common foodtype
among the green items in the training data (spinach, cucumber, and avocado). The function thus returns vegetable
. Had there been a tie in the training data, we’d choose from the most-common values arbitrarily.
What would a function built on the above decision tree predict for green apples that are high in calories? What about red apples?
Go back to the broccoli example, and the fact that there was no edge for its medium
value for the Calories attribute. Just above, we said we’d return the most common decision among green vegetables. Any green vegetable whose calories were something other than “low” and “high” (the two edges in the diagram) would return the same default value. To save later computation time, then, we will store that default value in each node. Here’s a revised diagram showing the defaults within each node:
Go through the process for predicting the foodType
of broccoli (green, high in protein, and medium in calories) using this tree.
Which node’s default value do we end up using?
_This example illustrates a key algorithmic principle: pre-computing data can turn some seemingly algorithmic problems into simpler data lookups. When a series of computational steps feels messy, ask whether you could pre-compute and store some data to simplify the computation later._This is the same optimization that we did with the size/length function in homework 2.
On paper, work out the decision tree, with default values, to predict the outcome for the following dataset on college admissions. Consider the attributes in the order of GPA, letters of rec, high school, and extracurriculars.
Check your work using the Gradescope conceptual assignment. Be sure to save this tree for use later!
GPA | high school | extra curriculars | letters of rec | outcome |
---|---|---|---|---|
C | blue school | strong | strong | reject |
C | red school | weak | weak | reject |
A | blue school | weak | strong | accept |
A | red school | strong | weak | reject |
B | blue school | weak | strong | accept |
B | red school | strong | strong | accept |
B | red school | weak | strong | reject |
It may be helpful to copy this table into an excel or Google Sheets file. You can make use of the “Sort” capabilities under “Data” to organize the rows by value for an attribute to visualize how to form the decision tree.
Put your answers to the Design check tasks in a document (e.g., google docs or similar) that you work on and turn in as a group. You will submit a PDF version of this document.
Assume that the tree that you generated earlier regarding admissions decisions will now be used on a new set of students. For each of the following students, use the tree to determine whether they will be admitted (fill in the outcome column):
Name | GPA | high school | extra curriculars | letters of rec | neighborhood | outcome |
---|---|---|---|---|---|---|
student1 | B | blue school | strong | strong | Canyon | |
student2 | C | red school | strong | weak | Elman | |
student3 | B | red school | weak | strong | Elman | |
student4 | B | blue school | weak | strong | Canyon | |
student5 | A | red school | strong | strong | Elman |
What are some things you notice about who does and does not get admitted? Are there outcomes that surprise you or concern you? Look at the table. Do you spot any patterns between the outcomes and attributes that were not part of the original tree?
This example points to what is called soft-coded bias. Even if identities like race, gender, socioeconomic status, or sexuality are not included as attributes when training decision trees, the generated prediction function can still produce biased outcomes dependent on those identities. This often occurs when an attribute included in the model is a close substitute for one of these identities. For example, seemingly unrelated information like neighborhoods are strongly correlated with race. Essentially, if any attribute that you include in your model is correlated with one of these identities, your model is vulnerable to this sort of soft-coded bias.
Often those creating algorithms or automating decision-making processes do not intend to cause harm and sometimes they’re actively trying to do good. Today, a lot of resources, thought, and effort are being put into learning how to ‘do good better’. For the next task, we’ll examine different perspectives on this issue.
The following code block shows a sample interaction with the final code that you’ll develop for this project (Note: in the starter code, this has been translated to a test class to help you debug):
// creates a Dataset out of a CSV file
List<Row> dataObjects = DecisionTreeCSVParser.parse("fruits-and-vegetables.csv");
List<String> attributeList = new ArrayList<>(dataObjects.get(0).getAttributes());
Dataset training = new Dataset(attributeList, dataObjects);
// builds a TreeGenerator object and generates a tree for "foodType"
TreeGenerator generator = new TreeGenerator();
generator.generateTree(training, "foodType");
// makes a new (partial) Row representing the tangerine from the example
Row tangerine = new Row();
tangerine.setAttributeValue("color", "orange");
tangerine.setAttributeValue("highProtein", "false");
tangerine.setAttributeValue("calories", "high");
// uses the decision tree generated on line 6 to make a decision
String decision = generator.getDecision(tangerine);
System.out.println(decision); // prints "fruit"
We will give you the DecisionTreeCSVParser
and Row
classes (the former will read in a CSV file and produce a list of Row
objects). In the implementation stage, you will need to:
Dataset
class that holds a list of Row
.generateTree
method (in TreeGenerator
) that will generate a decision tree from a list of Row
objects.getDecision
method (in TreeGenerator
) that will produce a prediction on a Row
object.The rest of the Design Stage work will have you plan out the classes and methods that you will need for these implementation tasks.
Before proceeding to the tasks in this section, make sure you have answered the conceptual assignment questions 1-4 on Gradescope and received full points. This helps make sure that you understand the concepts behind decision trees well enough to start thinking about designing and implementing your code. Question 5 is still coming up, so don’t worry about it yet.
In order to set yourselves up for success in debugging this project, you should come up with some training datasets and expected input-output pairs that you can run your code on. By datasets, we mean example CSV files of training data that the program reads in, like the produce and college admission tables above. By expected input-output pairs, we mean examples of inputs (rows of attribute values) that you provide to the decision tree (after training it on one of your test datasets) and the outputs (decisions) that you expect the tree to make for those inputs.
To help you get started, here are some criteria that you should use when coming up with your datasets and expected outcomes:
Are there any other criteria that you feel are important when coming up with datasets and expected outcomes for this project? (You don’t have to turn in this answer, but it might help you think of some datasets).
Also keep in mind that your final set of tests should allow for incremental, test-driven development. This means that you should be able to tackle one small part of the project (such as writing a specific method in a specific class), test it, debug if necessary, and only move on to another part of the project when the test passes. More on this is written in the implementation section of this handout, but for now, you should recognize that you will likely have to write additional tests when you start implementing specific methods, so that you can test as you go!
Based on the criteria above, come up with at least two example datasets to test your project. Include at least two expected outcomes for the tree generated from each dataset. During the design check, your TA will ask you how these example datasets fit the criteria listed above.
To check your understanding of this task, you can use our BasicDatasetTest.java
file and replace the lines that are indicated by the TODO
comments. At this point, please do not look through any of the other files in the starter code – we’ll guide you through them in the next section.
// TODO: replace with your own input file
String trainingPath = "data/fruits-and-vegetables.csv";
// TODO: replace with your own target attribute
String targetAttribute = "foodType";
...
// TODO: make your own rows based on your dataset
Row tangerine = new Row("test row (tangerine)");
...
// TODO: make your own assertions based on the expected classifications
Assert.assertEquals("fruit", testGenerator.getDecision(tangerine));
The tests will not run on the local copy of your code (yet), but you can then submit this file and your dataset csv to the Project 1 (Decision Tree): Dataset Test assignment of Gradescope. If it says it passes the wheat, your dataset is formatted correctly and your input-output pairs are correct.
Some notes about this Gradescope assignment:- Remember to turn in your dataset CSV! In your code, the file path to it should be data/your-csv-title.csv
(i.e. the path should not include any folders).
Include a copy of your dataset and input-output pairs as part of your design check submission.
Do the following task before opening the starter code.
You may NOT use HashMap
s as part of your implementation. This assignment is designed to test your use of lists, and using HashMap
s bypasses that goal.
Based on the information above, you will be responsible for completing a Dataset
class and a TreeGenerator
class. In a later section of this document, you will reason about the computations that each class should do and how the classes should interact with each other. For this next design check task, we want you to focus on the TreeGenerator
class.
Use your understanding of decision trees and tree-like structures in general to reason about the additional classes and interfaces needed to represent a decision tree.
Make a list of classes, any possible interface(s) they may implement, and what you think each one’s general functionality should be. A sentence or two will suffice per class and interface. For instance, what classes or interfaces should we use to represent nodes, edges, and leaves for our decision tree? What additional fields and methods should they have, if any? This is meant to be an exercise in calibrating your understanding of tree-like data structures, and the next two tasks guide you towards evaluating this understanding. You do not need to turn anything in for this task, but you will be asked to reflect on it below.
Now look through the starter code and draw a containment/association diagram of how the classes fit together.
You can ignore BasicDatasetTest
, DecisionTreeCVSParser
, DecisionTreeTester
, and DecisionTreeTest
. For classes that are in the src folder, you do not have to list their fields (but we recommend reading over/taking notes on the public methods they have, since you might want to make use of the public methods!) We have started this for you.
For information on how to make a containment/association diagram, check out this guide.
There is a file called README.txt
that lists all of the files that are present in the starter code and gives a short summary of each one. It will be helpful to refer to this file as you familiarize yourself with the code for this project. Most files also have more documentation and guidance for you in their own comments.
Write a few sentences that reflect on the differences, if any, between your answer to Design check task 3 and the code you saw in Design check task 4. Did your proposed tree classes and interfaces match those given to you in the starter code? Is there anything you don’t understand about the way the provided classes interact?
Now for each class, think about what information it might need to store in its fields/variables. Answer the following.
Take a look at your containment/association diagram (from task 4) and your answers above and answer:
DecisionLeaf
have?AttributeNode
have, besides the one we give you?ValueEdge
have?Dataset
have?Now, it’s time to sketch out how to actually construct the tree from the dataset. Generating the tree is a recursive process that works on smaller and smaller sets of rows from the original dataset. Let’s see this on our running example:
We’ve numbered the nodes in the sample diagram. Indicate which rows from the dataset (by the corresponding food name) are being used to generate the portion of the tree rooted at each node. For example, at node 1 (the topmost node), all 7 food rows are being used. Identify the rows for the other three nodes (2-4) as well as the leaves (A-D). Check your work using the Gradescope conceptual assignment.
Now, it’s time to plan the key methods, along with some notes on how you will turn those plans into code (we covered plans in class on Feb 16th).For guidelines on how to plan and what makes for a good plan, refer to this document.
Write a high-level plan for an implementation of generateTree
. You might find the questions below helpful in guiding your planning. The plan you write should outline the sequence of high-level computations that get performed to generate a tree from a dataset (list of rows). You are welcome to name helper functions and provide plans for them as part of your overall plan.
Think about conceptual question 5:
Now write a high-level plan for an implementation of getDecision
. You should ensure that getDecision
only uses the tree structure and pre-computed values from generating the tree to arrive at a decision.
Based on your plans in tasks 7 and 8, propose a collection of method headers (names and types only) that you expect to need. Feel free to use these method headers in the plans themselves as well.
For each method, does it belong in the Dataset
class or the TreeGenerator
class?
Note that we haven’t asked you about methods that belong in AttributeNode
, ValueEdge
, and DecisionLeaf
. Our solution only has a constructor and getDecision
in AttributeNode
and DecisionLeaf
, and a constructor and two getter methods in ValueEdge
. The idea is that the TreeGenerator
should be doing the work on calling the constructors for these classes, and it should be doing this work recursively (i.e. the method to generate node 1 in the above tree should make recursive calls to generate nodes 2 and 3, because nodes 2 and 3 are the roots of their own decision trees). Think of the TreeGenerator
as conducting the way that data goes from the dataset to the node/edge classes to build up a tree.
At this point, please create a document with only your answers to tasks 7-9 (plans and method headers) and upload it to the peer review assignment in Canvas (the links are at the top of the handout). Do not include your names in the document itself!
If your team submits this document (to Peerceptiv) by 11:59 pm on Wednesday, you’ll get to see other group’s designs shortly thereafter. Everyone else will submit to the regular deadline on Friday and have access to other’s plans after that.
Unlike on homeworks or the design stage, we are not providing a list of tasks. The following list of what needs to be done, along with the plans you wrote in the Design stage, will guide your work.
Test your code each time you complete a small part of the project. Some of these tests will be written as JUnit assertions, while others could be as simple as running constructors and basic methods through a main
class. Testing as you go will save you tons of debugging time later. We really can’t stress this point enough.
There is a separate section of this handout on Testing and Debugging (further down, or use the table of contents at the top) to guide you here.
High-level TO-DO List:
Dataset
constructor and the methods defined in the IDataset
interfaceDataset
object back (following the example code in the BasicDatasetTest
class we give you). Write some simple tests (say on the size of the dataset) to make sure you’ve gotten what you expected.generateTree
in TreeGenerator
, leaving space for helper methods that may or may not already be written (there will be errors!). A good way to do this is to start with simple cases (simple datasets) and work up to more complicated ones. Think about which of your design check test datasets will yield the simplest trees. Make sure that is working, then move on to slightly more complicated datasets.generateTree
in the appropriate places.getDecision
.You may NOT use HashMap
s as part of your implementation. This assignment is designed to test your use of lists, and using HashMap
s bypasses that goal.
Hopefully the design check activities and plans helped you see how the tree generation should work. If you still need help seeing what that code would look like, consider the following steps:
Here are some possible pitfalls:
==
when you should have used .equals(...)
)ValueEdge
and/or AttributeNode
There are three approaches for splitting an attribute that you must support. To help aid testing your TreeGenerator
class we have provided implementations of getAttributeToSplitOn
that select the attributes in either alphabetical order or reverse alphabetical order. This should allow you to generate trees with a deterministic (fixed) structure which may help you debug potential issues in your implementation.
It is you job to implement random attribute splitting. The idea is that you will generate a random index into the list of attributes, then use the attribute at that index to split on.
import java.util.Random;
Random random = new Random();
int upperBound = 10;
// generates a random number between 0 (inclusive) and 10 (exclusive)
// note that the upper bound must be greater than 0!
int randomNum = random.nextInt(upperBound);
Possible pitfall: as you select attributes to split on, you will end up needing to manage which attributes you have used and which you have not. There’s a subtlety here around mutable vs immutable lists that many students have run into in the past. If your generated trees seem to be missing lots of nodes, review your approach, thinking about mutability vs immutability …
As you work on this project, you should be completing it one small piece at a time, starting with writing tests for that piece. Only when you have tested and debugged the small piece should you move on to the next piece of the project. This is called test-driven development and provides a safety-net of sorts as you work on a larger, more complex project. This is how software engineering works in the real world. Multiple people make changes to the same project, and everyone is responsible for testing their addition before the changes are integrated with the rest of the project.
In order to be able to run tests with input files, make sure that decision-tree-[team name]
is the top-level project folder in your IntelliJ. Then, you can reference the datasets relative to that folder, just like our starter code does, e.g. data/fruits-and-vegetables.csv
or data/mushrooms/training.csv
. If you’re getting a FileNotFoundException
, this is the first thing to check!
You must submit these files to Gradescope!
You will use JUnit unit tests to guide your test-driven development. Follow the starting point you had in the design check tasks to write more tests using your example data. Since we can only run a small set of unit tests on your code via the Autograder (the project is fairly open-ended), it is important for you to write unit tests as you go and use them to evaluate your code.
For system testing, you will use the DecisionTreeTester.java
file that we have provided in the stencil’s src
directory. There are instructions in that file on exactly which lines to change to run different tests (we don’t recommend changing the other lines of the file, and instead making your own testing file if you want to use these tests to diagnose something in your project). Within that tester, we are checking your code on several new datasets:
isVillain
for whether or not the person is a villain. The tests use your code to train a decision tree on the training data to use on testing data to predict whether or not a person is a villain based on their restaurant preferences.isPoisonous
. The tests use your code to train a decision tree on the training data to use on testing data to predict whether or not a mushroom is poisonous based on its other attributes.‘topGenre’
, ‘year’
, and boolean attributes such as ‘isLoud’
. The tests use your code to train a decision tree on the training data to use on testing data to predict whether or not a song is popular based on its other attributes.The tests that work with these datasets are located in the DecisionTreeTester
class. For each dataset, the tester trains your decision tree on training data. Then, for each datum in the testing dataset, it compares the generated decision tree’s prediction against the datum’s actual value to measure accuracy.
Performance of your decision tree can be variable because attributes are randomly selected. Therefore, we do this many times (currently NUM_ITERATIONS = 100
) and average the results to get the average accuracy of the trees generated by your algorithm. You should not need to modify the code (unless you want to increase the number of iterations to make the measurement more precise).
Your accuracy measured on testing data for the Mushrooms and Villains datasets should be above 70%. If the target attribute is a boolean, a decision tree that randomly selects one of the two possible values as its prediction would have an accuracy of around 50%. You should use this as a diagnostic – if the accuracy is lower than 50%, there are definitely bugs in your code. Your performance on the songs dataset may be worse, but it still should be above 70%.
There is also a test that tests the decision trees on the same training data used to generate the trees. Your accuracy should be ~100% for this test, and if it is not, there may be something wrong with your code.
Instead, the Autograder will run the following tests with the following expectations:
Dataset
class that test the IDataset
methodsTreeGenerator
class that test the ITreeGenerator
methodsMushrooms
is a large dataset, if your tree implementation is very inefficient in its use of computation time/space, this test might result in an output that says you ran out of heap space. This is considered a failed test (see below for some tips to increase your code efficiency). For this reason, Mushrooms
will be weighted slightly lower than Villains
in our suite of tests.The BasicDatasetTest.java
file can provide a good starting point for visualizing the tree. For example, to step through how your code is generating the tree using the debugger, place a breakpoint at the line where your tree gets generated:
(remember that this code is in the setup method annotated with @Before
, so it is run before every test)
Then, you can run a test using the debugger and step through the code that generates your tree:
After the tree has been generated, you can also visualize it using the debugger. For example, if you want to see what the tree looks like right before calling getDecision, you can do so by placing a breakpoint at the appropriate line (line 40 in the screenshot below) and expanding the TreeGenerator
variable (in this case, testGenerator
) in the debug window (some of this screenshot has been occluded to avoid giving away the structure of our solution. Your AttributeNodes
, ValueEdges
, etc can and should have more fields than shown here):
You can also use this breakpoint as a starting point for stepping into the code and seeing how the tree makes the decision to classify the input row!
Include all of these files from sol
for every submission. Only the last submission is graded. You should receive an email from Gradescope immediately after your submission.
AttributeNode.java
Dataset.java
DecisionLeaf.java
DecisionTreeTest.java
TreeGenerator.java
ValueEdge.java
Super important: When you do your final submission, please make sure that your submission lists both you and your partner if you are submitting as a group. This makes sure that we know which submission to grade!
To check this, one member of your team should do the following (only necessary for your final submission):
Open up your submission on gradescope
Select “Group Members” at the bottom, which should look like this:
Type in your partner’s name and select them from the list
Only one member of your group needs to do this, and you can do it after the deadline without being marked late. Please please please be sure do to this, though, as it saves us a lot of time when grading!
This is meant to be a “two people work together” project, not “we split the project in half and each do a part”. At the design check, both partners will be asked about pieces of the project. If only one partner answers all the questions, the silent partner may receive a lower grade on the design check.
We have also been known to put questions about the project functions on the exams, as a way to encourage you to actually understand all aspects of the project.
If you are having problems with your project partner either being non-responsive, or not contributing to the work, or severely holding you back in some way, please reach out to Kathi. We have procedures in place to deal with these situations.
Please let us know if you find any mistakes, inconsistencies, or confusing language in this or any other CS200 document by filling out our anonymous feedback form!