Lecture 18 -- Graphs and Reachability

# Lecture 18 -- Graphs and Reachability ## Motivating Questions How could we write programs to check for routes between cities in a map of scheduled bus service? ## Conceptual Study Questions - What is a graph? - What are some options for capturing graphs in data structures? - Why do we need to track which cities we have visited when checking for routes? # Introducing Graphs The following diagram shows bus routes available on a New England regional bus line. ![diagram of bus routes](https://docs.cs200.io/uploads/9d480563-55ca-4c38-9d0c-e233376ad50a.png) This picture has two kinds of information: cities and bus trips. Data that have some sort of item and connections between them are called *graphs*. By this definition, trees are graphs. But in this example, we see that there is a *cycle*, in which one can go from Boston to Providence and back again. Graphs may feature cycles (which is what makes them interesting). We refer to the items as *Vertices* (also sometimes called *Nodes*), and the connections as *Edges*. # Data Structures for Graphs Unlike trees, graphs have no hierarchy (there is no \"top vertex\" of a graph using which we might refer to the rest of the graph). This indicates that we need to represent a graph as some collection of graphs, vertices, or both. What options might we have for data structures for graphs? 1. A list of Edge objects, where an Edge contains two Strings 2. A list of Edge objects, where an Edge contains two Vertex objects 3. A list of Vertex objects, which contains a list of other Vertices (the ones to which there are edges) 4. A 2D array (called an \"adjacency matrix\") in which cells are boolean values that are true if an edge exists between the vertices represented by the row and column of that cell 5. A HashMap from Vertex objects to a set of Vertex objects to which they have edges 6. A HashMap from names of cities to a list of the city names to which they have edges Each of these works well for some graph questions or some kinds of graphs, so which one you use in practice depends a lot on context. In our context, we are going to check (or find) routes (a sequence of edges) that could get us from one city to another. Option 3 works well for this. The array version requires you to search through the array to identify the outgoing edges of a vertex, instead of having a way to access them directly. The lists of edges have similar issues. Option 3 makes it easy to get from one vertex to its neighbors by following edges. Since we are not concerned about imposing an order on the collection of vertices, we will use a HashSet instead of a list to hold all of them so that we can check if a Graph contains a Vertex in constant time. # Classes for Vertices and Graphs First here is a class to capture a city in the route map ```=java= public class CityVertex { public LinkedList<CityVertex> toCities; public String name; public CityVertex(String nm) { this.name = nm; this.toCities = new LinkedList<CityVertex>(); } public void addEdge(CityVertex toVertex) { this.toCities.add(toVertex); } } ``` And here's the code to create our example graph as a series of `CityVertex` objects: ```=java= public class CityVertexTest { @Test public void testGraphExample() { CityVertex man = new CityVertex("Manchester"); CityVertex bos = new CityVertex("Boston"); CityVertex pvd = new CityVertex("Providence"); CityVertex wor = new CityVertex("Worcester"); CityVertex har = new CityVertex("Hartford"); man.addEdge(bos); bos.addEdge(pvd); bos.addEdge(wor); pvd.addEdge(bos); wor.addEdge(har); } } ``` Note that in this code, our vertices are named with separate variables (which makes it easier to set up the edges). In practice, we don't do this with large graphs (we instead do things like create vertices from, say, large csv tables about routes). In those cases, we need some sort of data structure to hold the collection of all vertices. Here is a graph class that does this. It uses a generic to capture the specific class to use for the vertices: ```=java= class Graph<T> { HashSet<Vertex<T>> vertices; public Graph() { this.vertices = new HashSet<Vertex<T>>(); } public void addVertex(Vertex<T> v) { this.vertices.add(v); } public void addEdge(Vertex<T> from, Vertex<T> to) { if (this.vertices.contains(to) && this.vertices.contains(from)) { from.addEdge(to); } // throw exception otherwise (code elided) } } ``` # Checking for Routes (Reachability) One common question to ask about a graph is whether there is a path from one vertex to another. A *path* is a sequence of edges (or vertices) that get from a desired starting vertex to a desired destination vertex. Let's call the method that checks for routes `canReach` (asking whether you can reach one node from another). Before we write the code, let's work out some assertions. ```=java= Assert.assertTrue(man.canReach(man)); Assert.assertTrue(man.canReach(pvd)); Assert.assertTrue(man.canReach(har)); Assert.assertFalse(har.canReach(wor)); ``` The one potentially controversial test here is the one from a city to itself. Should we consider it a route if we don't actually go anywhere? Depending on your application, either answer might make sense. For our purposes, we will treat this case as true, taking the interpretation that being at your destination is more important than whether you needed to travel to get there. ## A Plan for Checking Reachability Let's think about how we might check reachability from point A to point B in a graph. If we can get from A to B, then there must be some neighbor of B that gets used to do that. Looking at our example, to get from Boston to Hartford, we go through Worcester. Worcester is a neighbor of Boston and from Worcester we can get to Hartford. This intuition of breaking down the potential route around the neighbors of a Node gives us a plan for checking for routes: **Plan for getting from City A to City B** 1. if A and B are the same city, then A can reach B 2. otherwise, if any neighbor N of A can reach B, then A can reach B by going through N ## Implementing Reachability Now, let's write a method for this in the `CityVertex` class. ```=java= public class CityVertex { ... public boolean canReach(CityVertex dest) { // plan step 1 if (this == dest) { return true; } // plan step 2 for (CityVertex neighbor : this.toCities) { if (neighbor.canReachdest)) { return true; } } // out of neighbors to try, so no way to get to dest return false; } } ``` This code is a straightforward recursive traversal that follows our plan. ### Testing our code Let's check our assertions from earlier: ```=java= Assert.assertTrue(man.canReach(man)); Assert.assertTrue(man.canReach(pvd)); Assert.assertTrue(man.canReach(har)); Assert.assertFalse(har.canReach(wor)); ``` The first, second, and fourth pass, but the third results in a "StackOverflow error". This kind of error essentially means "you made so many method calls that Java has run out of room to keep track of them". Anytime you get a StackOverflow error, the most likely culprit is that you have an "infinite loop" in your code. This means you have some pattern of computation that is getting run over and over when it shouldn't be. To see what's happening there, let's hand-trace the calls to `canReach` that are involved in that assertion. Each level of indentation represents a recursive call. When two calls arise from the same `for` loop, they are aligned in terms of indentation (for example, lines 3 and 9, or lines 5 and 8) ```=java= man.canReach(har) bos.canReach(har) pvd.canReach(har) bos.canReach(har) pvd.canReach(har) bos.canReach(har) ... wor.canReach(har) // pending call wor.canReach(har) // pending call ``` The two calls involving `wor`, which are needed to actually get to `har` never happen because we keep bouncing back and forth between `bos` and `pvd` (which have edges to one another). Each time we try `canReach` from `bos`, it tries the first neighbor in its `neighbors` list, which happens to be `pvd`. Why is the first neighbor `pvd` and not `wor`? That's the order in which we added the edges: ```=java man.addEdge(bos); bos.addEdge(pvd); <-- we put pvd first bos.addEdge(wor); <-- then put wor second pvd.addEdge(bos); wor.addEdge(har); ``` In one sense, the infinite bouncing between boston and providence is dumb luck based on the order that we created the edges. But we shouldn't have code that runs or doesn't on dumb luck either. ## Breaking the cycle Looking at the trace of calls, we need to not start that second call to `bos.canReach()`. What could stop it, however? We'd need some way to keep track of which vertices we'd already tried searching from. What if we could modify the calls to `canReach` to also track which vertices had already been visited, say something like the following (putting those vertices we've tried in curly brackets): ```=java= man.canReachVisited(har, {}) bos.canReachVisited(har, {man}) pvd.canReachVisited(har, {man, bos}) // bos.canReachVisited(har) // don't try! we've already started from bos return false wor.canReachVisited(har, {man, bos, pvd}) har.canReachVisited(har, {man, bos, pvd, wor}) return true ``` If we never start the second call to `bos`, the infinite loop goes away and our code gets to try `wor`, from which the code eventually finds `har`. How do we build this in code? We will create a `HashSet<CityVertex>` and add vertices to it each time we start `canReach`. Here's the code: ```=java= public class CityVertex { ... // recursive helper for canReach with visited set private boolean canReachVisited(CityVertex dest, HashSet<CityVertex> visited) { if (this == dest) { return true; } visited.add(this); for (CityVertex neighbor : this.toCities) { if (!visited.contains(neighbor)) { if (neighbor.canReachVisited(dest, visited)) { return true; // We can reach neighbor, we can reach dest } } } return false; } // canReach that initializes the visited set public boolean canReach(CityVertex dest) { return this.canReachVisited(dest, new HashSet<CityVertex>()); } } ``` The public `canReach` method maintains the old method header, passing an empty `HashSet` into the recursive `canReachVisited` method. All of our assertions pass against this new version. Next lecture, we'll write a variation of this code and show how we can use it to look for routes (rather than just checking whether a route exists).