Visualizing request-flow comparison to aidperformance diagnosis in distributed systemsRaja R. Sambasivan, Ilari Shafer, Michelle L. Mazurek, Gregory R. GangerCMU-PDL-12-102May 2012Parallel Data LaboratoryCarnegie Mellon UniversityPittsburgh, PA 15213-3890AbstractDistributed systems are complex to develop and administer, and performance problem diagnosis is particularly challenging. Whenperformance decreases, the problem might be in any of the system’s many components or could be a result of poor interactionsamong them. Recent research has provided the ability to automatically identify a small set of most likely problem locations, leaving thediagnoser with the task of exploring just that set. This paper describes and evaluates three approaches for visualizing the results of aproven technique called “request-flow comparison” for identifying likely causes of performance decreases in a distributed system. Ouruser study provides a number of insights useful in guiding visualization tool design for distributed system diagnosis. For example, wefind that both an overlay-based approach (e.g., diff) and a side-by-side approach are effective, with tradeoffs for different users (e.g.,expert vs. not) and different problem types. We also find that an animation-based approach is confusing and difficult to use.Acknowledgements: We thank the members and companies of the PDL Consortium (including Actifio, APC, EMC, Emulex, Facebook, Fusionio, Google, Hewlett-Packard Labs, Hitachi, Intel, Microsoft Research, NEC Laboratories, NetApp, Oracle, Panasas, Riverbed, Samsung, Seagate,STEC, Symantec, VMWare, and Western Digital) for their interest, insights, feedback, and support. This research was sponsored in part by a Googleresearch award, NSF grant #CNS-1117567, and by Intel via the Intel Science and Technology Center for Cloud Computing (ISTC-CC). Ilari Shafer issupported in part by an NDSEG Fellowship, which is sponsored by the Department of Defense.

Keywords: distributed systems, performance diagnosis, request-flow comparison, user study, visualization

Figure 1: Comparing request-flow graphs. This side-by-side visualization, one of three interfaces we evaluate,illustrates the output of a diagnosis technique that compares graphs (called request-flow comparison). It shows thesetwo graphs juxtaposed horizontally, with dashed lines between matching nodes in both. The rightmost series of nodes inthe screenshot do not exist in the graph on the left, causing the yellow nodes to shift downward in the graph on the right.See Section 2 for further detail on why this change is important and what nodes in the graph represent.1IntroductionA distributed system is a set of software components running on multiple networked computers that collectivelyprovide some service or result. Examples now pervade all walks of life, as society uses distributed servicesto communicate (e.g., Google’s Gmail), shop (e.g., Amazon), entertain ourselves (e.g., YouTube), and soforth. Though such distributed systems often have simple interfaces and usually respond quickly, there isgreat complexity involved in developing them and maintaining their performance levels over time. Unexpectedperformance degradations arise frequently, and substantial human effort is involved in addressing them.When a performance degradation arises, the crucial first step in addressing it is figuring out what is causingit. The “root cause” might be any of the system’s software components, unexpected interactions betweenthem, or slowdowns in the network connecting them. Exploring the possibilities and identifying the most likelyroot causes has traditionally been an ad-hoc manual process, informed primarily by raw performance datacollected from individual components. As distributed systems have grown in scale and complexity, such ad-hocprocesses have grown less and less tenable.To help, recent research has proposed many tools for automatically localizing the many possible sources of anew problem to just a few potential culprits [5, 8, 9, 19–21, 25, 30, 31, 34, 36]. These tools do not identify the rootcause directly, but rather help developers build intuition about the problem and focus their diagnosis efforts.Though complete automation would be ideal, the complexity of modern systems and the problems that arise inthem ensure that this human-in-the-loop model will be dominant for the foreseeable future. As such, manyresearchers recognize the need for localization tools to present their results as clearly as possible [26, 29]. But1

apart from from a few select instances [23, 26], little research has been conducted on how to do so.As a step toward addressing this need, this paper presents a 20-person user study evaluating three interfaceswe built for visualizing the results of one powerful, proven technique called “request-flow comparison” [34]. Ouruser study uses real problems from a real distributed system. Request-flow comparison compares how thedistributed system services requests (e.g., “read this e-mail message” or “find books by this author”) duringtwo periods of operation: one where performance was fine (“before”) and the new one in which performancehas degraded (“after”). Each request serviced has a corresponding workflow within the system, representingthe order and timing of components involved; for example, a request to read e-mail might start at a front-endweb-server that parses the request, then be forwarded to the e-mail directory server for the specific user, thenbe forwarded to the storage server that holds the desired message, and then return to the web-server soit can respond to the requester. Figure 2 shows a similar example for a distributed storage system. Eachsuch request flow can be represented as a graph, and comparing the before and after graphs can providesignificant insight into performance degradations.Although request-flow comparison has been used to diagnose real problems observed in the Ursa Minordistributed storage system [1] as well as certain Google services, its utility has been limited by a clunkyinterface that presents results using text files and unsophisticated DOT [11] graphs that must be manually andpainstakingly compared with each other. The goal of this study is to identify what visualization techniqueswork best for presenting the results of request-flow comparison to their intended audience—developers andpeople knowledgeable about distributed systems.The interfaces we compared all try to show relevant differences between before-and-after pairs of directedacyclic graphs, which are the output of request-flow comparison. We built our own interfaces because ofdomain-specific requirements that precluded off-the-shelf solutions. For example, correspondences betweennodes of before-after pairs are not known a priori, so a significant challenge involved creating heuristics toidentify them. The side-by-side interface shows both graphs, adding correspondence lines between nodesthat represent the same activity in both graphs (see Figure 1). Diff presents a compact view by overlayingthe graphs and highlighting important differences. Finally, animation attempts to clarify differences by rapidlyswitching between the two graphs.Our user study results show diff and side-by-side perform comparably, with animation faring the worst.The choice between diff and side-by-side varies depending on users’ familiarity with software developmentpractices and with characteristics of the problem being diagnosed. Non-experts preferred side-by-side due toClient A151Client BDistributed System2Data LocationServerFrontendFile Server342Storage ServersFigure 2: Distributed storage system. To read a file, clients connect to this distributed system. A frontend file serverhandles their requests, but may need to access other components like the data location server and storage servers.For example, Client A makes a request that requires the blue messages 1–5, while the request initiated by Client Bonly produces the red messages 1–2. There may be many other paths through the system. Ursa Minor, the distributedstorage system discussed in this paper, has a similar architecture.2

its straightforward presentation. Diff’s more compact representation was favored by experts and advancedusers, but it engendered confusion in those less familiar with distributed systems. We also found that requestflow comparison’s failure modes sometimes did not match users’ expectations, highlighting the importance ofchoosing algorithms that match users’ mental models when creating automated diagnosis tools.The rest of this paper is organized as follows. Section 2 provides relevant background about request-flowcomparison. Section 3 describes the three interfaces and how node correspondences are determined.Section 4 describes the user study, and Section 5 describes the results. Section 6 describes key designlessons learned from the user study for further interface improvements. Section 7 describes related work andSection 8 concludes.2Request-flow comparisonRequest-flow comparison [34] is a technique for automatically localizing the root causes of performancedegradations in distributed systems, such as Ursa Minor (shown in Figure 2), GFS [14], and Bigtable [7]. Ituses the insight that such degradations often manifest as changes or differences in the workflow of individualrequests as they are serviced by the system. Exposing these differences and showing how they differ fromprevious behavior localizes the source of the problem and significantly guides developer effort.Request-flow comparison works by comparing request-flow graphs observed during two periods: one ofgood performance and one of poor performance. Nodes of these directed acyclic graphs show importantevents observed on different components during request processing, and edges show latency between theseevents (see Figure 3 for an example). Request-flow comparison groups the flows observed during bothperiods (often numbered in the hundreds of thousands or millions) into clusters, then identifies those from thepoor-performance period that appear to most contribute to the performance degradation. As output, it presentspairs of before-and-after graphs of these culprits, showing how they were processed before the performancechange versus after the change.1 Identifying differences between these pairs of graphs localizes the source ofthe problem and provides developers with starting points for their diagnosis efforts. To preserve context, entirerequest-flow graphs are presented with some, but not all, important differences highlighted.This technique identifies two important types of differences. Edge latency changes are differences in thetime required to execute successive events and represent slowdown in request processing. Request-flowcomparison attempts to identify these changes automatically, using hypothesis tests to identify edges withlatency distributions that have a statistically significant difference in the before and after periods. Similar testsare used in several automated diagnosis tools [19, 28, 34]. Since hypothesis tests will not identify all edgesworth investigating, developers must still examine the graphs manually to find additional such divergences.Structural changes are differences in the causal ordering of system events. Developers must contrast the twographs manually to identify them. Further details about request-flow comparison can be found in Sambasivanet al. [34].3Interface designTo compare the pairs of before/after graphs output by request-flow comparison, we built three interfacesdesigned to represent orthogonal approaches to representing differences. They are shown in Figure 4. Theinterfaces occupy three corners in the space of approaches to visualizing differences, as identified by a1In Sambasivan et al. [34], before graphs are called precursors and after graphs are called mutations.3

File read call3.99μ3.99μsFrontend file server lookup call28.48μ28.48μsFrontend file server cache miss30.78μ30.78μs54.35μ54.35μsSubSub -request callSubSub -request callSubSub -request replySubSub -request reply25.83μ25.83μs3233.13μ3233.13μsFile read replyFigure 3: Example request-flow graph. This graph shows the flow of a read request through the distributed storagesystem shown in Figure 2. Node names represent important events observed on the various components whilecompleting the required work. Edges show latencies between these events. Fan outs represent the start of parallelactivity, and synchronization points are indicated by fan ins. Due to space constraints, only the events observed on thefrontend file server are shown. The green dots abstract away messages exchanged between other components and thework done on them. Finally, the original node names, which had meaning only to the developers of the system, havebeen replaced with human-readable versions.taxonomy of comparison approaches [15]. The side-by-side interface is nearly a “juxtaposition,” which presentsindependent layouts. The diff interface is an “explicit encoding,” which highlights the differences betweenthe two graphs. Finally, the animation interface is closest to a “superposition” design that guides attention tochanges that “blink.” All are implemented in JavaScript, and use modified libraries from the Javascript InfoVisToolkit [6]. The rest of this section describes these interfaces.3.1Side-by-sideThe side-by-side interface (Figures 4a and 4d) computes independent layered layouts for the before andafter graphs and displays them beside each other horizontally. Nodes in the before graph are linked tocorresponding nodes in the after graph by dashed lines. This interface is analogous to a parallel coordinatesvisualization [18], with coordinates given by the locations of the nodes in the before and after graphs. Figure4a provides an example of this interface for two graphs containing nodes a,b,c, and d. Using this interface,latency changes can be identified by examining the relative slope of adjacent dashed lines: parallel linesindicate no change in latency, while increasing skew is indicative of longer response time. Structural changescan be identified by the presence of nodes in the before or after graph with no corresponding node in the othergraph.3.2DiffThe diff interface (Figures 4b and 4e) shows a single static image in an explicit encoding of the differencesbetween the before and after graphs, which are associated with the colors orange and blue respectively. Thelayout contains all nodes from both the before and after graphs. Nodes that exist only in the before graph areoutlined in orange and annotated with a minus sign; those that exist only in the after graph are outlined in blueand annotated with a plus sign. Nodes that exist in both graphs are not highlighted. This structural approach isakin to the output of a contextual diff tool [24] emphasizing insertions and deletions.We use the same orange and blue scheme to show latency changes, with edges that exist in only one graph4

(a) Side-by-side diagram(d) Side-by-side screenshot(b) Diff diagram(c) Animation diagram(e) Diff screenshot(f) Animation screenshotFigure 4: Three interfaces. This diagram illustrates the three approaches to visualizing differences in request-flowgraphs that we compare in this study. Figures a, b, and c provide samples for small synthetic graphs. Figures d, e, and fshow the interfaces applied to one of the real-world problems that was presented to users.shown in the appropriate color. Edges existing in both graphs produce a per-edge latency diff: orange andblue lines are inset together with different lengths. The ratio of the lengths is computed from the ratio of theedge latencies in before and after graphs, and the subsequent node is attached at the end of the longer line.3.3AnimationThe animation interface (Figures 4c and 4f) provides user-controllable switching between the before and aftergraphs. To provide a smooth transition, we interpolate the positions of nodes between the two graphs. Nodesthat exist in only one graph appear only on the appropriate terminal of the animation, becoming steadily moretransparent as the animation advances and vanishing completely by the other terminal. Users can start andstop the animation, as well as directly selecting a terminal or intermediate point of their choice.3.4Correspondence DeterminationAll of the interfaces described above require knowing correspondences between the before and after graphs,which are not known a priori. We must determine which nodes in the before graph map to which matchingnodes in the after graph, and by extension which nodes in each graph have no match in the other. Using graphstructure alone, this problem is hard in the formal sense [12], so we use an approximation technique.Each node has a distinguished name, and if the nodes are identical in the before and after graphs then theirnames are the same. The converse, however, is not true: a node name can appear multiple times in a trace,and insertions or deletions can have the same name as existing nodes. We exploit this naming property with a5

correspondence approximation approach based on string-edit distance. An overview of the process is shownin Figure 5.We first serialize each graph through a depth-first search, producing a string of objects. The two graphs shownat left in Figure 5, for example, are transformed into the strings abcd and abce. In each recursive step, wetraverse adjacencies in lexical order so that any reordered nodes from request flow graph collection (e.g.,interchanging nodes b and e in Figure 5(a)) do not produce different strings. This approach is reminiscent ofthat used in other graph difference comparison heuristics [12].We then calculate string-edit distance [39] between the resulting strings, with two components consideredequal if their names are equal, providing a correspondence between nodes. For example, Figure 5(c) showsthree corresponding nodes, one deletion, and one insertion. To obtain these correspondences from thememoization matrix shown in Figure 5(b), at each computation step we maintain a chain of the edits that led tothe distance shown at bottom right. We then trace the chain to its end.Finally, we map the differences computed above onto the graph union of the before and after graphs. Eachvertex in the graph union is tagged with one of three types used in displaying the three interfaces — in bothgraphs, in only the before graph, or in only the after graph. Edges likewise are tagged with one of these threetypes by comparing the adjacency lists of nodes in the two graphs. The graph union is used to computethe layout of the diff interface, and the three tags control the visualization in each interface (for example, theopacity of nodes in the animation interface).Of course, this approach is only approximate. Suppose, for example, that the node named e in Figure 5 wereinstead labeled d. The resulting serialized strings would both be abcd, and no nodes would be considered tobe insertions or deletions. We have found, however, that our technique works well in practice on request-flowgraphs, in which changes in structure are typically accompanied by changes in node names.3.5Common featuresAll three of our interfaces incorporate some common features, tailored specifically for request-flow graphs. Allgraphs are drawn with a layered layout based on the technique by Sugiyama et al [38]. This algorithm workswell for drawing graphs that can be presented as a hierarchy or a sequence of layers, a property satisfied byrequest-flow graphs. Layouts that modify this underlying approach enjoy widespread use [11]. Our interfacesFigure 5: Correspondence determination process. Here we show our method for finding correspondence, for thesame synthetic graph as shown in Figure 4. Starting from the before-and-after graph pair shown in (a), we perform adepth-first search to transform the request-flow graphs to strings. These strings are compared by finding their string-editdistance, as illustrated in (b). While computing the string-edit distance, we maintain information about the insertions,deletions, and correspondences between nodes, as shown in (c). Finally, we map these edits back to the original graphstructure, as shown in a diff-style format in (d).6

use a slightly modified version omitting some stages that are unnecessary for request flow graphs, such asdetecting and eliminating cycles (requests cannot move backward in time).To navigate the interface, users can pan the graph view by clicking and dragging or by using a vertical scrollbar. In large graphs, this allows for movement in the neighborhood of the current view or rapid traversal acrossthe entire graph. By using the wheel on a mouse, users can zoom in and out, up to a limit. We employrubber-banding for both the traversal and zoom features to prevent the interface from moving off the screen orbecoming considerably smaller than the viewing window.As mentioned in Section 2, the comparison tool automatically identifies certain edges as having changedstatistically significantly between the before and after graphs. The interfaces highlight these edges with a boldred outline.As drawn in each graph, the length of an edge relates to its latency. Because latencies within a singlerequest-flow graph can vary by orders of magnitude, we do not map latency directly to length; instead, we usea sigmoid-based scaling function that allows both longer and shorter edges to be visible in the same graph.When graphs contain join points, or locations where multiple parallel paths converge at the same node, alayered layout will likely produce edge lengths that no longer correspond to the values given by the originalscaling function. This occurs when one flow completes before another and must wait. Our interfaces illustratethe distinction between actual latency and connecting lines by using thinner lines for the latter. An example ofthis notation appears in the right-hand graph in Figure 4a, between nodes b and c.4User study overview & methodologyWe evaluated the three interfaces via a between-subjects user study, in which we asked participants tocomplete five assignments. Each assignment asked participants to find key performance-affecting differencesfor a before/after request-flow graph pair obtained from Ursa Minor (the distributed system shown in Figure 2).Four of the five assignments used graphs that were the output of request-flow comparison for real problemsobserved in the system. These problems are described in more detail in Sambasivan et al. [34].4.1ParticipantsOur tool’s target users are the developers of the distributed system being diagnosed. Our example taskscome from the Ursa Minor system, so we selected the seven Ursa Minor developers to whom we had accessas expert participants. However, due to the limited size of this pool and because we knew some of theseparticipants personally, we also recruited 13 additional non-expert participants. All of these non-experts weregenerally familiar with distributed systems but not with this system specifically. Although we advertised thestudy in undergraduate and graduate classes, as well as by posting fliers on and around our campus, all ournon-expert participants were graduate students in computer science, electrical and computer engineering, orinformation networking. The user study took about 1.5 hours and we paid participants 20.Potential non-expert participants were required to complete a pre-screening questionnaire that asked aboutkey undergraduate-level distributed systems concepts. To qualify, volunteers were required to indicate that theyunderstood what a request is in the context of a distributed system, along with at least two of five additionalconcepts: client/server architecture, concurrency and synchronization, remote procedure calls, batching, andcritical paths. Of the 33 volunteers who completed the questionnaire, 29 were deemed eligible; we selectedthe first 13 to respond as participants.7

During the user study, each participant was assigned, round-robin, to evaluate one of the three interfaces.Table 1 lists the participants, their demographic information, and the interface they were assigned.4.2Creating before/after graphs for the assignmentsOur user study contains five total assignments, each requiring participants to identify salient differencesbetween a before/after graph pair. To limit the length of the study and explore the space of possible differences,we removed a few repeated differences from real-problem graphs and added differences of different types.However, we were careful to preserve the inherent complexity of the graphs and the problems they represent.The only synthetic before/after pair was modified from a real request-flow graph observed in the system.Table 2 describes the various assignments and their properties.To make the request-flow graphs easier for participants to understand, we changed node labels, which describeevents observed during request processing, to more human-readable versions. For example, we changed thelabel “e10 t3 NFS CACHE READ HIT” to “Read that hit in the NFS server’s cache.” The original labelswere written by Ursa Minor developers and only have meaning to them. Finally, we omitted numbers indicatingedge lengths from the graphs to ensure participants used visual properties of our interfaces to find importantdifferences.4.3User study procedureThe study consisted of four parts: training, guided questions, emulation of real diagnoses, and interfacecomparison. Participants were encouraged to think aloud throughout the MMMM263338374433267MAvg NA13InterfaceSide-by-side (S)SSDiff (D)DAnimation (A)A3S, 2D, 2A(a) Participant demographics for 322236M, 2FAvg 25InterfaceSide-by-side (S)SSSDiff (D)DDDDAnimation (A)AAA4S, 5D, 4A(b) Participant demographics for non-expertsTable 1: Participant demographics. Our user study consisted of twenty participants, seven of whom were experts(developers of Ursa Minor) and thirteen of whom were non-experts (graduate students familiar with distributed systems).Interfaces were assigned round-robin. The ID encodes whether the participant was an expert (E) or non-expert (N) andthe the interface assigned (S side-by-side, D diff, A animation).8

PhaseAssignment Differencesand typeG1/Real4 statistically sig.5 other edge latency2/Real1 structural3/Synth.4 statistically sig.2 other edge latency3 structural94/1284/Real4 structural52/775/Real2 structural82/226EBefore/aftergraph sizes (nodes)122/1223/16Table 2: Information about the before/after graph pairs used for the assignments. Assignments 1–3 were used inthe guided questions phase (labeled G and described in Section 4.3.2); 4 and 5 were used to emulate real diagnoses(labeled E and described in Section 4.3.3). Four of the five assignments were the output of request-flow comparisonfor real problems seen in Ursa Minor. The assignments differed in whether they contained statistically significant edgelatency changes, other edge latency changes not identified automatically, or groups of structural changes. The graphsizes for the various assignments varied greatly.4.3.1TrainingIn the training phase, participants were shown the Ursa Minor diagram similar to the one in Figure 2. Theywere not required to understand details about the system, but only that it consists of four components that cancommunicate with each other over the network. We also presented them with a sample request-flow graphand described the meaning of nodes and edges. Finally, we trained each participant on her assigned interfaceby showing her a sample before/after graph pair and guiding her through tasks she would have to complete inlatter parts of the study. Participants were given ample time to ask questions and were told that we would beunable to answer further questions after the training portion.4.3.2Finding differences via guided questionsIn this phase of the study, we guided participants through the process of identifying differences, asking themto complete five focused tasks for each of three assignments. Rows 1–3 of Table 2 describe the graphs usedfor this part of the study.TASK 1: Find any edges with statistically significant latency changes. This task required participants to find allof the graph edges highlighted in red (that is, those automatically identified by the request-flow comparisontool as having statistically significant changes in latency distribution).TASK 2: Find any other edges with latency changes worth investigating. The request-flow comparison tool willnot identify all edges worth investigating. For example, edges with large changes in average latency that alsoexhibit high variance will not be identified. This task required participants to iterate through the graphs and findedges with notable latency changes that were not highlighted in red.TASK 3: Find any groups of structural changes. Participants were asked to identify added or deleted nodes inthe after graph. To reduce effort, we asked them to identify these changes in contiguous groups, rather thanby noting every changed node individually.TASK 4: Describe in a sentence or two what the changes you identified in the previous tasks represent. This9

task examines whether the interface enables participants to quickly develop an intuition about the problem inquestion. For example, all of the edge latency changes (statistically significant and otherwise) for the graphspresented in assignment 1 indicate a slowdown in network communication between machines and in writeactivity within one of Ursa Minor’s storage nodes. It is important that participants be able to identify thesethemes, as doing so is a crucial step toward understanding the root cause of the problem.TASK 5: Of the changes you identified in the previous tasks, identify which one most impac

Figure 1: Comparing request-flow graphs. This side-by-side visualization, one of three interfaces we evaluate, illustrates the output of a diagnosis techni