Exploring Complex, Dynamic Graph Data - Part 3

Since my last post, I’ve been reminded once again of the challenges presented by dynamic graph data. Earlier I wrote about how I’d hoped to exploit graph rewriting operations in Gremlin to efficiently tease out certain classes of communication event sequences in the Enron communication graph. Unfortunately the queries to produce those event sequences are nontrivial in Gremlin even with the abstraction provided through user-defined steps. At present, I think this is due to a mismatch between the representation I’ve chosen and the capabilities offered by Gremlin. There’s no doubt that Gremlin simplifies many operations one wants to perform on multi-node-type, multi-edge-type graphs. Time simply introduces some additional wrinkles given it imposes an ordering that must be respected to obtain valid traversals through the graph structure. The tests to ensure properly ordered traversals turn out to be the source of the complexity.

To date, I’ve had a number of conversations with colleagues about exploring dynamic graph data. I’ve still yet to uncover a mechanism by which one can explore the complexities of this type of data with relative ease. I suppose I should not be surprised; yet I am to some degree given the volumes of data being produced. It seems we are not yet in a position to uncover the more complex dynamic patterns that we expect to lurk in these datasets without serious effort and luck.

Left without general approaches for efficient exploratory data analysis in this context, we need to be able to efficiently realize domain-specific analytics to test our hypotheses. We need capabilities to address all layers in the process: persistence, query, analysis and visualization.

My experimentation has centered around composing different technology stacks to support this process. The very first technology stack I experimented with was Neo4j + Gremlin + Python + Gephi. Neo4j was a natural mechanism for representing and persisting the Enron data along with the social metadata. Python was an obvious choice for conditioning the Enron data and populating Neo4j. Gremlin offered me the capability to subset and transform the Enron data graph easily and export those results in GraphML form. Gephi allowed me to easily visualize the results and perform further operations on the data to enhance the signals I wanted to see.

If one wants to move beyond a singular focus on a particular dimension of the data, such as graph structure, it’s imperative to explore other options for visualization. I find that a number of the visual forms I want to see require specialized visualization. Ideally I want different projections of the data available to me simultaneously in linked, interactive visualizations. For someone skilled in Javascript, such visualizations are no longer so burdensome to create. Protovis has gone a long way toward minimizing that burden. D3, Mike Bostock’s latest creation, looks poised to build on Protovis' success and go even further. Even if you do not envision yourself doing serious infovis development, I think it is worthwhile to pick up some Protovis skills. I find it useful for realizing more complex static visualizations in a browser; thus allowing me to get a better view of multiple dimensions at once.

Below is a snapshot of a communication ego network browser I put together one afternoon to let me visualize traffic patterns between the ego and alters. It is essentially a series of stacked bullet charts representing email traffic to and from the ego. The colored bars show total email counts. The mid gray bars show the number of emails with the recipient in the To field. The dark gray bars show the number of emails in threads. The number of relationships in this ego network far exceeds what is displayed. Since it is in a web browser, I can quickly scroll and scan the data to get a feel for the patterns.

To support all of my analytical needs, there was little question that Python was the best choice for me. Numpy/Scipy/Matplotlib and NetworkX give Python a natural advantage on their own. With NLTK for natural language processing and a host of available machine learning and soptimization packages, the scale tips even further.

If you are a (J)Rubyist who appreciates the power of Gremlin, keep an eye on a project called Pacer. Pacer brings Gremlin to JRuby, thus expanding development options. The Tinkerpop crew has been busy and continues to develop new capabilities.

I’ll continue to experiment with different compositions as time permits and needs dictate and share those discoveries here.

[Parts 1 and 2 of the series]