SeanMcb.com

/Web Usage Patterns

This text adapted and updated from the full report (PDF) by Brian Shih.

Introduction

Most website traffic analysis systems consider the number of “page hits” or the access count for each page in the site. However, reporting these simple numbers eliminates the context of their place in the overall site visit. Other traffic analysis software can report on paths that individual users take, but do little to summarize this data into a meaningful and useful form. Our goal is to track user visits to a website, which are ordered sequences of page views, and then to extract trends, outliers, and other useful summary information from this data. Armed with this kind of rich statistical information, we can then apply it in new and interesting ways.

Development of our algorithm

We chose to represent our paths as strings of symbols where each symbol represents a page on the website. Given two different paths, we can use string similarity algorithms to find the “distance” between the two visits.

A widely used string comparison metric is the Levenshtein distance, or edit distance. The Levenshtein distance (LD) calculates the cost of changing one string into another, and is frequently used in applications such as spell checking, speech recognition, DNA analysis, and plagiarism detection.

Unfortunately, while paths in a graph can be represented as strings, there is appreciably more contextual meaning associated with the path than a simple string. This means that a perfectly acceptable algorithm for comparing the similarity of two strings might fail to fully capture important differences between two paths. We decided that this was the case with Levenshtein, but that with a few key changes we could make an algorithm that would better suit our needs. The changes are described in full in the complete report.

The Python code for our modified Levenshtein distance algorithm (or MQS for McBrideQuimbyShih distance) is shown below.

Given two paths as strings of symbols, input1 and input2
m = len(input1)
n = len(input2)
M = matrix[0:m][0:n]
M[0][0] = 0
for i = 1 to m
    if inBothInputs(input1[i]) then M[i][0] = M[i-1][0] + 1
    else M[i][0] = M[i-1][0] + 2
for j = 1 to n
    if inBothInputs(input2[j]) then M[0][j] = M[0][j-1] + 1
    else M[0][j] = M[0][j-1] + 2
for i = 1 to m
    for j = 1 to n
        if inBothInputs(input1[i]) then delCost = 1
        else delCost = 2
        if inBothInputs(input2[j]) then insCost = 1
        else insCost = 2
        if input1[i] = input2[j] then subCost = 0
        else subCost = max(insCost, delCost)
        M[i][j] = minimum(
            M[i-1][j] + delCost,
            M[i][j-1] + insCost,
            M[i-1][j-1] + subCost
            )
return M[m,n]

Visualization of results

Taking our algorithm, we wrote a Python program that parsed Apache log files and calculated the MQS distance between every visit to the website and graphed the results.

Warning, the full size images are pretty hefty. Consider yourself warned.

Here is the original graph, with complete data harvested from server logs from seanmcb.com - each node is a single visit or trip through the website. The nodes are connected to each other with lines indicating their MQS distance and spaced appropriately.

Obviously, this is very cluttered and hard to read, so we’ve applied some clustering to the graph.

Here, nodes that are MQS distance of zero apart are clustered into larger nodes where the number represents how many visits are in that cluster.

Further graph minimizations allow us to cut down on the copious edges.

Here, we’ve constructed a minimum spanning tree of the graph so only the lowest MQS distance edges are shown. This makes more sense we are interested in how similar visits are to each other. However, the graphing mode seems to graph the nodes incorrectly spatially. (e.g. a node at the top of the tree could have low MQS distance from a node at the bottom.) Also note, the differences in the cluster numbers results from using two different subsets of the sample data - since processing time was a concern, we chose to operate on lower numbers of data for certain graphs.

Conclusion

We also hope to improve our output in the future; we would like to eventually provide users with a report that itemizes the most common usage patterns with descriptions of what page sequences they represent and how frequently they occur. Ideally, this data would be displayed in both an interactive visual format and as a text dump. At the moment, our output is primarily visual, though not interactive, and we do not yet link the nodes displayed with a ready description of the paths they represent; we simply haven’t had the time to implement anything better.

Even with small datasets on small webpages, we can already see useful patterns emerging. Our current results would be extremely useful in a reorganization of the sites from which the data was gathered, suggesting that our algorithm fulfills at least one of its proposed goals. Success in meeting the other main goal, predictive navigation, is hard to judge, because our data came from personal sites with very short average paths, where predictive navigation is significantly less useful. It seems fair to conclude that since we are successfully visually categorizing visits, once we have a concrete clustering algorithm we could implement something that would match a visit to stored archives of common visits and propose possible destinations based on similarity to the stored paths.

Download

Download the python code and try it for yourself. usagetrend.py takes care of the MQS calculation and graph minimizations, while usagetrend_graph.py handles the graphing. This should only work on default Apache server logs.

Or download the original report. (PDF)

Sean McBride
Engineer / Web Developer
Boston, MA - Billings, MT

"I will follow you into the dark."