Summary
Python allows every sysadmin to run (and learn) basic statistics on system data, replacing sed, awk, bc and gnuplot with an unique, reusable and interactive framework. The talk is a case study where python allowed us to highlight some network performance points in minutes using itertools, scipy and matplotlib. The presentation includes code snippets and a brief plot discussion.
Description
Agenda
- A latency issue
- Data distribution
- 30 seconds correlation with pearsonr
- Combinating data
- Plotting and the power of color
An use case
- Network latency issues
- Correlate latency with other events
First statistics
we created our parsing library
Having the data in a dict like
> table = { > 'time': [ 1,2,3, ..], > 'elapsed': [ 0.12, 12.43, ..], > 'error': [ 2, 0, ..], > 'size': [123,3223, ..], > 'peers': [2313, 2303, ..],
It's easy to get max, min and standard deviation
> print [k, max(v), min(v), stats.mean(v) ] for k,v in table.items() ]
Distribution
A distribution shows event frequency
> from matplotlib import pyplot > pyplot.hist(table['elapsed'])
Time and Size distributions
(Linear) Correlation
What's correlation
What's not correlation
pearsonr and probability
catch for linear correlation
> from scipy.stats.stats import pearsonr > a, b = range(0,10), range(0,20, 2) > c = [randint(0,10) for x in a] > pearsonr(a, b), pearsonr(a,c) > (1.0, 0.0), (0.43, 0.2)
Combinations
using itertools.combinations
netfishing correlation
>from itertools import combination >for f1, f2 in combinations(table, 2): > r, p_value = pearsonr(table[f1], table[f2]) > print("the correlation between %s and %s is: %s" % (f1, f2, r)) > print("the probability of a given distribution (see manual) is: %s" % p_value)
Plot always
pearsonr finds only linear correlation
our eyes work better :P
so...plot always!
color is the 3d dimension of a plot!
> from pyplot import scatter, title, xlabel, ylabel, legend > from pyplot import savefig, close as closefig > > for f1, f2 in combinations(table, 2): > scatter(table[f1], table[2], label="%s_%s" % (f1,f2)) > # add legend and other labels > r, p = pearsonr(table[f1], table[f2]) > title("Correlation: %s v %s, %s" % (f1, f2, r)) > xlabel(f1), ylabel(f2) > legend(loc='upper left') # show the legend in a suitable corner > savefig(f1 + "_" + f2 + ".png") > closefig()
Wrap Up!
- do not use pearsonr to exclude relation between events
- plots may serve better
- scatter plot can show a system thruput and exclude correlation between fields A and fields B
- continue collecting results