 A thank you to everyone who makes this possible: Read More

## Statistics 101 for System Administrators

### Summary

Python allows every sysadmin to run (and learn) basic statistics on system data, replacing sed, awk, bc and gnuplot with an unique, reusable and interactive framework. The talk is a case study where python allowed us to highlight some network performance points in minutes using itertools, scipy and matplotlib. The presentation includes code snippets and a brief plot discussion.

### Description

#### Agenda

• A latency issue
• Data distribution
• 30 seconds correlation with pearsonr
• Combinating data
• Plotting and the power of color

#### An use case

• Network latency issues
• Correlate latency with other events

#### First statistics

• we created our parsing library

• using various recipes

• Having the data in a dict like

```> table = {
>   'time': [ 1,2,3, ..],
>   'elapsed': [ 0.12, 12.43, ..],
>   'error': [ 2, 0, ..],
>   'size': [123,3223, ..],
>   'peers': [2313, 2303, ..],
```
• It's easy to get max, min and standard deviation

```> print [k, max(v), min(v), stats.mean(v) ] for k,v in table.items() ]
```

#### Distribution

• A distribution shows event frequency

```> from matplotlib import pyplot
> pyplot.hist(table['elapsed'])
```
• Time and Size distributions

#### (Linear) Correlation

• What's correlation

• What's not correlation

• pearsonr and probability

• catch for linear correlation

```> from scipy.stats.stats import pearsonr
> a, b = range(0,10), range(0,20, 2)
> c = [randint(0,10) for x in a]
> pearsonr(a, b), pearsonr(a,c)
> (1.0, 0.0), (0.43, 0.2)
```

#### Combinations

• using itertools.combinations

• netfishing correlation

```>from itertools import combination
>for f1, f2 in combinations(table, 2):
>        r, p_value = pearsonr(table[f1], table[f2])
>        print("the correlation between %s and %s is: %s" % (f1, f2, r))
>        print("the probability of a given distribution (see manual) is: %s" % p_value)
```

#### Plot always

• pearsonr finds only linear correlation

• our eyes work better :P

• so...plot always!

• color is the 3d dimension of a plot!

```> from pyplot import scatter, title, xlabel, ylabel, legend
> from pyplot import savefig, close as closefig
>
> for f1, f2 in combinations(table, 2):
>    scatter(table[f1], table, label="%s_%s" % (f1,f2))
>    # add legend and other labels
>    r, p = pearsonr(table[f1], table[f2])
>    title("Correlation: %s v %s, %s" % (f1, f2, r))
>    xlabel(f1), ylabel(f2)
>    legend(loc='upper left') # show the legend in a suitable corner
>    savefig(f1 + "_" + f2 + ".png")
>    closefig()
```

#### Wrap Up!

• do not use pearsonr to exclude relation between events
• plots may serve better
• scatter plot can show a system thruput and exclude correlation between fields A and fields B
• continue collecting results