# Differences

This shows you the differences between two versions of the page.

random:a_curious_curve [2013/12/16 04:53] (current)
grant created
Line 1: Line 1:
+======A Curious Curve======
+Last week I created a function to evaluate the range of values around the average for a set of data at percentiles one through a hundred.
+
+<code python>
+def fracrange(values):
+    """
+    Useful for calculating the answer to the question:
+    What's the range around the average if I look
+    at only XX% of the data?
+
+    Results is a list of tuples:
+    0. Percent of data considered.
+    1. Average value.
+    2. Average of min/max in range of data considered.
+    3. Min in range of data considered.
+    4. Max in range of data considered.
+    5. Distance from min to average.
+    6. Distance from max to average.
+
+    Description of algorithm:
+    1. Calculate average.
+    2. Calculate distance from average.
+    3. Sort by distance from average.
+       100% of the time and range, max/min of values
+    5. Throw out 1% most distant data points.
+        99% of the time and range, max/min of values
+    6. Repeat 5 until 1% of the time.
+    """
+
+    total = float(sum(values))
+    average = total / len(values)
+
+    distances = [(abs(average - value), value) for value in values]
+    distances.sort(key=lambda tup: tup)
+
+    fraction = 0.01 * len(values)
+
+    results = []
+
+    for percent in reversed(xrange(1, 101)):
+        while len(distances) > (fraction * percent):
+            distances.pop()
+        if len(distances) == 0:
+            continue
+        max_val = max(tup for tup in distances)
+        min_val = min(tup for tup in distances)
+        max_dist = abs(average - max_val)
+        min_dist = abs(average - min_val)
+        results.append((percent, average, (max_val + min_val) / 2.0,
+                        min_val, max_val, min_dist, max_dist))
+
+    return results
+</code>
+
+Once I had that function working, I spun up IPython and tested it on some simple data. Below, values, are simply val * rand() where val iterates from 1 to 1,000 and rand() produces a number between 0 and 1.
+
+<code>
+values = [val * random.random() for val in xrange(1, 1001)]
+xvals = list(xrange(1, 1001))
+plot(xvals, values)
+results = fracrange(values)
+errorbar([tup for tup in results], [tup for tup in results], yerr=[[tup for tup in results], [tup for tup in results]])
+</code>
+
+Here's the plot for values:
+
+{{val_times_rand_1_1000.png}}
+
+And here's the plot for showing the ranges produced by fracrange:
+
+{{errorbar_percentile_vs_range.png}}
+
+I thought this graph was quite curious when I first saw it. The average is the straight line you see around 250. The x-axis is the percentile of data considered. So you'd read the midpoint of the x-axis as 50% of values were between ~75 and ~415. Values are excluded from greatest-from-the-average to least in descending order.
+
+Curiously, there's an inflection point in the curve at about the ~85th percentile. Given the uniform distribution of random values, I think the average makes sense as approximately 1/4th the range. And the inflection point seems to occur where the values excluded are more than double the average. It makes sense that a greater number of values will be near zero rather than more than twice the average but why ~15%? That number eludes me.
+
+I did some further analysis to see if, on average, the inflection point was at the 85th percentile. Here's a histogram of the results with 100,000 values repeated 2,000 times:
+
+{{inflection_point_histogram.png}}
+
+Looks like a normal curve to me. But it's not centered on 85. My closest guess is somewhere around the 84.65 percentile.
+
+So given the way values was computed above, this data says a little less than 15% of values will be more than double one-half the range.