A Python-based descriptive statistical analysis tool. « Python recipes

A Python module implementing a class which can be used for computing numerical statistics for a given data set.

      """Descriptive statistical analysis tool.
"""

__author__ = "Chad J. Schroeder"

__revision__ = "$Id$"
__version__ = "0.1"

__all__ = [ "StatisticsException", "Statistics" ]

class StatisticsException(Exception):
   """Statistics Exception class."""
   pass

class Statistics(object):
   """Class for descriptive statistical analysis.

   Behavior:
      Computes numerical statistics for a given data set.

   Available public methods:

      None

   Available instance attributes:

          N: total number of elements in the data set
        sum: sum of all values (n) in the data set
        min: smallest value of the data set
        max: largest value of the data set
       mode: value(s) that appear(s) most often in the data set
       mean: arithmetic average of the data set
      range: difference between the largest and smallest value in the data set
     median: value which is in the exact middle of the data set
   variance: measure of the spread of the data set about the mean
     stddev: standard deviation - measure of the dispersion of the data set
             based on variance

   identification: Instance ID

   Raised Exceptions:    

      StatisticsException

   Bases Classes:

      object (builtin)

   Example Usage:

      x = [ -1, 0, 1 ]

      try:
         stats = Statistics(x)
      except StatisticsException, mesg:
         <handle exception>

      print "N: %s" % stats.N
      print "SUM: %s" % stats.sum
      print "MIN: %s" % stats.min
      print "MAX: %s" % stats.max
      print "MODE: %s" % stats.mode
      print "MEAN: %0.2f" % stats.mean
      print "RANGE: %s" % stats.range
      print "MEDIAN: %0.2f" % stats.median
      print "VARIANCE: %0.5f" % stats.variance
      print "STDDEV: %0.5f" % stats.stddev
      print "DATA LIST: %s" % stats.sample

   """
                                                                                
   def __init__(self, sample=[], population=False):
      """Statistics class initializer method."""

      # Raise an exception if the data set is empty.
      if (not sample):
         raise StatisticsException, "Empty data set!: %s" % sample

      # The data set (a list).
      self.sample = sample

      # Sample/Population variance determination flag.
      self.population = population

      self.N = len(self.sample)

      self.sum = float(sum(self.sample))

      self.min = min(self.sample)

      self.max = max(self.sample)

      self.range = self.max - self.min

      self.mean = self.sum/self.N

      # Inplace sort (list is now in ascending order).
      self.sample.sort()

      self.__getMode()
      self.__getMedian()
      self.__getVariance()
      self.__getStandardDeviation()

      # Instance identification attribute.
      self.identification = id(self)

   def __getMode(self):
      """Determine the most repeated value(s) in the data set."""

      # Initialize a dictionary to store frequency data.
      frequency = {}

      # Build dictionary: key - data set values; item - data frequency.
      for x in self.sample:
         if (x in frequency):
            frequency[x] += 1
         else:
            frequency[x] = 1

      # Create a new list containing the values of the frequency dict.  Convert
      # the list, which may have duplicate elements, into a set.  This will
      # remove duplicate elements.  Convert the set back into a sorted list
      # (in descending order).  The first element of the new list now contains
      # the frequency of the most repeated values(s) in the data set.
      # mode = sorted(list(set(frequency.values())), reverse=True)[0]
      # Or use the builtin - max(), which returns the largest item of a
      # non-empty sequence.
      mode = max(frequency.values())

      # If the value of mode is 1, there is no mode for the given data set.
      if (mode == 1):
         self.mode = []
         return

      # Step through the frequency dictionary, looking for values equaling
      # the current value of mode.  If found, append the value and its
      # associated key to the self.mode list.
      self.mode = [(x, mode) for x in frequency if (mode == frequency[x])]

   def __getMedian(self):
      """Determine the value which is in the exact middle of the data set."""

      if (self.N%2):		# Number of elements in data set is odd.
         self.median = float(self.sample[self.N/2])
      else:
         midpt = self.N/2	# Number of elements in data set is even.
         self.median = (self.sample[midpt-1] + self.sample[midpt])/2.0

   def __getVariance(self):
      """Determine the measure of the spread of the data set about the mean.
      Sample variance is determined by default; population variance can be
      determined by setting population attribute to True.
      """

      x = 0	# Summation variable.

      # Subtract the mean from each data item and square the difference.
      # Sum all the squared deviations.
      for item in self.sample:
         x += (item - self.mean)**2.0

      try:
         if (not self.population):
            # Divide sum of squares by N-1 (sample variance).
            self.variance = x/(self.N-1)
         else:
            # Divide sum of squares by N (population variance).
            self.variance = x/self.N
      except:
         self.variance = 0

   def __getStandardDeviation(self):
      """Determine the measure of the dispersion of the data set based on the
      variance.
      """

      from math import sqrt     # Mathematical functions.

      # Take the square root of the variance.
      self.stddev = sqrt(self.variance)

if __name__ == "__main__":

   import os               # Miscellaneous OS interfaces.
   import sys              # System-specific parameters and functions.

   # Self-test

   a = [ -1, 0, 1 ]
   b = [ -1.0, 0.0, 1.1 ]
   c = []
   d = [ 12.23 ]
   e = [ 12.23, 99.543, 66.08 ]
   f = [ -1, 0, 2, -2, 1, 3, 0, -3, 2 ]
   g = [ 0, 9, 1, 8, 2, 7, 3, 6, 4, 5 ]
   h = [ -1, -1 ]

   for x in a, b, c, d, e, f, g, h:
      try:
         stats = Statistics(x)
      except StatisticsException, mesg:
         print; print "Exception caught: %s" % mesg; print
         continue
      print
      print "N: %s" % stats.N
      print "SUM: %s" % stats.sum
      print "MIN: %s" % stats.min
      print "MAX: %s" % stats.max
      print "MODE: %s" % stats.mode
      print "MEAN: %0.2f" % stats.mean
      print "RANGE: %s" % stats.range
      print "MEDIAN: %0.2f" % stats.median
      print "VARIANCE: %0.5f" % stats.variance
      print "STDDEV: %0.5f" % stats.stddev
      print "DATA LIST: %s\n" % stats.sample
      print

   sys.exit(0)

      

This recipe implements a descriptive statistical analysis class. It's intended to aid in computing numerical statistics for a given data set. It's well documented and hopefully useful. Any corrections, ideas, or suggestions are welcome. Enjoy.

Tags: programs

5 comments

gyro funch 19 years ago # | flag

Preexisting solutions. Very useful, but ...

There is a similar module in development within the python cvs tree: python/nondist/sandbox/statistics/statistics.py

There is also a nice stats module at http://www.nmr.mgh.harvard.edu/Neural_Systems_Group/gary/python.html

SciPy (http://www.scipy.org) also has some statistics-related functions: http://www.scipy.org/documentation/apidocs/scipy/scipy.stats.html

Mikko Pekkarinen 19 years ago # | flag

Bug in variance computation. You use x both as loop variable and summation variable, so that the result is bogus. E.g. Statistics([-1, -1]) gives a negative variance (and thus an exception from sqrt).

Also, computing the mode is more complicated than necessary: why not use

mode = max(frequency.values())

(Converting to a set does not save anything; one must touch all the elements of the list anyway, so it's O(n). And even if one wants to do list(set(lst)) first, max(lst) is faster than sorted(lst)[-1] or sorted(lst, reverse=True)[0].)

Chad J. Schroeder (author) 19 years ago # | flag

Mode and variance. Thanks for pointing out the summation oversight and mode computation improvement. Modifications have been merged.

Kim Slezak 13 years, 11 months ago # | flag

Chad, thanks for the code. I had a project in VBA for ArcGIS that I'm moving to Python, this was what I needed. However, when I call it from the main module (terms may not be right since new to Python) I get the same stats for the 2nd - n runs as the 1st.

My code generates a list named lstYield of points within a specific polygon

stat_analysis.Statistics(lstYield)

then the median, max, and std dev will be saved to the table for that record in the polygon's attribute table, but like I said I get one viable run, then it repeats the same N and stats. I've checked the list going in, and it is changing. I tried setting the stats.N, stats.median, etc to 0 before getting the next list of points, then all variables return as 0. Any thoughts would be greatly appreciated. Kim

Kim Slezak 13 years, 11 months ago # | flag

Chad, figured it out, had to say blockStats = stat_analysis.Statistics(lstYield), then it ran just fine. Sorry for jumping the gun in posting question, but thanks again for the code!

◄	Python recipes (4591)	►
◄	Chad J. Schroeder's recipes (2)	►

A Python-based descriptive statistical analysis tool. (Python recipe) by Chad J. Schroeder
ActiveState Code (http://code.activestate.com/recipes/409413/)

5 comments

Tags

Required Modules

Other Information and Tasks

Accounts

Code Recipes

Feedback & Information

ActiveState

A Python-based descriptive statistical analysis tool. (Python recipe) by Chad J. Schroeder ActiveState Code (http://code.activestate.com/recipes/409413/)

5 comments

Tags

Required Modules

Other Information and Tasks

Accounts

Code Recipes

Feedback & Information

ActiveState

A Python-based descriptive statistical analysis tool. (Python recipe) by Chad J. Schroeder
ActiveState Code (http://code.activestate.com/recipes/409413/)