Simple linear regression « Python recipes

What?

It's about forecasting.
It's about calculating a linear function.
Details I found: http://www.faes.de/Basis/Basis-Statistik/Basis-Statistik-Korrelation-Re/basis-statistik-korrelation-re.html (in german)

Also I don't know how the formula have been created the practical part was very easy to me. I have verified one example (see code) using Open Office Calc (I've learned: you can display the formula for the trend line as well as the coefficient of correlation - great).

Why?

In recipe 578111 I'm printing out current error rate for different training sessions in mental arithmetic.
Anyway I would like to be able to given information - approximately - about how many you have improved since you train yourself.

What has changed?

Revision2: Didn't compile with Jython 2.5.3b1 because of not supported exception syntax. Now it does work without exception.
Revision3: Test data row for failure not removed.

      """
    @author    Thomas Lehmann
    @file      Main.py
    @brief     correlation and regression analyse

    Referring to document at (german):
    http://www.faes.de/Basis/Basis-Statistik/Basis-Statistik-Korrelation-Re/basis-statistik-korrelation-re.html
"""
import sys
import math

EPSILON = 0.0000001

class SimpleLinearRegression:
    """ tool class as help for calculating a linear function """
    def __init__(self, data):
        """ initializes members with defaults """
        self.data = data   # list of (x,y) pairs
        self.a    = 0      # "a" of y = a + b*x
        self.b    = 0      # "b" of y = a + b*x
        self.r    = 0      # coefficient of correlation

    def run(self):
        """ calculates coefficient of correlation and
            the parameters for the linear function """
        sumX, sumY, sumXY, sumXX, sumYY = 0, 0, 0, 0, 0
        n = float(len(self.data))

        for x, y in self.data:
            sumX  += x
            sumY  += y
            sumXY += x*y
            sumXX += x*x
            sumYY += y*y

        denominator = math.sqrt((sumXX - 1/n * sumX**2)*(sumYY - 1/n * sumY**2))
        if denominator < EPSILON:
            return False

        # coefficient of correlation
        self.r  = (sumXY - 1/n * sumX * sumY)
        self.r /= denominator

        # is there no relationship between 'x' and 'y'?
        if abs(self.r) < EPSILON:
            return False

        # calculating 'a' and 'b' of y = a + b*x
        self.b  = sumXY - sumX * sumY / n
        self.b /= (sumXX - sumX**2 / n)

        self.a  = sumY - self.b * sumX
        self.a /= n
        return True

    def function(self, x):
        """ linear function (be aware of current
            coefficient of correlation """
        return self.a + self.b * x

    def __repr__(self):
        """ current linear function for print """
        return "y = f(x) = %(a)f + %(b)f*x" % self.__dict__

def example():
    """ provides an example with error rates (one per session)
        @note linear function verified in open office calc """
    print("Simple linear regression v0.3 by Thomas Lehmann 2012")
    print("...Python %s" % sys.version.replace("\n", ""))
    data   = [(1.0, 18.0), (2, 15.0), (3, 19.0), (4, 10.0)]

    print("...data is %s" % data)

    linRegr = SimpleLinearRegression(data)
    if not linRegr.run():
        print("...error: failed to calculate parameters")
        return

    print("...the coefficient of correlation r = %f (r**2 is %f)" % (linRegr.r, linRegr.r**2))
    print("...parameter a of y = f(x) = a + b*x is %f" % linRegr.a)
    print("...parameter b of y = f(x) = a + b*x is %f" % linRegr.b)
    print("...linear function is then %s" % linRegr)
    print("...forecast of next value: f(5) = %f" % linRegr.function(5))

    firstY = linRegr.function(1)
    lastY  = linRegr.function(4)
    change = (lastY - firstY) / firstY * 100.0

    # keep in mind: reducing of error rate (inverse valuation)!
    if change < 0:
        print("...the trend is about %.1f%% improvement" % -change)
    else:
        print("...the trend is about %.1f%% to the worse" % change)

if __name__ == "__main__":
    example()

      

Example

The value for 'x' is the nth session. Each session has tasks and either you give the right answer or a wrong one. From this you can calculate the error rate 'y'. Let's say you have a session with 10 tasks and you fail two times then your error rate is 2 * 100 / 10 = 20% (the error rates after here are not taken from a real example):

C:\Python32\python.exe F:/Programming/Python/Projects/SimpleLinearRegression/Main.py
Simple linear regression v0.1 by Thomas Lehmann 2012
...Python 3.2.3 (default, Apr 11 2012, 07:15:24) [MSC v.1500 32 bit (Intel)]
...data is [(1.0, 18.0), (2, 15.0), (3, 19.0), (4, 10.0)]
...the coefficient of correlation r = -0.638877 (r**2 is 0.408163)
...parameter a of y = f(x) = a + b*x is 20.500000
...parameter b of y = f(x) = a + b*x is -2.000000
...linear function is then y = f(x) = 20.500000 + -2.000000*x
...forecast of next value: f(5) = 10.500000
...the trend is about 32.4% improvement

Tags: decrease, increase, linear, math, percental, regression, simple

6 comments

s_h_a_i_o 11 years, 11 months ago # | flag

There seems to be an error in line 43 self.r = (sumXY - 1/n * sumX * sumY) If I am correct, this is the covariate term E(XY)- E(X) E(Y), which should write self.r = (sumXY - 1/n * sumX * 1/n* sumY)

you may use assert(abs(self.r)<1.000) to detect such errors

s_h_a_i_o 11 years, 11 months ago # | flag

This also holds for denominator = math.sqrt((sumXX - 1/n * sumX2)*(sumYY - 1/n * sumY2))

=> denominator = math.sqrt((sumXX - 1/n 1/n * sumX2)(sumYY - 1/n*1/n * sumY2))

but since you did not write the extra 1/n on top and bottom, your initial computation is correct !

s_h_a_i_o 11 years, 11 months ago # | flag

In any case, linear regerssion can be performed fastly with ONE left matrix division, for any number of explanatory variables (not just 2 as in your code)

Thomas Lehmann (author) 11 years, 11 months ago # | flag

Well, interesting, I just referring to the formula from the link. Could you please give me another reference to show me your version? (I've additional used Open Office to verify the values)

Thomas Lehmann (author) 11 years, 11 months ago # | flag

I also found another page with same formula (without squared 'n'): http://math.hws.edu/javamath/ryan/Regression.html

And here: http://easycalculation.com/statistics/learn-regression.php Here the only difference seems to be that the formula as been extended in top and bottom by 'N' so that the division changes into a multiplication of 'N'. So it's the same formular as above.

But still no squaring of 'N'

s_h_a_i_o 11 years, 10 months ago # | flag

I used the formula regression beta= Var(X)^{-1} * Cov(X,Y) which can be computed as " Cov(X,Y) leftdivided by Var(X) ", and which holds for any number of explanatory variables X.

The formula holds when the model includes a constant term. In cas it does not, the formula becomes E(X^2)^{-1} * E(X*Y)

◄	Python recipes (4591)	►
◄	Thomas Lehmann's recipes (18)	►

Simple linear regression (Python recipe) by Thomas Lehmann
ActiveState Code (http://code.activestate.com/recipes/578129/)

6 comments

Tags

Required Modules

Other Information and Tasks

Accounts

Code Recipes

Feedback & Information

ActiveState

Simple linear regression (Python recipe) by Thomas Lehmann ActiveState Code (http://code.activestate.com/recipes/578129/)

6 comments

Tags

Required Modules

Other Information and Tasks

Accounts

Code Recipes

Feedback & Information

ActiveState

Simple linear regression (Python recipe) by Thomas Lehmann
ActiveState Code (http://code.activestate.com/recipes/578129/)