Welcome, guest | Sign In | My Account | Store | Cart

What?

Also I don't know how the formula have been created the practical part was very easy to me. I have verified one example (see code) using Open Office Calc (I've learned: you can display the formula for the trend line as well as the coefficient of correlation - great).

Why?

  • In recipe 578111 I'm printing out current error rate for different training sessions in mental arithmetic.
  • Anyway I would like to be able to given information - approximately - about how many you have improved since you train yourself.

What has changed?

  • Revision2: Didn't compile with Jython 2.5.3b1 because of not supported exception syntax. Now it does work without exception.
  • Revision3: Test data row for failure not removed.
Python, 96 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
"""
    @author    Thomas Lehmann
    @file      Main.py
    @brief     correlation and regression analyse

    Referring to document at (german):
    http://www.faes.de/Basis/Basis-Statistik/Basis-Statistik-Korrelation-Re/basis-statistik-korrelation-re.html
"""
import sys
import math

EPSILON = 0.0000001

class SimpleLinearRegression:
    """ tool class as help for calculating a linear function """
    def __init__(self, data):
        """ initializes members with defaults """
        self.data = data   # list of (x,y) pairs
        self.a    = 0      # "a" of y = a + b*x
        self.b    = 0      # "b" of y = a + b*x
        self.r    = 0      # coefficient of correlation

    def run(self):
        """ calculates coefficient of correlation and
            the parameters for the linear function """
        sumX, sumY, sumXY, sumXX, sumYY = 0, 0, 0, 0, 0
        n = float(len(self.data))

        for x, y in self.data:
            sumX  += x
            sumY  += y
            sumXY += x*y
            sumXX += x*x
            sumYY += y*y

        denominator = math.sqrt((sumXX - 1/n * sumX**2)*(sumYY - 1/n * sumY**2))
        if denominator < EPSILON:
            return False

        # coefficient of correlation
        self.r  = (sumXY - 1/n * sumX * sumY)
        self.r /= denominator

        # is there no relationship between 'x' and 'y'?
        if abs(self.r) < EPSILON:
            return False

        # calculating 'a' and 'b' of y = a + b*x
        self.b  = sumXY - sumX * sumY / n
        self.b /= (sumXX - sumX**2 / n)

        self.a  = sumY - self.b * sumX
        self.a /= n
        return True

    def function(self, x):
        """ linear function (be aware of current
            coefficient of correlation """
        return self.a + self.b * x

    def __repr__(self):
        """ current linear function for print """
        return "y = f(x) = %(a)f + %(b)f*x" % self.__dict__

def example():
    """ provides an example with error rates (one per session)
        @note linear function verified in open office calc """
    print("Simple linear regression v0.3 by Thomas Lehmann 2012")
    print("...Python %s" % sys.version.replace("\n", ""))
    data   = [(1.0, 18.0), (2, 15.0), (3, 19.0), (4, 10.0)]

    print("...data is %s" % data)

    linRegr = SimpleLinearRegression(data)
    if not linRegr.run():
        print("...error: failed to calculate parameters")
        return

    print("...the coefficient of correlation r = %f (r**2 is %f)" % (linRegr.r, linRegr.r**2))
    print("...parameter a of y = f(x) = a + b*x is %f" % linRegr.a)
    print("...parameter b of y = f(x) = a + b*x is %f" % linRegr.b)
    print("...linear function is then %s" % linRegr)
    print("...forecast of next value: f(5) = %f" % linRegr.function(5))

    firstY = linRegr.function(1)
    lastY  = linRegr.function(4)
    change = (lastY - firstY) / firstY * 100.0

    # keep in mind: reducing of error rate (inverse valuation)!
    if change < 0:
        print("...the trend is about %.1f%% improvement" % -change)
    else:
        print("...the trend is about %.1f%% to the worse" % change)

if __name__ == "__main__":
    example()

Example

The value for 'x' is the nth session. Each session has tasks and either you give the right answer or a wrong one. From this you can calculate the error rate 'y'. Let's say you have a session with 10 tasks and you fail two times then your error rate is 2 * 100 / 10 = 20% (the error rates after here are not taken from a real example):

C:\Python32\python.exe F:/Programming/Python/Projects/SimpleLinearRegression/Main.py
Simple linear regression v0.1 by Thomas Lehmann 2012
...Python 3.2.3 (default, Apr 11 2012, 07:15:24) [MSC v.1500 32 bit (Intel)]
...data is [(1.0, 18.0), (2, 15.0), (3, 19.0), (4, 10.0)]
...the coefficient of correlation r = -0.638877 (r**2 is 0.408163)
...parameter a of y = f(x) = a + b*x is 20.500000
...parameter b of y = f(x) = a + b*x is -2.000000
...linear function is then y = f(x) = 20.500000 + -2.000000*x
...forecast of next value: f(5) = 10.500000
...the trend is about 32.4% improvement

6 comments

s_h_a_i_o 10 years ago  # | flag

There seems to be an error in line 43 self.r = (sumXY - 1/n * sumX * sumY) If I am correct, this is the covariate term E(XY)- E(X) E(Y), which should write self.r = (sumXY - 1/n * sumX * 1/n* sumY)

you may use assert(abs(self.r)<1.000) to detect such errors

s_h_a_i_o 10 years ago  # | flag

This also holds for denominator = math.sqrt((sumXX - 1/n * sumX2)*(sumYY - 1/n * sumY2))

=> denominator = math.sqrt((sumXX - 1/n 1/n * sumX2)(sumYY - 1/n*1/n * sumY2))

but since you did not write the extra 1/n on top and bottom, your initial computation is correct !

s_h_a_i_o 10 years ago  # | flag

In any case, linear regerssion can be performed fastly with ONE left matrix division, for any number of explanatory variables (not just 2 as in your code)

Well, interesting, I just referring to the formula from the link. Could you please give me another reference to show me your version? (I've additional used Open Office to verify the values)

I also found another page with same formula (without squared 'n'): http://math.hws.edu/javamath/ryan/Regression.html

And here: http://easycalculation.com/statistics/learn-regression.php Here the only difference seems to be that the formula as been extended in top and bottom by 'N' so that the division changes into a multiplication of 'N'. So it's the same formular as above.

But still no squaring of 'N'

s_h_a_i_o 9 years, 11 months ago  # | flag

I used the formula regression beta= Var(X)^{-1} * Cov(X,Y) which can be computed as " Cov(X,Y) leftdivided by Var(X) ", and which holds for any number of explanatory variables X.

The formula holds when the model includes a constant term. In cas it does not, the formula becomes E(X^2)^{-1} * E(X*Y)