What?
- It's about forecasting.
- It's about calculating a linear function.
- Details I found: http://www.faes.de/Basis/Basis-Statistik/Basis-Statistik-Korrelation-Re/basis-statistik-korrelation-re.html (in german)
Also I don't know how the formula have been created the practical part was very easy to me. I have verified one example (see code) using Open Office Calc (I've learned: you can display the formula for the trend line as well as the coefficient of correlation - great).
Why?
- In recipe 578111 I'm printing out current error rate for different training sessions in mental arithmetic.
- Anyway I would like to be able to given information - approximately - about how many you have improved since you train yourself.
What has changed?
- Revision2: Didn't compile with Jython 2.5.3b1 because of not supported exception syntax. Now it does work without exception.
- Revision3: Test data row for failure not removed.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 | """
@author Thomas Lehmann
@file Main.py
@brief correlation and regression analyse
Referring to document at (german):
http://www.faes.de/Basis/Basis-Statistik/Basis-Statistik-Korrelation-Re/basis-statistik-korrelation-re.html
"""
import sys
import math
EPSILON = 0.0000001
class SimpleLinearRegression:
""" tool class as help for calculating a linear function """
def __init__(self, data):
""" initializes members with defaults """
self.data = data # list of (x,y) pairs
self.a = 0 # "a" of y = a + b*x
self.b = 0 # "b" of y = a + b*x
self.r = 0 # coefficient of correlation
def run(self):
""" calculates coefficient of correlation and
the parameters for the linear function """
sumX, sumY, sumXY, sumXX, sumYY = 0, 0, 0, 0, 0
n = float(len(self.data))
for x, y in self.data:
sumX += x
sumY += y
sumXY += x*y
sumXX += x*x
sumYY += y*y
denominator = math.sqrt((sumXX - 1/n * sumX**2)*(sumYY - 1/n * sumY**2))
if denominator < EPSILON:
return False
# coefficient of correlation
self.r = (sumXY - 1/n * sumX * sumY)
self.r /= denominator
# is there no relationship between 'x' and 'y'?
if abs(self.r) < EPSILON:
return False
# calculating 'a' and 'b' of y = a + b*x
self.b = sumXY - sumX * sumY / n
self.b /= (sumXX - sumX**2 / n)
self.a = sumY - self.b * sumX
self.a /= n
return True
def function(self, x):
""" linear function (be aware of current
coefficient of correlation """
return self.a + self.b * x
def __repr__(self):
""" current linear function for print """
return "y = f(x) = %(a)f + %(b)f*x" % self.__dict__
def example():
""" provides an example with error rates (one per session)
@note linear function verified in open office calc """
print("Simple linear regression v0.3 by Thomas Lehmann 2012")
print("...Python %s" % sys.version.replace("\n", ""))
data = [(1.0, 18.0), (2, 15.0), (3, 19.0), (4, 10.0)]
print("...data is %s" % data)
linRegr = SimpleLinearRegression(data)
if not linRegr.run():
print("...error: failed to calculate parameters")
return
print("...the coefficient of correlation r = %f (r**2 is %f)" % (linRegr.r, linRegr.r**2))
print("...parameter a of y = f(x) = a + b*x is %f" % linRegr.a)
print("...parameter b of y = f(x) = a + b*x is %f" % linRegr.b)
print("...linear function is then %s" % linRegr)
print("...forecast of next value: f(5) = %f" % linRegr.function(5))
firstY = linRegr.function(1)
lastY = linRegr.function(4)
change = (lastY - firstY) / firstY * 100.0
# keep in mind: reducing of error rate (inverse valuation)!
if change < 0:
print("...the trend is about %.1f%% improvement" % -change)
else:
print("...the trend is about %.1f%% to the worse" % change)
if __name__ == "__main__":
example()
|
Example
The value for 'x' is the nth session. Each session has tasks and either you give the right answer or a wrong one. From this you can calculate the error rate 'y'. Let's say you have a session with 10 tasks and you fail two times then your error rate is 2 * 100 / 10 = 20% (the error rates after here are not taken from a real example):
C:\Python32\python.exe F:/Programming/Python/Projects/SimpleLinearRegression/Main.py
Simple linear regression v0.1 by Thomas Lehmann 2012
...Python 3.2.3 (default, Apr 11 2012, 07:15:24) [MSC v.1500 32 bit (Intel)]
...data is [(1.0, 18.0), (2, 15.0), (3, 19.0), (4, 10.0)]
...the coefficient of correlation r = -0.638877 (r**2 is 0.408163)
...parameter a of y = f(x) = a + b*x is 20.500000
...parameter b of y = f(x) = a + b*x is -2.000000
...linear function is then y = f(x) = 20.500000 + -2.000000*x
...forecast of next value: f(5) = 10.500000
...the trend is about 32.4% improvement
There seems to be an error in line 43 self.r = (sumXY - 1/n * sumX * sumY) If I am correct, this is the covariate term E(XY)- E(X) E(Y), which should write self.r = (sumXY - 1/n * sumX * 1/n* sumY)
you may use assert(abs(self.r)<1.000) to detect such errors
This also holds for denominator = math.sqrt((sumXX - 1/n * sumX2)*(sumYY - 1/n * sumY2))
=> denominator = math.sqrt((sumXX - 1/n 1/n * sumX2)(sumYY - 1/n*1/n * sumY2))
but since you did not write the extra 1/n on top and bottom, your initial computation is correct !
In any case, linear regerssion can be performed fastly with ONE left matrix division, for any number of explanatory variables (not just 2 as in your code)
Well, interesting, I just referring to the formula from the link. Could you please give me another reference to show me your version? (I've additional used Open Office to verify the values)
I also found another page with same formula (without squared 'n'): http://math.hws.edu/javamath/ryan/Regression.html
And here: http://easycalculation.com/statistics/learn-regression.php Here the only difference seems to be that the formula as been extended in top and bottom by 'N' so that the division changes into a multiplication of 'N'. So it's the same formular as above.
But still no squaring of 'N'
I used the formula regression beta= Var(X)^{-1} * Cov(X,Y) which can be computed as " Cov(X,Y) leftdivided by Var(X) ", and which holds for any number of explanatory variables X.
The formula holds when the model includes a constant term. In cas it does not, the formula becomes E(X^2)^{-1} * E(X*Y)