Linear regression is a very useful and simple to understand way for predicting values, given a set of training data. The outcome of the regression is a best fitting line function, which, by definition, is the line that minimizes the sum of the squared errors (When plotted on a 2 dimensional coordination system, the errors are the distance between the actual Y' and predicted Y' on the line.) In machine learning, this line equation Y' = b*x + A is solved using Gradient Descent to gradually approach to it. Also, there is a statistical approach that directly solves this line equation without using an iterative algorithm.
This recipe is a pure Python implementation of this statistical algorithm. It has no dependencies.
If you have pandas and numpy, you can test its result by uncommenting the assert lines.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
def fit(X, Y): def mean(Xs): return sum(Xs) / len(Xs) m_X = mean(X) m_Y = mean(Y) def std(Xs, m): normalizer = len(Xs) - 1 return math.sqrt(sum((pow(x - m, 2) for x in Xs)) / normalizer) # assert np.round(Series(X).std(), 6) == np.round(std(X, m_X), 6) def pearson_r(Xs, Ys): sum_xy = 0 sum_sq_v_x = 0 sum_sq_v_y = 0 for (x, y) in zip(Xs, Ys): var_x = x - m_X var_y = y - m_Y sum_xy += var_x * var_y sum_sq_v_x += pow(var_x, 2) sum_sq_v_y += pow(var_y, 2) return sum_xy / math.sqrt(sum_sq_v_x * sum_sq_v_y) # assert np.round(Series(X).corr(Series(Y)), 6) == np.round(pearson_r(X, Y), 6) r = pearson_r(X, Y) b = r * (std(Y, m_Y) / std(X, m_X)) A = m_Y - b * m_X def line(x): return b * x + A return line