# TLinearRegression class

This article gives the source code for a nice, clean implementation of linear regression by the linear, least-squares error algorithm.

Delphi Users
)

TLinearRegression Class

This article gives the source code for a nice, clean implementation of linear regression by the linear, least-squares error algorithm.

The class provides properties returning the average for both the independent and dependent variables, as well as the variance, covariance, and correlation. A function method is provided to interpolate along the line defined by the computed slope and intercept.

The code is well documented, giving the basic theoretical outline and some notes as to what the various methods are doing.

An example project is also included.

{ ****************************************************************** }

{ }

{ Class implementing linear regression by linear-least-squares }

{ }

{ Author's E-mail - mailto:[email protected] }

{ Other components at }

{ http://www.connix.com/~btober/delphi.htm }

{ }

{ ****************************************************************** }

unit linregr;

{Linear Regression Class}

{

Note Regarding the Problem Statement

------------------------------------

This class is used for finding the best-fit slope (m) and y-intercept (b), as in the equation for a straight line

y := m*x + b

in the case where you have more than two (x,y) data point pairs, i.e., you have (Xi, Yi) for i:=1,2,...,n, with n>2. In this situation you can write n equations:

Y1 := m*X1 + b + E1

Y2 := m*X2 + b + E2

.

.

.

Yn := m*Xn + b + En

giving a system of n linear equations that you need to solve for two unknowns, namely, m and b, where the Ei are error terms. The linear- least-squares error algorithm employed here finds the best m and b such that the sum of the squared error terms is minimized.

The linear algebra behind the solution is to write an n-by-2 matrix A that has rows looking like [1, Xi] for i=1,..,n and a 2-by-1 vector u of unknowns with the intercept and slope, i.e., u = [b,m]', where (') indicates "take the transpose of". There's a bunch of other technical detail stuff like "finding an orthogonal basis for a sub-space", and "guaranteed unique solutions", etc., that I vaguely recall from my course in real vectors, but it boils down to writing

e'e = (y - Au)'(y - Au) = |y - Au|^2

to show the best solution is found by solving for u in the equation

Au = y

where, to summarize:

y = [Y1,Y2,...,Yn]'

u = [b,m]'

A = [[1,X1]', [1,X2]',...[1,Xn]']'

In the process of solving this you eventually get to the explicit solution

u = inv(A'A)(A'y)

where inv() means "take the inverse of". This is the equation implemented in this class. It calculates the statistics "on the fly", so it works for any number of two or more data points.

This is all well-known, straight-forward stuff out of a textbook. I don't claim any credit for the math...I'm not inventing anything new here, just making a slick implementation.

Note Regarding Exceptions

-------------------------------

This class does not raise exceptions: your controlling application must trap 'divide by zero' errors that can occur in

function GetCorrelation

function GetIntercept

function GetSlope

I didn't build in this exception handling because it would entail a USES clause. As it is, this unit has NO explicit dependencies on ANY other units, and I couldn't bear to destroy that purity!

The 'divide by zero' circumstance can occur when accessing any of the derived, covariance-related properties if:

1. you have accumulated less than two data points, or

2. you have "degenerate" data, i.e., all the x's are the same (this corresponds to slope of infinity), in which case you will find that CovarianceXX will be zero (or within some small error range of zero).

Note Regarding Normalizing Data

-------------------------------

It is recommeded that you normalize data in advance of applying this linear-least-squares algorithm, if you can.

For purely multiplicative scaling, apply the following:

If you scale data input as (x',y') := (Ax,By), then you must

1) adjust computed intercept by b := b'/B, and

2) adjust computed slope by m := m'*A/B

}

interface

type

TRegressionRec=Record

N :Integer;

{ Count of number of data points accumulated }

AvgX :Double;

{ Average of independent variables }

AvgY :Double;

{ Average of dependent variables }

AvgXX :Double;

{ Average of squared independent variables }

AvgYY :Double;

{ Average of squared dependent variables }

AvgXY :Double;

{ Average of cross product }

end;

TLinearRegression=class(TObject)

private

FRegressionRec:TRegressionRec;

function GetCorrelation:Double;

function GetCovarianceXX:Double;

function GetCovarianceYY:Double;

function GetCovarianceXY:Double;

function GetIntercept:Double;

function GetSlope:Double;

public

constructor Create;

procedure Clear;

function Interpolate(x:Double):Double;

end;

implementation

function Accumulate

(

N:Integer; {Sequence number of this new data point}

a,{The existing average (of N-1 data points)}

x:Double{The new data value to be included in running average}):Double;

{Returns the running average}

begin

Result:=(N-1)/N;

Result:=a*Result + x/N;

end;

constructor TLinearRegression.Create;

begin

inherited Create;

Clear;

end;

{This function 'accumulates' another (x,y) data point}

begin

with FRegressionRec DO

begin

Inc(N);

AvgX := Accumulate(N,AvgX,X);

AvgY := Accumulate(N,AvgY,Y);

AvgXX := Accumulate(N,AvgXX,X*X);

AvgYY := Accumulate(N,AvgYY,Y*Y);

AvgXY := Accumulate(N,AvgXY,X*Y);

Result:=N;

end;

end;

procedure TLinearRegression.Clear;

begin

with FRegressionRec do

begin

N := 0;

AvgX := 0;

AvgY := 0;

AvgXX:= 0;

AvgYY:= 0;

AvgXY:= 0;

end;

end;

function TLinearRegression.GetCovarianceXX:Double;

begin

with FRegressionRec do

Result := AvgXX-AvgX*AvgX;

end;

function TLinearRegression.GetCovarianceYY:Double;

begin

with FRegressionRec do

Result := AvgYY-AvgY*AvgY;

end;

function TLinearRegression.GetCovarianceXY:Double;

begin

with FRegressionRec do

Result := AvgXY-AvgX*AvgY;

end;

function TLinearRegression.GetCorrelation:Double;

var CovXX,CovYY:Double;

begin

with FRegressionRec do

begin

CovXX:=GetCovarianceXX;

CovYY:=GetCovarianceYY;

Result:=0;

if CovXX = 0.0 then

Result := 1.0

{Degenerate case with infinite slope}

else if CovYY = 0.0 then

Result := 1.0

{Data has zero slope}

else

Result := GetCovarianceXY/Sqrt(CovXX*CovYY);

end;

end;

function TLinearRegression.GetIntercept:Double;

begin

{Note that if CovXX is zero, then no unique y-intercept exists}

with FRegressionRec do

Result:=(AvgY*AvgXX-AvgX*AvgXY)/GetCovarianceXX;

end;

function TLinearRegression.GetSlope:Double;

begin

{Note that if CovXX is zero, then slope is infinite}

with FRegressionRec do

Result := GetCovarianceXY/GetCovarianceXX;

end;

function TLinearRegression.Interpolate(x:Double):Double;

begin

with FRegressionRec do

Result:=x*Slope + Intercept;

end;

end.

Sample project file follows:

program Example;

uses

WinCRT,linregr;

type TDataArray=Array[1..10,1..2]

of Real;

const

Data1:TDataArray=

(

(75,81),

(78,73),

(88,85),

(92,85),

(95,89),

(67,73),

(55,66),

(73,81),

(74,81),

(80,81)

);

{

These are the "correct" results for the data set

above:

r=0.891, m=0.513, b=39.658

}

Data2:TDataArray=

( {Zero slope}

(75,1),

(78,1),

(88,1),

(92,1),

(95,1),

(67,1),

(55,1),

(73,1),

(74,1),

(80,1)

);

Data3:TDataArray=

( {Infinite slope}

(2,81),

(2,73),

(2,85),

(2,85),

(2,89),

(2,73),

(2,66),

(2,81),

(2,81),

(2,81)

);

procedure TestRegression(Data:TDataArray);

var i:Word;

begin

with TLinearRegression.Create do

begin

for i:= 1 to 10 do

try

write(Correlation:12:3);

write(Slope:12:3);

writeln(Intercept:12:3);

writeln('Average of X''s',AverageX:12:3);

writeln('Average of Y''s',AverageY:12:3);

writeln('Std Dev of X''s',sqrt(CovarianceXX):12:3);

writeln('Std Dev of Y''s',sqrt(CovarianceYY):12:3);

except

end;

writeln;

Free;

end;

end;

begin

writeln('This is what the correlation, slope and intercept values should be:');

writeln(0.891:12:3,0.513:12:3,39.658:12:3,#13);

writeln('Sample results:');

TestRegression(Data1);

writeln(#13'Data with zero slope');

TestRegression(Data2);

writeln(#13'Degenerate data with infinite slope');

TestRegression(Data3);

end.