TLinearRegression class

This article gives the source code for a nice, clean implementation of linear regression by the linear, least-squares error algorithm.

(This article originally appeared in the March 17, 2000 issue

of The Unofficial Newsletter of

Delphi Users
)



TLinearRegression Class



This article gives the source code for a nice, clean implementation of linear regression by the linear, least-squares error algorithm.



The class provides properties returning the average for both the independent and dependent variables, as well as the variance, covariance, and correlation. A function method is provided to interpolate along the line defined by the computed slope and intercept.



The code is well documented, giving the basic theoretical outline and some notes as to what the various methods are doing.



An example project is also included.



{ ****************************************************************** }

{ }

{ Class implementing linear regression by linear-least-squares }

{ }

{ Copyright © 2000, Berend M. Tober. All rights reserved. }

{ Author's E-mail - mailto:btober@computer.org }

{ Other components at }

{ http://www.connix.com/~btober/delphi.htm }

{ }

{ ****************************************************************** }

unit linregr;

{Linear Regression Class}

{



  Note Regarding the Problem Statement

  ------------------------------------



  This class is used for finding the best-fit slope (m) and y-intercept (b), as in the equation for a straight line



    y := m*x + b



  in the case where you have more than two (x,y) data point pairs, i.e., you have (Xi, Yi) for i:=1,2,...,n, with n>2. In this situation you can write n equations:



    Y1 := m*X1 + b + E1

    Y2 := m*X2 + b + E2

     .

     .

     .

    Yn := m*Xn + b + En



  giving a system of n linear equations that you need to solve for two unknowns, namely, m and b, where the Ei are error terms. The linear- least-squares error algorithm employed here finds the best m and b such that the sum of the squared error terms is minimized.



  The linear algebra behind the solution is to write an n-by-2 matrix A that has rows looking like [1, Xi] for i=1,..,n and a 2-by-1 vector u of unknowns with the intercept and slope, i.e., u = [b,m]', where (') indicates "take the transpose of". There's a bunch of other technical detail stuff like "finding an orthogonal basis for a sub-space", and "guaranteed unique solutions", etc., that I vaguely recall from my course in real vectors, but it boils down to writing



        e'e = (y - Au)'(y - Au) = |y - Au|^2



  to show the best solution is found by solving for u in the equation



        Au = y



  where, to summarize:



       y = [Y1,Y2,...,Yn]'

       u = [b,m]'

       A = [[1,X1]', [1,X2]',...[1,Xn]']'



  In the process of solving this you eventually get to the explicit solution



        u = inv(A'A)(A'y)



  where inv() means "take the inverse of". This is the equation implemented in this class. It calculates the statistics "on the fly", so it works for any number of two or more data points.



  This is all well-known, straight-forward stuff out of a textbook. I don't claim any credit for the math...I'm not inventing anything new here, just making a slick implementation.





  Note Regarding Exceptions

  -------------------------------



  This class does not raise exceptions: your controlling application must trap 'divide by zero' errors that can occur in



        function GetCorrelation

        function GetIntercept

        function GetSlope



  I didn't build in this exception handling because it would entail a USES clause. As it is, this unit has NO explicit dependencies on ANY other units, and I couldn't bear to destroy that purity!



  The 'divide by zero' circumstance can occur when accessing any of the derived, covariance-related properties if:



  1. you have accumulated less than two data points, or

  2. you have "degenerate" data, i.e., all the x's are the same (this corresponds to slope of infinity), in which case you will find that CovarianceXX will be zero (or within some small error range of zero).





  Note Regarding Normalizing Data

  -------------------------------



  It is recommeded that you normalize data in advance of applying this linear-least-squares algorithm, if you can.



  For purely multiplicative scaling, apply the following:



  If you scale data input as (x',y') := (Ax,By), then you must

    1) adjust computed intercept by b := b'/B, and

    2) adjust computed slope by m := m'*A/B

}



interface

type

  TRegressionRec=Record

     N :Integer;

{ Count of number of data points accumulated }

     AvgX :Double;

{ Average of independent variables }

     AvgY :Double;

{ Average of dependent variables }

     AvgXX :Double;

{ Average of squared independent variables }

     AvgYY :Double;

{ Average of squared dependent variables }

     AvgXY :Double;

{ Average of cross product }

  end;



  TLinearRegression=class(TObject)

    private

      FRegressionRec:TRegressionRec;

      function GetCorrelation:Double;

      function GetCovarianceXX:Double;

      function GetCovarianceYY:Double;

      function GetCovarianceXY:Double;

      function GetIntercept:Double;

      function GetSlope:Double;

    public

      constructor Create;

      procedure Clear;

      function Add(X,Y:Double):Integer;

      function Interpolate(x:Double):Double;



      property AverageX:Double Read FRegressionRec.AvgX;

      property AverageY:Double read FRegressionRec.AvgY;

      property Count:Integer read FRegressionRec.N;

      property Correlation:Double Read GetCorrelation;

      property CovarianceXX:Double Read GetCovarianceXX;


      property CovarianceYY:Double Read GetCovarianceYY;

      property CovarianceXY:Double Read GetCovarianceXY;

      property Intercept:Double Read GetIntercept;

      property Slope:Double Read GetSlope;

  end;





implementation



function Accumulate

(

  N:Integer; {Sequence number of this new data point}

  a,{The existing average (of N-1 data points)}

  x:Double{The new data value to be included in running average}):Double;

{Returns the running average}

begin

  Result:=(N-1)/N;

  Result:=a*Result + x/N;

end;



constructor TLinearRegression.Create;

begin

  inherited Create;

  Clear;

end;



function TLinearRegression.Add(X,Y:Double):Integer;

  {This function 'accumulates' another (x,y) data point}

begin

  with FRegressionRec DO

  begin

    Inc(N);



    AvgX := Accumulate(N,AvgX,X);

    AvgY := Accumulate(N,AvgY,Y);



    AvgXX := Accumulate(N,AvgXX,X*X);

    AvgYY := Accumulate(N,AvgYY,Y*Y);

    AvgXY := Accumulate(N,AvgXY,X*Y);



    Result:=N;

  end;

end;



procedure TLinearRegression.Clear;

begin

  with FRegressionRec do

  begin

    N := 0;

    AvgX := 0;

    AvgY := 0;

    AvgXX:= 0;

    AvgYY:= 0;

    AvgXY:= 0;

    end;

end;





function TLinearRegression.GetCovarianceXX:Double;

begin

  with FRegressionRec do

    Result := AvgXX-AvgX*AvgX;

end;





function TLinearRegression.GetCovarianceYY:Double;

begin

  with FRegressionRec do

    Result := AvgYY-AvgY*AvgY;

end;





function TLinearRegression.GetCovarianceXY:Double;

  begin

    with FRegressionRec do

      Result := AvgXY-AvgX*AvgY;

  end;



function TLinearRegression.GetCorrelation:Double;

  var CovXX,CovYY:Double;

  begin

    with FRegressionRec do

        begin

        CovXX:=GetCovarianceXX;

        CovYY:=GetCovarianceYY;

        Result:=0;

        if CovXX = 0.0 then

          Result := 1.0

{Degenerate case with infinite slope}

        else if CovYY = 0.0 then

          Result := 1.0

{Data has zero slope}

        else

          Result := GetCovarianceXY/Sqrt(CovXX*CovYY);

        end;

  end;



function TLinearRegression.GetIntercept:Double;

  begin

  {Note that if CovXX is zero, then no unique y-intercept exists}

  with FRegressionRec do

    Result:=(AvgY*AvgXX-AvgX*AvgXY)/GetCovarianceXX;

  end;



function TLinearRegression.GetSlope:Double;

  begin

  {Note that if CovXX is zero, then slope is infinite}

  with FRegressionRec do

    Result := GetCovarianceXY/GetCovarianceXX;

  end;



function TLinearRegression.Interpolate(x:Double):Double;

begin

  with FRegressionRec do

    Result:=x*Slope + Intercept;

end;



end.





Sample project file follows:



program Example;



uses

  WinCRT,linregr;



type TDataArray=Array[1..10,1..2]

of Real;

const



  Data1:TDataArray=

  (

    (75,81),

    (78,73),

    (88,85),

    (92,85),

    (95,89),

    (67,73),

    (55,66),

    (73,81),

    (74,81),

    (80,81)

  );

{

  These are the "correct" results for the data set

above:

  r=0.891, m=0.513, b=39.658

}



  Data2:TDataArray=

  ( {Zero slope}

    (75,1),

    (78,1),

    (88,1),

    (92,1),

    (95,1),

    (67,1),

    (55,1),

    (73,1),

    (74,1),

    (80,1)

  );

  Data3:TDataArray=

  ( {Infinite slope}

    (2,81),

    (2,73),

    (2,85),

    (2,85),

    (2,89),

    (2,73),

    (2,66),

    (2,81),

    (2,81),

    (2,81)

  );





procedure TestRegression(Data:TDataArray);

  var i:Word;

  begin

  with TLinearRegression.Create do

    begin

    for i:= 1 to 10 do

      Add(Data[i,1],Data[i,2]);

    try

      write(Correlation:12:3);

      write(Slope:12:3);

      writeln(Intercept:12:3);

      writeln('Average of X''s',AverageX:12:3);

      writeln('Average of Y''s',AverageY:12:3);

      writeln('Std Dev of X''s',sqrt(CovarianceXX):12:3);

      writeln('Std Dev of Y''s',sqrt(CovarianceYY):12:3);

    except

    end;

    writeln;

    Free;

    end;

  end;





begin

    writeln('This is what the correlation, slope and intercept values should be:');

    writeln(0.891:12:3,0.513:12:3,39.658:12:3,#13);



    writeln('Sample results:');

    TestRegression(Data1);



    writeln(#13'Data with zero slope');

    TestRegression(Data2);



    writeln(#13'Degenerate data with infinite slope');

    TestRegression(Data3);

end.

 

Share this article!

Follow us!

Find more helpful articles: