LeastSquares.gms : Demonstrate the use of numpy.linalg.lstsq on the diabetes test problem using linalg ols

Description

Use the libInlude linalg ols to solve the least squares regression problem.
This model demonstrates how to get the same statistical values as with
the LS solver (dropped in GAMS 34). It also demonstrates how to use the
feature in a loop with subsets of the feature set p.


Category : GAMS Data Utilities library


Main file : LeastSquares.gms   includes :  LeastSquares.gms  ls.gdx  diabetes_data.gdx

$title Demonstrate the use of numpy.linalg.lstsq on the diabetes test problem using linalg ols (LeastSquares,SEQ=141)

$ontext
Use the libInlude linalg ols to solve the least squares regression problem.
This model demonstrates how to get the same statistical values as with
the LS solver (dropped in GAMS 34). It also demonstrates how to use the
feature in a loop with subsets of the feature set p.
$offtext
Set i   'patients'
    p   'features';
    
Parameter
    A(i<,p<) 'features collected from a cohort of diabetes patients'
    y(i)     'quantitative measure of disease progression one year after baseline';

* Data from https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes.html
$gdxin diabetes_data.gdx
$load A y
$gdxin

$if not set EXPLICITINTERCEPT $set EXPLICITINTERCEPT 1
$ifThen %EXPLICITINTERCEPT%==1
$onMulti
set p / b0 Intercept /;
$offMulti
A(i,'b0') = 1;
$endIf

* When the LS solver is available use this to create the ls.gdx with the statistics
$ifThen not gamsVersion 340
variables x(p), sse 'sum of squared errors';

equation fit(i) 'equation to fit', sumsq;

sumsq..   sse =n= 0;
fit(i)..  y(i) =e= sum(p, x(p)*A(i,p));

model leastsq /all/;
option lp = ls; solve leastsq using lp minimizing sse;
$exit
$endIf
$log *** Use ls.gdx from previous run with LS solver

Scalars
    rss     'sum of squared errors'
    sigma   'Standard error'
    r2      'R-Squared'
    df      'Degrees of freedom'
    resvar  'Residual variance';
Parameters
    estimate(p) 'Estimated coefficients'
    se(p)       'Standard errors'
    tval(p)     't values'
    pval(p)     'p values'
    resid(i)    'residuals'
    fitted(i)   'fitted values for dependent variable'
    covar(p,p)  'variance-covariance matrix';

$FuncLibIn stolib stodclib
Functions
    cdfT     / stolib.CDFStudentT  / 
    icdfT    / stolib.iCDFstudentT /;

$libInclude linalg ols -CDFStudentT=cdfT -iCDFStudentT=icdfT -covar -sigma -r2 -df -rss -resvar -se -tval -pval -fitted  -resid i p A y estimate

Set alpha / "90%", "95%", "97.5%", "99%" /;
Parameter
    av(alpha) / "90%"   .9
                "95%"   .95
                "97.5%" .975
                "99%"   .99 /;

parameter confint(alpha,p,*) 'confidence intervals'
scalar q, t;
loop(alpha,
  t = -icdfT((1-av(alpha))/2, df);
  confint(alpha, p, 'LO') = estimate(p) - t*se(p);
  confint(alpha, p, 'UP') = estimate(p) + t*se(p);
);
execute_unload 'ls_np.gdx',confint,covar,df,estimate,fitted,pval,r2,resid,resvar,rss,se,sigma,tval;
execute.checkErrorLevel 'gdxdiff ls.gdx ls_np.gdx eps=1e-4 releps=1e-4 > %system.nullFile%'

* Now demonstrate dynamic use of OLS with subsets of features
Set pp(p)    'subset of features'
    iter     'iterations' / iter1*iter20 /
    bestp(p) 'best subset of features'
Parameter
    bestrss  / inf /
    beste(p) 'best estimate';
    
loop(iter,
   option clear=pp; pp(p)$(uniform(0,1)<0.5) = yes;
$if %EXPLICITINTERCEPT%==1  pp('b0') = yes;
$  libInclude linalg ols -mergeType=REPLACE -rss i pp A y estimate
   put_utility 'log' / iter.tl:10 ': rss=' rss;
   if (bestrss>rss, bestp(p) = pp(p); bestrss = rss; beste(p) = estimate(p));
);
display bestrss, bestp, beste;