The Grid and Multi-Threading Solve Facility

Introduction

The GAMS Grid facility allows to take advantage of High Performance Computing Grids and systems with multiple CPUs. This language feature facilitates the management of asynchronous submission and collection of model solution tasks in a platform independent fashion. A simple architecture, relying on existing operating system functionality allows for rapid introduction of new environments and provides for an open research architecture.

A typical application uses a coarse grain approach involving hundreds or thousands of model solutions tasks which can be carried out in parallel. Examples include but are not limited to scenario analysis, Monte Carlo simulations, Lagrangian relaxation, decomposition algorithms and advanced solution approaches.

The grid features work on all GAMS platforms and have been tailored to many different environments, like the Condor Resource Manager, a system for high throughput computing from the University of Wisconsin-Madison. Researchers using Condor reported a delivery of 5000 CPU hours in 20 hours wall clock time.

Similarly, the GAMS Multi-Threading Solve Facility allows the asynchronous submission and collection of model solution tasks on a single, multi-threaded machine while using efficient in-memory communication between GAMS and the solver.

Disclaimer: The use of the term grid computing may be offensive to some purists in the computer science world. We use it very loosely to refer to a collection of computing components that allow us to provide high throughput to certain applications. One may also think of it as a resurrection of the commercial service bureau concept of some 30 years ago.

Caution: Although these features have been tested on all platforms and are part of our standard release we may change the approach and introduce alternative mechanisms in the future.

Acknowledgments: Prof. Monique Guignard-Spielberg and Antoine Sauré at that time at Wharton School at the University of Pennsylvania introduced us to parallel Lagrangian relaxation on the SUN Grid Environment. Prof. Michael Ferris from the University of Wisconsin-Madison adopted our original GAMS grid approach to the high throughput system Condor and helped to make this approach a practical proposition.

The Grid Facility: Basic Concepts

The Grid facility separates the solution process into several steps which then can be controlled separately. First we will review the steps taken during synchronous solution and then we will introduce the asynchronous or parallel solution steps.

When GAMS encounters a solve statement during execution it proceeds in three basic steps:

  1. Generation: The symbolic equations of the model are used to instantiate the model using the current state of the GAMS data base. This instance contains all information and services needed by a solution method to attempt a solution. This representation is independent of the solver and computing platform.
  2. Solution: The model instance is handed over to a solver and GAMS will wait until it terminates.
  3. Update: The detailed solution and statistics are passed to GAMS from the solver to update the GAMS data base.

In most cases, the time taken to generate the model and update the data base with the solution will be much smaller than the actual time spent in a specific solver. The model generation will take a few seconds, whereas the time to obtain an optimal solution may take a few minutes to several hours or even longer. If sequential model solutions do not depend on each other, we can solve in parallel and update the data base in a random order. All we need is a facility to generate models, submit them for solution and continue. At a convenient point in our GAMS program we will then look for the completed solutions and update the data base accordingly. To summarize, solving in parallel entails two steps:

  1. Submission Loop: In this phase we will generate and submit models for solutions that can be solved independently.
  2. Collection Loop: The solutions of the previously submitted models are collected as soon as a solution is available. It may be necessary to wait for some solutions to complete by pausing the execution for some time.

Note that we have assumed that there will be no errors in any of these steps. Of course, this will not always be the case and elaborate mechanisms are in place to make the operation fail-safe.

Note
For scenario analysis the solver GUSS might be of particular interest. The model [GUSSGRID] demonstrates how GUSS is used together with the Grid facility.

The Grid Facility: A First Example

In this section we will illustrate the use of the basic grid facility with the model [QMEANVAR]. This model traces an efficiency frontier for restructuring an investment portfolio. Each point on the frontier requires the solution of independent quadratic mixed integer models. The original solution loop is shown below:

loop(p(pp),
   ret.fx = rmin + (rmax-rmin)/(card(pp)+1)*ord(pp) ;
   solve minvar min var using miqcp ;
   xres(i,p)         = x.l(i);
   report(p,i,'inc') = xi.l(i);
   report(p,i,'dec') = xd.l(i); 
);

This loop will save the solutions of the model minvar for different returns ret. As the solutions do not depend on the order in which they are carried out, we can rewrite this loop to operate in parallel.

The Submission Loop

The first step for solving in parallel using the Grid facility is to write the submission loop:

Parameter h(pp)      'model handles';
minvar.solveLink = 3;
loop(p(pp),
   ret.fx = rmin + (rmax-rmin)/(card(pp)+1)*ord(pp) ;
   solve minvar min var using miqcp;
   h(pp) = minvar.handle;
);

The model attribute solveLink controls the behavior of the solve statement. The value of 3 (which is equivalent to the compile-time constant %solveLink.asyncGrid%) directs GAMS to generate the model and submit it for solution and then continue without waiting for the completion of the solution step. Thus with setting minvar.solveLink to 3 we activate grid computing.

A handle in the grid environment identifies the particular model and data instances available. The model attribute handle contains a unique identification of each submitted solution request and is typically stored in a parameter defined over a set that covers all model instances. The specific numerical values of handles are assigned by GAMS and may be used to recover solutions and manage models that are solved on the grid. In our example, the handle parameter is h and the set of all model instances is pp. The handle values that are stored in h are later used to collect the solutions once the solution processes are completed.

The Collection Loop

We collect the solutions with the following collection loop:

loop(pp$handleCollect(h(pp)),
   xres(i,pp)         = x.l(i);
   report(pp,i,'inc') = xi.l(i);
   report(pp,i,'dec') = xd.l(i);
);

Note that the dollar condition restricts the looping set pp to those elements which return a nonzero value to the function handleCollect(h(pp)). The function handleCollect tests the solution status for each element of the set pp. And if the solution is available, it is read into the GAMS data base. In this case the function returns a value of 1. If the solution is not ready to be retrieved, the value zero will be returned.

Observe that the collection loop above has one big flaw. If a solution has not been ready (that is if handleCollect equaled zero), it will not be retrieved. We need to call this loop several times until all solutions have been retrieved or we get tired of it and quit. We will use a repeat until construct and the handle parameter h to control the loop to look only for the solutions that have not been loaded yet. The code follows:

repeat
   loop(pp$handleCollect(h(pp)),
         xres(i,pp)         = x.l(i);
         report(pp,i,'inc') = xi.l(i);
         report(pp,i,'dec') = xd.l(i);
         display$handleDelete(h(pp)) 'trouble deleting handles' ;
         h(pp) = 0; 
   ) ;
   display$readyCollect(h, 100) 'Problem waiting for next instance to complete';
until card(h) = 0 or timeelapsed > 100;
xres(i,pp)$h(pp) = na;

Once we have extracted a solution we will set the handle parameter h to zero. In addition, we want to remove the instance from the system by calling the function handleDelete which returns zero if successful. No harm is done if it fails but we want to be notified via the conditional display statement. Before running the collection loop again, we may want to wait a while to give the system time to complete more solution steps. This is done with the function readyCollect which waits until another model instance with a handle in h is ready to be collected (or the optionally defined number of seconds has passed). The final wrinkle is to terminate if all model instances have been deleted from the system since their solutions were retrieved or after 100 seconds have elapsed, even if we did not get all solutions. This is accomplished with the function timeElapsed and is important, because if one of the solution steps fails our program would never terminate. Recall that the handle parameter h equals zero for all elements of the set pp whose related models have been solved and their solutions have been extracted. The last statement in the code above sets the results of the missed solves to na to signal the failed solve. The parameter h will now contain the handles of the failed solves for later analysis.

Alternatively, we could use the function handleStatus and collect the solutions that are stored in a GDX file. For example, we could write:

loop(pp$(handleStatus(h(pp)) = 2),
   minvar.handle = h(pp);
   execute_loadhandle minvar;
   xres(i,pp)         = x.l(i);
   report(pp,i,'inc') = xi.l(i);
   report(pp,i,'dec') = xd.l(i); 
);

The function handleStatus tests the solution process and returns the value 2 if the solution process has been completed and the results can be retrieved. The solution is stored in a GDX file which can be loaded in a way similar to other GDX solution points. First, we need to specify which solution to retrieve by setting the the model attribute minvar.handle to the appropriate value. Then we can use the statement execute_loadhandle minvar; to load the solution for the model minvar back into the GAMS data base.

Note
Except for the requirement of a model with a previously specified handle, the command execute_loadhandle operates like the procedure execute_loadpoint.

Using the function handleStatus and the command execute_loadhandle instead of the simpler handleCollect, adds one more layer of control to the final collection loop. Now we need one additional if statement inside the collection loop above:

repeat
   loop(pp$h(pp),
      if(handleStatus(h(pp)) = 2,
         minvar.handle = h(pp);
         execute_loadhandle minvar;
         xres(i,pp)         = x.l(i);
         report(pp,i,'inc') = xi.l(i);
         report(pp,i,'dec') = xd.l(i);
         display$handleDelete(h(pp)) 'trouble deleting handles' ;
         h(pp) = 0; 
      ); 
   ) ;
   display$readyCollect(h, 100) 'Problem waiting for next instance to complete';
until card(h) = 0 or timeelapsed > 100;
xres(i,pp)$h(pp) = na;

Finally, we are ready to run the modified model.

The Execution Log

The execution log will contain some new information that may be useful for more advanced applications:

     --- LOOPS pp = p1
     ---   46 rows 37 columns 119 non-zeroes
     ---   311 nl-code 7 nl-non-zeroes
     ---   14 discrete-columns
     --- Submitting model minvar with handle grid137000002
     --- Executing after solve
     ...
     --- GDXin=C:\answerv5\gams_srcdev\225j\grid137000003\gmsgrid.gdx
     --- Removing handle grid137000003

Note that the log contains some additional information about the submission, retrieval and removal of the solution instance. In the following sections we will make use of this additional information.

The execution log does not cotain any solver log. Though, if a logOption value is set, that triggers writing to a file (i.e. 2 or 4), the solver log of each job will be written to the corresponding folder in the Grid Directory. So, if the solver log should be inspected when the job is done, one needs to make sure, that these folders do not get cleaned automatically, e.g. by setting the gridDir command line parameter.

For a complete example for grid computing, see the grid enabled transport model [TRNSGRID].

Observe that we have made no assumptions about what kind of solvers and what kind of computing environment we will operate. The example above is completely platform and solver independent and it runs on a Windows laptop or on a massive grid network like the Condor system without any changes in the GAMS source code.

Advanced Use of Grid Features

In this section we will describe a few special application requirements and show how this can be handled with the current system. Some of those applications may involve thousands of model instances with solution times of many hours each. Some may fail and require resubmission. More complex examples require communication and the use of GAMS facilities like the Branch-and-Cut-and-Heuristic Facility (BCH), which submit other models from within a running solver.

Imagine a situation with thousands of model instances each taking between minutes and many hours to solve. We will break the master program into a submitting program, an inquire program and a final collection program. We will again use the model [QMEANVAR] to demonstrate the principle. We will split the code of the modified [QMEANVAR] GAMS code into three components: qsubmit, qcheck and qreport.

Very Long Job Durations: The Submitting Program

The file qsubmit.gms will include everything up to and including the new submission loop. To save the instances we will need a unique grid directory gdir and to restart the problem we will have to create a save file. For details on the save and restart facility in GAMS, see chapter The Save and Restart Feature. When running the first job, we will use the command line parameter save or its synonym s to create the required save file:

   > gams qsubmit s=submit gdir=c:\test\grid

Very Long Job Durations: The Inquire Program

The solution of all the model instances may take hours. From time to time we may run a quick inquiry job to learn about the status. The following program qcheck.gms will list the current status:

Parameter status(pp,*);
Scalar    handle;
Acronym   BadHandle, Waiting, Ready;
loop(pp,
   handle = handleStatus(h(pp));
   if(handle=0,
      handle = BadHandle;
   elseif handle=2,
      handle = Ready;
      minvar.handle = h(pp);
      execute_loadhandle minvar;
      status(pp,'solvestat') = minvar.solvestat;
      status(pp,'modelstat') = minvar.modelstat;
      status(pp,'seconds') = minvar.resusd;
   else
      handle = Waiting;
   );
   status(pp,'status') = handle;
);
display status;

For details on the model attributes referenced in the code above, see handle, solveStat, modelStat and resUsd. To run the program above, we will restart from the previous save file by using the command line parameter restart or its synonym r.

   > gams qcheck r=submit gdir=c:\test\grid

The output generated by the display statement may look like the following:

   ----   173 PARAMETER status

       solvestat   modelstat   seconds     status

   p1      1.000       1.000     0.328      Ready
   p2      1.000       1.000     0.171      Ready
   p3                                     Waiting
   p4                                     Waiting
   p5      1.000       1.000     0.046      Ready

We may want to do some more detailed analysis on one of the solved model instances. The respective program, called qanalyze.gms, may inlcude the following lines of code:

$if not set instance $abort --instance is missing
if(not handleStatus(h('%instance%')),
     abort$yes 'model instance %instance% not ready');
minvar.handle = h('%instance%');
execute_loadhandle minvar;
display x.l,xi.l,xd.l;
...

For information on dollar control options, see chapter Dollar Control Options, especially the detailed descriptions of the options $if and $abort. Note that instance is a compile-time variable. The program may be called using a double dash paramter, which defines and sets a GAMS compile-time variable:

    > gams qanalyze r=submit gdir=c:\test\grid --instance=p4

Very Long Job Durations: The Collection Program

Once all jobs are completed we are ready for the collection loop. For simplicity, we will not include the repeat loop, because we would not run the final collection program unless we were satisfied that we got most of the solutions that we wanted. The file qreport.gms could look like the following:

loop(pp$handleStatus(h(pp)),
   minvar.handle = h(pp);
   execute_loadhandle minvar;
   xres(i,pp)         = x.l(i);
   report(pp,i,'inc') = xi.l(i);
   report(pp,i,'dec') = xd.l(i);
   display$handleDelete(h(pp)) 'trouble deleting handles' ;
   h(pp) = 0; 
);
xres(i,pp)$h(pp) = na;
...

We will restart the program above from the save file that was created earlier:

   > gams qreport r=submit gdir=c:\test\grid 

Note that it is not necessary to run the job from the same directory that we have used for the initial submission; it is even possible to use a different operating system.

Summary of Grid Features

We introduced several GAMS features to facilitate the asynchronous or parallel execution of the solve statement. These GAMS features are summarized in the following subsections.

In addition to the features described below, the option or command line parameter ThreadsAsync was introduced. ThreadsAsync controls the number of threads or CPU cores that are used in multi-threading computing.

Grid Handle Functions

The grid handle functions are listed in Table 1. For details on functions in GAMS in general and complete lists of all GAMS functions, see section Functions. Note that the desired return values - the return values that indicate that no error has occurred - are marked with bold letters.

Function Description Return Values
handleCollect(HANDLE) Tests if the solve of the model instance identified by HANDLE is done: if so, it loads the solution into the GAMS data base. If the option asyncSolLst is active the solution listing is printed to the listing file.
Note that handleCollect ignores the setting of the option SolveOpt and always uses the default value merge.
0: The model instance was not ready or could not be loaded.
>0: The model instance solution has been loaded. (When using the Grid facility, this is always 1; when using the multi-threading option, this returns the thread ID used.)
handleStatus(HANDLE) Tests if the solve of the model instance identified by HANDLE is done.
Note that there are compile-time constants that are related to this function.
0: The model instance is not known to the system.
1: The model instance exists but the solution process is incomplete.
2: The solution process has terminated and the solution is ready for retrieval.
3: The solution process signaled completion but the solution cannot be retrieved.
handleDelete(HANDLE) Deletes the model instance identified by HANDLE and returns a numerical indicator of the status of the deletion. If the HANDLE given is not valid, an execution error is triggered. 0: The model instance has been removed.
1: The argument HANDLE is not a legal handle.
2: The model instance is not known to the system.
3: The deletion of the model instance encountered errors.
handleSubmit(HANDLE) Resubmits the model instance identified by HANDLE for solution. In case of a nonzero return an execution error is triggered. 0: The model instance has been resubmitted for solution.
1: The argument HANDLE is not a legal handle.
2: The model instance is not known to the system.
3: The completion signal could not be removed.
4: The resubmit procedure could not be found.
5: The resubmit process could not be started.
readyCollect(HANDLES[,maxWait]) Waits until a model solution is ready to be collected. HANDLES must be either a scalar or parameter containing one or more model handles or a model with its handle attribute. MaxWait specifies the maximum time to wait in seconds, the default value is +inf. 0: One or more of the requested jobs is/are ready.
1: There is neither an active job to wait for nor a job ready to be collected.
2: The handle symbol is empty.
3: The argument is not a legal handle.
4: User specified time-out (using a solveLink = 6 handle).
5: User specified time-out (using a solveLink = 3 handle).
8: Unknown error (should not happen).

Table 1: Grid Handle Functions

Note that GAMS might issue execution errors which could give additional information that may help to identify the source of problems. The function execError may be used to get and set the number of execution errors.

Grid Model Attributes

Model attributes are introduced in section Model Attributes. The following three model attributes are particularly relevant for grid computing:

Function Description
solveLink Specifies the solver linking conventions. The following values direct the solve statement to use grid computing or multi-thread computing: 3, 4, 6 and 7.
Note that the default for this model attribute can be set as command line parameter and option statement.
For more information, see the detailed description.
handle Specifies the current instance handle. This is used to identify a specific model instance and to provide additional information needed for the process signal management (compare subsections The Submission Loop and The Collection Loop above).
number Specifies the current instance number. Any time a solve is attempted for a model, the instance number is incremented by one and the handle is update accordingly. The instance number can be reset by the user which then resynchronizes the handle.

Table 2: Grid Handle Attributes

Grid Solution Retrieval

As an alternative to the function handleCollect a solution may be retrieved with the following statement:

execute_loadhandle mymodel;

This statement will update the GAMS data base with the status and solution for the current instance of mymodel. Note that the underlying mechanism is a GDX file. Except for the requirement of a model with a previously specified handle, this command operates like the procedure execute_loadpoint. If the option asyncSolLst is active the solution listing is printed to the listing file.

The Grid Directory

The instantiated (generated) models and their corresponding solutions are kept in unique directories that may be reached from the submitting system. Each GAMS job may have only one grid directory. By default, the grid directory is assumed to be the scratch directory. This may be overwritten by using the GAMS command line parameter GridDir, or short GDir. An example follows.

    > gams myprogram ... GDir=gridpath

If gridpath is not a fully qualified name, the name will be completed using the current directory. If the grid path does not exist, an error will be issued and the GAMS job will be terminated. A related GAMS parameter is ScrDir or short SD.

Recall the following default mechanism: When a GAMS job is started a unique process directory is created in the current directory. These directories are named 225a to 225zz. When a GAMS job terminates, the system will remove the process directory at the completion of a GAMS job. Any file that has not been created by the GAMS core system will be flagged. If the call gamskeep instead of gams is used, another exit script will be activated that results in the process directory to be kept.

Note that if we do not specify a scratch directory, the scratch directory will be the same as the process directory. If we do not specify a grid directory, the grid directory will be the same as the scratch directory.

Observe that if we assume that some of the model instances may fail or we want to break the GAMS program into several pieces to run as separate jobs, we need to be careful not to remove the model instance we have not completely processed. In such cases we have to use the parameter GridDir, so that we may access previously created model instances.

The Grid Facility: Architecture and Customization

The current Grid facility relies on very basic operating system features and does not attempt to offer real and direct job or process control. The file system is used to signal the completion of a submitted task and GAMS has currently no other way to interact with the submitted process directly, like forcing termination or change the priority of a submitted task. This approach has its obvious advantages and disadvantages. There are a number of attempts to use grid computing to provide value added commercial remote computing services.

When GAMS executes a solve with the option solveLink set to 3 it will perform the following steps:

  1. Create a subdirectory in the GridDir with the name gridnnn. Here nnn stands for the numeric value of the handle. The handle value is the internal symbol ID number x 1e6 + the model instance number. For example, in the [QMEANVAR] example the first grid subdirectory was grid137000002.
  2. Remove the completion signal in case the file already exists. Currently the signal is a file called finished. For example, grid137000002/finished.
  3. Create or replace a GDX file called gmsgrid.gdx which will contain a dummy solution with failed model and solver status. This file will be overwritten by the final step of the solution process and will be read when calling execute_loadhandle.
  4. Place all standard GAMS solver interface files into the above instance directory.
  5. Execute the submission wrapper called gmsgrid.cmd under Windows or gmsgrid.run under Unix. These submission scripts are usually located in the GAMS system directory, they may be located via the current path if they are not found in the GAMS system directory.

The grid submission script gmsgrid.cmd or gmsgrid.run is called with four arguments that are needed to make a standard GAMS solver call: the solver executable file name, the solver control file name, the solver scratch directory, and the solver name. The submission script then does the final submission to the operating system. This final script will perform the following steps:

  1. call the solver,
  2. call a utility that will create the final GDX file gmsgrid.gdx,
  3. set the completion signal finished.

If we want to use the function handleSubmit we will also have to create the script gmsrerun.cmd or gmsrerun.run. This script could later be used to resubmit the job.

For example, the default submission script for Windows is shown below:

@echo off

: gams grid submission script
:
: arg1 solver executable
:    2 control file
:    3 scratch directory
:    4 solver name
:
: gmscr_nx.exe processes the solution and produces 'gmsgrid.gdx'
:
: note: %3 will be the short name, this is neeeded because
:       the START command cannot handle spaces or "...'
:       before we use %~3  will strip surrounding "..."
:       makes the name short
:
: gmsrerun.cmd will resubmit runit.cmd

echo @echo off                      > %3runit.cmd
echo %1 %2 %4                      >> %3runit.cmd
echo gmscr_nx.exe %2               >> %3runit.cmd
echo echo OK ^> %3finished ^& exit >> %3runit.cmd

echo @start /b /belownormal %3runit.cmd ^> nul  > %3gmsrerun.cmd

start /b /belownormal %3runit.cmd > nul

exit

Grid Submission Testing

The grid submission process can be tested on any GAMS program without having to change the source text. The option solveLink=4 instructs the solve statement to use the grid submission process and then wait until the results are available. Note that the option solveLink may be set via a GAMS command line parameter, a GAMS option statement or via assignment to the model attribute. Once the model instance has been submitted for solution, GAMS will check if the job has been completed. It will keep checking twice the reslim seconds allocated for this optimization job and report a failure if this limit has been exceeded. After successful or failed retrieval of the solution, GAMS will remove the grid directory, unless we have used the call gamskeep or have set the GAMS command line parameter keep.

Multi-Threading

As we have described in this chapter, each solve is handled in its own process space with the Grid facility. Recall that the Grid facility is activated by setting the option or model attribute solveLink to 3 or 4. If solveLink is set to 6 (or the compile-time constant %solveLink.asyncThreads%) instead, a separate thread is used. This allows efficient in-memory communication between GAMS and the solver, like it is done if the option solveLink is set to 5 (or the compile-time constant %solveLink.loadLibrary%).

Apart from this, the multi-threading facility works in the same way as the Grid facility. The solve statement generates the model and passes it to the solver in a separate thread, then a handle of the model instance may be stored using the model attribute handle and the grid handle functions may be used to collect the solution and deal with the model instance, namely handleCollect, handleDelete, handleStatus and readyCollect.

Note that the option or command line parameter ThreadsAsync sets the maximum number of threads that should be used for the asynchronous solves.

The following matrix shows which solvers may be used with solveLink = 6 on which platform:

Solver x86 64bit
MS Windows
x86 64bit
Linux
x86 64bit
macOS
arm 64bit
macOS
CONOPT3 × × × ×
CONOPT4 × × × ×
CONVERT × × × ×
COPT × × × ×
CPLEX × × × ×
GUROBI × × × ×
HIGHS × × ×
IPOPT × × × ×
MOSEK × × × ×
ODHCPLEX × ×
SCIP × × × ×
SHOT × × × ×
SNOPT ×
Soplex × × × ×
XPRESS × × ×

If a solver is selected for which solveLink = 6 is not supported on the corresponding platform, solveLink = 3 will be used instead and it will be noted in the log.

Multi-threading Submission Testing

The multi-threading submission process may be tested on any GAMS program without having to change the source text. The option solveLink = 7 (or, equivalently, the compile-time constant %solveLink.threadsSimulate%) instructs the solve statement to use the multi-threading submission process, wait until the results are available and then load the solution into the GAMS data base. Once the model instance has been submitted for solution, GAMS will check if the job has been completed. It will keep checking twice the reslim seconds allocated for this optimization job and report a failure if this limit has been exceed. After successful or failed retrieval of the solution GAMS will remove the thread handle.