csp.gms : Closest String Problem

Description

The closest-string problem (CSP) finds a center string that minimizes
the Hamming distance between the center string and all other strings.
The Hamming distance counts the nonmatching characters. For example,
'CATCC' and 'CTTGC' have a Hamming distance of 2.

Three formulations are presented. Formulations 1 and 2 can only be used
for small problems. Formulation 3 is the most intuitive formulation
and works well with general purpose MIP codes.

It should be noted that the root node heuristic of CPLEX performs
very well. With an absolute gap of one or two, CPLEX finds solution within
a few seconds for all suggested sizes. For example, the setting below
will result in solutions that are either within 1.0 from he global optimum
or less than 1 percent:

option optCr = 0.01, optCa = 1.99;


Small Model of Type : MIP


Category : GAMS Model library


Main file : csp.gms

$title Closest String Problem (CSP,SEQ=306)

$onText
The closest-string problem (CSP) finds a center string that minimizes
the Hamming distance between the center string and all other strings.
The Hamming distance counts the nonmatching characters. For example,
'CATCC' and 'CTTGC' have a Hamming distance of 2.

Three formulations are presented. Formulations 1 and 2 can only be used
for small problems. Formulation 3 is the most intuitive formulation
and works well with general purpose MIP codes.

It should be noted that the root node heuristic of CPLEX performs
very well. With an absolute gap of one or two, CPLEX finds solution within
a few seconds for all suggested sizes. For example, the setting below
will result in solutions that are either within 1.0 from he global optimum
or less than 1 percent:

option optCr = 0.01, optCa = 1.99;


Meneses, C N, Lu, Z, Oliveira, C A S, and Pardalos, P M,
Optimal Solutions for the Closest-String Problem via Integer Programming.
INFORMS Journal on Computing 16, 4 (2004), 419-429.

Keywords: mixed integer linear programming, closest-string problem, computational biology
$offText

$eolCom //

Set
   n 'strings'
   m 'character sequence'
   a 'alphabet';

Parameter x(n,m) 'string values';

$if not set letters $set letters 26
$if not set strings $set strings  4
$if not set length  $set length   6

Set
   n / s1*s%strings% /
   m / c1*c%length%  /
   a / a1*a%letters% /;

* sample data from paper
Table x(n,m)
      c1 c2 c3 c4 c5 c6
   s1  4  9  6  6  5 18   // differ
   s2 13  5  4  9  1 14   // median
   s3 12  5 14  7 20  8   // length
   s4 13  5  4  9 21 13   // medium;

* recognize sample data
$if not %strings%+%length%+%letters% == 4+6+26
x(n,m) = uniformInt(1,card(a));

if(card(n)*card(m) > 50,
   option limCol = 0, limRow = 0, solPrint = off, resLim = 10, optCr = 0.01, optCa = 1.999;
else
   option resLim = 5, optCr = 0.0, optCa = 0.999;
);

* Formulation P1
Variable
   d      'maximum difference between t and x'
   t(m)   'reference string'
   z(n,m) 'string is different';

Binary Variable z;

Equation
   e1(n)   'lower bound for d'
   e2(n,m) 'lower bound on difference'
   e3(n,m) 'upper bound on difference';

e1(n)..   sum(m, z(n,m)) =l= d;

* x <> t
e2(n,m).. t(m) =l= t.up(m)*z(n,m) + x(n,m)*(1 - z(n,m));

e3(n,m).. t(m) =g= t.lo(m)*z(n,m) + x(n,m)*(1 - z(n,m));

t.lo(m) = smin(n, x(n,m));

t.up(m) = smax(n, x(n,m));

Model p1A / e1, e2, e3 /;

Parameter report 'summary report';
report(m,'t.lo') = t.lo(m);
report(m,'t.up') = t.up(m);

solve p1A min d using mip;

report(m,'p1A')            = t.l(m);
report('objective','p1A')  = p1a.objVal;
report('Est Global','p1A') = ceil(p1a.objEst - 1e-6);

* Formulation P2
Set ma(m,a) 'possible characters by position';
ma(m,a) = sum(n, ord(a) = x(n,m));

* display ma;

Binary Variable v(m,a) 'selection of characters';

Equation
   e4(m) 'select only one'
   e5(m) 'assign character value to t';

e4(m).. sum(ma(m,a), v(ma)) =e= 1;

e5(m).. t(m) =e= sum(ma(m,a), ord(a)*v(ma));

Model p2 / e1, e2, e3, e4, e5 /;

solve p2 min d using mip;

report(m,'p2')            = t.l(m);
report('objective','p2')  = p2.objval;
report('Est Global','p2') = ceil(p2.objEst - 1e-6);

* Formulation P3
Equation e6(n) 'count matching characters';

e6(n).. card(m) - sum(ma(m,a),(x(n,m) = ord(a))*v(ma)) =l= d;

Model p3 / e4, e6 /;

solve p3 min d using mip;

t.l(m) = sum(ma(m,a), ord(a)*v.l(ma));
report(m,'p3')            = t.l(m);
report('objective','p3')  = p3.objVal;
report('Est Global','p3') = ceil(p3.objEst - 1e-6);

display report;