close
Skip to main content
Log in

On codeword design in metric DNA spaces

  • Published:
Natural Computing Aims and scope Submit manuscript

Abstract

Finding a large set of single DNA strands that do not crosshybridize to themselves and/or to their complements is an important problem in DNA computing, self-assembly, and DNA memories. We describe a theoretical framework to analyze this problem, gauge its computational difficulty, and provide nearly optimal solutions. In this framework, codeword design is reduced to finding large sets of strands maximaly separated in a DNA space and the size of such sets depends on the geometry of these metric spaces. We show that codeword design is NP-complete using any single reasonable measure that approximates the Gibbs energy, thus practically excluding the possibility of finding any procedure to find maximal sets efficiently. Second, we extend a technique known as shuffling to provide a construction that yields provably nearly-maximal codes. Third, we propose a filtering process that removes strands creating pairs with low Gibbs energies, as approximated by the nearest-neighbor model. These two steps produce large codes of thermodynamic high quality. The proposed framework can be used to gain an understanding of the Gibbs energy landscapes for DNA strands on which much of DNA computing and self-assembly are based.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Adleman L (1994) Molecular computation of solutions of combinatorial problems. Science 266:1021–1024

    Article  Google Scholar 

  • Allawi H, SantaLucia J (1997) Thermodynamics and NMR of internal G.T mismatches in DNA. Biochemistry 36:10581–10594

    Article  Google Scholar 

  • Allawi H, SantaLucia J (1998a) Nearest neighbor thermodynamic parameters for internal G.A mismatches in DNA. Biochemistry 37:2170–2179

    Article  Google Scholar 

  • Allawi H, SantaLucia J (1998b) Thermodynamics of internal C.T mismatches in DNA. Nucleic Acids Res 26:2694–2701

    Article  Google Scholar 

  • Alon N, Spencer J (2000) The probabilistic method, 2nd edn. Wiley

  • Arita M, Kobayashi S (2002) DNA sequence design using templates. New Gen Comput 20:263–277

    Article  MATH  Google Scholar 

  • Baum E (1995) Building an associative memory vastly larger than the brain. Science 268:583–585

    Article  Google Scholar 

  • Bishop M, D’Yachkov A, Macula A, Renz T, Rykov V (2007) Free energy gap and statistical thermodynamic fidelity of DNA codes. J Comput Biol 14:1088–1104

    Article  MathSciNet  Google Scholar 

  • Brenneman A, Condon A (2002) Strand design for biomolecular computation. Theor Comput Sci 287:39–58

    Article  MATH  MathSciNet  Google Scholar 

  • Chen J, Deaton R, Wang J (2003) A DNA-based memory with in vitro learning and associative recall. In: Proc. DNA9, Springer-Verlag Lecture Notes in Computer Science, pp 145–156

  • Chen J, Deaton R, Garzon M, Kim J, Wood DHD, Wang Y (2004) Characterization of non-crosshybridizing DNA oligonucleotides manufactured in vitro. In: Proc. DNA10, Springer-Verlag Lecture Notes in Computer Science, pp 50–61

  • Cormen T, Leiserson C, Rivest R, Stein C (2001) Introduction to Algorithms, 2nd edn. The MIT Press

  • Deaton R, Garzon M, Murphy RE, Rose JA, Franceschetti DR, Stevens SE Jr (1998) The reliability and efficiency of a DNA computation. Phys Rev Lett 80

  • Deaton R, Chen J, Bi H, Rose J (2003) A software tool for generating non-crosshybridizing libraries of DNA oligonucleotides. In: Proc. DNA8, Springer-Verlag Lecture Notes in Computer Science, vol 2568. Springer-Verlag, London, pp 252–261

  • D’yachkov A, Macula A, Pogozelski W, Renz T, Rykov V, Torney D (2004) A weighted insertion-deletion stacked pair thermodynamic metric for DNA codes. In: Proc. DNA10, Springer-Verlag Lecture Notes in Computer Science, vol 3384, pp 90–103

  • Feldkamp U, Ruhe H, Banzhaf W (2003) Sofware tools for DNA sequence design. J Genetic Program Evol Mach 4:153–171

    Article  Google Scholar 

  • Frutos A, Condon A, Corn R (1997) Demonstration of a word design strategy for DNA computing on surface. Nucleic Acids Res 25:4748–4757

    Article  Google Scholar 

  • Garey M, Johnson D (1979) Computers and intractability: a guide to the theory of NP-completeness. W. H. Freeman

  • Garzon M, Deaton R (2004) Codeword design and information encoding in DNA ensembles. J Nat Comput 3:253–292

    Article  MATH  MathSciNet  Google Scholar 

  • Garzon M, Oehmen C (2002) Biomolecular computation in virtual test tubes. In: Proc. DNA7, Springer-Verlag Lecture Notes in Computer Science, vol 2340, pp 117–128

    MathSciNet  Google Scholar 

  • Garzon M, Deaton R, Neathery P, Murphy R, Franceschetti D, Stevens SE Jr (1997a) On the encoding problem for DNA computing. In: The third DIMACS workshop on DNA-based computing, pp 230–237

  • Garzon M, Neathery P, Deaton R, Murphy R, Franceschetti D, Stevens SE Jr (1997b) A new metric for DNA computing. In: Koza JR et al (eds) Proc. 2nd annual genetic programming conference. Morgan Kaufmann, pp 230–237

  • Garzon M, Blain D, Bobba K, Neel A, West M (2003) Self-assembly of DNA-like structures in silico. In: Garzon M (ed) Biomolecular machines and artificial evolution. Special Issue of the Journal of Genetic Programming and Evolvable Machines. Kluwer Academic Publishers, pp 185–200

  • Garzon M, Bobba K, Hyde B (2004) Digital information encoding on DNA. In: Aspects of molecular computing, Springer-Verlag Lecture Notes in Computer Science, vol 2590, pp 152–166

  • Garzon M, Phan V, Bobba K, Kontham R (2005a) Sensitivity and capacity of microarray encodings. In: Proc. DNA11, Springer-Verlag Lecture Notes in Computer Science, vol 3892, pp 81–95

  • Garzon M, Phan V, Bobba K, Kontham R (2005b) Sensitivity and capacity of microarray encodings. In: Proc. DNA11, Springer-Verlag Lecture Notes in Computer Science, vol 3892, pp 81–95

  • Garzon M, Phan V, Roy S, Neel A (2007) In search of optimal codes for DNA computing. In: Mao C, Yokomori T, Zhang B (eds) Proc. DNA11, Springer-Verlag Lecture Notes in Computer Science, vol 4287. Springer-Verlag, pp 143–156

  • King O (2003) Bounds for DNA codes with constant GC-content. J Combinatorics 10:R33

    Google Scholar 

  • Marathe A, Condon A, Corn R (1999) On combinatorial DNA word design. In: Winfree E, Gifford DK (eds) Proceedings 5th DIMACS workshop on DNA based computers. American Mathematical Society, pp 75–89

  • Mullis K (2001) The unusual origin of the polymerase chain reaction. The unusual origin of the polymerase chain reaction 262:56–61, 64–65

  • Phan V, Garzon M (2005) Information encoding using DNA. In: Proc. DNA10, Springer-Verlag Lecture Notes in Computer Science, vol, 3384, pp 281–292

  • Roman J (1995) The theory of error-correcting codes. Springer-Verlag, Berlin

    Google Scholar 

  • SantaLucia J (1998) A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor thermodynamics. Proc Natl Acad Sci 95:1460–1465

    Article  Google Scholar 

  • SantaLucia J, Hicks D (2004) Thermodynamics of DNA structural motifs. Annu Rev Biophys Biomol Struct 33:415–440

    Article  Google Scholar 

  • SantaLucia J Jr, Allawi H, Seneviratne P (1990) Improved nearest neighbor paramemeters for predicting duplex stability. Biochemistry 35:3555–3562

    Article  Google Scholar 

  • Seeman N (2003) DNA in a material world. Nature 421:427–431

    Article  MathSciNet  Google Scholar 

  • Shortreed M, Chang S, Hong D, Phillips M, Campion B, Tulpan D, Andronescu M, Condon A, Hoos H, Smith L (2005) A thermodynamic approach to designing structure-free combinatorial DNA word sets. Nucleic Acids Res 33:4965–4977

    Article  Google Scholar 

  • Tian Y, He Y, Chen Y, Yin P, Mao C (2005) Molecular devices—a DNAzyme that walks processively and autonomously along a one-dimensional track. Angewandte Chemie 44:4355–4358

    Article  Google Scholar 

  • Tulpan D, Andronescu M, Chang S, Shortreed M, Condon A, Hoos H, Smith L (2005) Thermodynamically based DNA strand design. Nucleic Acids Res 33:4951–4964

    Article  Google Scholar 

  • Watson J, Crick F (1953) Molecular structure of nucleic acids. A structure for deoxyribose nucleic acid. Nature 171

  • Wetmur J (1997) Physical chemistry of nucleic acid hybridization. In: Third annual DIMACS meeting on DNA based computers, pp 1–23

  • Yurke B, Turberfield A, Mills AP Jr, Neumann J (2003) A DNA-fuelled molecular machine made of DNA. Nature 406:605–608

    Google Scholar 

  • Zuker M, Mathews D, Turner D (1999) Algorithms and thermodynamics for RNA secondary structure prediction: a practical guide in rna biochemistry and biotechnology. In: Barciszewski J, Clark B (eds) NATO ASI series. Kluwer Academic Publishers

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vinhthuy Phan.

Appendix

Appendix

In this section, we give a detailed proof that the h-measure is indeed a distance in the space of all poligos P n , for every n. A metric space is defined as a pair (S,d) consisting of a nonempty set S and a distance function d: S × S R (also called a metric) assigning a real number d(x,y) to every pair of elements x,y ∈ S so that three defining properties are satisfied for every triple x,y,z ∈ S, namely:

  1. 1.

    Nonnegativity: d(x,y) ≥ 0 and Strict nonnegativity: d(x,y) = 0 if and only if x = y;

  2. 2.

    Symmetry: d(x,y) = d(y,x);

  3. 3.

    Triangle inequality: d(x,z) ≤ d(x,y) + d(y,z).

The first property of a metric space fails for the h-measure, as was pointed above, because there are x’s such that 0 < h(x,x), for example x = aacc; as a result, the triangle inequality also fails because h(x,x′) + h(xx) = 0. To handle this problem, we pass to the appropriate poligo space consisting of subsets X defined by

$$ X =[x]=\{ y \in {\bf D}_{n}: h(x,y) = 0 \} = \{ x, x^\prime \}, $$
(3)

and defining the distance between poligos (still denoted h) as the set distances, i.e.,

$$ h(X,Y)=\mathop{\hbox{min}}\limits_{x \in X, y \in Y} \{ h(x,y) \}. $$
(4)

(We will continue to use the convention below that capital letters denote the poligos of corresponding n-mers given by lower case letters.)

Well known examples of a metric space are the binary hypercubes B n from information theory (Roman 1995) with the familiar Hamming distance obtained by counting the mismatched bits in a perfect alignment of x against y as binary strings

$$ H(x,y) = n - \sum \psi (x_i,y_i), $$

where ψ is the characteristic function with values 1 if the i th-bits x i ,y i are identical bits and 0 if they are different. It is easy to verify that H remains a distance if it is likewise defined on the DNA spaces D n . Moreover, H obviously satisfies several properties that will be used below, namely

$$ H(x,y) = H(x^r,y^r) = H(x^c,y^c) = H(x^\prime,y^\prime), $$
(5)

where r,c,′ are the reversal, pointwise Watson-Crick complement and their composition, the full Watson-Crick complementation operation ′, respectively. Also, H is additive over concatenation, i.e., for every four strings x,y,u,v with |x| = |y| and |u| = |v| (even if the two lengths are different), and for every permutation π of the indices i for the positions in the strands,

$$ H(ux,vy) = H(u,v) + h(x,y) \quad{\rm and }\ H(x,y) = H(\pi\cdot x, \pi\cdot y), $$
(6)

where π · x is the strand obtained by permuting the bases in x according to π, i.e., x i is moved to position π(i), i.e., (π · x)π(i) = x i .

The proof of Theorem 5.3 below will proceed in three stages. We will first establish that a particular case of the h-measure obtained by forcing alignments when calculating the h-measure (e.g., by appending much longer complementary primers to the ends of the strands in D n ) and given by

$$ h_0(x,y) = \hbox{min} \{ H(x,y), H(x,y^\prime) \} \, $$

makes a near-distance of the space D n of interest in its own right, i.e. it does satisfy the triangle inequality but only falls short of satisfiying h(x,y) > 0 for xy to be a full metric. (Note that h 0(x,x) = 0 and so, as noted above, the h 0-distance will then be a metric in poligo space P n .) Second, we will establish that similar measures h k are also near-metrics if given by

$$ h_k(x,y) = k + \hbox{min} \{ h_0(x[k], \sigma^k(y)[k]), h_0(x[k],\sigma^{-k}(y)[k]) \} \, (k > 0), $$
(7)

where x[k] (x[-k]) is the prefix (suffix, respectively) of x of length n − k + 1 overlapping y (or y′) after a shift of length k nucleotides to the right (left, respectively). Third, Theorem 5.3 will then follow because the h-distance can be obtained by passing to the appropriate poligo space consisting of subsets X of n-mer defined by X = {x,x′} and defining the distance between poligos as the set distance (4) above.

For the first stage, we establish that h 0 is a near-distance for every n ≥ 1.

Lemma 5.1

In D 2, if H(be,(df)′) = 0, then H(be,df) = 0 and be is a palindrome, or H(be,df) = 2.

Proof

If H(be,(df)′) = 0, then b = f′, e = d′, and H(be,df) = H(be,eb′) = H(be,(be)′). Thus, if the latter is 0, b = e′ and hence be = (be)′, a palindrome. H(be,df) = 1 would require b = e′ and eb′, which is impossible.□

It follows from property (6) that for every extension of strands x,z by four arbitrary nucleotides p,q,r,s,

$$ H(bxe,dzf)=H(x,z) + H(bd,ef) \;\; \hbox{and hence } $$
(8)
$$ H(bxe,(dzf)^\prime)=H(x,z^\prime) + H(be,(df)^\prime). $$
(9)

Since h 0(x,z) = min {H(x,z), H(x,z′)} ≤ H(x,z) and h 0(bd,ef) = min {H(bd,ef), H(bd,(ef)′)} ≤ H(bd,ef), and likewise for H(x,z′) and H(bd,(ef)′, it follows that

$$ \hbox{min} \{H(x,z), H(x,z^\prime)\} + \hbox{min} \{H(bd,ef), H(bd,(ef)^\prime)\} \le H(x,z) + H(bd,ef); $$

Likewise,

$$ \hbox{min} \{ H(x,z), H(x,z^\prime) \} + \hbox{min} \{ H(bd,ef), H(bd,(ef)^\prime) \} \le H(x,z^\prime) + H(bd,(ef)^\prime). $$

Therefore,

$$ h_0(x,z) + h_0(bd,ef) \le h_0(bxe,dzf) = \hbox{min} \{H(bxe,dzf), H(bxe,(dzf)^\prime) \}. $$
(10)

In fact, equality happens when both H(x,z) ≤ H(x,z′) and H(bd,ef) ≤ H(bd,(ef)′) because then H(x,z) +  H(bd,ef) ≤ H(x,z′) +  H(bd,(ef)′) and so H(bxe,dzf) = H(x,z) +  H(bd,ef); likewise when the opposite inequalities hold, or when H(bd,ef) = H(bd,(ef)′). Therefore, the inequality is strict only when we have H(x,z) ≤ H(x,z′) and H(bd,(ef)′) ≤ H(bd,ef), or vice versa, and if so, the difference between the left-hand and right-hand sides of the inequality is at most H(bd,ef), which is at most 2.

Theorem 5.2

D n is a near-metric space with the function h 0 in (4) for every n ≥ 1.

Proof

Nonnegativity and symmetry are clearly inherited from the corresponding properties of H. To establish the triangle inequality, we proceed by induction on n, the length of the strands. In D 1, h is identical to H and hence it satisfies the inequality. It is also easy to check the inequality hold for n = 2 as well, say by exhaustively checking all possible triples x,y,z. Assume inductively that the triangle inequality holds for h 0 in D m for all m < n. An arbitrary triple in D n (n > 2) can be written as bxe, pyq, dzf for some dimers be, pq,df such that

$$ h_0 (x,z) \le h_0(x,y) + h(y,z) \;\; \hbox{\rm and} \;\; h_0 (be,df) \le h_0(be,pq) + h(pq,df). $$

These inequalities in conjunction with inequality (8) imply that

$$ h_0(x,z)+h(be,df) \le h_0 (x,y) + h_0 (be,pq) + h_0 (y,z) + h_0 (pq,df) $$
(11)
$$ \le h_0(bxe,pyq) + h(pyq,dzf). $$
(12)

Now, by the remarks after inequality (10), the left-hand side is the desired h(bxe,dzf) except when H(x,z) ≤ H(x,z′) and H(bd,(ef)′) ≤ H(bd,ef), or vice versa. Assuming that the former is the case (the argument in the other case is identical) and considering that the integer value

$$ H(bxe,dzf) = \hbox{min} \{ H(x,z) + H(bd,ef), H(x,z^\prime) + H(bd,(ef)^\prime)\}, $$

the inequality can only fail if H(be,df) = 2 and H(be,(df)′) = 0 (so that the increase to obtain h(bxe,dzf) from h 0(x,z) +  h(be,df) is 1) but neither of the corresponding sums in the right-hand sums of (11) increases by at least one. In that case, again by Lemma 5.1, we have that H(bd,(pq)′) = 0 and H(pq,(ef)′) = 0, i.e. bd = (pq)′ = ef and hence H(db,ef) = 0, a contradiction. Therefore, an increase in going from the sum in the left-hand side of inequality~(11) to the desired h(bxe,dzf) forces an increase in one of the corresponding sums in the right-hand side and so the triangle inequality holds.□

For the second stage, observe that the measures h k defined by (7) is identical to the measure obtained by shifting y and y′ by |k| nucleotides and then computing the h 0-measure between the overlapping segments x[k] and y[k], i.e.,

$$ h_k(x,y) = \hbox{min} \{ |k| + H(x,\sigma^k(y)) , |k| + H(x,\sigma^k(y^\prime)) \} = |k| + h_0(x[k],y[k]). $$
(13)

Note that the poligos [x] in this case now consist of all strands having a common prefix (k > 0) or suffix (k < 0) of length n − k after a shift of |k| nucleotides to the right (or left, respectively) which are at h k measure 0 from one another.

For the third stage, observe that h can now be expressed as

$$ h(x,y)=\mathop{\hbox{min}}_{k \ge 0} \{ h_k(x,y) \}. $$
(14)

Theorem 5.3

Poligo space P n is a metric space with the h-distance.

Proof

It suffices to verify that the h-measure satisfies the triangle inequality for poligos because in the quotient space P n for h, a poligo X = [x] given by definition 3 must consist of n-mers y’s such that h(x,y) = 0. Since it if clear that kH(x,σ(y)) > 0 for k > 0, h(x,y) = 0 implies that h 0(x,y) = 0, i.e., that y = x′ and hence that X = {x,x′}. Therefore the appropriate poligos for h are precisely the poligos originally given for the h-measure. Moreover, strict nonnegativity of h follows from the previous observation that different poligos XY are disjoint and therefore h(X,Y) > 0 according to definition (4).

To verify the triangle inequality, consider three arbitrary strands x,y,z of length n, let k,j be the shifts where the minima for h(x,y) and h(y,z) are obtained in the expression (14), respectively. By equivalence (13), we thus need to show that

$$ h(x,z) \le h(x,y) + h(y,z)=|k| + h_0(x[k],y[k]) + |j| + h_0(y[j],z[j]). $$
(15)

Assuming that |j| ≤ |k| (the reverse case can be established likewise), inequality (14) and the triangle inequality for h 0 from Theorem 5.2 imply that

$$ \begin{aligned} h(x,z)\le&|j| + h_0(x[j],z[j]) \\ \le&|j| + h_0(x[j],y[j]) + h_0(y[j],z[j]) \\ \le&h(x,y) + h(y,z), \end{aligned} $$

since it is easy to verify that

$$ h_0(x[j],y[j]) \le h_0(x[k],y[k]) \le h_k(x,y) = h(x,y). $$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Phan, V., Garzon, M.H. On codeword design in metric DNA spaces. Nat Comput 8, 571–588 (2009). https://doi.org/10.1007/s11047-008-9088-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue date:

  • DOI: https://doi.org/10.1007/s11047-008-9088-6

Keywords

Profiles

  1. Vinhthuy Phan
  2. Max H. Garzon