Abstract
Finding a large set of single DNA strands that do not crosshybridize to themselves and/or to their complements is an important problem in DNA computing, self-assembly, and DNA memories. We describe a theoretical framework to analyze this problem, gauge its computational difficulty, and provide nearly optimal solutions. In this framework, codeword design is reduced to finding large sets of strands maximaly separated in a DNA space and the size of such sets depends on the geometry of these metric spaces. We show that codeword design is NP-complete using any single reasonable measure that approximates the Gibbs energy, thus practically excluding the possibility of finding any procedure to find maximal sets efficiently. Second, we extend a technique known as shuffling to provide a construction that yields provably nearly-maximal codes. Third, we propose a filtering process that removes strands creating pairs with low Gibbs energies, as approximated by the nearest-neighbor model. These two steps produce large codes of thermodynamic high quality. The proposed framework can be used to gain an understanding of the Gibbs energy landscapes for DNA strands on which much of DNA computing and self-assembly are based.
Similar content being viewed by others
References
Adleman L (1994) Molecular computation of solutions of combinatorial problems. Science 266:1021–1024
Allawi H, SantaLucia J (1997) Thermodynamics and NMR of internal G.T mismatches in DNA. Biochemistry 36:10581–10594
Allawi H, SantaLucia J (1998a) Nearest neighbor thermodynamic parameters for internal G.A mismatches in DNA. Biochemistry 37:2170–2179
Allawi H, SantaLucia J (1998b) Thermodynamics of internal C.T mismatches in DNA. Nucleic Acids Res 26:2694–2701
Alon N, Spencer J (2000) The probabilistic method, 2nd edn. Wiley
Arita M, Kobayashi S (2002) DNA sequence design using templates. New Gen Comput 20:263–277
Baum E (1995) Building an associative memory vastly larger than the brain. Science 268:583–585
Bishop M, D’Yachkov A, Macula A, Renz T, Rykov V (2007) Free energy gap and statistical thermodynamic fidelity of DNA codes. J Comput Biol 14:1088–1104
Brenneman A, Condon A (2002) Strand design for biomolecular computation. Theor Comput Sci 287:39–58
Chen J, Deaton R, Wang J (2003) A DNA-based memory with in vitro learning and associative recall. In: Proc. DNA9, Springer-Verlag Lecture Notes in Computer Science, pp 145–156
Chen J, Deaton R, Garzon M, Kim J, Wood DHD, Wang Y (2004) Characterization of non-crosshybridizing DNA oligonucleotides manufactured in vitro. In: Proc. DNA10, Springer-Verlag Lecture Notes in Computer Science, pp 50–61
Cormen T, Leiserson C, Rivest R, Stein C (2001) Introduction to Algorithms, 2nd edn. The MIT Press
Deaton R, Garzon M, Murphy RE, Rose JA, Franceschetti DR, Stevens SE Jr (1998) The reliability and efficiency of a DNA computation. Phys Rev Lett 80
Deaton R, Chen J, Bi H, Rose J (2003) A software tool for generating non-crosshybridizing libraries of DNA oligonucleotides. In: Proc. DNA8, Springer-Verlag Lecture Notes in Computer Science, vol 2568. Springer-Verlag, London, pp 252–261
D’yachkov A, Macula A, Pogozelski W, Renz T, Rykov V, Torney D (2004) A weighted insertion-deletion stacked pair thermodynamic metric for DNA codes. In: Proc. DNA10, Springer-Verlag Lecture Notes in Computer Science, vol 3384, pp 90–103
Feldkamp U, Ruhe H, Banzhaf W (2003) Sofware tools for DNA sequence design. J Genetic Program Evol Mach 4:153–171
Frutos A, Condon A, Corn R (1997) Demonstration of a word design strategy for DNA computing on surface. Nucleic Acids Res 25:4748–4757
Garey M, Johnson D (1979) Computers and intractability: a guide to the theory of NP-completeness. W. H. Freeman
Garzon M, Deaton R (2004) Codeword design and information encoding in DNA ensembles. J Nat Comput 3:253–292
Garzon M, Oehmen C (2002) Biomolecular computation in virtual test tubes. In: Proc. DNA7, Springer-Verlag Lecture Notes in Computer Science, vol 2340, pp 117–128
Garzon M, Deaton R, Neathery P, Murphy R, Franceschetti D, Stevens SE Jr (1997a) On the encoding problem for DNA computing. In: The third DIMACS workshop on DNA-based computing, pp 230–237
Garzon M, Neathery P, Deaton R, Murphy R, Franceschetti D, Stevens SE Jr (1997b) A new metric for DNA computing. In: Koza JR et al (eds) Proc. 2nd annual genetic programming conference. Morgan Kaufmann, pp 230–237
Garzon M, Blain D, Bobba K, Neel A, West M (2003) Self-assembly of DNA-like structures in silico. In: Garzon M (ed) Biomolecular machines and artificial evolution. Special Issue of the Journal of Genetic Programming and Evolvable Machines. Kluwer Academic Publishers, pp 185–200
Garzon M, Bobba K, Hyde B (2004) Digital information encoding on DNA. In: Aspects of molecular computing, Springer-Verlag Lecture Notes in Computer Science, vol 2590, pp 152–166
Garzon M, Phan V, Bobba K, Kontham R (2005a) Sensitivity and capacity of microarray encodings. In: Proc. DNA11, Springer-Verlag Lecture Notes in Computer Science, vol 3892, pp 81–95
Garzon M, Phan V, Bobba K, Kontham R (2005b) Sensitivity and capacity of microarray encodings. In: Proc. DNA11, Springer-Verlag Lecture Notes in Computer Science, vol 3892, pp 81–95
Garzon M, Phan V, Roy S, Neel A (2007) In search of optimal codes for DNA computing. In: Mao C, Yokomori T, Zhang B (eds) Proc. DNA11, Springer-Verlag Lecture Notes in Computer Science, vol 4287. Springer-Verlag, pp 143–156
King O (2003) Bounds for DNA codes with constant GC-content. J Combinatorics 10:R33
Marathe A, Condon A, Corn R (1999) On combinatorial DNA word design. In: Winfree E, Gifford DK (eds) Proceedings 5th DIMACS workshop on DNA based computers. American Mathematical Society, pp 75–89
Mullis K (2001) The unusual origin of the polymerase chain reaction. The unusual origin of the polymerase chain reaction 262:56–61, 64–65
Phan V, Garzon M (2005) Information encoding using DNA. In: Proc. DNA10, Springer-Verlag Lecture Notes in Computer Science, vol, 3384, pp 281–292
Roman J (1995) The theory of error-correcting codes. Springer-Verlag, Berlin
SantaLucia J (1998) A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor thermodynamics. Proc Natl Acad Sci 95:1460–1465
SantaLucia J, Hicks D (2004) Thermodynamics of DNA structural motifs. Annu Rev Biophys Biomol Struct 33:415–440
SantaLucia J Jr, Allawi H, Seneviratne P (1990) Improved nearest neighbor paramemeters for predicting duplex stability. Biochemistry 35:3555–3562
Seeman N (2003) DNA in a material world. Nature 421:427–431
Shortreed M, Chang S, Hong D, Phillips M, Campion B, Tulpan D, Andronescu M, Condon A, Hoos H, Smith L (2005) A thermodynamic approach to designing structure-free combinatorial DNA word sets. Nucleic Acids Res 33:4965–4977
Tian Y, He Y, Chen Y, Yin P, Mao C (2005) Molecular devices—a DNAzyme that walks processively and autonomously along a one-dimensional track. Angewandte Chemie 44:4355–4358
Tulpan D, Andronescu M, Chang S, Shortreed M, Condon A, Hoos H, Smith L (2005) Thermodynamically based DNA strand design. Nucleic Acids Res 33:4951–4964
Watson J, Crick F (1953) Molecular structure of nucleic acids. A structure for deoxyribose nucleic acid. Nature 171
Wetmur J (1997) Physical chemistry of nucleic acid hybridization. In: Third annual DIMACS meeting on DNA based computers, pp 1–23
Yurke B, Turberfield A, Mills AP Jr, Neumann J (2003) A DNA-fuelled molecular machine made of DNA. Nature 406:605–608
Zuker M, Mathews D, Turner D (1999) Algorithms and thermodynamics for RNA secondary structure prediction: a practical guide in rna biochemistry and biotechnology. In: Barciszewski J, Clark B (eds) NATO ASI series. Kluwer Academic Publishers
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
In this section, we give a detailed proof that the h-measure is indeed a distance in the space of all poligos P n , for every n. A metric space is defined as a pair (S,d) consisting of a nonempty set S and a distance function d: S × S → R (also called a metric) assigning a real number d(x,y) to every pair of elements x,y ∈ S so that three defining properties are satisfied for every triple x,y,z ∈ S, namely:
-
1.
Nonnegativity: d(x,y) ≥ 0 and Strict nonnegativity: d(x,y) = 0 if and only if x = y;
-
2.
Symmetry: d(x,y) = d(y,x);
-
3.
Triangle inequality: d(x,z) ≤ d(x,y) + d(y,z).
The first property of a metric space fails for the h-measure, as was pointed above, because there are x’s such that 0 < h(x,x), for example x = aacc; as a result, the triangle inequality also fails because h(x,x′) + h(x′ x) = 0. To handle this problem, we pass to the appropriate poligo space consisting of subsets X defined by
and defining the distance between poligos (still denoted h) as the set distances, i.e.,
(We will continue to use the convention below that capital letters denote the poligos of corresponding n-mers given by lower case letters.)
Well known examples of a metric space are the binary hypercubes B n from information theory (Roman 1995) with the familiar Hamming distance obtained by counting the mismatched bits in a perfect alignment of x against y as binary strings
where ψ is the characteristic function with values 1 if the i th-bits x i ,y i are identical bits and 0 if they are different. It is easy to verify that H remains a distance if it is likewise defined on the DNA spaces D n . Moreover, H obviously satisfies several properties that will be used below, namely
where r,c,′ are the reversal, pointwise Watson-Crick complement and their composition, the full Watson-Crick complementation operation ′, respectively. Also, H is additive over concatenation, i.e., for every four strings x,y,u,v with |x| = |y| and |u| = |v| (even if the two lengths are different), and for every permutation π of the indices i for the positions in the strands,
where π · x is the strand obtained by permuting the bases in x according to π, i.e., x i is moved to position π(i), i.e., (π · x)π(i) = x i .
The proof of Theorem 5.3 below will proceed in three stages. We will first establish that a particular case of the h-measure obtained by forcing alignments when calculating the h-measure (e.g., by appending much longer complementary primers to the ends of the strands in D n ) and given by
makes a near-distance of the space D n of interest in its own right, i.e. it does satisfy the triangle inequality but only falls short of satisfiying h(x,y) > 0 for x ≠ y to be a full metric. (Note that h 0(x,x) = 0 and so, as noted above, the h 0-distance will then be a metric in poligo space P n .) Second, we will establish that similar measures h k are also near-metrics if given by
where x[k] (x[-k]) is the prefix (suffix, respectively) of x of length n − k + 1 overlapping y (or y′) after a shift of length k nucleotides to the right (left, respectively). Third, Theorem 5.3 will then follow because the h-distance can be obtained by passing to the appropriate poligo space consisting of subsets X of n-mer defined by X = {x,x′} and defining the distance between poligos as the set distance (4) above.
For the first stage, we establish that h 0 is a near-distance for every n ≥ 1.
Lemma 5.1
In D 2, if H(be,(df)′) = 0, then H(be,df) = 0 and be is a palindrome, or H(be,df) = 2.
Proof
If H(be,(df)′) = 0, then b = f′, e = d′, and H(be,df) = H(be,e′ b′) = H(be,(be)′). Thus, if the latter is 0, b = e′ and hence be = (be)′, a palindrome. H(be,df) = 1 would require b = e′ and e ≠ b′, which is impossible.□
It follows from property (6) that for every extension of strands x,z by four arbitrary nucleotides p,q,r,s,
Since h 0(x,z) = min {H(x,z), H(x,z′)} ≤ H(x,z) and h 0(bd,ef) = min {H(bd,ef), H(bd,(ef)′)} ≤ H(bd,ef), and likewise for H(x,z′) and H(bd,(ef)′, it follows that
Likewise,
Therefore,
In fact, equality happens when both H(x,z) ≤ H(x,z′) and H(bd,ef) ≤ H(bd,(ef)′) because then H(x,z) + H(bd,ef) ≤ H(x,z′) + H(bd,(ef)′) and so H(bxe,dzf) = H(x,z) + H(bd,ef); likewise when the opposite inequalities hold, or when H(bd,ef) = H(bd,(ef)′). Therefore, the inequality is strict only when we have H(x,z) ≤ H(x,z′) and H(bd,(ef)′) ≤ H(bd,ef), or vice versa, and if so, the difference between the left-hand and right-hand sides of the inequality is at most H(bd,ef), which is at most 2.
Theorem 5.2
D n is a near-metric space with the function h 0 in (4) for every n ≥ 1.
Proof
Nonnegativity and symmetry are clearly inherited from the corresponding properties of H. To establish the triangle inequality, we proceed by induction on n, the length of the strands. In D 1, h is identical to H and hence it satisfies the inequality. It is also easy to check the inequality hold for n = 2 as well, say by exhaustively checking all possible triples x,y,z. Assume inductively that the triangle inequality holds for h 0 in D m for all m < n. An arbitrary triple in D n (n > 2) can be written as bxe, pyq, dzf for some dimers be, pq,df such that
These inequalities in conjunction with inequality (8) imply that
Now, by the remarks after inequality (10), the left-hand side is the desired h(bxe,dzf) except when H(x,z) ≤ H(x,z′) and H(bd,(ef)′) ≤ H(bd,ef), or vice versa. Assuming that the former is the case (the argument in the other case is identical) and considering that the integer value
the inequality can only fail if H(be,df) = 2 and H(be,(df)′) = 0 (so that the increase to obtain h(bxe,dzf) from h 0(x,z) + h(be,df) is 1) but neither of the corresponding sums in the right-hand sums of (11) increases by at least one. In that case, again by Lemma 5.1, we have that H(bd,(pq)′) = 0 and H(pq,(ef)′) = 0, i.e. bd = (pq)′ = ef and hence H(db,ef) = 0, a contradiction. Therefore, an increase in going from the sum in the left-hand side of inequality~(11) to the desired h(bxe,dzf) forces an increase in one of the corresponding sums in the right-hand side and so the triangle inequality holds.□
For the second stage, observe that the measures h k defined by (7) is identical to the measure obtained by shifting y and y′ by |k| nucleotides and then computing the h 0-measure between the overlapping segments x[k] and y[k], i.e.,
Note that the poligos [x] in this case now consist of all strands having a common prefix (k > 0) or suffix (k < 0) of length n − k after a shift of |k| nucleotides to the right (or left, respectively) which are at h k measure 0 from one another.
For the third stage, observe that h can now be expressed as
Theorem 5.3
Poligo space P n is a metric space with the h-distance.
Proof
It suffices to verify that the h-measure satisfies the triangle inequality for poligos because in the quotient space P n for h, a poligo X = [x] given by definition 3 must consist of n-mers y’s such that h(x,y) = 0. Since it if clear that k + H(x,σ(y)) > 0 for k > 0, h(x,y) = 0 implies that h 0(x,y) = 0, i.e., that y = x′ and hence that X = {x,x′}. Therefore the appropriate poligos for h are precisely the poligos originally given for the h-measure. Moreover, strict nonnegativity of h follows from the previous observation that different poligos X ≠ Y are disjoint and therefore h(X,Y) > 0 according to definition (4).
To verify the triangle inequality, consider three arbitrary strands x,y,z of length n, let k,j be the shifts where the minima for h(x,y) and h(y,z) are obtained in the expression (14), respectively. By equivalence (13), we thus need to show that
Assuming that |j| ≤ |k| (the reverse case can be established likewise), inequality (14) and the triangle inequality for h 0 from Theorem 5.2 imply that
since it is easy to verify that
□
Rights and permissions
About this article
Cite this article
Phan, V., Garzon, M.H. On codeword design in metric DNA spaces. Nat Comput 8, 571–588 (2009). https://doi.org/10.1007/s11047-008-9088-6
Received:
Accepted:
Published:
Issue date:
DOI: https://doi.org/10.1007/s11047-008-9088-6


