On codeword design in metric DNA spaces

Phan, Vinhthuy; Garzon, Max H.

doi:10.1007/s11047-008-9088-6

On codeword design in metric DNA spaces

Published: 25 June 2008

Volume 8, pages 571–588 (2009)
Cite this article

Natural Computing Aims and scope Submit manuscript

Vinhthuy Phan¹ &
Max H. Garzon¹

315 Accesses
24 Citations
Explore all metrics

Abstract

Finding a large set of single DNA strands that do not crosshybridize to themselves and/or to their complements is an important problem in DNA computing, self-assembly, and DNA memories. We describe a theoretical framework to analyze this problem, gauge its computational difficulty, and provide nearly optimal solutions. In this framework, codeword design is reduced to finding large sets of strands maximaly separated in a DNA space and the size of such sets depends on the geometry of these metric spaces. We show that codeword design is NP-complete using any single reasonable measure that approximates the Gibbs energy, thus practically excluding the possibility of finding any procedure to find maximal sets efficiently. Second, we extend a technique known as shuffling to provide a construction that yields provably nearly-maximal codes. Third, we propose a filtering process that removes strands creating pairs with low Gibbs energies, as approximated by the nearest-neighbor model. These two steps produce large codes of thermodynamic high quality. The proposed framework can be used to gain an understanding of the Gibbs energy landscapes for DNA strands on which much of DNA computing and self-assembly are based.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On conflict free DNA codes

Article 13 October 2020

Scaling up DNA digital data storage by efficiently predicting DNA hybridisation using deep learning

Article Open access 15 October 2021

Integrated computer-aided engineering and design for DNA assemblies

Article 19 April 2021

References

Adleman L (1994) Molecular computation of solutions of combinatorial problems. Science 266:1021–1024
Article Google Scholar
Allawi H, SantaLucia J (1997) Thermodynamics and NMR of internal G.T mismatches in DNA. Biochemistry 36:10581–10594
Article Google Scholar
Allawi H, SantaLucia J (1998a) Nearest neighbor thermodynamic parameters for internal G.A mismatches in DNA. Biochemistry 37:2170–2179
Article Google Scholar
Allawi H, SantaLucia J (1998b) Thermodynamics of internal C.T mismatches in DNA. Nucleic Acids Res 26:2694–2701
Article Google Scholar
Alon N, Spencer J (2000) The probabilistic method, 2nd edn. Wiley
Arita M, Kobayashi S (2002) DNA sequence design using templates. New Gen Comput 20:263–277
Article MATH Google Scholar
Baum E (1995) Building an associative memory vastly larger than the brain. Science 268:583–585
Article Google Scholar
Bishop M, D’Yachkov A, Macula A, Renz T, Rykov V (2007) Free energy gap and statistical thermodynamic fidelity of DNA codes. J Comput Biol 14:1088–1104
Article MathSciNet Google Scholar
Brenneman A, Condon A (2002) Strand design for biomolecular computation. Theor Comput Sci 287:39–58
Article MATH MathSciNet Google Scholar
Chen J, Deaton R, Wang J (2003) A DNA-based memory with in vitro learning and associative recall. In: Proc. DNA9, Springer-Verlag Lecture Notes in Computer Science, pp 145–156
Chen J, Deaton R, Garzon M, Kim J, Wood DHD, Wang Y (2004) Characterization of non-crosshybridizing DNA oligonucleotides manufactured in vitro. In: Proc. DNA10, Springer-Verlag Lecture Notes in Computer Science, pp 50–61
Cormen T, Leiserson C, Rivest R, Stein C (2001) Introduction to Algorithms, 2nd edn. The MIT Press
Deaton R, Garzon M, Murphy RE, Rose JA, Franceschetti DR, Stevens SE Jr (1998) The reliability and efficiency of a DNA computation. Phys Rev Lett 80
Deaton R, Chen J, Bi H, Rose J (2003) A software tool for generating non-crosshybridizing libraries of DNA oligonucleotides. In: Proc. DNA8, Springer-Verlag Lecture Notes in Computer Science, vol 2568. Springer-Verlag, London, pp 252–261
D’yachkov A, Macula A, Pogozelski W, Renz T, Rykov V, Torney D (2004) A weighted insertion-deletion stacked pair thermodynamic metric for DNA codes. In: Proc. DNA10, Springer-Verlag Lecture Notes in Computer Science, vol 3384, pp 90–103
Feldkamp U, Ruhe H, Banzhaf W (2003) Sofware tools for DNA sequence design. J Genetic Program Evol Mach 4:153–171
Article Google Scholar
Frutos A, Condon A, Corn R (1997) Demonstration of a word design strategy for DNA computing on surface. Nucleic Acids Res 25:4748–4757
Article Google Scholar
Garey M, Johnson D (1979) Computers and intractability: a guide to the theory of NP-completeness. W. H. Freeman
Garzon M, Deaton R (2004) Codeword design and information encoding in DNA ensembles. J Nat Comput 3:253–292
Article MATH MathSciNet Google Scholar
Garzon M, Oehmen C (2002) Biomolecular computation in virtual test tubes. In: Proc. DNA7, Springer-Verlag Lecture Notes in Computer Science, vol 2340, pp 117–128
MathSciNet Google Scholar
Garzon M, Deaton R, Neathery P, Murphy R, Franceschetti D, Stevens SE Jr (1997a) On the encoding problem for DNA computing. In: The third DIMACS workshop on DNA-based computing, pp 230–237
Garzon M, Neathery P, Deaton R, Murphy R, Franceschetti D, Stevens SE Jr (1997b) A new metric for DNA computing. In: Koza JR et al (eds) Proc. 2nd annual genetic programming conference. Morgan Kaufmann, pp 230–237
Garzon M, Blain D, Bobba K, Neel A, West M (2003) Self-assembly of DNA-like structures in silico. In: Garzon M (ed) Biomolecular machines and artificial evolution. Special Issue of the Journal of Genetic Programming and Evolvable Machines. Kluwer Academic Publishers, pp 185–200
Garzon M, Bobba K, Hyde B (2004) Digital information encoding on DNA. In: Aspects of molecular computing, Springer-Verlag Lecture Notes in Computer Science, vol 2590, pp 152–166
Garzon M, Phan V, Bobba K, Kontham R (2005a) Sensitivity and capacity of microarray encodings. In: Proc. DNA11, Springer-Verlag Lecture Notes in Computer Science, vol 3892, pp 81–95
Garzon M, Phan V, Bobba K, Kontham R (2005b) Sensitivity and capacity of microarray encodings. In: Proc. DNA11, Springer-Verlag Lecture Notes in Computer Science, vol 3892, pp 81–95
Garzon M, Phan V, Roy S, Neel A (2007) In search of optimal codes for DNA computing. In: Mao C, Yokomori T, Zhang B (eds) Proc. DNA11, Springer-Verlag Lecture Notes in Computer Science, vol 4287. Springer-Verlag, pp 143–156
King O (2003) Bounds for DNA codes with constant GC-content. J Combinatorics 10:R33
Google Scholar
Marathe A, Condon A, Corn R (1999) On combinatorial DNA word design. In: Winfree E, Gifford DK (eds) Proceedings 5th DIMACS workshop on DNA based computers. American Mathematical Society, pp 75–89
Mullis K (2001) The unusual origin of the polymerase chain reaction. The unusual origin of the polymerase chain reaction 262:56–61, 64–65
Phan V, Garzon M (2005) Information encoding using DNA. In: Proc. DNA10, Springer-Verlag Lecture Notes in Computer Science, vol, 3384, pp 281–292
Roman J (1995) The theory of error-correcting codes. Springer-Verlag, Berlin
Google Scholar
SantaLucia J (1998) A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor thermodynamics. Proc Natl Acad Sci 95:1460–1465
Article Google Scholar
SantaLucia J, Hicks D (2004) Thermodynamics of DNA structural motifs. Annu Rev Biophys Biomol Struct 33:415–440
Article Google Scholar
SantaLucia J Jr, Allawi H, Seneviratne P (1990) Improved nearest neighbor paramemeters for predicting duplex stability. Biochemistry 35:3555–3562
Article Google Scholar
Seeman N (2003) DNA in a material world. Nature 421:427–431
Article MathSciNet Google Scholar
Shortreed M, Chang S, Hong D, Phillips M, Campion B, Tulpan D, Andronescu M, Condon A, Hoos H, Smith L (2005) A thermodynamic approach to designing structure-free combinatorial DNA word sets. Nucleic Acids Res 33:4965–4977
Article Google Scholar
Tian Y, He Y, Chen Y, Yin P, Mao C (2005) Molecular devices—a DNAzyme that walks processively and autonomously along a one-dimensional track. Angewandte Chemie 44:4355–4358
Article Google Scholar
Tulpan D, Andronescu M, Chang S, Shortreed M, Condon A, Hoos H, Smith L (2005) Thermodynamically based DNA strand design. Nucleic Acids Res 33:4951–4964
Article Google Scholar
Watson J, Crick F (1953) Molecular structure of nucleic acids. A structure for deoxyribose nucleic acid. Nature 171
Wetmur J (1997) Physical chemistry of nucleic acid hybridization. In: Third annual DIMACS meeting on DNA based computers, pp 1–23
Yurke B, Turberfield A, Mills AP Jr, Neumann J (2003) A DNA-fuelled molecular machine made of DNA. Nature 406:605–608
Google Scholar
Zuker M, Mathews D, Turner D (1999) Algorithms and thermodynamics for RNA secondary structure prediction: a practical guide in rna biochemistry and biotechnology. In: Barciszewski J, Clark B (eds) NATO ASI series. Kluwer Academic Publishers

Download references

Author information

Authors and Affiliations

Department of Computer Science and the Bioinformatics Program, University of Memphis, Memphis, TN, 38152, USA
Vinhthuy Phan & Max H. Garzon

Authors

Vinhthuy Phan
View author publications
Search author on:PubMed Google Scholar
Max H. Garzon
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Vinhthuy Phan.

Appendix

In this section, we give a detailed proof that the h-measure is indeed a distance in the space of all poligos P _n, for every n. A metric space is defined as a pair (S,d) consisting of a nonempty set S and a distance function d: S × S → R (also called a metric) assigning a real number d(x,y) to every pair of elements x,y ∈ S so that three defining properties are satisfied for every triple x,y,z ∈ S, namely:

1.
Nonnegativity: d(x,y) ≥ 0 and Strict nonnegativity: d(x,y) = 0 if and only if x = y;
2.
Symmetry: d(x,y) = d(y,x);
3.
Triangle inequality: d(x,z) ≤ d(x,y) + d(y,z).

The first property of a metric space fails for the h-measure, as was pointed above, because there are x’s such that 0 < h(x,x), for example x = aacc; as a result, the triangle inequality also fails because h(x,x′) + h(x′ x) = 0. To handle this problem, we pass to the appropriate poligo space consisting of subsets X defined by

$$ X =[x]=\{ y \in {\bf D}_{n}: h(x,y) = 0 \} = \{ x, x^\prime \}, $$

(3)

and defining the distance between poligos (still denoted h) as the set distances, i.e.,

$$ h(X,Y)=\mathop{\hbox{min}}\limits_{x \in X, y \in Y} \{ h(x,y) \}. $$

(4)

(We will continue to use the convention below that capital letters denote the poligos of corresponding n-mers given by lower case letters.)

Well known examples of a metric space are the binary hypercubes B ⁿ from information theory (Roman 1995) with the familiar Hamming distance obtained by counting the mismatched bits in a perfect alignment of x against y as binary strings

$$ H(x,y) = n - \sum \psi (x_i,y_i), $$

where ψ is the characteristic function with values 1 if the i ^th-bits x _i,y _i are identical bits and 0 if they are different. It is easy to verify that H remains a distance if it is likewise defined on the DNA spaces D _n. Moreover, H obviously satisfies several properties that will be used below, namely

$$ H(x,y) = H(x^r,y^r) = H(x^c,y^c) = H(x^\prime,y^\prime), $$

(5)

where r,c,′ are the reversal, pointwise Watson-Crick complement and their composition, the full Watson-Crick complementation operation ′, respectively. Also, H is additive over concatenation, i.e., for every four strings x,y,u,v with |x| = |y| and |u| = |v| (even if the two lengths are different), and for every permutation π of the indices i for the positions in the strands,

$$ H(ux,vy) = H(u,v) + h(x,y) \quad{\rm and }\ H(x,y) = H(\pi\cdot x, \pi\cdot y), $$

(6)

where π · x is the strand obtained by permuting the bases in x according to π, i.e., x _i is moved to position π(i), i.e., (π · x)_π(i) = x _i.

The proof of Theorem 5.3 below will proceed in three stages. We will first establish that a particular case of the h-measure obtained by forcing alignments when calculating the h-measure (e.g., by appending much longer complementary primers to the ends of the strands in D _n) and given by

$$ h_0(x,y) = \hbox{min} \{ H(x,y), H(x,y^\prime) \} \, $$

makes a near-distance of the space D _n of interest in its own right, i.e. it does satisfy the triangle inequality but only falls short of satisfiying h(x,y) > 0 for x ≠ y to be a full metric. (Note that h ₀(x,x) = 0 and so, as noted above, the h ₀-distance will then be a metric in poligo space P _n.) Second, we will establish that similar measures h _k are also near-metrics if given by

$$ h_k(x,y) = k + \hbox{min} \{ h_0(x[k], \sigma^k(y)[k]), h_0(x[k],\sigma^{-k}(y)[k]) \} \, (k > 0), $$

(7)

where x[k] (x[-k]) is the prefix (suffix, respectively) of x of length n − k + 1 overlapping y (or y′) after a shift of length k nucleotides to the right (left, respectively). Third, Theorem 5.3 will then follow because the h-distance can be obtained by passing to the appropriate poligo space consisting of subsets X of n-mer defined by X = {x,x′} and defining the distance between poligos as the set distance (4) above.

For the first stage, we establish that h ₀ is a near-distance for every n ≥ 1.

Lemma 5.1

In D ₂, if H(be,(df)′) = 0, then H(be,df) = 0 and be is a palindrome, or H(be,df) = 2.

Proof

If H(be,(df)′) = 0, then b = f′, e = d′, and H(be,df) = H(be,e′ b′) = H(be,(be)′). Thus, if the latter is 0, b = e′ and hence be = (be)′, a palindrome. H(be,df) = 1 would require b = e′ and e ≠ b′, which is impossible.□

It follows from property (6) that for every extension of strands x,z by four arbitrary nucleotides p,q,r,s,

$$ H(bxe,dzf)=H(x,z) + H(bd,ef) \;\; \hbox{and hence } $$

(8)

$$ H(bxe,(dzf)^\prime)=H(x,z^\prime) + H(be,(df)^\prime). $$

(9)

Since h ₀(x,z) = min {H(x,z), H(x,z′)} ≤ H(x,z) and h ₀(bd,ef) = min {H(bd,ef), H(bd,(ef)′)} ≤ H(bd,ef), and likewise for H(x,z′) and H(bd,(ef)′, it follows that

$$ \hbox{min} \{H(x,z), H(x,z^\prime)\} + \hbox{min} \{H(bd,ef), H(bd,(ef)^\prime)\} \le H(x,z) + H(bd,ef); $$

Likewise,

$$ \hbox{min} \{ H(x,z), H(x,z^\prime) \} + \hbox{min} \{ H(bd,ef), H(bd,(ef)^\prime) \} \le H(x,z^\prime) + H(bd,(ef)^\prime). $$

Therefore,

$$ h_0(x,z) + h_0(bd,ef) \le h_0(bxe,dzf) = \hbox{min} \{H(bxe,dzf), H(bxe,(dzf)^\prime) \}. $$

(10)

In fact, equality happens when both H(x,z) ≤ H(x,z′) and H(bd,ef) ≤ H(bd,(ef)′) because then H(x,z) + H(bd,ef) ≤ H(x,z′) + H(bd,(ef)′) and so H(bxe,dzf) = H(x,z) + H(bd,ef); likewise when the opposite inequalities hold, or when H(bd,ef) = H(bd,(ef)′). Therefore, the inequality is strict only when we have H(x,z) ≤ H(x,z′) and H(bd,(ef)′) ≤ H(bd,ef), or vice versa, and if so, the difference between the left-hand and right-hand sides of the inequality is at most H(bd,ef), which is at most 2.

Theorem 5.2

D _n is a near-metric space with the function h ₀ in (4) for every n ≥ 1.

Proof

Nonnegativity and symmetry are clearly inherited from the corresponding properties of H. To establish the triangle inequality, we proceed by induction on n, the length of the strands. In D ₁, h is identical to H and hence it satisfies the inequality. It is also easy to check the inequality hold for n = 2 as well, say by exhaustively checking all possible triples x,y,z. Assume inductively that the triangle inequality holds for h ₀ in D _m for all m < n. An arbitrary triple in D _n (n > 2) can be written as bxe, pyq, dzf for some dimers be, pq,df such that

$$ h_0 (x,z) \le h_0(x,y) + h(y,z) \;\; \hbox{\rm and} \;\; h_0 (be,df) \le h_0(be,pq) + h(pq,df). $$

These inequalities in conjunction with inequality (8) imply that

$$ h_0(x,z)+h(be,df) \le h_0 (x,y) + h_0 (be,pq) + h_0 (y,z) + h_0 (pq,df) $$

(11)

$$ \le h_0(bxe,pyq) + h(pyq,dzf). $$

(12)

Now, by the remarks after inequality (10), the left-hand side is the desired h(bxe,dzf) except when H(x,z) ≤ H(x,z′) and H(bd,(ef)′) ≤ H(bd,ef), or vice versa. Assuming that the former is the case (the argument in the other case is identical) and considering that the integer value

$$ H(bxe,dzf) = \hbox{min} \{ H(x,z) + H(bd,ef), H(x,z^\prime) + H(bd,(ef)^\prime)\}, $$

the inequality can only fail if H(be,df) = 2 and H(be,(df)′) = 0 (so that the increase to obtain h(bxe,dzf) from h ₀(x,z) + h(be,df) is 1) but neither of the corresponding sums in the right-hand sums of (11) increases by at least one. In that case, again by Lemma 5.1, we have that H(bd,(pq)′) = 0 and H(pq,(ef)′) = 0, i.e. bd = (pq)′ = ef and hence H(db,ef) = 0, a contradiction. Therefore, an increase in going from the sum in the left-hand side of inequality~(11) to the desired h(bxe,dzf) forces an increase in one of the corresponding sums in the right-hand side and so the triangle inequality holds.□

For the second stage, observe that the measures h _k defined by (7) is identical to the measure obtained by shifting y and y′ by |k| nucleotides and then computing the h ₀-measure between the overlapping segments x[k] and y[k], i.e.,

$$ h_k(x,y) = \hbox{min} \{ |k| + H(x,\sigma^k(y)) , |k| + H(x,\sigma^k(y^\prime)) \} = |k| + h_0(x[k],y[k]). $$

(13)

Note that the poligos [x] in this case now consist of all strands having a common prefix (k > 0) or suffix (k < 0) of length n − k after a shift of |k| nucleotides to the right (or left, respectively) which are at h _k measure 0 from one another.

For the third stage, observe that h can now be expressed as

$$ h(x,y)=\mathop{\hbox{min}}_{k \ge 0} \{ h_k(x,y) \}. $$

(14)

Theorem 5.3

Poligo space P _n is a metric space with the h-distance.

Proof

It suffices to verify that the h-measure satisfies the triangle inequality for poligos because in the quotient space P _n for h, a poligo X = [x] given by definition 3 must consist of n-mers y’s such that h(x,y) = 0. Since it if clear that k + H(x,σ(y)) > 0 for k > 0, h(x,y) = 0 implies that h ₀(x,y) = 0, i.e., that y = x′ and hence that X = {x,x′}. Therefore the appropriate poligos for h are precisely the poligos originally given for the h-measure. Moreover, strict nonnegativity of h follows from the previous observation that different poligos X ≠ Y are disjoint and therefore h(X,Y) > 0 according to definition (4).

To verify the triangle inequality, consider three arbitrary strands x,y,z of length n, let k,j be the shifts where the minima for h(x,y) and h(y,z) are obtained in the expression (14), respectively. By equivalence (13), we thus need to show that

$$ h(x,z) \le h(x,y) + h(y,z)=|k| + h_0(x[k],y[k]) + |j| + h_0(y[j],z[j]). $$

(15)

Assuming that |j| ≤ |k| (the reverse case can be established likewise), inequality (14) and the triangle inequality for h ₀ from Theorem 5.2 imply that

$$ \begin{aligned} h(x,z)\le&|j| + h_0(x[j],z[j]) \\ \le&|j| + h_0(x[j],y[j]) + h_0(y[j],z[j]) \\ \le&h(x,y) + h(y,z), \end{aligned} $$

since it is easy to verify that

$$ h_0(x[j],y[j]) \le h_0(x[k],y[k]) \le h_k(x,y) = h(x,y). $$

□

Rights and permissions

Reprints and permissions

About this article

Cite this article

Phan, V., Garzon, M.H. On codeword design in metric DNA spaces. Nat Comput 8, 571–588 (2009). https://doi.org/10.1007/s11047-008-9088-6

Download citation

Received: 27 November 2007
Accepted: 13 May 2008
Published: 25 June 2008
Issue date: September 2009
DOI: https://doi.org/10.1007/s11047-008-9088-6

Keywords

Profiles

Vinhthuy Phan View author profile
Max H. Garzon View author profile

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On codeword design in metric DNA spaces

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

On conflict free DNA codes

Scaling up DNA digital data storage by efficiently predicting DNA hybridisation using deep learning

Integrated computer-aided engineering and design for DNA assemblies

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

Lemma 5.1

Proof

Theorem 5.2

Proof

Theorem 5.3

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Profiles

Subscribe and save

Buy Now