Got it! This site uses cookies. You consent to this by clicking on "Got it!" or by continuing to use this website.nbsp; Note: This appears on each machine/browser from which this site is accessed.
These notes were originally presented at the 2012 Lua Conference as Incrementally developing and implementing Hirschberg's longest common subsequence algorithm using Lua.
They have been revised, updated, and changed for this page.
The LCS (Longest Common Subsequence) problem is a dual problem of the SED (Shortest Edit Distance) problem.
The solution to these problems are used in open source file comparison tools such as WinMerge and DiffMerge.
In 1974, Hirschberg published a reasonably space and time efficient solution to these problems.
2. Subsequence
Technical definition of a subsequence.
String C = c1c 2...c p is a subsequence of string A = a1a 2...a m if there as a mapping
F: [1, 2, ..., p] to [1, 2, ..., m]
such that F(i) = k only if c i is a k and F is a monotone strictly increasing function (that is, (F(i) = u) and (F(j) = v) and (i < j) imply that (u < v)).
3. Common subsequence
String C is a common subsequence of strings A and B iff
C is a subsequence of A and
C is a subsequence of B.
4. Problem
Given strings
A = a 1a 2...a m and
B = b 1b 2...b n
find string
C = c 1c 2...c p
such that C is a common subsequence of both A and B and p is maximized. C is then called a maximal common subsequence or Longest Common Subsequence.
5. Alphabet
Alphabets examples:
Characters (line comparison)
Lines (file comparison)
Nucleotides (DNA comparisons)
6. Example strings
Example strings:
a = "nematode-knowledge"
m = 18
b = "empty-bottle"
n = = 12
7. LCS
No connection lines cross.
In general there are more than one LCS (e.g., last "e").
8. Symbols
Symbols can be anything that can be matched.
Letters of an alphabet
Lines of text
Nucleotides (in DNA)
For example purposes, letters will be used.
9. DNA
DNA consists of a sequence of four different nucleotides.
Time and space efficiency depends on the algorithm used.
20. Possible matches
(a == "nematode-knowledge") and (m == 18)
(b == "empty-bottle") and (n == 12)
possible non-empty substring compares: m*n == 216
21. Start
Start from the end of both strings.
22. Compare
Compare both versions for symmetry:
Flip the order of the strings.
Forward or backward in strings.
String or reverse string.
2*2*2 = 8 approaches. All yield the same LCS.
23. Match
24. Non-Match (1)
25. Non-Match (2)
26. Next
27. Recursive top down backward
a 1 a 2 ... a m-1 a m
b 1 b 2 ... b n-1 x n
function lcs_1b(a, b)
local m = #a
local n = #b
if (m == 0) or (n == 0) then
return ""
elseif string.sub(a, m, m) == string.sub(b, n, n) then
return lcs_1b(string.sub(a, 1, m-1), string.sub(b, 1, n-1)) .. string.sub(a, m, m)
else
local a1 = lcs_1b(a, string.sub(b, 1, n-1))
local b1 = lcs_1b(string.sub(a, 1, m-1), b)
return math.max(#a1, #b1)
end
end
Time and space INEFFICIENT!!!
28. Recursive top down forward
a 1 a 2 ... a m-1 a m
b 1 b 2 ... b n-1 x n
function lcs_1f(a, b)
local m = #a
local n = #b
if (m == 0) or (n == 0) then
return ""
elseif string.sub(a, 1, 1) == string.sub(b, 1, 1) then
return string.sub(a, 1, 1) .. lcs_1f(string.sub(a, 2, m), string.sub(b, 2, n))
else
local a1 = lcs_1f(a, string.sub(b, 2, n))
local b1 = lcs_1f(string.sub(a, 2, m), b)
return math.max(#a1, #b1)
end
end
Time and space INEFFICIENT!!!
29. Maximum subsequence length
String rewriting involves copies and is inefficient.
Modify the algorithm to return the length of the maximal subsequence.
Improve the algorithm.
Extract the LCS from the results.
function lcs_2b(a, b)
local m = #a
local n = #b
if (m == 0) or (n == 0) then
return 0
elseif string.sub(a, m, m) == string.sub(b, n, n) then
return lcs_2b(string.sub(a, 1, m-1), string.sub(b, 1, n-1)) + 1
else
local a1 = lcs_2b(a, string.sub(b, 1, n-1))
local b1 = lcs_2b(string.sub(a, 1, m-1), b)
return math.max(a1, b1)
end
end
30. Output
a=[empty_bottle]
b=[nematode_knowledge]
c=[7]
31. Next step
Use a list to store the string symbols.
Pass the ending location.
32. Use a list for A and B
Use a list for A and B.
A = {}
setDefault(A, "")
for i=1,string.len(a) do
A[i] = string.sub(a, i, i)
end
B = {}
setDefault(B, "")
for j=1,string.len(b) do
B[j] = string.sub(b, j, j)
end
io.write("A=[")
for i,a in pairs(A) do
io.write(a)
end
print("]")
io.write("B=[")
for j,b in pairs(B) do
io.write(b)
end
print("]")
33. Modified code
function lcs_3b(A, i, B, j)
if (i == 0) or (j == 0) then
return 0
elseif A[i] == B[j] then
return lcs_3b(A, i-1, B, j-1) + 1
else
local a1 = lcs_3b(A, i, B, j-1)
local b1 = lcs_3b(A, i-1, B, j)
return math.max(a1, b1)
end
end
34. Call
c = lcs_3b(A, #A, B, #B)
print("c=[" .. c .. "]")
35. Observation
Observation: L(i, j) is a maximal possible length common subsequence of A 1i and B 1j.
Initialization of L, the Length matrix.
L = {}
for i=1,#A do
L[i] = {}
for j=1,#B do
L[i][j] = -1
end
end
For convenience, L is initially defined as -1 everywhere (explicitly or via default metatable method).
36. Initial L matrix
L= n e m a t o d e _ k n o w l e d g e
e -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
m -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
p -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
t -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
y -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
_ -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
b -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
o -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
t -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
t -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
l -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
e -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
37. Initial L matrix
38. Compute the L matrix
function lcs_4b(A, i, B, j, L)
local p
if (i == 0) or (j == 0) then
p = 0
else
if A[i] == B[j] then
p = lcs_4b(A, i-1, B, j-1, L) + 1
else
local a1 = lcs_4b(A, i, B, j-1, L)
local b1 = lcs_4b(A, i-1, B, j, L)
p = math.max(a1, b1)
end
L[i][j] = p
end
return p
end
The L matrix is computed.
39. Computed L matrix
L= n e m a t o d e _ k n o w l e d g e
e 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 -1
m 0 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 -1
p 0 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 -1
t 0 1 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 -1
y 0 1 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 -1
_ 0 1 2 2 3 3 3 3 4 4 4 4 4 4 4 4 4 -1
b 0 1 2 2 3 3 3 3 4 4 4 4 4 4 4 4 4 -1
o 0 1 2 2 -1 4 4 4 4 4 4 5 5 5 5 5 5 -1
t 0 1 2 2 3 4 4 4 4 4 4 5 5 5 5 5 5 -1
t -1 -1 -1 -1 3 4 4 4 4 4 4 5 5 5 5 5 5 -1
l -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 6 6 6 6 -1
e -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 7
40. Computed L matrix
41. Recover the LCS: approach
42. Recover the LCS: code
To recover the LCS from L, backtrack through the matrix.
function path_extract1(L, A, i, B, j)
if (i == 0) or (j == 0) then
return ""
elseif A[i] == B[j] then
return path_extract1(L, A, i-1, B, j-1) .. A[i]
else
local x1, x2
if j == 1 then
x1 = -1
else
x1 = L[i][j-1]
end
if i == 1 then
x2 = -1
else
x2 = L[i-1][j]
end
if x1 > x2 then
return path_extract1(L, A, i, B, j-1)
else
return path_extract1(L, A, i-1, B, j)
end
end
end
43. Call the extraction
Call as follows.
p = lcs_6b(A, 1, #A, B, 1, #B, L)
print("p=[" .. p .. "]")
c = path_extract1(L, A, #A, B, #B)
print("c=[" .. c .. "]")
This is time efficient but space inefficient!
44. Efficiency
The recursive solution is very inefficient.
Solution: Memoization.
function lcs_5b(A, i, B, j, L)
local p
if (i == 0) or (j == 0) then
p = 0
else
p = L[i][j]
if p < 0 then
if A[i] == B[j] then
p = lcs_5b(A, i-1, B, j-1, L) + 1
else
local a1 = lcs_5b(A, i, B, j-1, L)
local b1 = lcs_5b(A, i-1, B, j, L)
p = math.max(a1, b1)
end
L[i][j] = p
end
end
return p
end
The same L matrix is computed.
45. Add the start and stop indexes
function lcs_6b(A, i1, i2, B, j1, j2, L)
local p2
if (i2 < i1) or (j2 < j1) then
p = 0
else
p = L[i2][j2]
if p < 0 then
if A[i2] == B[j2] then
p = lcs_6b(A, i1, i2-1, B, j1, j2-1, L) + 1
else
local a1 = lcs_6b(A, i1, i2, B, j1, j2-1, L)
local b1 = lcs_6b(A, i1, i2-1, B, j1, j2, L)
p = math.max(a1, b1)
end
L[i2][j2] = p
end
end
return p
end
The same L matrix is computed.
Note: Recursion can be converted to iteration. Left as an exercise.
46. Source
Accessible 3-page paper with which to get started.
47. Source
(page 2/3)
48. Source
(page 3/3)
49. Source
A linear space algorithm for computing maximal common subsequences
D. S. Hirschberg, Princeton University
1975.
50. Hirschberg Approach
51. Hirschberg Approach
52. Hirschberg Approach
53. Hirschberg Approach
54. Hirschberg Approach
55. Animation
56. Hirschberg (full L, rows)
function dpa_traverse_4(L, A, B, i1, i3, j1, j3, dx)
for i=i1,i3,dx do
for j=j1,j3,dx do
if A[i] == B[j] then
if (i == i1) or (j == j1) then
L[i][j] = 1
else
L[i][j] = 1 + L[i-dx][j-dx]
end
else
local y1, y2
if i == i1 then
y1 = 0
else
y1 = L[i-dx][j]
end
if j == j1 then
y2 = 0
else
y2 = L[i][j-dx]
end
L[i][j] = math.max(y1, y2)
end
end
end
end
57. Hirschberg (full L, main)
function lcs_hirschberg_4(L, A, B, i1, i3, j1, j3, level)
if not L[level] then
L[level] = initL1(A, B)
hlist1[level] = {}
end
if j1 > j3 then
for i=i1,i3 do
extractPut1(A[i]," ",1)
end
elseif i1 == i3 then
table.insert(hlist1[level], 4 .. " " .. #ts1 - i1+ 1 .. " " .. #ts1 - i3 .. " " .. j1 .. " " .. j3 .. " 1.5 (" .. "990099" .. ") doBox4")
local j2 = 0
for j=j3,j1,-1 do
if (A[i1] == B[j]) and (j2 == 0) then
j2 = j
extractPut1(A[i1],B[j],1)
else
extractPut1(" ",B[j],1)
end
end
if j2 == 0 then
extractPut1(A[i1]," ",1)
else
table.insert(hlist1[level], 5 .. " " .. #ts1 - i1+ 1 .. " " .. #ts1 - i3 .. " " .. j2 .. " " .. j2 .. " 0.5 (" .. "00CC00" .. ") doBox4")
table.insert(hlist1[level], 0 .. " " .. j2 .. " " .. #ts1 - i1 .. " " .. string.byte(A[i1]) .. " " .. "(CCFFCC) putChar2")
end
else
local i2 = math.floor((i1+i3)/2)
dpa_traverse_4(L[level], A, B, i1, i2, j1, j3, 1)
dpa_traverse_4(L[level], A, B, i3, i2+1, j3, j1, -1)
local j2 = j1-1
local k1 = 0
for j=j1,j3 do
local k
k = L[level][i2][j] + L[level][i2+1][j]
if k > k1 then
k1 = k
j2 = j
end
end
table.insert(hlist1[level], 1 .. " " .. #ts1 - i1+ 1 .. " " .. #ts1 - i2 .. " " .. j1 .. " " .. j2 .. " 1.0 (0000CC) doBox4")
table.insert(hlist1[level], 1 .. " " .. #ts1 - i2 .. " " .. #ts1 - i3 .. " " .. j2+1 .. " " .. j3 .. " 1.0 (0000CC) doBox4")
table.insert(hlist1[level], 2 .. " " .. #ts1 - i1+ 1 .. " " .. #ts1 - i3 .. " " .. j1 .. " " .. j3 .. " 1.0 (000000) doBox4")
table.insert(hlist1[level], 3 .. " " .. #ts1 - i2 - 1 .. " " .. #ts1 - i2 + 1 .. " " .. j1 .. " " .. j3 .. " 1.0 (CC0000) doBox4")
lcs_hirschberg_4(L, A, B, i1, i2, j1, j2, level+1)
lcs_hirschberg_4(L, A, B, i2+1, i3, j2+1, j3, level+1)
end
end