Longest Common Subsequence


The Longest common subsequence is a very important problem in computer sequence. In this, we need to find the longest subsequence that is present in both the strings in the same order. We need to find the longest sequence which can be obtained by deleting some items from the first and second sequences.

This problem is different from the longest common substrings as in substrings we only need to worry about the consecutive positions. The elements in subsequence appear in the same relative order but they don’t need to be contiguous.  For example, “axy” is a subsequence of “abxcwxy”.

Let’s consider an example first to understand this problem.

X: AABCDY

Y: AYCYDY

 

The length of LCS is 4 [ACDY]

A naïve solution is to generate all the subsequence of X and Y and try to compare them. If the length of the string is n, then there are a total of 2n subsequences possible. Thus, the time complexity of this solution is O(n.2n). It will take O(n) time to compare both subsequences.

We are going to use recursion to solve this problem first. This will help us in understanding the DP solution. LCS problem has an optimal substructure and we can divide it into smaller and simple subproblems. These simple subproblems can be further divided into smaller subproblems until we reach the base case.

Let us consider two strings X and Y of length n and m. We are going to start comparing the last character. There are two possible cases:

  1. If the last element of both the sequence is the same, then we can remove the last element and find the LCS of the shortened sequence. After that, we can append the removed element to the LCS. 
  2. If the last element of both the sequence is different, then there are two cases possible. We can either remove one character from the X string or we can remove one character from the Y string. The answer will be the minimum of these two cases. Thus, we can create a recurrence relation.

X[i…n], Y[i…m]

if(X[n] == X[m])

     LCS(X[1…n], Y[1…m]) = LCS(X[1…n-1],Y[1…m-1]) + 1   

else                                                       

   LCS(X[1…n], Y[1…m]) = max( LCS(X[1…n-1],Y[1…m] ,

                                                      LCS(X[1…n],Y[1…m-1])

Example: Consider X: AABCDY, Y=AYCYD

The last character of both the strings is different. Thus, the LCS will either end with Y (last character of string X[1…n]) or it will end with some previous character. We have two possible cases here.

Case1: If the LCS is going to end with Y, then it can’t have D (last character of string Y[1...m]) in the end. Thus, we will remove the last character of string Y and calculate LCS(X[1..n] , Y[1..m-1]).

Case 2: If the LCS is not going to end with Y, then we can remove Y from the sequence. Thus, we need to calculate LCS(X[1…n-1], Y[1…m]). 

You can check the code below for understanding the recursive relation.



 

O/P: The length of the LCS is 4

 

The time complexity of this solution is O(2(n+m)). If the length of LCS is 0, then we need to check all the possible subsequences. The LCS problem has overlapping subproblems. Most recursive solutions exhibit these properties. If the recursive algorithm is solving the same subproblem again and again, then it has overlapping subproblems. You can check the below recursion tree of LCS(“AAXC”,”AAXD”):

                           (“AAXC”,”AAXD”)                                                       

                           /                   \                                        

            (“AAX”,”AAXD”)             (“AAXC”,”AAX”)                

                  /       \                          /       \               

(“AA”,”AAXD”) (“AAX”,”AAX”)(“AAX”,”AAX”) (“AAXC”,”AA”)              

In the above recursion tree, LCS(“AAX”,”AAX”) is getting calculated twice. We will find more overlapping subproblems if we draw the complete recursion tree. Thus, this problem has both overlapping and optimal subproblems. We can use DP for saving the results of subproblems. This will ensure that we don’t need to recalculate them.

We are going to use the bottom-up approach for filling the DP table. There are two possible cases:

 

if(X[i-1]==Y[j-1])            //the last character of string is same            

dp[i][j] = dp[i-1][j-1] + 1; // removing one character from both string else            

dp[i][j] = max( dp[i-1][j] (removing last character from string X),

dp[i][j-1] (removing last character from string Y) )   

You can check the code below for more explanation:



 

O/P: The length of the LCS is 4

The time and space complexity of this solution is O(n*m). You can also use a map for storing the result. The LCS problem is very important as it is used by the data compression programs. Also, it is used by version control systems like Git.

Exercise: Print the Longest common subsequence. 

Practice Problems:

Happy Coding 😊

By Programmers Army