1. Home
Data Structure

Data Structure Tutorial: Everything You Need to Know

Learn all about data structures with our comprehensive tutorial. Master the fundamentals and advance your skills in organizing and managing data efficiently.

  • 60
  • 14
right-top-arrow

Tutorial Playlist

58 Lessons
34

The Longest Common Subsequence Algorithm: Exploring the Depths

Updated on 10/08/2024432 Views

Introduction

Longest common subsequence (LCS) has been a keystone in computer science and data analysis for a long time. It came into being during the middle of the twentieth century as scientists began to search for ways to compare and analyze sequences of data more efficiently. Diverse applications involving LCS algorithms have emerged in fields such as bioinformatics and text processing, among others.

Numerous real-life situations, such as DNA sequence alignment problems, plagiarism detection, and version control systems, are examples of those where LCS algorithms play an important role. The algorithms, in these cases, offer scholars and practitioners an opportunity to discover similarities between sequences, thus generating useful insights through data compression.

This guide will provide details on why LCS is important and how it can be computed by means of multiple methods, including those used in several disciplines.

Overview

The longest common subsequence problem involves determining the longest sequence that occurs in two given sequences.

But why is this important? Well, imagine you're comparing two versions of a document, trying to identify the changes made between them. The algorithm for longest common subsequences make this task manageable by pinpointing the similarities and differences between the sequences.

Throughout this guide, I will explain the concept of LCS thoroughly. We will talk about the LCS algorithms, its applications, and much more.

What is the longest common subsequence?

Longest Common Subsequence (LCS) is a crucial idea in computer science and data analysis that is based on seeking resemblances between elements sequenced. The example below describes this;

Assume you have two different series of elements: X and Y. In essence, the subsequence of these sequences refers to the subsets obtained by deleting some items while keeping the order of the rest elements intact. In simple terms, you can take out certain entries from the sequence without spoiling their initial arrangement.

Now let’s go deeper into the meaning of the longest common subsequence.

Basically, when we mention a common subsequence, we are talking about a subsequence that exists in both sequences X and Y. These are shared members present in each one, even if it does not matter where they occur in respective sequences.

But here’s an interesting thing about this! However, as discussed above, the longest Common Subsequence (LCS) takes it further than any other common subsequence. It is not just any other common subsequence but rather the longest one among all others. That is to say, it is made up of a maximum number of items from both X and Y, which retain their original ordering structure as desired.

Let's put this into perspective with an example. Suppose you have two sequences:

For example, let's say you have two sequences: 

X = "ABCDGH" 

Y = "AEDFHR"

The common subsequences here would be "ADH", "ADHR", and "ADGR", among others.

However, in this case, the longest common subsequence would be "ADH" because it's the longest subsequence common to both X and Y.

The Longest Common Subsequence (LCS) Algorithms

LSC algorithms aim to find the longest subsequence common to two or more strings. Here are major LSC algorithms and techniques.

1. Dynamic Programming Approach

Longest common subsequence dynamic programming offers a powerful technique for dealing with the longest common subsequence (LCS) problem. This approach breaks down the problem into smaller, more manageable instances that allow you to calculate LCS efficiently for larger sequences. In dynamic programming, there are two commonly used techniques

(i) Tabulation method

The Tabulation Method is an application of dynamic programming used to solve the Longest Common Subsequence (LCS) problem. It involves building a table often represented as a 2D array to store lengths of LCS for different prefixes of given sequences S1 and S2. 

The length of the LCS can be computed efficiently by filling in the table iteratively, starting from base cases and moving towards a solution for the entire sequences

Here's a more detailed explanation of the Tabulation method:

Initialization: We start by creating a 2D array, usually denoted as a table, with dimensions (n+1) x (m+1), where n is the length of sequence S1 and m is the length of sequence S2. This extra row and column are used to represent the empty sequence (i.e., when either sequence has a length of 0).

Filling in the Table: We iterate over each cell of the table, starting from the first row and column and moving row by row. At each cell (i, j) of the table, we calculate the length of the LCS for the prefixes of S1 and S2 up to positions i and j, respectively.

If the current characters of S1 and S2 at positions i-1 and j-1 are the same, then we increment the length of the LCS by 1 compared to the LCS of the previous prefixes (i-1 and j-1).

If the characters are different, we take the maximum of the LCS obtained by excluding either the last character of S1 or the last character of S2.

Base Cases: The first row and column of the table represent the LCS lengths when one of the sequences is empty. These values are initialized to 0 because an empty sequence has no elements that are common to any other sequence.

Final Result: Once the table is filled, the value in the bottom-right cell represents the length of the LCS for the entire sequences S1 and S2.

Let's consider an example to illustrate the tabulation method.

Suppose you have two sequences: "ABCBDAB" and "BDCAB." You can construct a table where each cell represents the length of the LCS for corresponding prefixes of the two sequences.

Initialization: We create a 2D array, table with dimensions (n+1) x (m+1), where n is the length of sequence S1 ("ABCBDAB") and m is the length of sequence S2 ("BDCAB"). Since both sequences have a length of 7, our table will have dimensions 8 x 6.

Filling in the Table: We iterate over each cell of the table and fill it based on the characters of the sequences. Let's go through this step:

S1\S2

B

D

C

A

B

A

0

0

0

1

1

B

1

1

1

1

2

C

1

1

2

2

2

B

1

1

2

2

3

D

1

2

2

2

3

A

1

2

2

3

3

B

1

2

2

3

4

Base Cases: When comparing an empty sequence with any other sequence, the LCS length is always 0. Hence, the first row and column of the table are initialized to 0.

When comparing the first characters of the sequences, if they match, we increment the LCS length by 1. Otherwise, we take the LCS length from the previous cell.

Final Result: The value in the bottom-right cell of the table represents the length of the LCS for the entire sequence. In this case, the LCS length is 4.

So, for the sequences "ABCBDAB" and "BDCAB", the LCS is of length 4, and it consists of the common subsequence "BCAB". This example demonstrates how the Tabulation method efficiently computes the length of the LCS by systematically filling in a table and considering all possible combinations of prefixes of the input sequences.

(ii) Memoization Method

The memoization method also called the top-down approach, consists of storing subproblems’ outcomes in a cache or a memoization table. This technique can be implemented using recursion, where every recursive call first checks the memoization table before it computes its result.

The memoization method is very effective when you have overlapping subproblems. Overlapping subproblems are solved multiple times during computation, because of which memorizing them leads to avoiding redundancy and improving algorithms' efficiency generally.

Additionally, the Memoization Method works well when finding only a portion of the Longest Common Subsequence is required rather than finding every element of it. This makes it possible for you to pull out any part of an LCS from the memoized solution instead of calculating everything.

Let's revisit the first example of the memoization method for computing the length of the LCS between two sequences, "ABCBDAB" and "BDCAB". We'll extend the code to include memoization for storing intermediate results and retrieve the length of the LCS:

memo = {}

 def lcs_length(s1, s2, i, j):

    if i == 0 or j == 0:

        return 0

    if (i, j) in memo:

        return memo[(i, j)]

    if s1[i - 1] == s2[j - 1]:

        memo[(i, j)] = 1 + lcs_length(s1, s2, i - 1, j - 1)

        return memo[(i, j)]

    else:

        memo[(i, j)] = max(lcs_length(s1, s2, i - 1, j), lcs_length(s1, s2, i, j - 1))

        return memo[(i, j)]

s1 = "ABCBDAB"

s2 = "BDCAB"

length = lcs_length(s1, s2, len(s1), len(s2))

print("Length of LCS:", length) 

# Output: 4

The results of previously computed subproblems are stored in the memo dictionary in this code. In computing the length of LCS, the lcs_length() function checks the memoization table to avoid any redundancy.

When run, it will output 4, which is the length of the longest common subsequence between “ABCBDAB” and “BDCAB.” This example shows how storing and reusing intermediate results, called memoization, enhances LCS algorithms' efficiency, reducing overall computational complexity.

2. Brute-force approach: Naive recursive algorithm

One of the simplest ways to solve the longest common subsequence (LCS) problem is through brute-force technique, especially using a naive recursive algorithm. Exhaustively considering all possible subsequences of these given sequences to find the longest common subsequences is what this algorithm does.

Here’s how the naive recursive algorithm works:

Exploring all possibilities: This algorithm takes the entire sequences as starting points, and its search space is recursively narrowed down by searching for smaller subsequences. 

Comparing subsequences: At each step of the recursion, the algorithm compares the current subsequences from both input sequences. If the elements at the current positions match, they are included in the potential longest common subsequence.

Recursive calls: The algorithm can find the LCS by either including or excluding such elements.

Backtracking: Having considered all possible combinations that were determined by recursive calls, the algorithm begins the backtracking process to identify the longest common subsequence. 

However, the straightforward approach provided by the naive recursive algorithm for solving LCS has significant downsides, especially when it comes to efficiency and scalability. As input sequences become larger and larger in terms of size, the time complexity of these algorithms grows exponentially, making them impractical for large datasets.

Consider, for instance, an example that demonstrates how a naïve recursive algorithm works:

Suppose we have two sequences:

Sequence 1: ABCDEFG

Sequence 2: BDCABF

In simple terms, using this method would mean starting with whole lists:

Matching the last elements should be followed by their inclusion while considering their possibility as potential members of an LCS, which leads to consideration of new sets ABCDEF and BDCAB.

Otherwise, if they differ, then there are two choices: remove one element from each sequence at its end. We continue this process until we have explored all possible combinations and determined the longest common subsequence.

While the naive recursive algorithm provides a basic understanding of how to approach the LCS problem, more efficient algorithms, such as dynamic programming techniques, are typically preferred for practical applications due to their improved time complexity.

Practical Applications

The Longest Common Subsequence (LCS) algorithms are implemented in various areas. Here is the application of the longest common subsequence.

1. Bioinformatics

LCS algorithms in computational biology are essential for DNA alignment and similarity analysis. By detecting common subsequences in DNA sequences, scientists can learn evolutionary relationships, identify genetic mutations, and comprehend biological functions. For example, LCS algorithms help reveal evolutionary trends by comparing the genetic sequences of two different species.

2. Text processing

Text processing mainly relies on LCS algorithms to detect plagiarism and compare documents. These algorithms help discover any similarities among documents, thereby assisting in the identification of copied material or paraphrased texts. Some schools and publishers utilize LCS algorithms to validate the originality of academic papers and manuscripts.

3. Version control systems

Version control systems heavily rely on LCS algorithms to identify changes between different file versions. These systems can efficiently track modifications, merge changes, and resolve conflicts in software development projects by finding the longest common subsequence between versions. For instance, platforms like Git utilize LCS algorithms to manage code revisions and collaborate on software projects seamlessly.

4. Image Processing and Speech Recognition

Beyond these domains, LCS algorithms have applications in diverse fields, such as image processing and speech recognition. In image processing, LCS algorithms can be used to compare images and identify similar patterns or objects. In speech recognition, these algorithms aid in identifying common phonetic sequences, enhancing the accuracy of speech-to-text conversion systems.

Conclusion

In conclusion, the Longest Common Subsequence (LCS) concept has become indispensable across various domains. It is driving advancements in computer science and data analysis.

Understanding LCS algorithms gives you powerful tools for comparing and analyzing sequences of elements. These algorithms can help you uncover patterns and similarities in data and streamline critical processes like plagiarism detection, version control, and even image processing.

There is no doubt that LCS algorithms are essential tools for researchers, developers, and analysts alike.

FAQs

  1. What is meant by the longest common subsequence?

The longest common subsequence refers to the longest sequence of elements shared between two or more sequences while maintaining their relative order.

  1. What is the full form of LCS in DAA?

In the Design and Analysis of Algorithms (DAA) field, LCS stands for Longest Common Subsequence.

  1. What is the longest common subsequence diff algorithm?

The Longest Common Subsequence Diff Algorithm is a method that seeks the longest common subsequence between two sequences in order to identify their differences.

  1. What is the longest common subsequence in the greedy approach?

In the greedy approach, the longest common subsequence refers to selecting elements from sequences to maximize the length of the common subsequence.

  1. What are the applications of LCS?

LCS's applications are diverse and include bioinformatics for DNA sequence alignment, text processing for plagiarism detection, version control systems for identifying changes in files, and more.

  1. What is the longest common subsequence measure?

The longest common subsequence measure is a metric used to quantify the similarity between two sequences based on the length of their longest common subsequence.

Rohan Vats

Rohan Vats

Passionate about building large scale web apps with delightful experiences. In pursuit of transforming engineers into leaders.

Get Free Career Counselling
form image
+91
*
By clicking, I accept theT&Cand
Privacy Policy
image
right-top-arrowleft-top-arrow

upGrad Learner Support

Talk to our experts. We’re available 24/7.

text

Indian Nationals

1800 210 2020

text

Foreign Nationals

+918045604032

Disclaimer

upGrad does not grant credit; credits are granted, accepted or transferred at the sole discretion of the relevant educational institution offering the diploma or degree. We advise you to enquire further regarding the suitability of this program for your academic, professional requirements and job prospects before enr...