Longest Duplicate Substring

HARD

Description

Given a string s, consider all duplicated substrings: (contiguous) substrings of s that occur 2 or more times. The occurrences may overlap.

Return any duplicated substring that has the longest possible length. If s does not have a duplicated substring, the answer is "".

Example 1:

Input: s = "banana"
Output: "ana"

Example 2:

Input: s = "abcd"
Output: ""

Constraints:

2 <= s.length <= 3 * 10⁴
s consists of lowercase English letters.

Approaches

Checkout 3 different approaches to solve Longest Duplicate Substring. Click on different approaches to view the approach and algorithm in detail.

Binary Search on Length with Rabin-Karp

A much more efficient approach uses binary search on the length of the substring combined with the Rabin-Karp algorithm for string matching. The problem has a monotonic property: if a duplicate substring of length k exists, a duplicate of any length less than k also exists. This allows us to binary search for the optimal length. For each candidate length, we use a rolling hash to efficiently check for duplicates in linear time.

Algorithm

The length of the longest duplicate substring can be anywhere from 0 to n-1. We can binary search for this length.
Define a search space for the length, low = 1, high = n-1.
In each step of the binary search, pick a mid length.
Use a helper function, search(length), to check if a duplicate substring of length mid exists.
The search(length) function uses the Rabin-Karp algorithm with a rolling hash.
- It slides a window of size length across the string.
- It calculates the hash of each substring in O(1) time (after an initial O(length) calculation) using the rolling hash technique.
- A HashSet is used to store hashes of seen substrings. If a hash collision occurs, we have found a potential duplicate.
- To be perfectly correct, upon a hash collision, one should compare the actual substrings to rule out false positives, though with a large prime modulus, this is rare.
If search(mid) finds a duplicate, we know a solution of at least length mid is possible. We store this potential answer and try for a longer one by setting low = mid + 1.
If search(mid) finds no duplicate, the length mid is too long. We must try a shorter one by setting high = mid - 1.
The final answer is the longest valid substring found during the search.

We can binary search for the answer, which is the length of the substring. The search space for the length L is [0, n-1]. For a given L, we need to efficiently check if there's any repeated substring of that length. This check can be performed in O(n) time using a rolling hash.

The rolling hash technique allows us to calculate the hash of a new substring from the previous one in O(1) time. We slide a window of length L over the string. We store the hashes of the substrings we've seen in a HashSet. If we compute a hash that's already in the set, we've likely found a duplicate. We then verify it's a true duplicate (not a hash collision) and proceed with the binary search.

If a duplicate of length mid is found, we know we might be able to do better, so we search in the upper half (low = mid + 1). If not, mid is too large, and we search in the lower half (high = mid - 1).

import java.util.HashSet;

class Solution {
    // Using prefix hashes for O(1) hash calculation of any substring
    public String longestDupSubstring(String S) {
        int n = S.length();
        int low = 1, high = n;
        String result = "";

        while (low <= high) {
            int mid = low + (high - low) / 2;
            String duplicate = search(S, mid);
            if (duplicate != null) {
                result = duplicate;
                low = mid + 1;
            } else {
                high = mid - 1;
            }
        }
        return result;
    }

    private String search(String s, int len) {
        long prime = 29;
        long power = 1;
        for (int i = 0; i < len; i++) {
            power = (power * prime);
        }

        HashSet<Long> seen = new HashSet<>();
        long currentHash = 0;
        for (int i = 0; i < s.length(); i++) {
            currentHash = (currentHash * prime) + (s.charAt(i) - 'a');
            if (i >= len) {
                currentHash -= (long)(s.charAt(i - len) - 'a') * power;
            }
            if (i >= len - 1) {
                if (seen.contains(currentHash)) {
                    // For this problem, simple hash check is often sufficient.
                    // A robust solution would verify the substring match here.
                    return s.substring(i - len + 1, i + 1);
                }
                seen.add(currentHash);
            }
        }
        return null;
    }
}

Complexity Analysis

Time Complexity: O(n log n). The binary search performs `O(log n)` iterations. Inside each iteration, the `search` function using rolling hash takes `O(n)` time to check for duplicates of a given length.Space Complexity: O(n). The `HashSet` used in the `search` function can store up to `O(n)` distinct hashes in the worst case.

Pros and Cons

Pros:

Significantly more efficient with O(n log n) time complexity.
Fast enough to pass the given constraints.
A standard and powerful technique for optimization problems.

Cons:

More complex to implement than the brute-force approach.
Requires understanding of binary search and hashing (Rabin-Karp).
Hash collisions are a potential issue, which might require verification or using multiple hash functions (double hashing) for robustness.

Code Solutions

Checking out 3 solutions in different languages for Longest Duplicate Substring. Click on different languages to view the code.

Video Solution

Watch the video walkthrough for Longest Duplicate Substring

Algorithms:

Binary Search

Patterns:

Sliding WindowRolling HashHash Function

Data Structures:

StringSuffix Array

Companies:

Coupang |