Remove Duplicates from String in Data Structures

Problem Statement

Given a string, we have to remove duplicate characters from the string such that each character appears only once (all the characters in the string should become unique). We can print the resultant string in any order.

Example

Input : "aaabbccd" Output : "abcd"

Explanation

As we can see the frequency of all the characters

Character	Frequency
a	3
b	2
c	2
d	1

After removing the duplicates, the frequency of all the characters became 1, so all the duplicate characters have been removed.

Character	Frequency
a	1
b	1
c	1
d	1

So the output is abcd

Constraints

$1 <= \text{|S|} <= 10^6$ ( $1 <= \text{Length of S} <= 10^6$ )
$\text{a} <= S[i] <= \text{z}$ (all the characters in the string will be in lower case)

Approach 1 (Using the Simple for loop)

This is the Brute-Force method to remove duplicates from string. In this method, we will use simple for loops to remove duplicates from string. In this algorithm, we will create an empty string and we will add the first occurrence of every character in our answer i.e. we will traverse our answer string and if the current character of the given string is present in our answer string, it means this is not the first occurrence of this character and it is already added to our answer. The algorithm and implementation of the approach are given below.

Algorithm:

Given a string s of length n
Create an empty string named ans = "", which will be our final answer
Run a for loop from i=0 to i=n-1
Run a nested for loop from j=0 to j=i-1
If at any point we found s[i]==s[j], we will break out of the loop (it means that this is not the first occurrence of this character and it is already added to our answer)
else if the inner loop ends without breaking, it means that we are visiting the character s[i] for the first time, so we will add it into our answer by ans=ans+s[i]

C++ Implementation

Java Implementation

Python Implementation

Output

Complexity Analysis

Time Complexity Analysis

In this algorithm, we are running a for loop from i=0 to i=n-1, which will perform i operations every time because we are running a loop from j=0 to j=i-1 every time. So in the worst case, the inner loop will take O(N) time complexity, and the outer loop will also take O(N) time complexity.

So the total worst-case time complexity for this approach to remove duplicates from string is O(N) * O(N) = O(N*N)

Space Complexity Analysis

In this approach we are using one array of characters to store the result i.e. ans, so we are using O(N) extra space.

Time Complexity: O(N*N) Space Complexity: O(N)

Approach 2 (Using a Set)

In this method, we are going to use the Set Data structure to remove duplicates from string. Finally, we will modify the given string such that the k elements from the starting are the unique elements, and we will return k which is the total number of unique elements in the string. (k is a positive integer which is the total number of unique characters in the string). Then we will print all the unique characters.

Note: Set is a data structure that stores only one occurrence of each element inserted into it.

Algorithm:

Given a string s of length n
Create a set named ans
Insert all the characters of the string into the set
Maintain a pointer named k and initialize it with 0
Start traversing the set, let x be the current element of the set and modify the string by performing the following operations.
- s[k] = x
- k++
Finally return the value of k, which is now the total number of unique characters in the string, so we will print the characters present at the index from i=0 to i=k-1

C++ Implementation

Java Implementation

Python Implementation

Output

Complexity Analysis

Time Complexity Analysis

In the first step, we are performing n operations by inserting all the characters of the string into the set which will take O(N) time.
Internally, the set uses hashing and takes O(NlogN) time for sorting the characters.
Finally, we are traversing the set which will take O(N) time in the worst case (if every element is unique)

So the total worst-case time complexity for this approach to remove duplicates from string is O(N)+O(NlogN)+O(N) = O(NlogN)

Space Complexity Analysis

In this approach, we are using a set and we are inserting all the characters of the string into the set. So we are using O(N) extra space in this approach.

Time Complexity: O(NlogN) Space Complexity: O(N)

Approach 3 (Using the Sorting Algorithm)

In this method, we will use sorting to remove duplicates from string. We will use the concept that in a sorted string all the characters which are equal to each other will be adjacent to each other, so we can easily check the adjacent elements of the string and store only the unique characters in our result. (The intuition behind this approach is to reduce the time complexity from O(N*N) to O(NlogN))

Algorithm:

Sort the string
Create an empty string named ans = "", which will be our final answer, and add the first character in the answer string
We know that the first element will always be unique and we have already added it to our answer string, so we maintain a pointer named k, which will start from 1 and keep track of the unique elements.
Run a loop from i=1 to i=n-1, if at any point s[i]!=s[i-1], then add s[i] into the answer string, else continue

C++ Implementation

Java Implementation

Python Implementation

Output

Complexity Analysis

Time Complexity Analysis

In the first step, we are sorting the string which takes O(NlogN) time
Then we are running the loop from 1 to n-1 which takes O(N) time

So the total worst-case time complexity for this approach to remove duplicates from string is O(NlogN)+O(N) = O(NlogN)

Space Complexity Analysis

In this approach we are using one array of characters to store the result i.e. ans, so we are using O(N) extra space.

Time Complexity: O(NlogN) Space Complexity: O(N)

Approach 4 (Using Hashing)

In this method, we are going to use the Hashing to remove duplicates from string. In this approach, we will create a map that will have a maximum size of 26 (because the given string only contains lower case characters specified in the problem statement). Then we will initialize the frequency 1 for each character present in the string. Then we will again traverse the string and if the count of any character is greater than 0, we will add this character to our answer and update its frequency to 0 so that next time we should not add it to our answer.

Algorithm:

Given a string s of length n
Create a map of <character,int> named m
Create an empty string named ans = "", which will be our final answer
Run a for loop from i=0 to i=n-1 and make the frequency of s[i] equal to 1
Now we will traverse the string again, and if m[s[i]] > 0 (frequency of s[i] is more than 0), we will add this character to our answer and make m[s[i]]=0
Finally print the ans string

C++ Implementation

Java Implementation

Python Implementation

Output

Complexity Analysis

Time Complexity Analysis

In the first step, we are performing n operations to initialize the frequency of every character with 1, which will take O(N) time.
In the second step, we are again performing n operations which will take O(N) time.

So the total worst-case time complexity for this approach to remove duplicates from string is O(N)+O(N) = O(N)

Space Complexity Analysis

In this approach, we are using a map and we are inserting 1 for all the characters of the string. So we are using O(N) extra space in this approach.

Time Complexity: O(N) Space Complexity: O(N)

Approach 5 (Using indexOf() Method)

In this approach, we will use the inbuilt methods already implemented in C++, Java, and Python languages. Just like in the first approach, we will find the first occurrence of each character and add it to our answer. But in this approach, we will find the first occurrence using the inbuilt methods.

Algorithm:

Given a string s of length n
Create an empty string named ans = "", which will be our final answer
Run a for loop from i=0 to i=n-1
Now for each character s[i], we will check if this character (s[i]) is already added to our answer or not.
If we are unable to find this character in our answer string, we will add this character to the answer string
Finally print the ans string

C++ Implementation

Java Implementation

Python Implementation

Output

Complexity Analysis

Time Complexity Analysis

In the outer loop, we are performing n operations (by checking the occurrence of each character in the answer string) which takes O(N) time
Inside the loop, we are checking if s[i] is already present in ans or not. The built-in methods will take the worst-case time complexity of O(N)

So the total worst-case time complexity for this approach to remove duplicates from string is O(N) * O(N) = $O(N^2)$

Space Complexity Analysis

In this approach we are using one array of characters to store the result i.e. ans, so we are using O(N) extra space.

Time Complexity: $O(N^2)$

Space Complexity: O(N)

Conclusion

In this quick tutorial, we have discussed 5 different approaches to remove duplicates from string

In Approach 1, we used simple for loops that took O(N*N) time complexity
In Approach 2, we used the Set data structure that took O(NLogN) time complexity
In Approach 3, we sorted the string which took O(NLogN) time complexity
In approach 4, we used hashing by using the map data structure that took O(N) time complexity
In approach 5, we used the built-in methods in C++, Java, and Python that took O(N*N) time complexity

FAQ

Q.What can be the best time complexity for removing the duplicates?

A. By using an unordered set, we can remove duplicates from the string in only O(N) time complexity and O(N) auxiliary space. This is because the unordered_set internally uses hashing which will take an average of O(1) time complexity for each element, so the total time complexity to remove the duplicates from the string comes out to be O(N).