HTML Entity Parser

MEDIUM

Description

HTML entity parser is the parser that takes HTML code as input and replace all the entities of the special characters by the characters itself.

The special characters and their entities for HTML are:

Quotation Mark: the entity is " and symbol character is ".
Single Quote Mark: the entity is ' and symbol character is '.
Ampersand: the entity is & and symbol character is &.
Greater Than Sign: the entity is > and symbol character is >.
Less Than Sign: the entity is < and symbol character is <.
Slash: the entity is &frasl; and symbol character is /.

Given the input text string to the HTML parser, you have to implement the entity parser.

Return the text after replacing the entities by the special characters.

Example 1:

Input: text = "&amp; is an HTML entity but &ambassador; is not."
Output: "& is an HTML entity but &ambassador; is not."
Explanation: The parser will replace the &amp; entity by &

Example 2:

Input: text = "and I quote: &quot;...&quot;"
Output: "and I quote: \"...\""

Constraints:

1 <= text.length <= 10⁵
The string may contain any possible characters out of all the 256 ASCII characters.

Approaches

Checkout 2 different approaches to solve HTML Entity Parser. Click on different approaches to view the approach and algorithm in detail.

Optimized Single Pass with StringBuilder

A more efficient approach is to parse the string in a single pass. We can iterate through the string, character by character, and build the result using a StringBuilder. When we encounter an ampersand &, we look ahead for a matching semicolon ;. If the substring between them corresponds to a known HTML entity, we append the special character; otherwise, we append the characters as they are. This avoids creating multiple intermediate strings.

Algorithm

Pre-populate a Map with the given HTML entity strings as keys and their corresponding special characters as values. This provides O(1) average time complexity for lookups.
Initialize a StringBuilder to efficiently build the result string.
Iterate through the input text using an index i.
At each character, check if it is an ampersand '&'.
If it is an '&', find the index j of the next semicolon ';'.
If a semicolon is found, extract the substring from i to j (inclusive).
Check if this substring exists as a key in our entity map.
- If it exists, append the mapped character to the StringBuilder and update the index i to j+1 to skip the processed entity.
- If it does not exist (e.g., &ambassador;), it's not a valid entity we need to parse. In this case, we treat the initial '&' as a literal character, append it to the StringBuilder, and increment i by 1.
If the character is not an '&', or if it was an '&' but did not form a valid entity, append the current character to the StringBuilder and increment i by 1.
After iterating through the entire string, convert the StringBuilder to a string and return it.

This optimized approach processes the string from left to right, making decisions locally without needing to re-scan the entire string. A StringBuilder is used for efficient string construction, and a HashMap provides fast lookups for entities. When an '&' is found, we check for a potential entity ending with ';'. If a valid, recognized entity is found, it's replaced. Otherwise, the characters are appended literally. This ensures that each character of the input string is processed only once.

import java.util.Map;
import java.util.HashMap;

class Solution {
    public String entityParser(String text) {
        Map<String, String> entityMap = new HashMap<>();
        entityMap.put("&quot;", "\"");
        entityMap.put("&apos;", "'");
        entityMap.put("&amp;", "&");
        entityMap.put("&gt;", ">");
        entityMap.put("&lt;", "<");
        entityMap.put("&frasl;", "/");

        StringBuilder sb = new StringBuilder();
        int i = 0;
        int n = text.length();
        while (i < n) {
            if (text.charAt(i) == '&') {
                int j = text.indexOf(';', i);
                // Check if a semicolon exists and the substring is a valid entity
                if (j != -1) {
                    String entity = text.substring(i, j + 1);
                    if (entityMap.containsKey(entity)) {
                        sb.append(entityMap.get(entity));
                        i = j + 1;
                        continue;
                    }
                }
            }
            // If not an entity or not starting with '&', append the character
            sb.append(text.charAt(i));
            i++;
        }
        return sb.toString();
    }
}

Complexity Analysis

Time Complexity: O(N), where N is the length of the string. We iterate through the string once. Inside the loop, `indexOf` and `substring` operations take time proportional to the length of the entity, which is a small constant. Map lookups are also constant on average. Thus, the overall complexity is linear.Space Complexity: O(N) in the worst case for the `StringBuilder`. The map uses constant extra space, as the number of entities is fixed.

Pros and Cons

Pros:

More efficient than the chained replace() approach as it processes the string in a single pass.
Avoids the overhead of creating multiple intermediate strings, saving memory and reducing garbage collection pressure.

Cons:

The code is slightly more complex to write and read compared to the chained replace() method.

Code Solutions

Checking out 3 solutions in different languages for HTML Entity Parser. Click on different languages to view the code.

Video Solution

Watch the video walkthrough for HTML Entity Parser

Data Structures:

Hash TableString