HTML Entity Parser
MEDIUMDescription
HTML entity parser is the parser that takes HTML code as input and replace all the entities of the special characters by the characters itself.
The special characters and their entities for HTML are:
- Quotation Mark: the entity is
"and symbol character is". - Single Quote Mark: the entity is
'and symbol character is'. - Ampersand: the entity is
&and symbol character is&. - Greater Than Sign: the entity is
>and symbol character is>. - Less Than Sign: the entity is
<and symbol character is<. - Slash: the entity is
⁄and symbol character is/.
Given the input text string to the HTML parser, you have to implement the entity parser.
Return the text after replacing the entities by the special characters.
Example 1:
Input: text = "& is an HTML entity but &ambassador; is not." Output: "& is an HTML entity but &ambassador; is not." Explanation: The parser will replace the & entity by &
Example 2:
Input: text = "and I quote: "..."" Output: "and I quote: \"...\""
Constraints:
1 <= text.length <= 105- The string may contain any possible characters out of all the 256 ASCII characters.
Approaches
Checkout 2 different approaches to solve HTML Entity Parser. Click on different approaches to view the approach and algorithm in detail.
Optimized Single Pass with StringBuilder
A more efficient approach is to parse the string in a single pass. We can iterate through the string, character by character, and build the result using a StringBuilder. When we encounter an ampersand &, we look ahead for a matching semicolon ;. If the substring between them corresponds to a known HTML entity, we append the special character; otherwise, we append the characters as they are. This avoids creating multiple intermediate strings.
Algorithm
- Pre-populate a
Mapwith the given HTML entity strings as keys and their corresponding special characters as values. This provides O(1) average time complexity for lookups. - Initialize a
StringBuilderto efficiently build the result string. - Iterate through the input
textusing an indexi. - At each character, check if it is an ampersand
'&'. - If it is an
'&', find the indexjof the next semicolon';'. - If a semicolon is found, extract the substring from
itoj(inclusive). - Check if this substring exists as a key in our entity map.
- If it exists, append the mapped character to the
StringBuilderand update the indexitoj+1to skip the processed entity. - If it does not exist (e.g.,
&ambassador;), it's not a valid entity we need to parse. In this case, we treat the initial'&'as a literal character, append it to theStringBuilder, and incrementiby 1.
- If it exists, append the mapped character to the
- If the character is not an
'&', or if it was an'&'but did not form a valid entity, append the current character to theStringBuilderand incrementiby 1. - After iterating through the entire string, convert the
StringBuilderto a string and return it.
This optimized approach processes the string from left to right, making decisions locally without needing to re-scan the entire string. A StringBuilder is used for efficient string construction, and a HashMap provides fast lookups for entities. When an '&' is found, we check for a potential entity ending with ';'. If a valid, recognized entity is found, it's replaced. Otherwise, the characters are appended literally. This ensures that each character of the input string is processed only once.
import java.util.Map;
import java.util.HashMap;
class Solution {
public String entityParser(String text) {
Map<String, String> entityMap = new HashMap<>();
entityMap.put(""", "\"");
entityMap.put("'", "'");
entityMap.put("&", "&");
entityMap.put(">", ">");
entityMap.put("<", "<");
entityMap.put("⁄", "/");
StringBuilder sb = new StringBuilder();
int i = 0;
int n = text.length();
while (i < n) {
if (text.charAt(i) == '&') {
int j = text.indexOf(';', i);
// Check if a semicolon exists and the substring is a valid entity
if (j != -1) {
String entity = text.substring(i, j + 1);
if (entityMap.containsKey(entity)) {
sb.append(entityMap.get(entity));
i = j + 1;
continue;
}
}
}
// If not an entity or not starting with '&', append the character
sb.append(text.charAt(i));
i++;
}
return sb.toString();
}
}
Complexity Analysis
Pros and Cons
- More efficient than the chained
replace()approach as it processes the string in a single pass. - Avoids the overhead of creating multiple intermediate strings, saving memory and reducing garbage collection pressure.
- The code is slightly more complex to write and read compared to the chained
replace()method.
Code Solutions
Checking out 3 solutions in different languages for HTML Entity Parser. Click on different languages to view the code.
class Solution {
public
String entityParser(String text) {
Map<String, String> d = new HashMap<>();
d.put(""", "\"");
d.put("'", "'");
d.put("&", "&");
d.put(">", ">");
d.put("<", "<");
d.put("⁄", "/");
StringBuilder ans = new StringBuilder();
int i = 0;
int n = text.length();
while (i < n) {
boolean found = false;
for (int l = 1; l < 8; ++l) {
int j = i + l;
if (j <= n) {
String t = text.substring(i, j);
if (d.containsKey(t)) {
ans.append(d.get(t));
i = j;
found = true;
break;
}
}
}
if (!found) {
ans.append(text.charAt(i++));
}
}
return ans.toString();
}
}
Video Solution
Watch the video walkthrough for HTML Entity Parser
Similar Questions
5 related questions you might find useful
Data Structures:
Subscribe to Scale Engineer newsletter
Learn about System Design, Software Engineering, and interview experiences every week.
No spam, unsubscribe at any time.