3 ways to remove emojis from text files in Java
To put it simply, emojis ruin application developers' lives.
Legacy applications that worked perfectly for years now cough up a lung as millennials express themselves through tiny images rather than standard, readable text.
Older databases can't handle them. Many GUIs can't render them. And XML parsers fail because they simply weren't built to process a "pile of poo" emoji. However, not all hope is lost. Here are the three best ways to remove emojis from text in Java files:
- Whitelist valid input with a regular expression (regex);
- Blacklist emoji Unicode in Java with code points; and
- Use a Java emoji parser library.
The first line of defense in Java text manipulation is always to defer to a regular expression.
1. Use a regex whitelist to remove emojis from text
When a program validates text, a best practice is to specify what to allow, not what to include. Whitelists are preferable to blacklists. A good regex will remove emojis from text in Java but won't accidentally filter out valid characters. For example:
- Allow punctuation, marks and whitespace: \p{P} \p{Z} \p{M}
- Allow numbers: \p{N}
- Allow any language character set: \p{L}
- Allow invisible control characters: \p{Cs} \s{Cf}
A developer should combine the above requirements to create a regex that filters emojis from any string. It will look like this:
[^\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{Cf}\\p{Cs}\\s]
When a developer applies this regex in code, it will look like this:

Remove emojis in text from a Java string with a Unicode blacklist
If a regex whitelist isn't your thing, a developer can explicitly blacklist the Unicode range of emojis in your code. The following example demonstrates what happens when a developer implements a Unicode blacklist. It removes the emojis found in the 80 code points between 0001F600 and 0001F64F:

One problem with a blacklist is that it extends various code points for emoji Unicode. The above code will catch the offending poo emoji but will miss emojis added to the standard in the x0001f680 to 0001f6ff range. It's a cat and mouse game to keep a blacklist up to date with the ever-evolving Unicode specification. While this method will work to remove emojis from text in Java, it can also require extra time to successfully implement and monitor it.

3. Use a Java emoji library
A third and highly recommended option is to use a Java emoji library such as the one from Vincent Durmont for more programmatic control. A developer can add Durmont's Java emoji parser to any Maven project by including the following entry in the POM:
<dependency>
<groupId>com.vdurmont</groupId>
<artifactId>emoji-java</artifactId>
<version> 5.1.1</version>
</dependency>
This Java emoji parser tool not only removes emojis, but also converts an emoji into its text-based alias. For example, the alias for the poo emoji is :hankey:. The switch makes the emoji more easily digestible by parsers or graphical interfaces that support only text rendering.

Emojis cause trouble with legacy Java applications that weren't developed with them in mind. But with these three strategies, a developer can easily remove emojis from text.