Problem solve Get help with specific problems with your technologies, process and projects.

3 ways to remove emojis from text files in Java

To put it simply, emojis ruin application developers' lives.

Legacy applications that worked perfectly for years now cough up a lung as millennials express themselves through tiny images rather than standard, readable text.

Older databases can't handle them. Many GUIs can't render them. And XML parsers fail because they simply weren't built to process a "pile of poo" emoji. However, not all hope is lost. Here are the three best ways to remove emojis from text in Java files:

  1. Whitelist valid input with a regular expression (regex);
  2. Blacklist emoji Unicode in Java with code points; and
  3. Use a Java emoji parser library.

The first line of defense in Java text manipulation is always to defer to a regular expression.

1. Use a regex whitelist to remove emojis from text

When a program validates text, a best practice is to specify what to allow, not what to include. Whitelists are preferable to blacklists. A good regex will remove emojis from text in Java but won't accidentally filter out valid characters. For example:

  • Allow punctuation, marks and whitespace: \p{P} \p{Z} \p{M}
  • Allow numbers: \p{N}
  • Allow any language character set: \p{L}
  • Allow invisible control characters: \p{Cs} \s{Cf}

A developer should combine the above requirements to create a regex that filters emojis from any string. It will look like this:

[^\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{Cf}\\p{Cs}\\s]

When a developer applies this regex in code, it will look like this:

whitelist emoji removal
A regex expression and the built in replaceAll method of the String class can be used to remove emojis from text.

Remove emojis in text from a Java string with a Unicode blacklist

If a regex whitelist isn't your thing, a developer can explicitly blacklist the Unicode range of emojis in your code. The following example demonstrates what happens when a developer implements a Unicode blacklist. It removes the emojis found in the 80 code points between 0001F600 and 0001F64F:

blacklist emoji removal
An alternate method to remove emojis from text is to blacklist all emoji Unicode points.

One problem with a blacklist is that it extends various code points for emoji Unicode. The above code will catch the offending poo emoji but will miss emojis added to the standard in the x0001f680 to 0001f6ff range. It's a cat and mouse game to keep a blacklist up to date with the ever-evolving Unicode specification. While this method will work to remove emojis from text in Java, it can also require extra time to successfully implement and monitor it.

remove emoji from text
The code above shows three ways to remove emojis from text, including blacklists, whitelists and a Java emoji parser library.

3. Use a Java emoji library

A third and highly recommended option is to use a Java emoji library such as the one from Vincent Durmont for more programmatic control. A developer can add Durmont's Java emoji parser to any Maven project by including the following entry in the POM:

<dependency>

    <groupId>com.vdurmont</groupId>

    <artifactId>emoji-java</artifactId>

    <version> 5.1.1</version>

</dependency>

This Java emoji parser tool not only removes emojis, but also converts an emoji into its text-based alias. For example, the alias for the poo emoji is :hankey:. The switch makes the emoji more easily digestible by parsers or graphical interfaces that support only text rendering.

Java emoji library
A Java emoji library can be used to remove emojis from text and to convert them into a readable format.

Emojis cause trouble with legacy Java applications that weren't developed with them in mind. But with these three strategies, a developer can easily remove emojis from text.

View All Videos

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.

-ADS BY GOOGLE

SearchAppArchitecture

SearchSoftwareQuality

SearchCloudComputing

SearchSecurity

SearchAWS

Close