What Happens When You Cluster Emojis 🤨: K-Means Edition

Updated:
8 minute read

Learn how to classify png images using Python and K-Means Clustering. This is part of a series on image clustering techniques to develop and encode the Emojist language.

Finding Patterns in Emojis with Machine Learning

Our initial discussion of Emojist, an emoji based language, began with the Emoji ABCs. To recap, so far the Unicode Consortium has grouped emojis into 13 top level categories and 99 subcategories. Our Emoji AI can think for itself, so what if we just show her some images and let her group them herself? This post is the first in the series to use clustering and other machine learning techniques to create a new kind of language specification.

Data

For any artificial intelligence or machine learning project the key input is data. In this case it is the collection of thousands of emojis. Many technology platforms have their own graphical style for each emoji. Emojist has standardized to the Twitter Emoji library which offers support for 3,304 emojis. The emojis rendered here as 72x72 pngs are used as inputs for this post.

# Plot twewmoji data set from (A)pple to (Z)ebra
fig, ax = plt.subplots(1, 2, figsize=(10, 4))
img = Image.open(EMOJIS.loc["1F34E", "filename"])
ax[0].imshow(img)

img = Image.open(EMOJIS.loc["1F993", "filename"])
ax[1].imshow(img)
plt.show()

Plot of Apple (1F34E) and Zebra (1F993) twemoji art

Approach

From the scikit-learn User Guide:

The K-Means algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares (see below). This algorithm requires the number of clusters to be specified. It scales well to large number of samples and has been used across a large range of application areas in many different fields.

K-means is among the first clustering algorithms typically applied by a data scientist to see if there are naturally occurring groups of data points. In this case the data points are the emoji png images. The images generated by the twemoji project are “RGB with Alpha Channel” formatted, thus color images with 4 data channels (Red, Green, Blue, and Alpha). To simplify this complex data we will reduce the color space to grayscale to allow the algorithm to focus on the contents of the image (i.e. shapes and patterns) rather than color. Next, we need to tell the algorithm how many clusters to look for. Since the Unicode Consortium has identified 10 and 99 categories those will be our counts. Then we can analyze the cluster results to determine if they make any sense and if they align with what the Unicode Consortium has identified.

Transform to Grayscale

The Pillow library makes this a one liner using the convert function. After the conversion, this is what the machine “sees” and will attempt to cluster according to the rules that guide K-Means.

# Plot "Apple" as grayscale with alpha channel (LA) and a grayscale-only (i.e. L == Luminance)
fig, ax = plt.subplots(1, 2, figsize=(10, 4))
img = Image.open(EMOJIS.loc["1F34E", "filename"]).convert('LA')
ax[0].imshow(img)

img = Image.open(EMOJIS.loc["1F34E", "filename"]).convert('L')
ax[1].imshow(img)
plt.show()

Emoji plot of a Grayscale Apple (1F34E)

# Prep each image for input to the K-means algorithm
data_emoji = []
for _, img in EMOJIS.iterrows():
    data_emoji.append(list(Image.open(img["filename"]).convert('L').getdata()))
    
data_emoji = np.array(data_emoji)

Cluster Top Level Categories

Let’s see if K-Means can help the Emoji AI identify the ten top-level categories:

Name Emoji Count
Activities 84
Animals & Nature 140
Component 4
Flags 269
Food & Drink 129
Objects 250
People & Body 347
Smileys & Emotion 151
Symbols 220
Travel & Places 215
# Setup and run K-Means
kmeans_top = KMeans(n_clusters=10, random_state=0)
clusters = kmeans_top.fit_predict(data_emoji)

# Add cluster label back to emojis indexed by id
EMOJIS.loc[:, "Top Level Cluster"] = pd.Series(kmeans_top.labels_, index=EMOJIS.index)

# Plot "average" image for each cluster
fig, ax = plt.subplots(2, 5, figsize=(15, 4))
centers = kmeans_top.cluster_centers_.reshape(10, 72, 72)
for axi, center in zip(ax.flat, centers):
    axi.set(xticks=[], yticks=[])
    axi.imshow(center, interpolation='nearest', cmap=plt.cm.binary)

Composite KMeans 10 Top-Level Emoji Clusters

# Plot sampling of three images from each cluster
for clst in sorted(EMOJIS["Top Level Cluster"].unique()):
    num = EMOJIS[EMOJIS["Top Level Cluster"] == clst].shape[0]
    
    imgs = EMOJIS[EMOJIS["Top Level Cluster"] == clst].sample(min(3, num), random_state=5)
    print(f"Cluster {clst} examples: " + ", ".join([name for name, _ in imgs.iterrows()]))
    
    fig, axs = plt.subplots(1, 3, figsize=(9, 3))
    cnt = 0
    for _, img in imgs.iterrows():
        axs[cnt].imshow(Image.open(img["filename"]))
        cnt += 1
    
    plt.show()
Cluster 0 examples: 1F397, 1F379, 1F686

Top-level Cluster 0 Sampled Emojis

Cluster 1 examples: 1F319, 1F1EB-1F1EE, 1F98A

Top-level Cluster 1 Sampled Emojis

Cluster 2 examples: 1F41E, 1F5E3, 1F9F2

Top-level Cluster 2 Sampled Emojis

Cluster 3 examples: 1F623, 1F625, 1F606

Top-level Cluster 3 Sampled Emojis

Cluster 4 examples: 1F9E9, 2744, 1F53B

Top-level Cluster 4 Sampled Emojis

Cluster 5 examples: 1F925, 262F, 1F37F

Top-level Cluster 5 Sampled Emojis

Cluster 6 examples: 1F233, 1F518, 1F1F5-1F1F9

Top-level Cluster 6 Sampled Emojis

Cluster 7 examples: 1F1EB-1F1F0, 1F1E6-1F1EE, 1F1F3-1F1FF

Top-level Cluster 7 Sampled Emojis

Cluster 8 examples: 1F1F3-1F1EE, 1F1FE-1F1EA, 1F1F5-1F1F8

Top-level Cluster 8 Sampled Emojis

Cluster 9 examples: 1F3DE, 1F305, 2697

Top-level Cluster 9 Sampled Emojis

Well, those clusters don’t seem to make much sense. However, there are a few interesting ones. Cluster 3 looks like it found ‘Smileys & Emotion’. Cluster 7 seemed to have latched on to the flags of the former British empire. Cluster 8 could be horizontally striped flags.

Cluster Sub Categories

Maybe the Emoji AI can identify some of the 99 sub-categories using K-Means…

# Setup and run K-Means
kmeans_sub = KMeans(n_clusters=99, random_state=912488, n_init=100, max_iter=1000)
clusters = kmeans_sub.fit_predict(data_emoji)

# Plot "average" image for each cluster
fig, ax = plt.subplots(11, 9, figsize=(50, 20))
centers = kmeans_sub.cluster_centers_.reshape(99, 72, 72)
for axi, center in zip(ax.flat, centers):
    axi.set(xticks=[], yticks=[])
    axi.imshow(center, interpolation='nearest', cmap=plt.cm.binary)

Composite KMeans 99 Sub-Categories Emoji Clusters

Flag varieties were nicely broken out and the algorithm clustered seemingly every different pattern that exists. The Emoji AI must have studied flags while preparing the Across the Globe riddles.

English Empire flag cluster

Many Stars and Stripes flag cluster

Three vertical stripes flag cluster

Left side cross shape flag cluster

Star and five stripe flag cluster

Cats faces and face emojis with glasses were nicely separated from normal emoji faces.

Cat face emoji cluster

Face emojis with glasses cluster

The Emoji AI also learned about the phases of the moon.

Moon emoji cluster

She may be a budding meteorologist too!

Weather emoji cluster

Last but not least, she is also learning about trends and graphs.

Graph emoji cluster

There were many more interesting clusters and plenty others that didn’t make any sense. Try for yourself with the code snippets above and see what you come up with!

Conclusion

K-Means is a simple clustering method to apply to do an initial evaluation of natural groupings. In this case, we found a couple decent clusters that represented familiar concepts. To make the Emoji AI more intelligent we can make use of more advanced machine learning algorithms to capture more nuanced features. Stay tuned for our next post on applying machine learning to see if we can better cluster emojis to create a robustly defined taxonomy for use in Emojist.


Download Emoji Riddles™ on your favorite device and start learning Emojist!

Download on the App Store Get it on Google Play