What Happens When You Cluster Emojis đ¤¨: K-Means Edition
Finding Patterns in Emojis with Machine Learning
Our initial discussion of Emojist, an emoji based language, began with the Emoji ABCs. To recap, so far the Unicode Consortium has grouped emojis into 13 top level categories and 99 subcategories. Our Emoji AI can think for itself, so what if we just show her some images and let her group them herself? This post is the first in the series to use clustering and other machine learning techniques to create a new kind of language specification.
Data
For any artificial intelligence or machine learning project the key input is data. In this case it is the collection of thousands of emojis. Many technology platforms have their own graphical style for each emoji. Emojist has standardized to the Twitter Emoji library which offers support for 3,304 emojis. The emojis rendered here as 72x72 pngs are used as inputs for this post.
# Plot twewmoji data set from (A)pple to (Z)ebra
fig, ax = plt.subplots(1, 2, figsize=(10, 4))
img = Image.open(EMOJIS.loc["1F34E", "filename"])
ax[0].imshow(img)
img = Image.open(EMOJIS.loc["1F993", "filename"])
ax[1].imshow(img)
plt.show()
Approach
From the scikit-learn User Guide:
The K-Means algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares (see below). This algorithm requires the number of clusters to be specified. It scales well to large number of samples and has been used across a large range of application areas in many different fields.
K-means is among the first clustering algorithms typically applied by a data scientist to see if there are naturally occurring groups of data points. In this case the data points are the emoji png images. The images generated by the twemoji project are âRGB with Alpha Channelâ formatted, thus color images with 4 data channels (Red, Green, Blue, and Alpha). To simplify this complex data we will reduce the color space to grayscale to allow the algorithm to focus on the contents of the image (i.e. shapes and patterns) rather than color. Next, we need to tell the algorithm how many clusters to look for. Since the Unicode Consortium has identified 10 and 99 categories those will be our counts. Then we can analyze the cluster results to determine if they make any sense and if they align with what the Unicode Consortium has identified.
Transform to Grayscale
The Pillow library makes this a one liner using the convert function. After the conversion, this is what the machine âseesâ and will attempt to cluster according to the rules that guide K-Means.
# Plot "Apple" as grayscale with alpha channel (LA) and a grayscale-only (i.e. L == Luminance)
fig, ax = plt.subplots(1, 2, figsize=(10, 4))
img = Image.open(EMOJIS.loc["1F34E", "filename"]).convert('LA')
ax[0].imshow(img)
img = Image.open(EMOJIS.loc["1F34E", "filename"]).convert('L')
ax[1].imshow(img)
plt.show()
# Prep each image for input to the K-means algorithm
data_emoji = []
for _, img in EMOJIS.iterrows():
data_emoji.append(list(Image.open(img["filename"]).convert('L').getdata()))
data_emoji = np.array(data_emoji)
Cluster Top Level Categories
Letâs see if K-Means can help the Emoji AI identify the ten top-level categories:
Name | Emoji Count |
---|---|
Activities | 84 |
Animals & Nature | 140 |
Component | 4 |
Flags | 269 |
Food & Drink | 129 |
Objects | 250 |
People & Body | 347 |
Smileys & Emotion | 151 |
Symbols | 220 |
Travel & Places | 215 |
# Setup and run K-Means
kmeans_top = KMeans(n_clusters=10, random_state=0)
clusters = kmeans_top.fit_predict(data_emoji)
# Add cluster label back to emojis indexed by id
EMOJIS.loc[:, "Top Level Cluster"] = pd.Series(kmeans_top.labels_, index=EMOJIS.index)
# Plot "average" image for each cluster
fig, ax = plt.subplots(2, 5, figsize=(15, 4))
centers = kmeans_top.cluster_centers_.reshape(10, 72, 72)
for axi, center in zip(ax.flat, centers):
axi.set(xticks=[], yticks=[])
axi.imshow(center, interpolation='nearest', cmap=plt.cm.binary)
# Plot sampling of three images from each cluster
for clst in sorted(EMOJIS["Top Level Cluster"].unique()):
num = EMOJIS[EMOJIS["Top Level Cluster"] == clst].shape[0]
imgs = EMOJIS[EMOJIS["Top Level Cluster"] == clst].sample(min(3, num), random_state=5)
print(f"Cluster {clst} examples: " + ", ".join([name for name, _ in imgs.iterrows()]))
fig, axs = plt.subplots(1, 3, figsize=(9, 3))
cnt = 0
for _, img in imgs.iterrows():
axs[cnt].imshow(Image.open(img["filename"]))
cnt += 1
plt.show()
Cluster 0 examples: 1F397, 1F379, 1F686
Cluster 1 examples: 1F319, 1F1EB-1F1EE, 1F98A
Cluster 2 examples: 1F41E, 1F5E3, 1F9F2
Cluster 3 examples: 1F623, 1F625, 1F606
Cluster 4 examples: 1F9E9, 2744, 1F53B
Cluster 5 examples: 1F925, 262F, 1F37F
Cluster 6 examples: 1F233, 1F518, 1F1F5-1F1F9
Cluster 7 examples: 1F1EB-1F1F0, 1F1E6-1F1EE, 1F1F3-1F1FF
Cluster 8 examples: 1F1F3-1F1EE, 1F1FE-1F1EA, 1F1F5-1F1F8
Cluster 9 examples: 1F3DE, 1F305, 2697
Well, those clusters donât seem to make much sense. However, there are a few interesting ones. Cluster 3 looks like it found âSmileys & Emotionâ. Cluster 7 seemed to have latched on to the flags of the former British empire. Cluster 8 could be horizontally striped flags.
Cluster Sub Categories
Maybe the Emoji AI can identify some of the 99 sub-categories using K-MeansâŚ
# Setup and run K-Means
kmeans_sub = KMeans(n_clusters=99, random_state=912488, n_init=100, max_iter=1000)
clusters = kmeans_sub.fit_predict(data_emoji)
# Plot "average" image for each cluster
fig, ax = plt.subplots(11, 9, figsize=(50, 20))
centers = kmeans_sub.cluster_centers_.reshape(99, 72, 72)
for axi, center in zip(ax.flat, centers):
axi.set(xticks=[], yticks=[])
axi.imshow(center, interpolation='nearest', cmap=plt.cm.binary)
Flag varieties were nicely broken out and the algorithm clustered seemingly every different pattern that exists. The Emoji AI must have studied flags while preparing the Across the Globe riddles.
Cats faces and face emojis with glasses were nicely separated from normal emoji faces.
The Emoji AI also learned about the phases of the moon.
She may be a budding meteorologist too!
Last but not least, she is also learning about trends and graphs.
There were many more interesting clusters and plenty others that didnât make any sense. Try for yourself with the code snippets above and see what you come up with!
Conclusion
K-Means is a simple clustering method to apply to do an initial evaluation of natural groupings. In this case, we found a couple decent clusters that represented familiar concepts. To make the Emoji AI more intelligent we can make use of more advanced machine learning algorithms to capture more nuanced features. Stay tuned for our next post on applying machine learning to see if we can better cluster emojis to create a robustly defined taxonomy for use in Emojist.