↩ ⚙

What's the world's most efficient language?

January 23rd, 2022

Edit: This has now made it to the front page of HackerNews. I'm sure a lot of valid criticism will come from it. Do note that I'm more curious in the discussion than the experiment itself (I'm already learning from the comments). I consistently claim here to not be qualified for an analysis, having done this for fun. The "Deriving information" section ends with "...the only thing I'm really measuring is efficiency from the perspective of printer ink. But so be it, I'll measure that.". Also check out Limitations.

A while ago I was sitting on a plane and in a moment of boredom picked up the in-flight magazine.

The magazine had a little travel article, and it was written in English on one page and Thai on the other.

The Thai version was so much shorter that I started to wonder if it was more efficient. In other words, was it able to convey the same exact meaning to the reader with fewer "resources" than the English version?

The topic came up again when I was speaking to a Japanese man at a language exchange meetup. He said:

"You can technically write Japanese without Kanji, but it is a lot less efficient."

So what is language efficiency? And how can we measure it?

Thoughts from a nobody

One can't find a lot resources about "language efficiency" very easily, and the majority of my findings related to spoken language. Here are two good reads on this topic for those interested:

But I was interested in written language, simply because something in me told me I could quantify it, without any formal linguistics knowledge.

As you'd imagine, quantifying the efficiency of a language is a complicated task, one that I'm not at all qualified to explore in a scientifically significant way.

But I thought I'd do a little experiment.

Could I gather the same text in various languages that I'm certain (as certain as you can be) conveys the exact the meaning, and then calculate from the language snippets how much information they contain?

If I could do that, I could arrive at the following measure of efficiency: meaning / amount of information.

Meaning is supposed to be a constant, thus the more information the body of text contains, the lower its "meaning per piece of information" ratio, making the language less efficient.

In essence, if we take information to be information = noise + signal, we're looking for the signal density of languages, or, conversely, the noise ratio - how much stuff is in there that doesn't need to be?

So how could I derive that information value?

Deriving information

When brainstorming about this, a few immediate indicators might come to mind, like total number of characters.

Characters, however, are not very uniform - they vary widely within and across languages. Consider:

Mandarin (Simplified)	Finnish
我爱咖啡	Minä rakastan kahvia

Both sentences say "I love coffee".

If we're counting characters, Mandarin blows Finnish out of the water in efficiency.

But most people looking at this will immediately notice that Mandarin characters are much more condensed. They make up for in "complexity" what they sentence is missing in length.

And that "complexity" is what we're looking to measure.

Imagine I gave you a task to try and find a pattern in the following 2 different images. The task is "done" when you either find a pattern or decide you can't come up with one.

Which would take your brain more time to conclude - A or B?

A	B

I believe the answer would be B for most of us.

B has more information, more data - in this case, non-background colored "pixels" - that our brain needs to process before it makes sense of what it is seeing.

The same should hold true of characters. When reading, in order for our brain to determine the meaning of the character it is seeing (if it even knows it), it needs to take in all the pixels making up that character and run them across its "database" of known patterns to try and find a hit.

So there's my information metric: pixels.

To derive the amount of information present in a snippet of text, we can count the total "used" pixels i.e. on a black-on-white representation, count the black pixels.

Now, at this point I will mention once again that a true analysis of efficiency should be much more nuanced, and it's not something I'd be able to undertake.

In fact, a friend, upon hearing about this idea, said the only thing I'm really measuring is efficiency from the perspective of printer ink. But so be it, I'll measure that.

Approach

This post is bound to be long, so I'll spare many of the details here. But essentially, in order to test this out, I did the following:

1: Selected a snippet of text that I would be likely to find good translations for in various languages

For this I landed on the Google Privacy Policy. And not the whole thing either. A tiny piece. So tiny that a real scientist would laugh at the idea of this whole thing even being called "an experiment".

But you must keep in mind the name of this blog: "Sunday Afternoon". Stuff contained here is often done over a single weekend just for the fun of it, as was the case with this, so I needed to keep things simple.

Nevertheless, I picked the Google Privacy Policy because:

I could easily scrape it in hundreds of languages
It needs to convey the exact same meaning in all languages
It has legal implications, meaning if Google puts it up online in a language, it must have been thoroughly checked
Google probably knows a thing or two about translations

The exact snippet I picked was:

"When you use our services, you’re trusting us with your information. We understand this is a big responsibility and work hard to protect your information and put you in control. This Privacy Policy is meant to help you understand what information we collect, why we collect it, and how you can update, manage, export, and delete your information."

And I verified it said exactly this in English, Portuguese, Spanish, Finnish, German, and Icelandic (the last 2 with external help). That is, by actually reading it, other languages checked out on Google Translate.

2: Pull and parse the data

For this I got a list of all existing locales, and pulled all the HTML from each language's privacy policy from https://policies.google.com/privacy?hl=<locale>.

I then found the desired paragraph and extracted it into a separate file for each language.

For the more technical readers, the paragraph's CSS selector was consistent across all languages, which is how I managed to extract it. It's easy to confirm you extracted the right thing by popping the snippet into a translator.

3: Map out how many pixels each character takes

Once I had all the clean data, I iterated over every character in the dataset and drew an image for each using Python's Pillow library.

From that image I could then count the total number of black pixels and generate a map of the results.

Here are the basics of how this works:

# Python

arial_unicode = ImageFont.truetype('/Library/Fonts/Arial Unicode.ttf', 60)
img = Image.new('RGB', (200, 200), 'white')

draw = ImageDraw.Draw(img)
draw.text((75,0), letter, font=arial_unicode, fill='#000000')

pixels = list(img.getdata())
total_black_pixels = len(list(filter(lambda rgb: sum(rgb) == 0, pixels)))

4: Build up the results

Having determined the black pixel value (information) for each character, I could then derive how much information (again, in my limited definition), each language's written representation was using to convey the same meaning.

Some manual intervention here was needed, and I ended up looking through every picture of a character that the script generated to make sure it was valid. Two key things here were removing from the results the languages for which generic squares drawn when the font didn't some or all of its characters (e.g. Amharic), as well as making sure the drawings were containing the full character.

Results

The most efficient language prize in my little child experiment was Gujarati, followed by Hebrew, and then Arabic. Gujarati and Hebrew also had some of the lowest mean pixel/character ratio in the dataset.

The least efficient ones were Japanese, Malay, and Canadian French. You heard that right. Out of all the French dialects included in the dataset, Canadian was the only one with different wording. I'd be curious to hear from someone who speaks French about whether the Canadian version has words that are actually not used elsewhere or if it's just a matter of choice of words.

English, by the way, was eighth on the list.

Particularly interesting to me was comparing results from language/dialects in the same language group, such as the French from France vs. France from Canada comparison mentioned above. Here are a few takeaways from the results:

American English > British English

American English performed slightly better than British English, and this one actually makes intuitive sense to me.

In most cases when there's a variation between the two, British English tends to be the one with additional letters. Think "color" vs. "colour", "traveled" vs. "travelled", etc.

These additional letters are almost decidedly redundant, inefficient. Given that there is a variation of the language without these extra letters, one can infer that they are not very important if the objective is to convey meaning efficiently.

Consider the r in cart for instance. Without that r the word would clash with an existing word - cat, so the letter is significant in establishing meaning. The u in color is not, however.

The Chinese Language Group

This was a surprising one to me. Simplified Mandarin Chinese was expectedly more efficient than Traditional Mandarin Chinese, but both were beaten out by Cantonese using traditional characters.

Also, if you ever wanted to put a number to the complexity of Chinese characters, they are around 3x more "complex" (pixels/char ratio) than the average of the dataset.

Portuguese from Brazil > Portuguese from Portugal (to my utter delight)

When reading both sentences one can indeed notice inherent language differences, like the lack of the word "você" in Portuguese from Portugal. However, in some cases the differences were merely a word choice thing (e.g. "Entendemos" vs. "Compreendemos").

Spanish from Spain > Spanish from Latin America

The biggest disparity within a language group happened with Spanish, with Spanish from Spain ranking at 14 and Latin American Spanish landing at 32. The vast majority of differences are purely arbitrary word selections, though.

Simplified Results

Dialects were collapsed if they used the same exact sentence.

Position	Language	Total pixels	Total chars	Mean pixels/char
1	Gujarati	62814	418	150.27
2	Hebrew	63533	313	202.98
3	Arabic	66116	257	257.26
4	Kannada	73659	507	145.28
5	Lithuanian	78828	311	253.47
6	Thai	80055	333	240.41
7	Swedish	81817	343	238.53
8	English (United States)	82897	345	240.28
9	English (United Kingdom)	84774	351	241.52
10	Croatian/Bosnian	85245	344	247.81
11	Slovenian	87038	343	253.76
12	Czech	88275	338	261.17
13	Afrikaans	88286	364	242.54
14	Spanish (Spain)	88551	354	250.14
15	Bengali	88585	382	231.9
16	Slovak	88887	344	258.39
17	Polish	88961	346	257.11
18	Persian	89931	388	231.78
19	Tamil	90312	409	220.81
20	Italian	91061	381	239.01
21	Catalan	91401	378	241.8
22	Serbian	91404	372	245.71
23	Telugu	92619	459	201.78
24	Cantonese	92751	101	918.33
25	Danish	92778	382	242.87
26	Faroese	92778	382	242.87
27	Danish	92778	382	242.87
28	Estonian	92868	350	265.34
29	Urdu	93788	406	231.0
30	Punjabi	93788	406	231.0
31	Urdu	93788	406	231.0
32	Spanish (Latin America)	95226	376	253.26
33	Galician	95886	371	258.45
34	Zulu	95935	355	270.24
35	Swahili	96728	355	272.47
36	Basque	97208	368	264.15
37	Hungarian	97570	375	260.19
38	Latvian	98572	382	258.04
39	Finnish	99466	375	265.24
40	Norwegian	99588	415	239.97
41	Greek	99795	410	243.4
42	Vietnamese	100035	403	248.23
43	Portuguese (Brazil)	100543	391	257.14
44	Chinese (Simplified)	100774	124	812.69
45	Ukrainian	101512	357	284.35
46	Dutch	102685	399	257.36
47	Marathi	105126	394	266.82
48	Portuguese (Portugal)	105385	412	255.79
49	Serbian	106028	365	290.49
50	Icelandic	106357	417	255.05
51	Russian	108623	365	297.6
52	German	110538	431	256.47
53	Turkish	111808	444	251.82
54	Malagasy	113242	471	240.43
55	French	113242	471	240.43
56	Bulgarian	115494	397	290.92
57	Hindi	118592	500	237.18
58	Indonesian	118687	418	283.94
59	Chinese (Traditional)	120295	138	871.7
60	Filipino	123179	456	270.13
61	Korean	124819	231	540.34
62	French (Canada)	126583	503	251.66
63	Malay	128355	448	286.51
64	Japanese	136036	215	632.73

You can find the full results in a table format on this website here and the CSV results here. These results also include the sentence in each language.

Limitations

The limitations of this little Sunday experiment are many. From the size of the snippet, the lack of extensive validation, to the lack of consideration for variations in writing, the use of only one font that may bias towards certain language families, etc.

But perhaps the most interesting discussions regard how counting pixels may be limited as an approach to measuring efficiency and how we could measure it instead.

Maybe (probably) the sheer amount of "information" is not the only factor contributing to our ability to read text efficiently?

Maybe the fact that Chinese characters were originally representative drawings helps association in the brain despite the extra strokes?

Maybe spaces also play a role and are thus are also a form of information?

Maybe my whole approach to quantifying "information" is wrong?

Either way, I'd be curious to explore this topic further, and would love to hear any thoughts others may have. Feel free to send those to yakko [dot] majuri [at] protonmail [dot] com if you like.

GitHub

You can find my loose snippets of code used for this analysis on the yakkomajuri/lang GitHub repo.