Mar 18, 2023

Localize web extensions (and apps)

Primer on how to use localization/pluralization

When writing a web application or wiring up a small web extension, people often use only one language to have a smooth start without wrapping their head around the proper naming for things in a foreign language. From my experience, devs often tend to use english as primary language for a coding project, and sometimes the primary language of the involved devs is used as the only language in a project, which may not be english. Both is fine to get a project running.

With a growing number of users it is more likely that the chosen language isn't the user's primary nor secondary language. The software needs to be localized. To achieve this, companies and devs tend to just replace the static text in an application or an extension with its counterpart in the foreign language, but proper localization is more than that: there are number and date formats as well as different plural rules for a language.

Locale specific strings

Just adding new languages (read: translated texts) to your Web Extension is pretty straight forward. All strings that needs to be localized/translated are held in one JSON file named messages.json, in a well defined directory structure: create a folder called _locales in your extension's root folder and create a directory for each locale you want to support, naming the subfolders by the language subtag you want to support. A language subtag is, sloppy speaking, the country code. Each subdirectory contain the messages.json where the actual strings are stored.

For an extension that supports english and german, the structure is as follows:

├── manifest.json
└── _locales
    ├── en
    │   └── messages.json
    └── de
        └── messages.json

The manifest.json file is shown in this example as an indicator of the extension's root directory.

It's also possible to add the region code to the localization, e.g. en-US for US English and en-GB for British English. If you're doing so, you must ensure to use an underscore _ for the directory name to separate the language subtag from the region variant. So for an extension supporting British English ("colour") and US English ("color") the directory structure should be like so.

├── manifest.json
└── _locales
    ├── en_US
    │   └── messages.json
    └── en_GB
        └── messages.json

Be aware: this is a common error source that's not that easy to spot. While the directories are using the underscore as a delimiter for language subtag and region, the source code side is using the form with a dash, e.g. the directory is named en_US, while the locale is referenced in the code as en-US.

The `messages.json` structure

The minimal set for a locale specific string is an string identifier and a message. The message is the string value itself that differs from locale to locale, while the identifier is used in the source code to lookup the specific string. In a JSON structure the minimal of identifier and message looks like so:

{
  "thisIsMyIdentifier": {
    "message": "This is my message!"
  }
}

To retrieve the value "This is my message!" from the locale file, the i18n API is used:

const text = browser.i18n.getMessage("thisIsMyIdentifier");

// logs "This is my message!"
console.log(text);

Fallback when a locale or key isn't found

The locale that is used depends on the language and region setting for the application your extension is running in as well as the supported locales within your extension. If your extension running on a system that's configured as en-US and your extension has locales the required identifier defined in the _locales/en_US/messages.json file, then this value is retrieved. If no identifier is found there, the lookup mechanism checks for the string identifier in _locales/en/messages.json. When the string identifier isn't defined there as well, the default locale is checked (it's defined in the manifest.json). In the edge case when it's also not defined there, browser.i18n.getMessage() returns an empty string. Hint: while empty strings are bad for users of your extension or application, I recomment to implement checks or unit tests before shipping the extension.

Adding context information

With a growing number of supported locales in your extension it can be getting harder for the translators to find out, where the strings are used in the user interface. The description property is intended to give more context or translation hints and cannot be used in the source code via the i18n API:

{
  "thisIsMyIdentifier": {
    "message": "This is my message!",
    "description": "Shown in the 'about' window"
  }
}

Using placeholders

Sometimes it is required to add dynamic information to a language specific string. Assume that you want to display a greeting in your extension, using the name of the user. There are multiple solutions to achieve this:

add a token to the strig and replace it, e.g. "Hello, your-token".replace('your-token', 'World') → Hello, world
create two strings with identifiers and concatenate them
make use of placeholders

Solution 1) and 2) are quite difficult to handle for translators, as they must know that either a token will be replaced or two or more strings are concatenated. This must be documented somewhere and can lead to errors in the translation. That's why one should avoid this and that's why placeholders were introduced to the i18n API and the message.json file structure.

There are two types of placeholders: positional placeholders and named placeholders.

Consider a JSON structure:

{
  "helloText": {
    "message": "Hello, $1",
    "description": "Shown in the 'welcome' window, placeholder contains the user's name"
  }
}

The localized string can be used as follows:

const userName = "world";
const text = browser.i18n.getMessage("helloText", userName);

// "Hello, world"
console.log(text);

The positional placeholder $1 will be replaced with the first argument (argument index 1) of the getMessage call. This works also with multiple placeholders:

{
  "helloText": {
    "message": "$2, $1",
    "description": "Shown in the 'welcome' window, placeholder contains the user's name and the greeting"
  }
}

const userName = "world";
const greeting = "Hello";
const text = browser.i18n.getMessage("helloText", userName, greeting);

// "Hello, world"
console.log(greeting);

In the example above the two positional identifiers are used in a different order than the arguments in the function call. This makes it pretty flexible for translators to use the correct placeholder at any position in the translation. But you can imagine that it's hard for translators to recall the correct order of arguments for each translation. Was the greeting first, e.g. placeholder $1, or was it $2? This adds a great source of translation glitches – and that's why named placeholders were introduced.

Named placeholders are a way to decouple the position of the arguments of the getMessage call and make it easier to add translations. To use named placeholders the positional placeholders are mapped to tokens you can use in your translation, as stated in the following example:

{
  "helloText": {
    "message": "$GREETING$, $USERNAME$",
    "description": "Shown in the 'welcome' window, placeholder contains the user's name and the greeting",
    "placeholders": {
      "username": {
        "content": "$1",
        "example": "world"
      },
      "greeting": {
        "content": "$2",
        "example": "Hello"
      }
    }
  }
}

You can now use the name of the placeholder in UPPERACSE letters, embraced by $ signs. When self-described names are used as placeholder name, the named placeholders makes it way easier to create a translation. The example value for a placeholder is optional, but handy.

You can also create static content for a placeholder, like replacing $GREETING$ always with "Hello". This can be quite useful when you're using placeholders that doesn't change that often, e.g. URLs.

{
  "helloText": {
    "message": "$GREETING$, $USERNAME$",
    "description": "Shown in the 'welcome' window, placeholder contains the user's name and the greeting",
    "placeholders": {
      "username": {
        "content": "$1",
        "example": "world"
      },
      "greeting": {
        "content": "Hello"
      }
    }
  }
}

Using placeholders in translations is a great improvement, as it enables your extension to be used by users all over the globe. But that's only one part to localize an extension. While placeholders are flexible enough to deal with dynamic arguments, you need plural rules to create a great user experience.

Plural rules for web extensions and applications

What are plural rules? Simply said, it's the way of how to build a plural form of a noun. Short example: you have one minute, but ten minutes. See the difference? There's an "s" at the end of the noun "minute" when theres more than one minute. That's what pluralization is about. There are different ways to handle pluralization:

you don't care at all
you make false assumptions
you're using an API for pluralization and handle it in the correct way.

Pluralization: the don't care approach

This approach is often used, as it is straight and simple: you just don't care about pluralization and use placeholders. From coding perspective it's easy to handle, as you just use a static text snippet and replace a part of it:

const text = `Please wait for ${seconds} seconds`;

If you're planning multiple languages, you might have planned a snippet like this one. The plain text is held in a JSON structure…

{
  "myText": {
    "message": "Please wait for $seconds$ seconds",
    "placeholders": {
      "seconds": {
        "content": "$1",
        "example": "2"
      }
    }
  }
}

…and you then grab the text from your code:

// "text" will contain "Please wait for 5 seconds"
const text = browser.i18n.getMessage("myText", 5);

For simple tooling and MVPs this can be sufficient, but on the long run it doesn't work out. But why is this problematic? Because the code will produce some "glitches":

called with "0" it will return Please wait for 0 seconds (correct)
called with "2" it will return Please wait for 2 seconds (correct)
called with "1" it will return Please wait for 1 seconds (not correct)

The plural form for "1" (one) isn't correct, as it must read as "please wait for 1 second" – have you noticed? The "s" is missing, as there's one second, but multiple seconds.

If you don't care at all, you can stop reading here (approach No. 1).

Pluralization: false assumptions

A naïve approach to face the former issue of returning "one second", but "multiple seconds" is to have multiple language strings in your JSON file of locale strings, providing a single text for each case, one for a single second and one for multiple seconds:

{
  "mySecond": {
    "message": "Please wait for $seconds$ second",
    "placeholders": {
      "seconds": {
        "content": "$1",
        "example": "2"
      }
    }
  },
  "mySeconds": {
    "message": "Please wait for $seconds$ seconds",
    "placeholders": {
      "seconds": {
        "content": "$1",
        "example": "2"
      }
    }
  }
}

The login then can differentiare between these two and use the one that's needed:

let text = '';
if (howManySeconds === 1) {
  // get the text for a single second
  text = browser.i18n.getMessage('mySecond', howManySeconds);
} else {
  // get the text for multiple seconds
  text = browser.i18n.getMessage('mySeconds', howManySeconds
}

You can also just change the "lookup key", depending on how many seconds you have:

const text = browser.i18n.getMessage(
  howManySeconds === 1 ? "mySecond" : "mySeconds",
  howManySeconds
);

While both approaches return the correct text ("Please wait for 1 second" and "Please wait for 10 seconds"), these approaches based on false assumptions, as we assume that there are only two forms to build up a plural form of a noun.

The code above only covers localization for specific language families, like Germanic, Latin/Greek or Romanic. This covers English, German, Dutch, Danish, Italian, Portugese and some others. So our approach based on the false assumption that there are only two cases. In fact, there are 19 plural rules. An in-depth overview can be found at the Language Plural Rules page at the unicode consortium.

So let's check how one can reduce the complexity of how to create a correct plural form for any language.

Pluralization: using `Intl.PluralRules`

To reduce complexity and make localization as easy as possible, all modern browsers and ECMAscript/JS runtimes (like nodeJS and Deno) support the Intl.PluralRules API. While the API itself is quite easy, the work with it might get a little bit confusing.

An instance of a pluralizer can be created with the language code:

const pluralize = new Intl.PluralRules("en");

The basic usage seems to be pretty simple, as the instance of PluralRules only offers three methods:

resolvedOptions()
select()
selectRange

To keep it short:

resolvedOptions can be used to retrieve the locale and the plural formatting that are used to build the plural form. These might differ from the locale that you have used to initialize the pluralizer. This could happen when you're using a region additionally to the locale, e.g. en is the language subtag for English, but en-GB is the language and region subtag for "British English", while en-US refers to "US English" (you know, that's colour and color). The pluralizer shortens this to en, as there's no difference in the plural forms of British English and US English.

select(n) is used to give an indicator what plural form to use for the number n.

selectRange(from, to) is used to retrieve the plural form indicator for a range that is between from and to.

But how to use the pluralizer? For most cases, select() is the method of choice, as it gives the indicator what plural form to use. Let's recall, that it's 1 second but 10 seconds in english. Checking back with the pluralizer will give:

const pluralizer = new Intl.PluralRules("en");
const oneSecond = pluralizer.select(1);
const tenSeconds = pluralizer.select(10);

console.log(oneSecond); // "one"
console.log(tenSeconds); // "other"

Wait. What? one and other? What are these values about? That's the indication on what plural form one should use: the english language for example only has an exception for "one", like "1 second". So the form for one single second differs from all other forms. In czech, for example, there are exceptions for one second (plural form: one), for 2 up to 4 seconds (plural form few) and for all other cases: 0, 5, 6, 7, 8, 9, … (plural form: other).

The pluralizer knows four plural form indicators: zero, one, two, few and many. Some plural rules only know a subset of these indicators. The Germanic language family only supports one and other, Czech support one, few, many and other.

But I said that it's easy to use. So how to build up the plural form?

How to use the plural form indicator

To retrieve the correct localization string for any given plural form indicator it's necessary to rename the language strings:

{
  "my_second_one": {
    "message": "Please wait for $seconds$ second",
    "placeholders": {
      "seconds": {
        "content": "$1",
        "example": "2"
      }
    }
  },
  "my_second_other": {
    "message": "Please wait for $seconds$ seconds",
    "placeholders": {
      "seconds": {
        "content": "$1",
        "example": "2"
      }
    }
  }
}

The identifiers for the localization strings now contain the indicator as "suffix". Now it's possible to use the Intl.PluralRules API like so:

const pluralForm = new Intl.PluralRules("en").select(howManySeconds);
const text = browser.i18n.getMessage(`my_second_${pluralForm}`, howManySeconds);

The variable text now contains the string "1 second" or "10 seconds", depending on the value of howManySeconds, while the code remains readable and the effort to support more languages is also minimal: just add additional JSON files, holding the locale specific strings.

Best practices

While not all languages and language families support all plural rules, I recommend to add all plural form identifiers to the localization files and leave them blank (maybe with a comment in the description property that in the specific locale the item is unused). This way you can compare the identifiers for each localized message string across all files to check if you missed a translation. To get all keys form a specific locale message file, you can use cat and jq:

cat ./_locales/en/messages.json | jq -r 'keys'

By passing all keys to temporary files and to compare them you can check what keys you miss to adopt in newer localization files.