Handling Special Characters in URL Slugs
Most slug generators work fine for plain English text. But the real world has accented characters, non-Latin scripts, emojis, and symbols. This guide covers how to handle all of them when generating URL slugs.
The Problem: URLs Have a Limited Character Set
URLs can only safely contain ASCII characters: letters (a-z), digits (0-9), and a few symbols like hyphens and underscores. Everything else must be either converted or percent-encoded (like %C3%A9for é). Percent-encoded URLs are ugly, hard to read, and bad for SEO. The solution is transliteration.
Transliteration: The Core Technique
Transliteration converts characters from one script to their closest ASCII equivalent. Some common examples:
| Input | Transliterated | Language |
|---|---|---|
| é, è, ê, ë | e | French, Portuguese |
| ü, ö, ä | ue, oe, ae | German |
| ñ | n | Spanish |
| ß | ss | German |
| ç | c | French, Portuguese |
Our Slug Generator performs transliteration automatically when the “Transliterate” option is enabled.
Unicode Normalization
Before transliteration, you need to normalize Unicode text. The character é can be stored two ways in Unicode:
- Composed (NFC): a single code point (U+00E9)
- Decomposed (NFD): the letter “e” followed by a combining accent mark (U+0065 + U+0301)
Both look identical on screen but are different bytes. Normalizing to NFKD (compatibility decomposition) separates the base letter from its accent mark, making it easy to strip the accents:
// JavaScript
function removeAccents(text) {
return text
.normalize('NFKD')
.replace(/[\u0300-\u036f]/g, '');
}
removeAccents('Café Résumé');
// Output: Cafe ResumeCJK Characters (Chinese, Japanese, Korean)
CJK characters don’t transliterate to Latin in a meaningful way. There are two common approaches:
- Use romanization (pinyin for Chinese, romaji for Japanese). Libraries like
pinyin(npm) orkuroshirocan do this, but the results may not match user expectations. - Keep the characters as-is. Modern browsers and search engines handle Unicode URLs well. Google can index URLs with CJK characters. The URL will be percent-encoded in the address bar but will display correctly in search results.
For most use cases, option 2 is safer—romanized CJK text can be unreadable to native speakers.
Emojis in URLs
While technically possible (emojis get percent-encoded), emojis in slugs are a bad idea:
- They become extremely long percent-encoded strings
- They break in many systems and APIs
- Search engines may not index them properly
- They can’t be typed manually
Always strip emojis during slug generation. A simple regex pattern can remove them:
text.replace(/[\u{1F600}-\u{1F6FF}\u{2600}-\u{26FF}\u{2700}-\u{27BF}]/gu, '')Common Special Characters
Here’s how a good slug generator handles common special characters:
| Character | Action | Reason |
|---|---|---|
| & | Replace with “and” or remove | Has special meaning in URLs |
| @, #, ?, = | Remove | Reserved URL characters |
| “ ” ‘ ’ | Remove | Punctuation, no semantic value in slug |
| — – | Replace with hyphen | Similar purpose, normalize to separator |
| / \\ | Replace with hyphen | Slashes create path segments |
Testing Your Slug Generator
Test with these tricky inputs to make sure your slug generator handles edge cases:
“Café Menu — Special Édition!”→cafe-menu-special-edition“100% Free & Open Source”→100-free-and-open-sourceor100-free-open-source“ Lots of spaces ”→lots-of-spaces“---triple---hyphens---”→triple-hyphens- Empty string → should return empty or a fallback
Try these examples directly in our Slugify Online tool and see how it handles each case.
Library Support
For a deeper comparison of slug libraries across languages, check our guide on how to slugify text in JavaScript, Python, and PHP. Each library handles special characters differently, so choose one that fits your use case.