Subtitle encoding and mojibake, explained
When subtitles turn into é, зд, or rows of �, the timing is fine — the text is being read with the wrong character encoding. Here's why it happens, language by language, and how to fix it for good.
A text file doesn't store letters; it stores bytes. An encoding is the lookup table that turns those bytes back into characters. Get the table wrong and you get gibberish — even though every byte is intact and the timing is perfect. This is the single most common reason non-English subtitles look broken.
Bytes, code pages and UTF-8
For the first 128 values (plain English letters, digits, punctuation), almost every encoding agrees — that's ASCII. The trouble starts above 127. For decades, each region used its own code page to map those high bytes: Windows-1252 for Western Europe, Windows-1251 for Cyrillic, Windows-1256 for Arabic, ISO-8859-7 for Greek, and so on. Each maps the same byte to a different letter.
UTF-8 replaced this mess. It can represent every character in Unicode and is now the default everywhere. The catch: SRT files carry no declaration of their own encoding, so a modern player simply assumes UTF-8. Feed it an old code-page file and it reads the bytes through the wrong table.
The two faces of broken text
There are two distinct failure modes, and they look different:
- Mojibake — readable-looking but wrong characters. Cyrillic read as Windows-1252 becomes
здрав; a Frenchébecomesé. The bytes decoded to something, just the wrong something. - The replacement character
�— appears when a byte sequence is invalid for the assumed encoding, so the decoder gives up on it. Rows of�usually mean a multi-byte file (like Chinese GBK) read as a single-byte encoding, or vice-versa.
Why it breaks, language by language
- Cyrillic (Russian, Ukrainian, Bulgarian…) — old files are typically Windows-1251 or KOI8-R. Read as UTF-8 they become long strings of
ÐandÑpairs. - Arabic / Persian — Windows-1256. Misread, it produces scattered Latin accents and symbols; the right-to-left direction can also make a correctly-decoded file look odd if the player lacks RTL support, which is a separate issue.
- Greek — ISO-8859-7 or Windows-1253. Easily confused with Cyrillic by automatic detectors because both remap the same byte ranges to letters.
- Central European (Polish, Czech, Hungarian…) — Windows-1250. The accented letters (ł, ő, ě) are the ones that corrupt.
- Chinese, Japanese, Korean — multi-byte encodings (GBK, Big5, Shift-JIS, EUC-KR). When misread as a single-byte encoding you get a mix of garbage and
�, because the byte pairs don't line up.
The BOM
A byte-order mark is an optional invisible marker at the very start of a file (the bytes
EF BB BF for UTF-8) that announces its encoding. It helps some Windows players detect UTF-8 — but a few
older players display it as a stray at the start of the first subtitle, or refuse the file. The safe
default is UTF-8 without a BOM; add one only if a specific player needs it. The
encoding fixer lets you choose.
The sneaky one: double-encoded UTF-8
A particularly confusing failure: text that was already UTF-8 gets read as Windows-1252 and saved again.
Now é shows as é, — as â€", and a curly apostrophe as
’. The file is technically valid UTF-8, so naïve detectors declare it fine and leave it
broken. The cure is to reverse the extra layer — re-encode the text as Windows-1252 and read it as UTF-8 once. Our
tool detects this pattern automatically and fixes it.
Why automatic detection is hard
Detecting a single-byte encoding from bytes alone is genuinely ambiguous: the same bytes are valid Cyrillic and valid Arabic and valid Greek — each just a different alphabet. Good detectors use letter frequency to guess which language the result most resembles, but short files don't give much to go on. That's why a trustworthy encoding fixer offers a live preview and a manual override: the machine guesses, you confirm with your eyes. Our encoding fixer ranks the candidates, shows the before-and-after per cue, and lets you pick the encoding if the guess is off.
Fixing it for good
- Open the garbled file in the encoding fixer.
- Check the preview. If the text reads correctly, you're done; if not, choose the encoding from the dropdown until it does.
- Download the result — clean UTF-8, which every modern player reads.
- If the file also has structural faults, follow up with the SRT repair tool.
Once a file is saved as UTF-8, the problem is gone permanently — there's no regional guesswork left for a player to get wrong.