Everyone in the world should be able to use their own language on phones and computers.
Source: unicode website

Understanding Encodings (Did you know? I didn’t)

Rhys Jervis
2 min readJul 31, 2020

--

Have you ever tried to pandas to open a csv file only to get an error?

‘utf-8’ codec can’t decode byte 0x99 in position 11: invalid start byte

Well, alphabet, numbers and other everyday characters we use such as $, + or # are stored as binary on a computer. The rules that allow the conversion bbetween the binary and the characters we know and love are called Encodings.

Encodings are simply a set of rules that allow the binary stored in a file to be converted into characters we know and vice versa.

Most files use encodings from the Unicode Consortium such as ‘UTF-8’. However, other encodings exist and its near impossible to tell what they are.

At least, that’s what I thought.

There is a package called encodingswhere a set of the most common encoding can be found.

Importing encoding package

Encodings included in the package are:

cp1026, tis_620, cp1251, iso2022_jp_ext, utf_7, iso8859_13, cp862,cp1253, cp1257, iso8859_4, iso8859_15, hp_roman8, shift_jis_2004, utf_32, cp037, iso8859_5,big5hkscs, cp1258, tactis, euc_kr, cp1256, iso8859_10, quopri_codec, utf_16_be, utf_32_be,zlib_codec, cp857, cp500, base64_codec, cp273, shift_jisx0213, hz, mbcs, cp866, cp860, cp950, euc_jis_2004, mac_turkish, gb18030, iso8859_11, kz1048,cp1125, shift_jis, iso2022_jp_3, iso8859_8, cp864, cp1255, gbk, mac_roman, uu_codec,cp858, bz2_codec, ptcp154, iso2022_jp_2, iso2022_kr, utf_8, cp850, iso8859_14, cp1140,johab, utf_16_le, iso8859_2, big5, mac_iceland, iso2022_jp_2004, iso8859_16, cp1254,utf_16, cp775, iso8859_6, cp869, cp1252, cp861, cp1250, latin_1, ascii, cp855, iso8859_9, cp863, cp949, koi8_r, cp424, mac_cyrillic, cp865, iso2022_jp_1, mac_greek, iso8859_7, euc_jp, utf_32_le, cp932, gb2312, rot_13, cp852, iso8859_3, euc_jisx0213,hex_codec, iso2022_jp, mac_latin2, cp437

With python’s try except,you can iterate through all the values to find the right encoding.

This took me a while to find so I thought I’d share. Happy coding!!

--

--