Understanding Encodings (Did you know? I didn’t)
Have you ever tried to pandas to open a csv file only to get an error?
Well, alphabet, numbers and other everyday characters we use such as $, + or # are stored as binary on a computer. The rules that allow the conversion bbetween the binary and the characters we know and love are called Encodings.
Most files use encodings from the Unicode Consortium such as ‘UTF-8’. However, other encodings exist and its near impossible to tell what they are.
At least, that’s what I thought.
There is a package called encodingswhere a set of the most common encoding can be found.
Encodings included in the package are:
cp1026, tis_620, cp1251, iso2022_jp_ext, utf_7, iso8859_13, cp862,cp1253, cp1257, iso8859_4, iso8859_15, hp_roman8, shift_jis_2004, utf_32, cp037, iso8859_5,big5hkscs, cp1258, tactis, euc_kr, cp1256, iso8859_10, quopri_codec, utf_16_be, utf_32_be,zlib_codec, cp857, cp500, base64_codec, cp273, shift_jisx0213, hz, mbcs, cp866, cp860, cp950, euc_jis_2004, mac_turkish, gb18030, iso8859_11, kz1048,cp1125, shift_jis, iso2022_jp_3, iso8859_8, cp864, cp1255, gbk, mac_roman, uu_codec,cp858, bz2_codec, ptcp154, iso2022_jp_2, iso2022_kr, utf_8, cp850, iso8859_14, cp1140,johab, utf_16_le, iso8859_2, big5, mac_iceland, iso2022_jp_2004, iso8859_16, cp1254,utf_16, cp775, iso8859_6, cp869, cp1252, cp861, cp1250, latin_1, ascii, cp855, iso8859_9, cp863, cp949, koi8_r, cp424, mac_cyrillic, cp865, iso2022_jp_1, mac_greek, iso8859_7, euc_jp, utf_32_le, cp932, gb2312, rot_13, cp852, iso8859_3, euc_jisx0213,hex_codec, iso2022_jp, mac_latin2, cp437
With python’s try except,you can iterate through all the values to find the right encoding.
This took me a while to find so I thought I’d share. Happy coding!!