Character encoding defines how raw data in a file is interpreted as characters. Trying to determine character encoding of a file programatically can yield three possible results:
- The file results can be exactly the same, no matter what encoding you pick. This can be the case when a file contains only basic ASCII characters (letters, numbers and a few basic "special" characters) since most western encodings map these characters the same way.
- There is no valid encoding found for the file. This can be the case when the file is not really a text file to begin with, for example, if it is an image file or a .zip file.
- There are several ways to interpret the file. This is the interesting case because in general it is not possible to programmaticallly deteremine the correct encoding without knowing what the characters are supposed to be. This application solves the problem by presenting samples from the file where different encodings yield different results so you can make an informed decision on which encoding is correct.
This application reads each line of the target file and tries each encoding. If an encoding is found to be invalid for any line, it is abandoned. A concise set of lines are collected to demonstrate differences between the various encodings, and if there is more than one way to interpret the file using the available encodings, the samples are presented for viewing.