13 April 2018

Find And Change Encoding on Files

This article describes how to find the current encoding of a file and an example of how you can change the file's encoding to the one you desire.


Find Current Encoding Of File

Use the following command to check what encoding is used in a file:

uchardet [filename]

Example:
$ uchardet myfile.txt
ISO-8859-7

OR, Alternatively, use the file command: 

file -[options] [filename]

Available Options:
-b, --brief : Don't print filename (brief mode)
-i, --mime : Print filetype and encoding

Example:
$ file -bi myfile.txt

text/plain; charset=iso-8859-7

Change a File’s Encoding

Use the following command to change the encoding of a file:

iconv -f [encoding] -t [encoding] -o [newfilename] [filename]

Available Options:
-f, --from-code : Convert a file’s encoding from charset
-t, --to-code : Convert a file’s encoding to charset
-o, --output : Specify output file (instead of stdout)

Examples:
a) Change a file’s encoding from CP1251 (Windows-1251, Cyrillic) charset to UTF-8:
$ iconv -f cp1251 -t utf-8 in.txt

b) Change a file’s encoding from ISO-8859-1 charset to and save it to out.txt:
$ iconv -f iso-8859-1 -t utf-8 -o out.txt in.txt

c) Change a file’s encoding from ASCII to UTF-8:
$ iconv -f utf-8 -t ascii -o out.txt in.txt

d) Change a file’s encoding from UTF-8 charset to ASCII:
Illegal input sequence at position: 

As UTF-8 can contain characters that can’t be encoded with ASCII, the iconv will generate the error message "illegal input sequence at position" unless you tell it to strip all non-ASCII characters using the -c option (omits invalid characters from the output).

$ iconv -c -f utf-8 -t ascii -o out.txt in.txt
You can lose characters: 

Note that if you use the iconv with the -c option, nonconvertible characters will be lost. This is a common situation when working between Windows and Linux systems (mostly concerns Windows machines with Cyrillic). You have copied some file from Windows to Linux, but when you open it in Linux, you see "Êàêèå-òî êðàêîçÿáðû". Such strings can be easily converted from CP1251 (Windows-1251, Cyrillic) charset to UTF-8 with:

$ echo "Êàêèå-òî êðàêîçÿáðû" | iconv -t latin1 | iconv -f cp1251 -t utf-8
Какие-то кракозябры

List All Charsets

Use the following command to list all the known charsets in your Linux system:

$ iconv -l

Mass Convert Files (Recursively) to another Encoding

$ find . -type f -print -exec iconv -f iso8859-2 -t utf-8 -o {}.converted {} \; -exec mv {}.converted {} \;

It will use temp file with '.converted' suffix (extension) and then will move it to original name, so be careful if you have files with '.converted' suffixes (I don't think you have).

Also this script is not safe for filenames containing spaces, so for more safety you should double-quote: "{}" instead of {} and "{}.converted" instead of {}.converted




No comments:

Post a Comment