- HowTo: Check and Change File Encoding In Linux
- Check a File’s Encoding
- Change a File’s Encoding
- List All Charsets
- 8 Replies to “HowTo: Check and Change File Encoding In Linux”
- How to get terminal’s Character Encoding
- 7 Answers 7
- How can I find encoding of a file via a script on Linux?
- 17 Answers 17
- How to Convert Files to UTF-8 Encoding in Linux
- Convert Files from UTF-8 to ASCII Encoding
- Convert Multiple Files to UTF-8 Encoding
- If You Appreciate What We Do Here On TecMint, You Should Consider:
HowTo: Check and Change File Encoding In Linux
The Linux administrators that work with web hosting know how is it important to keep correct character encoding of the html documents.
From the following article you’ll learn how to check a file’s encoding from the command-line in Linux.
You will also find the best solution to convert text files between different charsets.
I’ll also show the most common examples of how to convert a file’s encoding between CP1251 (Windows-1251, Cyrillic), UTF-8 , ISO-8859-1 and ASCII charsets.
Cool Tip: Want see your native language in the Linux terminal? Simply change locale! Read more →
Check a File’s Encoding
Use the following command to check what encoding is used in a file:
Option | Description |
---|---|
-b , —brief | Don’t print filename (brief mode) |
-i , —mime | Print filetype and encoding |
Check the encoding of the file in.txt :
Change a File’s Encoding
Use the following command to change the encoding of a file:
Option | Description |
---|---|
-f , —from-code | Convert a file’s encoding from charset |
-t , —to-code | Convert a file’s encoding to charset |
-o , —output | Specify output file (instead of stdout) |
Change a file’s encoding from CP1251 (Windows-1251, Cyrillic) charset to UTF-8 :
Change a file’s encoding from ISO-8859-1 charset to and save it to out.txt :
Change a file’s encoding from ASCII to UTF-8 :
Change a file’s encoding from UTF-8 charset to ASCII :
Illegal input sequence at position: As UTF-8 can contain characters that can’t be encoded with ASCII, the iconv will generate the error message “illegal input sequence at position” unless you tell it to strip all non-ASCII characters using the -c option.
Option | Description |
---|---|
-c | Omit invalid characters from the output |
You can lose characters: Note that if you use the iconv with the -c option, nonconvertible characters will be lost.
This concerns in particular Windows machines with Cyrillic.
You have copied some file from Windows to Linux, but when you open it in Linux, you see “Êàêèå-òî êðàêîçÿáðû” – WTF!?
Don’t panic – such strings can be easily converted from CP1251 (Windows-1251, Cyrillic) charset to UTF-8 with:
List All Charsets
List all the known charsets in your Linux system:
Option | Description |
---|---|
-l , —list | List known charsets |
8 Replies to “HowTo: Check and Change File Encoding In Linux”
Thank you very much. Your reciept helped a lot!
I am running Linux Mint 18.1 with Cinnamon 3.2. I had some Czech characters in file names (e.g: Pešek.m4a). The š appeared as a ? and the filename included a warning about invalid encoding. I used convmv to convert the filenames (from iso-8859-1) to utf-8, but the š now appears as a different character (a square with 009A in it. I tried the file command you recommended, and got the answer that the charset was binary. How do I solve this? I would like to have the filenames include the correct utf-8 characters.
Thanks for your help–
Вообще-то есть 2 утилиты для определения кодировки. Первая этo file. Она хорошо определяет тип файла и юникодовские кодировки… А вот с ASCII кодировками глючит. Например все они выдаются как буд-то они iso-8859-1. Но это не так. Тут надо воспользоваться другой утилитой enca. Она в отличие от file очень хорошо работает с ASCII кодировками. Я не знаю такой утилиты, чтобы она одновременно хорошо работала и с ASCII и с юникодом… Но можно совместить их, написав свою. Это да. Кстати еnca может и перекодировать. Но я вам этого не советую. Потому что лучше всего это iconv. Он отлично работает со всеми типами кодировок и даже намного больше, с различными вариациями, включая BCD кодировки типа EBCDIC(это кодировки 70-80 годов, ещё до ДОСа…) Хотя тех систем давно нет, а файлов полно… Я не знаю ничего лучше для перекодировки чем iconv. Я думаю всё таки что file не определяет ASCII кодировки потому что не зарегистрированы соответствующие mime-types для этих кодировок… Это плохо. Потому что лучшие кодировки это ASCII.
Для этого есть много причин. И я не знаю ни одной разумной почему надо пользоваться юникодовскими кроме фразы “США так решило…” И навязывают всем их, особенно эту utf-8. Это худшее для кодирования текста что когда либо было! А главная причина чтобы не пользоваться utf-8, а пользоваться ASCII это то, что пользоваться чем-то иным никогда не имеет смысла. Даже в вебе. Хотите значки? Используйте символьные шрифты, их полно. Не вижу проблем… Почему я должен делать для корейцев, арабов или китайцев? Не хочу. Мне всегда хватало русского, в крайнем случае английского. Зачем мне ихние поганые языки и кодировки? Теперь про ASCII. KOI8-R это вычурная кодировка. Там русские буквы идут не по порядку. Нормальных только 2: это CP1251 и DOS866. В зависимости от того для чего. Если для графики, то безусловно CP1251. А если для полноценной псевдографики, то лучше DOS866 не придумали. Они не идеальны, но почти… Плохость utf-8 для русских текстов ещё и в том, что там каждая буква занимает 2 байта. Там ещё такая фишка как во всех юникодах это indian… Это то, в каком порядке идут байты, вначале младший а потом старший(как в памяти по адресам, или буквы в словах при написании) или наоборот, как разряды в числе, вначале старшие а потом младшие. А если символ 3-х, 4-х и боле байтов(до 16-ти в utf-8) то там кол-во заморочек растёт в геометрической прогрессии! Он ещё и тормозит, ибо каждый раз надо вычислять длину символа по довольно сложному алгоритму! А ведь нам ничего этого не надо! Причём заметьте, ихние англицкие буквы идут по порядку, ничего не пропущено и все помещаются в 1-м байте… Т.е. это искусственно придуманые штуки не для избранных америкосов. Их это вообще не волнует. Они разом обошли все проблемы записав свой алфавит в начало таблицы! Но кто им дал такое право? А все остальные загнали куда подальше… Особенно китайцев! Но если использовать CP1251, то она работает очень быстро, без тормозов и заморочек! Так же как и английские буквы…
а вот дальше бардак. Правда сейчас нам приходится пользоваться этим utf-8, Нет систем в которых бы системная кодировка была бы ASCII. Уже перестали делать. И все файлы системные именно в uft-8. А если ты хочешь ASCII, то тебе придётся всё время перекодировать. Раньше так не надо было делать. Надеюсь наши всё же сделают свою систему без ихних штатовких костылей…
Уважаемый Анатолий, огромнейшее Вам спасибо за упоминание enca. очень помогла она мне сегодня. Хотя пост Ваш рассистский и странный, но, видимо, сильно наболело.
Источник
How to get terminal’s Character Encoding
Now I change my gnome-terminal’s character encoding to «GBK» (default it is UTF-8), but how can I get the value(character encoding) in my Linux?
7 Answers 7
The terminal uses environment variables to determine which character set to use, therefore you can determine it by looking at those variables:
locale command with no arguments will print the values of all of the relevant environment variables except for LANGUAGE.
For current encoding:
For available locales:
For available encodings:
Check encoding and language:
Get all languages:
Change to pt_PT.utf8:
If you have Python:
To my knowledge, no.
Circumstantial indications from $LC_CTYPE , locale and such might seem alluring, but these are completely separated from the encoding the terminal application (actually an emulator) happens to be using when displaying characters on the screen.
They only way to detect encoding for sure is to output something only present in the encoding, e.g. ä , take a screenshot, analyze that image and check if the output character is correct.
So no, it’s not possible, sadly.
To see the current locale information use locale command. Below is an example on RHEL 7.8
Examination of https://invisible-island.net/xterm/ctlseqs/ctlseqs.html, the xterm control character documentation, shows that it follows the ISO 2022 standard for character set switching. In particular ESC % G selects UTF-8. So to force the terminal to use UTF-8, this command would need to be sent. I find no way of querying which character set is currently in use, but there are ways of discovering if the terminal supports national replacement character sets.
However, from charsets(7), it doesn’t look like GBK (or GB2312) is an encoding supported by ISO 2022 and xterm doesn’t support it natively. So your best bet might be to use iconv to convert to UTF-8.
Further reading shows that a (significant) subset of GBK is EUC, which is a ISO2022 code, so ISO2022 capable terminals may be able to display GBK natively after all, but I can’t find any mention of activating this programmatically, so the terminal’s user interface would be the only recourse.
Источник
How can I find encoding of a file via a script on Linux?
I need to find the encoding of all files that are placed in a directory. Is there a way to find the encoding used?
The file command is not able to do this.
The encoding that is of interest to me is ISO 8859-1. If the encoding is anything else, I want to move the file to another directory.
17 Answers 17
It sounds like you’re looking for enca . It can guess and even convert between encodings. Just look at the man page.
Or, failing that, use file -i (Linux) or file -I (OS X). That will output MIME-type information for the file, which will also include the character-set encoding. I found a man-page for it, too 🙂
If you like to do this for a bunch of files
uchardet — An encoding detector library ported from Mozilla.
Various Linux distributions (Debian, Ubuntu, openSUSE, Pacman, etc.) provide binaries.
In Debian you can also use: encguess :
Here is an example script using file -I and iconv which works on Mac OS X.
For your question, you need to use mv instead of iconv :
To convert encoding from ISO 8859-1 to ASCII:
It is really hard to determine if it is ISO 8859-1. If you have a text with only 7-bit characters that could also be ISO 8859-1, but you don’t know. If you have 8-bit characters then the upper region characters exist in order encodings as well. Therefore you would have to use a dictionary to get a better guess which word it is and determine from there which letter it must be. Finally, if you detect that it might be UTF-8 then you are sure it is not ISO 8859-1.
Encoding is one of the hardest things to do, because you never know if nothing is telling you.
With Python, you can use the chardet module.
This is not something you can do in a foolproof way. One possibility would be to examine every character in the file to ensure that it doesn’t contain any characters in the ranges 0x00 — 0x1f or 0x7f -0x9f but, as I said, this may be true for any number of files, including at least one other variant of ISO 8859.
Another possibility is to look for specific words in the file in all of the languages supported and see if you can find them.
So, for example, find the equivalent of the English «and», «but», «to», «of» and so on in all the supported languages of ISO 8859-1 and see if they have a large number of occurrences within the file.
I’m not talking about literal translation such as:
although that’s possible. I’m talking about common words in the target language (for all I know, Icelandic has no word for «and» — you’d probably have to use their word for «fish» [sorry that’s a little stereotypical. I didn’t mean any offense, just illustrating a point]).
Источник
How to Convert Files to UTF-8 Encoding in Linux
In this guide, we will describe what character encoding and cover a few examples of converting files from one character encoding to another using a command line tool. Then finally, we will look at how to convert several files from any character set (charset) to UTF-8 encoding in Linux.
As you may probably have in mind already, a computer does not understand or store letters, numbers or anything else that we as humans can perceive except bits. A bit has only two possible values, that is either a 0 or 1 , true or false , yes or no . Every other thing such as letters, numbers, images must be represented in bits for a computer to process.
In simple terms, character encoding is a way of informing a computer how to interpret raw zeroes and ones into actual characters, where a character is represented by set of numbers. When we type text in a file, the words and sentences we form are cooked-up from different characters, and characters are organized into a charset.
There are various encoding schemes out there such as ASCII, ANSI, Unicode among others. Below is an example of ASCII encoding.
In Linux, the iconv command line tool is used to convert text from one form of encoding to another.
You can check the encoding of a file using the file command, by using the -i or —mime flag which enables printing of mime type string as in the examples below:
Check File Encoding in Linux
The syntax for using iconv is as follows:
Where -f or —from-code means input encoding and -t or —to-encoding specifies output encoding.
To list all known coded character sets, run the command below:
List Coded Charsets in Linux
Convert Files from UTF-8 to ASCII Encoding
Next, we will learn how to convert from one encoding scheme to another. The command below converts from ISO-8859-1 to UTF-8 encoding.
Consider a file named input.file which contains the characters:
Let us start by checking the encoding of the characters in the file and then view the file contents. Closely, we can convert all the characters to ASCII encoding.
After running the iconv command, we then check the contents of the output file and the new encoding of the characters as below.
Convert UTF-8 to ASCII in Linux
Note: In case the string //IGNORE is added to to-encoding, characters that can’t be converted and an error is displayed after conversion.
Again, supposing the string //TRANSLIT is added to to-encoding as in the example above (ASCII//TRANSLIT), characters being converted are transliterated as needed and if possible. Which implies in the event that a character can’t be represented in the target character set, it can be approximated through one or more similar looking characters.
Consequently, any character that can’t be transliterated and is not in target character set is replaced with a question mark (?) in the output.
Convert Multiple Files to UTF-8 Encoding
Coming back to our main topic, to convert multiple or all files in a directory to UTF-8 encoding, you can write a small shell script called encoding.sh as follows:
Save the file, then make the script executable. Run it from the directory where your files ( *.txt ) are located.
Important: You can as well use this script for general conversion of multiple files from one given encoding to another, simply play around with the values of the FROM_ENCODING and TO_ENCODING variable, not forgetting the output file name «$
For more information, look through the iconv man page.
To sum up this guide, understanding encoding and how to convert from one character encoding scheme to another is necessary knowledge for every computer user more so for programmers when it comes to dealing with text.
Lastly, you can get in touch with us by using the comment section below for any questions or feedback.
If You Appreciate What We Do Here On TecMint, You Should Consider:
TecMint is the fastest growing and most trusted community site for any kind of Linux Articles, Guides and Books on the web. Millions of people visit TecMint! to search or browse the thousands of published articles available FREELY to all.
If you like what you are reading, please consider buying us a coffee ( or 2 ) as a token of appreciation.
We are thankful for your never ending support.
Источник