Linux find non ascii characters

Содержание

identify files with non-ASCII or non-printable characters in file name
4 Answers 4
find and delete files with non-ascii names
5 Answers 5
Non-ASCII characters
ASCII control characters
Regular expressions for character codes
POSIX
Finding the files
Update: Finding both non-ASCII and control characters
Searching for non-ascii characters
3 Answers 3
how to detect invalid utf8 unicode/binary in a text file
9 Answers 9

identify files with non-ASCII or non-printable characters in file name

In a directory size 80GB with approximately 700,000 files, there are some file names with non-English characters in the file name. Other than trawling through the file list laboriously is there:

An easy way to list or otherwise identify these file names?
A way to generate printable non-English language characters — those characters that are not listed in the printable range of man ascii (so I can test that these files are being identified)?

4 Answers 4

Assuming that «foreign» means «not an ASCII character», then you can use find with a pattern to find all files not having printable ASCII characters in their names:

(The space is the first printable character listed on http://www.asciitable.com/,

The hint for LC_ALL=C is required (actually, LC_CTYPE=C and LC_COLLATE=C ), otherwise the character range is interpreted incorrectly. See also the manual page glob(7) . Since LC_ALL=C causes find to interpret strings as ASCII, it will print multi-byte characters (such as π ) as question marks. To fix this, pipe to some program (e.g. cat ) or redirect to file.

Instead of specifying character ranges, [:print:] can also be used to select «printable characters». Be sure to set the C locale or you get quite (seemingly) arbitrary behavior.

Источник

find and delete files with non-ascii names

I have some old migrated files that contain non-printable characters. I would like to find all files with such names and delete them completely from the system.

I would like to find all such files.

Here is an example screenshot of what I’m seeing when I do a ls in such folders:

I want to find these files with the non-printable characters and just delete them.

5 Answers 5

Non-ASCII characters

ASCII character codes range from 0x00 to 0x7F in hex. Therefore, any character with a code greater than 0x7F is a non-ASCII character. This includes the bulk of the characters in UTF-8 (ASCII codes are essentially a subset of UTF-8). For example, the Japanese character

is encoded in hex in UTF-8 as

ASCII control characters

Out of the ASCII codes, 0x00 through 0x1F and 0x7F represent control characters such as ESC ( 0x1B ). These control characters were not originally intended to be printable even though some of them, like the line feed character 0x0A , can be interpreted and displayed.

On my system, ls displays all control characters as ? by default, unless I pass the —show-control-chars option. I’m guessing that the files you want to delete contain ASCII control characters, as opposed to non-ASCII characters. This is an important distinction: if you delete filenames containing non-ASCII characters, you may blow away legitimate files that just happen to be named in another language.

Regular expressions for character codes

POSIX

POSIX provides a very handy collection of character classes for dealing with these types of characters (thanks to bashophil for pointing this out):

Perl Compatible Regular Expressions allow hexadecimal character codes using the syntax

For example, a PCRE regex for the Japanese character あ would be

In addition to the POSIX character classes listed above, PCRE also provides the [:ascii:] character class, which is a convenient shorthand for [\x00-\x7F] .

GNU’s version of grep supports PCRE using the -P flag, but BSD grep (on Mac OS X, for example) does not. Neither GNU nor BSD find supports PCRE regexes.

Finding the files

GNU find supports POSIX regexes (thanks to iscfrc for pointing out the pure find solution to avoid spawning additional processes). The following command will list all filenames (but not directory names) below the current directory that contain non-printable control characters:

The regex is a little complicated because the -regex option has to match the entire file path, not just the filename, and because I’m assuming that we don’t want to blow away files with normal names simply because they are inside directories with names containing control characters.

To delete the matching files, simply pass the -delete option to find , after all other options (this is critical; passing -delete as the first option will blow away everything in your current directory):

I highly recommend running the command without the -delete first, so you can see what will be deleted before it’s too late.

If you also pass the -print option, you can see what is being deleted as the command runs:

To blow away any paths (files or directories) that contain control characters, the regex can be simplified and you can drop the -type option:

With this command, if a directory name contains control characters, even if none of the filenames inside the directory do, they will all be deleted.

Update: Finding both non-ASCII and control characters

It looks like your files contain both non-ASCII characters and ASCII control characters. As it turns out, [:ascii:] is not a POSIX character class, but it is provided by PCRE. I couldn’t find a POSIX regex to do this, so it’s Perl to the rescue. We’ll still use find to traverse our directory tree, but we’ll pass the results to Perl for processing.

To make sure we can handle filenames containing newlines (which seems likely in this case), we need to use the -print0 argument to find (supported on both GNU and BSD versions); this separates records with a null character ( 0x00 ) instead of a newline, since the null character is the only character that can’t be in a valid filename on Linux. We need to pass the corresponding flag -0 to our Perl code so it knows how records are separated. The following command will print every path inside the current directory, recursively:

Note that this command only spawns a single instance of the Perl interpreter, which is good for performance. The starting path argument (in this case, . for CWD ) is optional in GNU find but is required in BSD find on Mac OS X, so I’ve included it for the sake of portability.

Now for our regex. Here is a PCRE regex matching names that contain either non-ASCII or non-printable (i.e. control) characters (or both):

The following command will print all paths (directories or files) in the current directory that match this regex:

The chomp is necessary because it strips off the trailing null character from each path, which would otherwise match our regex. To delete the matching files and directories, we can use the following:

This will also print out what is being deleted as the command runs (although control characters are interpreted so the output will not quite match the output of ls ).

Источник

Searching for non-ascii characters

I have a file, a.out, which contains a number of lines. Each line is one character only, either the unicode character U+2013 or a lower case letter a-z .

Doing a file command on a.out elicits the result UTF-8 Unicode text.

The locale command reports:

If I issue the command grep -P -n «[^\x00-\xFF]» a.out I would expect only the lines containing U+2013 to be returned. And this is the case if I carry out the test under cygwin. The problem environment however is Oracle Linux Server release 6.5 and the issue is that the grep command returns no lines. If I issue grep -P -n «[\x00-\xFF] » a.out then all lines are returned.

I realise that » [grep -P] . is highly experimental and grep -P may warn of unimplemented features.» but no warnings are issued.

Am I missing something?

3 Answers 3

I recommend avoiding dodgy grep -P implementations and use the real thing. This works:

The -CSD options says that both the stdio trio (stdin, stdout, stderr) and disk files should be treated as UTF-8 encoded.

The $. represents the current record (line) number.

The $_ represents the current line.

The \P matches any code point that is not ASCII.

Grep (and family) don’t do Unicode processing to merge multi-byte characters into a single entity for regex matching as you seem to want.

That implies that the UTF-8 encoding for U+2013 ( 0xe2 , 0x80 , 0x93 ) is not treated by grep as parts of a single printable character outside the given range.

The GNU grep manual’s description of -P does not mention Unicode or UTF-8. Rather, it says Interpret the pattern as a Perl regular expression. (this does not mean that the result is identical to Perl, only that some of the backslash-escapes are similar).

Perl itself can be told to use UTF-8 encoding. However the examples using Perl in Filtering invalid utf8 do not use that feature. Instead, the expressions (like those in the problematic grep) test only the individual bytes — not the complete character.

Источник

how to detect invalid utf8 unicode/binary in a text file

I need to detect corrupted text file where there are invalid (non-ASCII) utf-8, Unicode or binary characters.

what I have tried:

this converts a file from utf-8 encoding to utf-8 encoding and -c is for skipping invalid utf-8 characters. However at the end those illegal characters still got printed. Are there any other solutions in bash on linux or other languages?

9 Answers 9

Assuming you have your locale set to UTF-8 (see locale output), this works well to recognize invalid UTF-8 sequences:

-a, —text: treats file as text, essential prevents grep to abort once finding an invalid byte sequence (not being utf8)
-v, —invert-match: inverts the output showing lines not matched
-x ‘.*’ (—line-regexp): means to match a complete line consisting of any utf8 character.

Hence, there will be output, which is the lines containing the invalid not utf8 byte sequence containing lines (since inverted -v)

I would grep for non ASCII characters.

With GNU grep with pcre (due to -P , not available always. On FreeBSD you can use pcregrep in package pcre2) you can do:

Reference in How Do I grep For all non-ASCII Characters in UNIX. So, in fact, if you only want to check whether the file contains non ASCII characters, you can just say:

To remove these characters, you can use:

This will create a file.bak file as backup, whereas the original file will have its non ASCII characters removed. Reference in Remove non-ascii characters from csv.

Try this, in order to find non-ASCII characters from the shell.

What you are looking at is by definition corrupted. Apparently, you are displaying the file as it is rendered in Latin-1; the three characters ï¿½ represent the three byte values 0xEF 0xBF 0xBD. But those are the UTF-8 encoding of the Unicode REPLACEMENT CHARACTER U+FFFD which is the result of attempting to convert bytes from an unknown or undefined encoding into UTF-8, and which would properly be displayed as � (if you have a browser from this century, you should see something like a black diamond with a question mark in it; but this also depends on the font you are using etc).

So your question about «how to detect» this particular phenomenon is easy; the Unicode code point U+FFFD is a dead giveaway, and the only possible symptom from the process you are implying.

These are not «invalid Unicode» or «invalid UTF-8» in the sense that this is a valid UTF-8 sequence which encodes a valid Unicode code point; it’s just that the semantics of this particular code point is «this is a replacement character for a character which could not be represented properly», i.e. invalid input.

As for how to prevent it in the first place, the answer is really simple, but also rather uninformative — you need to identify when and how the incorrect encoding took place, and fix the process which produced this invalid output.

To just remove the U+FFFD characters, try something like

but again, the proper solution is to not generate these erroneous outputs in the first place.

(You are not revealing the encoding of your example data. It is possible that it has an additional corruption. If what you are showing us is a copy/paste of the UTF-8 rendering of the data, it has been «double-encoded». In other words, somebody took — already corrupted, as per the above — UTF-8 text and told the computer to convert it from Latin-1 to UTF-8. Undoing that is easy; just convert it «back» to Latin-1. What you obtain should then be the original UTF-8 data before the superfluous incorrect conversion.)

This Perl program should remove all non-ASCII characters:

What this does is it takes files as input on the command-line, like so:
perl fixutf8.pl foo bar baz
Then, for each line, it replaces each instance of a non-ASCII character with nothing (deletion).
It then writes this modified line out to super-temporary-utf8-replacement-file-which-should-never-be-used-EVER (named so it dosen’t modify any other files.)
Afterwards, it renames the temporary file to that of the original one.

This accepts ALL ASCII characters (including DEL, NUL, CR, etc.), in case you have some special use for them. If you want only printable characters, simply replace :ascii: with :print: in s/// .

I hope this helps! Please let me know if this wasn’t what you were looking for.

The following C program detects invalid utf8 characters. It was tested and used on a linux system.

I am probably repeating what other have said already. But i think your invalid characters get still printed because they may be valid. The Universal Character Set is the attempt to reference the worldwide frequently used characters to be able to write robust software which is not relying on a special character-set.

So i think your problem may be one of the following both — in assumption that your overall target is to handle this (malicious) input from utf-files in general:

There are invalid utf8 characters (better called invalid byte sequences — for this i’d like to refer to the corresponding Wikipedia-Article).
There are absent equivalents in your current display-font which are substituted by a special symbol or shown as their binary ASCII-equivalent (f.e. — i therefore would like to refer to the following so-post: UTF-8 special characters don’t show up).

So in my opinion you have two possible ways to handle this:

Transform the all characters from utf8 into something handleable — f.e. ASCII — this can be done f.e. with iconv -f utf-8 -t ascii -o file_in_ascii.txt file_in_utf8.txt . But be careful transferring from one the wider character-space (utf) into a smaller one might cause a data loss.
Handle utf(8) correctly — this is how the world is writing stuff. If you think you might have to rely on ASCII-chars because of any limitating post-processing step, stop and rethink. In most cases the post-processor already supports utf, it’s probably better to find out how to utilize it. You’re making your stuff future- and bullet-proof.

Handling utf might seem to be tricky, the following steps may help you to accomplish utf-readyness:

Источник