This function can be used to check if a string contain only ASCII characters.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
TEXT_CHARS =("0000000111101100" # 0x0X "0000000000010000" # 0x1X This table give for each characters "1111111111111111" # 0x2X the kind of ASCII text file it can "1111111111111111" # 0x3X belong to. "1111111111111111" # 0x4X "1111111111111111" # 0x5X "1111111111111111" # 0x6X 0 never appears in text "1111111111111110" # 0x7X 1 appears in plain ASCII text "3333313333333333" # 0x8X 2 appears in ISO-8859 text "3333333333333333" # 0x9X 3 appears in non-ISO extended- "2222222222222222" # 0xaX ASCII (Mac, IBM PC) "2222222222222222" # 0xbX "2222222222222222" # 0xcX "2222222222222222" # 0xdX This table is copyrighted, "2222222222222222" # 0xeX see the discussion part. "2222222222222222")# 0xfX PLAIN_ASCII = ''.join([chr(i) for i in range(256) if TEXT_CHARS[i]=='1']) def ascii_encoding(s): """ return 0 if the text s is not an ascii text, 1 if the text is a plain ASCII text, 2 if the text is ISO-8859, 3 if the file is an non ISO extended text file""" s = s.translate(TEXT_CHARS, PLAIN_ASCII) for i in "032": if i in s: return int(i) return 1 # # some samples # print ascii_encoding("Hello wolrd") print ascii_encoding("Sébastien Keim") print ascii_encoding("AZZ\x12BB")
It can seems inefficient to process the whole string with the translate method before doing any check but this has the great advantage to minimize loops in Python code. So I guess that this solution is faster in the common case than any clever algorithm written in Python.
Now if you have to process large files, it can be interesting to first check only a small part of the file, this will allow to reject most binary files.
The function considers a string to be ASCII if all of its characters are either ASCII printing characters (again, according to the X3.4 standard, not isascii()) or any of the following controls: bell, backspace, tab, line feed, form feed, carriage return, esc, nextline.
I include bell because some programs (particularly shell scripts) use it literally, even though it is rare in normal text. I exclude vertical tab because it never seems to be used in real text. I also include, with hesitation, the X3.64/ECMA-43 control nextline (0x85), because that's what the dd EBCDIC->ASCII table maps the EBCDIC newline character to. It might be more appropriate to include it in the 8859 set instead of the ASCII set, but it's got to be included in#something* we recognize or EBCDIC files aren't going to be considered textual. Some old Unix source files use SO/SI (^N/^O) to shift between Greek and Latin characters, so these should possibly be allowed. But they make a real mess on VT100-style displays if they're not paired properly, so we are probably better off not calling them text.
A string is considered to be ISO-8859 text if its characters are all either ASCII, according to the above definition, or printing characters from the ISO-8859 8-bit extension, characters 0xA0 ... 0xFF.
Finally, a string is considered to be international text from some other character code if its characters are all either ISO-8859 (according to the above definition) or characters in the range 0x80 ... 0x9F, which ISO-8859 considers to be control characters but the IBM PC and Macintosh consider to be printing characters.
The transcoding table and the previous discussion chapters come from the "file" package source code. Copyright (c) Ian F. Darwin, 1987. Written by Ian F. Darwin. Extensively modified by Eric Fischer in July, 2000, to handle character codes other than ASCII on a unified basis. Joerg Wunsch wrote the original support for 8-bit international characters. The following license apply to the table:
This software is not subject to any license of the American Telephone and Telegraph Company or of the Regents of the University of California.
Permission is granted to anyone to use this software for any purpose on any computer system, and to alter it and redistribute it freely, subject to the following restrictions:
The author is not responsible for the consequences of use of this software, no matter how awful, even if they arise from flaws in it.
The origin of this software must not be misrepresented, either by explicit claim or by omission. Since few users ever read sources, credits must appear in the documentation.
Altered versions must be plainly marked as such, and must not be misrepresented as being the original software. Since few users ever read sources, credits must appear in the documentation.
This notice may not be removed or altered.