| Store | Cart

Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

From: Stephen J. Turnbull <step...@xemacs.org>
Tue, 16 Sep 2014 12:34:36 +0900
Jim J. Jewett writes:

 > In terms of best-effort, it is reasonable to treat the smuggled bytes> as representing a character outside of your unicode repertoire

I have to disagree.  If you ever end up passing them to something that
validates or tries to reencode them without surrogateescape, BOOM!
These things are the text equivalent of IEEE NaNs.  If all you know
(as in the stdlib) is that you have "generic text", the only fairly
safe things to do with them are (1) delete them, (2) substitute an
appropriate replacement character for them, (3) pass the text
containing them verbatim to other code, and (4) reencode them using
the same codec they were read with.

 > -- so it won't ever match entirely valid strings, except perhaps> via a wildcard.  And it should still work for .endswith(<the same> invalid characters>).

Incorrect, I'm pretty sure, unless you know that both texts containing
<the same invalid code points> were read with the same codec.  Eg,
consider two filenames encoded in ISO Cyrillic and ISO Hebrew, read
with (encoding='ascii', errors='surrogateescape').

Apps that know the semantics of the text may DWIM/DTRT if they want
to, but FWIW-IMHO-YMMV-and-any-other-4-letter-caveat-acronyms-that-
may-apply Python and the stdlib shouldn't try to guess.

Guessing may be unavoidable, of course.

_______________________________________________
Python-Dev mailing list
Pyth...@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: https://mail.python.org/mailman/options/python-dev/python-dev-ml%40activestate.com

Recent Messages in this Thread
Jim J. Jewett Sep 15, 2014 06:35 pm
Stephen J. Turnbull Sep 16, 2014 03:34 am
Chris Angelico Sep 16, 2014 03:51 am
R. David Murray Sep 16, 2014 03:00 pm
Chris Angelico Sep 16, 2014 03:27 pm
R. David Murray Sep 16, 2014 05:46 pm
Chris Angelico Sep 16, 2014 06:02 pm
R. David Murray Sep 16, 2014 07:29 pm
Stephen J. Turnbull Sep 17, 2014 12:21 am
Glenn Linderman Sep 17, 2014 01:30 am
Stephen J. Turnbull Sep 17, 2014 03:28 am
Steven DAprano Sep 17, 2014 08:56 am
Antoine Pitrou Sep 17, 2014 09:37 am
"Martin v. Löwis" Sep 17, 2014 12:06 pm
Stephen J. Turnbull Sep 18, 2014 04:57 am
Chris Angelico Sep 17, 2014 01:14 am
Steven DAprano Sep 17, 2014 04:42 am
Akira Li Sep 17, 2014 05:10 am
Stephen J. Turnbull Sep 17, 2014 06:32 am
R. David Murray Sep 17, 2014 07:02 am
R. David Murray Sep 17, 2014 07:20 am
Jim Baker Sep 16, 2014 05:55 pm
Chris Angelico Sep 16, 2014 06:05 pm
Stephen J. Turnbull Sep 16, 2014 11:57 pm
R. David Murray Sep 17, 2014 12:25 am
Jeff Allen Sep 17, 2014 07:29 am
Stephen J. Turnbull Sep 18, 2014 04:45 am
Jeff Allen Sep 12, 2014 10:16 pm
Nick Coghlan Sep 13, 2014 09:06 am
R. David Murray Sep 13, 2014 01:31 pm
Tim Lesher Sep 13, 2014 03:29 pm
Nick Coghlan Sep 13, 2014 07:40 pm
Messages in this thread