Ian Bicking complains about unicode in Python and wants to change the default encoding in his Python application, and wonders why Python makes it so hard to change it. It's very tempting to change the default encoding, and I messed around with it too when I first explored Python unicode issues a few years ago. However, I now think that changing the default encoding in a Python application is not the right way to go. If you do so, you run the risk of writing applications or libraries that aren't going to work correctly on any other system. For a slightly involved example take the case of Silva and PlacelessTranslationService. Silva (a Zope-based CMS) a few years ago went through a painful transition to use unicode inside throughout. The ZPublisher can be configured to encode any unicode response to UTF-8. For input, we make sure everything is decoded into unicode. This all worked pretty well, though of course we did find 'leaks' once every while due to oversights in not doing the right encoding bit. The leaks are aggrevated by the fact that Zope 2 isn't very unicode pure as a framework. Then we installed PlacelessTranslationService. This had been developed for Plone, which does not use unicode the Python way. Instead, as I understand it, it stores its content as UTF-8, and then the codebase has a numer of hacks to make it deal with unicode strings too. Not by changing the default encoding, but by overriding an important StringIO that gets used by the Zope Page Template engine to do something very similar -- encode to UTF-8 any unicode that gets passed to the page template engine. Suddenly we were again in a mire of unicode-related bugs. Our assumption that the output of a page template was a unicode string was broken by PlacelessTranslationService, and this caused things to break in subtle ways. Desperate hacking ensued... (Five 1.1's i18n support should eventually fix this) Changing the default encoding is tempting, but you're really going to be in trouble if you're going to give code that does string concatenation to anyone else. Imagine you've written an XML processing library and you happily concatenate UTF-8 strings with unicode strings in its internals -- it'd almost certainly not work
correctly as soon as I use it in my application, unless I change the default encoding as well. The best way to deal with unicode is to make sure that everything that enters your application (from the filesystem, from the web, or a database) is decoded into unicode, and everything that leaves your application is encoded (preferably to UTF-8). Thinking it was easier before unicode came along is probably slightly deceptive -- you would've run into worse problems as soon as your system had to deal with more encodings than latin-1. String encoding issues just are hard. That's not to say the situation with Python's unicode support isn't frustrating. I've thought long and hard about this when I suffered through this, but I couldn't really think of a better solution than the route Python took. If Python didn't have to worry about backwards compatibility I'd suggest making all strings unicode such as Java did, and introducing a separate for storing bytes, but that wasn't possible. I do agree that life might've been easier if the default encoding of Python had been set to UTF-8 instead of to ascii. On the one hand this is catching more errors. If you're willing to break the ease of backporting code to older Python versions, I believe if, say, Python 2.5, shipped with a default encoding of UTF-8, it wouldn't actually break anything. But if I did it for my Python, I'd have problems soon as I gave my code to someone else.
(4) Tue Aug 02 2005 17:38 Changing the Python default encoding considered harmful:
- Comments:
Posted by Ian Bicking at Tue Aug 02 2005 23:55
Well, I'm talking about it a bunch exactly because I think this would work much better if everyone up and started changing the default encoding to UTF-8. And since you can even monkeypatch this in (with the reload(sys) trick) I can "fix" systems that use my libraries. I'm not sure I will, but I'm not sure I won't either...While setdefaultencoding might break other people's libraries, the status quo doesn't encourage reusability either. The problem with handling encoding at the boundary is that *everyone else's library is a boundary*. That's a brutal overhead to have to deal with -- everytime I use someone else's code, even if there's no process boundary involved (i.e., no persistence, no sending stuff over the network), I run the (signficant!) risk of introducing encoding bugs into the system. I really don't mind Unicode in general, or even Python's implementation. But this rather subtle issue, one that isn't even really part of the core Unicode implementation in Python, constantly drives me nuts. It's frustrating, because Python unicode objects actually act *almost* perfectly given the constraints. Like when you run 'test'.encode('iso-8859-1') Python will decode the string (since it isn't Unicode) with the default encoding, then encode it with the given encoding. If the default encoding was UTF-8 this would be a great feature! But with ascii it doesn't really help at all. Given the choice of a 3-line hack vs. fixing every Python library one by one... especially when "fixing" a Python library causes regressions when that library is used with any library that hasn't been "fixed"... it hurts to even think about it.
Posted by Martijn Faassen at Wed Aug 03 2005 11:04
Everybody else's library is a boundary only if they somehow end up concatenating non-unicode strings with unicode strings, or do things like 'str()' to something that might be a unicode strings. Modern Python libraries *should* be written to handle unicode correctly. Of course in reality many of them don't do this. :)I haven't thought it through, but I worry your three line hack will also mean fixing a lot of libraries. Thinking about encodings correctly is difficult, and many libraries will end up doing it wrong.Anyway, you'll probably get the best answers on this by
suggesting changing the default encoding to UTF-8 in Python
2.5 on Python-dev... I'm also curious to see what the reasons were for picking ascii at the time. Perhaps it has something to do with not making promises that the system
cannot fully keep: perhaps, for instance, the equivalence of ascii-only unicode strings and ascii-only classic strings as hash keys and the like would then be expected to work for *all* UTF-8 strings.Posted by Ian Bicking at Wed Aug 03 2005 20:34
Incidentally, if I set the default encoding to UTF-8 this is true:str(u'\u0100') == u'\u0100'But this isn't!hash(str(u'\u0100')) == hash(u'\u0100')Ouch. But this is true:hash('test') == hash(u'test')thar' be dragons there
Posted by Fredrik at Mon Aug 22 2005 12:09
"everyone else's library is a boundary"nope. a library can be 1) unicode aware, in which case you don't have to do anything, 2) unicode agnostic, in which case you don't have to do anything, or 3) not unicode compatible, in which case changing the default encoding, in almost all cases, won't help you at all."I'm also curious to see what the reasons were for picking ascii at the time"iso-8859-1 would have been conceptually cleaner (since it's a strict subset of unicode), but it wasn't politically correct. variable width encodings like utf-8 simply doesn't work. ascii was considered to be equally bad for everyone (except americans).(also note that despite what people who didn't design this are trying to tell you, Python's text model is designed to let you mix ascii-encoded text and unicode strings freely. it's all about characters, not bytes)
