.\" Man page generated from reStructuredText. . .TH "KITCHEN" "1" "Sep 05, 2016" "0.2" "kitchen" .SH NAME kitchen \- kitchen 1.2.4 . .nr rst2man-indent-level 0 . .de1 rstReportMargin \\$1 \\n[an-margin] level \\n[rst2man-indent-level] level margin: \\n[rst2man-indent\\n[rst2man-indent-level]] - \\n[rst2man-indent0] \\n[rst2man-indent1] \\n[rst2man-indent2] .. .de1 INDENT .\" .rstReportMargin pre: . RS \\$1 . nr rst2man-indent\\n[rst2man-indent-level] \\n[an-margin] . nr rst2man-indent-level +1 .\" .rstReportMargin post: .. .de UNINDENT . RE .\" indent \\n[an-margin] .\" old: \\n[rst2man-indent\\n[rst2man-indent-level]] .nr rst2man-indent-level -1 .\" new: \\n[rst2man-indent\\n[rst2man-indent-level]] .in \\n[rst2man-indent\\n[rst2man-indent-level]]u .. .INDENT 0.0 .TP .B Author Toshio Kuratomi .TP .B Date 19 March 2011 .TP .B Version 1.0.x .UNINDENT .sp We\(aqve all done it. In the process of writing a brand new application we\(aqve discovered that we need a little bit of code that we\(aqve invented before. Perhaps it\(aqs something to handle unicode text. Perhaps it\(aqs something to make a bit of python\-2.5 code run on python\-2.4. Whatever it is, it ends up being a tiny bit of code that seems too small to worry about pushing into its own module so it sits there, a part of your current project, waiting to be cut and pasted into your next project. And the next. And the next. And since that little bittybit of code proved so useful to you, it\(aqs highly likely that it proved useful to someone else as well. Useful enough that they\(aqve written it and copy and pasted it over and over into each of their new projects. .sp Well, no longer! Kitchen aims to pull these small snippets of code into a few python modules which you can import and use within your project. No more copy and paste! Now you can let someone else maintain and release these small snippets so that you can get on with your life. .sp This package forms the core of Kitchen. It contains some useful modules for using newer \fI\%python standard library\fP modules on older python versions, text manipulation, \fI\%PEP 386\fP versioning, and initializing \fBgettext\fP\&. With this package we\(aqre trying to provide a few useful features that don\(aqt have too many dependencies outside of the \fI\%python standard library\fP\&. We\(aqll be releasing other modules that drop into the kitchen namespace to add other features (possibly with larger deps) as time goes on. .SH REQUIREMENTS .sp We\(aqve tried to keep the core kitchen module\(aqs requirements lightweight. At the moment kitchen only requires .INDENT 0.0 .TP .B python 2.4 or later .UNINDENT .sp \fBWARNING:\fP .INDENT 0.0 .INDENT 3.5 Kitchen\-1.1.0 was the last release that supported python\-2.3.x .UNINDENT .UNINDENT .SS Soft Requirements .sp If found, these libraries will be used to make the implementation of some part of kitchen better in some way. If they are not present, the API that they enable will still exist but may function in a different manner. .INDENT 0.0 .TP .B \fI\%chardet\fP Used in \fBguess_encoding()\fP and \fBguess_encoding_to_xml()\fP to help guess encoding of byte strings being converted. If not present, unknown encodings will be converted as if they were \fBlatin1\fP .UNINDENT .SH OTHER RECOMMENDED LIBRARIES .sp These libraries implement commonly used functionality that everyone seems to invent. Rather than reinvent their wheel, I simply list the things that they do well for now. Perhaps if people can\(aqt find them normally, I\(aqll add them as requirements in \fBsetup.py\fP or link them into kitchen\(aqs namespace. For now, I just mention them here: .INDENT 0.0 .TP .B \fI\%bunch\fP Bunch is a dictionary that you can use attribute lookup as well as bracket notation to access. Setting it apart from most homebrewed implementations is the \fBbunchify()\fP function which will descend nested structures of lists and dicts, transforming the dicts to Bunch\(aqs. .TP .B \fI\%hashlib\fP Python 2.5 and forward have a \fBhashlib\fP library that provides secure hash functions to python. If you\(aqre developing for python2.4 though, you can install the standalone hashlib library and have access to the same functions. .TP .B \fI\%iterutils\fP The python documentation for \fBitertools\fP has some examples of other nice iterable functions that can be built from the \fBitertools\fP functions. This third\-party module creates those recipes as a module. .TP .B \fI\%ordereddict\fP Python 2.7 and forward have a \fBOrderedDict\fP that provides a \fBdict\fP whose items are ordered (and indexable) as well as named. .TP .B \fI\%unittest2\fP Python 2.7 has an updated \fBunittest\fP library with new functions not present in the \fI\%python standard library\fP for Python 2.6 or less. If you want to use those new functions but need your testing framework to be compatible with older Python the unittest2 library provides the update as an external module. .TP .B \fI\%nose\fP If you want to use a test discovery tool instead of the unittest framework, nosetests provides a simple to use way to do that. .UNINDENT .SH LICENSE .sp This python module is distributed under the terms of the \fI\%GNU Lesser General Public License Version 2 or later\fP\&. .sp \fBNOTE:\fP .INDENT 0.0 .INDENT 3.5 Some parts of this module are licensed under terms less restrictive than the LGPLv2+. If you separate these files from the work as a whole you are allowed to use them under the less restrictive licenses. The following is a list of the files that are known: .INDENT 0.0 .TP .B \fI\%Python 2 license\fP \fB_subprocess.py\fP, \fBtest_subprocess.py\fP, \fBdefaultdict.py\fP, \fBtest_defaultdict.py\fP, \fB_base64.py\fP, and \fBtest_base64.py\fP .UNINDENT .UNINDENT .UNINDENT .SH CONTENTS .SS Using kitchen to write good code .sp Kitchen\(aqs functions won\(aqt automatically make you a better programmer. You have to learn when and how to use them as well. This section of the documentation is intended to show you some of the ways that you can apply kitchen\(aqs functions to problems that may have arisen in your life. The goal of this section is to give you enough information to understand what the kitchen API can do for you and where in the KitchenAPI docs to look for something that can help you with your next issue. Along the way, you might pick up the knack for identifying issues with your code before you publish it. And that \fIwill\fP make you a better coder. .SS Overcoming frustration: Correctly using unicode in python2 .sp In python\-2.x, there\(aqs two types that deal with text. .INDENT 0.0 .IP 1. 3 \fBstr\fP is for strings of bytes. These are very similar in nature to how strings are handled in C. .IP 2. 3 \fBunicode\fP is for strings of unicode code points\&. .UNINDENT .sp \fBNOTE:\fP .INDENT 0.0 .INDENT 3.5 \fBJust what the dickens is "Unicode"?\fP .sp One mistake that people encountering this issue for the first time make is confusing the \fBunicode\fP type and the encodings of unicode stored in the \fBstr\fP type. In python, the \fBunicode\fP type stores an abstract sequence of code points\&. Each code point represents a grapheme\&. By contrast, byte \fBstr\fP stores a sequence of bytes which can then be mapped to a sequence of code points\&. Each unicode encoding (UTF\-8, UTF\-7, UTF\-16, UTF\-32, etc) maps different sequences of bytes to the unicode code points\&. .sp What does that mean to you as a programmer? When you\(aqre dealing with text manipulations (finding the number of characters in a string or cutting a string on word boundaries) you should be dealing with \fBunicode\fP strings as they abstract characters in a manner that\(aqs appropriate for thinking of them as a sequence of letters that you will see on a page. When dealing with I/O, reading to and from the disk, printing to a terminal, sending something over a network link, etc, you should be dealing with byte \fBstr\fP as those devices are going to need to deal with concrete implementations of what bytes represent your abstract characters. .UNINDENT .UNINDENT .sp In the python2 world many APIs use these two classes interchangeably but there are several important APIs where only one or the other will do the right thing. When you give the wrong type of string to an API that wants the other type, you may end up with an exception being raised (\fBUnicodeDecodeError\fP or \fBUnicodeEncodeError\fP). However, these exceptions aren\(aqt always raised because python implicitly converts between types... \fIsometimes\fP\&. .SS Frustration #1: Inconsistent Errors .sp Although converting when possible seems like the right thing to do, it\(aqs actually the first source of frustration. A programmer can test out their program with a string like: \fBThe quick brown fox jumped over the lazy dog\fP and not encounter any issues. But when they release their software into the wild, someone enters the string: \fBI sat down for coffee at the café\fP and suddenly an exception is thrown. The reason? The mechanism that converts between the two types is only able to deal with ASCII characters. Once you throw non\-ASCII characters into your strings, you have to start dealing with the conversion manually. .sp So, if I manually convert everything to either byte \fBstr\fP or \fBunicode\fP strings, will I be okay? The answer is.... \fIsometimes\fP\&. .SS Frustration #2: Inconsistent APIs .sp The problem you run into when converting everything to byte \fBstr\fP or \fBunicode\fP strings is that you\(aqll be using someone else\(aqs API quite often (this includes the APIs in the \fI\%python standard library\fP) and find that the API will only accept byte \fBstr\fP or only accept \fBunicode\fP strings. Or worse, that the code will accept either when you\(aqre dealing with strings that consist solely of ASCII but throw an error when you give it a string that\(aqs got non\-ASCII characters. When you encounter these APIs you first need to identify which type will work better and then you have to convert your values to the correct type for that code. Thus the programmer that wants to proactively fix all unicode errors in their code needs to do two things: .INDENT 0.0 .IP 1. 3 You must keep track of what type your sequences of text are. Does \fBmy_sentence\fP contain \fBunicode\fP or \fBstr\fP? If you don\(aqt know that then you\(aqre going to be in for a world of hurt. .IP 2. 3 Anytime you call a function you need to evaluate whether that function will do the right thing with \fBstr\fP or \fBunicode\fP values. Sending the wrong value here will lead to a \fBUnicodeError\fP being thrown when the string contains non\-ASCII characters. .UNINDENT .sp \fBNOTE:\fP .INDENT 0.0 .INDENT 3.5 There is one mitigating factor here. The python community has been standardizing on using \fBunicode\fP in all its APIs. Although there are some APIs that you need to send byte \fBstr\fP to in order to be safe, (including things as ubiquitous as \fBprint()\fP as we\(aqll see in the next section), it\(aqs getting easier and easier to use \fBunicode\fP strings with most APIs. .UNINDENT .UNINDENT .SS Frustration #3: Inconsistent treatment of output .sp Alright, since the python community is moving to using \fBunicode\fP strings everywhere, we might as well convert everything to \fBunicode\fP strings and use that by default, right? Sounds good most of the time but there\(aqs at least one huge caveat to be aware of. Anytime you output text to the terminal or to a file, the text has to be converted into a byte \fBstr\fP\&. Python will try to implicitly convert from \fBunicode\fP to byte \fBstr\fP\&... but it will throw an exception if the bytes are non\-ASCII: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C >>> string = unicode(raw_input(), \(aqutf8\(aq) café >>> log = open(\(aq/var/tmp/debug.log\(aq, \(aqw\(aq) >>> log.write(string) Traceback (most recent call last): File "", line 1, in UnicodeEncodeError: \(aqascii\(aq codec can\(aqt encode character u\(aq\exe9\(aq in position 3: ordinal not in range(128) .ft P .fi .UNINDENT .UNINDENT .sp Okay, this is simple enough to solve: Just convert to a byte \fBstr\fP and we\(aqre all set: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C >>> string = unicode(raw_input(), \(aqutf8\(aq) café >>> string_for_output = string.encode(\(aqutf8\(aq, \(aqreplace\(aq) >>> log = open(\(aq/var/tmp/debug.log\(aq, \(aqw\(aq) >>> log.write(string_for_output) >>> .ft P .fi .UNINDENT .UNINDENT .sp So that was simple, right? Well... there\(aqs one gotcha that makes things a bit harder to debug sometimes. When you attempt to write non\-ASCII \fBunicode\fP strings to a file\-like object you get a traceback every time. But what happens when you use \fBprint()\fP? The terminal is a file\-like object so it should raise an exception right? The answer to that is.... \fIsometimes\fP: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C $ python >>> print u\(aqcafé\(aq café .ft P .fi .UNINDENT .UNINDENT .sp No exception. Okay, we\(aqre fine then? .sp We are until someone does one of the following: .INDENT 0.0 .IP \(bu 2 Runs the script in a different locale: .INDENT 2.0 .INDENT 3.5 .sp .nf .ft C $ LC_ALL=C python >>> # Note: if you\(aqre using a good terminal program when running in the C locale >>> # The terminal program will prevent you from entering non\-ASCII characters >>> # python will still recognize them if you use the codepoint instead: >>> print u\(aqcaf\exe9\(aq Traceback (most recent call last): File "", line 1, in UnicodeEncodeError: \(aqascii\(aq codec can\(aqt encode character u\(aq\exe9\(aq in position 3: ordinal not in range(128) .ft P .fi .UNINDENT .UNINDENT .IP \(bu 2 Redirects output to a file: .INDENT 2.0 .INDENT 3.5 .sp .nf .ft C $ cat test.py #!/usr/bin/python \-tt # \-*\- coding: utf\-8 \-*\- print u\(aqcafé\(aq $ ./test.py >t Traceback (most recent call last): File "./test.py", line 4, in print u\(aqcafé\(aq UnicodeEncodeError: \(aqascii\(aq codec can\(aqt encode character u\(aq\exe9\(aq in position 3: ordinal not in range(128) .ft P .fi .UNINDENT .UNINDENT .UNINDENT .sp Okay, the locale thing is a pain but understandable: the C locale doesn\(aqt understand any characters outside of ASCII so naturally attempting to display those won\(aqt work. Now why does redirecting to a file cause problems? It\(aqs because \fBprint()\fP in python2 is treated specially. Whereas the other file\-like objects in python always convert to ASCII unless you set them up differently, using \fBprint()\fP to output to the terminal will use the user\(aqs locale to convert before sending the output to the terminal. When \fBprint()\fP is not outputting to the terminal (being redirected to a file, for instance), \fBprint()\fP decides that it doesn\(aqt know what locale to use for that file and so it tries to convert to ASCII instead. .sp So what does this mean for you, as a programmer? Unless you have the luxury of controlling how your users use your code, you should always, always, always convert to a byte \fBstr\fP before outputting strings to the terminal or to a file. Python even provides you with a facility to do just this. If you know that every \fBunicode\fP string you send to a particular file\-like object (for instance, \fBstdout\fP) should be converted to a particular encoding you can use a \fBcodecs.StreamWriter\fP object to convert from a \fBunicode\fP string into a byte \fBstr\fP\&. In particular, \fBcodecs.getwriter()\fP will return a \fBStreamWriter\fP class that will help you to wrap a file\-like object for output. Using our \fBprint()\fP example: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C $ cat test.py #!/usr/bin/python \-tt # \-*\- coding: utf\-8 \-*\- import codecs import sys UTF8Writer = codecs.getwriter(\(aqutf8\(aq) sys.stdout = UTF8Writer(sys.stdout) print u\(aqcafé\(aq $ ./test.py >t $ cat t café .ft P .fi .UNINDENT .UNINDENT .SS Frustrations #4 and #5 \-\- The other shoes .sp In English, there\(aqs a saying "waiting for the other shoe to drop". It means that when one event (usually bad) happens, you come to expect another event (usually worse) to come after. In this case we have two other shoes. .SS Frustration #4: Now it doesn\(aqt take byte strings?! .sp If you wrap \fBsys.stdout\fP using \fBcodecs.getwriter()\fP and think you are now safe to print any variable without checking its type I am afraid I must inform you that you\(aqre not paying enough attention to Murphy\(aqs Law\&. The \fBStreamWriter\fP that \fBcodecs.getwriter()\fP provides will take \fBunicode\fP strings and transform them into byte \fBstr\fP before they get to \fBsys.stdout\fP\&. The problem is if you give it something that\(aqs already a byte \fBstr\fP it tries to transform that as well. To do that it tries to turn the byte \fBstr\fP you give it into \fBunicode\fP and then transform that back into a byte \fBstr\fP\&... and since it uses the ASCII codec to perform those conversions, chances are that it\(aqll blow up when making them: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C >>> import codecs >>> import sys >>> UTF8Writer = codecs.getwriter(\(aqutf8\(aq) >>> sys.stdout = UTF8Writer(sys.stdout) >>> print \(aqcafé\(aq Traceback (most recent call last): File "", line 1, in File "/usr/lib64/python2.6/codecs.py", line 351, in write data, consumed = self.encode(object, self.errors) UnicodeDecodeError: \(aqascii\(aq codec can\(aqt decode byte 0xc3 in position 3: ordinal not in range(128) .ft P .fi .UNINDENT .UNINDENT .sp To work around this, kitchen provides an alternate version of \fBcodecs.getwriter()\fP that can deal with both byte \fBstr\fP and \fBunicode\fP strings. Use \fBkitchen.text.converters.getwriter()\fP in place of the \fBcodecs\fP version like this: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C >>> import sys >>> from kitchen.text.converters import getwriter >>> UTF8Writer = getwriter(\(aqutf8\(aq) >>> sys.stdout = UTF8Writer(sys.stdout) >>> print u\(aqcafé\(aq café >>> print \(aqcafé\(aq café .ft P .fi .UNINDENT .UNINDENT .SS Frustration #5: Exceptions .sp Okay, so we\(aqve gotten ourselves this far. We convert everything to \fBunicode\fP strings. We\(aqre aware that we need to convert back into byte \fBstr\fP before we write to the terminal. We\(aqve worked around the inability of the standard \fBgetwriter()\fP to deal with both byte \fBstr\fP and \fBunicode\fP strings. Are we all set? Well, there\(aqs at least one more gotcha: raising exceptions with a \fBunicode\fP message. Take a look: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C >>> class MyException(Exception): >>> pass >>> >>> raise MyException(u\(aqCannot do this\(aq) Traceback (most recent call last): File "", line 1, in __main__.MyException: Cannot do this >>> raise MyException(u\(aqCannot do this while at a café\(aq) Traceback (most recent call last): File "", line 1, in __main__.MyException: >>> .ft P .fi .UNINDENT .UNINDENT .sp No, I didn\(aqt truncate that last line; raising exceptions really cannot handle non\-ASCII characters in a \fBunicode\fP string and will output an exception without the message if the message contains them. What happens if we try to use the handy dandy \fBgetwriter()\fP trick to work around this? .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C >>> import sys >>> from kitchen.text.converters import getwriter >>> sys.stderr = getwriter(\(aqutf8\(aq)(sys.stderr) >>> raise MyException(u\(aqCannot do this\(aq) Traceback (most recent call last): File "", line 1, in __main__.MyException: Cannot do this >>> raise MyException(u\(aqCannot do this while at a café\(aq) Traceback (most recent call last): File "", line 1, in __main__.MyException>>> .ft P .fi .UNINDENT .UNINDENT .sp Not only did this also fail, it even swallowed the trailing newline that\(aqs normally there.... So how to make this work? Transform from \fBunicode\fP strings to byte \fBstr\fP manually before outputting: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C >>> from kitchen.text.converters import to_bytes >>> raise MyException(to_bytes(u\(aqCannot do this while at a café\(aq)) Traceback (most recent call last): File "", line 1, in __main__.MyException: Cannot do this while at a café >>> .ft P .fi .UNINDENT .UNINDENT .sp \fBWARNING:\fP .INDENT 0.0 .INDENT 3.5 If you use \fBcodecs.getwriter()\fP on \fBsys.stderr\fP, you\(aqll find that raising an exception with a byte \fBstr\fP is broken by the default \fBStreamWriter\fP as well. Don\(aqt do that or you\(aqll have no way to output non\-ASCII characters. If you want to use a \fBStreamWriter\fP to encode other things on stderr while still having working exceptions, use \fBkitchen.text.converters.getwriter()\fP\&. .UNINDENT .UNINDENT .SS Frustration #6: Inconsistent APIs Part deux .sp Sometimes you do everything right in your code but other people\(aqs code fails you. With unicode issues this happens more often than we want. A glaring example of this is when you get values back from a function that aren\(aqt consistently \fBunicode\fP string or byte \fBstr\fP\&. .sp An example from the \fI\%python standard library\fP is \fBgettext\fP\&. The \fBgettext\fP functions are used to help translate messages that you display to users in the users\(aq native languages. Since most languages contain letters outside of the ASCII range, the values that are returned contain unicode characters. \fBgettext\fP provides you with \fBugettext()\fP and \fBungettext()\fP to return these translations as \fBunicode\fP strings and \fBgettext()\fP, \fBngettext()\fP, \fBlgettext()\fP, and \fBlngettext()\fP to return them as encoded byte \fBstr\fP\&. Unfortunately, even though they\(aqre documented to return only one type of string or the other, the implementation has corner cases where the wrong type can be returned. .sp This means that even if you separate your \fBunicode\fP string and byte \fBstr\fP correctly before you pass your strings to a \fBgettext\fP function, afterwards, you might have to check that you have the right sort of string type again. .sp \fBNOTE:\fP .INDENT 0.0 .INDENT 3.5 \fBkitchen.i18n\fP provides alternate gettext translation objects that return only byte \fBstr\fP or only \fBunicode\fP string. .UNINDENT .UNINDENT .SS A few solutions .sp Now that we\(aqve identified the issues, can we define a comprehensive strategy for dealing with them? .SS Convert text at the border .sp If you get some piece of text from a library, read from a file, etc, turn it into a \fBunicode\fP string immediately. Since python is moving in the direction of \fBunicode\fP strings everywhere it\(aqs going to be easier to work with \fBunicode\fP strings within your code. .sp If your code is heavily involved with using things that are bytes, you can do the opposite and convert all text into byte \fBstr\fP at the border and only convert to \fBunicode\fP when you need it for passing to another library or performing string operations on it. .sp In either case, the important thing is to pick a default type for strings and stick with it throughout your code. When you mix the types it becomes much easier to operate on a string with a function that can only use the other type by mistake. .sp \fBNOTE:\fP .INDENT 0.0 .INDENT 3.5 In python3, the abstract unicode type becomes much more prominent. The type named \fBstr\fP is the equivalent of python2\(aqs \fBunicode\fP and python3\(aqs \fBbytes\fP type replaces python2\(aqs \fBstr\fP\&. Most APIs deal in the unicode type of string with just some pieces that are low level dealing with bytes. The implicit conversions between bytes and unicode is removed and whenever you want to make the conversion you need to do so explicitly. .UNINDENT .UNINDENT .SS When the data needs to be treated as bytes (or unicode) use a naming convention .sp Sometimes you\(aqre converting nearly all of your data to \fBunicode\fP strings but you have one or two values where you have to keep byte \fBstr\fP around. This is often the case when you need to use the value verbatim with some external resource. For instance, filenames or key values in a database. When you do this, use a naming convention for the data you\(aqre working with so you (and others reading your code later) don\(aqt get confused about what\(aqs being stored in the value. .sp If you need both a textual string to present to the user and a byte value for an exact match, consider keeping both versions around. You can either use two variables for this or a \fBdict\fP whose key is the byte value. .sp \fBNOTE:\fP .INDENT 0.0 .INDENT 3.5 You can use the naming convention used in kitchen as a guide for implementing your own naming convention. It prefixes byte \fBstr\fP variables of unknown encoding with \fBb_\fP and byte \fBstr\fP of known encoding with the encoding name like: \fButf8_\fP\&. If the default was to handle \fBstr\fP and only keep a few \fBunicode\fP values, those variables would be prefixed with \fBu_\fP\&. .UNINDENT .UNINDENT .SS When outputting data, convert back into bytes .sp When you go to send your data back outside of your program (to the filesystem, over the network, displaying to the user, etc) turn the data back into a byte \fBstr\fP\&. How you do this will depend on the expected output format of the data. For displaying to the user, you can use the user\(aqs default encoding using \fBlocale.getpreferredencoding()\fP\&. For entering into a file, you\(aqre best bet is to pick a single encoding and stick with it. .sp \fBWARNING:\fP .INDENT 0.0 .INDENT 3.5 When using the encoding that the user has set (for instance, using \fBlocale.getpreferredencoding()\fP, remember that they may have their encoding set to something that can\(aqt display every single unicode character. That means when you convert from \fBunicode\fP to a byte \fBstr\fP you need to decide what should happen if the byte value is not valid in the user\(aqs encoding. For purposes of displaying messages to the user, it\(aqs usually okay to use the \fBreplace\fP encoding error handler to replace the invalid characters with a question mark or other symbol meaning the character couldn\(aqt be displayed. .UNINDENT .UNINDENT .sp You can use \fBkitchen.text.converters.getwriter()\fP to do this automatically for \fBsys.stdout\fP\&. When creating exception messages be sure to convert to bytes manually. .SS When writing unittests, include non\-ASCII values and both unicode and str type .sp Unless you know that a specific portion of your code will only deal with ASCII, be sure to include non\-ASCII values in your unittests. Including a few characters from several different scripts is highly advised as well because some code may have special cased accented roman characters but not know how to handle characters used in Asian alphabets. .sp Similarly, unless you know that that portion of your code will only be given \fBunicode\fP strings or only byte \fBstr\fP be sure to try variables of both types in your unittests. When doing this, make sure that the variables are also non\-ASCII as python\(aqs implicit conversion will mask problems with pure ASCII data. In many cases, it makes sense to check what happens if byte \fBstr\fP and \fBunicode\fP strings that won\(aqt decode in the present locale are given. .SS Be vigilant about spotting poor APIs .sp Make sure that the libraries you use return only \fBunicode\fP strings or byte \fBstr\fP\&. Unittests can help you spot issues here by running many variations of data through your functions and checking that you\(aqre still getting the types of string that you expect. .SS Example: Putting this all together with kitchen .sp The kitchen library provides a wide array of functions to help you deal with byte \fBstr\fP and \fBunicode\fP strings in your program. Here\(aqs a short example that uses many kitchen functions to do its work: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C #!/usr/bin/python \-tt # \-*\- coding: utf\-8 \-*\- import locale import os import sys import unicodedata from kitchen.text.converters import getwriter, to_bytes, to_unicode from kitchen.i18n import get_translation_object if __name__ == \(aq__main__\(aq: # Setup gettext driven translations but use the kitchen functions so # we don\(aqt have the mismatched bytes\-unicode issues. translations = get_translation_object(\(aqexample\(aq) # We use _() for marking strings that we operate on as unicode # This is pretty much everything _ = translations.ugettext # And b_() for marking strings that we operate on as bytes. # This is limited to exceptions b_ = translations.lgettext # Setup stdout encoding = locale.getpreferredencoding() Writer = getwriter(encoding) sys.stdout = Writer(sys.stdout) # Load data. Format is filename\e0description # description should be utf\-8 but filename can be any legal filename # on the filesystem # Sample datafile.txt: # /etc/shells\ex00Shells available on caf\exc3\exa9.lan # /var/tmp/file\exff\ex00File with non\-utf8 data in the filename # # And to create /var/tmp/file\exff (under bash or zsh) do: # echo \(aqSome data\(aq > /var/tmp/file$\(aq\e377\(aq datafile = open(\(aqdatafile.txt\(aq, \(aqr\(aq) data = {} for line in datafile: # We\(aqre going to keep filename as bytes because we will need the # exact bytes to access files on a POSIX operating system. # description, we\(aqll immediately transform into unicode type. b_filename, description = line.split(\(aq\e0\(aq, 1) # to_unicode defaults to decoding output from utf\-8 and replacing # any problematic bytes with the unicode replacement character # We accept mangling of the description here knowing that our file # format is supposed to use utf\-8 in that field and that the # description will only be displayed to the user, not used as # a key value. description = to_unicode(description, \(aqutf\-8\(aq).strip() data[b_filename] = description datafile.close() # We\(aqre going to add a pair of extra fields onto our data to show the # length of the description and the filesize. We put those between # the filename and description because we haven\(aqt checked that the # description is free of NULLs. datafile = open(\(aqnewdatafile.txt\(aq, \(aqw\(aq) # Name filename with a b_ prefix to denote byte string of unknown encoding for b_filename in data: # Since we have the byte representation of filename, we can read any # filename if os.access(b_filename, os.F_OK): size = os.path.getsize(b_filename) else: size = 0 # Because the description is unicode type, we know the number of # characters corresponds to the length of the normalized unicode # string. length = len(unicodedata.normalize(\(aqNFC\(aq, description)) # Print a summary to the screen # Note that we do not let implici type conversion from str to # unicode transform b_filename into a unicode string. That might # fail as python would use the ASCII filename. Instead we use # to_unicode() to explicitly transform in a way that we know will # not traceback. print _(u\(aqfilename: %s\(aq) % to_unicode(b_filename) print _(u\(aqfile size: %s\(aq) % size print _(u\(aqdesc length: %s\(aq) % length print _(u\(aqdescription: %s\(aq) % data[b_filename] # First combine the unicode portion line = u\(aq%s\e0%s\e0%s\(aq % (size, length, data[b_filename]) # Since the filenames are bytes, turn everything else to bytes before combining # Turning into unicode first would be wrong as the bytes in b_filename # might not convert b_line = \(aq%s\e0%s\en\(aq % (b_filename, to_bytes(line)) # Just to demonstrate that getwriter will pass bytes through fine print b_(\(aqWrote: %s\(aq) % b_line datafile.write(b_line) datafile.close() # And just to show how to properly deal with an exception. # Note two things about this: # 1) We use the b_() function to translate the string. This returns a # byte string instead of a unicode string # 2) We\(aqre using the b_() function returned by kitchen. If we had # used the one from gettext we would need to convert the message to # a byte str first message = u\(aqDemonstrate the proper way to raise exceptions. Sincerely, \eu3068\eu3057\eu304a\(aq raise Exception(b_(message)) .ft P .fi .UNINDENT .UNINDENT .sp \fBSEE ALSO:\fP .INDENT 0.0 .INDENT 3.5 \fBkitchen.text.converters\fP .UNINDENT .UNINDENT .SS Designing Unicode Aware APIs .sp APIs that deal with byte \fBstr\fP and \fBunicode\fP strings are difficult to get right. Here are a few strategies with pros and cons of each. .SS Contents .INDENT 0.0 .IP \(bu 2 \fI\%Designing Unicode Aware APIs\fP .INDENT 2.0 .IP \(bu 2 \fI\%Take either bytes or unicode, output only unicode\fP .IP \(bu 2 \fI\%Take either bytes or unicode, output the same type\fP .IP \(bu 2 \fI\%Separate functions\fP .IP \(bu 2 \fI\%Deciding whether to take str or unicode when no value is returned\fP .INDENT 2.0 .IP \(bu 2 \fI\%Writing to external data\fP .IP \(bu 2 \fI\%Updating data structures\fP .UNINDENT .IP \(bu 2 \fI\%APIs to Avoid\fP .INDENT 2.0 .IP \(bu 2 \fI\%Returning unicode unless a conversion fails\fP .IP \(bu 2 \fI\%Ignoring values with no chance of recovery\fP .IP \(bu 2 \fI\%Raising a UnicodeException with no chance of recovery\fP .UNINDENT .IP \(bu 2 \fI\%Knowing your data\fP .INDENT 2.0 .IP \(bu 2 \fI\%Do you need to operate on both bytes and unicode?\fP .IP \(bu 2 \fI\%Can you restrict the encodings?\fP .INDENT 2.0 .IP \(bu 2 \fI\%Single byte encodings\fP .IP \(bu 2 \fI\%Multibyte encodings\fP .INDENT 2.0 .IP \(bu 2 \fI\%Fixed width\fP .IP \(bu 2 \fI\%Variable Width\fP .INDENT 2.0 .IP \(bu 2 \fI\%ASCII compatible\fP .IP \(bu 2 \fI\%Escaped\fP .IP \(bu 2 \fI\%Other\fP .UNINDENT .UNINDENT .UNINDENT .UNINDENT .UNINDENT .UNINDENT .SS Take either bytes or unicode, output only unicode .sp In this strategy, you allow the user to enter either \fBunicode\fP strings or byte \fBstr\fP but what you give back is always \fBunicode\fP\&. This strategy is easy for novice endusers to start using immediately as they will be able to feed either type of string into the function and get back a string that they can use in other places. .sp However, it does lead to the novice writing code that functions correctly when testing it with ASCII\-only data but fails when given data that contains non\-ASCII characters. Worse, if your API is not designed to be flexible, the consumer of your code won\(aqt be able to easily correct those problems once they find them. .sp Here\(aqs a good API that uses this strategy: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C from kitchen.text.converters import to_unicode def truncate(msg, max_length, encoding=\(aqutf8\(aq, errors=\(aqreplace\(aq): msg = to_unicode(msg, encoding, errors) return msg[:max_length] .ft P .fi .UNINDENT .UNINDENT .sp The call to \fBtruncate()\fP starts with the essential parameters for performing the task. It ends with two optional keyword arguments that define the encoding to use to transform from a byte \fBstr\fP to \fBunicode\fP and the strategy to use if undecodable bytes are encountered. The defaults may vary depending on the use cases you have in mind. When the output is generally going to be printed for the user to see, \fBerrors=\(aqreplace\(aq\fP is a good default. If you are constructing keys to a database, raisng an exception (with \fBerrors=\(aqstrict\(aq\fP) may be a better default. In either case, having both parameters allows the person using your API to choose how they want to handle any problems. Having the values is also a clue to them that a conversion from byte \fBstr\fP to \fBunicode\fP string is going to occur. .sp \fBNOTE:\fP .INDENT 0.0 .INDENT 3.5 If you\(aqre targeting python\-3.1 and above, \fBerrors=\(aqsurrogateescape\(aq\fP may be a better default than \fBerrors=\(aqstrict\(aq\fP\&. You need to be mindful of a few things when using \fBsurrogateescape\fP though: .INDENT 0.0 .IP \(bu 2 \fBsurrogateescape\fP will cause issues if a non\-ASCII compatible encoding is used (for instance, UTF\-16 and UTF\-32.) That makes it unhelpful in situations where a true general purpose method of encoding must be found. \fI\%PEP 383\fP mentions that \fBsurrogateescape\fP was specifically designed with the limitations of translating using system locales (where ASCII compatibility is generally seen as inescapable) so you should keep that in mind. .IP \(bu 2 If you use \fBsurrogateescape\fP to decode from \fBbytes\fP to \fBunicode\fP you will need to use an error handler other than \fBstrict\fP to encode as the lone surrogate that this error handler creates makes for invalid unicode that must be handled when encoding. In Python\-3.1.2 or less, a bug in the encoder error handlers mean that you can only use \fBsurrogateescape\fP to encode; anything else will throw an error. .UNINDENT .sp Evaluate your usages of the variables in question to see what makes sense. .UNINDENT .UNINDENT .sp Here\(aqs a bad example of using this strategy: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C from kitchen.text.converters import to_unicode def truncate(msg, max_length): msg = to_unicode(msg) return msg[:max_length] .ft P .fi .UNINDENT .UNINDENT .sp In this example, we don\(aqt have the optional keyword arguments for \fBencoding\fP and \fBerrors\fP\&. A user who uses this function is more likely to miss the fact that a conversion from byte \fBstr\fP to \fBunicode\fP is going to occur. And once an error is reported, they will have to look through their backtrace and think harder about where they want to transform their data into \fBunicode\fP strings instead of having the opportunity to control how the conversion takes place in the function itself. Note that the user does have the ability to make this work by making the transformation to unicode themselves: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C from kitchen.text.converters import to_unicode msg = to_unicode(msg, encoding=\(aqeuc_jp\(aq, errors=\(aqignore\(aq) new_msg = truncate(msg, 5) .ft P .fi .UNINDENT .UNINDENT .SS Take either bytes or unicode, output the same type .sp This strategy is sometimes called polymorphic because the type of data that is returned is dependent on the type of data that is received. The concept is that when you are given a byte \fBstr\fP to process, you return a byte \fBstr\fP in your output. When you are given \fBunicode\fP strings to process, you return \fBunicode\fP strings in your output. .sp This can work well for end users as the ones that know about the difference between the two string types will already have transformed the strings to their desired type before giving it to this function. The ones that don\(aqt can remain blissfully ignorant (at least, as far as your function is concerned) as the function does not change the type. .sp In cases where the encoding of the byte \fBstr\fP is known or can be discovered based on the input data this works well. If you can\(aqt figure out the input encoding, however, this strategy can fail in any of the following cases: .INDENT 0.0 .IP 1. 3 It needs to do an internal conversion between byte \fBstr\fP and \fBunicode\fP string. .IP 2. 3 It cannot return the same data as either a \fBunicode\fP string or byte \fBstr\fP\&. .IP 3. 3 You may need to deal with byte strings that are not byte\-compatible with ASCII .UNINDENT .sp First, a couple examples of using this strategy in a good way: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C def translate(msg, table): replacements = table.keys() new_msg = [] for index, char in enumerate(msg): if char in replacements: new_msg.append(table[char]) else: new_msg.append(char) return \(aq\(aq.join(new_msg) .ft P .fi .UNINDENT .UNINDENT .sp In this example, all of the strings that we use (except the empty string which is okay because it doesn\(aqt have any characters to encode) come from outside of the function. Due to that, the user is responsible for making sure that the \fBmsg\fP, and the keys and values in \fBtable\fP all match in terms of type (\fBunicode\fP vs \fBstr\fP) and encoding (You can do some error checking to make sure the user gave all the same type but you can\(aqt do the same for the user giving different encodings). You do not need to make changes to the string that require you to know the encoding or type of the string; everything is a simple replacement of one element in the array of characters in message with the character in table. .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C import json from kitchen.text.converters import to_unicode, to_bytes def first_field_from_json_data(json_string): \(aq\(aq\(aqReturn the first field in a json data structure. The format of the json data is a simple list of strings. \(aq["one", "two", "three"]\(aq \(aq\(aq\(aq if isinstance(json_string, unicode): # On all python versions, json.loads() returns unicode if given # a unicode string return json.loads(json_string)[0] # Byte str: figure out which encoding we\(aqre dealing with if \(aq\ex00\(aq not in json_data[:2] encoding = \(aqutf8\(aq elif \(aq\ex00\ex00\ex00\(aq == json_data[:3]: encoding = \(aqutf\-32\-be\(aq elif \(aq\ex00\ex00\ex00\(aq == json_data[1:4]: encoding = \(aqutf\-32\-le\(aq elif \(aq\ex00\(aq == json_data[0] and \(aq\ex00\(aq == json_data[2]: encoding = \(aqutf\-16\-be\(aq else: encoding = \(aqutf\-16\-le\(aq data = json.loads(unicode(json_string, encoding)) return data[0].encode(encoding) .ft P .fi .UNINDENT .UNINDENT .sp In this example the function takes either a byte \fBstr\fP type or a \fBunicode\fP string that has a list in json format and returns the first field from it as the type of the input string. The first section of code is very straightforward; we receive a \fBunicode\fP string, parse it with a function, and then return the first field from our parsed data (which our function returned to us as json data). .sp The second portion that deals with byte \fBstr\fP is not so straightforward. Before we can parse the string we have to determine what characters the bytes in the string map to. If we didn\(aqt do that, we wouldn\(aqt be able to properly find which characters are present in the string. In order to do that we have to figure out the encoding of the byte \fBstr\fP\&. Luckily, the json specification states that all strings are unicode and encoded with one of UTF32be, UTF32le, UTF16be, UTF16le, or UTF\-8\&. It further defines the format such that the first two characters are always ASCII\&. Each of these has a different sequence of NULLs when they encode an ASCII character. We can use that to detect which encoding was used to create the byte \fBstr\fP\&. .sp Finally, we return the byte \fBstr\fP by encoding the \fBunicode\fP back to a byte \fBstr\fP\&. .sp As you can see, in this example we have to convert from byte \fBstr\fP to \fBunicode\fP and back. But we know from the json specification that byte \fBstr\fP has to be one of a limited number of encodings that we are able to detect. That ability makes this strategy work. .sp Now for some examples of using this strategy in ways that fail: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C import unicodedata def first_char(msg): \(aq\(aq\(aqReturn the first character in a string\(aq\(aq\(aq if not isinstance(msg, unicode): try: msg = unicode(msg, \(aqutf8\(aq) except UnicodeError: msg = unicode(msg, \(aqlatin1\(aq) msg = unicodedata.normalize(\(aqNFC\(aq, msg) return msg[0] .ft P .fi .UNINDENT .UNINDENT .sp If you look at that code and think that there\(aqs something fragile and prone to breaking in the \fBtry: except:\fP block you are correct in being suspicious. This code will fail on multi\-byte character sets that aren\(aqt UTF\-8\&. It can also fail on data where the sequence of bytes is valid UTF\-8 but the bytes are actually of a different encoding. The reasons this code fails is that we don\(aqt know what encoding the bytes are in and the code must convert from a byte \fBstr\fP to a \fBunicode\fP string in order to function. .sp In order to make this code robust we must know the encoding of \fBmsg\fP\&. The only way to know that is to ask the user so the API must do that: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C import unicodedata def number_of_chars(msg, encoding=\(aqutf8\(aq, errors=\(aqstrict\(aq): if not isinstance(msg, unicode): msg = unicode(msg, encoding, errors) msg = unicodedata.normalize(\(aqNFC\(aq, msg) return len(msg) .ft P .fi .UNINDENT .UNINDENT .sp Another example of failure: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C import os def listdir(directory): files = os.listdir(directory) if isinstance(directory, str): return files # files could contain both bytes and unicode new_files = [] for filename in files: if not isinstance(filename, unicode): # What to do here? continue new_files.appen(filename) return new_files .ft P .fi .UNINDENT .UNINDENT .sp This function illustrates the second failure mode. Here, not all of the possible values can be represented as \fBunicode\fP without knowing more about the encoding of each of the filenames involved. Since each filename could have a different encoding there\(aqs a few different options to pursue. We could make this function always return byte \fBstr\fP since that can accurately represent anything that could be returned. If we want to return \fBunicode\fP we need to at least allow the user to specify what to do in case of an error decoding the bytes to \fBunicode\fP\&. We can also let the user specify the encoding to use for doing the decoding but that won\(aqt help in all cases since not all files will be in the same encoding (or even necessarily in any encoding): .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C import locale import os def listdir(directory, encoding=locale.getpreferredencoding(), errors=\(aqstrict\(aq): # Note: In python\-3.1+, surrogateescape may be a better default files = os.listdir(directory) if isinstance(directory, str): return files new_files = [] for filename in files: if not isinstance(filename, unicode): filename = unicode(filename, encoding=encoding, errors=errors) new_files.append(filename) return new_files .ft P .fi .UNINDENT .UNINDENT .sp Note that although we use \fBerrors\fP in this example as what to pass to the codec that decodes to \fBunicode\fP we could also have an \fBerrors\fP argument that decides other things to do like skip a filename entirely, return a placeholder (\fBNondisplayable filename\fP), or raise an exception. .sp This leaves us with one last failure to describe: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C def first_field(csv_string): \(aq\(aq\(aqReturn the first field in a comma separated values string.\(aq\(aq\(aq try: return csv_string[:csv_string.index(\(aq,\(aq)] except ValueError: return csv_string .ft P .fi .UNINDENT .UNINDENT .sp This code looks simple enough. The hidden error here is that we are searching for a comma character in a byte \fBstr\fP but not all encodings will use the same sequence of bytes to represent the comma. If you use an encoding that\(aqs not ASCII compatible on the byte level, then the literal comma \fB\(aq,\(aq\fP in the above code will match inappropriate bytes. Some examples of how it can fail: .INDENT 0.0 .IP \(bu 2 Will find the byte representing an ASCII comma in another character .IP \(bu 2 Will find the comma but leave trailing garbage bytes on the end of the string .IP \(bu 2 Will not match the character that represents the comma in this encoding .UNINDENT .sp There are two ways to solve this. You can either take the encoding value from the user or you can take the separator value from the user. Of the two, taking the encoding is the better option for two reasons: .INDENT 0.0 .IP 1. 3 Taking a separator argument doesn\(aqt clearly document for the API user that the reason they must give it is to properly match the encoding of the \fBcsv_string\fP\&. They\(aqre just as likely to think that it\(aqs simply a way to specify an alternate character (like ":" or "|") for the separator. .IP 2. 3 It\(aqs possible for a variable width encoding to reuse the same byte sequence for different characters in multiple sequences. .sp \fBNOTE:\fP .INDENT 3.0 .INDENT 3.5 UTF\-8 is resistant to this as any character\(aqs sequence of bytes will never be a subset of another character\(aqs sequence of bytes. .UNINDENT .UNINDENT .UNINDENT .sp With that in mind, here\(aqs how to improve the API: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C def first_field(csv_string, encoding=\(aqutf\-8\(aq, errors=\(aqreplace\(aq): if not isinstance(csv_string, unicode): u_string = unicode(csv_string, encoding, errors) is_unicode = False else: u_string = csv_string try: field = u_string[:U_string.index(u\(aq,\(aq)] except ValueError: return csv_string if not is_unicode: field = field.encode(encoding, errors) return field .ft P .fi .UNINDENT .UNINDENT .sp \fBNOTE:\fP .INDENT 0.0 .INDENT 3.5 If you decide you\(aqll never encounter a variable width encoding that reuses byte sequences you can use this code instead: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C def first_field(csv_string, encoding=\(aqutf\-8\(aq): try: return csv_string[:csv_string.index(\(aq,\(aq.encode(encoding))] except ValueError: return csv_string .ft P .fi .UNINDENT .UNINDENT .UNINDENT .UNINDENT .SS Separate functions .sp Sometimes you want to be able to take either byte \fBstr\fP or \fBunicode\fP strings, perform similar operations on either one and then return data in the same format as was given. Probably the easiest way to do that is to have separate functions for each and adopt a naming convention to show that one is for working with byte \fBstr\fP and the other is for working with \fBunicode\fP strings: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C def translate_b(msg, table): \(aq\(aq\(aqReplace values in str with other byte values like unicode.translate\(aq\(aq\(aq if not isinstance(msg, str): raise TypeError(\(aqmsg must be of type str\(aq) str_table = [chr(s) for s in xrange(0,256)] delete_chars = [] for chr_val in (k for k in table.keys() if isinstance(k, int)): if chr_val > 255: raise ValueError(\(aqKeys in table must not exceed 255)\(aq) if table[chr_val] == None: delete_chars.append(chr(chr_val)) elif isinstance(table[chr_val], int): if table[chr_val] > 255: raise TypeError(\(aqtable values cannot be more than 255 or less than 0\(aq) str_table[chr_val] = chr(table[chr_val]) else: if not isinstance(table[chr_val], str): raise TypeError(\(aqcharacter mapping must return integer, None or str\(aq) str_table[chr_val] = table[chr_val] str_table = \(aq\(aq.join(str_table) delete_chars = \(aq\(aq.join(delete_chars) return msg.translate(str_table, delete_chars) def translate(msg, table): \(aq\(aq\(aqReplace values in a unicode string with other values\(aq\(aq\(aq if not isinstance(msg, unicode): raise TypeError(\(aqmsg must be of type unicode\(aq) return msg.translate(table) .ft P .fi .UNINDENT .UNINDENT .sp There\(aqs several things that we have to do in this API: .INDENT 0.0 .IP \(bu 2 Because the function names might not be enough of a clue to the user of the functions of the value types that are expected, we have to check that the types are correct. .IP \(bu 2 We keep the behaviour of the two functions as close to the same as possible, just with byte \fBstr\fP and \fBunicode\fP strings substituted for each other. .UNINDENT .SS Deciding whether to take str or unicode when no value is returned .sp Not all functions have a return value. Sometimes a function is there to interact with something external to python, for instance, writing a file out to disk or a method exists to update the internal state of a data structure. One of the main questions with these APIs is whether to take byte \fBstr\fP, \fBunicode\fP string, or both. The answer depends on your use case but I\(aqll give some examples here. .SS Writing to external data .sp When your information is going to an external data source like writing to a file you need to decide whether to take in \fBunicode\fP strings or byte \fBstr\fP\&. Remember that most external data sources are not going to be dealing with unicode directly. Instead, they\(aqre going to be dealing with a sequence of bytes that may be interpreted as unicode. With that in mind, you either need to have the user give you a byte \fBstr\fP or convert to a byte \fBstr\fP inside the function. .sp Next you need to think about the type of data that you\(aqre receiving. If it\(aqs textual data, (for instance, this is a chat client and the user is typing messages that they expect to be read by another person) it probably makes sense to take in \fBunicode\fP strings and do the conversion inside your function. On the other hand, if this is a lower level function that\(aqs passing data into a network socket, it probably should be taking byte \fBstr\fP instead. .sp Just as noted in the API notes above, you should specify an \fBencoding\fP and \fBerrors\fP argument if you need to transform from \fBunicode\fP string to byte \fBstr\fP and you are unable to guess the encoding from the data itself. .SS Updating data structures .sp Sometimes your API is just going to update a data structure and not immediately output that data anywhere. Just as when writing external data, you should think about both what your function is going to do with the data eventually and what the caller of your function is thinking that they\(aqre giving you. Most of the time, you\(aqll want to take \fBunicode\fP strings and enter them into the data structure as \fBunicode\fP when the data is textual in nature. You\(aqll want to take byte \fBstr\fP and enter them into the data structure as byte \fBstr\fP when the data is not text. Use a naming convention so the user knows what\(aqs expected. .SS APIs to Avoid .sp There are a few APIs that are just wrong. If you catch yourself making an API that does one of these things, change it before anyone sees your code. .SS Returning unicode unless a conversion fails .sp This type of API usually deals with byte \fBstr\fP at some point and converts it to \fBunicode\fP because it\(aqs usually thought to be text. However, there are times when the bytes fail to convert to a \fBunicode\fP string. When that happens, this API returns the raw byte \fBstr\fP instead of a \fBunicode\fP string. One example of this is present in the \fI\%python standard library\fP: python2\(aqs \fBos.listdir()\fP: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C >>> import os >>> import locale >>> locale.getpreferredencoding() \(aqUTF\-8\(aq >>> os.mkdir(\(aq/tmp/mine\(aq) >>> os.chdir(\(aq/tmp/mine\(aq) >>> open(\(aqnonsense_char_\exff\(aq, \(aqw\(aq).close() >>> open(\(aqall_ascii\(aq, \(aqw\(aq).close() >>> os.listdir(u\(aq.\(aq) [u\(aqall_ascii\(aq, \(aqnonsense_char_\exff\(aq] .ft P .fi .UNINDENT .UNINDENT .sp The problem with APIs like this is that they cause failures that are hard to debug because they don\(aqt happen where the variables are set. For instance, let\(aqs say you take the filenames from \fBos.listdir()\fP and give it to this function: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C def normalize_filename(filename): \(aq\(aq\(aqChange spaces and dashes into underscores\(aq\(aq\(aq return filename.translate({ord(u\(aq \(aq):u\(aq_\(aq, ord(u\(aq \(aq):u\(aq_\(aq}) .ft P .fi .UNINDENT .UNINDENT .sp When you test this, you use filenames that all are decodable in your preferred encoding and everything seems to work. But when this code is run on a machine that has filenames in multiple encodings the filenames returned by \fBos.listdir()\fP suddenly include byte \fBstr\fP\&. And byte \fBstr\fP has a different \fBstring.translate()\fP function that takes different values. So the code raises an exception where it\(aqs not immediately obvious that \fBos.listdir()\fP is at fault. .SS Ignoring values with no chance of recovery .sp An early version of python3 attempted to fix the \fBos.listdir()\fP problem pointed out in the last section by returning all values that were decodable to \fBunicode\fP and omitting the filenames that were not. This lead to the following output: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C >>> import os >>> import locale >>> locale.getpreferredencoding() \(aqUTF\-8\(aq >>> os.mkdir(\(aq/tmp/mine\(aq) >>> os.chdir(\(aq/tmp/mine\(aq) >>> open(b\(aqnonsense_char_\exff\(aq, \(aqw\(aq).close() >>> open(\(aqall_ascii\(aq, \(aqw\(aq).close() >>> os.listdir(\(aq.\(aq) [\(aqall_ascii\(aq] .ft P .fi .UNINDENT .UNINDENT .sp The issue with this type of code is that it is silently doing something surprising. The caller expects to get a full list of files back from \fBos.listdir()\fP\&. Instead, it silently ignores some of the files, returning only a subset. This leads to code that doesn\(aqt do what is expected that may go unnoticed until the code is in production and someone notices that something important is being missed. .SS Raising a UnicodeException with no chance of recovery .sp Believe it or not, a few libraries exist that make it impossible to deal with unicode text without raising a \fBUnicodeError\fP\&. What seems to occur in these libraries is that the library has functions that expect to receive a \fBunicode\fP string. However, internally, those functions call other functions that expect to receive a byte \fBstr\fP\&. The programmer of the API was smart enough to convert from a \fBunicode\fP string to a byte \fBstr\fP but they did not give the user the chance to specify the encodings to use or how to deal with errors. This results in exceptions when the user passes in a byte \fBstr\fP because the initial function wants a \fBunicode\fP string and exceptions when the user passes in a \fBunicode\fP string because the function can\(aqt convert the string to bytes in the encoding that it\(aqs selected. .sp Do not put the user in the position of not being able to use your API without raising a \fBUnicodeError\fP with certain values. If you can only safely take \fBunicode\fP strings, document that byte \fBstr\fP is not allowed and vice versa. If you have to convert internally, make sure to give the caller of your function parameters to control the encoding and how to treat errors that may occur during the encoding/decoding process. If your code will raise a \fBUnicodeError\fP with non\-ASCII values no matter what, you should probably rethink your API. .SS Knowing your data .sp If you\(aqve read all the way down to this section without skipping you\(aqve seen several admonitions about the type of data you are processing affecting the viability of the various API choices. .sp Here\(aqs a few things to consider in your data: .SS Do you need to operate on both bytes and unicode? .sp Much of the data in libraries, programs, and the general environment outside of python is written where strings are sequences of bytes. So when we interact with data that comes from outside of python or data that is about to leave python it may make sense to only operate on the data as a byte \fBstr\fP\&. There\(aqs two times when this may make sense: .INDENT 0.0 .IP 1. 3 The user is intended to hand the data to the function and then the function takes care of sending the data outside of python (to the filesystem, over the network, etc). .IP 2. 3 The data is not representable as text. For instance, writing a binary file format. .UNINDENT .sp Even when your code is operating in this area you still need to think a little more about your data. For instance, it might make sense for the person using your API to pass in \fBunicode\fP strings and let the function convert that into the byte \fBstr\fP that it then sends over the wire. .sp There are also times when it might make sense to operate only on \fBunicode\fP strings. \fBunicode\fP represents text so anytime that you are working on textual data that isn\(aqt going to leave python it has the potential to be a \fBunicode\fP\-only API. However, there\(aqs two things that you should consider when designing a \fBunicode\fP\-only API: .INDENT 0.0 .IP 1. 3 As your API gains popularity, people are going to use your API in places that you may not have thought of. Corner cases in these other places may mean that processing bytes is desirable. .IP 2. 3 In python2, byte \fBstr\fP and \fBunicode\fP are often used interchangeably with each other. That means that people programming against your API may have received \fBstr\fP from some other API and it would be most convenient for their code if your API accepted it. .UNINDENT .sp \fBNOTE:\fP .INDENT 0.0 .INDENT 3.5 In python3, the separation between the text type and the byte type are more clear. So in python3, there\(aqs less need to have all APIs take both unicode and bytes. .UNINDENT .UNINDENT .SS Can you restrict the encodings? .sp If you determine that you have to deal with byte \fBstr\fP you should realize that not all encodings are created equal. Each has different properties that may make it possible to provide a simpler API provided that you can reasonably tell the users of your API that they cannot use certain classes of encodings. .sp As one example, if you are required to find a comma (\fB,\fP) in a byte \fBstr\fP you have different choices based on what encodings are allowed. If you can reasonably restrict your API users to only giving ASCII compatible encodings you can do this simply by searching for the literal comma character because that character will be represented by the same byte sequence in all ASCII compatible encodings. .sp The following are some classes of encodings to be aware of as you decide how generic your code needs to be. .SS Single byte encodings .sp Single byte encodings can only represent 256 total characters. They encode the code points for a character to the equivalent number in a single byte. .sp Most single byte encodings are ASCII compatible\&. ASCII compatible encodings are the most likely to be usable without changes to code so this is good news. A notable exception to this is the \fI\%EBDIC\fP family of encodings. .SS Multibyte encodings .sp Multibyte encodings use more than one byte to encode some characters. .SS Fixed width .sp Fixed width encodings have a set number of bytes to represent all of the characters in the character set. \fBUTF\-32\fP is an example of a fixed width encoding that uses four bytes per character and can express every unicode characters. There are a number of problems with writing APIs that need to operate on fixed width, multibyte characters. To go back to our earlier example of finding a comma in a string, we have to realize that even in \fBUTF\-32\fP where the code point for ASCII characters is the same as in ASCII, the byte sequence for them is different. So you cannot search for the literal byte character as it may pick up false positives and may break a byte sequence in an odd place. .SS Variable Width .SS ASCII compatible .sp UTF\-8 and the \fI\%EUC\fP family of encodings are examples of ASCII compatible multi\-byte encodings. They achieve this by adhering to two principles: .INDENT 0.0 .IP \(bu 2 All of the ASCII characters are represented by the byte that they are in the ASCII encoding. .IP \(bu 2 None of the ASCII byte sequences are reused in any other byte sequence for a different character. .UNINDENT .SS Escaped .sp Some multibyte encodings work by using only bytes from the ASCII encoding but when a particular sequence of those byes is found, they are interpreted as meaning something other than their ASCII values. \fBUTF\-7\fP is one such encoding that can encode all of the unicode code points\&. For instance, here\(aqs a some Japanese characters encoded as \fBUTF\-7\fP: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C >>> a = u\(aq\eu304f\eu3089\eu3068\eu307f\(aq >>> print a くらとみ >>> print a.encode(\(aqutf\-7\(aq) +ME8wiTBoMH8\- .ft P .fi .UNINDENT .UNINDENT .sp These encodings can be used when you need to encode unicode data that may contain non\-ASCII characters for inclusion in an ASCII only transport medium or file. .sp However, they are not ASCII compatible in the sense that we used earlier as the bytes that represent a ASCII character are being reused as part of other characters. If you were to search for a literal plus sign in this encoded string, you would run across many false positives, for instance. .SS Other .sp There are many other popular variable width encodings, for instance \fBUTF\-16\fP and \fBshift\-JIS\fP\&. Many of these are not ASCII compatible so you cannot search for a literal ASCII character without danger of false positives or false negatives. .SS Kitchen API .sp Kitchen is structured as a collection of modules. In its current configuration, Kitchen ships with the following modules. Other addon modules that may drag in more dependencies can be found on the \fI\%project webpage\fP .SS Kitchen.i18n Module .sp I18N is an important piece of any modern program. Unfortunately, setting up i18n in your program is often a confusing process. The functions provided here aim to make the programming side of that a little easier. .sp Most projects will be able to do something like this when they startup: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C # myprogram/__init__.py: import os import sys from kitchen.i18n import easy_gettext_setup _, N_ = easy_gettext_setup(\(aqmyprogram\(aq, localedirs=( os.path.join(os.path.realpath(os.path.dirname(__file__)), \(aqlocale\(aq), os.path.join(sys.prefix, \(aqlib\(aq, \(aqlocale\(aq) )) .ft P .fi .UNINDENT .UNINDENT .sp Then, in other files that have strings that need translating: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C # myprogram/commands.py: from myprogram import _, N_ def print_usage(): print _(u"""available commands are: \-\-help Display help \-\-version Display version of this program \-\-bake\-me\-a\-cake as fast as you can """) def print_invitations(age): print _(\(aqPlease come to my party.\(aq) print N_(\(aqI will be turning %(age)s year old\(aq, \(aqI will be turning %(age)s years old\(aq, age) % {\(aqage\(aq: age} .ft P .fi .UNINDENT .UNINDENT .sp See the documentation of \fI\%easy_gettext_setup()\fP and \fI\%get_translation_object()\fP for more details. .INDENT 0.0 .INDENT 3.5 .sp \fBSEE ALSO:\fP .INDENT 0.0 .INDENT 3.5 .INDENT 0.0 .TP .B \fBgettext\fP for details of how the python gettext facilities work .TP .B \fI\%babel\fP The babel module for in depth information on gettext, message catalogs, and translating your app. babel provides some nice features for i18n on top of \fBgettext\fP .UNINDENT .UNINDENT .UNINDENT .UNINDENT .UNINDENT .SS Functions .sp \fI\%easy_gettext_setup()\fP should satisfy the needs of most users. \fI\%get_translation_object()\fP is designed to ease the way for anyone that needs more control. .INDENT 0.0 .TP .B kitchen.i18n.easy_gettext_setup(domain, localedirs=(), use_unicode=True) Setup translation functions for an application .INDENT 7.0 .TP .B Parameters .INDENT 7.0 .IP \(bu 2 \fBdomain\fP \-\- Name of the message domain. This should be a unique name that can be used to lookup the message catalog for this app. .IP \(bu 2 \fBlocaledirs\fP \-\- Iterator of directories to look for message catalogs under. The first directory to exist is used regardless of whether messages for this domain are present. If none of the directories exist, fallback on \fBsys.prefix\fP + \fB/share/locale\fP Default: No directories to search so we just use the fallback. .IP \(bu 2 \fBuse_unicode\fP \-\- If \fBTrue\fP return the \fBgettext\fP functions for \fBunicode\fP strings else return the functions for byte \fBstr\fP for the translations. Default is \fBTrue\fP\&. .UNINDENT .TP .B Returns tuple of the \fBgettext\fP function and \fBgettext\fP function for plurals .UNINDENT .sp Setting up \fBgettext\fP can be a little tricky because of lack of documentation. This function will setup \fBgettext\fP using the \fI\%Class\-based API\fP for you. For the simple case, you can use the default arguments and call it like this: .INDENT 7.0 .INDENT 3.5 .sp .nf .ft C _, N_ = easy_gettext_setup() .ft P .fi .UNINDENT .UNINDENT .sp This will get you two functions, \fB_()\fP and \fBN_()\fP that you can use to mark strings in your code for translation. \fB_()\fP is used to mark strings that don\(aqt need to worry about plural forms no matter what the value of the variable is. \fBN_()\fP is used to mark strings that do need to have a different form if a variable in the string is plural. .sp \fBSEE ALSO:\fP .INDENT 7.0 .INDENT 3.5 .INDENT 0.0 .TP .B api\-i18n This module\(aqs documentation has examples of using \fB_()\fP and \fBN_()\fP .TP .B \fI\%get_translation_object()\fP for information on how to use \fBlocaledirs\fP to get the proper message catalogs both when in development and when installed to FHS compliant directories on Linux. .UNINDENT .UNINDENT .UNINDENT .sp \fBNOTE:\fP .INDENT 7.0 .INDENT 3.5 The gettext functions returned from this function should be superior to the ones returned from \fBgettext\fP\&. The traits that make them better are described in the \fI\%DummyTranslations\fP and \fI\%NewGNUTranslations\fP documentation. .UNINDENT .UNINDENT .sp Changed in version kitchen\-0.2.4: ; API kitchen.i18n 2.0.0 Changed \fI\%easy_gettext_setup()\fP to return the lgettext functions instead of gettext functions when use_unicode=False. .UNINDENT .INDENT 0.0 .TP .B kitchen.i18n.get_translation_object(domain, localedirs=(), languages=None, class_=None, fallback=True, codeset=None, python2_api=True) Get a translation object bound to the message catalogs .INDENT 7.0 .TP .B Parameters .INDENT 7.0 .IP \(bu 2 \fBdomain\fP \-\- Name of the message domain. This should be a unique name that can be used to lookup the message catalog for this app or library. .IP \(bu 2 \fBlocaledirs\fP \-\- Iterator of directories to look for message catalogs under. The directories are searched in order for message catalogs\&. For each of the directories searched, we check for message catalogs in any language specified in:attr:\fIlanguages\fP\&. The message catalogs are used to create the Translation object that we return. The Translation object will attempt to lookup the msgid in the first catalog that we found. If it\(aqs not in there, it will go through each subsequent catalog looking for a match. For this reason, the order in which you specify the \fBlocaledirs\fP may be important. If no message catalogs are found, either return a \fI\%DummyTranslations\fP object or raise an \fBIOError\fP depending on the value of \fBfallback\fP\&. Rhe default localedir from \fBgettext\fP which is \fBos.path.join(sys.prefix, \(aqshare\(aq, \(aqlocale\(aq)\fP on Unix is implicitly appended to the \fBlocaledirs\fP, making it the last directory searched. .IP \(bu 2 \fBlanguages\fP \-\- .sp Iterator of language codes to check for message catalogs\&. If unspecified, the user\(aqs locale settings will be used. .sp \fBSEE ALSO:\fP .INDENT 2.0 .INDENT 3.5 \fBgettext.find()\fP for information on what environment variables are used. .UNINDENT .UNINDENT .IP \(bu 2 \fBclass\fP \-\- The class to use to extract translations from the message catalogs\&. Defaults to \fI\%NewGNUTranslations\fP\&. .IP \(bu 2 \fBfallback\fP \-\- If set to data:\fIFalse\fP, raise an \fBIOError\fP if no message catalogs are found. If \fBTrue\fP, the default, return a \fI\%DummyTranslations\fP object. .IP \(bu 2 \fBcodeset\fP \-\- Set the character encoding to use when returning byte \fBstr\fP objects. This is equivalent to calling \fBoutput_charset()\fP on the Translations object that is returned from this function. .IP \(bu 2 \fBpython2_api\fP \-\- When data:\fITrue\fP (default), return Translation objects that use the python2 gettext api (\fBgettext()\fP and \fBlgettext()\fP return byte \fBstr\fP\&. \fBugettext()\fP exists and returns \fBunicode\fP strings). When \fBFalse\fP, return Translation objects that use the python3 gettext api (gettext returns \fBunicode\fP strings and lgettext returns byte \fBstr\fP\&. ugettext does not exist.) .UNINDENT .TP .B Returns Translation object to get \fBgettext\fP methods from .UNINDENT .sp If you need more flexibility than \fI\%easy_gettext_setup()\fP, use this function. It sets up a \fBgettext\fP Translation object and returns it to you. Then you can access any of the methods of the object that you need directly. For instance, if you specifically need to access \fBlgettext()\fP: .INDENT 7.0 .INDENT 3.5 .sp .nf .ft C translations = get_translation_object(\(aqfoo\(aq) translations.lgettext(\(aqMy Message\(aq) .ft P .fi .UNINDENT .UNINDENT .sp This function is similar to the \fI\%python standard library\fP \fBgettext.translation()\fP but makes it better in two ways .INDENT 7.0 .IP 1. 3 .INDENT 3.0 .TP .B It returns \fI\%NewGNUTranslations\fP or \fI\%DummyTranslations\fP objects by default. These are superior to the \fBgettext.GNUTranslations\fP and \fBgettext.NullTranslations\fP objects because they are consistent in the string type they return and they fix several issues that can cause the \fI\%python standard library\fP objects to throw \fBUnicodeError\fP\&. .UNINDENT .IP 2. 3 .INDENT 3.0 .TP .B This function takes multiple directories to search for message catalogs\&. .UNINDENT .UNINDENT .sp The latter is important when setting up \fBgettext\fP in a portable manner. There is not a common directory for translations across operating systems so one needs to look in multiple directories for the translations. \fI\%get_translation_object()\fP is able to handle that if you give it a list of directories to search for catalogs: .INDENT 7.0 .INDENT 3.5 .sp .nf .ft C translations = get_translation_object(\(aqfoo\(aq, localedirs=( os.path.join(os.path.realpath(os.path.dirname(__file__)), \(aqlocale\(aq), os.path.join(sys.prefix, \(aqlib\(aq, \(aqlocale\(aq))) .ft P .fi .UNINDENT .UNINDENT .sp This will search for several different directories: .INDENT 7.0 .IP 1. 3 A directory named \fBlocale\fP in the same directory as the module that called \fI\%get_translation_object()\fP, .IP 2. 3 In \fB/usr/lib/locale\fP .IP 3. 3 In \fB/usr/share/locale\fP (the fallback directory) .UNINDENT .sp This allows \fBgettext\fP to work on Windows and in development (where the message catalogs are typically in the toplevel module directory) and also when installed under Linux (where the message catalogs are installed in \fB/usr/share/locale\fP). You (or the system packager) just need to install the message catalogs in \fB/usr/share/locale\fP and remove the \fBlocale\fP directory from the module to make this work. ie: .INDENT 7.0 .INDENT 3.5 .sp .nf .ft C In development: ~/foo # Toplevel module directory ~/foo/__init__.py ~/foo/locale # With message catalogs below here: ~/foo/locale/es/LC_MESSAGES/foo.mo Installed on Linux: /usr/lib/python2.7/site\-packages/foo /usr/lib/python2.7/site\-packages/foo/__init__.py /usr/share/locale/ # With message catalogs below here: /usr/share/locale/es/LC_MESSAGES/foo.mo .ft P .fi .UNINDENT .UNINDENT .sp \fBNOTE:\fP .INDENT 7.0 .INDENT 3.5 This function will setup Translation objects that attempt to lookup msgids in all of the found message catalogs\&. This means if you have several versions of the message catalogs installed in different directories that the function searches, you need to make sure that \fBlocaledirs\fP specifies the directories so that newer message catalogs are searched first. It also means that if a newer catalog does not contain a translation for a msgid but an older one that\(aqs in \fBlocaledirs\fP does, the translation from that older catalog will be returned. .UNINDENT .UNINDENT .sp Changed in version kitchen\-1.1.0: ; API kitchen.i18n 2.1.0 Add more parameters to \fI\%get_translation_object()\fP so it can more easily be used as a replacement for \fBgettext.translation()\fP\&. Also change the way we use localedirs. We cycle through them until we find a suitable locale file rather than simply cycling through until we find a directory that exists. The new code is based heavily on the \fI\%python standard library\fP \fBgettext.translation()\fP function. .sp Changed in version kitchen\-1.2.0: ; API kitchen.i18n 2.2.0 Add python2_api parameter .UNINDENT .SS Translation Objects .sp The standard translation objects from the \fBgettext\fP module suffer from several problems: .INDENT 0.0 .IP \(bu 2 They can throw \fBUnicodeError\fP .IP \(bu 2 They can\(aqt find translations for non\-ASCII byte \fBstr\fP messages .IP \(bu 2 They may return either \fBunicode\fP string or byte \fBstr\fP from the same function even though the functions say they will only return \fBunicode\fP or only return byte \fBstr\fP\&. .UNINDENT .sp \fI\%DummyTranslations\fP and \fI\%NewGNUTranslations\fP were written to fix these issues. .INDENT 0.0 .TP .B class kitchen.i18n.DummyTranslations(fp=None, python2_api=True) Safer version of \fBgettext.NullTranslations\fP .sp This Translations class doesn\(aqt translate the strings and is intended to be used as a fallback when there were errors setting up a real Translations object. It\(aqs safer than \fBgettext.NullTranslations\fP in its handling of byte \fBstr\fP vs \fBunicode\fP strings. .sp Unlike \fBNullTranslations\fP, this Translation class will never throw a \fBUnicodeError\fP\&. The code that you have around a call to \fI\%DummyTranslations\fP might throw a \fBUnicodeError\fP but at least that will be in code you control and can fix. Also, unlike \fBNullTranslations\fP all of this Translation object\(aqs methods guarantee to return byte \fBstr\fP except for \fBugettext()\fP and \fBungettext()\fP which guarantee to return \fBunicode\fP strings. .sp When byte \fBstr\fP are returned, the strings will be encoded according to this algorithm: .INDENT 7.0 .IP 1. 3 If a fallback has been added, the fallback will be called first. You\(aqll need to consult the fallback to see whether it performs any encoding changes. .IP 2. 3 If a byte \fBstr\fP was given, the same byte \fBstr\fP will be returned. .IP 3. 3 If a \fBunicode\fP string was given and \fI\%set_output_charset()\fP has been called then we encode the string using the \fBoutput_charset\fP .IP 4. 3 If a \fBunicode\fP string was given and this is \fBgettext()\fP or \fBngettext()\fP and \fB_charset\fP was set output in that charset. .IP 5. 3 If a \fBunicode\fP string was given and this is \fBgettext()\fP or \fBngettext()\fP we encode it using \(aqutf\-8\(aq. .IP 6. 3 If a \fBunicode\fP string was given and this is \fBlgettext()\fP or \fBlngettext()\fP we encode using the value of \fBlocale.getpreferredencoding()\fP .UNINDENT .sp For \fBugettext()\fP and \fBungettext()\fP, we go through the same set of steps with the following differences: .INDENT 7.0 .IP \(bu 2 We transform byte \fBstr\fP into \fBunicode\fP strings for these methods. .IP \(bu 2 The encoding used to decode the byte \fBstr\fP is taken from \fI\%input_charset\fP if it\(aqs set, otherwise we decode using UTF\-8\&. .UNINDENT .INDENT 7.0 .TP .B input_charset is an extension to the \fI\%python standard library\fP \fBgettext\fP that specifies what charset a message is encoded in when decoding a message to \fBunicode\fP\&. This is used for two purposes: .UNINDENT .INDENT 7.0 .IP 1. 3 If the message string is a byte \fBstr\fP, this is used to decode the string to a \fBunicode\fP string before looking it up in the message catalog\&. .IP 2. 3 In \fBugettext()\fP and \fBungettext()\fP methods, if a byte \fBstr\fP is given as the message and is untranslated this is used as the encoding when decoding to \fBunicode\fP\&. This is different from \fB_charset\fP which may be set when a message catalog is loaded because \fI\%input_charset\fP is used to describe an encoding used in a python source file while \fB_charset\fP describes the encoding used in the message catalog file. .UNINDENT .sp Any characters that aren\(aqt able to be transformed from a byte \fBstr\fP to \fBunicode\fP string or vice versa will be replaced with a replacement character (ie: \fBu\(aq�\(aq\fP in unicode based encodings, \fB\(aq?\(aq\fP in other ASCII compatible encodings). .sp \fBSEE ALSO:\fP .INDENT 7.0 .INDENT 3.5 .INDENT 0.0 .TP .B \fBgettext.NullTranslations\fP For information about what methods are available and what they do. .UNINDENT .UNINDENT .UNINDENT .sp Changed in version kitchen\-1.1.0: ; API kitchen.i18n 2.1.0 * Although we had adapted \fBgettext()\fP, \fBngettext()\fP, \fBlgettext()\fP, and \fBlngettext()\fP to always return byte \fBstr\fP, we hadn\(aqt forced those byte \fBstr\fP to always be in a specified charset. We now make sure that \fBgettext()\fP and \fBngettext()\fP return byte \fBstr\fP encoded using \fBoutput_charset\fP if set, otherwise \fBcharset\fP and if neither of those, UTF\-8\&. With \fBlgettext()\fP and \fBlngettext()\fP \fBoutput_charset\fP if set, otherwise \fBlocale.getpreferredencoding()\fP\&. * Make setting \fI\%input_charset\fP and \fBoutput_charset\fP also set those attributes on any fallback translation objects. .sp Changed in version kitchen\-1.2.0: ; API kitchen.i18n 2.2.0 Add python2_api parameter to __init__() .INDENT 7.0 .TP .B set_output_charset(charset) Set the output charset .sp This serves two purposes. The normal \fBgettext.NullTranslations.set_output_charset()\fP does not set the output on fallback objects. On python\-2.3, \fBgettext.NullTranslations\fP objects don\(aqt contain this method. .UNINDENT .UNINDENT .INDENT 0.0 .TP .B class kitchen.i18n.NewGNUTranslations(fp=None, python2_api=True) Safer version of \fBgettext.GNUTranslations\fP .sp \fBgettext.GNUTranslations\fP suffers from two problems that this class fixes. .INDENT 7.0 .IP 1. 3 \fBgettext.GNUTranslations\fP can throw a \fBUnicodeError\fP in \fBgettext.GNUTranslations.ugettext()\fP if the message being translated has non\-ASCII characters and there is no translation for it. .IP 2. 3 \fBgettext.GNUTranslations\fP can return byte \fBstr\fP from \fBgettext.GNUTranslations.ugettext()\fP and \fBunicode\fP strings from the other \fBgettext()\fP methods if the message being translated is the wrong type .UNINDENT .sp When byte \fBstr\fP are returned, the strings will be encoded according to this algorithm: .INDENT 7.0 .IP 1. 3 If a fallback has been added, the fallback will be called first. You\(aqll need to consult the fallback to see whether it performs any encoding changes. .IP 2. 3 If a byte \fBstr\fP was given, the same byte \fBstr\fP will be returned. .IP 3. 3 If a \fBunicode\fP string was given and \fBset_output_charset()\fP has been called then we encode the string using the \fBoutput_charset\fP .IP 4. 3 If a \fBunicode\fP string was given and this is \fBgettext()\fP or \fBngettext()\fP and a charset was detected when parsing the message catalog, output in that charset. .IP 5. 3 If a \fBunicode\fP string was given and this is \fBgettext()\fP or \fBngettext()\fP we encode it using UTF\-8\&. .IP 6. 3 If a \fBunicode\fP string was given and this is \fBlgettext()\fP or \fBlngettext()\fP we encode using the value of \fBlocale.getpreferredencoding()\fP .UNINDENT .sp For \fBugettext()\fP and \fBungettext()\fP, we go through the same set of steps with the following differences: .INDENT 7.0 .IP \(bu 2 We transform byte \fBstr\fP into \fBunicode\fP strings for these methods. .IP \(bu 2 The encoding used to decode the byte \fBstr\fP is taken from \fI\%input_charset\fP if it\(aqs set, otherwise we decode using UTF\-8 .UNINDENT .INDENT 7.0 .TP .B input_charset an extension to the \fI\%python standard library\fP \fBgettext\fP that specifies what charset a message is encoded in when decoding a message to \fBunicode\fP\&. This is used for two purposes: .UNINDENT .INDENT 7.0 .IP 1. 3 If the message string is a byte \fBstr\fP, this is used to decode the string to a \fBunicode\fP string before looking it up in the message catalog\&. .IP 2. 3 In \fBugettext()\fP and \fBungettext()\fP methods, if a byte \fBstr\fP is given as the message and is untranslated his is used as the encoding when decoding to \fBunicode\fP\&. This is different from the \fB_charset\fP parameter that may be set when a message catalog is loaded because \fI\%input_charset\fP is used to describe an encoding used in a python source file while \fB_charset\fP describes the encoding used in the message catalog file. .UNINDENT .sp Any characters that aren\(aqt able to be transformed from a byte \fBstr\fP to \fBunicode\fP string or vice versa will be replaced with a replacement character (ie: \fBu\(aq�\(aq\fP in unicode based encodings, \fB\(aq?\(aq\fP in other ASCII compatible encodings). .sp \fBSEE ALSO:\fP .INDENT 7.0 .INDENT 3.5 .INDENT 0.0 .TP .B \fBgettext.GNUTranslations.gettext\fP For information about what methods this class has and what they do .UNINDENT .UNINDENT .UNINDENT .sp Changed in version kitchen\-1.1.0: ; API kitchen.i18n 2.1.0 Although we had adapted \fBgettext()\fP, \fBngettext()\fP, \fBlgettext()\fP, and \fBlngettext()\fP to always return byte \fBstr\fP, we hadn\(aqt forced those byte \fBstr\fP to always be in a specified charset. We now make sure that \fBgettext()\fP and \fBngettext()\fP return byte \fBstr\fP encoded using \fBoutput_charset\fP if set, otherwise \fBcharset\fP and if neither of those, UTF\-8\&. With \fBlgettext()\fP and \fBlngettext()\fP \fBoutput_charset\fP if set, otherwise \fBlocale.getpreferredencoding()\fP\&. .UNINDENT .SS Kitchen.text: unicode and utf8 and xml oh my! .sp The kitchen.text module contains functions that deal with text manipulation. .SS Kitchen.text.converters .sp Functions to handle conversion of byte \fBstr\fP and \fBunicode\fP strings. .sp Changed in version kitchen: 0.2a2 ; API kitchen.text 2.0.0 Added \fI\%getwriter()\fP .sp Changed in version kitchen: 0.2.2 ; API kitchen.text 2.1.0 Added \fI\%exception_to_unicode()\fP, \fI\%exception_to_bytes()\fP, \fI\%EXCEPTION_CONVERTERS\fP, and \fI\%BYTE_EXCEPTION_CONVERTERS\fP .sp Changed in version kitchen: 1.0.1 ; API kitchen.text 2.1.1 Deprecated \fI\%BYTE_EXCEPTION_CONVERTERS\fP as we\(aqve simplified \fI\%exception_to_unicode()\fP and \fI\%exception_to_bytes()\fP to make it unnecessary .SS Byte Strings and Unicode in Python2 .sp Python2 has two string types, \fBstr\fP and \fBunicode\fP\&. \fBunicode\fP represents an abstract sequence of text characters. It can hold any character that is present in the unicode standard. \fBstr\fP can hold any byte of data. The operating system and python work together to display these bytes as characters in many cases but you should always keep in mind that the information is really a sequence of bytes, not a sequence of characters. In python2 these types are interchangeable a large amount of the time. They are one of the few pairs of types that automatically convert when used in equality: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C >>> # string is converted to unicode and then compared >>> "I am a string" == u"I am a string" True >>> # Other types, like int, don\(aqt have this special treatment >>> 5 == "5" False .ft P .fi .UNINDENT .UNINDENT .sp However, this automatic conversion tends to lull people into a false sense of security. As long as you\(aqre dealing with ASCII characters the automatic conversion will save you from seeing any differences. Once you start using characters that are not in ASCII, you will start getting \fBUnicodeError\fP and \fBUnicodeWarning\fP as the automatic conversions between the types fail: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C >>> "I am an ñ" == u"I am an ñ" __main__:1: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode \- interpreting them as being unequal False .ft P .fi .UNINDENT .UNINDENT .sp Why do these conversions fail? The reason is that the python2 \fBunicode\fP type represents an abstract sequence of unicode text known as code points\&. \fBstr\fP, on the other hand, really represents a sequence of bytes. Those bytes are converted by your operating system to appear as characters on your screen using a particular encoding (usually with a default defined by the operating system and customizable by the individual user.) Although ASCII characters are fairly standard in what bytes represent each character, the bytes outside of the ASCII range are not. In general, each encoding will map a different character to a particular byte. Newer encodings map individual characters to multiple bytes (which the older encodings will instead treat as multiple characters). In the face of these differences, python refuses to guess at an encoding and instead issues a warning or exception and refuses to convert. .sp \fBSEE ALSO:\fP .INDENT 0.0 .INDENT 3.5 .INDENT 0.0 .TP .B overcoming\-frustration For a longer introduction on this subject. .UNINDENT .UNINDENT .UNINDENT .SS Strategy for Explicit Conversion .sp So what is the best method of dealing with this weltering babble of incoherent encodings? The basic strategy is to explicitly turn everything into \fBunicode\fP when it first enters your program. Then, when you send it to output, you can transform the unicode back into bytes. Doing this allows you to control the encodings that are used and avoid getting tracebacks due to \fBUnicodeError\fP\&. Using the functions defined in this module, that looks something like this: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C >>> from kitchen.text.converters import to_unicode, to_bytes >>> name = raw_input(\(aqEnter your name: \(aq) Enter your name: Toshio くらとみ >>> name \(aqToshio \exe3\ex81\ex8f\exe3\ex82\ex89\exe3\ex81\exa8\exe3\ex81\exbf\(aq >>> type(name) >>> unicode_name = to_unicode(name) >>> type(unicode_name) >>> unicode_name u\(aqToshio \eu304f\eu3089\eu3068\eu307f\(aq >>> # Do a lot of other things before needing to save/output again: >>> output = open(\(aqdatafile\(aq, \(aqw\(aq) >>> output.write(to_bytes(u\(aqName: %s\e\en\(aq % unicode_name)) .ft P .fi .UNINDENT .UNINDENT .sp A few notes: .sp Looking at line 6, you\(aqll notice that the input we took from the user was a byte \fBstr\fP\&. In general, anytime we\(aqre getting a value from outside of python (The filesystem, reading data from the network, interacting with an external command, reading values from the environment) we are interacting with something that will want to give us a byte \fBstr\fP\&. Some \fI\%python standard library\fP modules and third party libraries will automatically attempt to convert a byte \fBstr\fP to \fBunicode\fP strings for you. This is both a boon and a curse. If the library can guess correctly about the encoding that the data is in, it will return \fBunicode\fP objects to you without you having to convert. However, if it can\(aqt guess correctly, you may end up with one of several problems: .INDENT 0.0 .TP .B \fBUnicodeError\fP The library attempted to decode a byte \fBstr\fP into a \fBunicode\fP, string failed, and raises an exception. .TP .B Garbled data If the library returns the data after decoding it with the wrong encoding, the characters you see in the \fBunicode\fP string won\(aqt be the ones that you expect. .TP .B A byte \fBstr\fP instead of \fBunicode\fP string Some libraries will return a \fBunicode\fP string when they\(aqre able to decode the data and a byte \fBstr\fP when they can\(aqt. This is generally the hardest problem to debug when it occurs. Avoid it in your own code and try to avoid or open bugs against upstreams that do this. See DesigningUnicodeAwareAPIs for strategies to do this properly. .UNINDENT .sp On line 8, we convert from a byte \fBstr\fP to a \fBunicode\fP string. \fI\%to_unicode()\fP does this for us. It has some error handling and sane defaults that make this a nicer function to use than calling \fBstr.decode()\fP directly: .INDENT 0.0 .IP \(bu 2 Instead of defaulting to the ASCII encoding which fails with all but the simple American English characters, it defaults to UTF\-8\&. .IP \(bu 2 Instead of raising an error if it cannot decode a value, it will replace the value with the unicode "Replacement character" symbol (\fB�\fP). .IP \(bu 2 If you happen to call this method with something that is not a \fBstr\fP or \fBunicode\fP, it will return an empty \fBunicode\fP string. .UNINDENT .sp All three of these can be overridden using different keyword arguments to the function. See the \fI\%to_unicode()\fP documentation for more information. .sp On line 15 we push the data back out to a file. Two things you should note here: .INDENT 0.0 .IP 1. 3 We deal with the strings as \fBunicode\fP until the last instant. The string format that we\(aqre using is \fBunicode\fP and the variable also holds \fBunicode\fP\&. People sometimes get into trouble when they mix a byte \fBstr\fP format with a variable that holds a \fBunicode\fP string (or vice versa) at this stage. .IP 2. 3 \fI\%to_bytes()\fP, does the reverse of \fI\%to_unicode()\fP\&. In this case, we\(aqre using the default values which turn \fBunicode\fP into a byte \fBstr\fP using UTF\-8\&. Any errors are replaced with a \fB�\fP and sending nonstring objects yield empty \fBunicode\fP strings. Just like \fI\%to_unicode()\fP, you can look at the documentation for \fI\%to_bytes()\fP to find out how to override any of these defaults. .UNINDENT .SS When to use an alternate strategy .sp The default strategy of decoding to \fBunicode\fP strings when you take data in and encoding to a byte \fBstr\fP when you send the data back out works great for most problems but there are a few times when you shouldn\(aqt: .INDENT 0.0 .IP \(bu 2 The values aren\(aqt meant to be read as text .IP \(bu 2 The values need to be byte\-for\-byte when you send them back out \-\- for instance if they are database keys or filenames. .IP \(bu 2 You are transferring the data between several libraries that all expect byte \fBstr\fP\&. .UNINDENT .sp In each of these instances, there is a reason to keep around the byte \fBstr\fP version of a value. Here\(aqs a few hints to keep your sanity in these situations: .INDENT 0.0 .IP 1. 3 Keep your \fBunicode\fP and \fBstr\fP values separate. Just like the pain caused when you have to use someone else\(aqs library that returns both \fBunicode\fP and \fBstr\fP you can cause yourself pain if you have functions that can return both types or variables that could hold either type of value. .IP 2. 3 Name your variables so that you can tell whether you\(aqre storing byte \fBstr\fP or \fBunicode\fP string. One of the first things you end up having to do when debugging is determine what type of string you have in a variable and what type of string you are expecting. Naming your variables consistently so that you can tell which type they are supposed to hold will save you from at least one of those steps. .IP 3. 3 When you get values initially, make sure that you\(aqre dealing with the type of value that you expect as you save it. You can use \fBisinstance()\fP or \fI\%to_bytes()\fP since \fI\%to_bytes()\fP doesn\(aqt do any modifications of the string if it\(aqs already a \fBstr\fP\&. When using \fI\%to_bytes()\fP for this purpose you might want to use: .INDENT 3.0 .INDENT 3.5 .sp .nf .ft C try: b_input = to_bytes(input_should_be_bytes_already, errors=\(aqstrict\(aq, nonstring=\(aqstrict\(aq) except: handle_errors_somehow() .ft P .fi .UNINDENT .UNINDENT .sp The reason is that the default of \fI\%to_bytes()\fP will take characters that are illegal in the chosen encoding and transform them to replacement characters. Since the point of keeping this data as a byte \fBstr\fP is to keep the exact same bytes when you send it outside of your code, changing things to replacement characters should be rasing red flags that something is wrong. Setting \fBerrors\fP to \fBstrict\fP will raise an exception which gives you an opportunity to fail gracefully. .IP 4. 3 Sometimes you will want to print out the values that you have in your byte \fBstr\fP\&. When you do this you will need to make sure that you transform \fBunicode\fP to \fBstr\fP before combining them. Also be sure that any other function calls (including \fBgettext\fP) are going to give you strings that are the same type. For instance: .INDENT 3.0 .INDENT 3.5 .sp .nf .ft C print to_bytes(_(\(aqUsername: %(user)s\(aq), \(aqutf\-8\(aq) % {\(aquser\(aq: b_username} .ft P .fi .UNINDENT .UNINDENT .UNINDENT .SS Gotchas and how to avoid them .sp Even when you have a good conceptual understanding of how python2 treats \fBunicode\fP and \fBstr\fP there are still some things that can surprise you. In most cases this is because, as noted earlier, python or one of the python libraries you depend on is trying to convert a value automatically and failing. Explicit conversion at the appropriate place usually solves that. .SS str(obj) .sp One common idiom for getting a simple, string representation of an object is to use: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C str(obj) .ft P .fi .UNINDENT .UNINDENT .sp Unfortunately, this is not safe. Sometimes str(obj) will return \fBunicode\fP\&. Sometimes it will return a byte \fBstr\fP\&. Sometimes, it will attempt to convert from a \fBunicode\fP string to a byte \fBstr\fP, fail, and throw a \fBUnicodeError\fP\&. To be safe from all of these, first decide whether you need \fBunicode\fP or \fBstr\fP to be returned. Then use \fI\%to_unicode()\fP or \fI\%to_bytes()\fP to get the simple representation like this: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C u_representation = to_unicode(obj, nonstring=\(aqsimplerepr\(aq) b_representation = to_bytes(obj, nonstring=\(aqsimplerepr\(aq) .ft P .fi .UNINDENT .UNINDENT .SS print .sp python has a builtin \fBprint()\fP statement that outputs strings to the terminal. This originated in a time when python only dealt with byte \fBstr\fP\&. When \fBunicode\fP strings came about, some enhancements were made to the \fBprint()\fP statement so that it could print those as well. The enhancements make \fBprint()\fP work most of the time. However, the times when it doesn\(aqt work tend to make for cryptic debugging. .sp The basic issue is that \fBprint()\fP has to figure out what encoding to use when it prints a \fBunicode\fP string to the terminal. When python is attached to your terminal (ie, you\(aqre running the interpreter or running a script that prints to the screen) python is able to take the encoding value from your locale settings \fBLC_ALL\fP or \fBLC_CTYPE\fP and print the characters allowed by that encoding. On most modern Unix systems, the encoding is utf\-8 which means that you can print any \fBunicode\fP character without problem. .sp There are two common cases of things going wrong: .INDENT 0.0 .IP 1. 3 Someone has a locale set that does not accept all valid unicode characters. For instance: .INDENT 3.0 .INDENT 3.5 .sp .nf .ft C $ LC_ALL=C python >>> print u\(aq\eufffd\(aq Traceback (most recent call last): File "", line 1, in UnicodeEncodeError: \(aqascii\(aq codec can\(aqt encode character u\(aq\eufffd\(aq in position 0: ordinal not in range(128) .ft P .fi .UNINDENT .UNINDENT .sp This often happens when a script that you\(aqve written and debugged from the terminal is run from an automated environment like \fBcron\fP\&. It also occurs when you have written a script using a utf\-8 aware locale and released it for consumption by people all over the internet. Inevitably, someone is running with a locale that can\(aqt handle all unicode characters and you get a traceback reported. .IP 2. 3 You redirect output to a file. Python isn\(aqt using the values in \fBLC_ALL\fP unconditionally to decide what encoding to use. Instead it is using the encoding set for the terminal you are printing to which is set to accept different encodings by \fBLC_ALL\fP\&. If you redirect to a file, you are no longer printing to the terminal so \fBLC_ALL\fP won\(aqt have any effect. At this point, python will decide it can\(aqt find an encoding and fallback to ASCII which will likely lead to \fBUnicodeError\fP being raised. You can see this in a short script: .INDENT 3.0 .INDENT 3.5 .sp .nf .ft C #! /usr/bin/python \-tt print u\(aq\eufffd\(aq .ft P .fi .UNINDENT .UNINDENT .sp And then look at the difference between running it normally and redirecting to a file: .INDENT 3.0 .INDENT 3.5 .sp .nf .ft C $ ./test.py � $ ./test.py > t Traceback (most recent call last): File "test.py", line 3, in print u\(aq\eufffd\(aq UnicodeEncodeError: \(aqascii\(aq codec can\(aqt encode character u\(aq\eufffd\(aq in position 0: ordinal not in range(128) .ft P .fi .UNINDENT .UNINDENT .UNINDENT .sp The short answer to dealing with this is to always use bytes when writing output. You can do this by explicitly converting to bytes like this: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C from kitchen.text.converters import to_bytes u_string = u\(aq\eufffd\(aq print to_bytes(u_string) .ft P .fi .UNINDENT .UNINDENT .sp or you can wrap stdout and stderr with a \fBStreamWriter\fP\&. A \fBStreamWriter\fP is convenient in that you can assign it to encode for \fBsys.stdout\fP or \fBsys.stderr\fP and then have output automatically converted but it has the drawback of still being able to throw \fBUnicodeError\fP if the writer can\(aqt encode all possible unicode codepoints. Kitchen provides an alternate version which can be retrieved with \fI\%kitchen.text.converters.getwriter()\fP which will not traceback in its standard configuration. .SS Unicode, str, and dict keys .sp The \fBhash()\fP of the ASCII characters is the same for \fBunicode\fP and byte \fBstr\fP\&. When you use them in \fBdict\fP keys, they evaluate to the same dictionary slot: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C >>> u_string = u\(aqa\(aq >>> b_string = \(aqa\(aq >>> hash(u_string), hash(b_string) (12416037344, 12416037344) >>> d = {} >>> d[u_string] = \(aqunicode\(aq >>> d[b_string] = \(aqbytes\(aq >>> d {u\(aqa\(aq: \(aqbytes\(aq} .ft P .fi .UNINDENT .UNINDENT .sp When you deal with key values outside of ASCII, \fBunicode\fP and byte \fBstr\fP evaluate unequally no matter what their character content or hash value: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C >>> u_string = u\(aqñ\(aq >>> b_string = u_string.encode(\(aqutf\-8\(aq) >>> print u_string ñ >>> print b_string ñ >>> d = {} >>> d[u_string] = \(aqunicode\(aq >>> d[b_string] = \(aqbytes\(aq >>> d {u\(aq\e\exf1\(aq: \(aqunicode\(aq, \(aq\e\exc3\e\exb1\(aq: \(aqbytes\(aq} >>> b_string2 = \(aq\e\exf1\(aq >>> hash(u_string), hash(b_string2) (30848092528, 30848092528) >>> d = {} >>> d[u_string] = \(aqunicode\(aq >>> d[b_string2] = \(aqbytes\(aq {u\(aq\e\exf1\(aq: \(aqunicode\(aq, \(aq\e\exf1\(aq: \(aqbytes\(aq} .ft P .fi .UNINDENT .UNINDENT .sp How do you work with this one? Remember rule #1: Keep your \fBunicode\fP and byte \fBstr\fP values separate. That goes for keys in a dictionary just like anything else. .INDENT 0.0 .IP \(bu 2 For any given dictionary, make sure that all your keys are either \fBunicode\fP or \fBstr\fP\&. \fBDo not mix the two.\fP If you\(aqre being given both \fBunicode\fP and \fBstr\fP but you don\(aqt need to preserve separate keys for each, I recommend using \fI\%to_unicode()\fP or \fI\%to_bytes()\fP to convert all keys to one type or the other like this: .INDENT 2.0 .INDENT 3.5 .sp .nf .ft C >>> from kitchen.text.converters import to_unicode >>> u_string = u\(aqone\(aq >>> b_string = \(aqtwo\(aq >>> d = {} >>> d[to_unicode(u_string)] = 1 >>> d[to_unicode(b_string)] = 2 >>> d {u\(aqtwo\(aq: 2, u\(aqone\(aq: 1} .ft P .fi .UNINDENT .UNINDENT .IP \(bu 2 These issues also apply to using dicts with tuple keys that contain a mixture of \fBunicode\fP and \fBstr\fP\&. Once again the best fix is to standardise on either \fBstr\fP or \fBunicode\fP\&. .IP \(bu 2 If you absolutely need to store values in a dictionary where the keys could be either \fBunicode\fP or \fBstr\fP you can use \fBStrictDict\fP which has separate entries for all \fBunicode\fP and byte \fBstr\fP and deals correctly with any \fBtuple\fP containing mixed \fBunicode\fP and byte \fBstr\fP\&. .UNINDENT .SS Functions .SS Unicode and byte str conversion .INDENT 0.0 .TP .B kitchen.text.converters.to_unicode(obj, encoding=\(aqutf\-8\(aq, errors=\(aqreplace\(aq, nonstring=None, non_string=None) Convert an object into a \fBunicode\fP string .INDENT 7.0 .TP .B Parameters .INDENT 7.0 .IP \(bu 2 \fBobj\fP \-\- Object to convert to a \fBunicode\fP string. This should normally be a byte \fBstr\fP .IP \(bu 2 \fBencoding\fP \-\- What encoding to try converting the byte \fBstr\fP as. Defaults to utf\-8 .IP \(bu 2 \fBerrors\fP \-\- If errors are found while decoding, perform this action. Defaults to \fBreplace\fP which replaces the invalid bytes with a character that means the bytes were unable to be decoded. Other values are the same as the error handling schemes in the \fI\%codec base classes\fP\&. For instance \fBstrict\fP which raises an exception and \fBignore\fP which simply omits the non\-decodable characters. .IP \(bu 2 \fBnonstring\fP \-\- .sp How to treat nonstring values. Possible values are: .INDENT 2.0 .TP .B simplerepr Attempt to call the object\(aqs "simple representation" method and return that value. Python\-2.3+ has two methods that try to return a simple representation: \fBobject.__unicode__()\fP and \fBobject.__str__()\fP\&. We first try to get a usable value from \fBobject.__unicode__()\fP\&. If that fails we try the same with \fBobject.__str__()\fP\&. .TP .B empty Return an empty \fBunicode\fP string .TP .B strict Raise a \fBTypeError\fP .TP .B passthru Return the object unchanged .TP .B repr Attempt to return a \fBunicode\fP string of the repr of the object .UNINDENT .sp Default is \fBsimplerepr\fP .IP \(bu 2 \fBnon_string\fP \-\- \fIDeprecated\fP Use \fBnonstring\fP instead .UNINDENT .TP .B Raises .INDENT 7.0 .IP \(bu 2 \fBTypeError\fP \-\- if \fBnonstring\fP is \fBstrict\fP and a non\-\fBbasestring\fP object is passed in or if \fBnonstring\fP is set to an unknown value .IP \(bu 2 \fBUnicodeDecodeError\fP \-\- if \fBerrors\fP is \fBstrict\fP and \fBobj\fP is not decodable using the given encoding .UNINDENT .TP .B Returns \fBunicode\fP string or the original object depending on the value of \fBnonstring\fP\&. .UNINDENT .sp Usually this should be used on a byte \fBstr\fP but it can take both byte \fBstr\fP and \fBunicode\fP strings intelligently. Nonstring objects are handled in different ways depending on the setting of the \fBnonstring\fP parameter. .sp The default values of this function are set so as to always return a \fBunicode\fP string and never raise an error when converting from a byte \fBstr\fP to a \fBunicode\fP string. However, when you do not pass validly encoded text (or a nonstring object), you may end up with output that you don\(aqt expect. Be sure you understand the requirements of your data, not just ignore errors by passing it through this function. .sp Changed in version 0.2.1a2: Deprecated \fBnon_string\fP in favor of \fBnonstring\fP parameter and changed default value to \fBsimplerepr\fP .UNINDENT .INDENT 0.0 .TP .B kitchen.text.converters.to_bytes(obj, encoding=\(aqutf\-8\(aq, errors=\(aqreplace\(aq, nonstring=None, non_string=None) Convert an object into a byte \fBstr\fP .INDENT 7.0 .TP .B Parameters .INDENT 7.0 .IP \(bu 2 \fBobj\fP \-\- Object to convert to a byte \fBstr\fP\&. This should normally be a \fBunicode\fP string. .IP \(bu 2 \fBencoding\fP \-\- Encoding to use to convert the \fBunicode\fP string into a byte \fBstr\fP\&. Defaults to utf\-8\&. .IP \(bu 2 \fBerrors\fP \-\- .sp If errors are found while encoding, perform this action. Defaults to \fBreplace\fP which replaces the invalid bytes with a character that means the bytes were unable to be encoded. Other values are the same as the error handling schemes in the \fI\%codec base classes\fP\&. For instance \fBstrict\fP which raises an exception and \fBignore\fP which simply omits the non\-encodable characters. .IP \(bu 2 \fBnonstring\fP \-\- .sp How to treat nonstring values. Possible values are: .INDENT 2.0 .TP .B simplerepr Attempt to call the object\(aqs "simple representation" method and return that value. Python\-2.3+ has two methods that try to return a simple representation: \fBobject.__unicode__()\fP and \fBobject.__str__()\fP\&. We first try to get a usable value from \fBobject.__str__()\fP\&. If that fails we try the same with \fBobject.__unicode__()\fP\&. .TP .B empty Return an empty byte \fBstr\fP .TP .B strict Raise a \fBTypeError\fP .TP .B passthru Return the object unchanged .TP .B repr Attempt to return a byte \fBstr\fP of the \fBrepr()\fP of the object .UNINDENT .sp Default is \fBsimplerepr\fP\&. .IP \(bu 2 \fBnon_string\fP \-\- \fIDeprecated\fP Use \fBnonstring\fP instead. .UNINDENT .TP .B Raises .INDENT 7.0 .IP \(bu 2 \fBTypeError\fP \-\- if \fBnonstring\fP is \fBstrict\fP and a non\-\fBbasestring\fP object is passed in or if \fBnonstring\fP is set to an unknown value. .IP \(bu 2 \fBUnicodeEncodeError\fP \-\- if \fBerrors\fP is \fBstrict\fP and all of the bytes of \fBobj\fP are unable to be encoded using \fBencoding\fP\&. .UNINDENT .TP .B Returns byte \fBstr\fP or the original object depending on the value of \fBnonstring\fP\&. .UNINDENT .sp \fBWARNING:\fP .INDENT 7.0 .INDENT 3.5 If you pass a byte \fBstr\fP into this function the byte \fBstr\fP is returned unmodified. It is \fBnot\fP re\-encoded with the specified \fBencoding\fP\&. The easiest way to achieve that is: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C to_bytes(to_unicode(text), encoding=\(aqutf\-8\(aq) .ft P .fi .UNINDENT .UNINDENT .sp The initial \fI\%to_unicode()\fP call will ensure text is a \fBunicode\fP string. Then, \fI\%to_bytes()\fP will turn that into a byte \fBstr\fP with the specified encoding. .UNINDENT .UNINDENT .sp Usually, this should be used on a \fBunicode\fP string but it can take either a byte \fBstr\fP or a \fBunicode\fP string intelligently. Nonstring objects are handled in different ways depending on the setting of the \fBnonstring\fP parameter. .sp The default values of this function are set so as to always return a byte \fBstr\fP and never raise an error when converting from unicode to bytes. However, when you do not pass an encoding that can validly encode the object (or a non\-string object), you may end up with output that you don\(aqt expect. Be sure you understand the requirements of your data, not just ignore errors by passing it through this function. .sp Changed in version 0.2.1a2: Deprecated \fBnon_string\fP in favor of \fBnonstring\fP parameter and changed default value to \fBsimplerepr\fP .UNINDENT .INDENT 0.0 .TP .B kitchen.text.converters.getwriter(encoding) Return a \fBcodecs.StreamWriter\fP that resists tracing back. .INDENT 7.0 .TP .B Parameters \fBencoding\fP \-\- Encoding to use for transforming \fBunicode\fP strings into byte \fBstr\fP\&. .TP .B Return type \fBcodecs.StreamWriter\fP .TP .B Returns \fBStreamWriter\fP that you can instantiate to wrap output streams to automatically translate \fBunicode\fP strings into \fBencoding\fP\&. .UNINDENT .sp This is a reimplemetation of \fBcodecs.getwriter()\fP that returns a \fBStreamWriter\fP that resists issuing tracebacks. The \fBStreamWriter\fP that is returned uses \fI\%kitchen.text.converters.to_bytes()\fP to convert \fBunicode\fP strings into byte \fBstr\fP\&. The departures from \fBcodecs.getwriter()\fP are: .INDENT 7.0 .IP 1. 3 The \fBStreamWriter\fP that is returned will take byte \fBstr\fP as well as \fBunicode\fP strings. Any byte \fBstr\fP will be passed through unmodified. .IP 2. 3 The default error handler for unknown bytes is to \fBreplace\fP the bytes with the unknown character (\fB?\fP in most ascii\-based encodings, \fB�\fP in the utf encodings) whereas \fBcodecs.getwriter()\fP defaults to \fBstrict\fP\&. Like \fBcodecs.StreamWriter\fP, the returned \fBStreamWriter\fP can have its error handler changed in code by setting \fBstream.errors = \(aqnew_handler_name\(aq\fP .UNINDENT .sp Example usage: .INDENT 7.0 .INDENT 3.5 .sp .nf .ft C $ LC_ALL=C python >>> import sys >>> from kitchen.text.converters import getwriter >>> UTF8Writer = getwriter(\(aqutf\-8\(aq) >>> unwrapped_stdout = sys.stdout >>> sys.stdout = UTF8Writer(unwrapped_stdout) >>> print \(aqcaf\exc3\exa9\(aq café >>> print u\(aqcaf\exe9\(aq café >>> ASCIIWriter = getwriter(\(aqascii\(aq) >>> sys.stdout = ASCIIWriter(unwrapped_stdout) >>> print \(aqcaf\exc3\exa9\(aq café >>> print u\(aqcaf\exe9\(aq caf? .ft P .fi .UNINDENT .UNINDENT .sp \fBSEE ALSO:\fP .INDENT 7.0 .INDENT 3.5 API docs for \fBcodecs.StreamWriter\fP and \fBcodecs.getwriter()\fP and \fI\%Print Fails\fP on the python wiki. .UNINDENT .UNINDENT .sp New in version kitchen: 0.2a2, API: kitchen.text 1.1.0 .UNINDENT .INDENT 0.0 .TP .B kitchen.text.converters.to_str(obj) \fIDeprecated\fP .sp This function converts something to a byte \fBstr\fP if it isn\(aqt one. It\(aqs used to call \fBstr()\fP or \fBunicode()\fP on the object to get its simple representation without danger of getting a \fBUnicodeError\fP\&. You should be using \fI\%to_unicode()\fP or \fI\%to_bytes()\fP explicitly instead. .sp If you need \fBunicode\fP strings: .INDENT 7.0 .INDENT 3.5 .sp .nf .ft C to_unicode(obj, nonstring=\(aqsimplerepr\(aq) .ft P .fi .UNINDENT .UNINDENT .sp If you need byte \fBstr\fP: .INDENT 7.0 .INDENT 3.5 .sp .nf .ft C to_bytes(obj, nonstring=\(aqsimplerepr\(aq) .ft P .fi .UNINDENT .UNINDENT .UNINDENT .INDENT 0.0 .TP .B kitchen.text.converters.to_utf8(obj, errors=\(aqreplace\(aq, non_string=\(aqpassthru\(aq) \fIDeprecated\fP .sp Convert \fBunicode\fP to an encoded utf\-8 byte \fBstr\fP\&. You should be using \fI\%to_bytes()\fP instead: .INDENT 7.0 .INDENT 3.5 .sp .nf .ft C to_bytes(obj, encoding=\(aqutf\-8\(aq, non_string=\(aqpassthru\(aq) .ft P .fi .UNINDENT .UNINDENT .UNINDENT .SS Transformation to XML .INDENT 0.0 .TP .B kitchen.text.converters.unicode_to_xml(string, encoding=\(aqutf\-8\(aq, attrib=False, control_chars=\(aqreplace\(aq) Take a \fBunicode\fP string and turn it into a byte \fBstr\fP suitable for xml .INDENT 7.0 .TP .B Parameters .INDENT 7.0 .IP \(bu 2 \fBstring\fP \-\- \fBunicode\fP string to encode into an XML compatible byte \fBstr\fP .IP \(bu 2 \fBencoding\fP \-\- encoding to use for the returned byte \fBstr\fP\&. Default is to encode to UTF\-8\&. If some of the characters in \fBstring\fP are not encodable in this encoding, the unknown characters will be entered into the output string using xml character references. .IP \(bu 2 \fBattrib\fP \-\- If \fBTrue\fP, quote the string for use in an xml attribute. If \fBFalse\fP (default), quote for use in an xml text field. .IP \(bu 2 \fBcontrol_chars\fP \-\- .sp control characters are not allowed in XML documents. When we encounter those we need to know what to do. Valid options are: .INDENT 2.0 .TP .B replace (default) Replace the control characters with \fB?\fP .TP .B ignore Remove the characters altogether from the output .TP .B strict Raise an \fBXmlEncodeError\fP when we encounter a control character .UNINDENT .UNINDENT .TP .B Raises .INDENT 7.0 .IP \(bu 2 \fBkitchen.text.exceptions.XmlEncodeError\fP \-\- If \fBcontrol_chars\fP is set to \fBstrict\fP and the string to be made suitable for output to xml contains control characters or if \fBstring\fP is not a \fBunicode\fP string then we raise this exception. .IP \(bu 2 \fBValueError\fP \-\- If \fBcontrol_chars\fP is set to something other than \fBreplace\fP, \fBignore\fP, or \fBstrict\fP\&. .UNINDENT .TP .B Return type byte \fBstr\fP .TP .B Returns representation of the \fBunicode\fP string as a valid XML byte \fBstr\fP .UNINDENT .sp XML files consist mainly of text encoded using a particular charset. XML also denies the use of certain bytes in the encoded text (example: \fBASCII Null\fP). There are also special characters that must be escaped if they are present in the input (example: \fB<\fP). This function takes care of all of those issues for you. .sp There are a few different ways to use this function depending on your needs. The simplest invocation is like this: .INDENT 7.0 .INDENT 3.5 .sp .nf .ft C unicode_to_xml(u\(aqString with non\-ASCII characters: <"á と">\(aq) .ft P .fi .UNINDENT .UNINDENT .sp This will return the following to you, encoded in utf\-8: .INDENT 7.0 .INDENT 3.5 .sp .nf .ft C \(aqString with non\-ASCII characters: <"á と">\(aq .ft P .fi .UNINDENT .UNINDENT .sp Pretty straightforward. Now, what if you need to encode your document in something other than utf\-8? For instance, \fBlatin\-1\fP? Let\(aqs see: .INDENT 7.0 .INDENT 3.5 .sp .nf .ft C unicode_to_xml(u\(aqString with non\-ASCII characters: <"á と">\(aq, encoding=\(aqlatin\-1\(aq) \(aqString with non\-ASCII characters: <"á と">\(aq .ft P .fi .UNINDENT .UNINDENT .sp Because the \fBと\fP character is not available in the \fBlatin\-1\fP charset, it is replaced with \fBと\fP in our output. This is an xml character reference which represents the character at unicode codepoint \fB12392\fP, the \fBと\fP character. .sp When you want to reverse this, use \fI\%xml_to_unicode()\fP which will turn a byte \fBstr\fP into a \fBunicode\fP string and replace the xml character references with the unicode characters. .sp XML also has the quirk of not allowing control characters in its output. The \fBcontrol_chars\fP parameter allows us to specify what to do with those. For use cases that don\(aqt need absolute character by character fidelity (example: holding strings that will just be used for display in a GUI app later), the default value of \fBreplace\fP works well: .INDENT 7.0 .INDENT 3.5 .sp .nf .ft C unicode_to_xml(u\(aqString with disallowed control chars: \eu0000\eu0007\(aq) \(aqString with disallowed control chars: ??\(aq .ft P .fi .UNINDENT .UNINDENT .sp If you do need to be able to reproduce all of the characters at a later date (examples: if the string is a key value in a database or a path on a filesystem) you have many choices. Here are a few that rely on \fButf\-7\fP, a verbose encoding that encodes control characters (as well as non\-ASCII unicode values) to characters from within the ASCII printable characters. The good thing about doing this is that the code is pretty simple. You just need to use \fButf\-7\fP both when encoding the field for xml and when decoding it for use in your python program: .INDENT 7.0 .INDENT 3.5 .sp .nf .ft C unicode_to_xml(u\(aqString with unicode: と and control char: \eu0007\(aq, encoding=\(aqutf7\(aq) \(aqString with unicode: +MGg and control char: +AAc\-\(aq # [...] xml_to_unicode(\(aqString with unicode: +MGg and control char: +AAc\-\(aq, encoding=\(aqutf7\(aq) u\(aqString with unicode: と and control char: \eu0007\(aq .ft P .fi .UNINDENT .UNINDENT .sp As you can see, the \fButf\-7\fP encoding will transform even characters that would be representable in utf\-8\&. This can be a drawback if you want unicode characters in the file to be readable without being decoded first. You can work around this with increased complexity in your application code: .INDENT 7.0 .INDENT 3.5 .sp .nf .ft C encoding = \(aqutf\-8\(aq u_string = u\(aqString with unicode: と and control char: \eu0007\(aq try: # First attempt to encode to utf8 data = unicode_to_xml(u_string, encoding=encoding, errors=\(aqstrict\(aq) except XmlEncodeError: # Fallback to utf\-7 encoding = \(aqutf\-7\(aq data = unicode_to_xml(u_string, encoding=encoding, errors=\(aqstrict\(aq) write_tag(\(aq%s\(aq % (encoding, data)) # [...] encoding = tag.attributes.encoding u_string = xml_to_unicode(u_string, encoding=encoding) .ft P .fi .UNINDENT .UNINDENT .sp Using code similar to that, you can have some fields encoded using your default encoding and fallback to \fButf\-7\fP if there are control characters present. .sp \fBNOTE:\fP .INDENT 7.0 .INDENT 3.5 If your goal is to preserve the control characters you cannot save the entire file as \fButf\-7\fP and set the xml encoding parameter to \fButf\-7\fP if your goal is to preserve the control characters\&. Because XML doesn\(aqt allow control characters, you have to encode those separate from any encoding work that the XML parser itself knows about. .UNINDENT .UNINDENT .sp \fBSEE ALSO:\fP .INDENT 7.0 .INDENT 3.5 .INDENT 0.0 .TP .B \fI\%bytes_to_xml()\fP if you\(aqre dealing with bytes that are non\-text or of an unknown encoding that you must preserve on a byte for byte level. .TP .B \fI\%guess_encoding_to_xml()\fP if you\(aqre dealing with strings in unknown encodings that you don\(aqt need to save with char\-for\-char fidelity. .UNINDENT .UNINDENT .UNINDENT .UNINDENT .INDENT 0.0 .TP .B kitchen.text.converters.xml_to_unicode(byte_string, encoding=\(aqutf\-8\(aq, errors=\(aqreplace\(aq) Transform a byte \fBstr\fP from an xml file into a \fBunicode\fP string .INDENT 7.0 .TP .B Parameters .INDENT 7.0 .IP \(bu 2 \fBbyte_string\fP \-\- byte \fBstr\fP to decode .IP \(bu 2 \fBencoding\fP \-\- encoding that the byte \fBstr\fP is in .IP \(bu 2 \fBerrors\fP \-\- What to do if not every character is valid in \fBencoding\fP\&. See the \fI\%to_unicode()\fP documentation for legal values. .UNINDENT .TP .B Return type \fBunicode\fP string .TP .B Returns string decoded from \fBbyte_string\fP .UNINDENT .sp This function attempts to reverse what \fI\%unicode_to_xml()\fP does. It takes a byte \fBstr\fP (presumably read in from an xml file) and expands all the html entities into unicode characters and decodes the byte \fBstr\fP into a \fBunicode\fP string. One thing it cannot do is restore any control characters that were removed prior to inserting into the file. If you need to keep such characters you need to use \fI\%xml_to_bytes()\fP and \fI\%bytes_to_xml()\fP or use on of the strategies documented in \fI\%unicode_to_xml()\fP instead. .UNINDENT .INDENT 0.0 .TP .B kitchen.text.converters.byte_string_to_xml(byte_string, input_encoding=\(aqutf\-8\(aq, errors=\(aqreplace\(aq, output_encoding=\(aqutf\-8\(aq, attrib=False, control_chars=\(aqreplace\(aq) Make sure a byte \fBstr\fP is validly encoded for xml output .INDENT 7.0 .TP .B Parameters .INDENT 7.0 .IP \(bu 2 \fBbyte_string\fP \-\- Byte \fBstr\fP to turn into valid xml output .IP \(bu 2 \fBinput_encoding\fP \-\- Encoding of \fBbyte_string\fP\&. Default \fButf\-8\fP .IP \(bu 2 \fBerrors\fP \-\- .sp How to handle errors encountered while decoding the \fBbyte_string\fP into \fBunicode\fP at the beginning of the process. Values are: .INDENT 2.0 .TP .B replace (default) Replace the invalid bytes with a \fB?\fP .TP .B ignore Remove the characters altogether from the output .TP .B strict Raise an \fBUnicodeDecodeError\fP when we encounter a non\-decodable character .UNINDENT .IP \(bu 2 \fBoutput_encoding\fP \-\- Encoding for the xml file that this string will go into. Default is \fButf\-8\fP\&. If all the characters in \fBbyte_string\fP are not encodable in this encoding, the unknown characters will be entered into the output string using xml character references. .IP \(bu 2 \fBattrib\fP \-\- If \fBTrue\fP, quote the string for use in an xml attribute. If \fBFalse\fP (default), quote for use in an xml text field. .IP \(bu 2 \fBcontrol_chars\fP \-\- .sp XML does not allow control characters\&. When we encounter those we need to know what to do. Valid options are: .INDENT 2.0 .TP .B replace (default) Replace the control characters with \fB?\fP .TP .B ignore Remove the characters altogether from the output .TP .B strict Raise an error when we encounter a control character .UNINDENT .UNINDENT .TP .B Raises .INDENT 7.0 .IP \(bu 2 \fBXmlEncodeError\fP \-\- If \fBcontrol_chars\fP is set to \fBstrict\fP and the string to be made suitable for output to xml contains control characters then we raise this exception. .IP \(bu 2 \fBUnicodeDecodeError\fP \-\- If errors is set to \fBstrict\fP and the \fBbyte_string\fP contains bytes that are not decodable using \fBinput_encoding\fP, this error is raised .UNINDENT .TP .B Return type byte \fBstr\fP .TP .B Returns representation of the byte \fBstr\fP in the output encoding with any bytes that aren\(aqt available in xml taken care of. .UNINDENT .sp Use this when you have a byte \fBstr\fP representing text that you need to make suitable for output to xml. There are several cases where this is the case. For instance, if you need to transform some strings encoded in \fBlatin\-1\fP to utf\-8 for output: .INDENT 7.0 .INDENT 3.5 .sp .nf .ft C utf8_string = byte_string_to_xml(latin1_string, input_encoding=\(aqlatin\-1\(aq) .ft P .fi .UNINDENT .UNINDENT .sp If you already have strings in the proper encoding you may still want to use this function to remove control characters: .INDENT 7.0 .INDENT 3.5 .sp .nf .ft C cleaned_string = byte_string_to_xml(string, input_encoding=\(aqutf\-8\(aq, output_encoding=\(aqutf\-8\(aq) .ft P .fi .UNINDENT .UNINDENT .sp \fBSEE ALSO:\fP .INDENT 7.0 .INDENT 3.5 .INDENT 0.0 .TP .B \fI\%unicode_to_xml()\fP for other ideas on using this function .UNINDENT .UNINDENT .UNINDENT .UNINDENT .INDENT 0.0 .TP .B kitchen.text.converters.xml_to_byte_string(byte_string, input_encoding=\(aqutf\-8\(aq, errors=\(aqreplace\(aq, output_encoding=\(aqutf\-8\(aq) Transform a byte \fBstr\fP from an xml file into \fBunicode\fP string .INDENT 7.0 .TP .B Parameters .INDENT 7.0 .IP \(bu 2 \fBbyte_string\fP \-\- byte \fBstr\fP to decode .IP \(bu 2 \fBinput_encoding\fP \-\- encoding that the byte \fBstr\fP is in .IP \(bu 2 \fBerrors\fP \-\- What to do if not every character is valid in \fBencoding\fP\&. See the \fI\%to_unicode()\fP docstring for legal values. .IP \(bu 2 \fBoutput_encoding\fP \-\- Encoding for the output byte \fBstr\fP .UNINDENT .TP .B Returns \fBunicode\fP string decoded from \fBbyte_string\fP .UNINDENT .sp This function attempts to reverse what \fI\%unicode_to_xml()\fP does. It takes a byte \fBstr\fP (presumably read in from an xml file) and expands all the html entities into unicode characters and decodes the byte \fBstr\fP into a \fBunicode\fP string. One thing it cannot do is restore any control characters that were removed prior to inserting into the file. If you need to keep such characters you need to use \fI\%xml_to_bytes()\fP and \fI\%bytes_to_xml()\fP or use one of the strategies documented in \fI\%unicode_to_xml()\fP instead. .UNINDENT .INDENT 0.0 .TP .B kitchen.text.converters.bytes_to_xml(byte_string, *args, **kwargs) Return a byte \fBstr\fP encoded so it is valid inside of any xml file .INDENT 7.0 .TP .B Parameters .INDENT 7.0 .IP \(bu 2 \fBbyte_string\fP \-\- byte \fBstr\fP to transform .IP \(bu 2 \fB**kwargs\fP (\fI*args,\fP) \-\- extra arguments to this function are passed on to the function actually implementing the encoding. You can use this to tweak the output in some cases but, as a general rule, you shouldn\(aqt because the underlying encoding function is not guaranteed to remain the same. .UNINDENT .TP .B Return type byte \fBstr\fP consisting of all ASCII characters .TP .B Returns byte \fBstr\fP representation of the input. This will be encoded using base64. .UNINDENT .sp This function is made especially to put binary information into xml documents. .sp This function is intended for encoding things that must be preserved byte\-for\-byte. If you want to encode a byte string that\(aqs text and don\(aqt mind losing the actual bytes you probably want to try \fI\%byte_string_to_xml()\fP or \fI\%guess_encoding_to_xml()\fP instead. .sp \fBNOTE:\fP .INDENT 7.0 .INDENT 3.5 Although the current implementation uses \fBbase64.b64encode()\fP and there\(aqs no plans to change it, that isn\(aqt guaranteed. If you want to make sure that you can encode and decode these messages it\(aqs best to use \fI\%xml_to_bytes()\fP if you use this function to encode. .UNINDENT .UNINDENT .UNINDENT .INDENT 0.0 .TP .B kitchen.text.converters.xml_to_bytes(byte_string, *args, **kwargs) Decode a string encoded using \fI\%bytes_to_xml()\fP .INDENT 7.0 .TP .B Parameters .INDENT 7.0 .IP \(bu 2 \fBbyte_string\fP \-\- byte \fBstr\fP to transform. This should be a base64 encoded sequence of bytes originally generated by \fI\%bytes_to_xml()\fP\&. .IP \(bu 2 \fB**kwargs\fP (\fI*args,\fP) \-\- extra arguments to this function are passed on to the function actually implementing the encoding. You can use this to tweak the output in some cases but, as a general rule, you shouldn\(aqt because the underlying encoding function is not guaranteed to remain the same. .UNINDENT .TP .B Return type byte \fBstr\fP .TP .B Returns byte \fBstr\fP that\(aqs the decoded input .UNINDENT .sp If you\(aqve got fields in an xml document that were encoded with \fI\%bytes_to_xml()\fP then you want to use this function to undecode them. It converts a base64 encoded string into a byte \fBstr\fP\&. .sp \fBNOTE:\fP .INDENT 7.0 .INDENT 3.5 Although the current implementation uses \fBbase64.b64decode()\fP and there\(aqs no plans to change it, that isn\(aqt guaranteed. If you want to make sure that you can encode and decode these messages it\(aqs best to use \fI\%bytes_to_xml()\fP if you use this function to decode. .UNINDENT .UNINDENT .UNINDENT .INDENT 0.0 .TP .B kitchen.text.converters.guess_encoding_to_xml(string, output_encoding=\(aqutf\-8\(aq, attrib=False, control_chars=\(aqreplace\(aq) Return a byte \fBstr\fP suitable for inclusion in xml .INDENT 7.0 .TP .B Parameters .INDENT 7.0 .IP \(bu 2 \fBstring\fP \-\- \fBunicode\fP or byte \fBstr\fP to be transformed into a byte \fBstr\fP suitable for inclusion in xml. If string is a byte \fBstr\fP we attempt to guess the encoding. If we cannot guess, we fallback to \fBlatin\-1\fP\&. .IP \(bu 2 \fBoutput_encoding\fP \-\- Output encoding for the byte \fBstr\fP\&. This should match the encoding of your xml file. .IP \(bu 2 \fBattrib\fP \-\- If \fBTrue\fP, escape the item for use in an xml attribute. If \fBFalse\fP (default) escape the item for use in a text node. .UNINDENT .TP .B Returns utf\-8 encoded byte \fBstr\fP .UNINDENT .UNINDENT .INDENT 0.0 .TP .B kitchen.text.converters.to_xml(string, encoding=\(aqutf\-8\(aq, attrib=False, control_chars=\(aqignore\(aq) \fIDeprecated\fP: Use \fI\%guess_encoding_to_xml()\fP instead .UNINDENT .SS Working with exception messages .INDENT 0.0 .TP .B kitchen.text.converters.EXCEPTION_CONVERTERS = (>, >) .INDENT 7.0 .TP .B Tuple of functions to try to use to convert an exception into a string representation. Its main use is to extract a string (\fBunicode\fP or \fBstr\fP) from an exception object in \fI\%exception_to_unicode()\fP and \fI\%exception_to_bytes()\fP\&. The functions here will try the exception\(aqs \fBargs[0]\fP and the exception itself (roughly equivalent to \fIstr(exception)\fP) to extract the message. This is only a default and can be easily overridden when calling those functions. There are several reasons you might wish to do that. If you have exceptions where the best string representing the exception is not returned by the default functions, you can add another function to extract from a different field: .INDENT 7.0 .INDENT 3.5 .sp .nf .ft C from kitchen.text.converters import (EXCEPTION_CONVERTERS, exception_to_unicode) class MyError(Exception): def __init__(self, message): self.value = message c = [lambda e: e.value] c.extend(EXCEPTION_CONVERTERS) try: raise MyError(\(aqAn Exception message\(aq) except MyError, e: print exception_to_unicode(e, converters=c) .ft P .fi .UNINDENT .UNINDENT .sp Another reason would be if you\(aqre converting to a byte \fBstr\fP and you know the \fBstr\fP needs to be a non\-utf\-8 encoding. \fI\%exception_to_bytes()\fP defaults to utf\-8 but if you convert into a byte \fBstr\fP explicitly using a converter then you can choose a different encoding: .INDENT 7.0 .INDENT 3.5 .sp .nf .ft C from kitchen.text.converters import (EXCEPTION_CONVERTERS, exception_to_bytes, to_bytes) c = [lambda e: to_bytes(e.args[0], encoding=\(aqeuc_jp\(aq), lambda e: to_bytes(e, encoding=\(aqeuc_jp\(aq)] c.extend(EXCEPTION_CONVERTERS) try: do_something() except Exception, e: log = open(\(aqlogfile.euc_jp\(aq, \(aqa\(aq) log.write(\(aq%s .ft P .fi .UNINDENT .UNINDENT .TP .B \(aq % exception_to_bytes(e, converters=c) .INDENT 7.0 .INDENT 3.5 log.close() .UNINDENT .UNINDENT .sp Each function in this list should take the exception as its sole argument and return a string containing the message representing the exception. The functions may return the message as a :byte class:\fIstr\fP, a \fBunicode\fP string, or even an object if you trust the object to return a decent string representation. The \fI\%exception_to_unicode()\fP and \fI\%exception_to_bytes()\fP functions will make sure to convert the string to the proper type before returning. .sp New in version 0.2.2. .UNINDENT .UNINDENT .INDENT 0.0 .TP .B kitchen.text.converters.BYTE_EXCEPTION_CONVERTERS = (>, ) \fIDeprecated\fP: Use \fI\%EXCEPTION_CONVERTERS\fP instead. .sp Tuple of functions to try to use to convert an exception into a string representation. This tuple is similar to the one in \fI\%EXCEPTION_CONVERTERS\fP but it\(aqs used with \fI\%exception_to_bytes()\fP instead. Ideally, these functions should do their best to return the data as a byte \fBstr\fP but the results will be run through \fI\%to_bytes()\fP before being returned. .sp New in version 0.2.2. .sp Changed in version 1.0.1: Deprecated as simplifications allow \fI\%EXCEPTION_CONVERTERS\fP to perform the same function. .UNINDENT .INDENT 0.0 .TP .B kitchen.text.converters.exception_to_unicode(exc, converters=(>, >)) Convert an exception object into a unicode representation .INDENT 7.0 .TP .B Parameters .INDENT 7.0 .IP \(bu 2 \fBexc\fP \-\- Exception object to convert .IP \(bu 2 \fBconverters\fP \-\- List of functions to use to convert the exception into a string. See \fI\%EXCEPTION_CONVERTERS\fP for the default value and an example of adding other converters to the defaults. The functions in the list are tried one at a time to see if they can extract a string from the exception. The first one to do so without raising an exception is used. .UNINDENT .TP .B Returns \fBunicode\fP string representation of the exception. The value extracted by the \fBconverters\fP will be converted into \fBunicode\fP before being returned using the utf\-8 encoding. If you know you need to use an alternate encoding add a function that does that to the list of functions in \fBconverters\fP) .UNINDENT .sp New in version 0.2.2. .UNINDENT .INDENT 0.0 .TP .B kitchen.text.converters.exception_to_bytes(exc, converters=(>, >)) Convert an exception object into a str representation .INDENT 7.0 .TP .B Parameters .INDENT 7.0 .IP \(bu 2 \fBexc\fP \-\- Exception object to convert .IP \(bu 2 \fBconverters\fP \-\- List of functions to use to convert the exception into a string. See \fI\%EXCEPTION_CONVERTERS\fP for the default value and an example of adding other converters to the defaults. The functions in the list are tried one at a time to see if they can extract a string from the exception. The first one to do so without raising an exception is used. .UNINDENT .TP .B Returns byte \fBstr\fP representation of the exception. The value extracted by the \fBconverters\fP will be converted into \fBstr\fP before being returned using the utf\-8 encoding. If you know you need to use an alternate encoding add a function that does that to the list of functions in \fBconverters\fP) .UNINDENT .sp New in version 0.2.2. .sp Changed in version 1.0.1: Code simplification allowed us to switch to using \fI\%EXCEPTION_CONVERTERS\fP as the default value of \fBconverters\fP\&. .UNINDENT .SS Format Text for Display .sp Functions related to displaying unicode text. Unicode characters don\(aqt all have the same width so we need helper functions for displaying them. .sp New in version 0.2: kitchen.display API 1.0.0 .INDENT 0.0 .TP .B kitchen.text.display.textual_width(msg, control_chars=\(aqguess\(aq, encoding=\(aqutf\-8\(aq, errors=\(aqreplace\(aq) Get the textual width of a string .INDENT 7.0 .TP .B Parameters .INDENT 7.0 .IP \(bu 2 \fBmsg\fP \-\- \fBunicode\fP string or byte \fBstr\fP to get the width of .IP \(bu 2 \fBcontrol_chars\fP \-\- .sp specify how to deal with control characters\&. Possible values are: .INDENT 2.0 .TP .B guess (default) will take a guess for control character widths. Most codes will return zero width. \fBbackspace\fP, \fBdelete\fP, and \fBclear delete\fP return \-1. \fBescape\fP currently returns \-1 as well but this is not guaranteed as it\(aqs not always correct .TP .B strict will raise \fBkitchen.text.exceptions.ControlCharError\fP if a control character is encountered .UNINDENT .IP \(bu 2 \fBencoding\fP \-\- If we are given a byte \fBstr\fP this is used to decode it into \fBunicode\fP string. Any characters that are not decodable in this encoding will get a value dependent on the \fBerrors\fP parameter. .IP \(bu 2 \fBerrors\fP \-\- How to treat errors encoding the byte \fBstr\fP to \fBunicode\fP string. Legal values are the same as for \fBkitchen.text.converters.to_unicode()\fP\&. The default value of \fBreplace\fP will cause undecodable byte sequences to have a width of one. \fBignore\fP will have a width of zero. .UNINDENT .TP .B Raises \fBControlCharError\fP \-\- if \fBmsg\fP contains a control character and \fBcontrol_chars\fP is \fBstrict\fP\&. .TP .B Returns Textual width of the \fBmsg\fP\&. This is the amount of space that the string will consume on a monospace display. It\(aqs measured in the number of cell positions or columns it will take up on a monospace display. This is \fBnot\fP the number of glyphs that are in the string. .UNINDENT .sp \fBNOTE:\fP .INDENT 7.0 .INDENT 3.5 This function can be wrong sometimes because Unicode does not specify a strict width value for all of the code points\&. In particular, we\(aqve found that some Tamil characters take up to four character cells but we return a lesser amount. .UNINDENT .UNINDENT .UNINDENT .INDENT 0.0 .TP .B kitchen.text.display.textual_width_chop(msg, chop, encoding=\(aqutf\-8\(aq, errors=\(aqreplace\(aq) Given a string, return it chopped to a given textual width .INDENT 7.0 .TP .B Parameters .INDENT 7.0 .IP \(bu 2 \fBmsg\fP \-\- \fBunicode\fP string or byte \fBstr\fP to chop .IP \(bu 2 \fBchop\fP \-\- Chop \fBmsg\fP if it exceeds this textual width .IP \(bu 2 \fBencoding\fP \-\- If we are given a byte \fBstr\fP, this is used to decode it into a \fBunicode\fP string. Any characters that are not decodable in this encoding will be assigned a width of one. .IP \(bu 2 \fBerrors\fP \-\- How to treat errors encoding the byte \fBstr\fP to \fBunicode\fP\&. Legal values are the same as for \fBkitchen.text.converters.to_unicode()\fP .UNINDENT .TP .B Return type \fBunicode\fP string .TP .B Returns \fBunicode\fP string of the \fBmsg\fP chopped at the given textual width .UNINDENT .sp This is what you want to use instead of \fB%.*s\fP, as it does the "right" thing with regard to UTF\-8 sequences, control characters, and characters that take more than one cell position. Eg: .INDENT 7.0 .INDENT 3.5 .sp .nf .ft C >>> # Wrong: only displays 8 characters because it is operating on bytes >>> print "%.*s" % (10, \(aqcafé ñunru!\(aq) café ñun >>> # Properly operates on graphemes >>> \(aq%s\(aq % (textual_width_chop(\(aqcafé ñunru!\(aq, 10)) café ñunru >>> # takes too many columns because the kanji need two cell positions >>> print \(aq1234567890\en%.*s\(aq % (10, u\(aq一二三四五六七八九十\(aq) 1234567890 一二三四五六七八九十 >>> # Properly chops at 10 columns >>> print \(aq1234567890\en%s\(aq % (textual_width_chop(u\(aq一二三四五六七八九十\(aq, 10)) 1234567890 一二三四五 .ft P .fi .UNINDENT .UNINDENT .UNINDENT .INDENT 0.0 .TP .B kitchen.text.display.textual_width_fill(msg, fill, chop=None, left=True, prefix=\(aq\(aq, suffix=\(aq\(aq) Expand a \fBunicode\fP string to a specified textual width or chop to same .INDENT 7.0 .TP .B Parameters .INDENT 7.0 .IP \(bu 2 \fBmsg\fP \-\- \fBunicode\fP string to format .IP \(bu 2 \fBfill\fP \-\- pad string until the textual width of the string is this length .IP \(bu 2 \fBchop\fP \-\- before doing anything else, chop the string to this length. Default: Don\(aqt chop the string at all .IP \(bu 2 \fBleft\fP \-\- If \fBTrue\fP (default) left justify the string and put the padding on the right. If \fBFalse\fP, pad on the left side. .IP \(bu 2 \fBprefix\fP \-\- Attach this string before the field we\(aqre filling .IP \(bu 2 \fBsuffix\fP \-\- Append this string to the end of the field we\(aqre filling .UNINDENT .TP .B Return type \fBunicode\fP string .TP .B Returns \fBmsg\fP formatted to fill the specified width. If no \fBchop\fP is specified, the string could exceed the fill length when completed. If \fBprefix\fP or \fBsuffix\fP are printable characters, the string could be longer than the fill width. .UNINDENT .sp \fBNOTE:\fP .INDENT 7.0 .INDENT 3.5 \fBprefix\fP and \fBsuffix\fP should be used for "invisible" characters like highlighting, color changing escape codes, etc. The fill characters are appended outside of any \fBprefix\fP or \fBsuffix\fP elements. This allows you to only highlight \fBmsg\fP inside of the field you\(aqre filling. .UNINDENT .UNINDENT .sp \fBWARNING:\fP .INDENT 7.0 .INDENT 3.5 \fBmsg\fP, \fBprefix\fP, and \fBsuffix\fP should all be representable as unicode characters. In particular, any escape sequences in \fBprefix\fP and \fBsuffix\fP need to be convertible to \fBunicode\fP\&. If you need to use byte sequences here rather than unicode characters, use \fI\%byte_string_textual_width_fill()\fP instead. .UNINDENT .UNINDENT .sp This function expands a string to fill a field of a particular textual width\&. Use it instead of \fB%*.*s\fP, as it does the "right" thing with regard to UTF\-8 sequences, control characters, and characters that take more than one cell position in a display. Example usage: .INDENT 7.0 .INDENT 3.5 .sp .nf .ft C >>> msg = u\(aq一二三四五六七八九十\(aq >>> # Wrong: This uses 10 characters instead of 10 cells: >>> u":%\-*.*s:" % (10, 10, msg[:9]) :一二三四五六七八九 : >>> # This uses 10 cells like we really want: >>> u":%s:" % (textual_width_fill(msg[:9], 10, 10)) :一二三四五: >>> # Wrong: Right aligned in the field, but too many cells >>> u"%20.10s" % (msg) 一二三四五六七八九十 >>> # Correct: Right aligned with proper number of cells >>> u"%s" % (textual_width_fill(msg, 20, 10, left=False)) 一二三四五 >>> # Wrong: Adding some escape characters to highlight the line but too many cells >>> u"%s%20.10s%s" % (prefix, msg, suffix) u\(aq 一二三四五六七八九十\(aq >>> # Correct highlight of the line >>> u"%s%s%s" % (prefix, display.textual_width_fill(msg, 20, 10, left=False), suffix) u\(aq 一二三四五\(aq >>> # Correct way to not highlight the fill >>> u"%s" % (display.textual_width_fill(msg, 20, 10, left=False, prefix=prefix, suffix=suffix)) u\(aq 一二三四五\(aq .ft P .fi .UNINDENT .UNINDENT .UNINDENT .INDENT 0.0 .TP .B kitchen.text.display.wrap(text, width=70, initial_indent=u\(aq\(aq, subsequent_indent=u\(aq\(aq, encoding=\(aqutf\-8\(aq, errors=\(aqreplace\(aq) Works like we want \fBtextwrap.wrap()\fP to work, .INDENT 7.0 .TP .B Parameters .INDENT 7.0 .IP \(bu 2 \fBtext\fP \-\- \fBunicode\fP string or byte \fBstr\fP to wrap .IP \(bu 2 \fBwidth\fP \-\- textual width at which to wrap. Default: 70 .IP \(bu 2 \fBinitial_indent\fP \-\- string to use to indent the first line. Default: do not indent. .IP \(bu 2 \fBsubsequent_indent\fP \-\- string to use to wrap subsequent lines. Default: do not indent .IP \(bu 2 \fBencoding\fP \-\- Encoding to use if \fBtext\fP is a byte \fBstr\fP .IP \(bu 2 \fBerrors\fP \-\- error handler to use if \fBtext\fP is a byte \fBstr\fP and contains some undecodable characters. .UNINDENT .TP .B Return type \fBlist\fP of \fBunicode\fP strings .TP .B Returns list of lines that have been text wrapped and indented. .UNINDENT .sp \fBtextwrap.wrap()\fP from the \fI\%python standard library\fP has two drawbacks that this attempts to fix: .INDENT 7.0 .IP 1. 3 It does not handle textual width\&. It only operates on bytes or characters which are both inadequate (due to multi\-byte and double width characters). .IP 2. 3 It malforms lists and blocks. .UNINDENT .UNINDENT .INDENT 0.0 .TP .B kitchen.text.display.fill(text, *args, **kwargs) Works like we want \fBtextwrap.fill()\fP to work .INDENT 7.0 .TP .B Parameters \fBtext\fP \-\- \fBunicode\fP string or byte \fBstr\fP to process .TP .B Returns \fBunicode\fP string with each line separated by a newline .UNINDENT .sp \fBSEE ALSO:\fP .INDENT 7.0 .INDENT 3.5 .INDENT 0.0 .TP .B \fI\%kitchen.text.display.wrap()\fP for other parameters that you can give this command. .UNINDENT .UNINDENT .UNINDENT .sp This function is a light wrapper around \fI\%kitchen.text.display.wrap()\fP\&. Where that function returns a \fBlist\fP of lines, this function returns one string with each line separated by a newline. .UNINDENT .INDENT 0.0 .TP .B kitchen.text.display.byte_string_textual_width_fill(msg, fill, chop=None, left=True, prefix=\(aq\(aq, suffix=\(aq\(aq, encoding=\(aqutf\-8\(aq, errors=\(aqreplace\(aq) Expand a byte \fBstr\fP to a specified textual width or chop to same .INDENT 7.0 .TP .B Parameters .INDENT 7.0 .IP \(bu 2 \fBmsg\fP \-\- byte \fBstr\fP encoded in UTF\-8 that we want formatted .IP \(bu 2 \fBfill\fP \-\- pad \fBmsg\fP until the textual width is this long .IP \(bu 2 \fBchop\fP \-\- before doing anything else, chop the string to this length. Default: Don\(aqt chop the string at all .IP \(bu 2 \fBleft\fP \-\- If \fBTrue\fP (default) left justify the string and put the padding on the right. If \fBFalse\fP, pad on the left side. .IP \(bu 2 \fBprefix\fP \-\- Attach this byte \fBstr\fP before the field we\(aqre filling .IP \(bu 2 \fBsuffix\fP \-\- Append this byte \fBstr\fP to the end of the field we\(aqre filling .UNINDENT .TP .B Return type byte \fBstr\fP .TP .B Returns \fBmsg\fP formatted to fill the specified textual width\&. If no \fBchop\fP is specified, the string could exceed the fill length when completed. If \fBprefix\fP or \fBsuffix\fP are printable characters, the string could be longer than fill width. .UNINDENT .sp \fBNOTE:\fP .INDENT 7.0 .INDENT 3.5 \fBprefix\fP and \fBsuffix\fP should be used for "invisible" characters like highlighting, color changing escape codes, etc. The fill characters are appended outside of any \fBprefix\fP or \fBsuffix\fP elements. This allows you to only highlight \fBmsg\fP inside of the field you\(aqre filling. .UNINDENT .UNINDENT .sp \fBSEE ALSO:\fP .INDENT 7.0 .INDENT 3.5 .INDENT 0.0 .TP .B \fI\%textual_width_fill()\fP For example usage. This function has only two differences. .INDENT 7.0 .IP 1. 3 it takes byte \fBstr\fP for \fBprefix\fP and \fBsuffix\fP so you can pass in arbitrary sequences of bytes, not just unicode characters. .IP 2. 3 it returns a byte \fBstr\fP instead of a \fBunicode\fP string. .UNINDENT .UNINDENT .UNINDENT .UNINDENT .UNINDENT .SS Internal Data .sp There are a few internal functions and variables in this module. Code outside of kitchen shouldn\(aqt use them but people coding on kitchen itself may find them useful. .INDENT 0.0 .TP .B kitchen.text.display._COMBINING = ((768, 879), (1155, 1161), (1425, 1469), (1471, 1471), (1473, 1474), (1476, 1477), (1479, 1479), (1536, 1539), (1552, 1562), (1611, 1631), (1648, 1648), (1750, 1764), (1767, 1768), (1770, 1773), (1807, 1807), (1809, 1809), (1840, 1866), (1958, 1968), (2027, 2035), (2070, 2073), (2075, 2083), (2085, 2087), (2089, 2093), (2137, 2139), (2305, 2306), (2364, 2364), (2369, 2376), (2381, 2381), (2385, 2388), (2402, 2403), (2433, 2433), (2492, 2492), (2497, 2500), (2509, 2509), (2530, 2531), (2561, 2562), (2620, 2620), (2625, 2626), (2631, 2632), (2635, 2637), (2672, 2673), (2689, 2690), (2748, 2748), (2753, 2757), (2759, 2760), (2765, 2765), (2786, 2787), (2817, 2817), (2876, 2876), (2879, 2879), (2881, 2883), (2893, 2893), (2902, 2902), (2946, 2946), (3008, 3008), (3021, 3021), (3134, 3136), (3142, 3144), (3146, 3149), (3157, 3158), (3260, 3260), (3263, 3263), (3270, 3270), (3276, 3277), (3298, 3299), (3393, 3395), (3405, 3405), (3530, 3530), (3538, 3540), (3542, 3542), (3633, 3633), (3636, 3642), (3655, 3662), (3761, 3761), (3764, 3769), (3771, 3772), (3784, 3789), (3864, 3865), (3893, 3893), (3895, 3895), (3897, 3897), (3953, 3966), (3968, 3972), (3974, 3975), (3984, 3991), (3993, 4028), (4038, 4038), (4141, 4144), (4146, 4146), (4150, 4151), (4153, 4154), (4184, 4185), (4237, 4237), (4448, 4607), (4957, 4959), (5906, 5908), (5938, 5940), (5970, 5971), (6002, 6003), (6068, 6069), (6071, 6077), (6086, 6086), (6089, 6099), (6109, 6109), (6155, 6157), (6313, 6313), (6432, 6434), (6439, 6440), (6450, 6450), (6457, 6459), (6679, 6680), (6752, 6752), (6773, 6780), (6783, 6783), (6912, 6915), (6964, 6964), (6966, 6970), (6972, 6972), (6978, 6978), (6980, 6980), (7019, 7027), (7082, 7082), (7142, 7142), (7154, 7155), (7223, 7223), (7376, 7378), (7380, 7392), (7394, 7400), (7405, 7405), (7616, 7654), (7676, 7679), (8203, 8207), (8234, 8238), (8288, 8291), (8298, 8303), (8400, 8432), (11503, 11505), (11647, 11647), (11744, 11775), (12330, 12335), (12441, 12442), (42607, 42607), (42620, 42621), (42736, 42737), (43014, 43014), (43019, 43019), (43045, 43046), (43204, 43204), (43232, 43249), (43307, 43309), (43347, 43347), (43443, 43443), (43456, 43456), (43696, 43696), (43698, 43700), (43703, 43704), (43710, 43711), (43713, 43713), (44013, 44013), (64286, 64286), (65024, 65039), (65056, 65062), (65279, 65279), (65529, 65531), (66045, 66045), (68097, 68099), (68101, 68102), (68108, 68111), (68152, 68154), (68159, 68159), (69702, 69702), (69817, 69818), (119141, 119145), (119149, 119170), (119173, 119179), (119210, 119213), (119362, 119364), (917505, 917505), (917536, 917631), (917760, 917999)) Internal table, provided by this module to list code points which combine with other characters and therefore should have no textual width\&. This is a sorted \fBtuple\fP of non\-overlapping intervals. Each interval is a \fBtuple\fP listing a starting code point and ending code point\&. Every code point between the two end points is a combining character. .sp \fBSEE ALSO:\fP .INDENT 7.0 .INDENT 3.5 .INDENT 0.0 .TP .B \fI\%_generate_combining_table()\fP for how this table is generated .UNINDENT .UNINDENT .UNINDENT .sp This table was last regenerated on python\-3.2.3 with \fBunicodedata.unidata_version\fP 6.0.0 .UNINDENT .INDENT 0.0 .TP .B kitchen.text.display._generate_combining_table() Combine Markus Kuhn\(aqs data with \fBunicodedata\fP to make combining char list .INDENT 7.0 .TP .B Return type \fBtuple\fP of tuples .TP .B Returns \fBtuple\fP of intervals of code points that are combining character. Each interval is a 2\-\fBtuple\fP of the starting code point and the ending code point for the combining characters. .UNINDENT .sp In normal use, this function serves to tell how we\(aqre generating the combining char list. For speed reasons, we use this to generate a static list and just use that later. .sp Markus Kuhn\(aqs list of combining characters is more complete than what\(aqs in the python \fBunicodedata\fP library but the python \fBunicodedata\fP is synced against later versions of the unicode database .sp This is used to generate the \fI\%_COMBINING\fP table. .UNINDENT .INDENT 0.0 .TP .B kitchen.text.display._print_combining_table() Print out a new \fI\%_COMBINING\fP table .sp This will print a new \fI\%_COMBINING\fP table in the format used in \fBkitchen/text/display.py\fP\&. It\(aqs useful for updating the \fI\%_COMBINING\fP table with updated data from a new python as the format won\(aqt change from what\(aqs already in the file. .UNINDENT .INDENT 0.0 .TP .B kitchen.text.display._interval_bisearch(value, table) Binary search in an interval table. .INDENT 7.0 .TP .B Parameters .INDENT 7.0 .IP \(bu 2 \fBvalue\fP \-\- numeric value to search for .IP \(bu 2 \fBtable\fP \-\- Ordered list of intervals. This is a list of two\-tuples. The elements of the two\-tuple define an interval\(aqs start and end points. .UNINDENT .TP .B Returns If \fBvalue\fP is found within an interval in the \fBtable\fP return \fBTrue\fP\&. Otherwise, \fBFalse\fP .UNINDENT .sp This function checks whether a numeric value is present within a table of intervals. It checks using a binary search algorithm, dividing the list of values in half and checking against the values until it determines whether the value is in the table. .UNINDENT .INDENT 0.0 .TP .B kitchen.text.display._ucp_width(ucs, control_chars=\(aqguess\(aq) Get the textual width of a ucs character .INDENT 7.0 .TP .B Parameters .INDENT 7.0 .IP \(bu 2 \fBucs\fP \-\- integer representing a single unicode code point .IP \(bu 2 \fBcontrol_chars\fP \-\- .sp specify how to deal with control characters\&. Possible values are: .INDENT 2.0 .TP .B guess (default) will take a guess for control character widths. Most codes will return zero width. \fBbackspace\fP, \fBdelete\fP, and \fBclear delete\fP return \-1. \fBescape\fP currently returns \-1 as well but this is not guaranteed as it\(aqs not always correct .TP .B strict will raise \fBControlCharError\fP if a control character is encountered .UNINDENT .UNINDENT .TP .B Raises \fBControlCharError\fP \-\- if the code point is a unicode control character and \fBcontrol_chars\fP is set to \(aqstrict\(aq .TP .B Returns textual width of the character. .UNINDENT .sp \fBNOTE:\fP .INDENT 7.0 .INDENT 3.5 It\(aqs important to remember this is textual width and not the number of characters or bytes. .UNINDENT .UNINDENT .UNINDENT .INDENT 0.0 .TP .B kitchen.text.display._textual_width_le(width, *args) Optimize the common case when deciding which textual width is larger .INDENT 7.0 .TP .B Parameters .INDENT 7.0 .IP \(bu 2 \fBwidth\fP \-\- textual width to compare against. .IP \(bu 2 \fB*args\fP \-\- \fBunicode\fP strings to check the total textual width of .UNINDENT .TP .B Returns \fBTrue\fP if the total length of \fBargs\fP are less than or equal to \fBwidth\fP\&. Otherwise \fBFalse\fP\&. .UNINDENT .sp We often want to know "does X fit in Y". It takes a while to use \fI\%textual_width()\fP to calculate this. However, we know that the number of canonically composed \fBunicode\fP characters is always going to have 1 or 2 for the textual width per character. With this we can take the following shortcuts: .INDENT 7.0 .IP 1. 3 If the number of canonically composed characters is more than width, the true textual width cannot be less than width. .IP 2. 3 If the number of canonically composed characters * 2 is less than the width then the textual width must be ok. .UNINDENT .sp textual width of a canonically composed \fBunicode\fP string will always be greater than or equal to the the number of \fBunicode\fP characters. So we can first check if the number of composed \fBunicode\fP characters is less than the asked for width. If it is we can return \fBTrue\fP immediately. If not, then we must do a full textual width lookup. .UNINDENT .SS Miscellaneous functions for manipulating text .sp Collection of text functions that don\(aqt fit in another category. .sp Changed in version kitchen: 1.2.0, API: kitchen.text 2.2.0 Added \fI\%isbasestring()\fP, \fI\%isbytestring()\fP, and \fI\%isunicodestring()\fP to help tell which string type is which on python2 and python3 .INDENT 0.0 .TP .B kitchen.text.misc.byte_string_valid_encoding(byte_string, encoding=\(aqutf\-8\(aq) Detect if a byte \fBstr\fP is valid in a specific encoding .INDENT 7.0 .TP .B Parameters .INDENT 7.0 .IP \(bu 2 \fBbyte_string\fP \-\- Byte \fBstr\fP to test for bytes not valid in this encoding .IP \(bu 2 \fBencoding\fP \-\- encoding to test against. Defaults to UTF\-8\&. .UNINDENT .TP .B Returns \fBTrue\fP if there are no invalid UTF\-8 characters. \fBFalse\fP if an invalid character is detected. .UNINDENT .sp \fBNOTE:\fP .INDENT 7.0 .INDENT 3.5 This function checks whether the byte \fBstr\fP is valid in the specified encoding. It \fBdoes not\fP detect whether the byte \fBstr\fP actually was encoded in that encoding. If you want that sort of functionality, you probably want to use \fI\%guess_encoding()\fP instead. .UNINDENT .UNINDENT .UNINDENT .INDENT 0.0 .TP .B kitchen.text.misc.byte_string_valid_xml(byte_string, encoding=\(aqutf\-8\(aq) Check that a byte \fBstr\fP would be valid in xml .INDENT 7.0 .TP .B Parameters .INDENT 7.0 .IP \(bu 2 \fBbyte_string\fP \-\- Byte \fBstr\fP to check .IP \(bu 2 \fBencoding\fP \-\- Encoding of the xml file. Default: UTF\-8 .UNINDENT .TP .B Returns \fBTrue\fP if the string is valid. \fBFalse\fP if it would be invalid in the xml file .UNINDENT .sp In some cases you\(aqll have a whole bunch of byte strings and rather than transforming them to \fBunicode\fP and back to byte \fBstr\fP for output to xml, you will just want to make sure they work with the xml file you\(aqre constructing. This function will help you do that. Example: .INDENT 7.0 .INDENT 3.5 .sp .nf .ft C ARRAY_OF_MOSTLY_UTF8_STRINGS = [...] processed_array = [] for string in ARRAY_OF_MOSTLY_UTF8_STRINGS: if byte_string_valid_xml(string, \(aqutf\-8\(aq): processed_array.append(string) else: processed_array.append(guess_bytes_to_xml(string, encoding=\(aqutf\-8\(aq)) output_xml(processed_array) .ft P .fi .UNINDENT .UNINDENT .UNINDENT .INDENT 0.0 .TP .B kitchen.text.misc.guess_encoding(byte_string, disable_chardet=False) Try to guess the encoding of a byte \fBstr\fP .INDENT 7.0 .TP .B Parameters .INDENT 7.0 .IP \(bu 2 \fBbyte_string\fP \-\- byte \fBstr\fP to guess the encoding of .IP \(bu 2 \fBdisable_chardet\fP \-\- If this is True, we never attempt to use \fBchardet\fP to guess the encoding. This is useful if you need to have reproducibility whether \fBchardet\fP is installed or not. Default: \fBFalse\fP\&. .UNINDENT .TP .B Raises \fBTypeError\fP \-\- if \fBbyte_string\fP is not a byte \fBstr\fP type .TP .B Returns string containing a guess at the encoding of \fBbyte_string\fP\&. This is appropriate to pass as the encoding argument when encoding and decoding unicode strings. .UNINDENT .sp We start by attempting to decode the byte \fBstr\fP as UTF\-8\&. If this succeeds we tell the world it\(aqs UTF\-8 text. If it doesn\(aqt and \fBchardet\fP is installed on the system and \fBdisable_chardet\fP is False this function will use it to try detecting the encoding of \fBbyte_string\fP\&. If it is not installed or \fBchardet\fP cannot determine the encoding with a high enough confidence then we rather arbitrarily claim that it is \fBlatin\-1\fP\&. Since \fBlatin\-1\fP will encode to every byte, decoding from \fBlatin\-1\fP to \fBunicode\fP will not cause \fBUnicodeErrors\fP although the output might be mangled. .UNINDENT .INDENT 0.0 .TP .B kitchen.text.misc.html_entities_unescape(string) Substitute unicode characters for HTML entities .INDENT 7.0 .TP .B Parameters \fBstring\fP \-\- \fBunicode\fP string to substitute out html entities .TP .B Raises \fBTypeError\fP \-\- if something other than a \fBunicode\fP string is given .TP .B Return type \fBunicode\fP string .TP .B Returns The plain text without html entities .UNINDENT .UNINDENT .INDENT 0.0 .TP .B kitchen.text.misc.isbasestring(obj) Determine if obj is a byte \fBstr\fP or \fBunicode\fP string .sp In python2 this is eqiuvalent to isinstance(obj, basestring). In python3 it checks whether the object is an instance of str, bytes, or bytearray. This is an aid to porting code that needed to test whether an object was derived from basestring in python2 (commonly used in unicode\-bytes conversion functions) .INDENT 7.0 .TP .B Parameters \fBobj\fP \-\- Object to test .TP .B Returns True if the object is a \fBbasestring\fP\&. Otherwise False. .UNINDENT .sp New in version Kitchen:: 1.2.0, API kitchen.text 2.2.0 .UNINDENT .INDENT 0.0 .TP .B kitchen.text.misc.isbytestring(obj) Determine if obj is a byte \fBstr\fP .sp In python2 this is equivalent to isinstance(obj, str). In python3 it checks whether the object is an instance of bytes or bytearray. .INDENT 7.0 .TP .B Parameters \fBobj\fP \-\- Object to test .TP .B Returns True if the object is a byte \fBstr\fP\&. Otherwise, False. .UNINDENT .sp New in version Kitchen:: 1.2.0, API kitchen.text 2.2.0 .UNINDENT .INDENT 0.0 .TP .B kitchen.text.misc.isunicodestring(obj) Determine if obj is a \fBunicode\fP string .sp In python2 this is equivalent to isinstance(obj, unicode). In python3 it checks whether the object is an instance of \fBstr\fP\&. .INDENT 7.0 .TP .B Parameters \fBobj\fP \-\- Object to test .TP .B Returns True if the object is a \fBunicode\fP string. Otherwise, False. .UNINDENT .sp New in version Kitchen:: 1.2.0, API kitchen.text 2.2.0 .UNINDENT .INDENT 0.0 .TP .B kitchen.text.misc.process_control_chars(string, strategy=\(aqreplace\(aq) Look for and transform control characters in a string .INDENT 7.0 .TP .B Parameters .INDENT 7.0 .IP \(bu 2 \fBstring\fP \-\- string to search for and transform control characters within .IP \(bu 2 \fBstrategy\fP \-\- .sp XML does not allow ASCII control characters\&. When we encounter those we need to know what to do. Valid options are: .INDENT 2.0 .TP .B replace (default) Replace the control characters with \fB"?"\fP .TP .B ignore Remove the characters altogether from the output .TP .B strict Raise a \fBControlCharError\fP when we encounter a control character .UNINDENT .UNINDENT .TP .B Raises .INDENT 7.0 .IP \(bu 2 \fBTypeError\fP \-\- if \fBstring\fP is not a unicode string. .IP \(bu 2 \fBValueError\fP \-\- if the strategy is not one of replace, ignore, or strict. .IP \(bu 2 \fBkitchen.text.exceptions.ControlCharError\fP \-\- if the strategy is \fBstrict\fP and a control character is present in the \fBstring\fP .UNINDENT .TP .B Returns \fBunicode\fP string with no control characters in it. .UNINDENT .sp Changed in version kitchen: 1.2.0, API: kitchen.text 2.2.0 Strip out the C1 control characters in addition to the C0 control characters. .UNINDENT .INDENT 0.0 .TP .B kitchen.text.misc.str_eq(str1, str2, encoding=\(aqutf\-8\(aq, errors=\(aqreplace\(aq) Compare two strings, converting to byte \fBstr\fP if one is \fBunicode\fP .INDENT 7.0 .TP .B Parameters .INDENT 7.0 .IP \(bu 2 \fBstr1\fP \-\- First string to compare .IP \(bu 2 \fBstr2\fP \-\- Second string to compare .IP \(bu 2 \fBencoding\fP \-\- If we need to convert one string into a byte \fBstr\fP to compare, the encoding to use. Default is utf\-8\&. .IP \(bu 2 \fBerrors\fP \-\- What to do if we encounter errors when encoding the string. See the \fBkitchen.text.converters.to_bytes()\fP documentation for possible values. The default is \fBreplace\fP\&. .UNINDENT .UNINDENT .sp This function prevents \fBUnicodeError\fP (python\-2.4 or less) and \fBUnicodeWarning\fP (python 2.5 and higher) when we compare a \fBunicode\fP string to a byte \fBstr\fP\&. The errors normally arise because the conversion is done to ASCII\&. This function lets you convert to utf\-8 or another encoding instead. .sp \fBNOTE:\fP .INDENT 7.0 .INDENT 3.5 When we need to convert one of the strings from \fBunicode\fP in order to compare them we convert the \fBunicode\fP string into a byte \fBstr\fP\&. That means that strings can compare differently if you use different encodings for each. .UNINDENT .UNINDENT .sp Note that \fBstr1 == str2\fP is faster than this function if you can accept the following limitations: .INDENT 7.0 .IP \(bu 2 Limited to python\-2.5+ (otherwise a \fBUnicodeDecodeError\fP may be thrown) .IP \(bu 2 Will generate a \fBUnicodeWarning\fP if non\-ASCII byte \fBstr\fP is compared to \fBunicode\fP string. .UNINDENT .UNINDENT .SS UTF\-8 .sp Functions for operating on byte \fBstr\fP encoded as UTF\-8 .sp \fBNOTE:\fP .INDENT 0.0 .INDENT 3.5 In many cases, it is better to convert to \fBunicode\fP, operate on the strings, then convert back to UTF\-8\&. \fBunicode\fP type can handle many of these functions itself. For those that it doesn\(aqt (removing control characters from length calculations, for instance) the code to do so with a \fBunicode\fP type is often simpler. .UNINDENT .UNINDENT .sp \fBWARNING:\fP .INDENT 0.0 .INDENT 3.5 All of the functions in this module are deprecated. Most of them have been replaced with functions that operate on unicode values in \fBkitchen.text.display\fP\&. \fI\%kitchen.text.utf8.utf8_valid()\fP has been replaced with a function in \fBkitchen.text.misc\fP\&. .UNINDENT .UNINDENT .INDENT 0.0 .TP .B kitchen.text.utf8.utf8_text_fill(text, *args, **kwargs) \fBDeprecated\fP Similar to \fBtextwrap.fill()\fP but understands utf\-8 strings and doesn\(aqt screw up lists/blocks/etc. .sp Use \fBkitchen.text.display.fill()\fP instead. .UNINDENT .INDENT 0.0 .TP .B kitchen.text.utf8.utf8_text_wrap(text, width=70, initial_indent=\(aq\(aq, subsequent_indent=\(aq\(aq) \fBDeprecated\fP Similar to \fBtextwrap.wrap()\fP but understands utf\-8 data and doesn\(aqt screw up lists/blocks/etc .sp Use \fBkitchen.text.display.wrap()\fP instead .UNINDENT .INDENT 0.0 .TP .B kitchen.text.utf8.utf8_valid(msg) \fBDeprecated\fP Detect if a string is valid utf\-8 .sp Use \fBkitchen.text.misc.byte_string_valid_encoding()\fP instead. .UNINDENT .INDENT 0.0 .TP .B kitchen.text.utf8.utf8_width(msg) \fBDeprecated\fP Get the textual width of a utf\-8 string .sp Use \fBkitchen.text.display.textual_width()\fP instead. .UNINDENT .INDENT 0.0 .TP .B kitchen.text.utf8.utf8_width_chop(msg, chop=None) \fBDeprecated\fP Return a string chopped to a given textual width .sp Use \fBtextual_width_chop()\fP and \fBtextual_width()\fP instead: .INDENT 7.0 .INDENT 3.5 .sp .nf .ft C >>> msg = \(aqく ku ら ra と to み mi\(aq >>> # Old way: >>> utf8_width_chop(msg, 5) (5, \(aqく ku\(aq) >>> # New way >>> from kitchen.text.converters import to_bytes >>> from kitchen.text.display import textual_width, textual_width_chop >>> (textual_width(msg), to_bytes(textual_width_chop(msg, 5))) (5, \(aqく ku\(aq) .ft P .fi .UNINDENT .UNINDENT .UNINDENT .INDENT 0.0 .TP .B kitchen.text.utf8.utf8_width_fill(msg, fill, chop=None, left=True, prefix=\(aq\(aq, suffix=\(aq\(aq) \fBDeprecated\fP Pad a utf\-8 string to fill a specified width .sp Use \fBbyte_string_textual_width_fill()\fP instead .UNINDENT .INDENT 0.0 .TP .B \fBconverters\fP deals with converting text for different encodings and to and from XML .TP .B \fBdisplay\fP deals with issues with printing text to a screen .TP .B \fBmisc\fP is a catchall for text manipulation functions that don\(aqt seem to fit elsewhere .TP .B \fButf8\fP contains deprecated functions to manipulate utf8 byte strings .UNINDENT .SS Kitchen.collections .SS StrictDict .sp \fBkitchen.collections.StrictDict\fP provides a dictionary that treats \fBstr\fP and \fBunicode\fP as distinct key values. .INDENT 0.0 .TP .B class kitchen.collections.strictdict.StrictDict Map class that considers \fBunicode\fP and \fBstr\fP different keys .sp Ordinarily when you are dealing with a \fBdict\fP keyed on strings you want to have keys that have the same characters end up in the same bucket even if one key is \fBunicode\fP and the other is a byte \fBstr\fP\&. The normal \fBdict\fP type does this for ASCII characters (but not for anything outside of the ASCII range.) .sp Sometimes, however, you want to keep the two string classes strictly separate, for instance, if you\(aqre creating a single table that can map from \fBunicode\fP characters to \fBstr\fP characters and vice versa. This class will help you do that by making all \fBunicode\fP keys evaluate to a different key than all \fBstr\fP keys. .sp \fBSEE ALSO:\fP .INDENT 7.0 .INDENT 3.5 .INDENT 0.0 .TP .B \fBdict\fP for documentation on this class\(aqs methods. This class implements all the standard \fBdict\fP methods. Its treatment of \fBunicode\fP and \fBstr\fP keys as separate is the only difference. .UNINDENT .UNINDENT .UNINDENT .UNINDENT .SS Kitchen.iterutils Module .sp Functions to manipulate iterables .sp New in version Kitchen:: 0.2.1a1 .sp \fIModule author: Toshio Kuratomi <\fI\%toshio@fedoraproject.org\fP>\fP .sp \fIModule author: Luke Macken <\fI\%lmacken@redhat.com\fP>\fP .INDENT 0.0 .TP .B kitchen.iterutils.isiterable(obj, include_string=False) Check whether an object is an iterable .INDENT 7.0 .TP .B Parameters .INDENT 7.0 .IP \(bu 2 \fBobj\fP \-\- Object to test whether it is an iterable .IP \(bu 2 \fBinclude_string\fP \-\- If \fBTrue\fP and \fBobj\fP is a byte \fBstr\fP or \fBunicode\fP string this function will return \fBTrue\fP\&. If set to \fBFalse\fP, byte \fBstr\fP and \fBunicode\fP strings will cause this function to return \fBFalse\fP\&. Default \fBFalse\fP\&. .UNINDENT .TP .B Returns \fBTrue\fP if \fBobj\fP is iterable, otherwise \fBFalse\fP\&. .UNINDENT .UNINDENT .INDENT 0.0 .TP .B kitchen.iterutils.iterate(obj, include_string=False) Generator that can be used to iterate over anything .INDENT 7.0 .TP .B Parameters .INDENT 7.0 .IP \(bu 2 \fBobj\fP \-\- The object to iterate over .IP \(bu 2 \fBinclude_string\fP \-\- if \fBTrue\fP, treat strings as iterables. Otherwise treat them as a single scalar value. Default \fBFalse\fP .UNINDENT .UNINDENT .sp This function will create an iterator out of any scalar or iterable. It is useful for making a value given to you an iterable before operating on it. Iterables have their items returned. scalars are transformed into iterables. A string is treated as a scalar value unless the \fBinclude_string\fP parameter is set to \fBTrue\fP\&. Example usage: .INDENT 7.0 .INDENT 3.5 .sp .nf .ft C >>> list(iterate(None)) [None] >>> list(iterate([None])) [None] >>> list(iterate([1, 2, 3])) [1, 2, 3] >>> list(iterate(set([1, 2, 3]))) [1, 2, 3] >>> list(iterate(dict(a=\(aq1\(aq, b=\(aq2\(aq))) [\(aqa\(aq, \(aqb\(aq] >>> list(iterate(1)) [1] >>> list(iterate(iter([1, 2, 3]))) [1, 2, 3] >>> list(iterate(\(aqabc\(aq)) [\(aqabc\(aq] >>> list(iterate(\(aqabc\(aq, include_string=True)) [\(aqa\(aq, \(aqb\(aq, \(aqc\(aq] .ft P .fi .UNINDENT .UNINDENT .UNINDENT .SS Helpers for versioning software .SS PEP\-386 compliant versioning .sp \fI\%PEP 386\fP defines a standard format for version strings. This module contains a function for creating strings in that format. .INDENT 0.0 .TP .B kitchen.versioning.version_tuple_to_string(version_info) Return a \fI\%PEP 386\fP version string from a \fI\%PEP 386\fP style version tuple .INDENT 7.0 .TP .B Parameters \fBversion_info\fP \-\- Nested set of tuples that describes the version. See below for an example. .TP .B Returns a version string .UNINDENT .sp This function implements just enough of \fI\%PEP 386\fP to satisfy our needs. \fI\%PEP 386\fP defines a standard format for version strings and refers to a function that will be merged into the \fI\%python standard library\fP that transforms a tuple of version information into a standard version string. This function is an implementation of that function. Once that function becomes available in the \fI\%python standard library\fP we will start using it and deprecate this function. .sp \fBversion_info\fP takes the form that \fI\%PEP 386\fP\(aqs \fBNormalizedVersion.from_parts()\fP uses: .INDENT 7.0 .INDENT 3.5 .sp .nf .ft C ((Major, Minor, [Micros]), [(Alpha/Beta/rc marker, version)], [(post/dev marker, version)]) Ex: ((1, 0, 0), (\(aqa\(aq, 2), (\(aqdev\(aq, 3456)) .ft P .fi .UNINDENT .UNINDENT .sp It generates a \fI\%PEP 386\fP compliant version string: .INDENT 7.0 .INDENT 3.5 .sp .nf .ft C N.N[.N]+[{a|b|c|rc}N[.N]+][.postN][.devN] Ex: 1.0.0a2.dev3456 .ft P .fi .UNINDENT .UNINDENT .sp \fBWARNING:\fP .INDENT 7.0 .INDENT 3.5 This function does next to no error checking. It\(aqs up to the person defining the version tuple to make sure that the values make sense. If the \fI\%PEP 386\fP compliant version parser doesn\(aqt get released soon we\(aqll look at making this function check that the version tuple makes sense before transforming it into a string. .UNINDENT .UNINDENT .sp It\(aqs recommended that you use this function to keep a \fB__version_info__\fP tuple and \fB__version__\fP string in your modules. Why do we need both a tuple and a string? The string is often useful for putting into human readable locations like release announcements, version strings in tarballs, etc. Meanwhile the tuple is very easy for a computer to compare. For example, kitchen sets up its version information like this: .INDENT 7.0 .INDENT 3.5 .sp .nf .ft C from kitchen.versioning import version_tuple_to_string __version_info__ = ((0, 2, 1),) __version__ = version_tuple_to_string(__version_info__) .ft P .fi .UNINDENT .UNINDENT .sp Other programs that depend on a kitchen version between 0.2.1 and 0.3.0 can find whether the present version is okay with code like this: .INDENT 7.0 .INDENT 3.5 .sp .nf .ft C from kitchen import __version_info__, __version__ if __version_info__ < ((0, 2, 1),) or __version_info__ >= ((0, 3, 0),): print \(aqkitchen is present but not at the right version.\(aq print \(aqWe need at least version 0.2.1 and less than 0.3.0\(aq print \(aqCurrently found: kitchen\-%s\(aq % __version__ .ft P .fi .UNINDENT .UNINDENT .UNINDENT .SS Exceptions .sp Kitchen has a hierarchy of exceptions that should make it easy to catch many errors emitted by kitchen itself. .SS Base kitchen exceptions .sp Exception classes for kitchen and the root of the exception hierarchy for all kitchen modules. .INDENT 0.0 .TP .B exception kitchen.exceptions.KitchenError Base exception class for any error thrown directly by kitchen. .UNINDENT .SS Kitchen.text exceptions .sp Exception classes thrown by kitchen\(aqs text processing routines. .INDENT 0.0 .TP .B exception kitchen.text.exceptions.XmlEncodeError Exception thrown by error conditions when encoding an xml string. .UNINDENT .INDENT 0.0 .TP .B exception kitchen.text.exceptions.ControlCharError Exception thrown when an ascii control character is encountered. .UNINDENT .SS 1.0.0 Porting Guide .sp The 0.1 through 1.0.0 releases focused on bringing in functions from yum and python\-fedora. This porting guide tells how to port from those APIs to their kitchen replacements. .SS python\-fedora .TS center; |l|l|. _ T{ python\-fedora T} T{ kitchen replacement T} _ T{ \fBfedora.iterutils.isiterable()\fP T} T{ \fBkitchen.iterutils.isiterable()\fP [1] T} _ T{ \fBfedora.textutils.to_unicode()\fP T} T{ \fBkitchen.text.converters.to_unicode()\fP T} _ T{ \fBfedora.textutils.to_bytes()\fP T} T{ \fBkitchen.text.converters.to_bytes()\fP T} _ .TE .IP [1] 5 \fBisiterable()\fP has changed slightly in kitchen. The \fBinclude_string\fP attribute has switched its default value from \fBTrue\fP to \fBFalse\fP\&. So you need to change code like: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C >>> # Old code >>> isiterable(\(aqabcdef\(aq) True >>> # New code >>> isiterable(\(aqabcdef\(aq, include_string=True) True .ft P .fi .UNINDENT .UNINDENT .SS yum .TS center; |l|l|. _ T{ yum T} T{ kitchen replacement T} _ T{ \fByum.i18n.dummy_wrapper()\fP T} T{ \fBkitchen.i18n.DummyTranslations.ugettext()\fP [2] T} _ T{ \fByum.i18n.dummyP_wrapper()\fP T} T{ \fBkitchen.i18n.DummyTanslations.ungettext()\fP [2] T} _ T{ \fByum.i18n.utf8_width()\fP T} T{ \fBkitchen.text.display.textual_width()\fP T} _ T{ \fByum.i18n.utf8_width_chop()\fP T} T{ \fBkitchen.text.display.textual_width_chop()\fP and \fBkitchen.text.display.textual_width()\fP [3] [5] T} _ T{ \fByum.i18n.utf8_valid()\fP T} T{ \fBkitchen.text.misc.byte_string_valid_encoding()\fP T} _ T{ \fByum.i18n.utf8_text_wrap()\fP T} T{ \fBkitchen.text.display.wrap()\fP [4] T} _ T{ \fByum.i18n.utf8_text_fill()\fP T} T{ \fBkitchen.text.display.fill()\fP [4] T} _ T{ \fByum.i18n.to_unicode()\fP T} T{ \fBkitchen.text.converters.to_unicode()\fP [6] T} _ T{ \fByum.i18n.to_unicode_maybe()\fP T} T{ \fBkitchen.text.converters.to_unicode()\fP [6] T} _ T{ \fByum.i18n.to_utf8()\fP T} T{ \fBkitchen.text.converters.to_bytes()\fP [6] T} _ T{ \fByum.i18n.to_str()\fP T} T{ \fBkitchen.text.converters.to_unicode()\fP or \fBkitchen.text.converters.to_bytes()\fP [7] T} _ T{ \fByum.i18n.str_eq()\fP T} T{ \fBkitchen.text.misc.str_eq()\fP T} _ T{ \fByum.misc.to_xml()\fP T} T{ \fBkitchen.text.converters.unicode_to_xml()\fP or \fBkitchen.text.converters.byte_string_to_xml()\fP [8] T} _ T{ \fByum.i18n._()\fP T} T{ See: \fI\%Initializing Yum i18n\fP T} _ T{ \fByum.i18n.P_()\fP T} T{ See: \fI\%Initializing Yum i18n\fP T} _ T{ \fByum.i18n.exception2msg()\fP T} T{ \fBkitchen.text.converters.exception_to_unicode()\fP or \fBkitchen.text.converter.exception_to_bytes()\fP [9] T} _ .TE .IP [2] 5 These yum methods provided fallback support for \fBgettext\fP functions in case either \fBgaftonmode\fP was set or \fBgettext\fP failed to return an object. In kitchen, we can use the \fBkitchen.i18n.DummyTranslations\fP object to fulfill that role. Please see \fI\%Initializing Yum i18n\fP for more suggestions on how to do this. .IP [3] 5 The yum version of these functions returned a byte \fBstr\fP\&. The kitchen version listed here returns a \fBunicode\fP string. If you need a byte \fBstr\fP simply call \fBkitchen.text.converters.to_bytes()\fP on the result. .IP [4] 5 The yum version of these functions would return either a byte \fBstr\fP or a \fBunicode\fP string depending on what the input value was. The kitchen version always returns \fBunicode\fP strings. .IP [5] 5 \fByum.i18n.utf8_width_chop()\fP performed two functions. It returned the piece of the message that fit in a specified width and the width of that message. In kitchen, you need to call two functions, one for each action: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C >>> # Old way >>> utf8_width_chop(msg, 5) (5, \(aqく ku\(aq) >>> # New way >>> from kitchen.text.display import textual_width, textual_width_chop >>> (textual_width(msg), textual_width_chop(msg, 5)) (5, u\(aqく ku\(aq) .ft P .fi .UNINDENT .UNINDENT .IP [6] 5 If the yum version of \fBto_unicode()\fP or \fBto_utf8()\fP is given an object that is not a string, it returns the object itself. \fBkitchen.text.converters.to_unicode()\fP and \fBkitchen.text.converters.to_bytes()\fP default to returning the \fBsimplerepr\fP of the object instead. If you want the yum behaviour, set the \fBnonstring\fP parameter to \fBpassthru\fP: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C >>> from kitchen.text.converters import to_unicode >>> to_unicode(5) u\(aq5\(aq >>> to_unicode(5, nonstring=\(aqpassthru\(aq) 5 .ft P .fi .UNINDENT .UNINDENT .IP [7] 5 \fByum.i18n.to_str()\fP could return either a byte \fBstr\fP\&. or a \fBunicode\fP string In kitchen you can get the same effect but you get to choose whether you want a byte \fBstr\fP or a \fBunicode\fP string. Use \fBto_bytes()\fP for \fBstr\fP and \fBto_unicode()\fP for \fBunicode\fP\&. .IP [8] 5 \fByum.misc.to_xml()\fP was buggy as written. I think the intention was for you to be able to pass a byte \fBstr\fP or \fBunicode\fP string in and get out a byte \fBstr\fP that was valid to use in an xml file. The two kitchen functions \fBbyte_string_to_xml()\fP and \fBunicode_to_xml()\fP do that for each string type. .IP [9] 5 When porting \fByum.i18n.exception2msg()\fP to use kitchen, you should setup two wrapper functions to aid in your port. They\(aqll look like this: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C from kitchen.text.converters import EXCEPTION_CONVERTERS, \e BYTE_EXCEPTION_CONVERTERS, exception_to_unicode, \e exception_to_bytes def exception2umsg(e): \(aq\(aq\(aqReturn a unicode representation of an exception\(aq\(aq\(aq c = [lambda e: e.value] c.extend(EXCEPTION_CONVERTERS) return exception_to_unicode(e, converters=c) def exception2bmsg(e): \(aq\(aq\(aqReturn a utf8 encoded str representation of an exception\(aq\(aq\(aq c = [lambda e: e.value] c.extend(BYTE_EXCEPTION_CONVERTERS) return exception_to_bytes(e, converters=c) .ft P .fi .UNINDENT .UNINDENT .sp The reason to define this wrapper is that many of the exceptions in yum put the message in the \fBvalue\fP attribute of the \fBException\fP instead of adding it to the \fBargs\fP attribute. So the default \fBEXCEPTION_CONVERTERS\fP don\(aqt know where to find the message. The wrapper tells kitchen to check the \fBvalue\fP attribute for the message. The reason to define two wrappers may be less obvious. \fByum.i18n.exception2msg()\fP can return a \fBunicode\fP string or a byte \fBstr\fP depending on a combination of what attributes are present on the \fBException\fP and what locale the function is being run in. By contrast, \fBkitchen.text.converters.exception_to_unicode()\fP only returns \fBunicode\fP strings and \fBkitchen.text.converters.exception_to_bytes()\fP only returns byte \fBstr\fP\&. This is much safer as it keeps code that can only handle \fBunicode\fP or only handle byte \fBstr\fP correctly from getting the wrong type when an input changes but it means you need to examine the calling code when porting from \fByum.i18n.exception2msg()\fP and use the appropriate wrapper. .SS Initializing Yum i18n .sp Previously, yum had several pieces of code to initialize i18n. From the toplevel of \fByum/i18n.py\fP: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C try:. \(aq\(aq\(aq Setup the yum translation domain and make _() and P_() translation wrappers available. using ugettext to make sure translated strings are in Unicode. \(aq\(aq\(aq import gettext t = gettext.translation(\(aqyum\(aq, fallback=True) _ = t.ugettext P_ = t.ungettext except: \(aq\(aq\(aq Something went wrong so we make a dummy _() wrapper there is just returning the same text \(aq\(aq\(aq _ = dummy_wrapper P_ = dummyP_wrapper .ft P .fi .UNINDENT .UNINDENT .sp With kitchen, this can be changed to this: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C from kitchen.i18n import easy_gettext_setup, DummyTranslations try: _, P_ = easy_gettext_setup(\(aqyum\(aq) except: translations = DummyTranslations() _ = translations.ugettext P_ = translations.ungettext .ft P .fi .UNINDENT .UNINDENT .sp \fBNOTE:\fP .INDENT 0.0 .INDENT 3.5 In overcoming\-frustration, it is mentioned that for some things (like exception messages), using the byte \fBstr\fP oriented functions is more appropriate. If this is desired, the setup portion is only a second call to \fBkitchen.i18n.easy_gettext_setup()\fP: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C b_, bP_ = easy_gettext_setup(\(aqyum\(aq, use_unicode=False) .ft P .fi .UNINDENT .UNINDENT .UNINDENT .UNINDENT .sp The second place where i18n is setup is in \fByum.YumBase._getConfig()\fP in \fByum/__init_.py\fP if \fBgaftonmode\fP is in effect: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C if startupconf.gaftonmode: global _ _ = yum.i18n.dummy_wrapper .ft P .fi .UNINDENT .UNINDENT .sp This can be changed to: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C if startupconf.gaftonmode: global _ _ = DummyTranslations().ugettext() .ft P .fi .UNINDENT .UNINDENT .SS Conventions for contributing to kitchen .SS Style .INDENT 0.0 .IP \(bu 2 Strive to be \fI\%PEP 8\fP compliant .IP \(bu 2 Run \fI:command:\(gapylint\fP \(ga over the code and try to resolve most of its nitpicking .UNINDENT .SS Python 2.4 compatibility .sp At the moment, we\(aqre supporting python\-2.4 and above. Understand that there\(aqs a lot of python features that we cannot use because of this. .sp Sometimes modules in the \fI\%python standard library\fP can be added to kitchen so that they\(aqre available. When we do that we need to be careful of several things: .INDENT 0.0 .IP 1. 3 Keep the module in sync with the version in the python\-2.x trunk. Use \fBmaintainers/sync\-copied\-files.py\fP for this. .IP 2. 3 Sync the unittests as well as the module. .IP 3. 3 Be aware that not all modules are written to remain compatible with Python\-2.4 and might use python language features that were not present then (generator expressions, relative imports, decorators, with, try: with both except: and finally:, etc) These are not good candidates for importing into kitchen as they require more work to keep synced. .UNINDENT .SS Unittests .INDENT 0.0 .IP \(bu 2 At least smoketest your code (make sure a function will return expected values for one set of inputs). .IP \(bu 2 Note that even 100% coverage is not a guarantee of working code! Good tests will realize that you need to also give multiple inputs that test the code paths of called functions that are outside of your code. Example: .INDENT 2.0 .INDENT 3.5 .sp .nf .ft C def to_unicode(msg, encoding=\(aqutf8\(aq, errors=\(aqreplace\(aq): return unicode(msg, encoding, errors) # Smoketest only. This will give 100% coverage for your code (it # tests all of the code inside of to_unicode) but it leaves a lot of # room for errors as it doesn\(aqt test all combinations of arguments # that are then passed to the unicode() function. tools.ok_(to_unicode(\(aqabc\(aq) == u\(aqabc\(aq) # Better \-\- tests now cover non\-ascii characters and that error conditions # occur properly. There\(aqs a lot of other permutations that can be # added along these same lines. tools.ok_(to_unicode(u\(aqcafé\(aq, \(aqutf8\(aq, \(aqreplace\(aq)) tools.assert_raises(UnicodeError, to_unicode, [u\(aqcafè ñunru\(aq.encode(\(aqlatin1\(aq)]) .ft P .fi .UNINDENT .UNINDENT .IP \(bu 2 We\(aqre using nose for unittesting. Rather than depend on unittest2 functionality, use the functions that nose provides. .IP \(bu 2 Remember to maintain python\-2.4 compatibility even in unittests. .UNINDENT .SS Docstrings and documentation .sp We use sphinx to build our documentation. We use the sphinx autodoc extension to pull docstrings out of the modules for API documentation. This means that docstrings for subpackages and modules should follow a certain pattern. The general structure is: .INDENT 0.0 .IP \(bu 2 Introductory material about a module in the module\(aqs top level docstring. .INDENT 2.0 .IP \(bu 2 Introductory material should begin with a level two title: an overbar and underbar of \(aq\-\(aq. .UNINDENT .IP \(bu 2 docstrings for every function. .INDENT 2.0 .IP \(bu 2 The first line is a short summary of what the function does .IP \(bu 2 This is followed by a blank line .IP \(bu 2 The next lines are a \fIfield list _\fP giving information about the function\(aqs signature. We use the keywords: \fBarg\fP, \fBkwarg\fP, \fBraises\fP, \fBreturns\fP, and sometimes \fBrtype\fP\&. Use these to describe all arguments, key word arguments, exceptions raised, and return values using these. .INDENT 2.0 .IP \(bu 2 Parameters that are \fBkwarg\fP should specify what their default behaviour is. .UNINDENT .UNINDENT .UNINDENT .SS Kitchen versioning .sp Currently the kitchen library is in early stages of development. While we\(aqre in this state, the main kitchen library uses the following pattern for version information: .INDENT 0.0 .IP \(bu 2 .INDENT 2.0 .TP .B Versions look like this:: __version_info__ = ((0, 1, 2),) __version__ = \(aq0.1.2\(aq .UNINDENT .IP \(bu 2 The Major version number remains at 0 until we decide to make the first 1.0 release of kitchen. At that point, we\(aqre declaring that we have some confidence that we won\(aqt need to break backwards compatibility for a while. .IP \(bu 2 The Minor version increments for any backwards incompatible API changes. When this is updated, we reset micro to zero. .IP \(bu 2 The Micro version increments for any other changes (backwards compatible API changes, pure bugfixes, etc). .UNINDENT .sp \fBNOTE:\fP .INDENT 0.0 .INDENT 3.5 Versioning is only updated for releases that generate sdists and new uploads to the download directory. Usually we update the version information for the library just before release. By contrast, we update kitchen \fI\%Versioning\fP when an API change is made. When in doubt, look at the version information in the last release. .UNINDENT .UNINDENT .SS I18N .sp All strings that are used as feedback for users need to be translated. \fBkitchen\fP sets up several functions for this. \fB_()\fP is used for marking things that are shown to users via print, GUIs, or other "standard" methods. Strings for exceptions are marked with \fBb_()\fP\&. This function returns a byte \fBstr\fP which is needed for use with exceptions: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C from kitchen import _, b_ def print_message(msg, username): print _(\(aq%(user)s, your message of the day is: %(message)s\(aq) % { \(aqmessage\(aq: msg, \(aquser\(aq: username} raise Exception b_(\(aqTest message\(aq) .ft P .fi .UNINDENT .UNINDENT .sp This serves several purposes: .INDENT 0.0 .IP \(bu 2 It marks the strings to be extracted by an xgettext\-like program. .IP \(bu 2 \fB_()\fP is a function that will substitute available translations at runtime. .UNINDENT .sp \fBNOTE:\fP .INDENT 0.0 .INDENT 3.5 By using the \fB%()s with dict\fP style of string formatting, we make this string friendly to translators that may need to reorder the variables when they\(aqre translating the string. .UNINDENT .UNINDENT .sp \fIpaver _\fP and \fIbabel _\fP are used to extract the strings. .SS API updates .sp Kitchen strives to have a long deprecation cycle so that people have time to switch away from any APIs that we decide to discard. Discarded APIs should raise a \fBDeprecationWarning\fP and clearly state in the warning message and the docstring how to convert old code to use the new interface. An example of deprecating a function: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C import warnings from kitchen import _ from kitchen.text.converters import to_bytes, to_unicode from kitchen.text.new_module import new_function def old_function(param): \(aq\(aq\(aq**Deprecated** This function is deprecated. Use :func:\(gakitchen.text.new_module.new_function\(ga instead. If you want unicode strngs as output, switch to:: >>> from kitchen.text.new_module import new_function >>> output = new_function(param) If you want byte strings, use:: >>> from kitchen.text.new_module import new_function >>> from kitchen.text.converters import to_bytes >>> output = to_bytes(new_function(param)) \(aq\(aq\(aq warnings.warn(_(\(aqkitchen.text.old_function is deprecated. Use\(aq \(aq kitchen.text.new_module.new_function instead\(aq), DeprecationWarning, stacklevel=2) as_unicode = isinstance(param, unicode) message = new_function(to_unicode(param)) if not as_unicode: message = to_bytes(message) return message .ft P .fi .UNINDENT .UNINDENT .sp If a particular API change is very intrusive, it may be better to create a new version of the subpackage and ship both the old version and the new version. .SS NEWS file .sp Update the \fBNEWS\fP file when you make a change that will be visible to the users. This is not a ChangeLog file so we don\(aqt need to list absolutely everything but it should give the user an idea of how this version differs from prior versions. API changes should be listed here explicitly. bugfixes can be more general: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C \-\-\-\-\- 0.2.0 \-\-\-\-\- * Relicense to LGPLv2+ * Add kitchen.text.format module with the following functions: textual_width, textual_width_chop. * Rename the kitchen.text.utils module to kitchen.text.misc. use of the old names is deprecated but still available. * bugfixes applied to kitchen.pycompat24.defaultdict that fixes some tracebacks .ft P .fi .UNINDENT .UNINDENT .SS Kitchen subpackages .sp Kitchen itself is a namespace. The kitchen sdist (tarball) provides certain useful subpackages. .sp \fBSEE ALSO:\fP .INDENT 0.0 .INDENT 3.5 .INDENT 0.0 .TP .B \fI\%Kitchen addon packages\fP For information about subpackages not distributed in the kitchen sdist that install into the kitchen namespace. .UNINDENT .UNINDENT .UNINDENT .SS Versioning .sp Each subpackage should have its own version information which is independent of the other kitchen subpackages and the main kitchen library version. This is used so that code that depends on kitchen APIs can check the version information. The standard way to do this is to put something like this in the subpackage\(aqs \fB__init__.py\fP: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C from kitchen.versioning import version_tuple_to_string __version_info__ = ((1, 0, 0),) __version__ = version_tuple_to_string(__version_info__) .ft P .fi .UNINDENT .UNINDENT .sp \fB__version_info__\fP is documented in \fBkitchen.versioning\fP\&. The values of the first tuple should describe API changes to the module. There are at least three numbers present in the tuple: (Major, minor, micro). The major version number is for backwards incompatible changes (For instance, removing a function, or adding a new mandatory argument to a function). Whenever one of these occurs, you should increment the major number and reset minor and micro to zero. The second number is the minor version. Anytime new but backwards compatible changes are introduced this number should be incremented and the micro version number reset to zero. The micro version should be incremented when a change is made that does not change the API at all. This is a common case for bugfixes, for instance. .sp Version information beyond the first three parts of the first tuple may be useful for versioning but semantically have similar meaning to the micro version. .sp \fBNOTE:\fP .INDENT 0.0 .INDENT 3.5 We update the \fB__version_info__\fP tuple when the API is updated. This way there\(aqs less chance of forgetting to update the API version when a new release is made. However, we try to only increment the version numbers a single step for any release. So if kitchen\-0.1.0 has kitchen.text.__version__ == \(aq1.0.1\(aq, kitchen\-0.1.1 should have kitchen.text.__version__ == \(aq1.0.2\(aq or \(aq1.1.0\(aq or \(aq2.0.0\(aq. .UNINDENT .UNINDENT .SS Criteria for subpackages in kitchen .sp Subpackages within kitchen should meet these criteria: .INDENT 0.0 .IP \(bu 2 Generally useful or needed for other pieces of kitchen. .IP \(bu 2 No mandatory requirements outside of the \fI\%python standard library\fP\&. .INDENT 2.0 .IP \(bu 2 Optional requirements from outside the \fI\%python standard library\fP are allowed. Things with mandatory requirements are better placed in \fI\%kitchen addon packages\fP .UNINDENT .IP \(bu 2 Somewhat API stable \-\- this is not a hard requirement. We can change the kitchen api. However, it is better not to as people may come to depend on it. .sp \fBSEE ALSO:\fP .INDENT 2.0 .INDENT 3.5 \fI\%API Updates\fP .UNINDENT .UNINDENT .UNINDENT .SS Kitchen addon packages .sp Addon packages are very similar to subpackages integrated into the kitchen sdist. This section just lists some of the differences to watch out for. .SS setup.py .sp Your \fBsetup.py\fP should contain entries like this: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C # It\(aqs suggested to use a dotted name like this so the package is easily # findable on pypi: setup(name=\(aqkitchen.config\(aq, # Include kitchen in the keywords, again, for searching on pypi keywords=[\(aqkitchen\(aq, \(aqconfiguration\(aq], # This package lives in the directory kitchen/config packages=[\(aqkitchen.config\(aq], # [...] ) .ft P .fi .UNINDENT .UNINDENT .SS Package directory layout .sp Create a \fBkitchen\fP directory in the toplevel. Place the addon subpackage in there. For example: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C \&./ <== toplevel with README, setup.py, NEWS, etc kitchen/ kitchen/__init__.py kitchen/config/ <== subpackage directory kitchen/config/__init__.py .ft P .fi .UNINDENT .UNINDENT .SS Fake kitchen module .sp The :file::\fI__init__.py\fP in the \fBkitchen\fP directory is special. It won\(aqt be installed. It just needs to pull in the kitchen from the system so that you are able to test your module. You should be able to use this boilerplate: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C # Fake module. This is not installed, It\(aqs just made to import the real # kitchen modules for testing this module import pkgutil # Extend the __path__ with everything in the real kitchen module __path__ = pkgutil.extend_path(__path__, __name__) .ft P .fi .UNINDENT .UNINDENT .sp \fBNOTE:\fP .INDENT 0.0 .INDENT 3.5 \fBkitchen\fP needs to be findable by python for this to work. Installed in the \fBsite\-packages\fP directory or adding it to the \fBPYTHONPATH\fP will work. .UNINDENT .UNINDENT .sp Your unittests should now be able to find both your submodule and the main kitchen module. .SS Versioning .sp It is recommended that addon packages version similarly to \fI\%Versioning\fP\&. The \fB__version_info__\fP and \fB__version__\fP strings can be changed independently of the version exposed by setup.py so that you have both an API version (\fB__version_info__\fP) and release version that\(aqs easier for people to parse. However, you aren\(aqt required to do this and you could follow a different methodology if you want (for instance, \fI\%Kitchen versioning\fP) .SS Glossary .INDENT 0.0 .TP .B "Everything but the kitchen sink" An English idiom meaning to include nearly everything that you can think of. .TP .B API version Version that is meant for computer consumption. This version is parsable and comparable by computers. It contains information about a library\(aqs API so that computer software can decide whether it works with the software. .TP .B ASCII A character encoding that maps numbers to characters essential to American English. It maps 128 characters using 7bits. .sp \fBSEE ALSO:\fP .INDENT 7.0 .INDENT 3.5 \fI\%http://en.wikipedia.org/wiki/ASCII\fP .UNINDENT .UNINDENT .TP .B ASCII compatible An encoding in which the particular byte that maps to a character in the \fI\%ASCII\fP character set is only used to map to that character. This excludes EBDIC based encodings and many multi\-byte fixed and variable width encodings since they reuse the bytes that make up the \fI\%ASCII\fP encoding for other purposes. \fI\%UTF\-8\fP is notable as a variable width encoding that is \fI\%ASCII\fP compatible. .sp \fBSEE ALSO:\fP .INDENT 7.0 .INDENT 3.5 .INDENT 0.0 .TP .B \fI\%http://en.wikipedia.org/wiki/Variable\-width_encoding\fP For another explanation of various ways bytes are mapped to characters in a possibly incompatible manner. .UNINDENT .UNINDENT .UNINDENT .TP .B code points \fI\%code point\fP .TP .B code point A number that maps to a particular abstract character. Code points make it so that we have a number pointing to a character without worrying about implementation details of how those numbers are stored for the computer to read. Encodings define how the code points map to particular sequences of bytes on disk and in memory. .TP .B control characters \fI\%control character\fP .TP .B control character The set of characters in unicode that are used, not to display glyphs on the screen, but to tell the display in program to do something. .sp \fBSEE ALSO:\fP .INDENT 7.0 .INDENT 3.5 \fI\%http://en.wikipedia.org/wiki/Control_character\fP .UNINDENT .UNINDENT .TP .B grapheme characters or pieces of characters that you might write on a page to make words, sentences, or other pieces of text. .sp \fBSEE ALSO:\fP .INDENT 7.0 .INDENT 3.5 \fI\%http://en.wikipedia.org/wiki/Grapheme\fP .UNINDENT .UNINDENT .TP .B I18N I18N is an abbreviation for internationalization. It\(aqs often used to signify the need to translate words, number and date formats, and other pieces of data in a computer program so that it will work well for people who speak another language than yourself. .TP .B message catalogs \fI\%message catalog\fP .TP .B message catalog Message catalogs contain translations for user\-visible strings that are present in your code. Normally, you need to mark the strings to be translated by wrapping them in one of several \fBgettext\fP functions. The function serves two purposes: .INDENT 7.0 .IP 1. 3 It allows automated tools to find which strings are supposed to be extracted for translation. .IP 2. 3 The functions perform the translation when the program is running. .UNINDENT .sp \fBSEE ALSO:\fP .INDENT 7.0 .INDENT 3.5 \fI\%babel\(aqs documentation\fP .INDENT 0.0 .INDENT 3.5 for one method of extracting message catalogs from source code. .UNINDENT .UNINDENT .UNINDENT .UNINDENT .TP .B Murphy\(aqs Law "Anything that can go wrong, will go wrong." .sp \fBSEE ALSO:\fP .INDENT 7.0 .INDENT 3.5 \fI\%http://en.wikipedia.org/wiki/Murphy%27s_Law\fP .UNINDENT .UNINDENT .TP .B release version Version that is meant for human consumption. This version is easy for a human to look at to decide how a particular version relates to other versions of the software. .TP .B textual width The amount of horizontal space a character takes up on a monospaced screen. The units are number of character cells or columns that it takes the place of. .TP .B UTF\-8 A character encoding that maps all unicode \fI\%code points\fP to a sequence of bytes. It is compatible with \fI\%ASCII\fP\&. It uses a variable number of bytes to encode all of unicode. ASCII characters take one byte. Characters from other parts of unicode take two to four bytes. It is widespread as an encoding on the internet and in Linux. .UNINDENT .SH INDICES AND TABLES .INDENT 0.0 .IP \(bu 2 genindex .IP \(bu 2 modindex .IP \(bu 2 search .UNINDENT .SH PROJECT PAGES .sp More information about the project can be found on the \fI\%project webpage\fP .sp The latest published version of this documentation can be found on the \fI\%documentation page\fP .SH COPYRIGHT 2016 Red Hat, Inc. and others .\" Generated by docutils manpage writer. .