.\" Man page generated from reStructuredText.
.
.TH "KITCHEN" "1" "Sep 05, 2016" "0.2" "kitchen"
.SH NAME
kitchen \- kitchen 1.2.4
.
.nr rst2man-indent-level 0
.
.de1 rstReportMargin
\\$1 \\n[an-margin]
level \\n[rst2man-indent-level]
level margin: \\n[rst2man-indent\\n[rst2man-indent-level]]
-
\\n[rst2man-indent0]
\\n[rst2man-indent1]
\\n[rst2man-indent2]
..
.de1 INDENT
.\" .rstReportMargin pre:
. RS \\$1
. nr rst2man-indent\\n[rst2man-indent-level] \\n[an-margin]
. nr rst2man-indent-level +1
.\" .rstReportMargin post:
..
.de UNINDENT
. RE
.\" indent \\n[an-margin]
.\" old: \\n[rst2man-indent\\n[rst2man-indent-level]]
.nr rst2man-indent-level -1
.\" new: \\n[rst2man-indent\\n[rst2man-indent-level]]
.in \\n[rst2man-indent\\n[rst2man-indent-level]]u
..
.INDENT 0.0
.TP
.B Author
Toshio Kuratomi
.TP
.B Date
19 March 2011
.TP
.B Version
1.0.x
.UNINDENT
.sp
We\(aqve all done it.  In the process of writing a brand new application we\(aqve
discovered that we need a little bit of code that we\(aqve invented before.
Perhaps it\(aqs something to handle unicode text.  Perhaps it\(aqs something to make
a bit of python\-2.5 code run on python\-2.4.  Whatever it is, it ends up being
a tiny bit of code that seems too small to worry about pushing into its own
module so it sits there, a part of your current project, waiting to be cut and
pasted into your next project.  And the next.  And the next.  And since that
little bittybit of code proved so useful to you, it\(aqs highly likely that it
proved useful to someone else as well.  Useful enough that they\(aqve written it
and copy and pasted it over and over into each of their new projects.
.sp
Well, no longer!  Kitchen aims to pull these small snippets of code into a few
python modules which you can import and use within your project.  No more copy
and paste!  Now you can let someone else maintain and release these small
snippets so that you can get on with your life.
.sp
This package forms the core of Kitchen.  It contains some useful modules for
using newer \fI\%python standard library\fP modules on older python versions, text manipulation,
\fI\%PEP 386\fP versioning, and initializing \fBgettext\fP\&.  With this package we\(aqre
trying to provide a few useful features that don\(aqt have too many dependencies
outside of the \fI\%python standard library\fP\&.  We\(aqll be releasing other modules that drop into the
kitchen namespace to add other features (possibly with larger deps) as time
goes on.
.SH REQUIREMENTS
.sp
We\(aqve tried to keep the core kitchen module\(aqs requirements lightweight.  At the
moment kitchen only requires
.INDENT 0.0
.TP
.B python
2.4 or later
.UNINDENT
.sp
\fBWARNING:\fP
.INDENT 0.0
.INDENT 3.5
Kitchen\-1.1.0 was the last release that supported python\-2.3.x
.UNINDENT
.UNINDENT
.SS Soft Requirements
.sp
If found, these libraries will be used to make the implementation of some part
of kitchen better in some way.  If they are not present, the API that they
enable will still exist but may function in a different manner.
.INDENT 0.0
.TP
.B \fI\%chardet\fP
Used in \fBguess_encoding()\fP and
\fBguess_encoding_to_xml()\fP to help guess
encoding of byte strings being converted.  If not present, unknown
encodings will be converted as if they were \fBlatin1\fP
.UNINDENT
.SH OTHER RECOMMENDED LIBRARIES
.sp
These libraries implement commonly used functionality that everyone seems to
invent.  Rather than reinvent their wheel, I simply list the things that they
do well for now.  Perhaps if people can\(aqt find them normally, I\(aqll add them as
requirements in \fBsetup.py\fP or link them into kitchen\(aqs namespace.  For
now, I just mention them here:
.INDENT 0.0
.TP
.B \fI\%bunch\fP
Bunch is a dictionary that you can use attribute lookup as well as bracket
notation to access.  Setting it apart from most homebrewed implementations
is the \fBbunchify()\fP function which will descend nested structures of
lists and dicts, transforming the dicts to Bunch\(aqs.
.TP
.B \fI\%hashlib\fP
Python 2.5 and forward have a \fBhashlib\fP library that provides secure
hash functions to python.  If you\(aqre developing for python2.4 though, you
can install the standalone hashlib library and have access to the same
functions.
.TP
.B \fI\%iterutils\fP
The python documentation for \fBitertools\fP has some examples
of other nice iterable functions that can be built from the
\fBitertools\fP functions.  This third\-party module creates those recipes
as a module.
.TP
.B \fI\%ordereddict\fP
Python 2.7 and forward have a \fBOrderedDict\fP that
provides a \fBdict\fP whose items are ordered (and indexable) as well
as named.
.TP
.B \fI\%unittest2\fP
Python 2.7 has an updated \fBunittest\fP library with new functions not
present in the \fI\%python standard library\fP for Python 2.6 or less.  If you want to use those
new functions but need your testing framework to be compatible with older
Python the unittest2 library provides the update as an external module.
.TP
.B \fI\%nose\fP
If you want to use a test discovery tool instead of the unittest
framework, nosetests provides a simple to use way to do that.
.UNINDENT
.SH LICENSE
.sp
This python module is distributed under the terms of the
\fI\%GNU Lesser General Public License Version 2 or later\fP\&.
.sp
\fBNOTE:\fP
.INDENT 0.0
.INDENT 3.5
Some parts of this module are licensed under terms less restrictive
than the LGPLv2+.  If you separate these files from the work as a whole
you are allowed to use them under the less restrictive licenses.  The
following is a list of the files that are known:
.INDENT 0.0
.TP
.B \fI\%Python 2 license\fP
\fB_subprocess.py\fP, \fBtest_subprocess.py\fP,
\fBdefaultdict.py\fP, \fBtest_defaultdict.py\fP,
\fB_base64.py\fP, and \fBtest_base64.py\fP
.UNINDENT
.UNINDENT
.UNINDENT
.SH CONTENTS
.SS Using kitchen to write good code
.sp
Kitchen\(aqs functions won\(aqt automatically make you a better programmer.  You
have to learn when and how to use them as well.  This section of the
documentation is intended to show you some of the ways that you can apply
kitchen\(aqs functions to problems that may have arisen in your life.  The goal
of this section is to give you enough information to understand what the
kitchen API can do for you and where in the KitchenAPI docs to look
for something that can help you with your next issue.  Along the way,
you might pick up the knack for identifying issues with your code before you
publish it.  And that \fIwill\fP make you a better coder.
.SS Overcoming frustration: Correctly using unicode in python2
.sp
In python\-2.x, there\(aqs two types that deal with text.
.INDENT 0.0
.IP 1. 3
\fBstr\fP is for strings of bytes.  These are very similar in nature to
how strings are handled in C.
.IP 2. 3
\fBunicode\fP is for strings of unicode code points\&.
.UNINDENT
.sp
\fBNOTE:\fP
.INDENT 0.0
.INDENT 3.5
\fBJust what the dickens is "Unicode"?\fP
.sp
One mistake that people encountering this issue for the first time make is
confusing the \fBunicode\fP type and the encodings of unicode stored in
the \fBstr\fP type.  In python, the \fBunicode\fP type stores an
abstract sequence of code points\&.  Each code point
represents a grapheme\&.  By contrast, byte \fBstr\fP stores
a sequence of bytes which can then be mapped to a sequence of code
points\&.  Each unicode encoding (UTF\-8, UTF\-7, UTF\-16, UTF\-32,
etc) maps different sequences of bytes to the unicode code points\&.
.sp
What does that mean to you as a programmer?  When you\(aqre dealing with text
manipulations (finding the number of characters in a string or cutting
a string on word boundaries) you should be dealing with \fBunicode\fP
strings as they abstract characters in a manner that\(aqs appropriate for
thinking of them as a sequence of letters that you will see on a page.
When dealing with I/O, reading to and from the disk, printing to
a terminal, sending something over a network link, etc, you should be dealing
with byte \fBstr\fP as those devices are going to need to deal with
concrete implementations of what bytes represent your abstract characters.
.UNINDENT
.UNINDENT
.sp
In the python2 world many APIs use these two classes interchangeably but there
are several important APIs where only one or the other will do the right
thing.  When you give the wrong type of string to an API that wants the other
type, you may end up with an exception being raised (\fBUnicodeDecodeError\fP
or \fBUnicodeEncodeError\fP).  However, these exceptions aren\(aqt always raised
because python implicitly converts between types... \fIsometimes\fP\&.
.SS Frustration #1: Inconsistent Errors
.sp
Although converting when possible seems like the right thing to do, it\(aqs
actually the first source of frustration.  A programmer can test out their
program with a string like: \fBThe quick brown fox jumped over the lazy dog\fP
and not encounter any issues.  But when they release their software into the
wild, someone enters the string: \fBI sat down for coffee at the café\fP and
suddenly an exception is thrown.  The reason?  The mechanism that converts
between the two types is only able to deal with ASCII characters.
Once you throw non\-ASCII characters into your strings, you have to
start dealing with the conversion manually.
.sp
So, if I manually convert everything to either byte \fBstr\fP or
\fBunicode\fP strings, will I be okay?  The answer is.... \fIsometimes\fP\&.
.SS Frustration #2: Inconsistent APIs
.sp
The problem you run into when converting everything to byte \fBstr\fP or
\fBunicode\fP strings is that you\(aqll be using someone else\(aqs API quite
often (this includes the APIs in the \fI\%python standard library\fP) and find that the API will only
accept byte \fBstr\fP or only accept \fBunicode\fP strings.  Or worse,
that the code will accept either when you\(aqre dealing with strings that consist
solely of ASCII but throw an error when you give it a string that\(aqs
got non\-ASCII characters.  When you encounter these APIs you first
need to identify which type will work better and then you have to convert your
values to the correct type for that code.  Thus the programmer that wants to
proactively fix all unicode errors in their code needs to do two things:
.INDENT 0.0
.IP 1. 3
You must keep track of what type your sequences of text are.  Does
\fBmy_sentence\fP contain \fBunicode\fP or \fBstr\fP?  If you don\(aqt
know that then you\(aqre going to be in for a world of hurt.
.IP 2. 3
Anytime you call a function you need to evaluate whether that function will
do the right thing with \fBstr\fP or \fBunicode\fP values.  Sending
the wrong value here will lead to a \fBUnicodeError\fP being thrown when
the string contains non\-ASCII characters.
.UNINDENT
.sp
\fBNOTE:\fP
.INDENT 0.0
.INDENT 3.5
There is one mitigating factor here.  The python community has been
standardizing on using \fBunicode\fP in all its APIs.  Although there
are some APIs that you need to send byte \fBstr\fP to in order to be
safe, (including things as ubiquitous as \fBprint()\fP as we\(aqll see in the
next section), it\(aqs getting easier and easier to use \fBunicode\fP
strings with most APIs.
.UNINDENT
.UNINDENT
.SS Frustration #3: Inconsistent treatment of output
.sp
Alright, since the python community is moving to using \fBunicode\fP
strings everywhere, we might as well convert everything to \fBunicode\fP
strings and use that by default, right?  Sounds good most of the time but
there\(aqs at least one huge caveat to be aware of.  Anytime you output text to
the terminal or to a file, the text has to be converted into a byte
\fBstr\fP\&.  Python will try to implicitly convert from \fBunicode\fP to
byte \fBstr\fP\&... but it will throw an exception if the bytes are
non\-ASCII:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
>>> string = unicode(raw_input(), \(aqutf8\(aq)
café
>>> log = open(\(aq/var/tmp/debug.log\(aq, \(aqw\(aq)
>>> log.write(string)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: \(aqascii\(aq codec can\(aqt encode character u\(aq\exe9\(aq in position 3: ordinal not in range(128)
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Okay, this is simple enough to solve:  Just convert to a byte \fBstr\fP and
we\(aqre all set:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
>>> string = unicode(raw_input(), \(aqutf8\(aq)
café
>>> string_for_output = string.encode(\(aqutf8\(aq, \(aqreplace\(aq)
>>> log = open(\(aq/var/tmp/debug.log\(aq, \(aqw\(aq)
>>> log.write(string_for_output)
>>>
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
So that was simple, right?  Well... there\(aqs one gotcha that makes things a bit
harder to debug sometimes.  When you attempt to write non\-ASCII
\fBunicode\fP strings to a file\-like object you get a traceback every time.
But what happens when you use \fBprint()\fP?  The terminal is a file\-like object
so it should raise an exception right?  The answer to that is....
\fIsometimes\fP:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
$ python
>>> print u\(aqcafé\(aq
café
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
No exception.  Okay, we\(aqre fine then?
.sp
We are until someone does one of the following:
.INDENT 0.0
.IP \(bu 2
Runs the script in a different locale:
.INDENT 2.0
.INDENT 3.5
.sp
.nf
.ft C
$ LC_ALL=C python
>>> # Note: if you\(aqre using a good terminal program when running in the C locale
>>> # The terminal program will prevent you from entering non\-ASCII characters
>>> # python will still recognize them if you use the codepoint instead:
>>> print u\(aqcaf\exe9\(aq
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: \(aqascii\(aq codec can\(aqt encode character u\(aq\exe9\(aq in position 3: ordinal not in range(128)
.ft P
.fi
.UNINDENT
.UNINDENT
.IP \(bu 2
Redirects output to a file:
.INDENT 2.0
.INDENT 3.5
.sp
.nf
.ft C
$ cat test.py
#!/usr/bin/python \-tt
# \-*\- coding: utf\-8 \-*\-
print u\(aqcafé\(aq
$ ./test.py  >t
Traceback (most recent call last):
  File "./test.py", line 4, in <module>
    print u\(aqcafé\(aq
UnicodeEncodeError: \(aqascii\(aq codec can\(aqt encode character u\(aq\exe9\(aq in position 3: ordinal not in range(128)
.ft P
.fi
.UNINDENT
.UNINDENT
.UNINDENT
.sp
Okay, the locale thing is a pain but understandable: the C locale doesn\(aqt
understand any characters outside of ASCII so naturally attempting to
display those won\(aqt work.  Now why does redirecting to a file cause problems?
It\(aqs because \fBprint()\fP in python2 is treated specially.  Whereas the other
file\-like objects in python always convert to ASCII unless you set
them up differently, using \fBprint()\fP to output to the terminal will use
the user\(aqs locale to convert before sending the output to the terminal.  When
\fBprint()\fP is not outputting to the terminal (being redirected to a file,
for instance), \fBprint()\fP decides that it doesn\(aqt know what locale to use
for that file and so it tries to convert to ASCII instead.
.sp
So what does this mean for you, as a programmer?  Unless you have the luxury
of controlling how your users use your code, you should always, always, always
convert to a byte \fBstr\fP before outputting strings to the terminal or to
a file.  Python even provides you with a facility to do just this.  If you
know that every \fBunicode\fP string you send to a particular file\-like
object (for instance, \fBstdout\fP) should be converted to a particular
encoding you can use a \fBcodecs.StreamWriter\fP object to convert from
a \fBunicode\fP string into a byte \fBstr\fP\&.  In particular,
\fBcodecs.getwriter()\fP will return a \fBStreamWriter\fP class
that will help you to wrap a file\-like object for output.  Using our
\fBprint()\fP example:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
$ cat test.py
#!/usr/bin/python \-tt
# \-*\- coding: utf\-8 \-*\-
import codecs
import sys

UTF8Writer = codecs.getwriter(\(aqutf8\(aq)
sys.stdout = UTF8Writer(sys.stdout)
print u\(aqcafé\(aq
$ ./test.py  >t
$ cat t
café
.ft P
.fi
.UNINDENT
.UNINDENT
.SS Frustrations #4 and #5 \-\- The other shoes
.sp
In English, there\(aqs a saying "waiting for the other shoe to drop".  It means
that when one event (usually bad) happens, you come to expect another event
(usually worse) to come after.  In this case we have two other shoes.
.SS Frustration #4: Now it doesn\(aqt take byte strings?!
.sp
If you wrap \fBsys.stdout\fP using \fBcodecs.getwriter()\fP and think you
are now safe to print any variable without checking its type I am afraid
I must inform you that you\(aqre not paying enough attention to Murphy\(aqs
Law\&.  The \fBStreamWriter\fP that \fBcodecs.getwriter()\fP
provides will take \fBunicode\fP strings and transform them into byte
\fBstr\fP before they get to \fBsys.stdout\fP\&.  The problem is if you
give it something that\(aqs already a byte \fBstr\fP it tries to transform
that as well.  To do that it tries to turn the byte \fBstr\fP you give it
into \fBunicode\fP and then transform that back into a byte \fBstr\fP\&...
and since it uses the ASCII codec to perform those conversions,
chances are that it\(aqll blow up when making them:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
>>> import codecs
>>> import sys
>>> UTF8Writer = codecs.getwriter(\(aqutf8\(aq)
>>> sys.stdout = UTF8Writer(sys.stdout)
>>> print \(aqcafé\(aq
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python2.6/codecs.py", line 351, in write
    data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: \(aqascii\(aq codec can\(aqt decode byte 0xc3 in position 3: ordinal not in range(128)
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
To work around this, kitchen provides an alternate version of
\fBcodecs.getwriter()\fP that can deal with both byte \fBstr\fP and
\fBunicode\fP strings.  Use \fBkitchen.text.converters.getwriter()\fP in
place of the \fBcodecs\fP version like this:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
>>> import sys
>>> from kitchen.text.converters import getwriter
>>> UTF8Writer = getwriter(\(aqutf8\(aq)
>>> sys.stdout = UTF8Writer(sys.stdout)
>>> print u\(aqcafé\(aq
café
>>> print \(aqcafé\(aq
café
.ft P
.fi
.UNINDENT
.UNINDENT
.SS Frustration #5: Exceptions
.sp
Okay, so we\(aqve gotten ourselves this far.  We convert everything to
\fBunicode\fP strings.  We\(aqre aware that we need to convert back into byte
\fBstr\fP before we write to the terminal.  We\(aqve worked around the
inability of the standard \fBgetwriter()\fP to deal with both byte
\fBstr\fP and \fBunicode\fP strings.  Are we all set?  Well, there\(aqs at
least one more gotcha:  raising exceptions with a \fBunicode\fP message.
Take a look:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
>>> class MyException(Exception):
>>>     pass
>>>
>>> raise MyException(u\(aqCannot do this\(aq)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
__main__.MyException: Cannot do this
>>> raise MyException(u\(aqCannot do this while at a café\(aq)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
__main__.MyException:
>>>
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
No, I didn\(aqt truncate that last line; raising exceptions really cannot handle
non\-ASCII characters in a \fBunicode\fP string and will output an
exception without the message if the message contains them.  What happens if
we try to use the handy dandy \fBgetwriter()\fP trick
to work around this?
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
>>> import sys
>>> from kitchen.text.converters import getwriter
>>> sys.stderr = getwriter(\(aqutf8\(aq)(sys.stderr)
>>> raise MyException(u\(aqCannot do this\(aq)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
__main__.MyException: Cannot do this
>>> raise MyException(u\(aqCannot do this while at a café\(aq)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
__main__.MyException>>>
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Not only did this also fail, it even swallowed the trailing newline that\(aqs
normally there.... So how to make this work?  Transform from \fBunicode\fP
strings to byte \fBstr\fP manually before outputting:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
>>> from kitchen.text.converters import to_bytes
>>> raise MyException(to_bytes(u\(aqCannot do this while at a café\(aq))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
__main__.MyException: Cannot do this while at a café
>>>
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
\fBWARNING:\fP
.INDENT 0.0
.INDENT 3.5
If you use \fBcodecs.getwriter()\fP on \fBsys.stderr\fP, you\(aqll find
that raising an exception with a byte \fBstr\fP is broken by the
default \fBStreamWriter\fP as well.  Don\(aqt do that or you\(aqll
have no way to output non\-ASCII characters.  If you want to use
a \fBStreamWriter\fP to encode other things on stderr while
still having working exceptions, use
\fBkitchen.text.converters.getwriter()\fP\&.
.UNINDENT
.UNINDENT
.SS Frustration #6: Inconsistent APIs Part deux
.sp
Sometimes you do everything right in your code but other people\(aqs code fails
you.  With unicode issues this happens more often than we want.  A glaring
example of this is when you get values back from a function that aren\(aqt
consistently \fBunicode\fP string or byte \fBstr\fP\&.
.sp
An example from the \fI\%python standard library\fP is \fBgettext\fP\&.  The \fBgettext\fP functions
are used to help translate messages that you display to users in the users\(aq
native languages.  Since most languages contain letters outside of the
ASCII range, the values that are returned contain unicode characters.
\fBgettext\fP provides you with \fBugettext()\fP and
\fBungettext()\fP to return these translations as
\fBunicode\fP strings and \fBgettext()\fP,
\fBngettext()\fP,
\fBlgettext()\fP, and
\fBlngettext()\fP to return them as encoded byte
\fBstr\fP\&.  Unfortunately, even though they\(aqre documented to return only
one type of string or the other, the implementation has corner cases where the
wrong type can be returned.
.sp
This means that even if you separate your \fBunicode\fP string and byte
\fBstr\fP correctly before you pass your strings to a \fBgettext\fP
function, afterwards, you might have to check that you have the right sort of
string type again.
.sp
\fBNOTE:\fP
.INDENT 0.0
.INDENT 3.5
\fBkitchen.i18n\fP provides alternate gettext translation objects that
return only byte \fBstr\fP or only \fBunicode\fP string.
.UNINDENT
.UNINDENT
.SS A few solutions
.sp
Now that we\(aqve identified the issues, can we define a comprehensive strategy
for dealing with them?
.SS Convert text at the border
.sp
If you get some piece of text from a library, read from a file, etc, turn it
into a \fBunicode\fP string immediately.  Since python is moving in the
direction of \fBunicode\fP strings everywhere it\(aqs going to be easier to
work with \fBunicode\fP strings within your code.
.sp
If your code is heavily involved with using things that are bytes, you can do
the opposite and convert all text into byte \fBstr\fP at the border and
only convert to \fBunicode\fP when you need it for passing to another
library or performing string operations on it.
.sp
In either case, the important thing is to pick a default type for strings and
stick with it throughout your code.  When you mix the types it becomes much
easier to operate on a string with a function that can only use the other type
by mistake.
.sp
\fBNOTE:\fP
.INDENT 0.0
.INDENT 3.5
In python3, the abstract unicode type becomes much more prominent.
The type named \fBstr\fP is the equivalent of python2\(aqs \fBunicode\fP and
python3\(aqs \fBbytes\fP type replaces python2\(aqs \fBstr\fP\&.  Most APIs deal
in the unicode type of string with just some pieces that are low level
dealing with bytes.  The implicit conversions between bytes and unicode
is removed and whenever you want to make the conversion you need to do so
explicitly.
.UNINDENT
.UNINDENT
.SS When the data needs to be treated as bytes (or unicode) use a naming convention
.sp
Sometimes you\(aqre converting nearly all of your data to \fBunicode\fP
strings but you have one or two values where you have to keep byte
\fBstr\fP around.  This is often the case when you need to use the value
verbatim with some external resource.  For instance, filenames or key values
in a database.  When you do this, use a naming convention for the data you\(aqre
working with so you (and others reading your code later) don\(aqt get confused
about what\(aqs being stored in the value.
.sp
If you need both a textual string to present to the user and a byte value for
an exact match, consider keeping both versions around.  You can either use two
variables for this or a \fBdict\fP whose key is the byte value.
.sp
\fBNOTE:\fP
.INDENT 0.0
.INDENT 3.5
You can use the naming convention used in kitchen as a guide for
implementing your own naming convention.  It prefixes byte \fBstr\fP
variables of unknown encoding with \fBb_\fP and byte \fBstr\fP of known
encoding with the encoding name like: \fButf8_\fP\&.  If the default was to
handle \fBstr\fP and only keep a few \fBunicode\fP values, those
variables would be prefixed with \fBu_\fP\&.
.UNINDENT
.UNINDENT
.SS When outputting data, convert back into bytes
.sp
When you go to send your data back outside of your program (to the filesystem,
over the network, displaying to the user, etc) turn the data back into a byte
\fBstr\fP\&.  How you do this will depend on the expected output format of
the data.  For displaying to the user, you can use the user\(aqs default encoding
using \fBlocale.getpreferredencoding()\fP\&.  For entering into a file, you\(aqre best
bet is to pick a single encoding and stick with it.
.sp
\fBWARNING:\fP
.INDENT 0.0
.INDENT 3.5
When using the encoding that the user has set (for instance, using
\fBlocale.getpreferredencoding()\fP, remember that they may have their
encoding set to something that can\(aqt display every single unicode
character.  That means when you convert from \fBunicode\fP to a byte
\fBstr\fP you need to decide what should happen if the byte value is
not valid in the user\(aqs encoding.  For purposes of displaying messages to
the user, it\(aqs usually okay to use the \fBreplace\fP encoding error handler
to replace the invalid characters with a question mark or other symbol
meaning the character couldn\(aqt be displayed.
.UNINDENT
.UNINDENT
.sp
You can use \fBkitchen.text.converters.getwriter()\fP to do this automatically
for \fBsys.stdout\fP\&.  When creating exception messages be sure to convert
to bytes manually.
.SS When writing unittests, include non\-ASCII values and both unicode and str type
.sp
Unless you know that a specific portion of your code will only deal with
ASCII, be sure to include non\-ASCII values in your unittests.
Including a few characters from several different scripts is highly advised as
well because some code may have special cased accented roman characters but
not know how to handle characters used in Asian alphabets.
.sp
Similarly, unless you know that that portion of your code will only be given
\fBunicode\fP strings or only byte \fBstr\fP be sure to try variables
of both types in your unittests.  When doing this, make sure that the
variables are also non\-ASCII as python\(aqs implicit conversion will mask
problems with pure ASCII data.  In many cases, it makes sense to check
what happens if byte \fBstr\fP and \fBunicode\fP strings that won\(aqt
decode in the present locale are given.
.SS Be vigilant about spotting poor APIs
.sp
Make sure that the libraries you use return only \fBunicode\fP strings or
byte \fBstr\fP\&.  Unittests can help you spot issues here by running many
variations of data through your functions and checking that you\(aqre still
getting the types of string that you expect.
.SS Example: Putting this all together with kitchen
.sp
The kitchen library provides a wide array of functions to help you deal with
byte \fBstr\fP and \fBunicode\fP strings in your program.  Here\(aqs
a short example that uses many kitchen functions to do its work:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
#!/usr/bin/python \-tt
# \-*\- coding: utf\-8 \-*\-
import locale
import os
import sys
import unicodedata

from kitchen.text.converters import getwriter, to_bytes, to_unicode
from kitchen.i18n import get_translation_object

if __name__ == \(aq__main__\(aq:
    # Setup gettext driven translations but use the kitchen functions so
    # we don\(aqt have the mismatched bytes\-unicode issues.
    translations = get_translation_object(\(aqexample\(aq)
    # We use _() for marking strings that we operate on as unicode
    # This is pretty much everything
    _ = translations.ugettext
    # And b_() for marking strings that we operate on as bytes.
    # This is limited to exceptions
    b_ = translations.lgettext

    # Setup stdout
    encoding = locale.getpreferredencoding()
    Writer = getwriter(encoding)
    sys.stdout = Writer(sys.stdout)

    # Load data.  Format is filename\e0description
    # description should be utf\-8 but filename can be any legal filename
    # on the filesystem
    # Sample datafile.txt:
    #   /etc/shells\ex00Shells available on caf\exc3\exa9.lan
    #   /var/tmp/file\exff\ex00File with non\-utf8 data in the filename
    #
    # And to create /var/tmp/file\exff (under bash or zsh) do:
    #   echo \(aqSome data\(aq > /var/tmp/file$\(aq\e377\(aq
    datafile = open(\(aqdatafile.txt\(aq, \(aqr\(aq)
    data = {}
    for line in datafile:
        # We\(aqre going to keep filename as bytes because we will need the
        # exact bytes to access files on a POSIX operating system.
        # description, we\(aqll immediately transform into unicode type.
        b_filename, description = line.split(\(aq\e0\(aq, 1)

        # to_unicode defaults to decoding output from utf\-8 and replacing
        # any problematic bytes with the unicode replacement character
        # We accept mangling of the description here knowing that our file
        # format is supposed to use utf\-8 in that field and that the
        # description will only be displayed to the user, not used as
        # a key value.
        description = to_unicode(description, \(aqutf\-8\(aq).strip()
        data[b_filename] = description
    datafile.close()

    # We\(aqre going to add a pair of extra fields onto our data to show the
    # length of the description and the filesize.  We put those between
    # the filename and description because we haven\(aqt checked that the
    # description is free of NULLs.
    datafile = open(\(aqnewdatafile.txt\(aq, \(aqw\(aq)

    # Name filename with a b_ prefix to denote byte string of unknown encoding
    for b_filename in data:
        # Since we have the byte representation of filename, we can read any
        # filename
        if os.access(b_filename, os.F_OK):
            size = os.path.getsize(b_filename)
        else:
            size = 0
        # Because the description is unicode type,  we know the number of
        # characters corresponds to the length of the normalized unicode
        # string.
        length = len(unicodedata.normalize(\(aqNFC\(aq, description))

        # Print a summary to the screen
        # Note that we do not let implici type conversion from str to
        # unicode transform b_filename into a unicode string.  That might
        # fail as python would use the ASCII filename.  Instead we use
        # to_unicode() to explicitly transform in a way that we know will
        # not traceback.
        print _(u\(aqfilename: %s\(aq) % to_unicode(b_filename)
        print _(u\(aqfile size: %s\(aq) % size
        print _(u\(aqdesc length: %s\(aq) % length
        print _(u\(aqdescription: %s\(aq) % data[b_filename]

        # First combine the unicode portion
        line = u\(aq%s\e0%s\e0%s\(aq % (size, length, data[b_filename])
        # Since the filenames are bytes, turn everything else to bytes before combining
        # Turning into unicode first would be wrong as the bytes in b_filename
        # might not convert
        b_line = \(aq%s\e0%s\en\(aq % (b_filename, to_bytes(line))

        # Just to demonstrate that getwriter will pass bytes through fine
        print b_(\(aqWrote: %s\(aq) % b_line
        datafile.write(b_line)
    datafile.close()

    # And just to show how to properly deal with an exception.
    # Note two things about this:
    # 1) We use the b_() function to translate the string.  This returns a
    #    byte string instead of a unicode string
    # 2) We\(aqre using the b_() function returned by kitchen.  If we had
    #    used the one from gettext we would need to convert the message to
    #    a byte str first
    message = u\(aqDemonstrate the proper way to raise exceptions.  Sincerely,  \eu3068\eu3057\eu304a\(aq
    raise Exception(b_(message))
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
\fBSEE ALSO:\fP
.INDENT 0.0
.INDENT 3.5
\fBkitchen.text.converters\fP
.UNINDENT
.UNINDENT
.SS Designing Unicode Aware APIs
.sp
APIs that deal with byte \fBstr\fP and \fBunicode\fP strings are
difficult to get right.  Here are a few strategies with pros and cons of each.
.SS Contents
.INDENT 0.0
.IP \(bu 2
\fI\%Designing Unicode Aware APIs\fP
.INDENT 2.0
.IP \(bu 2
\fI\%Take either bytes or unicode, output only unicode\fP
.IP \(bu 2
\fI\%Take either bytes or unicode, output the same type\fP
.IP \(bu 2
\fI\%Separate functions\fP
.IP \(bu 2
\fI\%Deciding whether to take str or unicode when no value is returned\fP
.INDENT 2.0
.IP \(bu 2
\fI\%Writing to external data\fP
.IP \(bu 2
\fI\%Updating data structures\fP
.UNINDENT
.IP \(bu 2
\fI\%APIs to Avoid\fP
.INDENT 2.0
.IP \(bu 2
\fI\%Returning unicode unless a conversion fails\fP
.IP \(bu 2
\fI\%Ignoring values with no chance of recovery\fP
.IP \(bu 2
\fI\%Raising a UnicodeException with no chance of recovery\fP
.UNINDENT
.IP \(bu 2
\fI\%Knowing your data\fP
.INDENT 2.0
.IP \(bu 2
\fI\%Do you need to operate on both bytes and unicode?\fP
.IP \(bu 2
\fI\%Can you restrict the encodings?\fP
.INDENT 2.0
.IP \(bu 2
\fI\%Single byte encodings\fP
.IP \(bu 2
\fI\%Multibyte encodings\fP
.INDENT 2.0
.IP \(bu 2
\fI\%Fixed width\fP
.IP \(bu 2
\fI\%Variable Width\fP
.INDENT 2.0
.IP \(bu 2
\fI\%ASCII compatible\fP
.IP \(bu 2
\fI\%Escaped\fP
.IP \(bu 2
\fI\%Other\fP
.UNINDENT
.UNINDENT
.UNINDENT
.UNINDENT
.UNINDENT
.UNINDENT
.SS Take either bytes or unicode, output only unicode
.sp
In this strategy, you allow the user to enter either \fBunicode\fP strings
or byte \fBstr\fP but what you give back is always \fBunicode\fP\&.  This
strategy is easy for novice endusers to start using immediately as they will
be able to feed either type of string into the function and get back a string
that they can use in other places.
.sp
However, it does lead to the novice writing code that functions correctly when
testing it with ASCII\-only data but fails when given data that contains
non\-ASCII characters.  Worse, if your API is not designed to be
flexible, the consumer of your code won\(aqt be able to easily correct those
problems once they find them.
.sp
Here\(aqs a good API that uses this strategy:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
from kitchen.text.converters import to_unicode

def truncate(msg, max_length, encoding=\(aqutf8\(aq, errors=\(aqreplace\(aq):
    msg = to_unicode(msg, encoding, errors)
    return msg[:max_length]
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
The call to \fBtruncate()\fP starts with the essential parameters for
performing the task.  It ends with two optional keyword arguments that define
the encoding to use to transform from a byte \fBstr\fP to \fBunicode\fP
and the strategy to use if undecodable bytes are encountered.  The defaults
may vary depending on the use cases you have in mind.  When the output is
generally going to be printed for the user to see, \fBerrors=\(aqreplace\(aq\fP is
a good default.  If you are constructing keys to a database, raisng an
exception (with \fBerrors=\(aqstrict\(aq\fP) may be a better default.  In either case,
having both parameters allows the person using your API to choose how they
want to handle any problems.  Having the values is also a clue to them that
a conversion from byte \fBstr\fP to \fBunicode\fP string is going to
occur.
.sp
\fBNOTE:\fP
.INDENT 0.0
.INDENT 3.5
If you\(aqre targeting python\-3.1 and above, \fBerrors=\(aqsurrogateescape\(aq\fP may
be a better default than \fBerrors=\(aqstrict\(aq\fP\&.  You need to be mindful of
a few things when using \fBsurrogateescape\fP though:
.INDENT 0.0
.IP \(bu 2
\fBsurrogateescape\fP will cause issues if a non\-ASCII compatible
encoding is used (for instance, UTF\-16 and UTF\-32.)  That makes it
unhelpful in situations where a true general purpose method of encoding
must be found.  \fI\%PEP 383\fP mentions that \fBsurrogateescape\fP was
specifically designed with the limitations of translating using system
locales (where ASCII compatibility is generally seen as
inescapable) so you should keep that in mind.
.IP \(bu 2
If you use \fBsurrogateescape\fP to decode from \fBbytes\fP
to \fBunicode\fP you will need to use an error handler other than
\fBstrict\fP to encode as the lone surrogate that this error handler
creates makes for invalid unicode that must be handled when encoding.
In Python\-3.1.2 or less, a bug in the encoder error handlers mean that
you can only use \fBsurrogateescape\fP to encode; anything else will throw
an error.
.UNINDENT
.sp
Evaluate your usages of the variables in question to see what makes sense.
.UNINDENT
.UNINDENT
.sp
Here\(aqs a bad example of using this strategy:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
from kitchen.text.converters import to_unicode

def truncate(msg, max_length):
    msg = to_unicode(msg)
    return msg[:max_length]
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
In this example, we don\(aqt have the optional keyword arguments for
\fBencoding\fP and \fBerrors\fP\&.  A user who uses this function is more
likely to miss the fact that a conversion from byte \fBstr\fP to
\fBunicode\fP is going to occur.  And once an error is reported, they will
have to look through their backtrace and think harder about where they want to
transform their data into \fBunicode\fP strings instead of having the
opportunity to control how the conversion takes place in the function itself.
Note that the user does have the ability to make this work by making the
transformation to unicode themselves:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
from kitchen.text.converters import to_unicode

msg = to_unicode(msg, encoding=\(aqeuc_jp\(aq, errors=\(aqignore\(aq)
new_msg = truncate(msg, 5)
.ft P
.fi
.UNINDENT
.UNINDENT
.SS Take either bytes or unicode, output the same type
.sp
This strategy is sometimes called polymorphic because the type of data that is
returned is dependent on the type of data that is received.  The concept is
that when you are given a byte \fBstr\fP to process, you return a byte
\fBstr\fP in your output.  When you are given \fBunicode\fP strings to
process, you return \fBunicode\fP strings in your output.
.sp
This can work well for end users as the ones that know about the difference
between the two string types will already have transformed the strings to
their desired type before giving it to this function.  The ones that don\(aqt can
remain blissfully ignorant (at least, as far as your function is concerned) as
the function does not change the type.
.sp
In cases where the encoding of the byte \fBstr\fP is known or can be
discovered based on the input data this works well.  If you can\(aqt figure out
the input encoding, however, this strategy can fail in any of the following
cases:
.INDENT 0.0
.IP 1. 3
It needs to do an internal conversion between byte \fBstr\fP and
\fBunicode\fP string.
.IP 2. 3
It cannot return the same data as either a \fBunicode\fP string or byte
\fBstr\fP\&.
.IP 3. 3
You may need to deal with byte strings that are not byte\-compatible with
ASCII
.UNINDENT
.sp
First, a couple examples of using this strategy in a good way:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
def translate(msg, table):
    replacements = table.keys()
    new_msg = []
    for index, char in enumerate(msg):
        if char in replacements:
            new_msg.append(table[char])
        else:
            new_msg.append(char)

    return \(aq\(aq.join(new_msg)
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
In this example, all of the strings that we use (except the empty string which
is okay because it doesn\(aqt have any characters to encode) come from outside of
the function.  Due to that, the user is responsible for making sure that the
\fBmsg\fP, and the keys and values in \fBtable\fP all match in terms of
type (\fBunicode\fP vs \fBstr\fP) and encoding (You can do some error
checking to make sure the user gave all the same type but you can\(aqt do the
same for the user giving different encodings).  You do not need to make
changes to the string that require you to know the encoding or type of the
string; everything is a simple replacement of one element in the array of
characters in message with the character in table.
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
import json
from kitchen.text.converters import to_unicode, to_bytes

def first_field_from_json_data(json_string):
    \(aq\(aq\(aqReturn the first field in a json data structure.

    The format of the json data is a simple list of strings.
    \(aq["one", "two", "three"]\(aq
    \(aq\(aq\(aq
    if isinstance(json_string, unicode):
        # On all python versions, json.loads() returns unicode if given
        # a unicode string
        return json.loads(json_string)[0]

    # Byte str: figure out which encoding we\(aqre dealing with
    if \(aq\ex00\(aq not in json_data[:2]
        encoding = \(aqutf8\(aq
    elif \(aq\ex00\ex00\ex00\(aq == json_data[:3]:
        encoding = \(aqutf\-32\-be\(aq
    elif \(aq\ex00\ex00\ex00\(aq == json_data[1:4]:
        encoding = \(aqutf\-32\-le\(aq
    elif \(aq\ex00\(aq == json_data[0] and \(aq\ex00\(aq == json_data[2]:
        encoding = \(aqutf\-16\-be\(aq
    else:
        encoding = \(aqutf\-16\-le\(aq

    data = json.loads(unicode(json_string, encoding))
    return data[0].encode(encoding)
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
In this example the function takes either a byte \fBstr\fP type or
a \fBunicode\fP string that has a list in json format and returns the first
field from it as the type of the input string.  The first section of code is
very straightforward; we receive a \fBunicode\fP string, parse it with
a function, and then return the first field from our parsed data (which our
function returned to us as json data).
.sp
The second portion that deals with byte \fBstr\fP is not so
straightforward.  Before we can parse the string we have to determine what
characters the bytes in the string map to.  If we didn\(aqt do that, we wouldn\(aqt
be able to properly find which characters are present in the string.  In order
to do that we have to figure out the encoding of the byte \fBstr\fP\&.
Luckily, the json specification states that all strings are unicode and
encoded with one of UTF32be, UTF32le, UTF16be, UTF16le, or UTF\-8\&.  It further
defines the format such that the first two characters are always
ASCII\&.  Each of these has a different sequence of NULLs when they
encode an ASCII character.  We can use that to detect which encoding
was used to create the byte \fBstr\fP\&.
.sp
Finally, we return the byte \fBstr\fP by encoding the \fBunicode\fP back
to a byte \fBstr\fP\&.
.sp
As you can see, in this example we have to convert from byte \fBstr\fP to
\fBunicode\fP and back.  But we know from the json specification that byte
\fBstr\fP has to be one of a limited number of encodings that we are able
to detect.  That ability makes this strategy work.
.sp
Now for some examples of using this strategy in ways that fail:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
import unicodedata
def first_char(msg):
    \(aq\(aq\(aqReturn the first character in a string\(aq\(aq\(aq
    if not isinstance(msg, unicode):
        try:
            msg = unicode(msg, \(aqutf8\(aq)
        except UnicodeError:
            msg = unicode(msg, \(aqlatin1\(aq)
    msg = unicodedata.normalize(\(aqNFC\(aq, msg)
    return msg[0]
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
If you look at that code and think that there\(aqs something fragile and prone to
breaking in the \fBtry: except:\fP block you are correct in being suspicious.
This code will fail on multi\-byte character sets that aren\(aqt UTF\-8\&.  It
can also fail on data where the sequence of bytes is valid UTF\-8 but
the bytes are actually of a different encoding.  The reasons this code fails
is that we don\(aqt know what encoding the bytes are in and the code must convert
from a byte \fBstr\fP to a \fBunicode\fP string in order to function.
.sp
In order to make this code robust we must know the encoding of \fBmsg\fP\&.
The only way to know that is to ask the user so the API must do that:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
import unicodedata
def number_of_chars(msg, encoding=\(aqutf8\(aq, errors=\(aqstrict\(aq):
    if not isinstance(msg, unicode):
        msg = unicode(msg, encoding, errors)
    msg = unicodedata.normalize(\(aqNFC\(aq, msg)
    return len(msg)
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Another example of failure:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
import os
def listdir(directory):
    files = os.listdir(directory)
    if isinstance(directory, str):
        return files
    # files could contain both bytes and unicode
    new_files = []
    for filename in files:
        if not isinstance(filename, unicode):
            # What to do here?
            continue
        new_files.appen(filename)
    return new_files
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
This function illustrates the second failure mode.  Here, not all of the
possible values can be represented as \fBunicode\fP without knowing more
about the encoding of each of the filenames involved.  Since each filename
could have a different encoding there\(aqs a few different options to pursue.  We
could make this function always return byte \fBstr\fP since that can
accurately represent anything that could be returned.  If we want to return
\fBunicode\fP we need to at least allow the user to specify what to do in
case of an error decoding the bytes to \fBunicode\fP\&.  We can also let the
user specify the encoding to use for doing the decoding but that won\(aqt help in
all cases since not all files will be in the same encoding (or even
necessarily in any encoding):
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
import locale
import os
def listdir(directory, encoding=locale.getpreferredencoding(), errors=\(aqstrict\(aq):
    # Note: In python\-3.1+, surrogateescape may be a better default
    files = os.listdir(directory)
    if isinstance(directory, str):
        return files
    new_files = []
    for filename in files:
        if not isinstance(filename, unicode):
            filename = unicode(filename, encoding=encoding, errors=errors)
        new_files.append(filename)
    return new_files
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Note that although we use \fBerrors\fP in this example as what to pass to
the codec that decodes to \fBunicode\fP we could also have an
\fBerrors\fP argument that decides other things to do like skip a filename
entirely, return a placeholder (\fBNondisplayable filename\fP), or raise an
exception.
.sp
This leaves us with one last failure to describe:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
def first_field(csv_string):
    \(aq\(aq\(aqReturn the first field in a comma separated values string.\(aq\(aq\(aq
    try:
        return csv_string[:csv_string.index(\(aq,\(aq)]
    except ValueError:
        return csv_string
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
This code looks simple enough.  The hidden error here is that we are searching
for a comma character in a byte \fBstr\fP but not all encodings will use
the same sequence of bytes to represent the comma.  If you use an encoding
that\(aqs not ASCII compatible on the byte level, then the literal comma
\fB\(aq,\(aq\fP in the above code will match inappropriate bytes.  Some examples of
how it can fail:
.INDENT 0.0
.IP \(bu 2
Will find the byte representing an ASCII comma in another character
.IP \(bu 2
Will find the comma but leave trailing garbage bytes on the end of the
string
.IP \(bu 2
Will not match the character that represents the comma in this encoding
.UNINDENT
.sp
There are two ways to solve this.  You can either take the encoding value from
the user or you can take the separator value from the user.  Of the two,
taking the encoding is the better option for two reasons:
.INDENT 0.0
.IP 1. 3
Taking a separator argument doesn\(aqt clearly document for the API user that
the reason they must give it is to properly match the encoding of the
\fBcsv_string\fP\&.  They\(aqre just as likely to think that it\(aqs simply a way
to specify an alternate character (like ":" or "|") for the separator.
.IP 2. 3
It\(aqs possible for a variable width encoding to reuse the same byte sequence
for different characters in multiple sequences.
.sp
\fBNOTE:\fP
.INDENT 3.0
.INDENT 3.5
UTF\-8 is resistant to this as any character\(aqs sequence of
bytes will never be a subset of another character\(aqs sequence of bytes.
.UNINDENT
.UNINDENT
.UNINDENT
.sp
With that in mind, here\(aqs how to improve the API:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
def first_field(csv_string, encoding=\(aqutf\-8\(aq, errors=\(aqreplace\(aq):
    if not isinstance(csv_string, unicode):
        u_string = unicode(csv_string, encoding, errors)
        is_unicode = False
    else:
        u_string = csv_string

    try:
        field = u_string[:U_string.index(u\(aq,\(aq)]
    except ValueError:
        return csv_string

    if not is_unicode:
        field = field.encode(encoding, errors)
    return field
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
\fBNOTE:\fP
.INDENT 0.0
.INDENT 3.5
If you decide you\(aqll never encounter a variable width encoding that reuses
byte sequences you can use this code instead:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
def first_field(csv_string, encoding=\(aqutf\-8\(aq):
    try:
        return csv_string[:csv_string.index(\(aq,\(aq.encode(encoding))]
    except ValueError:
        return csv_string
.ft P
.fi
.UNINDENT
.UNINDENT
.UNINDENT
.UNINDENT
.SS Separate functions
.sp
Sometimes you want to be able to take either byte \fBstr\fP or
\fBunicode\fP strings, perform similar operations on either one and then
return data in the same format as was given.  Probably the easiest way to do
that is to have separate functions for each and adopt a naming convention to
show that one is for working with byte \fBstr\fP and the other is for
working with \fBunicode\fP strings:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
def translate_b(msg, table):
    \(aq\(aq\(aqReplace values in str with other byte values like unicode.translate\(aq\(aq\(aq
    if not isinstance(msg, str):
        raise TypeError(\(aqmsg must be of type str\(aq)
    str_table = [chr(s) for s in xrange(0,256)]
    delete_chars = []
    for chr_val in (k for k in table.keys() if isinstance(k, int)):
        if chr_val > 255:
            raise ValueError(\(aqKeys in table must not exceed 255)\(aq)
        if table[chr_val] == None:
            delete_chars.append(chr(chr_val))
        elif isinstance(table[chr_val], int):
            if table[chr_val] > 255:
                raise TypeError(\(aqtable values cannot be more than 255 or less than 0\(aq)
            str_table[chr_val] = chr(table[chr_val])
        else:
            if not isinstance(table[chr_val], str):
                raise TypeError(\(aqcharacter mapping must return integer, None or str\(aq)
            str_table[chr_val] = table[chr_val]
    str_table = \(aq\(aq.join(str_table)
    delete_chars = \(aq\(aq.join(delete_chars)
    return msg.translate(str_table, delete_chars)

def translate(msg, table):
    \(aq\(aq\(aqReplace values in a unicode string with other values\(aq\(aq\(aq
    if not isinstance(msg, unicode):
        raise TypeError(\(aqmsg must be of type unicode\(aq)
    return msg.translate(table)
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
There\(aqs several things that we have to do in this API:
.INDENT 0.0
.IP \(bu 2
Because the function names might not be enough of a clue to the user of the
functions of the value types that are expected, we have to check that the
types are correct.
.IP \(bu 2
We keep the behaviour of the two functions as close to the same as possible,
just with byte \fBstr\fP and \fBunicode\fP strings substituted for
each other.
.UNINDENT
.SS Deciding whether to take str or unicode when no value is returned
.sp
Not all functions have a return value.  Sometimes a function is there to
interact with something external to python, for instance, writing a file out
to disk or a method exists to update the internal state of a data structure.
One of the main questions with these APIs is whether to take byte
\fBstr\fP, \fBunicode\fP string, or both.  The answer depends on your
use case but I\(aqll give some examples here.
.SS Writing to external data
.sp
When your information is going to an external data source like writing to
a file you need to decide whether to take in \fBunicode\fP strings or byte
\fBstr\fP\&.  Remember that most external data sources are not going to be
dealing with unicode directly.  Instead, they\(aqre going to be dealing with
a sequence of bytes that may be interpreted as unicode.  With that in mind,
you either need to have the user give you a byte \fBstr\fP or convert to
a byte \fBstr\fP inside the function.
.sp
Next you need to think about the type of data that you\(aqre receiving.  If it\(aqs
textual data, (for instance, this is a chat client and the user is typing
messages that they expect to be read by another person) it probably makes sense to
take in \fBunicode\fP strings and do the conversion inside your function.
On the other hand, if this is a lower level function that\(aqs passing data into
a network socket, it probably should be taking byte \fBstr\fP instead.
.sp
Just as noted in the API notes above, you should specify an \fBencoding\fP
and \fBerrors\fP argument if you need to transform from \fBunicode\fP
string to byte \fBstr\fP and you are unable to guess the encoding from the
data itself.
.SS Updating data structures
.sp
Sometimes your API is just going to update a data structure and not
immediately output that data anywhere.  Just as when writing external data,
you should think about both what your function is going to do with the data
eventually and what the caller of your function is thinking that they\(aqre
giving you.  Most of the time, you\(aqll want to take \fBunicode\fP strings
and enter them into the data structure as \fBunicode\fP when the data is
textual in nature.  You\(aqll want to take byte \fBstr\fP and enter them into
the data structure as byte \fBstr\fP when the data is not text.  Use
a naming convention so the user knows what\(aqs expected.
.SS APIs to Avoid
.sp
There are a few APIs that are just wrong.  If you catch yourself making an API
that does one of these things, change it before anyone sees your code.
.SS Returning unicode unless a conversion fails
.sp
This type of API usually deals with byte \fBstr\fP at some point and
converts it to \fBunicode\fP because it\(aqs usually thought to be text.
However, there are times when the bytes fail to convert to a \fBunicode\fP
string.  When that happens, this API returns the raw byte \fBstr\fP instead
of a \fBunicode\fP string.  One example of this is present in the \fI\%python standard library\fP:
python2\(aqs \fBos.listdir()\fP:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
>>> import os
>>> import locale
>>> locale.getpreferredencoding()
\(aqUTF\-8\(aq
>>> os.mkdir(\(aq/tmp/mine\(aq)
>>> os.chdir(\(aq/tmp/mine\(aq)
>>> open(\(aqnonsense_char_\exff\(aq, \(aqw\(aq).close()
>>> open(\(aqall_ascii\(aq, \(aqw\(aq).close()
>>> os.listdir(u\(aq.\(aq)
[u\(aqall_ascii\(aq, \(aqnonsense_char_\exff\(aq]
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
The problem with APIs like this is that they cause failures that are hard to
debug because they don\(aqt happen where the variables are set.  For instance,
let\(aqs say you take the filenames from \fBos.listdir()\fP and give it to this
function:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
def normalize_filename(filename):
    \(aq\(aq\(aqChange spaces and dashes into underscores\(aq\(aq\(aq
    return filename.translate({ord(u\(aq \(aq):u\(aq_\(aq, ord(u\(aq \(aq):u\(aq_\(aq})
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
When you test this, you use filenames that all are decodable in your preferred
encoding and everything seems to work.  But when this code is run on a machine
that has filenames in multiple encodings the filenames returned by
\fBos.listdir()\fP suddenly include byte \fBstr\fP\&.  And byte \fBstr\fP
has a different \fBstring.translate()\fP function that takes different values.
So the code raises an exception where it\(aqs not immediately obvious that
\fBos.listdir()\fP is at fault.
.SS Ignoring values with no chance of recovery
.sp
An early version of python3 attempted to fix the \fBos.listdir()\fP problem
pointed out in the last section by returning all values that were decodable to
\fBunicode\fP and omitting the filenames that were not.  This lead to the
following output:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
>>> import os
>>> import locale
>>> locale.getpreferredencoding()
\(aqUTF\-8\(aq
>>> os.mkdir(\(aq/tmp/mine\(aq)
>>> os.chdir(\(aq/tmp/mine\(aq)
>>> open(b\(aqnonsense_char_\exff\(aq, \(aqw\(aq).close()
>>> open(\(aqall_ascii\(aq, \(aqw\(aq).close()
>>> os.listdir(\(aq.\(aq)
[\(aqall_ascii\(aq]
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
The issue with this type of code is that it is silently doing something
surprising.  The caller expects to get a full list of files back from
\fBos.listdir()\fP\&.  Instead, it silently ignores some of the files, returning
only a subset.  This leads to code that doesn\(aqt do what is expected that may
go unnoticed until the code is in production and someone notices that
something important is being missed.
.SS Raising a UnicodeException with no chance of recovery
.sp
Believe it or not, a few libraries exist that make it impossible to deal
with unicode text without raising a \fBUnicodeError\fP\&.  What seems to occur
in these libraries is that the library has functions that expect to receive
a \fBunicode\fP string.  However, internally, those functions call other
functions that expect to receive a byte \fBstr\fP\&.  The programmer of the
API was smart enough to convert from a \fBunicode\fP string to a byte
\fBstr\fP but they did not give the user the chance to specify the
encodings to use or how to deal with errors.  This results in exceptions when
the user passes in a byte \fBstr\fP because the initial function wants
a \fBunicode\fP string and exceptions when the user passes in
a \fBunicode\fP string because the function can\(aqt convert the string to
bytes in the encoding that it\(aqs selected.
.sp
Do not put the user in the position of not being able to use your API without
raising a \fBUnicodeError\fP with certain values.  If you can only safely
take \fBunicode\fP strings, document that byte \fBstr\fP is not allowed
and vice versa.  If you have to convert internally, make sure to give the
caller of your function parameters to control the encoding and how to treat
errors that may occur during the encoding/decoding process.  If your code will
raise a \fBUnicodeError\fP with non\-ASCII values no matter what, you
should probably rethink your API.
.SS Knowing your data
.sp
If you\(aqve read all the way down to this section without skipping you\(aqve seen
several admonitions about the type of data you are processing affecting the
viability of the various API choices.
.sp
Here\(aqs a few things to consider in your data:
.SS Do you need to operate on both bytes and unicode?
.sp
Much of the data in libraries, programs, and the general environment outside
of python is written where strings are sequences of bytes.  So when we
interact with data that comes from outside of python or data that is about to
leave python it may make sense to only operate on the data as a byte
\fBstr\fP\&.  There\(aqs two times when this may make sense:
.INDENT 0.0
.IP 1. 3
The user is intended to hand the data to the function and then the function
takes care of sending the data outside of python (to the filesystem, over
the network, etc).
.IP 2. 3
The data is not representable as text.  For instance, writing a binary
file format.
.UNINDENT
.sp
Even when your code is operating in this area you still need to think a little
more about your data.  For instance, it might make sense for the person using
your API to pass in \fBunicode\fP strings and let the function convert that
into the byte \fBstr\fP that it then sends over the wire.
.sp
There are also times when it might make sense to operate only on
\fBunicode\fP strings.  \fBunicode\fP represents text so anytime that
you are working on textual data that isn\(aqt going to leave python it has the
potential to be a \fBunicode\fP\-only API.  However, there\(aqs two things that
you should consider when designing a \fBunicode\fP\-only API:
.INDENT 0.0
.IP 1. 3
As your API gains popularity, people are going to use your API in places
that you may not have thought of.  Corner cases in these other places may
mean that processing bytes is desirable.
.IP 2. 3
In python2, byte \fBstr\fP and \fBunicode\fP are often used
interchangeably with each other.  That means that people programming against
your API may have received \fBstr\fP from some other API and it would be
most convenient for their code if your API accepted it.
.UNINDENT
.sp
\fBNOTE:\fP
.INDENT 0.0
.INDENT 3.5
In python3, the separation between the text type and the byte type
are more clear.  So in python3, there\(aqs less need to have all APIs take
both unicode and bytes.
.UNINDENT
.UNINDENT
.SS Can you restrict the encodings?
.sp
If you determine that you have to deal with byte \fBstr\fP you should
realize that not all encodings are created equal.  Each has different
properties that may make it possible to provide a simpler API provided that
you can reasonably tell the users of your API that they cannot use certain
classes of encodings.
.sp
As one example, if you are required to find a comma (\fB,\fP) in a byte
\fBstr\fP you have different choices based on what encodings are allowed.
If you can reasonably restrict your API users to only giving ASCII
compatible encodings you can do this simply by searching for the literal
comma character because that character will be represented by the same byte
sequence in all ASCII compatible encodings.
.sp
The following are some classes of encodings to be aware of as you decide how
generic your code needs to be.
.SS Single byte encodings
.sp
Single byte encodings can only represent 256 total characters.  They encode
the code points for a character to the equivalent number in a single
byte.
.sp
Most single byte encodings are ASCII compatible\&.  ASCII
compatible encodings are the most likely to be usable without changes to code
so this is good news.  A notable exception to this is the \fI\%EBDIC\fP
family of encodings.
.SS Multibyte encodings
.sp
Multibyte encodings use more than one byte to encode some characters.
.SS Fixed width
.sp
Fixed width encodings have a set number of bytes to represent all of the
characters in the character set.  \fBUTF\-32\fP is an example of a fixed width
encoding that uses four bytes per character and can express every unicode
characters.  There are a number of problems with writing APIs that need to
operate on fixed width, multibyte characters.  To go back to our earlier
example of finding a comma in a string, we have to realize that even in
\fBUTF\-32\fP where the code point for ASCII characters is the
same as in ASCII, the byte sequence for them is different.  So you
cannot search for the literal byte character as it may pick up false
positives and may break a byte sequence in an odd place.
.SS Variable Width
.SS ASCII compatible
.sp
UTF\-8 and the \fI\%EUC\fP
family of encodings are examples of ASCII compatible multi\-byte
encodings.  They achieve this by adhering to two principles:
.INDENT 0.0
.IP \(bu 2
All of the ASCII characters are represented by the byte that they
are in the ASCII encoding.
.IP \(bu 2
None of the ASCII byte sequences are reused in any other byte
sequence for a different character.
.UNINDENT
.SS Escaped
.sp
Some multibyte encodings work by using only bytes from the ASCII
encoding but when a particular sequence of those byes is found, they are
interpreted as meaning something other than their ASCII values.
\fBUTF\-7\fP is one such encoding that can encode all of the unicode
code points\&.  For instance, here\(aqs a some Japanese characters encoded as
\fBUTF\-7\fP:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
>>> a = u\(aq\eu304f\eu3089\eu3068\eu307f\(aq
>>> print a
くらとみ
>>> print a.encode(\(aqutf\-7\(aq)
+ME8wiTBoMH8\-
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
These encodings can be used when you need to encode unicode data that may
contain non\-ASCII characters for inclusion in an ASCII only
transport medium or file.
.sp
However, they are not ASCII compatible in the sense that we used
earlier as the bytes that represent a ASCII character are being reused
as part of other characters.  If you were to search for a literal plus sign in
this encoded string, you would run across many false positives, for instance.
.SS Other
.sp
There are many other popular variable width encodings, for instance \fBUTF\-16\fP
and \fBshift\-JIS\fP\&.  Many of these are not ASCII compatible so you
cannot search for a literal ASCII character without danger of false
positives or false negatives.
.SS Kitchen API
.sp
Kitchen is structured as a collection of modules.  In its current
configuration, Kitchen ships with the following modules.  Other addon modules
that may drag in more dependencies can be found on the \fI\%project webpage\fP
.SS Kitchen.i18n Module
.sp
I18N is an important piece of any modern program.  Unfortunately,
setting up i18n in your program is often a confusing process.  The
functions provided here aim to make the programming side of that a little
easier.
.sp
Most projects will be able to do something like this when they startup:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
# myprogram/__init__.py:

import os
import sys

from kitchen.i18n import easy_gettext_setup

_, N_  = easy_gettext_setup(\(aqmyprogram\(aq, localedirs=(
        os.path.join(os.path.realpath(os.path.dirname(__file__)), \(aqlocale\(aq),
        os.path.join(sys.prefix, \(aqlib\(aq, \(aqlocale\(aq)
        ))
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Then, in other files that have strings that need translating:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
# myprogram/commands.py:

from myprogram import _, N_

def print_usage():
    print _(u"""available commands are:
    \-\-help              Display help
    \-\-version           Display version of this program
    \-\-bake\-me\-a\-cake    as fast as you can
        """)

def print_invitations(age):
    print _(\(aqPlease come to my party.\(aq)
    print N_(\(aqI will be turning %(age)s year old\(aq,
        \(aqI will be turning %(age)s years old\(aq, age) % {\(aqage\(aq: age}
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
See the documentation of \fI\%easy_gettext_setup()\fP and
\fI\%get_translation_object()\fP for more details.
.INDENT 0.0
.INDENT 3.5
.sp
\fBSEE ALSO:\fP
.INDENT 0.0
.INDENT 3.5
.INDENT 0.0
.TP
.B \fBgettext\fP
for details of how the python gettext facilities work
.TP
.B \fI\%babel\fP
The babel module for in depth information on gettext, message
catalogs, and translating your app.  babel provides some nice
features for i18n on top of \fBgettext\fP
.UNINDENT
.UNINDENT
.UNINDENT
.UNINDENT
.UNINDENT
.SS Functions
.sp
\fI\%easy_gettext_setup()\fP should satisfy the needs of most users.
\fI\%get_translation_object()\fP is designed to ease the way for anyone that
needs more control.
.INDENT 0.0
.TP
.B kitchen.i18n.easy_gettext_setup(domain, localedirs=(), use_unicode=True)
Setup translation functions for an application
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.IP \(bu 2
\fBdomain\fP \-\- Name of the message domain.  This should be a unique name
that can be used to lookup the message catalog for this app.
.IP \(bu 2
\fBlocaledirs\fP \-\- Iterator of directories to look for message
catalogs under.  The first directory to exist is used regardless of
whether messages for this domain are present.  If none of the
directories exist, fallback on \fBsys.prefix\fP + \fB/share/locale\fP
Default: No directories to search so we just use the fallback.
.IP \(bu 2
\fBuse_unicode\fP \-\- If \fBTrue\fP return the \fBgettext\fP functions
for \fBunicode\fP strings else return the functions for byte
\fBstr\fP for the translations.  Default is \fBTrue\fP\&.
.UNINDENT
.TP
.B Returns
tuple of the \fBgettext\fP function and \fBgettext\fP function
for plurals
.UNINDENT
.sp
Setting up \fBgettext\fP can be a little tricky because of lack of
documentation.  This function will setup \fBgettext\fP  using the 
\fI\%Class\-based API\fP for you.
For the simple case, you can use the default arguments and call it like
this:
.INDENT 7.0
.INDENT 3.5
.sp
.nf
.ft C
_, N_ = easy_gettext_setup()
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
This will get you two functions, \fB_()\fP and \fBN_()\fP that you can use
to mark strings in your code for translation.  \fB_()\fP is used to mark
strings that don\(aqt need to worry about plural forms no matter what the
value of the variable is.  \fBN_()\fP is used to mark strings that do need
to have a different form if a variable in the string is plural.
.sp
\fBSEE ALSO:\fP
.INDENT 7.0
.INDENT 3.5
.INDENT 0.0
.TP
.B api\-i18n
This module\(aqs documentation has examples of using \fB_()\fP and \fBN_()\fP
.TP
.B \fI\%get_translation_object()\fP
for information on how to use \fBlocaledirs\fP to get the
proper message catalogs both when in development and when
installed to FHS compliant directories on Linux.
.UNINDENT
.UNINDENT
.UNINDENT
.sp
\fBNOTE:\fP
.INDENT 7.0
.INDENT 3.5
The gettext functions returned from this function should be superior
to the ones returned from \fBgettext\fP\&.  The traits that make them
better are described in the \fI\%DummyTranslations\fP and
\fI\%NewGNUTranslations\fP documentation.
.UNINDENT
.UNINDENT
.sp
Changed in version kitchen\-0.2.4: ; API kitchen.i18n 2.0.0
Changed \fI\%easy_gettext_setup()\fP to return the lgettext
functions instead of gettext functions when use_unicode=False.

.UNINDENT
.INDENT 0.0
.TP
.B kitchen.i18n.get_translation_object(domain, localedirs=(), languages=None, class_=None, fallback=True, codeset=None, python2_api=True)
Get a translation object bound to the message catalogs
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.IP \(bu 2
\fBdomain\fP \-\- Name of the message domain.  This should be a unique name
that can be used to lookup the message catalog for this app or
library.
.IP \(bu 2
\fBlocaledirs\fP \-\- Iterator of directories to look for
message catalogs under.  The directories are searched in order
for message catalogs\&.  For each of the directories searched,
we check for message catalogs in any language specified
in:attr:\fIlanguages\fP\&.  The message catalogs are used to create
the Translation object that we return.  The Translation object will
attempt to lookup the msgid in the first catalog that we found.  If
it\(aqs not in there, it will go through each subsequent catalog looking
for a match.  For this reason, the order in which you specify the
\fBlocaledirs\fP may be important.  If no message catalogs
are found, either return a \fI\%DummyTranslations\fP object or raise
an \fBIOError\fP depending on the value of \fBfallback\fP\&.
Rhe default localedir from  \fBgettext\fP which is
\fBos.path.join(sys.prefix, \(aqshare\(aq, \(aqlocale\(aq)\fP on Unix is
implicitly appended to the \fBlocaledirs\fP, making it the last
directory searched.
.IP \(bu 2
\fBlanguages\fP \-\- 
.sp
Iterator of language codes to check for
message catalogs\&.  If unspecified, the user\(aqs locale settings
will be used.
.sp
\fBSEE ALSO:\fP
.INDENT 2.0
.INDENT 3.5
\fBgettext.find()\fP for information on what environment
variables are used.
.UNINDENT
.UNINDENT

.IP \(bu 2
\fBclass\fP \-\- The class to use to extract translations from the
message catalogs\&.  Defaults to \fI\%NewGNUTranslations\fP\&.
.IP \(bu 2
\fBfallback\fP \-\- If set to data:\fIFalse\fP, raise an \fBIOError\fP if no
message catalogs are found.  If \fBTrue\fP, the default,
return a \fI\%DummyTranslations\fP object.
.IP \(bu 2
\fBcodeset\fP \-\- Set the character encoding to use when returning byte
\fBstr\fP objects.  This is equivalent to calling
\fBoutput_charset()\fP on the Translations
object that is returned from this function.
.IP \(bu 2
\fBpython2_api\fP \-\- When data:\fITrue\fP (default), return Translation objects
that use the python2 gettext api
(\fBgettext()\fP and
\fBlgettext()\fP return byte
\fBstr\fP\&.  \fBugettext()\fP exists and
returns \fBunicode\fP strings).  When \fBFalse\fP, return
Translation objects that use the python3 gettext api (gettext returns
\fBunicode\fP strings and lgettext returns byte \fBstr\fP\&.
ugettext does not exist.)
.UNINDENT
.TP
.B Returns
Translation object to get \fBgettext\fP methods from
.UNINDENT
.sp
If you need more flexibility than \fI\%easy_gettext_setup()\fP, use this
function.  It sets up a \fBgettext\fP Translation object and returns it
to you.  Then you can access any of the methods of the object that you
need directly.  For instance, if you specifically need to access
\fBlgettext()\fP:
.INDENT 7.0
.INDENT 3.5
.sp
.nf
.ft C
translations = get_translation_object(\(aqfoo\(aq)
translations.lgettext(\(aqMy Message\(aq)
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
This function is similar to the \fI\%python standard library\fP \fBgettext.translation()\fP but
makes it better in two ways
.INDENT 7.0
.IP 1. 3
.INDENT 3.0
.TP
.B It returns \fI\%NewGNUTranslations\fP or \fI\%DummyTranslations\fP
objects by default.  These are superior to the
\fBgettext.GNUTranslations\fP and \fBgettext.NullTranslations\fP
objects because they are consistent in the string type they return and
they fix several issues that can cause the \fI\%python standard library\fP objects to throw
\fBUnicodeError\fP\&.
.UNINDENT
.IP 2. 3
.INDENT 3.0
.TP
.B This function takes multiple directories to search for
message catalogs\&.
.UNINDENT
.UNINDENT
.sp
The latter is important when setting up \fBgettext\fP in a portable
manner.  There is not a common directory for translations across operating
systems so one needs to look in multiple directories for the translations.
\fI\%get_translation_object()\fP is able to handle that if you give it
a list of directories to search for catalogs:
.INDENT 7.0
.INDENT 3.5
.sp
.nf
.ft C
translations = get_translation_object(\(aqfoo\(aq, localedirs=(
     os.path.join(os.path.realpath(os.path.dirname(__file__)), \(aqlocale\(aq),
     os.path.join(sys.prefix, \(aqlib\(aq, \(aqlocale\(aq)))
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
This will search for several different directories:
.INDENT 7.0
.IP 1. 3
A directory named \fBlocale\fP in the same directory as the module
that called \fI\%get_translation_object()\fP,
.IP 2. 3
In \fB/usr/lib/locale\fP
.IP 3. 3
In \fB/usr/share/locale\fP (the fallback directory)
.UNINDENT
.sp
This allows \fBgettext\fP to work on Windows and in development (where the
message catalogs are typically in the toplevel module directory)
and also when installed under Linux (where the message catalogs
are installed in \fB/usr/share/locale\fP).  You (or the system packager)
just need to install the message catalogs in
\fB/usr/share/locale\fP and remove the \fBlocale\fP directory from the
module to make this work.  ie:
.INDENT 7.0
.INDENT 3.5
.sp
.nf
.ft C
In development:
    ~/foo   # Toplevel module directory
    ~/foo/__init__.py
    ~/foo/locale    # With message catalogs below here:
    ~/foo/locale/es/LC_MESSAGES/foo.mo

Installed on Linux:
    /usr/lib/python2.7/site\-packages/foo
    /usr/lib/python2.7/site\-packages/foo/__init__.py
    /usr/share/locale/  # With message catalogs below here:
    /usr/share/locale/es/LC_MESSAGES/foo.mo
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
\fBNOTE:\fP
.INDENT 7.0
.INDENT 3.5
This function will setup Translation objects that attempt to lookup
msgids in all of the found message catalogs\&.  This means if
you have several versions of the message catalogs installed
in different directories that the function searches, you need to make
sure that \fBlocaledirs\fP specifies the directories so that newer
message catalogs are searched first.  It also means that if
a newer catalog does not contain a translation for a msgid but an
older one that\(aqs in \fBlocaledirs\fP does, the translation from that
older catalog will be returned.
.UNINDENT
.UNINDENT
.sp
Changed in version kitchen\-1.1.0: ; API kitchen.i18n 2.1.0
Add more parameters to \fI\%get_translation_object()\fP so
it can more easily be used as a replacement for
\fBgettext.translation()\fP\&.  Also change the way we use localedirs.
We cycle through them until we find a suitable locale file rather
than simply cycling through until we find a directory that exists.
The new code is based heavily on the \fI\%python standard library\fP
\fBgettext.translation()\fP function.

.sp
Changed in version kitchen\-1.2.0: ; API kitchen.i18n 2.2.0
Add python2_api parameter

.UNINDENT
.SS Translation Objects
.sp
The standard translation objects from the \fBgettext\fP module suffer from
several problems:
.INDENT 0.0
.IP \(bu 2
They can throw \fBUnicodeError\fP
.IP \(bu 2
They can\(aqt find translations for non\-ASCII byte \fBstr\fP
messages
.IP \(bu 2
They may return either \fBunicode\fP string or byte \fBstr\fP from the
same function even though the functions say they will only return
\fBunicode\fP or only return byte \fBstr\fP\&.
.UNINDENT
.sp
\fI\%DummyTranslations\fP and \fI\%NewGNUTranslations\fP were written to fix
these issues.
.INDENT 0.0
.TP
.B class kitchen.i18n.DummyTranslations(fp=None, python2_api=True)
Safer version of \fBgettext.NullTranslations\fP
.sp
This Translations class doesn\(aqt translate the strings and is intended to
be used as a fallback when there were errors setting up a real
Translations object.  It\(aqs safer than \fBgettext.NullTranslations\fP in
its handling of byte \fBstr\fP vs \fBunicode\fP strings.
.sp
Unlike \fBNullTranslations\fP, this Translation class will
never throw a \fBUnicodeError\fP\&.  The code that you have
around a call to \fI\%DummyTranslations\fP might throw
a \fBUnicodeError\fP but at least that will be in code you
control and can fix.  Also, unlike \fBNullTranslations\fP all
of this Translation object\(aqs methods guarantee to return byte \fBstr\fP
except for \fBugettext()\fP and \fBungettext()\fP which guarantee to
return \fBunicode\fP strings.
.sp
When byte \fBstr\fP are returned, the strings will be encoded according
to this algorithm:
.INDENT 7.0
.IP 1. 3
If a fallback has been added, the fallback will be called first.
You\(aqll need to consult the fallback to see whether it performs any
encoding changes.
.IP 2. 3
If a byte \fBstr\fP was given, the same byte \fBstr\fP will
be returned.
.IP 3. 3
If a \fBunicode\fP string was given and \fI\%set_output_charset()\fP
has been called then we encode the string using the
\fBoutput_charset\fP
.IP 4. 3
If a \fBunicode\fP string was given and this is \fBgettext()\fP or
\fBngettext()\fP and \fB_charset\fP was set output in that charset.
.IP 5. 3
If a \fBunicode\fP string was given and this is \fBgettext()\fP
or \fBngettext()\fP we encode it using \(aqutf\-8\(aq.
.IP 6. 3
If a \fBunicode\fP string was given and this is \fBlgettext()\fP
or \fBlngettext()\fP we encode using the value of
\fBlocale.getpreferredencoding()\fP
.UNINDENT
.sp
For \fBugettext()\fP and \fBungettext()\fP, we go through the same set of
steps with the following differences:
.INDENT 7.0
.IP \(bu 2
We transform byte \fBstr\fP into \fBunicode\fP strings for
these methods.
.IP \(bu 2
The encoding used to decode the byte \fBstr\fP is taken from
\fI\%input_charset\fP if it\(aqs set, otherwise we decode using
UTF\-8\&.
.UNINDENT
.INDENT 7.0
.TP
.B input_charset
is an extension to the \fI\%python standard library\fP \fBgettext\fP that specifies what
charset a message is encoded in when decoding a message to
\fBunicode\fP\&.  This is used for two purposes:
.UNINDENT
.INDENT 7.0
.IP 1. 3
If the message string is a byte \fBstr\fP, this is used to decode
the string to a \fBunicode\fP string before looking it up in the
message catalog\&.
.IP 2. 3
In \fBugettext()\fP and
\fBungettext()\fP methods, if a byte
\fBstr\fP is given as the message and is untranslated this is used
as the encoding when decoding to \fBunicode\fP\&.  This is different
from \fB_charset\fP which may be set when a message catalog
is loaded because \fI\%input_charset\fP is used to describe an encoding
used in a python source file while \fB_charset\fP describes the
encoding used in the message catalog file.
.UNINDENT
.sp
Any characters that aren\(aqt able to be transformed from a byte \fBstr\fP
to \fBunicode\fP string or vice versa will be replaced with
a replacement character (ie: \fBu\(aq�\(aq\fP in unicode based encodings, \fB\(aq?\(aq\fP in other
ASCII compatible encodings).
.sp
\fBSEE ALSO:\fP
.INDENT 7.0
.INDENT 3.5
.INDENT 0.0
.TP
.B \fBgettext.NullTranslations\fP
For information about what methods are available and what they do.
.UNINDENT
.UNINDENT
.UNINDENT
.sp
Changed in version kitchen\-1.1.0: ; API kitchen.i18n 2.1.0
* Although we had adapted \fBgettext()\fP, \fBngettext()\fP,
  \fBlgettext()\fP, and \fBlngettext()\fP to always return byte
  \fBstr\fP, we hadn\(aqt forced those byte \fBstr\fP to always be
  in a specified charset.  We now make sure that \fBgettext()\fP and
  \fBngettext()\fP return byte \fBstr\fP encoded using
  \fBoutput_charset\fP if set, otherwise \fBcharset\fP and if
  neither of those, UTF\-8\&.  With \fBlgettext()\fP and
  \fBlngettext()\fP \fBoutput_charset\fP if set, otherwise
  \fBlocale.getpreferredencoding()\fP\&.
* Make setting \fI\%input_charset\fP and \fBoutput_charset\fP also
  set those attributes on any fallback translation objects.

.sp
Changed in version kitchen\-1.2.0: ; API kitchen.i18n 2.2.0
Add python2_api parameter to __init__()

.INDENT 7.0
.TP
.B set_output_charset(charset)
Set the output charset
.sp
This serves two purposes.  The normal
\fBgettext.NullTranslations.set_output_charset()\fP does not set the
output on fallback objects.  On python\-2.3,
\fBgettext.NullTranslations\fP objects don\(aqt contain this method.
.UNINDENT
.UNINDENT
.INDENT 0.0
.TP
.B class kitchen.i18n.NewGNUTranslations(fp=None, python2_api=True)
Safer version of \fBgettext.GNUTranslations\fP
.sp
\fBgettext.GNUTranslations\fP suffers from two problems that this
class fixes.
.INDENT 7.0
.IP 1. 3
\fBgettext.GNUTranslations\fP can throw a
\fBUnicodeError\fP in
\fBgettext.GNUTranslations.ugettext()\fP if the message being
translated has non\-ASCII characters and there is no translation
for it.
.IP 2. 3
\fBgettext.GNUTranslations\fP can return byte \fBstr\fP from
\fBgettext.GNUTranslations.ugettext()\fP and \fBunicode\fP
strings from the other \fBgettext()\fP
methods if the message being translated is the wrong type
.UNINDENT
.sp
When byte \fBstr\fP are returned, the strings will be encoded
according to this algorithm:
.INDENT 7.0
.IP 1. 3
If a fallback has been added, the fallback will be called first.
You\(aqll need to consult the fallback to see whether it performs any
encoding changes.
.IP 2. 3
If a byte \fBstr\fP was given, the same byte \fBstr\fP will
be returned.
.IP 3. 3
If a \fBunicode\fP string was given and
\fBset_output_charset()\fP has been called then we encode the
string using the \fBoutput_charset\fP
.IP 4. 3
If a \fBunicode\fP string was given and this is \fBgettext()\fP
or \fBngettext()\fP and a charset was detected when parsing the
message catalog, output in that charset.
.IP 5. 3
If a \fBunicode\fP string was given and this is \fBgettext()\fP
or \fBngettext()\fP we encode it using UTF\-8\&.
.IP 6. 3
If a \fBunicode\fP string was given and this is \fBlgettext()\fP
or \fBlngettext()\fP we encode using the value of
\fBlocale.getpreferredencoding()\fP
.UNINDENT
.sp
For \fBugettext()\fP and \fBungettext()\fP, we go through the same set of
steps with the following differences:
.INDENT 7.0
.IP \(bu 2
We transform byte \fBstr\fP into \fBunicode\fP strings for these
methods.
.IP \(bu 2
The encoding used to decode the byte \fBstr\fP is taken from
\fI\%input_charset\fP if it\(aqs set, otherwise we decode using
UTF\-8
.UNINDENT
.INDENT 7.0
.TP
.B input_charset
an extension to the \fI\%python standard library\fP \fBgettext\fP that specifies what
charset a message is encoded in when decoding a message to
\fBunicode\fP\&.  This is used for two purposes:
.UNINDENT
.INDENT 7.0
.IP 1. 3
If the message string is a byte \fBstr\fP, this is used to decode
the string to a \fBunicode\fP string before looking it up in the
message catalog\&.
.IP 2. 3
In \fBugettext()\fP and
\fBungettext()\fP methods, if a byte
\fBstr\fP is given as the message and is untranslated his is used as
the encoding when decoding to \fBunicode\fP\&.  This is different from
the \fB_charset\fP parameter that may be set when a message
catalog is loaded because \fI\%input_charset\fP is used to describe an
encoding used in a python source file while \fB_charset\fP describes
the encoding used in the message catalog file.
.UNINDENT
.sp
Any characters that aren\(aqt able to be transformed from a byte
\fBstr\fP to \fBunicode\fP string or vice versa will be replaced
with a replacement character (ie: \fBu\(aq�\(aq\fP in unicode based encodings,
\fB\(aq?\(aq\fP in other ASCII compatible encodings).
.sp
\fBSEE ALSO:\fP
.INDENT 7.0
.INDENT 3.5
.INDENT 0.0
.TP
.B \fBgettext.GNUTranslations.gettext\fP
For information about what methods this class has and what they do
.UNINDENT
.UNINDENT
.UNINDENT
.sp
Changed in version kitchen\-1.1.0: ; API kitchen.i18n 2.1.0
Although we had adapted \fBgettext()\fP, \fBngettext()\fP,
\fBlgettext()\fP, and \fBlngettext()\fP to always return
byte \fBstr\fP, we hadn\(aqt forced those byte \fBstr\fP to always
be in a specified charset.  We now make sure that \fBgettext()\fP and
\fBngettext()\fP return byte \fBstr\fP encoded using
\fBoutput_charset\fP if set, otherwise \fBcharset\fP and if
neither of those, UTF\-8\&.  With \fBlgettext()\fP and
\fBlngettext()\fP \fBoutput_charset\fP if set, otherwise
\fBlocale.getpreferredencoding()\fP\&.

.UNINDENT
.SS Kitchen.text: unicode and utf8 and xml oh my!
.sp
The kitchen.text module contains functions that deal with text manipulation.
.SS Kitchen.text.converters
.sp
Functions to handle conversion of byte \fBstr\fP and \fBunicode\fP
strings.
.sp
Changed in version kitchen: 0.2a2 ; API kitchen.text 2.0.0
Added \fI\%getwriter()\fP

.sp
Changed in version kitchen: 0.2.2  ; API kitchen.text 2.1.0
Added \fI\%exception_to_unicode()\fP,
\fI\%exception_to_bytes()\fP,
\fI\%EXCEPTION_CONVERTERS\fP,
and \fI\%BYTE_EXCEPTION_CONVERTERS\fP

.sp
Changed in version kitchen: 1.0.1 ; API kitchen.text 2.1.1
Deprecated \fI\%BYTE_EXCEPTION_CONVERTERS\fP as
we\(aqve simplified \fI\%exception_to_unicode()\fP and
\fI\%exception_to_bytes()\fP to make it unnecessary

.SS Byte Strings and Unicode in Python2
.sp
Python2 has two string types, \fBstr\fP and \fBunicode\fP\&.
\fBunicode\fP represents an abstract sequence of text characters.  It can
hold any character that is present in the unicode standard.  \fBstr\fP can
hold any byte of data.  The operating system and python work together to
display these bytes as characters in many cases but you should always keep in
mind that the information is really a sequence of bytes, not a sequence of
characters.  In python2 these types are interchangeable a large amount of the
time.  They are one of the few pairs of types that automatically convert when
used in equality:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
>>> # string is converted to unicode and then compared
>>> "I am a string" == u"I am a string"
True
>>> # Other types, like int, don\(aqt have this special treatment
>>> 5 == "5"
False
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
However, this automatic conversion tends to lull people into a false sense of
security.  As long as you\(aqre dealing with ASCII characters the
automatic conversion will save you from seeing any differences.  Once you
start using characters that are not in ASCII, you will start getting
\fBUnicodeError\fP and \fBUnicodeWarning\fP as the automatic conversions
between the types fail:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
>>> "I am an ñ" == u"I am an ñ"
__main__:1: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode \- interpreting them as being unequal
False
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Why do these conversions fail?  The reason is that the python2
\fBunicode\fP type represents an abstract sequence of unicode text known as
code points\&.  \fBstr\fP, on the other hand, really represents
a sequence of bytes.  Those bytes are converted by your operating system to
appear as characters on your screen using a particular encoding (usually
with a default defined by the operating system and customizable by the
individual user.) Although ASCII characters are fairly standard in
what bytes represent each character, the bytes outside of the ASCII
range are not.  In general, each encoding will map a different character to
a particular byte.  Newer encodings map individual characters to multiple
bytes (which the older encodings will instead treat as multiple characters).
In the face of these differences, python refuses to guess at an encoding and
instead issues a warning or exception and refuses to convert.
.sp
\fBSEE ALSO:\fP
.INDENT 0.0
.INDENT 3.5
.INDENT 0.0
.TP
.B overcoming\-frustration
For a longer introduction on this subject.
.UNINDENT
.UNINDENT
.UNINDENT
.SS Strategy for Explicit Conversion
.sp
So what is the best method of dealing with this weltering babble of incoherent
encodings?  The basic strategy is to explicitly turn everything into
\fBunicode\fP when it first enters your program.  Then, when you send it to
output, you can transform the unicode back into bytes.  Doing this allows you
to control the encodings that are used and avoid getting tracebacks due to
\fBUnicodeError\fP\&. Using the functions defined in this module, that looks
something like this:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
>>> from kitchen.text.converters import to_unicode, to_bytes
>>> name = raw_input(\(aqEnter your name: \(aq)
Enter your name: Toshio くらとみ
>>> name
\(aqToshio \exe3\ex81\ex8f\exe3\ex82\ex89\exe3\ex81\exa8\exe3\ex81\exbf\(aq
>>> type(name)
<type \(aqstr\(aq>
>>> unicode_name = to_unicode(name)
>>> type(unicode_name)
<type \(aqunicode\(aq>
>>> unicode_name
u\(aqToshio \eu304f\eu3089\eu3068\eu307f\(aq
>>> # Do a lot of other things before needing to save/output again:
>>> output = open(\(aqdatafile\(aq, \(aqw\(aq)
>>> output.write(to_bytes(u\(aqName: %s\e\en\(aq % unicode_name))
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
A few notes:
.sp
Looking at line 6, you\(aqll notice that the input we took from the user was
a byte \fBstr\fP\&.  In general, anytime we\(aqre getting a value from outside
of python (The filesystem, reading data from the network, interacting with an
external command, reading values from the environment) we are interacting with
something that will want to give us a byte \fBstr\fP\&.  Some \fI\%python standard library\fP
modules and third party libraries will automatically attempt to convert a byte
\fBstr\fP to \fBunicode\fP strings for you.  This is both a boon and
a curse.  If the library can guess correctly about the encoding that the data
is in, it will return \fBunicode\fP objects to you without you having to
convert.  However, if it can\(aqt guess correctly, you may end up with one of
several problems:
.INDENT 0.0
.TP
.B \fBUnicodeError\fP
The library attempted to decode a byte \fBstr\fP into
a \fBunicode\fP, string failed, and raises an exception.
.TP
.B Garbled data
If the library returns the data after decoding it with the wrong encoding,
the characters you see in the \fBunicode\fP string won\(aqt be the ones that
you expect.
.TP
.B A byte \fBstr\fP instead of \fBunicode\fP string
Some libraries will return a \fBunicode\fP string when they\(aqre able to
decode the data and a byte \fBstr\fP  when they can\(aqt.  This is
generally the hardest problem to debug when it occurs.  Avoid it in your
own code and try to avoid or open bugs against upstreams that do this. See
DesigningUnicodeAwareAPIs for strategies to do this properly.
.UNINDENT
.sp
On line 8, we convert from a byte \fBstr\fP to a \fBunicode\fP string.
\fI\%to_unicode()\fP does this for us.  It has some
error handling and sane defaults that make this a nicer function to use than
calling \fBstr.decode()\fP directly:
.INDENT 0.0
.IP \(bu 2
Instead of defaulting to the ASCII encoding which fails with all
but the simple American English characters, it defaults to UTF\-8\&.
.IP \(bu 2
Instead of raising an error if it cannot decode a value, it will replace
the value with the unicode "Replacement character" symbol (\fB�\fP).
.IP \(bu 2
If you happen to call this method with something that is not a \fBstr\fP
or \fBunicode\fP, it will return an empty \fBunicode\fP string.
.UNINDENT
.sp
All three of these can be overridden using different keyword arguments to the
function.  See the \fI\%to_unicode()\fP documentation for more information.
.sp
On line 15 we push the data back out to a file.  Two things you should note here:
.INDENT 0.0
.IP 1. 3
We deal with the strings as \fBunicode\fP until the last instant.  The
string format that we\(aqre using is \fBunicode\fP and the variable also
holds \fBunicode\fP\&.  People sometimes get into trouble when they mix
a byte \fBstr\fP format with a variable that holds a \fBunicode\fP
string (or vice versa) at this stage.
.IP 2. 3
\fI\%to_bytes()\fP, does the reverse of
\fI\%to_unicode()\fP\&.  In this case, we\(aqre using the default values which
turn \fBunicode\fP into a byte \fBstr\fP using UTF\-8\&.  Any
errors are replaced with a \fB�\fP and sending nonstring objects yield empty
\fBunicode\fP strings.  Just like \fI\%to_unicode()\fP, you can look at
the documentation for \fI\%to_bytes()\fP to find out how to override any of
these defaults.
.UNINDENT
.SS When to use an alternate strategy
.sp
The default strategy of decoding to \fBunicode\fP strings when you take
data in and encoding to a byte \fBstr\fP when you send the data back out
works great for most problems but there are a few times when you shouldn\(aqt:
.INDENT 0.0
.IP \(bu 2
The values aren\(aqt meant to be read as text
.IP \(bu 2
The values need to be byte\-for\-byte when you send them back out \-\- for
instance if they are database keys or filenames.
.IP \(bu 2
You are transferring the data between several libraries that all expect
byte \fBstr\fP\&.
.UNINDENT
.sp
In each of these instances, there is a reason to keep around the byte
\fBstr\fP version of a value.  Here\(aqs a few hints to keep your sanity in
these situations:
.INDENT 0.0
.IP 1. 3
Keep your \fBunicode\fP and \fBstr\fP values separate.  Just like the
pain caused when you have to use someone else\(aqs library that returns both
\fBunicode\fP and \fBstr\fP you can cause yourself pain if you have
functions that can return both types or variables that could hold either
type of value.
.IP 2. 3
Name your variables so that you can tell whether you\(aqre storing byte
\fBstr\fP or \fBunicode\fP string.  One of the first things you end
up having to do when debugging is determine what type of string you have in
a variable and what type of string you are expecting.  Naming your
variables consistently so that you can tell which type they are supposed to
hold will save you from at least one of those steps.
.IP 3. 3
When you get values initially, make sure that you\(aqre dealing with the type
of value that you expect as you save it.  You can use \fBisinstance()\fP
or \fI\%to_bytes()\fP since \fI\%to_bytes()\fP doesn\(aqt do any modifications of
the string if it\(aqs already a \fBstr\fP\&.  When using \fI\%to_bytes()\fP
for this purpose you might want to use:
.INDENT 3.0
.INDENT 3.5
.sp
.nf
.ft C
try:
    b_input = to_bytes(input_should_be_bytes_already, errors=\(aqstrict\(aq, nonstring=\(aqstrict\(aq)
except:
    handle_errors_somehow()
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
The reason is that the default of \fI\%to_bytes()\fP will take characters
that are illegal in the chosen encoding and transform them to replacement
characters.  Since the point of keeping this data as a byte \fBstr\fP is
to keep the exact same bytes when you send it outside of your code,
changing things to replacement characters should be rasing red flags that
something is wrong.  Setting \fBerrors\fP to \fBstrict\fP will raise an
exception which gives you an opportunity to fail gracefully.
.IP 4. 3
Sometimes you will want to print out the values that you have in your byte
\fBstr\fP\&.  When you do this you will need to make sure that you
transform \fBunicode\fP to \fBstr\fP before combining them.  Also be
sure that any other function calls (including \fBgettext\fP) are going to
give you strings that are the same type.  For instance:
.INDENT 3.0
.INDENT 3.5
.sp
.nf
.ft C
print to_bytes(_(\(aqUsername: %(user)s\(aq), \(aqutf\-8\(aq) % {\(aquser\(aq: b_username}
.ft P
.fi
.UNINDENT
.UNINDENT
.UNINDENT
.SS Gotchas and how to avoid them
.sp
Even when you have a good conceptual understanding of how python2 treats
\fBunicode\fP and \fBstr\fP there are still some things that can
surprise you.  In most cases this is because, as noted earlier, python or one
of the python libraries you depend on is trying to convert a value
automatically and failing.  Explicit conversion at the appropriate place
usually solves that.
.SS str(obj)
.sp
One common idiom for getting a simple, string representation of an object is to use:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
str(obj)
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Unfortunately, this is not safe.  Sometimes str(obj) will return
\fBunicode\fP\&.  Sometimes it will return a byte \fBstr\fP\&.  Sometimes,
it will attempt to convert from a \fBunicode\fP string to a byte
\fBstr\fP, fail, and throw a \fBUnicodeError\fP\&.  To be safe from all of
these, first decide whether you need \fBunicode\fP or \fBstr\fP to be
returned.  Then use \fI\%to_unicode()\fP or \fI\%to_bytes()\fP to get the simple
representation like this:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
u_representation = to_unicode(obj, nonstring=\(aqsimplerepr\(aq)
b_representation = to_bytes(obj, nonstring=\(aqsimplerepr\(aq)
.ft P
.fi
.UNINDENT
.UNINDENT
.SS print
.sp
python has a builtin \fBprint()\fP statement that outputs strings to the
terminal.  This originated in a time when python only dealt with byte
\fBstr\fP\&.  When \fBunicode\fP strings came about, some enhancements
were made to the \fBprint()\fP statement so that it could print those as well.
The enhancements make \fBprint()\fP work most of the time.  However, the times
when it doesn\(aqt work tend to make for cryptic debugging.
.sp
The basic issue is that \fBprint()\fP has to figure out what encoding to use
when it prints a \fBunicode\fP string to the terminal.  When python is
attached to your terminal (ie, you\(aqre running the interpreter or running
a script that prints to the screen) python is able to take the encoding value
from your locale settings \fBLC_ALL\fP or \fBLC_CTYPE\fP and print the
characters allowed by that encoding.  On most modern Unix systems, the
encoding is utf\-8 which means that you can print any \fBunicode\fP
character without problem.
.sp
There are two common cases of things going wrong:
.INDENT 0.0
.IP 1. 3
Someone has a locale set that does not accept all valid unicode characters.
For instance:
.INDENT 3.0
.INDENT 3.5
.sp
.nf
.ft C
$ LC_ALL=C python
>>> print u\(aq\eufffd\(aq
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: \(aqascii\(aq codec can\(aqt encode character u\(aq\eufffd\(aq in position 0: ordinal not in range(128)
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
This often happens when a script that you\(aqve written and debugged from the
terminal is run from an automated environment like \fBcron\fP\&.  It
also occurs when you have written a script using a utf\-8 aware
locale and released it for consumption by people all over the internet.
Inevitably, someone is running with a locale that can\(aqt handle all unicode
characters and you get a traceback reported.
.IP 2. 3
You redirect output to a file.  Python isn\(aqt using the values in
\fBLC_ALL\fP unconditionally to decide what encoding to use.  Instead
it is using the encoding set for the terminal you are printing to which is
set to accept different encodings by \fBLC_ALL\fP\&.  If you redirect
to a file, you are no longer printing to the terminal so \fBLC_ALL\fP
won\(aqt have any effect.  At this point, python will decide it can\(aqt find an
encoding and fallback to ASCII which will likely lead to
\fBUnicodeError\fP being raised.  You can see this in a short script:
.INDENT 3.0
.INDENT 3.5
.sp
.nf
.ft C
#! /usr/bin/python \-tt
print u\(aq\eufffd\(aq
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
And then look at the difference between running it normally and redirecting to a file:
.INDENT 3.0
.INDENT 3.5
.sp
.nf
.ft C
$ ./test.py
�
$ ./test.py > t
Traceback (most recent call last):
  File "test.py", line 3, in <module>
      print u\(aq\eufffd\(aq
UnicodeEncodeError: \(aqascii\(aq codec can\(aqt encode character u\(aq\eufffd\(aq in position 0: ordinal not in range(128)
.ft P
.fi
.UNINDENT
.UNINDENT
.UNINDENT
.sp
The short answer to dealing with this is to always use bytes when writing
output.  You can do this by explicitly converting to bytes like this:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
from kitchen.text.converters import to_bytes
u_string = u\(aq\eufffd\(aq
print to_bytes(u_string)
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
or you can wrap stdout and stderr with a \fBStreamWriter\fP\&.
A \fBStreamWriter\fP is convenient in that you can assign it to
encode for \fBsys.stdout\fP or \fBsys.stderr\fP and then have output
automatically converted but it has the drawback of still being able to throw
\fBUnicodeError\fP if the writer can\(aqt encode all possible unicode
codepoints.  Kitchen provides an alternate version which can be retrieved with
\fI\%kitchen.text.converters.getwriter()\fP which will not traceback in its
standard configuration.
.SS Unicode, str, and dict keys
.sp
The \fBhash()\fP of the ASCII characters is the same for
\fBunicode\fP and byte \fBstr\fP\&.  When you use them in \fBdict\fP
keys, they evaluate to the same dictionary slot:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
>>> u_string = u\(aqa\(aq
>>> b_string = \(aqa\(aq
>>> hash(u_string), hash(b_string)
(12416037344, 12416037344)
>>> d = {}
>>> d[u_string] = \(aqunicode\(aq
>>> d[b_string] = \(aqbytes\(aq
>>> d
{u\(aqa\(aq: \(aqbytes\(aq}
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
When you deal with key values outside of ASCII, \fBunicode\fP and
byte \fBstr\fP evaluate unequally no matter what their character content or
hash value:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
>>> u_string = u\(aqñ\(aq
>>> b_string = u_string.encode(\(aqutf\-8\(aq)
>>> print u_string
ñ
>>> print b_string
ñ
>>> d = {}
>>> d[u_string] = \(aqunicode\(aq
>>> d[b_string] = \(aqbytes\(aq
>>> d
{u\(aq\e\exf1\(aq: \(aqunicode\(aq, \(aq\e\exc3\e\exb1\(aq: \(aqbytes\(aq}
>>> b_string2 = \(aq\e\exf1\(aq
>>> hash(u_string), hash(b_string2)
(30848092528, 30848092528)
>>> d = {}
>>> d[u_string] = \(aqunicode\(aq
>>> d[b_string2] = \(aqbytes\(aq
{u\(aq\e\exf1\(aq: \(aqunicode\(aq, \(aq\e\exf1\(aq: \(aqbytes\(aq}
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
How do you work with this one?  Remember rule #1:  Keep your \fBunicode\fP
and byte \fBstr\fP values separate.  That goes for keys in a dictionary
just like anything else.
.INDENT 0.0
.IP \(bu 2
For any given dictionary, make sure that all your keys are either
\fBunicode\fP or \fBstr\fP\&.  \fBDo not mix the two.\fP  If you\(aqre being
given both \fBunicode\fP and \fBstr\fP but you don\(aqt need to preserve
separate keys for each, I recommend using \fI\%to_unicode()\fP or
\fI\%to_bytes()\fP to convert all keys to one type or the other like this:
.INDENT 2.0
.INDENT 3.5
.sp
.nf
.ft C
>>> from kitchen.text.converters import to_unicode
>>> u_string = u\(aqone\(aq
>>> b_string = \(aqtwo\(aq
>>> d = {}
>>> d[to_unicode(u_string)] = 1
>>> d[to_unicode(b_string)] = 2
>>> d
{u\(aqtwo\(aq: 2, u\(aqone\(aq: 1}
.ft P
.fi
.UNINDENT
.UNINDENT
.IP \(bu 2
These issues also apply to using dicts with tuple keys that contain
a mixture of \fBunicode\fP and \fBstr\fP\&.  Once again the best fix
is to standardise on either \fBstr\fP or \fBunicode\fP\&.
.IP \(bu 2
If you absolutely need to store values in a dictionary where the keys could
be either \fBunicode\fP or \fBstr\fP you can use
\fBStrictDict\fP which has separate
entries for all \fBunicode\fP and byte \fBstr\fP and deals correctly
with any \fBtuple\fP containing mixed \fBunicode\fP and byte
\fBstr\fP\&.
.UNINDENT
.SS Functions
.SS Unicode and byte str conversion
.INDENT 0.0
.TP
.B kitchen.text.converters.to_unicode(obj, encoding=\(aqutf\-8\(aq, errors=\(aqreplace\(aq, nonstring=None, non_string=None)
Convert an object into a \fBunicode\fP string
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.IP \(bu 2
\fBobj\fP \-\- Object to convert to a \fBunicode\fP string.  This should
normally be a byte \fBstr\fP
.IP \(bu 2
\fBencoding\fP \-\- What encoding to try converting the byte \fBstr\fP as.
Defaults to utf\-8
.IP \(bu 2
\fBerrors\fP \-\- If errors are found while decoding, perform this action.
Defaults to \fBreplace\fP which replaces the invalid bytes with
a character that means the bytes were unable to be decoded.  Other
values are the same as the error handling schemes in the \fI\%codec base
classes\fP\&.
For instance \fBstrict\fP which raises an exception and \fBignore\fP which
simply omits the non\-decodable characters.
.IP \(bu 2
\fBnonstring\fP \-\- 
.sp
How to treat nonstring values.  Possible values are:
.INDENT 2.0
.TP
.B simplerepr
Attempt to call the object\(aqs "simple representation"
method and return that value.  Python\-2.3+ has two methods that
try to return a simple representation: \fBobject.__unicode__()\fP
and \fBobject.__str__()\fP\&.  We first try to get a usable value
from \fBobject.__unicode__()\fP\&.  If that fails we try the same
with \fBobject.__str__()\fP\&.
.TP
.B empty
Return an empty \fBunicode\fP string
.TP
.B strict
Raise a \fBTypeError\fP
.TP
.B passthru
Return the object unchanged
.TP
.B repr
Attempt to return a \fBunicode\fP string of the repr of the
object
.UNINDENT
.sp
Default is \fBsimplerepr\fP

.IP \(bu 2
\fBnon_string\fP \-\- \fIDeprecated\fP Use \fBnonstring\fP instead
.UNINDENT
.TP
.B Raises
.INDENT 7.0
.IP \(bu 2
\fBTypeError\fP \-\- if \fBnonstring\fP is \fBstrict\fP and
a non\-\fBbasestring\fP object is passed in or if \fBnonstring\fP
is set to an unknown value
.IP \(bu 2
\fBUnicodeDecodeError\fP \-\- if \fBerrors\fP is \fBstrict\fP and
\fBobj\fP is not decodable using the given encoding
.UNINDENT
.TP
.B Returns
\fBunicode\fP string or the original object depending on the
value of \fBnonstring\fP\&.
.UNINDENT
.sp
Usually this should be used on a byte \fBstr\fP but it can take both
byte \fBstr\fP and \fBunicode\fP strings intelligently.  Nonstring
objects are handled in different ways depending on the setting of the
\fBnonstring\fP parameter.
.sp
The default values of this function are set so as to always return
a \fBunicode\fP string and never raise an error when converting from
a byte \fBstr\fP to a \fBunicode\fP string.  However, when you do
not pass validly encoded text (or a nonstring object), you may end up with
output that you don\(aqt expect.  Be sure you understand the requirements of
your data, not just ignore errors by passing it through this function.
.sp
Changed in version 0.2.1a2: Deprecated \fBnon_string\fP in favor of \fBnonstring\fP parameter and changed
default value to \fBsimplerepr\fP

.UNINDENT
.INDENT 0.0
.TP
.B kitchen.text.converters.to_bytes(obj, encoding=\(aqutf\-8\(aq, errors=\(aqreplace\(aq, nonstring=None, non_string=None)
Convert an object into a byte \fBstr\fP
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.IP \(bu 2
\fBobj\fP \-\- Object to convert to a byte \fBstr\fP\&.  This should normally
be a \fBunicode\fP string.
.IP \(bu 2
\fBencoding\fP \-\- Encoding to use to convert the \fBunicode\fP string
into a byte \fBstr\fP\&.  Defaults to utf\-8\&.
.IP \(bu 2
\fBerrors\fP \-\- 
.sp
If errors are found while encoding, perform this action.
Defaults to \fBreplace\fP which replaces the invalid bytes with
a character that means the bytes were unable to be encoded.  Other
values are the same as the error handling schemes in the \fI\%codec base
classes\fP\&.
For instance \fBstrict\fP which raises an exception and \fBignore\fP which
simply omits the non\-encodable characters.

.IP \(bu 2
\fBnonstring\fP \-\- 
.sp
How to treat nonstring values.  Possible values are:
.INDENT 2.0
.TP
.B simplerepr
Attempt to call the object\(aqs "simple representation"
method and return that value.  Python\-2.3+ has two methods that
try to return a simple representation: \fBobject.__unicode__()\fP
and \fBobject.__str__()\fP\&.  We first try to get a usable value
from \fBobject.__str__()\fP\&.  If that fails we try the same
with \fBobject.__unicode__()\fP\&.
.TP
.B empty
Return an empty byte \fBstr\fP
.TP
.B strict
Raise a \fBTypeError\fP
.TP
.B passthru
Return the object unchanged
.TP
.B repr
Attempt to return a byte \fBstr\fP of the \fBrepr()\fP of the
object
.UNINDENT
.sp
Default is \fBsimplerepr\fP\&.

.IP \(bu 2
\fBnon_string\fP \-\- \fIDeprecated\fP Use \fBnonstring\fP instead.
.UNINDENT
.TP
.B Raises
.INDENT 7.0
.IP \(bu 2
\fBTypeError\fP \-\- if \fBnonstring\fP is \fBstrict\fP and
a non\-\fBbasestring\fP object is passed in or if \fBnonstring\fP
is set to an unknown value.
.IP \(bu 2
\fBUnicodeEncodeError\fP \-\- if \fBerrors\fP is \fBstrict\fP and all of the
bytes of \fBobj\fP are unable to be encoded using \fBencoding\fP\&.
.UNINDENT
.TP
.B Returns
byte \fBstr\fP or the original object depending on the value
of \fBnonstring\fP\&.
.UNINDENT
.sp
\fBWARNING:\fP
.INDENT 7.0
.INDENT 3.5
If you pass a byte \fBstr\fP into this function the byte
\fBstr\fP is returned unmodified.  It is \fBnot\fP re\-encoded with
the specified \fBencoding\fP\&.  The easiest way to achieve that is:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
to_bytes(to_unicode(text), encoding=\(aqutf\-8\(aq)
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
The initial \fI\%to_unicode()\fP call will ensure text is
a \fBunicode\fP string.  Then, \fI\%to_bytes()\fP will turn that into
a byte \fBstr\fP with the specified encoding.
.UNINDENT
.UNINDENT
.sp
Usually, this should be used on a \fBunicode\fP string but it can take
either a byte \fBstr\fP or a \fBunicode\fP string intelligently.
Nonstring objects are handled in different ways depending on the setting
of the \fBnonstring\fP parameter.
.sp
The default values of this function are set so as to always return a byte
\fBstr\fP and never raise an error when converting from unicode to
bytes.  However, when you do not pass an encoding that can validly encode
the object (or a non\-string object), you may end up with output that you
don\(aqt expect.  Be sure you understand the requirements of your data, not
just ignore errors by passing it through this function.
.sp
Changed in version 0.2.1a2: Deprecated \fBnon_string\fP in favor of \fBnonstring\fP parameter
and changed default value to \fBsimplerepr\fP

.UNINDENT
.INDENT 0.0
.TP
.B kitchen.text.converters.getwriter(encoding)
Return a \fBcodecs.StreamWriter\fP that resists tracing back.
.INDENT 7.0
.TP
.B Parameters
\fBencoding\fP \-\- Encoding to use for transforming \fBunicode\fP strings
into byte \fBstr\fP\&.
.TP
.B Return type
\fBcodecs.StreamWriter\fP
.TP
.B Returns
\fBStreamWriter\fP that you can instantiate to wrap output
streams to automatically translate \fBunicode\fP strings into \fBencoding\fP\&.
.UNINDENT
.sp
This is a reimplemetation of \fBcodecs.getwriter()\fP that returns
a \fBStreamWriter\fP that resists issuing tracebacks.  The
\fBStreamWriter\fP that is returned uses
\fI\%kitchen.text.converters.to_bytes()\fP to convert \fBunicode\fP
strings into byte \fBstr\fP\&.  The departures from
\fBcodecs.getwriter()\fP are:
.INDENT 7.0
.IP 1. 3
The \fBStreamWriter\fP that is returned will take byte
\fBstr\fP as well as \fBunicode\fP strings.  Any byte
\fBstr\fP will be passed through unmodified.
.IP 2. 3
The default error handler for unknown bytes is to \fBreplace\fP the bytes
with the unknown character (\fB?\fP in most ascii\-based encodings, \fB�\fP
in the utf encodings) whereas \fBcodecs.getwriter()\fP defaults to
\fBstrict\fP\&.  Like \fBcodecs.StreamWriter\fP, the returned
\fBStreamWriter\fP can have its error handler changed in
code by setting \fBstream.errors = \(aqnew_handler_name\(aq\fP
.UNINDENT
.sp
Example usage:
.INDENT 7.0
.INDENT 3.5
.sp
.nf
.ft C
$ LC_ALL=C python
>>> import sys
>>> from kitchen.text.converters import getwriter
>>> UTF8Writer = getwriter(\(aqutf\-8\(aq)
>>> unwrapped_stdout = sys.stdout
>>> sys.stdout = UTF8Writer(unwrapped_stdout)
>>> print \(aqcaf\exc3\exa9\(aq
café
>>> print u\(aqcaf\exe9\(aq
café
>>> ASCIIWriter = getwriter(\(aqascii\(aq)
>>> sys.stdout = ASCIIWriter(unwrapped_stdout)
>>> print \(aqcaf\exc3\exa9\(aq
café
>>> print u\(aqcaf\exe9\(aq
caf?
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
\fBSEE ALSO:\fP
.INDENT 7.0
.INDENT 3.5
API docs for \fBcodecs.StreamWriter\fP and \fBcodecs.getwriter()\fP
and \fI\%Print Fails\fP on the
python wiki.
.UNINDENT
.UNINDENT
.sp
New in version kitchen: 0.2a2, API: kitchen.text 1.1.0

.UNINDENT
.INDENT 0.0
.TP
.B kitchen.text.converters.to_str(obj)
\fIDeprecated\fP
.sp
This function converts something to a byte \fBstr\fP if it isn\(aqt one.
It\(aqs used to call \fBstr()\fP or \fBunicode()\fP on the object to get its
simple representation without danger of getting a \fBUnicodeError\fP\&.
You should be using \fI\%to_unicode()\fP or \fI\%to_bytes()\fP explicitly
instead.
.sp
If you need \fBunicode\fP strings:
.INDENT 7.0
.INDENT 3.5
.sp
.nf
.ft C
to_unicode(obj, nonstring=\(aqsimplerepr\(aq)
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
If you need byte \fBstr\fP:
.INDENT 7.0
.INDENT 3.5
.sp
.nf
.ft C
to_bytes(obj, nonstring=\(aqsimplerepr\(aq)
.ft P
.fi
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 0.0
.TP
.B kitchen.text.converters.to_utf8(obj, errors=\(aqreplace\(aq, non_string=\(aqpassthru\(aq)
\fIDeprecated\fP
.sp
Convert \fBunicode\fP to an encoded utf\-8 byte \fBstr\fP\&.
You should be using \fI\%to_bytes()\fP instead:
.INDENT 7.0
.INDENT 3.5
.sp
.nf
.ft C
to_bytes(obj, encoding=\(aqutf\-8\(aq, non_string=\(aqpassthru\(aq)
.ft P
.fi
.UNINDENT
.UNINDENT
.UNINDENT
.SS Transformation to XML
.INDENT 0.0
.TP
.B kitchen.text.converters.unicode_to_xml(string, encoding=\(aqutf\-8\(aq, attrib=False, control_chars=\(aqreplace\(aq)
Take a \fBunicode\fP string and turn it into a byte \fBstr\fP
suitable for xml
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.IP \(bu 2
\fBstring\fP \-\- \fBunicode\fP string to encode into an XML compatible byte
\fBstr\fP
.IP \(bu 2
\fBencoding\fP \-\- encoding to use for the returned byte \fBstr\fP\&.
Default is to encode to UTF\-8\&.  If some of the characters in
\fBstring\fP are not encodable in this encoding, the unknown
characters will be entered into the output string using xml character
references.
.IP \(bu 2
\fBattrib\fP \-\- If \fBTrue\fP, quote the string for use in an xml
attribute.  If \fBFalse\fP (default), quote for use in an xml text
field.
.IP \(bu 2
\fBcontrol_chars\fP \-\- 
.sp
control characters are not allowed in XML
documents.  When we encounter those we need to know what to do.  Valid
options are:
.INDENT 2.0
.TP
.B replace
(default) Replace the control characters with \fB?\fP
.TP
.B ignore
Remove the characters altogether from the output
.TP
.B strict
Raise an \fBXmlEncodeError\fP  when
we encounter a control character
.UNINDENT

.UNINDENT
.TP
.B Raises
.INDENT 7.0
.IP \(bu 2
\fBkitchen.text.exceptions.XmlEncodeError\fP \-\- If \fBcontrol_chars\fP
is set to \fBstrict\fP and the string to be made suitable for output to
xml contains control characters or if \fBstring\fP is not
a \fBunicode\fP string then we raise this exception.
.IP \(bu 2
\fBValueError\fP \-\- If \fBcontrol_chars\fP is set to something other than
\fBreplace\fP, \fBignore\fP, or \fBstrict\fP\&.
.UNINDENT
.TP
.B Return type
byte \fBstr\fP
.TP
.B Returns
representation of the \fBunicode\fP string as a valid XML
byte \fBstr\fP
.UNINDENT
.sp
XML files consist mainly of text encoded using a particular charset.  XML
also denies the use of certain bytes in the encoded text (example: \fBASCII
Null\fP).  There are also special characters that must be escaped if they
are present in the input (example: \fB<\fP).  This function takes care of
all of those issues for you.
.sp
There are a few different ways to use this function depending on your
needs.  The simplest invocation is like this:
.INDENT 7.0
.INDENT 3.5
.sp
.nf
.ft C
unicode_to_xml(u\(aqString with non\-ASCII characters: <"á と">\(aq)
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
This will return the following to you, encoded in utf\-8:
.INDENT 7.0
.INDENT 3.5
.sp
.nf
.ft C
\(aqString with non\-ASCII characters: &lt;"á と"&gt;\(aq
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Pretty straightforward.  Now, what if you need to encode your document in
something other than utf\-8?  For instance, \fBlatin\-1\fP?  Let\(aqs
see:
.INDENT 7.0
.INDENT 3.5
.sp
.nf
.ft C
unicode_to_xml(u\(aqString with non\-ASCII characters: <"á と">\(aq, encoding=\(aqlatin\-1\(aq)
\(aqString with non\-ASCII characters: &lt;"á &#12392;"&gt;\(aq
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Because the \fBと\fP character is not available in the \fBlatin\-1\fP charset,
it is replaced with \fB&#12392;\fP in our output.  This is an xml character
reference which represents the character at unicode codepoint \fB12392\fP, the
\fBと\fP character.
.sp
When you want to reverse this, use \fI\%xml_to_unicode()\fP which will turn
a byte \fBstr\fP into a \fBunicode\fP string and replace the xml
character references with the unicode characters.
.sp
XML also has the quirk of not allowing control characters in its
output.  The \fBcontrol_chars\fP parameter allows us to specify what to
do with those.  For use cases that don\(aqt need absolute character by
character fidelity (example: holding strings that will just be used for
display in a GUI app later), the default value of \fBreplace\fP works well:
.INDENT 7.0
.INDENT 3.5
.sp
.nf
.ft C
unicode_to_xml(u\(aqString with disallowed control chars: \eu0000\eu0007\(aq)
\(aqString with disallowed control chars: ??\(aq
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
If you do need to be able to reproduce all of the characters at a later
date (examples: if the string is a key value in a database or a path on a
filesystem) you have many choices.  Here are a few that rely on \fButf\-7\fP,
a verbose encoding that encodes control characters (as well as
non\-ASCII unicode values) to characters from within the
ASCII printable characters.  The good thing about doing this is
that the code is pretty simple.  You just need to use \fButf\-7\fP both when
encoding the field for xml and when decoding it for use in your python
program:
.INDENT 7.0
.INDENT 3.5
.sp
.nf
.ft C
unicode_to_xml(u\(aqString with unicode: と and control char: \eu0007\(aq, encoding=\(aqutf7\(aq)
\(aqString with unicode: +MGg and control char: +AAc\-\(aq
# [...]
xml_to_unicode(\(aqString with unicode: +MGg and control char: +AAc\-\(aq, encoding=\(aqutf7\(aq)
u\(aqString with unicode: と and control char: \eu0007\(aq
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
As you can see, the \fButf\-7\fP encoding will transform even characters that
would be representable in utf\-8\&.  This can be a drawback if you
want unicode characters in the file to be readable without being decoded
first.  You can work around this with increased complexity in your
application code:
.INDENT 7.0
.INDENT 3.5
.sp
.nf
.ft C
encoding = \(aqutf\-8\(aq
u_string = u\(aqString with unicode: と and control char: \eu0007\(aq
try:
    # First attempt to encode to utf8
    data = unicode_to_xml(u_string, encoding=encoding, errors=\(aqstrict\(aq)
except XmlEncodeError:
    # Fallback to utf\-7
    encoding = \(aqutf\-7\(aq
    data = unicode_to_xml(u_string, encoding=encoding, errors=\(aqstrict\(aq)
write_tag(\(aq<mytag encoding=%s>%s</mytag>\(aq % (encoding, data))
# [...]
encoding = tag.attributes.encoding
u_string = xml_to_unicode(u_string, encoding=encoding)
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Using code similar to that, you can have some fields encoded using your
default encoding and fallback to \fButf\-7\fP if there are control
characters present.
.sp
\fBNOTE:\fP
.INDENT 7.0
.INDENT 3.5
If your goal is to preserve the control characters you cannot
save the entire file as \fButf\-7\fP and set the xml encoding parameter
to \fButf\-7\fP if your goal is to preserve the control
characters\&.  Because XML doesn\(aqt allow control characters,
you have to encode those separate from any encoding work that the XML
parser itself knows about.
.UNINDENT
.UNINDENT
.sp
\fBSEE ALSO:\fP
.INDENT 7.0
.INDENT 3.5
.INDENT 0.0
.TP
.B \fI\%bytes_to_xml()\fP
if you\(aqre dealing with bytes that are non\-text or of an unknown
encoding that you must preserve on a byte for byte level.
.TP
.B \fI\%guess_encoding_to_xml()\fP
if you\(aqre dealing with strings in unknown encodings that you don\(aqt
need to save with char\-for\-char fidelity.
.UNINDENT
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 0.0
.TP
.B kitchen.text.converters.xml_to_unicode(byte_string, encoding=\(aqutf\-8\(aq, errors=\(aqreplace\(aq)
Transform a byte \fBstr\fP from an xml file into a \fBunicode\fP
string
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.IP \(bu 2
\fBbyte_string\fP \-\- byte \fBstr\fP to decode
.IP \(bu 2
\fBencoding\fP \-\- encoding that the byte \fBstr\fP is in
.IP \(bu 2
\fBerrors\fP \-\- What to do if not every character is  valid in
\fBencoding\fP\&.  See the \fI\%to_unicode()\fP documentation for legal
values.
.UNINDENT
.TP
.B Return type
\fBunicode\fP string
.TP
.B Returns
string decoded from \fBbyte_string\fP
.UNINDENT
.sp
This function attempts to reverse what \fI\%unicode_to_xml()\fP does.  It
takes a byte \fBstr\fP (presumably read in from an xml file) and
expands all the html entities into unicode characters and decodes the byte
\fBstr\fP into a \fBunicode\fP string.  One thing it cannot do is
restore any control characters that were removed prior to
inserting into the file.  If you need to keep such characters you need to
use \fI\%xml_to_bytes()\fP and \fI\%bytes_to_xml()\fP or use on of the
strategies documented in \fI\%unicode_to_xml()\fP instead.
.UNINDENT
.INDENT 0.0
.TP
.B kitchen.text.converters.byte_string_to_xml(byte_string, input_encoding=\(aqutf\-8\(aq, errors=\(aqreplace\(aq, output_encoding=\(aqutf\-8\(aq, attrib=False, control_chars=\(aqreplace\(aq)
Make sure a byte \fBstr\fP is validly encoded for xml output
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.IP \(bu 2
\fBbyte_string\fP \-\- Byte \fBstr\fP to turn into valid xml output
.IP \(bu 2
\fBinput_encoding\fP \-\- Encoding of \fBbyte_string\fP\&.  Default \fButf\-8\fP
.IP \(bu 2
\fBerrors\fP \-\- 
.sp
How to handle errors encountered while decoding the
\fBbyte_string\fP into \fBunicode\fP at the beginning of the
process.  Values are:
.INDENT 2.0
.TP
.B replace
(default) Replace the invalid bytes with a \fB?\fP
.TP
.B ignore
Remove the characters altogether from the output
.TP
.B strict
Raise an \fBUnicodeDecodeError\fP when we encounter
a non\-decodable character
.UNINDENT

.IP \(bu 2
\fBoutput_encoding\fP \-\- Encoding for the xml file that this string will go
into.  Default is \fButf\-8\fP\&.  If all the characters in
\fBbyte_string\fP are not encodable in this encoding, the unknown
characters will be entered into the output string using xml character
references.
.IP \(bu 2
\fBattrib\fP \-\- If \fBTrue\fP, quote the string for use in an xml
attribute.  If \fBFalse\fP (default), quote for use in an xml text
field.
.IP \(bu 2
\fBcontrol_chars\fP \-\- 
.sp
XML does not allow control characters\&.  When
we encounter those we need to know what to do.  Valid options are:
.INDENT 2.0
.TP
.B replace
(default) Replace the control characters with \fB?\fP
.TP
.B ignore
Remove the characters altogether from the output
.TP
.B strict
Raise an error when we encounter a control character
.UNINDENT

.UNINDENT
.TP
.B Raises
.INDENT 7.0
.IP \(bu 2
\fBXmlEncodeError\fP \-\- If \fBcontrol_chars\fP is set to \fBstrict\fP and
the string to be made suitable for output to xml contains
control characters then we raise this exception.
.IP \(bu 2
\fBUnicodeDecodeError\fP \-\- If errors is set to \fBstrict\fP and the
\fBbyte_string\fP contains bytes that are not decodable using
\fBinput_encoding\fP, this error is raised
.UNINDENT
.TP
.B Return type
byte \fBstr\fP
.TP
.B Returns
representation of the byte \fBstr\fP in the output encoding with
any bytes that aren\(aqt available in xml taken care of.
.UNINDENT
.sp
Use this when you have a byte \fBstr\fP representing text that you need
to make suitable for output to xml.  There are several cases where this
is the case.  For instance, if you need to transform some strings encoded
in \fBlatin\-1\fP to utf\-8 for output:
.INDENT 7.0
.INDENT 3.5
.sp
.nf
.ft C
utf8_string = byte_string_to_xml(latin1_string, input_encoding=\(aqlatin\-1\(aq)
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
If you already have strings in the proper encoding you may still want to
use this function to remove control characters:
.INDENT 7.0
.INDENT 3.5
.sp
.nf
.ft C
cleaned_string = byte_string_to_xml(string, input_encoding=\(aqutf\-8\(aq, output_encoding=\(aqutf\-8\(aq)
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
\fBSEE ALSO:\fP
.INDENT 7.0
.INDENT 3.5
.INDENT 0.0
.TP
.B \fI\%unicode_to_xml()\fP
for other ideas on using this function
.UNINDENT
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 0.0
.TP
.B kitchen.text.converters.xml_to_byte_string(byte_string, input_encoding=\(aqutf\-8\(aq, errors=\(aqreplace\(aq, output_encoding=\(aqutf\-8\(aq)
Transform a byte \fBstr\fP from an xml file into \fBunicode\fP
string
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.IP \(bu 2
\fBbyte_string\fP \-\- byte \fBstr\fP to decode
.IP \(bu 2
\fBinput_encoding\fP \-\- encoding that the byte \fBstr\fP is in
.IP \(bu 2
\fBerrors\fP \-\- What to do if not every character is valid in
\fBencoding\fP\&.  See the \fI\%to_unicode()\fP docstring for legal
values.
.IP \(bu 2
\fBoutput_encoding\fP \-\- Encoding for the output byte \fBstr\fP
.UNINDENT
.TP
.B Returns
\fBunicode\fP string decoded from \fBbyte_string\fP
.UNINDENT
.sp
This function attempts to reverse what \fI\%unicode_to_xml()\fP does.  It
takes a byte \fBstr\fP (presumably read in from an xml file) and
expands all the html entities into unicode characters and decodes the
byte \fBstr\fP into a \fBunicode\fP string.  One thing it cannot do
is restore any control characters that were removed prior to
inserting into the file.  If you need to keep such characters you need to
use \fI\%xml_to_bytes()\fP and \fI\%bytes_to_xml()\fP or use one of the
strategies documented in \fI\%unicode_to_xml()\fP instead.
.UNINDENT
.INDENT 0.0
.TP
.B kitchen.text.converters.bytes_to_xml(byte_string, *args, **kwargs)
Return a byte \fBstr\fP encoded so it is valid inside of any xml
file
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.IP \(bu 2
\fBbyte_string\fP \-\- byte \fBstr\fP to transform
.IP \(bu 2
\fB**kwargs\fP (\fI*args,\fP) \-\- extra arguments to this function are passed on to
the function actually implementing the encoding.  You can use this to
tweak the output in some cases but, as a general rule, you shouldn\(aqt
because the underlying encoding function is not guaranteed to remain
the same.
.UNINDENT
.TP
.B Return type
byte \fBstr\fP consisting of all ASCII characters
.TP
.B Returns
byte \fBstr\fP representation of the input.  This will be encoded
using base64.
.UNINDENT
.sp
This function is made especially to put binary information into xml
documents.
.sp
This function is intended for encoding things that must be preserved
byte\-for\-byte.  If you want to encode a byte string that\(aqs text and don\(aqt
mind losing the actual bytes you probably want to try \fI\%byte_string_to_xml()\fP
or \fI\%guess_encoding_to_xml()\fP instead.
.sp
\fBNOTE:\fP
.INDENT 7.0
.INDENT 3.5
Although the current implementation uses \fBbase64.b64encode()\fP and
there\(aqs no plans to change it, that isn\(aqt guaranteed.  If you want to
make sure that you can encode and decode these messages it\(aqs best to
use \fI\%xml_to_bytes()\fP if you use this function to encode.
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 0.0
.TP
.B kitchen.text.converters.xml_to_bytes(byte_string, *args, **kwargs)
Decode a string encoded using \fI\%bytes_to_xml()\fP
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.IP \(bu 2
\fBbyte_string\fP \-\- byte \fBstr\fP to transform.  This should be a base64
encoded sequence of bytes originally generated by \fI\%bytes_to_xml()\fP\&.
.IP \(bu 2
\fB**kwargs\fP (\fI*args,\fP) \-\- extra arguments to this function are passed on to
the function actually implementing the encoding.  You can use this to
tweak the output in some cases but, as a general rule, you shouldn\(aqt
because the underlying encoding function is not guaranteed to remain
the same.
.UNINDENT
.TP
.B Return type
byte \fBstr\fP
.TP
.B Returns
byte \fBstr\fP that\(aqs the decoded input
.UNINDENT
.sp
If you\(aqve got fields in an xml document that were encoded with
\fI\%bytes_to_xml()\fP then you want to use this function to undecode them.
It converts a base64 encoded string into a byte \fBstr\fP\&.
.sp
\fBNOTE:\fP
.INDENT 7.0
.INDENT 3.5
Although the current implementation uses \fBbase64.b64decode()\fP and
there\(aqs no plans to change it, that isn\(aqt guaranteed.  If you want to
make sure that you can encode and decode these messages it\(aqs best to
use \fI\%bytes_to_xml()\fP if you use this function to decode.
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 0.0
.TP
.B kitchen.text.converters.guess_encoding_to_xml(string, output_encoding=\(aqutf\-8\(aq, attrib=False, control_chars=\(aqreplace\(aq)
Return a byte \fBstr\fP suitable for inclusion in xml
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.IP \(bu 2
\fBstring\fP \-\- \fBunicode\fP or byte \fBstr\fP to be transformed into
a byte \fBstr\fP suitable for inclusion in xml.  If string is
a byte \fBstr\fP we attempt to guess the encoding.  If we cannot guess,
we fallback to \fBlatin\-1\fP\&.
.IP \(bu 2
\fBoutput_encoding\fP \-\- Output encoding for the byte \fBstr\fP\&.  This
should match the encoding of your xml file.
.IP \(bu 2
\fBattrib\fP \-\- If \fBTrue\fP, escape the item for use in an xml
attribute.  If \fBFalse\fP (default) escape the item for use in
a text node.
.UNINDENT
.TP
.B Returns
utf\-8 encoded byte \fBstr\fP
.UNINDENT
.UNINDENT
.INDENT 0.0
.TP
.B kitchen.text.converters.to_xml(string, encoding=\(aqutf\-8\(aq, attrib=False, control_chars=\(aqignore\(aq)
\fIDeprecated\fP: Use \fI\%guess_encoding_to_xml()\fP instead
.UNINDENT
.SS Working with exception messages
.INDENT 0.0
.TP
.B kitchen.text.converters.EXCEPTION_CONVERTERS = (<function <lambda>>, <function <lambda>>)
.INDENT 7.0
.TP
.B Tuple of functions to try to use to convert an exception into a string
representation.  Its main use is to extract a string (\fBunicode\fP or
\fBstr\fP) from an exception object in \fI\%exception_to_unicode()\fP and
\fI\%exception_to_bytes()\fP\&.  The functions here will try the exception\(aqs
\fBargs[0]\fP and the exception itself (roughly equivalent to
\fIstr(exception)\fP) to extract the message. This is only a default and can
be easily overridden when calling those functions.  There are several
reasons you might wish to do that.  If you have exceptions where the best
string representing the exception is not returned by the default
functions, you can add another function to extract from a different
field:
.INDENT 7.0
.INDENT 3.5
.sp
.nf
.ft C
from kitchen.text.converters import (EXCEPTION_CONVERTERS,
        exception_to_unicode)

class MyError(Exception):
    def __init__(self, message):
        self.value = message

c = [lambda e: e.value]
c.extend(EXCEPTION_CONVERTERS)
try:
    raise MyError(\(aqAn Exception message\(aq)
except MyError, e:
    print exception_to_unicode(e, converters=c)
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Another reason would be if you\(aqre converting to a byte \fBstr\fP and
you know the \fBstr\fP needs to be a non\-utf\-8 encoding.
\fI\%exception_to_bytes()\fP defaults to utf\-8 but if you convert
into a byte \fBstr\fP explicitly using a converter then you can choose
a different encoding:
.INDENT 7.0
.INDENT 3.5
.sp
.nf
.ft C
from kitchen.text.converters import (EXCEPTION_CONVERTERS,
        exception_to_bytes, to_bytes)
c = [lambda e: to_bytes(e.args[0], encoding=\(aqeuc_jp\(aq),
        lambda e: to_bytes(e, encoding=\(aqeuc_jp\(aq)]
c.extend(EXCEPTION_CONVERTERS)
try:
    do_something()
except Exception, e:
    log = open(\(aqlogfile.euc_jp\(aq, \(aqa\(aq)
    log.write(\(aq%s
.ft P
.fi
.UNINDENT
.UNINDENT
.TP
.B \(aq % exception_to_bytes(e, converters=c)
.INDENT 7.0
.INDENT 3.5
log.close()
.UNINDENT
.UNINDENT
.sp
Each function in this list should take the exception as its sole argument
and return a string containing the message representing the exception.
The functions may return the message as a :byte class:\fIstr\fP,
a \fBunicode\fP string, or even an object if you trust the object to
return a decent string representation.  The \fI\%exception_to_unicode()\fP
and \fI\%exception_to_bytes()\fP functions will make sure to convert the
string to the proper type before returning.
.sp
New in version 0.2.2.

.UNINDENT
.UNINDENT
.INDENT 0.0
.TP
.B kitchen.text.converters.BYTE_EXCEPTION_CONVERTERS = (<function <lambda>>, <function to_bytes>)
\fIDeprecated\fP: Use \fI\%EXCEPTION_CONVERTERS\fP instead.
.sp
Tuple of functions to try to use to convert an exception into a string
representation.  This tuple is similar to the one in
\fI\%EXCEPTION_CONVERTERS\fP but it\(aqs used with \fI\%exception_to_bytes()\fP
instead.  Ideally, these functions should do their best to return the data
as a byte \fBstr\fP but the results will be run through
\fI\%to_bytes()\fP before being returned.
.sp
New in version 0.2.2.

.sp
Changed in version 1.0.1: Deprecated as simplifications allow \fI\%EXCEPTION_CONVERTERS\fP to
perform the same function.

.UNINDENT
.INDENT 0.0
.TP
.B kitchen.text.converters.exception_to_unicode(exc, converters=(<function <lambda>>, <function <lambda>>))
Convert an exception object into a unicode representation
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.IP \(bu 2
\fBexc\fP \-\- Exception object to convert
.IP \(bu 2
\fBconverters\fP \-\- List of functions to use to convert the exception into
a string.  See \fI\%EXCEPTION_CONVERTERS\fP for the default value and
an example of adding other converters to the defaults.  The functions
in the list are tried one at a time to see if they can extract
a string from the exception.  The first one to do so without raising
an exception is used.
.UNINDENT
.TP
.B Returns
\fBunicode\fP string representation of the exception.  The
value extracted by the \fBconverters\fP will be converted into
\fBunicode\fP before being returned using the utf\-8
encoding.  If you know you need to use an alternate encoding add
a function that does that to the list of functions in
\fBconverters\fP)
.UNINDENT
.sp
New in version 0.2.2.

.UNINDENT
.INDENT 0.0
.TP
.B kitchen.text.converters.exception_to_bytes(exc, converters=(<function <lambda>>, <function <lambda>>))
Convert an exception object into a str representation
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.IP \(bu 2
\fBexc\fP \-\- Exception object to convert
.IP \(bu 2
\fBconverters\fP \-\- List of functions to use to convert the exception into
a string.  See \fI\%EXCEPTION_CONVERTERS\fP for the default value and
an example of adding other converters to the defaults.  The functions
in the list are tried one at a time to see if they can extract
a string from the exception.  The first one to do so without raising
an exception is used.
.UNINDENT
.TP
.B Returns
byte \fBstr\fP representation of the exception.  The value
extracted by the \fBconverters\fP will be converted into
\fBstr\fP before being returned using the utf\-8 encoding.
If you know you need to use an alternate encoding add a function that
does that to the list of functions in \fBconverters\fP)
.UNINDENT
.sp
New in version 0.2.2.

.sp
Changed in version 1.0.1: Code simplification allowed us to switch to using
\fI\%EXCEPTION_CONVERTERS\fP as the default value of
\fBconverters\fP\&.

.UNINDENT
.SS Format Text for Display
.sp
Functions related to displaying unicode text.  Unicode characters don\(aqt all
have the same width so we need helper functions for displaying them.
.sp
New in version 0.2: kitchen.display API 1.0.0

.INDENT 0.0
.TP
.B kitchen.text.display.textual_width(msg, control_chars=\(aqguess\(aq, encoding=\(aqutf\-8\(aq, errors=\(aqreplace\(aq)
Get the textual width of a string
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.IP \(bu 2
\fBmsg\fP \-\- \fBunicode\fP string or byte \fBstr\fP to get the width of
.IP \(bu 2
\fBcontrol_chars\fP \-\- 
.sp
specify how to deal with control characters\&.
Possible values are:
.INDENT 2.0
.TP
.B guess
(default) will take a guess for control character
widths.  Most codes will return zero width.  \fBbackspace\fP,
\fBdelete\fP, and \fBclear delete\fP return \-1.  \fBescape\fP currently
returns \-1 as well but this is not guaranteed as it\(aqs not always
correct
.TP
.B strict
will raise \fBkitchen.text.exceptions.ControlCharError\fP
if a control character is encountered
.UNINDENT

.IP \(bu 2
\fBencoding\fP \-\- If we are given a byte \fBstr\fP this is used to
decode it into \fBunicode\fP string.  Any characters that are not
decodable in this encoding will get a value dependent on the
\fBerrors\fP parameter.
.IP \(bu 2
\fBerrors\fP \-\- How to treat errors encoding the byte \fBstr\fP to
\fBunicode\fP string.  Legal values are the same as for
\fBkitchen.text.converters.to_unicode()\fP\&.  The default value of
\fBreplace\fP will cause undecodable byte sequences to have a width of
one. \fBignore\fP will have a width of zero.
.UNINDENT
.TP
.B Raises
\fBControlCharError\fP \-\- if \fBmsg\fP contains a control
character and \fBcontrol_chars\fP is \fBstrict\fP\&.
.TP
.B Returns
Textual width of the \fBmsg\fP\&.  This is the amount of
space that the string will consume on a monospace display.  It\(aqs
measured in the number of cell positions or columns it will take up on
a monospace display.  This is \fBnot\fP the number of glyphs that are in
the string.
.UNINDENT
.sp
\fBNOTE:\fP
.INDENT 7.0
.INDENT 3.5
This function can be wrong sometimes because Unicode does not specify
a strict width value for all of the code points\&.  In
particular, we\(aqve found that some Tamil characters take up to four
character cells but we return a lesser amount.
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 0.0
.TP
.B kitchen.text.display.textual_width_chop(msg, chop, encoding=\(aqutf\-8\(aq, errors=\(aqreplace\(aq)
Given a string, return it chopped to a given textual width
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.IP \(bu 2
\fBmsg\fP \-\- \fBunicode\fP string or byte \fBstr\fP to chop
.IP \(bu 2
\fBchop\fP \-\- Chop \fBmsg\fP if it exceeds this textual width
.IP \(bu 2
\fBencoding\fP \-\- If we are given a byte \fBstr\fP, this is used to
decode it into a \fBunicode\fP string.  Any characters that are not
decodable in this encoding will be assigned a width of one.
.IP \(bu 2
\fBerrors\fP \-\- How to treat errors encoding the byte \fBstr\fP to
\fBunicode\fP\&.  Legal values are the same as for
\fBkitchen.text.converters.to_unicode()\fP
.UNINDENT
.TP
.B Return type
\fBunicode\fP string
.TP
.B Returns
\fBunicode\fP string of the \fBmsg\fP chopped at the given
textual width
.UNINDENT
.sp
This is what you want to use instead of \fB%.*s\fP, as it does the "right"
thing with regard to UTF\-8 sequences, control characters,
and characters that take more than one cell position. Eg:
.INDENT 7.0
.INDENT 3.5
.sp
.nf
.ft C
>>> # Wrong: only displays 8 characters because it is operating on bytes
>>> print "%.*s" % (10, \(aqcafé ñunru!\(aq)
café ñun
>>> # Properly operates on graphemes
>>> \(aq%s\(aq % (textual_width_chop(\(aqcafé ñunru!\(aq, 10))
café ñunru
>>> # takes too many columns because the kanji need two cell positions
>>> print \(aq1234567890\en%.*s\(aq % (10, u\(aq一二三四五六七八九十\(aq)
1234567890
一二三四五六七八九十
>>> # Properly chops at 10 columns
>>> print \(aq1234567890\en%s\(aq % (textual_width_chop(u\(aq一二三四五六七八九十\(aq, 10))
1234567890
一二三四五
.ft P
.fi
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 0.0
.TP
.B kitchen.text.display.textual_width_fill(msg, fill, chop=None, left=True, prefix=\(aq\(aq, suffix=\(aq\(aq)
Expand a \fBunicode\fP string to a specified textual width
or chop to same
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.IP \(bu 2
\fBmsg\fP \-\- \fBunicode\fP string to format
.IP \(bu 2
\fBfill\fP \-\- pad string until the textual width of the string is
this length
.IP \(bu 2
\fBchop\fP \-\- before doing anything else, chop the string to this length.
Default: Don\(aqt chop the string at all
.IP \(bu 2
\fBleft\fP \-\- If \fBTrue\fP (default) left justify the string and put the
padding on the right.  If \fBFalse\fP, pad on the left side.
.IP \(bu 2
\fBprefix\fP \-\- Attach this string before the field we\(aqre filling
.IP \(bu 2
\fBsuffix\fP \-\- Append this string to the end of the field we\(aqre filling
.UNINDENT
.TP
.B Return type
\fBunicode\fP string
.TP
.B Returns
\fBmsg\fP formatted to fill the specified width.  If no
\fBchop\fP is specified, the string could exceed the fill length
when completed.  If \fBprefix\fP or \fBsuffix\fP are printable
characters, the string could be longer than the fill width.
.UNINDENT
.sp
\fBNOTE:\fP
.INDENT 7.0
.INDENT 3.5
\fBprefix\fP and \fBsuffix\fP should be used for "invisible"
characters like highlighting, color changing escape codes, etc.  The
fill characters are appended outside of any \fBprefix\fP or
\fBsuffix\fP elements.  This allows you to only highlight
\fBmsg\fP inside of the field you\(aqre filling.
.UNINDENT
.UNINDENT
.sp
\fBWARNING:\fP
.INDENT 7.0
.INDENT 3.5
\fBmsg\fP, \fBprefix\fP, and \fBsuffix\fP should all be
representable as unicode characters.  In particular, any escape
sequences in \fBprefix\fP and \fBsuffix\fP need to be convertible
to \fBunicode\fP\&.  If you need to use byte sequences here rather
than unicode characters, use
\fI\%byte_string_textual_width_fill()\fP instead.
.UNINDENT
.UNINDENT
.sp
This function expands a string to fill a field of a particular
textual width\&.  Use it instead of \fB%*.*s\fP, as it does the
"right" thing with regard to UTF\-8 sequences, control
characters, and characters that take more than one cell position in
a display.  Example usage:
.INDENT 7.0
.INDENT 3.5
.sp
.nf
.ft C
>>> msg = u\(aq一二三四五六七八九十\(aq
>>> # Wrong: This uses 10 characters instead of 10 cells:
>>> u":%\-*.*s:" % (10, 10, msg[:9])
:一二三四五六七八九 :
>>> # This uses 10 cells like we really want:
>>> u":%s:" % (textual_width_fill(msg[:9], 10, 10))
:一二三四五:

>>> # Wrong: Right aligned in the field, but too many cells
>>> u"%20.10s" % (msg)
          一二三四五六七八九十
>>> # Correct: Right aligned with proper number of cells
>>> u"%s" % (textual_width_fill(msg, 20, 10, left=False))
          一二三四五

>>> # Wrong: Adding some escape characters to highlight the line but too many cells
>>> u"%s%20.10s%s" % (prefix, msg, suffix)
u\(aq[7m          一二三四五六七八九十[0m\(aq
>>> # Correct highlight of the line
>>> u"%s%s%s" % (prefix, display.textual_width_fill(msg, 20, 10, left=False), suffix)
u\(aq[7m          一二三四五[0m\(aq

>>> # Correct way to not highlight the fill
>>> u"%s" % (display.textual_width_fill(msg, 20, 10, left=False, prefix=prefix, suffix=suffix))
u\(aq          [7m一二三四五[0m\(aq
.ft P
.fi
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 0.0
.TP
.B kitchen.text.display.wrap(text, width=70, initial_indent=u\(aq\(aq, subsequent_indent=u\(aq\(aq, encoding=\(aqutf\-8\(aq, errors=\(aqreplace\(aq)
Works like we want \fBtextwrap.wrap()\fP to work,
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.IP \(bu 2
\fBtext\fP \-\- \fBunicode\fP string or byte \fBstr\fP to wrap
.IP \(bu 2
\fBwidth\fP \-\- textual width at which to wrap.  Default: 70
.IP \(bu 2
\fBinitial_indent\fP \-\- string to use to indent the first line.  Default:
do not indent.
.IP \(bu 2
\fBsubsequent_indent\fP \-\- string to use to wrap subsequent lines.
Default: do not indent
.IP \(bu 2
\fBencoding\fP \-\- Encoding to use if \fBtext\fP is a byte \fBstr\fP
.IP \(bu 2
\fBerrors\fP \-\- error handler to use if \fBtext\fP is a byte \fBstr\fP
and contains some undecodable characters.
.UNINDENT
.TP
.B Return type
\fBlist\fP of \fBunicode\fP strings
.TP
.B Returns
list of lines that have been text wrapped and indented.
.UNINDENT
.sp
\fBtextwrap.wrap()\fP from the \fI\%python standard library\fP has two drawbacks that this
attempts to fix:
.INDENT 7.0
.IP 1. 3
It does not handle textual width\&.  It only operates on bytes or
characters which are both inadequate (due to multi\-byte and double
width characters).
.IP 2. 3
It malforms lists and blocks.
.UNINDENT
.UNINDENT
.INDENT 0.0
.TP
.B kitchen.text.display.fill(text, *args, **kwargs)
Works like we want \fBtextwrap.fill()\fP to work
.INDENT 7.0
.TP
.B Parameters
\fBtext\fP \-\- \fBunicode\fP string or byte \fBstr\fP to process
.TP
.B Returns
\fBunicode\fP string with each line separated by a newline
.UNINDENT
.sp
\fBSEE ALSO:\fP
.INDENT 7.0
.INDENT 3.5
.INDENT 0.0
.TP
.B \fI\%kitchen.text.display.wrap()\fP
for other parameters that you can give this command.
.UNINDENT
.UNINDENT
.UNINDENT
.sp
This function is a light wrapper around \fI\%kitchen.text.display.wrap()\fP\&.
Where that function returns a \fBlist\fP of lines, this function
returns one string with each line separated by a newline.
.UNINDENT
.INDENT 0.0
.TP
.B kitchen.text.display.byte_string_textual_width_fill(msg, fill, chop=None, left=True, prefix=\(aq\(aq, suffix=\(aq\(aq, encoding=\(aqutf\-8\(aq, errors=\(aqreplace\(aq)
Expand a byte \fBstr\fP to a specified textual width or chop
to same
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.IP \(bu 2
\fBmsg\fP \-\- byte \fBstr\fP encoded in UTF\-8 that we want formatted
.IP \(bu 2
\fBfill\fP \-\- pad \fBmsg\fP until the textual width is this long
.IP \(bu 2
\fBchop\fP \-\- before doing anything else, chop the string to this length.
Default: Don\(aqt chop the string at all
.IP \(bu 2
\fBleft\fP \-\- If \fBTrue\fP (default) left justify the string and put the
padding on the right.  If \fBFalse\fP, pad on the left side.
.IP \(bu 2
\fBprefix\fP \-\- Attach this byte \fBstr\fP before the field we\(aqre
filling
.IP \(bu 2
\fBsuffix\fP \-\- Append this byte \fBstr\fP to the end of the field we\(aqre
filling
.UNINDENT
.TP
.B Return type
byte \fBstr\fP
.TP
.B Returns
\fBmsg\fP formatted to fill the specified textual
width\&.  If no \fBchop\fP is specified, the string could exceed the
fill length when completed.  If \fBprefix\fP or \fBsuffix\fP are
printable characters, the string could be longer than fill width.
.UNINDENT
.sp
\fBNOTE:\fP
.INDENT 7.0
.INDENT 3.5
\fBprefix\fP and \fBsuffix\fP should be used for "invisible"
characters like highlighting, color changing escape codes, etc.  The
fill characters are appended outside of any \fBprefix\fP or
\fBsuffix\fP elements.  This allows you to only highlight
\fBmsg\fP inside of the field you\(aqre filling.
.UNINDENT
.UNINDENT
.sp
\fBSEE ALSO:\fP
.INDENT 7.0
.INDENT 3.5
.INDENT 0.0
.TP
.B \fI\%textual_width_fill()\fP
For example usage.  This function has only two differences.
.INDENT 7.0
.IP 1. 3
it takes byte \fBstr\fP for \fBprefix\fP and
\fBsuffix\fP so you can pass in arbitrary sequences of
bytes, not just unicode characters.
.IP 2. 3
it returns a byte \fBstr\fP instead of a \fBunicode\fP
string.
.UNINDENT
.UNINDENT
.UNINDENT
.UNINDENT
.UNINDENT
.SS Internal Data
.sp
There are a few internal functions and variables in this module.  Code outside
of kitchen shouldn\(aqt use them but people coding on kitchen itself may find
them useful.
.INDENT 0.0
.TP
.B kitchen.text.display._COMBINING = ((768, 879), (1155, 1161), (1425, 1469), (1471, 1471), (1473, 1474), (1476, 1477), (1479, 1479), (1536, 1539), (1552, 1562), (1611, 1631), (1648, 1648), (1750, 1764), (1767, 1768), (1770, 1773), (1807, 1807), (1809, 1809), (1840, 1866), (1958, 1968), (2027, 2035), (2070, 2073), (2075, 2083), (2085, 2087), (2089, 2093), (2137, 2139), (2305, 2306), (2364, 2364), (2369, 2376), (2381, 2381), (2385, 2388), (2402, 2403), (2433, 2433), (2492, 2492), (2497, 2500), (2509, 2509), (2530, 2531), (2561, 2562), (2620, 2620), (2625, 2626), (2631, 2632), (2635, 2637), (2672, 2673), (2689, 2690), (2748, 2748), (2753, 2757), (2759, 2760), (2765, 2765), (2786, 2787), (2817, 2817), (2876, 2876), (2879, 2879), (2881, 2883), (2893, 2893), (2902, 2902), (2946, 2946), (3008, 3008), (3021, 3021), (3134, 3136), (3142, 3144), (3146, 3149), (3157, 3158), (3260, 3260), (3263, 3263), (3270, 3270), (3276, 3277), (3298, 3299), (3393, 3395), (3405, 3405), (3530, 3530), (3538, 3540), (3542, 3542), (3633, 3633), (3636, 3642), (3655, 3662), (3761, 3761), (3764, 3769), (3771, 3772), (3784, 3789), (3864, 3865), (3893, 3893), (3895, 3895), (3897, 3897), (3953, 3966), (3968, 3972), (3974, 3975), (3984, 3991), (3993, 4028), (4038, 4038), (4141, 4144), (4146, 4146), (4150, 4151), (4153, 4154), (4184, 4185), (4237, 4237), (4448, 4607), (4957, 4959), (5906, 5908), (5938, 5940), (5970, 5971), (6002, 6003), (6068, 6069), (6071, 6077), (6086, 6086), (6089, 6099), (6109, 6109), (6155, 6157), (6313, 6313), (6432, 6434), (6439, 6440), (6450, 6450), (6457, 6459), (6679, 6680), (6752, 6752), (6773, 6780), (6783, 6783), (6912, 6915), (6964, 6964), (6966, 6970), (6972, 6972), (6978, 6978), (6980, 6980), (7019, 7027), (7082, 7082), (7142, 7142), (7154, 7155), (7223, 7223), (7376, 7378), (7380, 7392), (7394, 7400), (7405, 7405), (7616, 7654), (7676, 7679), (8203, 8207), (8234, 8238), (8288, 8291), (8298, 8303), (8400, 8432), (11503, 11505), (11647, 11647), (11744, 11775), (12330, 12335), (12441, 12442), (42607, 42607), (42620, 42621), (42736, 42737), (43014, 43014), (43019, 43019), (43045, 43046), (43204, 43204), (43232, 43249), (43307, 43309), (43347, 43347), (43443, 43443), (43456, 43456), (43696, 43696), (43698, 43700), (43703, 43704), (43710, 43711), (43713, 43713), (44013, 44013), (64286, 64286), (65024, 65039), (65056, 65062), (65279, 65279), (65529, 65531), (66045, 66045), (68097, 68099), (68101, 68102), (68108, 68111), (68152, 68154), (68159, 68159), (69702, 69702), (69817, 69818), (119141, 119145), (119149, 119170), (119173, 119179), (119210, 119213), (119362, 119364), (917505, 917505), (917536, 917631), (917760, 917999))
Internal table, provided by this module to list code points which
combine with other characters and therefore should have no textual
width\&.  This is a sorted \fBtuple\fP of non\-overlapping intervals.  Each
interval is a \fBtuple\fP listing a starting code point and ending
code point\&.  Every code point between the two end points is
a combining character.
.sp
\fBSEE ALSO:\fP
.INDENT 7.0
.INDENT 3.5
.INDENT 0.0
.TP
.B \fI\%_generate_combining_table()\fP
for how this table is generated
.UNINDENT
.UNINDENT
.UNINDENT
.sp
This table was last regenerated on python\-3.2.3 with
\fBunicodedata.unidata_version\fP 6.0.0
.UNINDENT
.INDENT 0.0
.TP
.B kitchen.text.display._generate_combining_table()
Combine Markus Kuhn\(aqs data with \fBunicodedata\fP to make combining
char list
.INDENT 7.0
.TP
.B Return type
\fBtuple\fP of tuples
.TP
.B Returns
\fBtuple\fP of intervals of code points that are
combining character.  Each interval is a 2\-\fBtuple\fP of the
starting code point and the ending code point for the
combining characters.
.UNINDENT
.sp
In normal use, this function serves to tell how we\(aqre generating the
combining char list.  For speed reasons, we use this to generate a static
list and just use that later.
.sp
Markus Kuhn\(aqs list of combining characters is more complete than what\(aqs in
the python \fBunicodedata\fP library but the python \fBunicodedata\fP is
synced against later versions of the unicode database
.sp
This is used to generate the \fI\%_COMBINING\fP
table.
.UNINDENT
.INDENT 0.0
.TP
.B kitchen.text.display._print_combining_table()
Print out a new \fI\%_COMBINING\fP table
.sp
This will print a new \fI\%_COMBINING\fP table in the format used in
\fBkitchen/text/display.py\fP\&.  It\(aqs useful for updating the
\fI\%_COMBINING\fP table with updated data from a new python as the format
won\(aqt change from what\(aqs already in the file.
.UNINDENT
.INDENT 0.0
.TP
.B kitchen.text.display._interval_bisearch(value, table)
Binary search in an interval table.
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.IP \(bu 2
\fBvalue\fP \-\- numeric value to search for
.IP \(bu 2
\fBtable\fP \-\- Ordered list of intervals.  This is a list of two\-tuples.  The
elements of the two\-tuple define an interval\(aqs start and end points.
.UNINDENT
.TP
.B Returns
If \fBvalue\fP is found within an interval in the \fBtable\fP
return \fBTrue\fP\&.  Otherwise, \fBFalse\fP
.UNINDENT
.sp
This function checks whether a numeric value is present within a table
of intervals.  It checks using a binary search algorithm, dividing the
list of values in half and checking against the values until it determines
whether the value is in the table.
.UNINDENT
.INDENT 0.0
.TP
.B kitchen.text.display._ucp_width(ucs, control_chars=\(aqguess\(aq)
Get the textual width of a ucs character
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.IP \(bu 2
\fBucs\fP \-\- integer representing a single unicode code point
.IP \(bu 2
\fBcontrol_chars\fP \-\- 
.sp
specify how to deal with control characters\&.
Possible values are:
.INDENT 2.0
.TP
.B guess
(default) will take a guess for control character
widths.  Most codes will return zero width.  \fBbackspace\fP,
\fBdelete\fP, and \fBclear delete\fP return \-1.  \fBescape\fP currently
returns \-1 as well but this is not guaranteed as it\(aqs not always
correct
.TP
.B strict
will raise \fBControlCharError\fP
if a control character is encountered
.UNINDENT

.UNINDENT
.TP
.B Raises
\fBControlCharError\fP \-\- if the code point is a unicode
control character and \fBcontrol_chars\fP is set to \(aqstrict\(aq
.TP
.B Returns
textual width of the character.
.UNINDENT
.sp
\fBNOTE:\fP
.INDENT 7.0
.INDENT 3.5
It\(aqs important to remember this is textual width and not the
number of characters or bytes.
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 0.0
.TP
.B kitchen.text.display._textual_width_le(width, *args)
Optimize the common case when deciding which textual width is
larger
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.IP \(bu 2
\fBwidth\fP \-\- textual width to compare against.
.IP \(bu 2
\fB*args\fP \-\- \fBunicode\fP strings to check the total textual
width of
.UNINDENT
.TP
.B Returns
\fBTrue\fP if the total length of \fBargs\fP are less than
or equal to \fBwidth\fP\&.  Otherwise \fBFalse\fP\&.
.UNINDENT
.sp
We often want to know "does X fit in Y".  It takes a while to use
\fI\%textual_width()\fP to calculate this.  However, we know that the number
of canonically composed \fBunicode\fP characters is always going to
have 1 or 2 for the textual width per character.  With this we can
take the following shortcuts:
.INDENT 7.0
.IP 1. 3
If the number of canonically composed characters is more than width,
the true textual width cannot be less than width.
.IP 2. 3
If the number of canonically composed characters * 2 is less than the
width then the textual width must be ok.
.UNINDENT
.sp
textual width of a canonically composed \fBunicode\fP string
will always be greater than or equal to the the number of \fBunicode\fP
characters.  So we can first check if the number of composed
\fBunicode\fP characters is less than the asked for width.  If it is we
can return \fBTrue\fP immediately.  If not, then we must do a full
textual width lookup.
.UNINDENT
.SS Miscellaneous functions for manipulating text
.sp
Collection of text functions that don\(aqt fit in another category.
.sp
Changed in version kitchen: 1.2.0, API: kitchen.text 2.2.0
Added \fI\%isbasestring()\fP,
\fI\%isbytestring()\fP, and
\fI\%isunicodestring()\fP to help tell which string type
is which on python2 and python3

.INDENT 0.0
.TP
.B kitchen.text.misc.byte_string_valid_encoding(byte_string, encoding=\(aqutf\-8\(aq)
Detect if a byte \fBstr\fP is valid in a specific encoding
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.IP \(bu 2
\fBbyte_string\fP \-\- Byte \fBstr\fP to test for bytes not valid in this
encoding
.IP \(bu 2
\fBencoding\fP \-\- encoding to test against.  Defaults to UTF\-8\&.
.UNINDENT
.TP
.B Returns
\fBTrue\fP if there are no invalid UTF\-8 characters.
\fBFalse\fP if an invalid character is detected.
.UNINDENT
.sp
\fBNOTE:\fP
.INDENT 7.0
.INDENT 3.5
This function checks whether the byte \fBstr\fP is valid in the
specified encoding.  It \fBdoes not\fP detect whether the byte
\fBstr\fP actually was encoded in that encoding.  If you want that
sort of functionality, you probably want to use
\fI\%guess_encoding()\fP instead.
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 0.0
.TP
.B kitchen.text.misc.byte_string_valid_xml(byte_string, encoding=\(aqutf\-8\(aq)
Check that a byte \fBstr\fP would be valid in xml
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.IP \(bu 2
\fBbyte_string\fP \-\- Byte \fBstr\fP to check
.IP \(bu 2
\fBencoding\fP \-\- Encoding of the xml file.  Default: UTF\-8
.UNINDENT
.TP
.B Returns
\fBTrue\fP if the string is valid.  \fBFalse\fP if it would
be invalid in the xml file
.UNINDENT
.sp
In some cases you\(aqll have a whole bunch of byte strings and rather than
transforming them to \fBunicode\fP and back to byte \fBstr\fP for
output to xml, you will just want to make sure they work with the xml file
you\(aqre constructing.  This function will help you do that.  Example:
.INDENT 7.0
.INDENT 3.5
.sp
.nf
.ft C
ARRAY_OF_MOSTLY_UTF8_STRINGS = [...]
processed_array = []
for string in ARRAY_OF_MOSTLY_UTF8_STRINGS:
    if byte_string_valid_xml(string, \(aqutf\-8\(aq):
        processed_array.append(string)
    else:
        processed_array.append(guess_bytes_to_xml(string, encoding=\(aqutf\-8\(aq))
output_xml(processed_array)
.ft P
.fi
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 0.0
.TP
.B kitchen.text.misc.guess_encoding(byte_string, disable_chardet=False)
Try to guess the encoding of a byte \fBstr\fP
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.IP \(bu 2
\fBbyte_string\fP \-\- byte \fBstr\fP to guess the encoding of
.IP \(bu 2
\fBdisable_chardet\fP \-\- If this is True, we never attempt to use
\fBchardet\fP to guess the encoding.  This is useful if you need to
have reproducibility whether \fBchardet\fP is installed or not.
Default: \fBFalse\fP\&.
.UNINDENT
.TP
.B Raises
\fBTypeError\fP \-\- if \fBbyte_string\fP is not a byte \fBstr\fP type
.TP
.B Returns
string containing a guess at the encoding of
\fBbyte_string\fP\&.  This is appropriate to pass as the encoding
argument when encoding and decoding unicode strings.
.UNINDENT
.sp
We start by attempting to decode the byte \fBstr\fP as UTF\-8\&.
If this succeeds we tell the world it\(aqs UTF\-8 text.  If it doesn\(aqt
and \fBchardet\fP is installed on the system and \fBdisable_chardet\fP
is False this function will use it to try detecting the encoding of
\fBbyte_string\fP\&.  If it is not installed or \fBchardet\fP cannot
determine the encoding with a high enough confidence then we rather
arbitrarily claim that it is \fBlatin\-1\fP\&.  Since \fBlatin\-1\fP will encode
to every byte, decoding from \fBlatin\-1\fP to \fBunicode\fP will not
cause \fBUnicodeErrors\fP although the output might be mangled.
.UNINDENT
.INDENT 0.0
.TP
.B kitchen.text.misc.html_entities_unescape(string)
Substitute unicode characters for HTML entities
.INDENT 7.0
.TP
.B Parameters
\fBstring\fP \-\- \fBunicode\fP string to substitute out html entities
.TP
.B Raises
\fBTypeError\fP \-\- if something other than a \fBunicode\fP string is
given
.TP
.B Return type
\fBunicode\fP string
.TP
.B Returns
The plain text without html entities
.UNINDENT
.UNINDENT
.INDENT 0.0
.TP
.B kitchen.text.misc.isbasestring(obj)
Determine if obj is a byte \fBstr\fP or \fBunicode\fP string
.sp
In python2 this is eqiuvalent to isinstance(obj, basestring).  In python3
it checks whether the object is an instance of str, bytes, or bytearray.
This is an aid to porting code that needed to test whether an object was
derived from basestring in python2 (commonly used in unicode\-bytes
conversion functions)
.INDENT 7.0
.TP
.B Parameters
\fBobj\fP \-\- Object to test
.TP
.B Returns
True if the object is a \fBbasestring\fP\&.  Otherwise False.
.UNINDENT
.sp
New in version Kitchen:: 1.2.0, API kitchen.text 2.2.0

.UNINDENT
.INDENT 0.0
.TP
.B kitchen.text.misc.isbytestring(obj)
Determine if obj is a byte \fBstr\fP
.sp
In python2 this is equivalent to isinstance(obj, str).  In python3 it
checks whether the object is an instance of bytes or bytearray.
.INDENT 7.0
.TP
.B Parameters
\fBobj\fP \-\- Object to test
.TP
.B Returns
True if the object is a byte \fBstr\fP\&.  Otherwise, False.
.UNINDENT
.sp
New in version Kitchen:: 1.2.0, API kitchen.text 2.2.0

.UNINDENT
.INDENT 0.0
.TP
.B kitchen.text.misc.isunicodestring(obj)
Determine if obj is a \fBunicode\fP string
.sp
In python2 this is equivalent to isinstance(obj, unicode).  In python3 it
checks whether the object is an instance of \fBstr\fP\&.
.INDENT 7.0
.TP
.B Parameters
\fBobj\fP \-\- Object to test
.TP
.B Returns
True if the object is a \fBunicode\fP string.  Otherwise, False.
.UNINDENT
.sp
New in version Kitchen:: 1.2.0, API kitchen.text 2.2.0

.UNINDENT
.INDENT 0.0
.TP
.B kitchen.text.misc.process_control_chars(string, strategy=\(aqreplace\(aq)
Look for and transform control characters in a string
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.IP \(bu 2
\fBstring\fP \-\- string to search for and transform control characters
within
.IP \(bu 2
\fBstrategy\fP \-\- 
.sp
XML does not allow ASCII control
characters\&.  When we encounter those we need to know what to do.
Valid options are:
.INDENT 2.0
.TP
.B replace
(default) Replace the control characters
with \fB"?"\fP
.TP
.B ignore
Remove the characters altogether from the output
.TP
.B strict
Raise a \fBControlCharError\fP when
we encounter a control character
.UNINDENT

.UNINDENT
.TP
.B Raises
.INDENT 7.0
.IP \(bu 2
\fBTypeError\fP \-\- if \fBstring\fP is not a unicode string.
.IP \(bu 2
\fBValueError\fP \-\- if the strategy is not one of replace, ignore, or
strict.
.IP \(bu 2
\fBkitchen.text.exceptions.ControlCharError\fP \-\- if the strategy is
\fBstrict\fP and a control character is present in the
\fBstring\fP
.UNINDENT
.TP
.B Returns
\fBunicode\fP string with no control characters in
it.
.UNINDENT
.sp
Changed in version kitchen: 1.2.0, API: kitchen.text 2.2.0
Strip out the C1 control characters in addition to the C0 control
characters.

.UNINDENT
.INDENT 0.0
.TP
.B kitchen.text.misc.str_eq(str1, str2, encoding=\(aqutf\-8\(aq, errors=\(aqreplace\(aq)
Compare two strings, converting to byte \fBstr\fP if one is
\fBunicode\fP
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.IP \(bu 2
\fBstr1\fP \-\- First string to compare
.IP \(bu 2
\fBstr2\fP \-\- Second string to compare
.IP \(bu 2
\fBencoding\fP \-\- If we need to convert one string into a byte \fBstr\fP
to compare, the encoding to use.  Default is utf\-8\&.
.IP \(bu 2
\fBerrors\fP \-\- What to do if we encounter errors when encoding the string.
See the \fBkitchen.text.converters.to_bytes()\fP documentation for
possible values.  The default is \fBreplace\fP\&.
.UNINDENT
.UNINDENT
.sp
This function prevents \fBUnicodeError\fP (python\-2.4 or less) and
\fBUnicodeWarning\fP (python 2.5 and higher) when we compare
a \fBunicode\fP string to a byte \fBstr\fP\&.  The errors normally
arise because the conversion is done to ASCII\&.  This function
lets you convert to utf\-8 or another encoding instead.
.sp
\fBNOTE:\fP
.INDENT 7.0
.INDENT 3.5
When we need to convert one of the strings from \fBunicode\fP in
order to compare them we convert the \fBunicode\fP string into
a byte \fBstr\fP\&.  That means that strings can compare differently
if you use different encodings for each.
.UNINDENT
.UNINDENT
.sp
Note that \fBstr1 == str2\fP is faster than this function if you can accept
the following limitations:
.INDENT 7.0
.IP \(bu 2
Limited to python\-2.5+ (otherwise a \fBUnicodeDecodeError\fP may be
thrown)
.IP \(bu 2
Will generate a \fBUnicodeWarning\fP if non\-ASCII byte
\fBstr\fP is compared to \fBunicode\fP string.
.UNINDENT
.UNINDENT
.SS UTF\-8
.sp
Functions for operating on byte \fBstr\fP encoded as UTF\-8
.sp
\fBNOTE:\fP
.INDENT 0.0
.INDENT 3.5
In many cases, it is better to convert to \fBunicode\fP, operate on the
strings, then convert back to UTF\-8\&.  \fBunicode\fP type can
handle many of these functions itself.  For those that it doesn\(aqt
(removing control characters from length calculations, for instance) the
code to do so with a \fBunicode\fP type is often simpler.
.UNINDENT
.UNINDENT
.sp
\fBWARNING:\fP
.INDENT 0.0
.INDENT 3.5
All of the functions in this module are deprecated.  Most of them have
been replaced with functions that operate on unicode values in
\fBkitchen.text.display\fP\&.  \fI\%kitchen.text.utf8.utf8_valid()\fP has
been replaced with a function in \fBkitchen.text.misc\fP\&.
.UNINDENT
.UNINDENT
.INDENT 0.0
.TP
.B kitchen.text.utf8.utf8_text_fill(text, *args, **kwargs)
\fBDeprecated\fP Similar to \fBtextwrap.fill()\fP but understands
utf\-8 strings and doesn\(aqt screw up lists/blocks/etc.
.sp
Use \fBkitchen.text.display.fill()\fP instead.
.UNINDENT
.INDENT 0.0
.TP
.B kitchen.text.utf8.utf8_text_wrap(text, width=70, initial_indent=\(aq\(aq, subsequent_indent=\(aq\(aq)
\fBDeprecated\fP Similar to \fBtextwrap.wrap()\fP but understands
utf\-8 data and doesn\(aqt screw up lists/blocks/etc
.sp
Use \fBkitchen.text.display.wrap()\fP instead
.UNINDENT
.INDENT 0.0
.TP
.B kitchen.text.utf8.utf8_valid(msg)
\fBDeprecated\fP Detect if a string is valid utf\-8
.sp
Use \fBkitchen.text.misc.byte_string_valid_encoding()\fP instead.
.UNINDENT
.INDENT 0.0
.TP
.B kitchen.text.utf8.utf8_width(msg)
\fBDeprecated\fP Get the textual width of a utf\-8 string
.sp
Use \fBkitchen.text.display.textual_width()\fP instead.
.UNINDENT
.INDENT 0.0
.TP
.B kitchen.text.utf8.utf8_width_chop(msg, chop=None)
\fBDeprecated\fP Return a string chopped to a given textual width
.sp
Use \fBtextual_width_chop()\fP and
\fBtextual_width()\fP instead:
.INDENT 7.0
.INDENT 3.5
.sp
.nf
.ft C
>>> msg = \(aqく ku ら ra と to み mi\(aq
>>> # Old way:
>>> utf8_width_chop(msg, 5)
(5, \(aqく ku\(aq)
>>> # New way
>>> from kitchen.text.converters import to_bytes
>>> from kitchen.text.display import textual_width, textual_width_chop
>>> (textual_width(msg), to_bytes(textual_width_chop(msg, 5)))
(5, \(aqく ku\(aq)
.ft P
.fi
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 0.0
.TP
.B kitchen.text.utf8.utf8_width_fill(msg, fill, chop=None, left=True, prefix=\(aq\(aq, suffix=\(aq\(aq)
\fBDeprecated\fP Pad a utf\-8 string to fill a specified width
.sp
Use \fBbyte_string_textual_width_fill()\fP instead
.UNINDENT
.INDENT 0.0
.TP
.B \fBconverters\fP
deals with converting text for different encodings and to and from XML
.TP
.B \fBdisplay\fP
deals with issues with printing text to a screen
.TP
.B \fBmisc\fP
is a catchall for text manipulation functions that don\(aqt seem to fit
elsewhere
.TP
.B \fButf8\fP
contains deprecated functions to manipulate utf8 byte strings
.UNINDENT
.SS Kitchen.collections
.SS StrictDict
.sp
\fBkitchen.collections.StrictDict\fP provides a dictionary that treats
\fBstr\fP and \fBunicode\fP as distinct key values.
.INDENT 0.0
.TP
.B class kitchen.collections.strictdict.StrictDict
Map class that considers \fBunicode\fP and \fBstr\fP different keys
.sp
Ordinarily when you are dealing with a \fBdict\fP keyed on strings you
want to have keys that have the same characters end up in the same bucket
even if one key is \fBunicode\fP and the other is a byte \fBstr\fP\&.
The normal \fBdict\fP type does this for ASCII characters (but
not for anything outside of the ASCII range.)
.sp
Sometimes, however, you want to keep the two string classes strictly
separate, for instance, if you\(aqre creating a single table that can map
from \fBunicode\fP characters to \fBstr\fP characters and vice
versa.  This class will help you do that by making all \fBunicode\fP
keys evaluate to a different key than all \fBstr\fP keys.
.sp
\fBSEE ALSO:\fP
.INDENT 7.0
.INDENT 3.5
.INDENT 0.0
.TP
.B \fBdict\fP
for documentation on this class\(aqs methods.  This class implements
all the standard \fBdict\fP methods.  Its treatment of
\fBunicode\fP and \fBstr\fP keys as separate is the only
difference.
.UNINDENT
.UNINDENT
.UNINDENT
.UNINDENT
.SS Kitchen.iterutils Module
.sp
Functions to manipulate iterables
.sp
New in version Kitchen:: 0.2.1a1

.sp
\fIModule author: Toshio Kuratomi <\fI\%toshio@fedoraproject.org\fP>\fP
.sp
\fIModule author: Luke Macken <\fI\%lmacken@redhat.com\fP>\fP
.INDENT 0.0
.TP
.B kitchen.iterutils.isiterable(obj, include_string=False)
Check whether an object is an iterable
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.IP \(bu 2
\fBobj\fP \-\- Object to test whether it is an iterable
.IP \(bu 2
\fBinclude_string\fP \-\- If \fBTrue\fP and \fBobj\fP is a byte
\fBstr\fP or \fBunicode\fP string this function will return
\fBTrue\fP\&.  If set to \fBFalse\fP, byte \fBstr\fP and
\fBunicode\fP strings will cause this function to return
\fBFalse\fP\&.  Default \fBFalse\fP\&.
.UNINDENT
.TP
.B Returns
\fBTrue\fP if \fBobj\fP is iterable, otherwise
\fBFalse\fP\&.
.UNINDENT
.UNINDENT
.INDENT 0.0
.TP
.B kitchen.iterutils.iterate(obj, include_string=False)
Generator that can be used to iterate over anything
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.IP \(bu 2
\fBobj\fP \-\- The object to iterate over
.IP \(bu 2
\fBinclude_string\fP \-\- if \fBTrue\fP, treat strings as iterables.
Otherwise treat them as a single scalar value.  Default \fBFalse\fP
.UNINDENT
.UNINDENT
.sp
This function will create an iterator out of any scalar or iterable.  It
is useful for making a value given to you an iterable before operating on it.
Iterables have their items returned.  scalars are transformed into iterables.
A string is treated as a scalar value unless the \fBinclude_string\fP
parameter is set to \fBTrue\fP\&.  Example usage:
.INDENT 7.0
.INDENT 3.5
.sp
.nf
.ft C
>>> list(iterate(None))
[None]
>>> list(iterate([None]))
[None]
>>> list(iterate([1, 2, 3]))
[1, 2, 3]
>>> list(iterate(set([1, 2, 3])))
[1, 2, 3]
>>> list(iterate(dict(a=\(aq1\(aq, b=\(aq2\(aq)))
[\(aqa\(aq, \(aqb\(aq]
>>> list(iterate(1))
[1]
>>> list(iterate(iter([1, 2, 3])))
[1, 2, 3]
>>> list(iterate(\(aqabc\(aq))
[\(aqabc\(aq]
>>> list(iterate(\(aqabc\(aq, include_string=True))
[\(aqa\(aq, \(aqb\(aq, \(aqc\(aq]
.ft P
.fi
.UNINDENT
.UNINDENT
.UNINDENT
.SS Helpers for versioning software
.SS PEP\-386 compliant versioning
.sp
\fI\%PEP 386\fP defines a standard format for version strings.  This module
contains a function for creating strings in that format.
.INDENT 0.0
.TP
.B kitchen.versioning.version_tuple_to_string(version_info)
Return a \fI\%PEP 386\fP version string from a \fI\%PEP 386\fP style version tuple
.INDENT 7.0
.TP
.B Parameters
\fBversion_info\fP \-\- Nested set of tuples that describes the version.  See
below for an example.
.TP
.B Returns
a version string
.UNINDENT
.sp
This function implements just enough of \fI\%PEP 386\fP to satisfy our needs.
\fI\%PEP 386\fP defines a standard format for version strings and refers to
a function that will be merged into the \fI\%python standard library\fP that transforms a tuple
of version information into a standard version string.  This function is
an implementation of that function.  Once that function becomes available
in the \fI\%python standard library\fP we will start using it and deprecate this function.
.sp
\fBversion_info\fP takes the form that \fI\%PEP 386\fP\(aqs
\fBNormalizedVersion.from_parts()\fP uses:
.INDENT 7.0
.INDENT 3.5
.sp
.nf
.ft C
((Major, Minor, [Micros]), [(Alpha/Beta/rc marker, version)],
    [(post/dev marker, version)])

Ex: ((1, 0, 0), (\(aqa\(aq, 2), (\(aqdev\(aq, 3456))
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
It generates a \fI\%PEP 386\fP compliant version string:
.INDENT 7.0
.INDENT 3.5
.sp
.nf
.ft C
N.N[.N]+[{a|b|c|rc}N[.N]+][.postN][.devN]

Ex: 1.0.0a2.dev3456
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
\fBWARNING:\fP
.INDENT 7.0
.INDENT 3.5
This function does next to no error checking.  It\(aqs up to the
person defining the version tuple to make sure that the values make
sense.  If the \fI\%PEP 386\fP compliant version parser doesn\(aqt get
released soon we\(aqll look at making this function check that the
version tuple makes sense before transforming it into a string.
.UNINDENT
.UNINDENT
.sp
It\(aqs recommended that you use this function to keep
a \fB__version_info__\fP tuple and \fB__version__\fP string in your
modules.  Why do we need both a tuple and a string?  The string is often
useful for putting into human readable locations like release
announcements, version strings in tarballs, etc.  Meanwhile the tuple is
very easy for a computer to compare. For example, kitchen sets up its
version information like this:
.INDENT 7.0
.INDENT 3.5
.sp
.nf
.ft C
from kitchen.versioning import version_tuple_to_string
__version_info__ = ((0, 2, 1),)
__version__ = version_tuple_to_string(__version_info__)
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Other programs that depend on a kitchen version between 0.2.1 and 0.3.0
can find whether the present version is okay with code like this:
.INDENT 7.0
.INDENT 3.5
.sp
.nf
.ft C
from kitchen import __version_info__, __version__
if __version_info__ < ((0, 2, 1),) or __version_info__ >= ((0, 3, 0),):
    print \(aqkitchen is present but not at the right version.\(aq
    print \(aqWe need at least version 0.2.1 and less than 0.3.0\(aq
    print \(aqCurrently found: kitchen\-%s\(aq % __version__
.ft P
.fi
.UNINDENT
.UNINDENT
.UNINDENT
.SS Exceptions
.sp
Kitchen has a hierarchy of exceptions that should make it easy to catch many
errors emitted by kitchen itself.
.SS Base kitchen exceptions
.sp
Exception classes for kitchen and the root of the exception hierarchy for
all kitchen modules.
.INDENT 0.0
.TP
.B exception kitchen.exceptions.KitchenError
Base exception class for any error thrown directly by kitchen.
.UNINDENT
.SS Kitchen.text exceptions
.sp
Exception classes thrown by kitchen\(aqs text processing routines.
.INDENT 0.0
.TP
.B exception kitchen.text.exceptions.XmlEncodeError
Exception thrown by error conditions when encoding an xml string.
.UNINDENT
.INDENT 0.0
.TP
.B exception kitchen.text.exceptions.ControlCharError
Exception thrown when an ascii control character is encountered.
.UNINDENT
.SS 1.0.0 Porting Guide
.sp
The 0.1 through 1.0.0 releases focused on bringing in functions from yum and
python\-fedora.  This porting guide tells how to port from those APIs to their
kitchen replacements.
.SS python\-fedora
.TS
center;
|l|l|.
_
T{
python\-fedora
T}	T{
kitchen replacement
T}
_
T{
\fBfedora.iterutils.isiterable()\fP
T}	T{
\fBkitchen.iterutils.isiterable()\fP [1]
T}
_
T{
\fBfedora.textutils.to_unicode()\fP
T}	T{
\fBkitchen.text.converters.to_unicode()\fP
T}
_
T{
\fBfedora.textutils.to_bytes()\fP
T}	T{
\fBkitchen.text.converters.to_bytes()\fP
T}
_
.TE
.IP [1] 5
\fBisiterable()\fP has changed slightly in
kitchen.  The \fBinclude_string\fP attribute has switched its default value
from \fBTrue\fP to \fBFalse\fP\&.  So you need to change code like:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
>>> # Old code
>>> isiterable(\(aqabcdef\(aq)
True
>>> # New code
>>> isiterable(\(aqabcdef\(aq, include_string=True)
True
.ft P
.fi
.UNINDENT
.UNINDENT
.SS yum
.TS
center;
|l|l|.
_
T{
yum
T}	T{
kitchen replacement
T}
_
T{
\fByum.i18n.dummy_wrapper()\fP
T}	T{
\fBkitchen.i18n.DummyTranslations.ugettext()\fP [2]
T}
_
T{
\fByum.i18n.dummyP_wrapper()\fP
T}	T{
\fBkitchen.i18n.DummyTanslations.ungettext()\fP [2]
T}
_
T{
\fByum.i18n.utf8_width()\fP
T}	T{
\fBkitchen.text.display.textual_width()\fP
T}
_
T{
\fByum.i18n.utf8_width_chop()\fP
T}	T{
\fBkitchen.text.display.textual_width_chop()\fP
and \fBkitchen.text.display.textual_width()\fP [3] [5]
T}
_
T{
\fByum.i18n.utf8_valid()\fP
T}	T{
\fBkitchen.text.misc.byte_string_valid_encoding()\fP
T}
_
T{
\fByum.i18n.utf8_text_wrap()\fP
T}	T{
\fBkitchen.text.display.wrap()\fP [4]
T}
_
T{
\fByum.i18n.utf8_text_fill()\fP
T}	T{
\fBkitchen.text.display.fill()\fP [4]
T}
_
T{
\fByum.i18n.to_unicode()\fP
T}	T{
\fBkitchen.text.converters.to_unicode()\fP [6]
T}
_
T{
\fByum.i18n.to_unicode_maybe()\fP
T}	T{
\fBkitchen.text.converters.to_unicode()\fP [6]
T}
_
T{
\fByum.i18n.to_utf8()\fP
T}	T{
\fBkitchen.text.converters.to_bytes()\fP [6]
T}
_
T{
\fByum.i18n.to_str()\fP
T}	T{
\fBkitchen.text.converters.to_unicode()\fP
or \fBkitchen.text.converters.to_bytes()\fP [7]
T}
_
T{
\fByum.i18n.str_eq()\fP
T}	T{
\fBkitchen.text.misc.str_eq()\fP
T}
_
T{
\fByum.misc.to_xml()\fP
T}	T{
\fBkitchen.text.converters.unicode_to_xml()\fP
or \fBkitchen.text.converters.byte_string_to_xml()\fP [8]
T}
_
T{
\fByum.i18n._()\fP
T}	T{
See: \fI\%Initializing Yum i18n\fP
T}
_
T{
\fByum.i18n.P_()\fP
T}	T{
See: \fI\%Initializing Yum i18n\fP
T}
_
T{
\fByum.i18n.exception2msg()\fP
T}	T{
\fBkitchen.text.converters.exception_to_unicode()\fP
or \fBkitchen.text.converter.exception_to_bytes()\fP [9]
T}
_
.TE
.IP [2] 5
These yum methods provided fallback support for \fBgettext\fP
functions in case either \fBgaftonmode\fP was set or \fBgettext\fP failed
to return an object.  In kitchen, we can use the
\fBkitchen.i18n.DummyTranslations\fP object to fulfill that role.
Please see \fI\%Initializing Yum i18n\fP for more suggestions on how to do this.
.IP [3] 5
The yum version of these functions returned a byte \fBstr\fP\&.  The
kitchen version listed here returns a \fBunicode\fP string.  If you
need a byte \fBstr\fP simply call
\fBkitchen.text.converters.to_bytes()\fP on the result.
.IP [4] 5
The yum version of these functions would return either a byte
\fBstr\fP or a \fBunicode\fP string depending on what the input
value was.  The kitchen version always returns \fBunicode\fP strings.
.IP [5] 5
\fByum.i18n.utf8_width_chop()\fP performed two functions.  It
returned the piece of the message that fit in a specified width and the
width of that message.  In kitchen, you need to call two functions, one
for each action:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
>>> # Old way
>>> utf8_width_chop(msg, 5)
(5, \(aqく ku\(aq)
>>> # New way
>>> from kitchen.text.display import textual_width, textual_width_chop
>>> (textual_width(msg), textual_width_chop(msg, 5))
(5, u\(aqく ku\(aq)
.ft P
.fi
.UNINDENT
.UNINDENT
.IP [6] 5
If the yum version of \fBto_unicode()\fP or
\fBto_utf8()\fP is given an object that is not a string, it
returns the object itself.  \fBkitchen.text.converters.to_unicode()\fP and
\fBkitchen.text.converters.to_bytes()\fP default to returning the
\fBsimplerepr\fP of the object instead.  If you want the yum behaviour, set
the \fBnonstring\fP parameter to \fBpassthru\fP:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
>>> from kitchen.text.converters import to_unicode
>>> to_unicode(5)
u\(aq5\(aq
>>> to_unicode(5, nonstring=\(aqpassthru\(aq)
5
.ft P
.fi
.UNINDENT
.UNINDENT
.IP [7] 5
\fByum.i18n.to_str()\fP could return either a byte \fBstr\fP\&.  or
a \fBunicode\fP string In kitchen you can get the same effect but you
get to choose whether you want a byte \fBstr\fP or a \fBunicode\fP
string.  Use \fBto_bytes()\fP for \fBstr\fP
and \fBto_unicode()\fP for \fBunicode\fP\&.
.IP [8] 5
\fByum.misc.to_xml()\fP was buggy as written.  I think the intention
was for you to be able to pass a byte \fBstr\fP or \fBunicode\fP
string in and get out a byte \fBstr\fP that was valid to use in an xml
file.  The two kitchen functions
\fBbyte_string_to_xml()\fP and
\fBunicode_to_xml()\fP do that for each string
type.
.IP [9] 5
When porting \fByum.i18n.exception2msg()\fP to use kitchen, you
should setup two wrapper functions to aid in your port.  They\(aqll look like
this:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
from kitchen.text.converters import EXCEPTION_CONVERTERS, \e
    BYTE_EXCEPTION_CONVERTERS, exception_to_unicode, \e
    exception_to_bytes
def exception2umsg(e):
    \(aq\(aq\(aqReturn a unicode representation of an exception\(aq\(aq\(aq
    c = [lambda e: e.value]
    c.extend(EXCEPTION_CONVERTERS)
    return exception_to_unicode(e, converters=c)
def exception2bmsg(e):
    \(aq\(aq\(aqReturn a utf8 encoded str representation of an exception\(aq\(aq\(aq
    c = [lambda e: e.value]
    c.extend(BYTE_EXCEPTION_CONVERTERS)
    return exception_to_bytes(e, converters=c)
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
The reason to define this wrapper is that many of the exceptions in yum
put the message in the \fBvalue\fP attribute of the \fBException\fP
instead of adding it to the \fBargs\fP attribute.  So the default
\fBEXCEPTION_CONVERTERS\fP don\(aqt know where to
find the message.  The wrapper tells kitchen to check the \fBvalue\fP
attribute for the message.  The reason to define two wrappers may be less
obvious.  \fByum.i18n.exception2msg()\fP can return a \fBunicode\fP
string or a byte \fBstr\fP depending on a combination of what
attributes are present on the \fBException\fP and what locale the
function is being run in.  By contrast,
\fBkitchen.text.converters.exception_to_unicode()\fP only returns
\fBunicode\fP strings and
\fBkitchen.text.converters.exception_to_bytes()\fP only returns byte
\fBstr\fP\&.  This is much safer as it keeps code that can only handle
\fBunicode\fP or only handle byte \fBstr\fP correctly from getting
the wrong type when an input changes but it means you need to examine the
calling code when porting from \fByum.i18n.exception2msg()\fP and use the
appropriate wrapper.
.SS Initializing Yum i18n
.sp
Previously, yum had several pieces of code to initialize i18n.  From the
toplevel of \fByum/i18n.py\fP:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
try:.
    \(aq\(aq\(aq
    Setup the yum translation domain and make _() and P_() translation wrappers
    available.
    using ugettext to make sure translated strings are in Unicode.
    \(aq\(aq\(aq
    import gettext
    t = gettext.translation(\(aqyum\(aq, fallback=True)
    _ = t.ugettext
    P_ = t.ungettext
except:
    \(aq\(aq\(aq
    Something went wrong so we make a dummy _() wrapper there is just
    returning the same text
    \(aq\(aq\(aq
    _ = dummy_wrapper
    P_ = dummyP_wrapper
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
With kitchen, this can be changed to this:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
from kitchen.i18n import easy_gettext_setup, DummyTranslations
try:
    _, P_ = easy_gettext_setup(\(aqyum\(aq)
except:
    translations = DummyTranslations()
    _ = translations.ugettext
    P_ = translations.ungettext
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
\fBNOTE:\fP
.INDENT 0.0
.INDENT 3.5
In overcoming\-frustration, it is mentioned that for some
things (like exception messages), using the byte \fBstr\fP oriented
functions is more appropriate.  If this is desired, the setup portion is
only a second call to \fBkitchen.i18n.easy_gettext_setup()\fP:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
b_, bP_ = easy_gettext_setup(\(aqyum\(aq, use_unicode=False)
.ft P
.fi
.UNINDENT
.UNINDENT
.UNINDENT
.UNINDENT
.sp
The second place where i18n is setup is in \fByum.YumBase._getConfig()\fP in
\fByum/__init_.py\fP if \fBgaftonmode\fP is in effect:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
if startupconf.gaftonmode:
    global _
    _ = yum.i18n.dummy_wrapper
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
This can be changed to:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
if startupconf.gaftonmode:
    global _
    _ = DummyTranslations().ugettext()
.ft P
.fi
.UNINDENT
.UNINDENT
.SS Conventions for contributing to kitchen
.SS Style
.INDENT 0.0
.IP \(bu 2
Strive to be \fI\%PEP 8\fP compliant
.IP \(bu 2
Run \fI:command:\(gapylint\fP \(ga over the code and try to resolve most of its nitpicking
.UNINDENT
.SS Python 2.4 compatibility
.sp
At the moment, we\(aqre supporting python\-2.4 and above.  Understand that there\(aqs
a lot of python features that we cannot use because of this.
.sp
Sometimes modules in the \fI\%python standard library\fP can be added to kitchen so that they\(aqre
available.  When we do that we need to be careful of several things:
.INDENT 0.0
.IP 1. 3
Keep the module in sync with the version in the python\-2.x trunk.  Use
\fBmaintainers/sync\-copied\-files.py\fP for this.
.IP 2. 3
Sync the unittests as well as the module.
.IP 3. 3
Be aware that not all modules are written to remain compatible with
Python\-2.4 and might use python language features that were not present
then (generator expressions, relative imports, decorators, with, try: with
both except: and finally:, etc)  These are not good candidates for
importing into kitchen as they require more work to keep synced.
.UNINDENT
.SS Unittests
.INDENT 0.0
.IP \(bu 2
At least smoketest your code (make sure a function will return expected
values for one set of inputs).
.IP \(bu 2
Note that even 100% coverage is not a guarantee of working code!  Good tests
will realize that you need to also give multiple inputs that test the code
paths of called functions that are outside of your code.  Example:
.INDENT 2.0
.INDENT 3.5
.sp
.nf
.ft C
def to_unicode(msg, encoding=\(aqutf8\(aq, errors=\(aqreplace\(aq):
    return unicode(msg, encoding, errors)

# Smoketest only.  This will give 100% coverage for your code (it
# tests all of the code inside of to_unicode) but it leaves a lot of
# room for errors as it doesn\(aqt test all combinations of arguments
# that are then passed to the unicode() function.

tools.ok_(to_unicode(\(aqabc\(aq) == u\(aqabc\(aq)

# Better \-\- tests now cover non\-ascii characters and that error conditions
# occur properly.  There\(aqs a lot of other permutations that can be
# added along these same lines.
tools.ok_(to_unicode(u\(aqcafé\(aq, \(aqutf8\(aq, \(aqreplace\(aq))
tools.assert_raises(UnicodeError, to_unicode, [u\(aqcafè ñunru\(aq.encode(\(aqlatin1\(aq)])
.ft P
.fi
.UNINDENT
.UNINDENT
.IP \(bu 2
We\(aqre using nose for unittesting.  Rather than depend on unittest2
functionality, use the functions that nose provides.
.IP \(bu 2
Remember to maintain python\-2.4 compatibility even in unittests.
.UNINDENT
.SS Docstrings and documentation
.sp
We use sphinx to build our documentation.  We use the sphinx autodoc extension
to pull docstrings out of the modules for API documentation.  This means that
docstrings for subpackages and modules should follow a certain pattern.  The
general structure is:
.INDENT 0.0
.IP \(bu 2
Introductory material about a module in the module\(aqs top level docstring.
.INDENT 2.0
.IP \(bu 2
Introductory material should begin with a level two title: an overbar and
underbar of \(aq\-\(aq.
.UNINDENT
.IP \(bu 2
docstrings for every function.
.INDENT 2.0
.IP \(bu 2
The first line is a short summary of what the function does
.IP \(bu 2
This is followed by a blank line
.IP \(bu 2
The next lines are a \fIfield list
<http://sphinx.pocoo.org/markup/desc.html#info\-field\-lists>_\fP giving
information about the function\(aqs signature.  We use the keywords:
\fBarg\fP, \fBkwarg\fP, \fBraises\fP, \fBreturns\fP, and sometimes \fBrtype\fP\&.  Use
these to describe all arguments, key word arguments, exceptions raised,
and return values using these.
.INDENT 2.0
.IP \(bu 2
Parameters that are \fBkwarg\fP should specify what their default
behaviour is.
.UNINDENT
.UNINDENT
.UNINDENT
.SS Kitchen versioning
.sp
Currently the kitchen library is in early stages of development.  While we\(aqre
in this state, the main kitchen library uses the following pattern for version
information:
.INDENT 0.0
.IP \(bu 2
.INDENT 2.0
.TP
.B Versions look like this::
__version_info__ = ((0, 1, 2),)
__version__ = \(aq0.1.2\(aq
.UNINDENT
.IP \(bu 2
The Major version number remains at 0 until we decide to make the first 1.0
release of kitchen.  At that point, we\(aqre declaring that we have some
confidence that we won\(aqt need to break backwards compatibility for a while.
.IP \(bu 2
The Minor version increments for any backwards incompatible API changes.
When this is updated, we reset micro to zero.
.IP \(bu 2
The Micro version increments for any other changes (backwards compatible API
changes, pure bugfixes, etc).
.UNINDENT
.sp
\fBNOTE:\fP
.INDENT 0.0
.INDENT 3.5
Versioning is only updated for releases that generate sdists and new
uploads to the download directory.  Usually we update the version
information for the library just before release.  By contrast, we update
kitchen \fI\%Versioning\fP when an API change is made.  When in
doubt, look at the version information in the last release.
.UNINDENT
.UNINDENT
.SS I18N
.sp
All strings that are used as feedback for users need to be translated.
\fBkitchen\fP sets up several functions for this.  \fB_()\fP is used for
marking things that are shown to users via print, GUIs, or other "standard"
methods.  Strings for exceptions are marked with \fBb_()\fP\&.  This function
returns a byte \fBstr\fP which is needed for use with exceptions:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
from kitchen import _, b_

def print_message(msg, username):
    print _(\(aq%(user)s, your message of the day is:  %(message)s\(aq) % {
            \(aqmessage\(aq: msg, \(aquser\(aq: username}

    raise Exception b_(\(aqTest message\(aq)
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
This serves several purposes:
.INDENT 0.0
.IP \(bu 2
It marks the strings to be extracted by an xgettext\-like program.
.IP \(bu 2
\fB_()\fP is a function that will substitute available translations at
runtime.
.UNINDENT
.sp
\fBNOTE:\fP
.INDENT 0.0
.INDENT 3.5
By using the \fB%()s with dict\fP style of string formatting, we make this
string friendly to translators that may need to reorder the variables when
they\(aqre translating the string.
.UNINDENT
.UNINDENT
.sp
\fIpaver <http://www.blueskyonmars.com/projects/paver/>_\fP and \fIbabel
<http://babel.edgewall.org/>_\fP are used to extract the strings.
.SS API updates
.sp
Kitchen strives to have a long deprecation cycle so that people have time to
switch away from any APIs that we decide to discard.  Discarded APIs should
raise a \fBDeprecationWarning\fP and clearly state in the warning message and
the docstring how to convert old code to use the new interface.  An example of
deprecating a function:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
import warnings

from kitchen import _
from  kitchen.text.converters import to_bytes, to_unicode
from kitchen.text.new_module import new_function

def old_function(param):
    \(aq\(aq\(aq**Deprecated**

    This function is deprecated.  Use
    :func:\(gakitchen.text.new_module.new_function\(ga instead. If you want
    unicode strngs as output, switch to::

        >>> from kitchen.text.new_module import new_function
        >>> output = new_function(param)

    If you want byte strings, use::

        >>> from kitchen.text.new_module import new_function
        >>> from kitchen.text.converters import to_bytes
        >>> output = to_bytes(new_function(param))
    \(aq\(aq\(aq
    warnings.warn(_(\(aqkitchen.text.old_function is deprecated.  Use\(aq
        \(aq kitchen.text.new_module.new_function instead\(aq),
        DeprecationWarning, stacklevel=2)

    as_unicode = isinstance(param, unicode)
    message = new_function(to_unicode(param))
    if not as_unicode:
        message = to_bytes(message)
    return message
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
If a particular API change is very intrusive, it may be better to create a new
version of the subpackage and ship both the old version and the new version.
.SS NEWS file
.sp
Update the \fBNEWS\fP file when you make a change that will be visible to
the users.  This is not a ChangeLog file so we don\(aqt need to list absolutely
everything but it should give the user an idea of how this version differs
from prior versions.  API changes should be listed here explicitly.  bugfixes
can be more general:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
\-\-\-\-\-
0.2.0
\-\-\-\-\-
* Relicense to LGPLv2+
* Add kitchen.text.format module with the following functions:
  textual_width, textual_width_chop.
* Rename the kitchen.text.utils module to kitchen.text.misc.  use of the
  old names is deprecated but still available.
* bugfixes applied to kitchen.pycompat24.defaultdict that fixes some
  tracebacks
.ft P
.fi
.UNINDENT
.UNINDENT
.SS Kitchen subpackages
.sp
Kitchen itself is a namespace.  The kitchen sdist (tarball) provides certain
useful subpackages.
.sp
\fBSEE ALSO:\fP
.INDENT 0.0
.INDENT 3.5
.INDENT 0.0
.TP
.B \fI\%Kitchen addon packages\fP
For information about subpackages not distributed in the kitchen sdist
that install into the kitchen namespace.
.UNINDENT
.UNINDENT
.UNINDENT
.SS Versioning
.sp
Each subpackage should have its own version information which is independent
of the other kitchen subpackages and the main kitchen library version. This is
used so that code that depends on kitchen APIs can check the version
information.  The standard way to do this is to put something like this in the
subpackage\(aqs \fB__init__.py\fP:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
from kitchen.versioning import version_tuple_to_string

__version_info__ = ((1, 0, 0),)
__version__ = version_tuple_to_string(__version_info__)
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
\fB__version_info__\fP is documented in \fBkitchen.versioning\fP\&.  The
values of the first tuple should describe API changes to the module.  There
are at least three numbers present in the tuple: (Major, minor, micro).  The
major version number is for backwards incompatible changes (For
instance, removing a function, or adding a new mandatory argument to
a function).  Whenever one of these occurs, you should increment the major
number and reset minor and micro to zero.  The second number is the minor
version.  Anytime new but backwards compatible changes are introduced this
number should be incremented and the micro version number reset to zero.  The
micro version should be incremented when a change is made that does not change
the API at all.  This is a common case for bugfixes, for instance.
.sp
Version information beyond the first three parts of the first tuple may be
useful for versioning but semantically have similar meaning to the micro
version.
.sp
\fBNOTE:\fP
.INDENT 0.0
.INDENT 3.5
We update the \fB__version_info__\fP tuple when the API is updated.
This way there\(aqs less chance of forgetting to update the API version when
a new release is made.  However, we try to only increment the version
numbers a single step for any release.  So if kitchen\-0.1.0 has
kitchen.text.__version__ == \(aq1.0.1\(aq, kitchen\-0.1.1 should have
kitchen.text.__version__ == \(aq1.0.2\(aq or \(aq1.1.0\(aq or \(aq2.0.0\(aq.
.UNINDENT
.UNINDENT
.SS Criteria for subpackages in kitchen
.sp
Subpackages within kitchen should meet these criteria:
.INDENT 0.0
.IP \(bu 2
Generally useful or needed for other pieces of kitchen.
.IP \(bu 2
No mandatory requirements outside of the \fI\%python standard library\fP\&.
.INDENT 2.0
.IP \(bu 2
Optional requirements from outside the \fI\%python standard library\fP are allowed.  Things with
mandatory requirements are better placed in \fI\%kitchen addon packages\fP
.UNINDENT
.IP \(bu 2
Somewhat API stable \-\- this is not a hard requirement.  We can change the
kitchen api.  However, it is better not to as people may come to depend on
it.
.sp
\fBSEE ALSO:\fP
.INDENT 2.0
.INDENT 3.5
\fI\%API Updates\fP
.UNINDENT
.UNINDENT
.UNINDENT
.SS Kitchen addon packages
.sp
Addon packages are very similar to subpackages integrated into the kitchen
sdist.  This section just lists some of the differences to watch out for.
.SS setup.py
.sp
Your \fBsetup.py\fP should contain entries like this:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
# It\(aqs suggested to use a dotted name like this so the package is easily
# findable on pypi:
setup(name=\(aqkitchen.config\(aq,
    # Include kitchen in the keywords, again, for searching on pypi
    keywords=[\(aqkitchen\(aq, \(aqconfiguration\(aq],
    # This package lives in the directory kitchen/config
    packages=[\(aqkitchen.config\(aq],
    # [...]
)
.ft P
.fi
.UNINDENT
.UNINDENT
.SS Package directory layout
.sp
Create a \fBkitchen\fP directory in the toplevel.  Place the addon
subpackage in there.  For example:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
\&./                     <== toplevel with README, setup.py, NEWS, etc
kitchen/
kitchen/__init__.py
kitchen/config/        <== subpackage directory
kitchen/config/__init__.py
.ft P
.fi
.UNINDENT
.UNINDENT
.SS Fake kitchen module
.sp
The :file::\fI__init__.py\fP in the \fBkitchen\fP directory is special.  It
won\(aqt be installed.  It just needs to pull in the kitchen from the system so
that you are able to test your module.  You should be able to use this
boilerplate:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
# Fake module.  This is not installed,  It\(aqs just made to import the real
# kitchen modules for testing this module
import pkgutil

# Extend the __path__ with everything in the real kitchen module
__path__ = pkgutil.extend_path(__path__, __name__)
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
\fBNOTE:\fP
.INDENT 0.0
.INDENT 3.5
\fBkitchen\fP needs to be findable by python for this to work.  Installed
in the \fBsite\-packages\fP directory or adding it to the
\fBPYTHONPATH\fP will work.
.UNINDENT
.UNINDENT
.sp
Your unittests should now be able to find both your submodule and the main
kitchen module.
.SS Versioning
.sp
It is recommended that addon packages version similarly to
\fI\%Versioning\fP\&.  The \fB__version_info__\fP and
\fB__version__\fP strings can be changed independently of  the version
exposed by setup.py so that you have both an API version
(\fB__version_info__\fP) and release version that\(aqs easier for people to
parse.  However, you aren\(aqt required to do this and you could follow
a different methodology if you want (for instance, \fI\%Kitchen versioning\fP)
.SS Glossary
.INDENT 0.0
.TP
.B "Everything but the kitchen sink"
An English idiom meaning to include nearly everything that you can
think of.
.TP
.B API version
Version that is meant for computer consumption.  This version is
parsable and comparable by computers.  It contains information about
a library\(aqs API so that computer software can decide whether it works
with the software.
.TP
.B ASCII
A character encoding that maps numbers to characters essential to
American English.  It maps 128 characters using 7bits.
.sp
\fBSEE ALSO:\fP
.INDENT 7.0
.INDENT 3.5
\fI\%http://en.wikipedia.org/wiki/ASCII\fP
.UNINDENT
.UNINDENT
.TP
.B ASCII compatible
An encoding in which the particular byte that maps to a character in
the \fI\%ASCII\fP character set is only used to map to that character.
This excludes EBDIC based encodings and many multi\-byte fixed and
variable width encodings since they reuse the bytes that make up the
\fI\%ASCII\fP encoding for other purposes.  \fI\%UTF\-8\fP is notable
as a variable width encoding that is \fI\%ASCII\fP compatible.
.sp
\fBSEE ALSO:\fP
.INDENT 7.0
.INDENT 3.5
.INDENT 0.0
.TP
.B \fI\%http://en.wikipedia.org/wiki/Variable\-width_encoding\fP
For another explanation of various ways bytes are mapped to
characters in a possibly incompatible manner.
.UNINDENT
.UNINDENT
.UNINDENT
.TP
.B code points
\fI\%code point\fP
.TP
.B code point
A number that maps to a particular abstract character.  Code points
make it so that we have a number pointing to a character without
worrying about implementation details of how those numbers are stored
for the computer to read.  Encodings define how the code points map to
particular sequences of bytes on disk  and in memory.
.TP
.B control characters
\fI\%control character\fP
.TP
.B control character
The set of characters in unicode that are used, not to display glyphs
on the screen, but to tell the display in program to do something.
.sp
\fBSEE ALSO:\fP
.INDENT 7.0
.INDENT 3.5
\fI\%http://en.wikipedia.org/wiki/Control_character\fP
.UNINDENT
.UNINDENT
.TP
.B grapheme
characters or pieces of characters that you might write on a page to
make words, sentences, or other pieces of text.
.sp
\fBSEE ALSO:\fP
.INDENT 7.0
.INDENT 3.5
\fI\%http://en.wikipedia.org/wiki/Grapheme\fP
.UNINDENT
.UNINDENT
.TP
.B I18N
I18N is an abbreviation for internationalization.  It\(aqs often used to
signify the need to translate words, number and date formats, and
other pieces of data in a computer program so that it will work well
for people who speak another language than yourself.
.TP
.B message catalogs
\fI\%message catalog\fP
.TP
.B message catalog
Message catalogs contain translations for user\-visible strings that
are present in your code.  Normally, you need to mark the strings to
be translated by wrapping them in one of several \fBgettext\fP
functions.  The function serves two purposes:
.INDENT 7.0
.IP 1. 3
It allows automated tools to find which strings are supposed to be
extracted for translation.
.IP 2. 3
The functions perform the translation when the program is running.
.UNINDENT
.sp
\fBSEE ALSO:\fP
.INDENT 7.0
.INDENT 3.5
\fI\%babel\(aqs documentation\fP
.INDENT 0.0
.INDENT 3.5
for one method of extracting message catalogs from source
code.
.UNINDENT
.UNINDENT
.UNINDENT
.UNINDENT
.TP
.B Murphy\(aqs Law
"Anything that can go wrong, will go wrong."
.sp
\fBSEE ALSO:\fP
.INDENT 7.0
.INDENT 3.5
\fI\%http://en.wikipedia.org/wiki/Murphy%27s_Law\fP
.UNINDENT
.UNINDENT
.TP
.B release version
Version that is meant for human consumption.  This version is easy for
a human to look at to decide how a particular version relates to other
versions of the software.
.TP
.B textual width
The amount of horizontal space a character takes up on a monospaced
screen.  The units are number of character cells or columns that it
takes the place of.
.TP
.B UTF\-8
A character encoding that maps all unicode \fI\%code points\fP to a sequence
of bytes.  It is compatible with \fI\%ASCII\fP\&.  It uses a variable
number of bytes to encode all of unicode.  ASCII characters take one
byte.  Characters from other parts of unicode take two to four bytes.
It is widespread as an encoding on the internet and in Linux.
.UNINDENT
.SH INDICES AND TABLES
.INDENT 0.0
.IP \(bu 2
genindex
.IP \(bu 2
modindex
.IP \(bu 2
search
.UNINDENT
.SH PROJECT PAGES
.sp
More information about the project can be found on the \fI\%project webpage\fP
.sp
The latest published version of this documentation can be found on the \fI\%documentation page\fP
.SH COPYRIGHT
2016 Red Hat, Inc. and others
.\" Generated by docutils manpage writer.
.