In the previous notebook the Unicode string class str
was examined and was seen to have a Unicode character as a fundamental unit. The byte string class bytes
on the other hand use a byte as a fundamental unit. The byte string was the foundation for text data in Python 2.
Categorize_Identifiers Module¶
This notebook will use the following functions dir2
, variables
and view
in the custom module categorize_identifiers
which is found in the same directory as this notebook file. dir2
is a variant of dir
that groups identifiers into a dict
under categories and variables
is an IPython based a variable inspector. view
is used to view a Collection
in more detail:
from categorize_identifiers import dir2, variables, view
Bytes Conception¶
A computer stores data using a bit. A bit can be conceptualised as a single dip switch and has the values Off and On respectively as shown below. A single switch has the possible values 0
, 1
which is 2 ** 1
combinations which is a total of 2
.
A single switch ranges between 0:2
. Since Python uses zero-order indexing, the lower bound 0
is included and the upper bound 2
is exclusive. i.e. up to and excluding 2
:
More typically 8
of these switches are combined into a single logical unit called a byte. A byte has 2 ** 8
combinations which is a total of 256
. i.e. a byte comprises of 8 bits and has 0:256
combinations:
Each dipswitch represents a power of 2
. The first dipswitch on the right hand side "8" represents the units (2 ** 0
), the second dipswitch from the right "7" represents the power (2 ** 1
), the third dipswitch from the right "6" represents the power (2 ** 2
) and so on… The number above can therefore be calculated as a decimal number using:
+ 0 * (2 ** 7) \
+ 1 * (2 ** 6) \
+ 1 * (2 ** 5) \
+ 0 * (2 ** 4) \
+ 1 * (2 ** 3) \
+ 0 * (2 ** 2) \
+ 0 * (2 ** 1) \
+ 0 * (2 ** 0)
104
The bytes
above can be expressed as a binary number using the prefix 0b
, this prefix is used to distinguish the base 2 from the base 10 which is used by default for an int
. Notice that syntax highlight highlights the base 2 prefix. The decimal int
will be returned in the cell output:
0b01101000
104
Leading zeros are normally omitted:
0b1101000
104
Although in this context, it is useful to show the leading zeros, so all 8 bits in the byte can be visualised.
A bytes
instance is essentially a collection of individual bytes:
Each byte above can be represented as a binary number and grouped into a tuple
:
(0b01101000, 0b01100101, 0b01101100, 0b01101100, 0b01101111)
(104, 101, 108, 108, 111)
This tuple
collection can be cast into bytes
giving text information:
bytes((0b01101000, 0b01100101, 0b01101100, 0b01101100, 0b01101111))
b'hello'
If this bytes
instance is viewed, notice the value for each index is a byte which is represented by an int
in decimal:
view(bytes((0b01101000, 0b01100101, 0b01101100, 0b01101100, 0b01101111)))
Index Type Size Value 0 int 1 104 1 int 1 101 2 int 1 108 3 int 1 108 4 int 1 111
ASCII Characters¶
Recall that the American Standard Code for Information Interchange (ASCII) maps each byte to a physical command or English character. The physical commands were used to control primitive computers that were essentially typewriter based:
There are 128 commands which span the first half of the byte:
byte | hex | num | command |
---|---|---|---|
00000000 | 00 | 000 | null |
00000001 | 01 | 001 | start of heading |
00000010 | 02 | 002 | start of text |
00000011 | 03 | 003 | end of text |
00000100 | 04 | 004 | end of transmission |
00000101 | 05 | 005 | enquiry |
00000110 | 06 | 006 | acknowledge |
00000111 | 07 | 007 | bell |
00001000 | 08 | 008 | backspace |
00001001 | 09 | 009 | horizontal tab |
00001010 | 0a | 010 | new line |
00001011 | 0b | 011 | vertical tab |
00001100 | 0c | 012 | form feed |
00001101 | 0d | 013 | carriage return |
00001110 | 0e | 014 | shift out |
00001111 | 0f | 015 | shift in |
00010000 | 10 | 016 | data link escape |
00010001 | 11 | 017 | device control 1 |
00010010 | 12 | 018 | device control 2 |
00010011 | 13 | 019 | device control 3 |
00010100 | 14 | 020 | device control 4 |
00010101 | 15 | 021 | negative acknowledge |
00010110 | 16 | 022 | synchronous idle |
00010111 | 17 | 023 | end of transmission block |
00011000 | 18 | 024 | cancel |
00011001 | 19 | 025 | end of medium |
00011010 | 1a | 026 | substitute |
00011011 | 1b | 027 | escape |
00011100 | 1c | 028 | file separator |
00011101 | 1d | 029 | group separator |
00011110 | 1e | 030 | record separator |
00011111 | 1f | 031 | unit seperator |
00100000 | 20 | 032 | space |
The remaining commands spanning up to half a byte contained the characters most commonly used in the English language.
byte | hex | num | character |
---|---|---|---|
00100001 | 21 | 033 | ! |
00100010 | 22 | 034 | " |
00100011 | 23 | 035 | # |
00100100 | 24 | 036 | $ |
00100101 | 25 | 037 | % |
00100110 | 26 | 038 | & |
00100111 | 27 | 039 | ' |
00101000 | 28 | 040 | ( |
00101001 | 29 | 041 | ) |
00101010 | 2a | 042 | * |
00101011 | 2b | 043 | + |
00101100 | 2c | 044 | , |
00101101 | 2d | 045 | – |
00101110 | 2e | 046 | . |
00101111 | 2f | 047 | / |
00110000 | 30 | 048 | 0 |
00110001 | 31 | 049 | 1 |
00110010 | 32 | 050 | 2 |
00110011 | 33 | 051 | 3 |
00110100 | 34 | 052 | 4 |
00110101 | 35 | 053 | 5 |
00110110 | 36 | 054 | 6 |
00110111 | 37 | 055 | 7 |
00111000 | 38 | 056 | 8 |
00111001 | 39 | 057 | 9 |
00111010 | 3a | 058 | : |
00111011 | 3b | 059 | ; |
00111100 | 3c | 060 | < |
00111101 | 3d | 061 | = |
00111110 | 3e | 062 | > |
00111111 | 3f | 063 | ? |
01000000 | 40 | 064 | @ |
01000001 | 41 | 065 | A |
01000010 | 42 | 066 | B |
01000011 | 43 | 067 | C |
01000100 | 44 | 068 | D |
01000101 | 45 | 069 | E |
01000110 | 46 | 070 | F |
01000111 | 47 | 071 | G |
01001000 | 48 | 072 | H |
01001001 | 49 | 073 | I |
01001010 | 4a | 074 | J |
01001011 | 4b | 075 | K |
01001100 | 4c | 076 | L |
01001101 | 4d | 077 | M |
01001110 | 4e | 078 | N |
01001111 | 4f | 079 | O |
01010000 | 50 | 080 | P |
01010001 | 51 | 081 | Q |
01010010 | 52 | 082 | R |
01010011 | 53 | 083 | S |
01010100 | 54 | 084 | T |
01010101 | 55 | 085 | U |
01010110 | 56 | 086 | V |
01010111 | 57 | 087 | W |
01011000 | 58 | 088 | X |
01011001 | 59 | 089 | Y |
01011010 | 5a | 090 | Z |
01011011 | 5b | 091 | [ |
01011100 | 5c | 092 | \ |
01011101 | 5d | 093 | ] |
01011110 | 5e | 094 | ^ |
01011111 | 5f | 095 | _ |
01100000 | 60 | 096 | ` |
01100001 | 61 | 097 | a |
01100010 | 62 | 098 | b |
01100011 | 63 | 099 | c |
01100100 | 64 | 100 | d |
01100101 | 65 | 101 | e |
01100110 | 66 | 102 | f |
01100111 | 67 | 103 | g |
01101000 | 68 | 104 | h |
01101001 | 69 | 105 | i |
01101010 | 6a | 106 | j |
01101011 | 6b | 107 | k |
01101100 | 6c | 108 | l |
01101101 | 6d | 109 | m |
01101110 | 6e | 110 | n |
01101111 | 6f | 111 | o |
01110000 | 70 | 112 | p |
01110001 | 71 | 113 | q |
01110010 | 72 | 114 | r |
01110011 | 73 | 115 | s |
01110100 | 74 | 116 | t |
01110101 | 75 | 117 | u |
01110110 | 76 | 118 | v |
01110111 | 77 | 119 | w |
01111000 | 78 | 120 | x |
01111001 | 79 | 121 | y |
01111010 | 7a | 122 | z |
01111011 | 7b | 123 | { |
01111100 | 7c | 124 | | |
01111101 | 7d | 125 | } |
01111110 | 7e | 126 | ~ |
01111111 | 7f | 127 | |
Recall the string
module contains the printable ASCII characters:
import string
In the bytes
class, the translation table for an ASCII character is always the same. Therefore in the formal representation instead of displaying the byte for that character, the ASCII character is shown:
string.printable
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'
Initialisation Signature¶
The initialisation signature for the bytes
class can be examined:
bytes?
Init signature: bytes(self, /, *args, **kwargs) Docstring: bytes(iterable_of_ints) -> bytes bytes(string, encoding[, errors]) -> bytes bytes(bytes_or_buffer) -> immutable copy of bytes_or_buffer bytes(int) -> bytes object of size given by the parameter initialized with null bytes bytes() -> empty bytes object Construct an immutable array of bytes from: - an iterable yielding integers in range(256) - a text string encoded using the specified encoding - any object implementing the buffer API. - an integer Type: type Subclasses: bytes_
For the bytes
string class, the initialisation signature shows 5 alternative ways of supplying instance data:
bytes(self, /, *args, **kwargs)
bytes(iterable_of_ints) -> bytes
bytes(string, encoding[, errors]) -> bytes
bytes(bytes_or_buffer) -> immutable copy of bytes_or_buffer
bytes(int) -> bytes object of size given by the parameter initialized with null bytes
bytes() -> empty bytes object
If the first way is examined:
bytes(self, /, *args, **kwargs)
- The parenthesis
( )
are used to call a function and supply any necessary input arguments. - The comma
,
is used as a delimiter to seperate out any input arguments. self
is used to denote this instance. In other words a byte string can be constructed from an existing byte string instance, this is a special case as a byte string is a fundamental datatype.- Any input argument before a
/
must be provided positionally *args
indicates a variable number of additional positional input arguments. These are typically not used for initialisation of thebytes
string class.**kwargs
indicates a variable number of additional named input arguments. These are typically not used for initialisation of thebytes
string class.
A bytes
instance can be instantiated by supplying an existing bytes
instance self
to the bytes
class:
bytes(b'Hello World!')
b'Hello World!'
However because the bytes
class is a fundamental datatype it can also be instantiated shorthand using the following:
b'Hello World!'
b'Hello World!'
All of the characters above in the bytes
instance are ASCII printable characters. Therefore each byte value in the bytes
instance above is represented by its corresponding ASCII character:
view(b'Hello World!')
Index Type Size Value 0 int 1 72 1 int 1 101 2 int 1 108 3 int 1 108 4 int 1 111 5 int 1 32 6 int 1 87 7 int 1 111 8 int 1 114 9 int 1 108 10 int 1 100 11 int 1 33
string.printable
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'
A bytes
instance can be initialised using an iterable such as a tuple
of int
instances:
bytes(iterable_of_ints, /) -> bytes
For this to work each int
must be a valid byte value and recall that a byte looks like the following:
Since a byte has 2 ** 8
combinations, which is a total of 256
the range is 0:256
inclusive of the lower bound and exclusive of the maximum bound. Therefore the maximum value for an int
instance is 255
. Note that a trailing comma is required to distinguish a single element tuple
from a numeric calculation using parenthesis:
num = (97)
archive = (97, )
variables()
Type | Size/Shape | Value | |
---|---|---|---|
Instance Name | |||
num | int | 97 | |
archive | tuple | 1 | (97,) |
From the above ASCII table the int
instance 97
corresponds to the character a
and this can be seen when this is cast to a tuple
:
bytes((97, ))
b'a'
When an int
exceeds the upper bound 256
(up to and exclusive of 256
due to zero-order indexing) a ValueError
will display:
bytes((256, ))
Normally the tuple
will contain more than one int
instance and each of these will be a valid byte value:
integers = (104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100, 33)
bytes(integers)
b'hello world!'
Recall that the decimal int
instance 104
can be represented in binary. The bin
function will return this binary number as an Unicode str
instance:
bin(104)
'0b1101000'
For conception it is helpful to see the leading zeros using the str
instance methods removeprefix
and zfill
, alongside str
instance concatenation:
'0b' + bin(104).removeprefix('0b').zfill(8)
'0b01101000'
The Unicode str
instance corresponding to this byte can be retrieved using the chr
function:
chr(104)
'h'
This bytes
instance consists of 12
individual byte units. The different representation for each byte unit can be examined below:
for number in integers:
print(str(number).center(8), end=' ')
print()
for number in integers:
print(bin(number).removeprefix('0b').zfill(8), end=' ')
print()
for number in integers:
print(chr(number).center(8), end=' ')
104 101 108 108 111 32 119 111 114 108 100 33 01101000 01100101 01101100 01101100 01101111 00100000 01110111 01101111 01110010 01101100 01100100 00100001 h e l l o w o r l d !
When every byte is a printable ASCII character it will be displayed instead of the byte sequence:
bytes(integers)
b'hello world!'
When the byte is not printable for example, the int
instance 0
which corresponds to the ASCII non-printable command NULL, is represented using a hexadecimal escape sequence:
bytes((0,))
b'\x00'
All the whitespace characters with exception to the space are represented using an escape character. For the tab, newline and carriage return characters that are commonly used, there are the escape characters \t
(decimal 9), \n
(decimal 10) and \r
(decimal 13). The vertical tab and form feed are less commonly used and represented by their hexadecimal escape sequences \x0b
(decimal 11) and \x0c
(decimal 12):
string.whitespace
' \t\n\r\x0b\x0c'
integers = (9, 10, 11, 12, 13)
bytes(integers)
b'\t\n\x0b\x0c\r'
Binary 0b00001100
is not very human-readable and therefore it is easy for a human to make transcription errors when dealing with binary. To make a byte more human readable the hexadecimal number system is introduced. In hexadecimal the byte is essentially split into 2 halves and each half byte is represented as a hexadecimal character:
Recall binary has 2
digits and the prefix 0b
, decimal has 10
digits and no prefix because it is most commonly used numbering system. Hexadecimal has 16
digits and the prefix 0x
.
Hexadecimal takes the first 10
digits from decimal and supplements them with the first 6
letters in the alphabet. The number of combinations in half a byte is:
2 ** 4
16
As a consequence each hexadecimal character perfectly maps to a 4 bit (half a byte) binary sequence:
(0b) binary | (0x) hexadecimal character | decimal character |
---|---|---|
0000 | 0 | 0 |
0001 | 1 | 1 |
0010 | 2 | 2 |
0011 | 3 | 3 |
0100 | 4 | 4 |
0101 | 5 | 5 |
0110 | 6 | 6 |
0111 | 7 | 7 |
1000 | 8 | 8 |
1001 | 9 | 9 |
1010 | a | 10 |
1011 | b | 11 |
1100 | c | 12 |
1101 | d | 13 |
1110 | e | 14 |
1111 | f | 15 |
Although uppercase and lowercase can be used to represent a hexadecimal character, notice that the Python interpreter prefers lowercase:
integers = (11, 12)
bytes(integers)
b'\x0b\x0c'
A human is more likely to make a transcription error when reading hexadecimal sequences. For example:
'ABB4AB8A'
when reading the above quickly, notice the similarity between A and 4 and B and 8. The lowercase characters are more clearly distinguished:
'abb4ab8a'
The following bytes
instance:
Is the binary number:
0b00001100
12
The value returned in the cell output displays the decimal integer. The hex
function can be used to cast a decimal int
into a Unicode str
of a hexadecimal character:
hex(0b00001100)
'0xc'
The hexadecimal value is displayed without the trailing zero, so the first half byte is not shown. This can be added for clarity:
'0x' + hex(12).removeprefix('0x').zfill(2)
'0x0c'
'0b' + bin(12).removeprefix('0b').zfill(8)
'0b00001100'
And for clarity:
print(bin(12).removeprefix('0b').zfill(8)[:4], bin(12).removeprefix('0b').zfill(8)[4:])
print(hex(12).removeprefix('0x').zfill(2)[:1].center(4), hex(12).removeprefix('0x').zfill(2)[1:].center(4))
0000 1100 0 c
When the byte sequence contains a byte that maps to a whitespace character, a non-printable command or is unmapped to an ASCII character there is no corresponding character to display and an escape sequence instead displays. This can be seen when using the first 32
integers and integers above 127
:
integers = (0, 1, 2, 29, 30, 31, 128, 129, 130, 253, 254, 255)
bytes(integers)
b'\x00\x01\x02\x1d\x1e\x1f\x80\x81\x82\xfd\xfe\xff'
Notice that each character is inserted using its own hexadecimal escape sequence prefix \x
and this instruction expects two hexadecimal characters so a leading zero must be included where applicable.
When a bytes
instance contains byte sequences that map to characters these characters will be displayed instead of the hexadecimal escape sequence:
integers = (32, 65, 80, 120)
bytes(integers)
b' APx'
The tab \t
, newline \t
, carriage return \r
and backslash \\
itself all have single escape characters (the \
is shown as \\
as the \
is used to insert an escape character):
integers = (9, 10, 13, 92)
bytes(integers)
b'\t\n\r\\'
A bytes
instance containing all of these initially appear confusing:
integers = (0, 1, 2, 9, 10, 13, 29, 30, 31, 32, 65, 92, 97, 128, 129, 130, 253, 254, 255)
bytes_string = bytes(integers)
variables()
Type | Size/Shape | Value | |
---|---|---|---|
Instance Name | |||
num | int | 97 | |
archive | tuple | 1 | (97,) |
integers | tuple | 19 | (0, 1, 2, 9, 10, 13, 29, 30, 31, 32, 65, 92, 97, 128, 129, 130, 253, 254, 255) |
number | int | 33 | |
bytes_string | bytes | 19 | b'\x00\x01\x02\t\n\r\x1d\x1e\x1f A\\a\x80\x81\x82\xfd\xfe\xff' |
view(bytes_string)
Index Type Size Value 0 int 1 0 1 int 1 1 2 int 1 2 3 int 1 9 4 int 1 10 5 int 1 13 6 int 1 29 7 int 1 30 8 int 1 31 9 int 1 32 10 int 1 65 11 int 1 92 12 int 1 97 13 int 1 128 14 int 1 129 15 int 1 130 16 int 1 253 17 int 1 254 18 int 1 255
For this reason the bytes
class has the method hex
which returns a Unicode str
instance of the hexadecimal values without any of the escape sequences:
bytes_string.hex()
'000102090a0d1d1e1f20415c61808182fdfeff'
Note that the byte
classes hex
method differs from builtins
function hex
which provides a 0x
prefix:
bytes((12, )).hex()
'0c'
hex(12)
'0xc'
The builtins
function hex
can process a single large integer that exceeds 1 byte:
hex(256)
'0x100'
Whereas the byte
classes hex
method processes multiple integers that are within the constrains of a byte:
bytes((12, 34)).hex()
'0c22'
The following bytes
instance can be represented as a str
of hexadecimal characters with 2 hexadecimal characters for each byte using the bytes
class method hex
:
b'hello'.hex()
'68656c6c6f'
Notice when each of the ASCII characters is supplied using a hexadecimal escape sequence, the default representation simplifies the output displaying the ASCII character:
b'\x68\x65\x6c\x6c\x6f'
b'hello'
The bytes
class has the class method fromhex
which is an alternative constructor to create a bytes
instance from a Unicode str
instance of hexadecimal characters:
bytes.fromhex('68656c6c6f')
b'hello'
The different ways of representing each byte in the bytes
instance can be examined using:
integers = (104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100, 33)
bytes(integers)
for number in integers:
print(chr(number).center(8), end=' ')
print()
for number in integers:
print(str(number).center(8), end=' ')
print()
for number in integers:
print(bin(number).removeprefix('0b').zfill(8), end=' ')
print()
for number in integers:
print((hex(number).removeprefix('0x')).center(8), end=' ')
print()
for number in integers:
print((r'0x' + hex(number).removeprefix('0x')).center(8), end=' ')
print()
for number in integers:
print((r'\x' + hex(number).removeprefix('0x')).center(8), end=' ')
h e l l o w o r l d ! 104 101 108 108 111 32 119 111 114 108 100 33 01101000 01100101 01101100 01101100 01101111 00100000 01110111 01101111 01110010 01101100 01100100 00100001 68 65 6c 6c 6f 20 77 6f 72 6c 64 21 0x68 0x65 0x6c 0x6c 0x6f 0x20 0x77 0x6f 0x72 0x6c 0x64 0x21 \x68 \x65 \x6c \x6c \x6f \x20 \x77 \x6f \x72 \x6c \x64 \x21
A bytes
instance can be instantiated from a Unicode str
however the named parameter encoding
needs to be supplied, which gives the instructions to decode a Unicode character outwith the ASCII range. When the simplest encoding 'ascii'
is supplied all characters in the Unicode str
must be within the ASCII range. Generally the current standard 'utf-8'
is used which is adaptable and encodes each Unicode character in the Unicode str
to 1-4 bytes in the returned bytes
instance:
bytes(string, /, encoding[, errors]) -> bytes
bytes('hello', encoding='ascii')
b'hello'
bytes('hello', encoding='utf-8')
b'hello'
Not supplying encoding
gives a TypeError
:
bytes('hello')
Supplying a non-Unicode character that is not ASCII and specifying ASCII encoding will also give a UnicodeDecodeError
:
bytes('α', encoding='ascii')
A bytes
instance can be cast from an existing bytes
instance:
bytes(bytes_or_buffer, /) -> immutable copy of bytes_or_buffer
A bytes
instance can be instantiated by casting a bytearray
. The bytearray
is the mutable counterpart to the bytes
class:
bytearray_string = bytearray(b'hello')
bytes_string = bytes(bytearray_string)
variables()
Type | Size/Shape | Value | |
---|---|---|---|
Instance Name | |||
num | int | 97 | |
archive | tuple | 1 | (97,) |
integers | tuple | 12 | (104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100, 33) |
number | int | 33 | |
bytes_string | bytes | 5 | b'hello' |
bytearray_string | bytearray | 5 | bytearray(b'hello') |
A NULL bytes
instance can also be initialised from an int
. The int
is used to specify the number of NULL bytes
and is not cast into an individual byte as seen when an int
is provided via a tuple
:
bytes(int, /) -> bytes object of size given by the parameter initialized with null bytes
For example a bytes
instance occupying 1 bytes can be instantiated:
bytes(1)
b'\x00'
And another one occupying 4 bytes can be instantiated:
bytes(4)
b'\x00\x00\x00\x00'
Using the bytes
class with out providing any instantiation data will create a single NULL bytes
instance:
bytes() -> empty bytes object
bytes()
b''
Initialising data and then populating is more commonly used for mutable datatypes. For an immutable datatype, the instance cannot be modified and the instance name instead gets reassigned to a new instance.
Encoding and Decoding¶
When a bytes
instance was instantiated from a Unicode str
instance, an encoding translation table was selected; that is a table that maps the bytes sequence to a specific character. There have been many encoding standards developed throughout the years and by default the current standard 'utf-8'
should be used. The Unicode str
class is always 'utf-8'
and as a consequence is far easier to use than the bytes
class for most text applications:
encoding | bytes per character | bits per character | byte order | byte order marker BOM |
---|---|---|---|---|
'utf-8' | 1, 2, 3, 4 | 8, 16, 24, 32 | big endian | |
'utf-8-sig' | 1, 2, 3, 4 | 8, 16, 24, 32 | big endian | efbbbf |
'utf-32' | 4 | 32 | little endian | fffe0000 |
'utf-32-le' | 4 | 32 | little endian | |
'utf-32-be' | 4 | 32 | big endian | |
'utf-16' | 2 | 16 | little endian | fffe |
'utf-16-le' | 2 | 16 | little endian | |
'utf-16-be' | 2 | 16 | big endian | |
'latin1' | 1 | 8 | ||
'ascii' | 1 | 8 |
ASCII¶
'ascii'
is the most basic translation table and each character is encoded over 1
byte using only half of the possible values. ASCII is restricted to a small subset of English characters:
a_ascii = b'a'
a_ascii
b'a'
hex(ord('a'))
'0x61'
The 'ascii'
encoding scheme was originally developed using 7 bits with:
2 ** 7
128
for this reason the commands span over the range 0:128
(up to and exclusing the upper bound of 128
). This covers half the possible values of a byte:
2 ** 8
256
Extended ASCII Variants¶
In the 1990s there were numerous regional translation tables which mapped the second half of the bytes to regional characters.
In the UK, 'latin1'
was used which includes the £
sign:
gb = bytes('£123.45', encoding='latin1')
gb
b'\xa3123.45'
int('0xa3', base=16)
163
gb.decode(encoding='latin1')
'£123.45'
This regional encoding scheme spanned over the full byte allowing the commonly used regional characters.
The problem with early regional encoding was that operating systems and browsers were often configured to use a regional encoding scheme that often differed to the encoding scheme the content itself was written in and as a result non-ASCII characters were often incorrectly substituted. This can be seen for example by decoding the bytes
instance above which was originally encoded in 'latin1'
with 'latin2'
, 'latin3'
, 'greek'
and 'cyrillic'
:
gb.decode(encoding='latin2')
'Ł123.45'
gb.decode(encoding='latin3')
'£123.45'
gb.decode(encoding='greek')
'£123.45'
gb.decode(encoding='cyrillic')
'Ѓ123.45'
All of these formats should be considered as legacy formats.
UTF-16¶
The Unicode Transformation Format 'utf-16'
was a previous standard where each character occupied 2
bytes which is 2 * 8
bits and is where the name 16
comes from. Using 2 bytes instead of 1 bytes per character increases the number of possible combinations to:
2 ** 16
65536
When we count using numbers we use big endian, for example the number twelve is represented using two decimal digits:
12
This is big endian and the most significant digit 1 which corresponds to 10 is stated first followed by the digit 2 which corresponds to 2 units.
This number twelve could also be represented in little endian:
21
In little endian, the least significant digit, the digit 2 which corresponds to 2 units is stated first followed by the most significant digit 1 which corresponds to 10.
When the ASCII character a
is encoded using the 'ascii'
translation table it occupies a single byte, which recall is represented using two hexadecimal characters:
bytes('\x61', encoding='ascii')
b'a'
For 'utf-16'
, each character must occupy two bytes. For an ASCII character the single byte that was encoded in 'ascii'
is taken as the least significant byte and is accompanied by the NULL byte which acts as the most significant byte. In big endian the most significant byte is placed first followed by the least significant byte:
b'\x00\x61'.hex()
'0061'
In little endian the least significant byte is instead displayed first, followed by the least significant byte:
b'\x61\x00'.hex()
'6100'
If the two byte instances are examined, the default representation assumes the bytes
instance is using 'ascii'
encoding and so the NULL byte displays as an escape character and the 'ascii'
character displays:
b'\x00\x61' # big endian
b'\x00a'
b'\x61\x00' # little endian
b'a\x00'
If the Unicode str
instance 'abc'
is examined and encoded in 'utf-16-be'
:
bytes('abc', encoding='utf-16-be')
b'\x00a\x00b\x00c'
Which looks like the following when the bytes corresponding to ASCII characters aren't processed:
b'\x00\x61\x00\x62\x00\x63' # big endian
b'\x00a\x00b\x00c'
If the Unicode str
instance 'abc'
is instead encoded in 'utf-16-le'
:
bytes('abc', encoding='utf-16-le')
b'a\x00b\x00c\x00'
b'\x61\x00\x62\x00\x63\x00' # little endian
b'a\x00b\x00c\x00'
When 'utf-16'
was introduced there was a deviation in the way processors handled characters that spanned over multiple bytes. Some processors used big endian and others used little endian. Intel, the most dominant processor manufacturer at the time favoured little endian. As there was confusion between the two variants of 'utf-16'
, Microsoft favoured addition of a Byte Order Marker (BOM). The BOM is at the start of the bytes
instance and like every character in 'utf-16'
will span over two bytes (4 hexadecimal characters):
bytes('abc', encoding='utf-16-le').hex()
'610062006300'
bytes('abc', encoding='utf-16').hex()
'fffe610062006300'
The BOM can be examined by casting an empty str
instance:
bytes('', encoding='utf-16')
b'\xff\xfe'
bytes('', encoding='utf-16').hex()
'fffe'
The str
instance corresponding to the Greek letter alpha can be encoded in 'utf-16-le'
:
alpha_be = bytes('α', encoding='utf-16-le')
The hexadecimal values can be examined:
alpha_be.hex()
'b103'
alpha_be
b'\xb1\x03'
This bytes
instance can be decoded back to the original str
instance using the correct encoding:
alpha_be.decode(encoding='utf-16-le')
'α'
If the incorrect decoding is used the wrong character is selected:
alpha_be.decode(encoding='utf-16-be')
'넃'
This is equivalent to:
b'\x03\xb1'.decode(encoding='utf-16-le')
'넃'
If a single byte encoding is used, each byte will be represented as a different character. One of the characters is a non-printable ASCII character so displays as \x03
:
alpha_be.decode(encoding='latin1')
'±\x03'
With 16 bytes there are:
2 ** 16
65536
combinations. These are not enough to cover characters from all the languages in the world.
UTF-32¶
Therefore 'utf-32'
was developed which spans over 32 bits which is 4 bytes:
2 ** 32
4294967296
Like 'utf-16'
there are BOM variations:
bytes('\x61', encoding='utf-32-be'), bytes('\x61', encoding='utf-32-be').hex()
(b'\x00\x00\x00a', '00000061')
bytes('\x61', encoding='utf-32-le'), bytes('\x61', encoding='utf-32-le').hex()
(b'a\x00\x00\x00', '61000000')
bytes('\x61', encoding='utf-32-be'), bytes('\x61', encoding='utf-32').hex()
(b'\x00\x00\x00a', 'fffe000061000000')
This gives groupings of 4 bytes, which is 8 hexadecimal characters:
word = 'abαβ悤悥🦒🦓'
be = bytes(word, encoding='utf-32-be').hex()
le = bytes(word, encoding='utf-32-le').hex()
bom_le = bytes(word, encoding='utf-32').hex()
print('char', end=': ')
for i in word:
print(i, end=' ')
print()
print('utf-32-be', end=': ')
for i in range(0, len(be), 8):
print(be [i:i+8], end=' ')
print()
print('utf-32-le', end=': ')
for i in range(0, len(le), 8):
print(le [i:i+8], end=' ')
print()
print('utf-32', end=': ')
for i in range(0, len(bom_le), 8):
print(bom_le [i:i+8], end=' ')
char: a b α β 悤 悥 🦒 🦓 utf-32-be: 00000061 00000062 000003b1 000003b2 000060a4 000060a5 0001f992 0001f993 utf-32-le: 61000000 62000000 b1030000 b2030000 a4600000 a5600000 92f90100 93f90100 utf-32: fffe0000 61000000 62000000 b1030000 b2030000 a4600000 a5600000 92f90100 93f90100
UTF-8¶
The main drawback of 'utf-32'
is that it requires a lot more memory per character, had byte order issues and each ASCII character now needs to be accompanied by 3 NULL bytes. The current standard 'utf-8'
was developed as an adaptable format and characters span over 1-4 bytes and is always big endian:
print('1 byte:', end=' ')
for unicode_char in 'abcde':
print(bytes(unicode_char, encoding='utf-8').hex(), end=' ')
print()
print('2 bytes:', end=' ')
for unicode_char in 'αβγδε':
print(bytes(unicode_char, encoding='utf-8').hex(), end=' ')
print()
print('3 bytes:', end=' ')
for unicode_char in '悤悥悦悧您':
print(bytes(unicode_char, encoding='utf-8').hex(), end=' ')
print()
print('4 bytes:', end=' ')
for unicode_char in '🦒🦓🦔🦕🦖':
print(bytes(unicode_char, encoding='utf-8').hex(), end=' ')
1 byte: 61 62 63 64 65 2 bytes: ceb1 ceb2 ceb3 ceb4 ceb5 3 bytes: e682a4 e682a5 e682a6 e682a7 e682a8 4 bytes: f09fa692 f09fa693 f09fa694 f09fa695 f09fa696
Generally:
- 1 byte is the
'ascii'
subset. - 2 bytes is extended European characters.
'utf-16
is not a subset as'utf-8'
uses a byte pattern that differs from'utf-16'
. - 3 bytes are used for additional languages.
- 4 bytes are used for emojis.
'utf-32
is not a subset as'utf-8'
uses a byte pattern that differs from'utf-32'
.
Under the hood the start of the first byte is used to identify if 1, 2, 3 or 4 bytes is used to identify a character:
1 byte: XXXXXXXX (ASCII)
2 bytes: 110XXXXX 10XXXXXX
3 bytes: 1110XXXX 10XXXXXX 10XXXXXX
4 bytes: 11110XXX 10XXXXXX 10XXXXXX 10XXXXXX
4 of the example characters above can be cast into binary and seen to follow the above pattern:
('1 byte', bin(0x61).removeprefix('0b').zfill(1 * 8))
('1 byte', '01100001')
('2 bytes', bin(0xceb1).removeprefix('0b').zfill(2 * 8))
('2 bytes', '1100111010110001')
('3 bytes', bin(0xe682a4).removeprefix('0b').zfill(3 * 8))
('3 bytes', '111001101000001010100100')
('4 bytes', bin(0xf09fa692).removeprefix('0b').zfill(4 * 8))
('4 bytes', '11110000100111111010011010010010')
UTF-8-Sig¶
Since characters 'utf-8'
is always big endian and characters encoded over multiple bytes have a byte pattern, there is generally no need for a BOM.
Despite 'utf-8'
not requiring a BOM, Microsoft often include one in their products using a variation utf-8-sig
and therefore may be seen in data exported from popular Microsoft applications such as Notepad or Excel. The BOM can be seen by compared the casting of an empty Unicode string to a bytes string using these different 'utf-8'
and 'utf-8-sig'
respectively:
bytes('', encoding='utf-8').hex()
''
bytes('', encoding='utf-8-sig').hex()
'efbbbf'
'utf-8'
is the current standard and should be used by default. The Unicode string str
class is locked to 'utf-8'
and is much easier to work with as there is no worry about encoding.
When decoding data from another source that is in byte form, 'utf-8'
should be used by default to decode it. If an unwanted BOM appears at the start when decoding data, then the data probably was processed using a Microsoft product with 'utf-8-sig'
.
Hardware¶
The bytes
class was the main text class for Python 2.
As 'utf-8'
became widely adopted as an encoding standard, the text datatype was redeveloped to use 'utf-8'
as the only translation table. This simplified class allowed a Unicode character to be considered as a fundamental unit opposed to a byte. Python 3 introduced major changes over Python 2, in particular changes to the default text class. The str
class is the default text class for Python 3 and should be used in most applications, the older text datatype is the bytes
class and is sometimes used when communicating directly with hardware on the hardware level. The bytes
instance below can be transmitted over a serial port:
for number in b'hello world!':
print(bin(number).removeprefix('0b').zfill(8), end='')
011010000110010101101100011011000110111100100000011101110110111101110010011011000110010000100001
A serial port for example has a signal pin which is configured to transmit or receive a digital time trace. The baud rate, 9600
for example means 9600
bits are processed in a second. When the bit is 0
, the voltage of the signal pin is LOW and when the bit is 1
the voltage of the signal pin is HIGH.
bytes
instances are therefore still used when directly interfacing with hardware for example an Arduino. In such applications it is recommended to decode a bytes
instance to a Unicode str
instance as early as possible in a Python program and only cast the Unicode str
instance back to a bytes
instance as late as possible before transmitting it to reduce the possibility of encoding issues.
Bytes Identifiers¶
If the identifiers of the bytes
class and the str
class are examined, and are seen to be largely consistent:
dir2(bytes, object, unique_only=True)
{'method': ['capitalize', 'center', 'count', 'decode', 'endswith', 'expandtabs', 'find', 'fromhex', 'hex', 'index', 'isalnum', 'isalpha', 'isascii', 'isdigit', 'islower', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'maketrans', 'partition', 'removeprefix', 'removesuffix', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill'], 'datamodel_method': ['__add__', '__buffer__', '__bytes__', '__contains__', '__getitem__', '__getnewargs__', '__iter__', '__len__', '__mod__', '__mul__', '__rmod__', '__rmul__']}
dir2(str, object, unique_only=True)
{'method': ['capitalize', 'casefold', 'center', 'count', 'encode', 'endswith', 'expandtabs', 'find', 'format', 'format_map', 'index', 'isalnum', 'isalpha', 'isascii', 'isdecimal', 'isdigit', 'isidentifier', 'islower', 'isnumeric', 'isprintable', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'maketrans', 'partition', 'removeprefix', 'removesuffix', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill'], 'datamodel_method': ['__add__', '__contains__', '__getitem__', '__getnewargs__', '__iter__', '__len__', '__mod__', '__mul__', '__rmod__', '__rmul__']}
The unique identifiers can be examined for each class. The str
method 'encode'
casts a str
instance to a bytes
instance using a specified translation table to encode with. The bytes
method 'decode'
casts a bytes
instance to a str
instance using a specified translation table to decode with. The class method fromhex
and instance hex
can be used to cast from a Unicode str
instance of hexadecimal characters and to return
a Unicode str
instance of hexadecimal characters respectively:
dir2(bytes, str, unique_only=True)
{'method': ['decode', 'fromhex', 'hex'], 'datamodel_method': ['__buffer__', '__bytes__']}
dir2(str, bytes, unique_only=True)
{'method': ['casefold', 'encode', 'format', 'format_map', 'isdecimal', 'isidentifier', 'isnumeric', 'isprintable']}
Some str
methods related to formatting and methods which check groupings of Unicode characters don't have a counterpart available in the bytes
class.
The consistent identifiers behave consistently between the two classes. The counterpart to str
methods that return a str
instance will instead return a bytes
instance or single byte which recall is an int
between 0:256
. Supposing the four instances are instantiated:
english_b = b'abcde'
greek_b = bytes('αβγδε', encoding='UTF-8')
english_s = 'abcde'
greek_s = 'αβγδε'
The variables can be viewed:
variables(['english_b', 'english_s', 'greek_b', 'greek_s'])
Type | Size/Shape | Value | |
---|---|---|---|
Instance Name | |||
english_b | bytes | 5 | b'abcde' |
english_s | str | 5 | abcde |
greek_b | bytes | 10 | b'\xce\xb1\xce\xb2\xce\xb3\xce\xb4\xce\xb5' |
greek_s | str | 5 | αβγδε |
And examined:
view(english_b)
Index Type Size Value 0 int 1 97 1 int 1 98 2 int 1 99 3 int 1 100 4 int 1 101
view(english_s)
Index Type Size Value 0 str 1 a 1 str 1 b 2 str 1 c 3 str 1 d 4 str 1 e
view(greek_b)
Index Type Size Value 0 int 1 206 1 int 1 177 2 int 1 206 3 int 1 178 4 int 1 206 5 int 1 179 6 int 1 206 7 int 1 180 8 int 1 206 9 int 1 181
view(greek_s)
Index Type Size Value 0 str 1 α 1 str 1 β 2 str 1 γ 3 str 1 δ 4 str 1 ε
Notice that the len
of english_s
and greek_s
are consistently 5
as there are 5
Unicode characters:
len(english_s)
5
len(greek_s)
5
Notice that the len
of english_b
is also 5
as each ASCII character spans 1
byte, however greek_s
is a length of 10
as each character spans 2
bytes:
len(english_b)
5
len(greek_b)
10
For a str
instance, the Unicode character corresponding to that index is returned. For a bytes
instance on the other hand, the byte in the form of an int
is returned:
greek_s[0]
'α'
greek_b[0]
206
greek_b.hex()
'ceb1ceb2ceb3ceb4ceb5'
0xce
206
Slicing will instead return a bytes
instance. The difference can be seen when 1
byte is selected from a slice:
greek_b[:1]
b'\xce'
The syntax otherwise is consistent to slicing when used in the str
class:
greek_b[:2:]
b'\xce\xb1'
greek_b[::2]
b'\xce\xce\xce\xce\xce'
greek_b[1::2]
b'\xb1\xb2\xb3\xb4\xb5'