notebook

In the previous notebook the Unicode string class str was examined and was seen to have a Unicode character as a fundamental unit. The byte string class bytes on the other hand use a byte as a fundamental unit. The byte string was the foundation for text data in Python 2.

Categorize_Identifiers Module¶

This notebook will use the following functions dir2, variables and view in the custom module categorize_identifiers which is found in the same directory as this notebook file. dir2 is a variant of dir that groups identifiers into a dict under categories and variables is an IPython based a variable inspector. view is used to view a Collection in more detail:

In [1]:

from categorize_identifiers import dir2, variables, view

Bytes Conception¶

A computer stores data using a bit. A bit can be conceptualised as a single dip switch and has the values Off and On respectively as shown below. A single switch has the possible values 0, 1 which is 2 ** 1 combinations which is a total of 2.

A single switch ranges between 0:2. Since Python uses zero-order indexing, the lower bound 0 is included and the upper bound 2 is exclusive. i.e. up to and excluding 2:

More typically 8 of these switches are combined into a single logical unit called a byte. A byte has 2 ** 8 combinations which is a total of 256. i.e. a byte comprises of 8 bits and has 0:256 combinations:

Each dipswitch represents a power of 2. The first dipswitch on the right hand side "8" represents the units (2 ** 0), the second dipswitch from the right "7" represents the power (2 ** 1), the third dipswitch from the right "6" represents the power (2 ** 2) and so on… The number above can therefore be calculated as a decimal number using:

In [2]:

+ 0 * (2 ** 7) \
+ 1 * (2 ** 6) \
+ 1 * (2 ** 5) \
+ 0 * (2 ** 4) \
+ 1 * (2 ** 3) \
+ 0 * (2 ** 2) \
+ 0 * (2 ** 1) \
+ 0 * (2 ** 0)

Out[2]:

The bytes above can be expressed as a binary number using the prefix 0b, this prefix is used to distinguish the base 2 from the base 10 which is used by default for an int. Notice that syntax highlight highlights the base 2 prefix. The decimal int will be returned in the cell output:

In [3]:

0b01101000

Out[3]:

Leading zeros are normally omitted:

In [4]:

0b1101000

Out[4]:

Although in this context, it is useful to show the leading zeros, so all 8 bits in the byte can be visualised.

A bytes instance is essentially a collection of individual bytes:

Each byte above can be represented as a binary number and grouped into a tuple:

In [5]:

(0b01101000, 0b01100101, 0b01101100, 0b01101100, 0b01101111)

Out[5]:

(104, 101, 108, 108, 111)

This tuple collection can be cast into bytes giving text information:

In [6]:

bytes((0b01101000, 0b01100101, 0b01101100, 0b01101100, 0b01101111))

Out[6]:

b'hello'

If this bytes instance is viewed, notice the value for each index is a byte which is represented by an int in decimal:

In [7]:

view(bytes((0b01101000, 0b01100101, 0b01101100, 0b01101100, 0b01101111)))

Index 	 Type                 	 Size   	 Value                         
0 	 int                  	 1      	 104                            	
1 	 int                  	 1      	 101                            	
2 	 int                  	 1      	 108                            	
3 	 int                  	 1      	 108                            	
4 	 int                  	 1      	 111

ASCII Characters¶

Recall that the American Standard Code for Information Interchange (ASCII) maps each byte to a physical command or English character. The physical commands were used to control primitive computers that were essentially typewriter based:

There are 128 commands which span the first half of the byte:

byte	hex	num	command
00000000	00	000	null
00000001	01	001	start of heading
00000010	02	002	start of text
00000011	03	003	end of text
00000100	04	004	end of transmission
00000101	05	005	enquiry
00000110	06	006	acknowledge
00000111	07	007	bell
00001000	08	008	backspace
00001001	09	009	horizontal tab
00001010	0a	010	new line
00001011	0b	011	vertical tab
00001100	0c	012	form feed
00001101	0d	013	carriage return
00001110	0e	014	shift out
00001111	0f	015	shift in
00010000	10	016	data link escape
00010001	11	017	device control 1
00010010	12	018	device control 2
00010011	13	019	device control 3
00010100	14	020	device control 4
00010101	15	021	negative acknowledge
00010110	16	022	synchronous idle
00010111	17	023	end of transmission block
00011000	18	024	cancel
00011001	19	025	end of medium
00011010	1a	026	substitute
00011011	1b	027	escape
00011100	1c	028	file separator
00011101	1d	029	group separator
00011110	1e	030	record separator
00011111	1f	031	unit seperator
00100000	20	032	space

The remaining commands spanning up to half a byte contained the characters most commonly used in the English language.

byte	hex	num	character
00100001	21	033	!
00100010	22	034	"
00100011	23	035	#
00100100	24	036	$
00100101	25	037	%
00100110	26	038	&
00100111	27	039	'
00101000	28	040	(
00101001	29	041	)
00101010	2a	042	*
00101011	2b	043	+
00101100	2c	044	,
00101101	2d	045	–
00101110	2e	046	.
00101111	2f	047	/
00110000	30	048	0
00110001	31	049	1
00110010	32	050	2
00110011	33	051	3
00110100	34	052	4
00110101	35	053	5
00110110	36	054	6
00110111	37	055	7
00111000	38	056	8
00111001	39	057	9
00111010	3a	058	:
00111011	3b	059	;
00111100	3c	060	<
00111101	3d	061	=
00111110	3e	062	>
00111111	3f	063	?
01000000	40	064	@
01000001	41	065	A
01000010	42	066	B
01000011	43	067	C
01000100	44	068	D
01000101	45	069	E
01000110	46	070	F
01000111	47	071	G
01001000	48	072	H
01001001	49	073	I
01001010	4a	074	J
01001011	4b	075	K
01001100	4c	076	L
01001101	4d	077	M
01001110	4e	078	N
01001111	4f	079	O
01010000	50	080	P
01010001	51	081	Q
01010010	52	082	R
01010011	53	083	S
01010100	54	084	T
01010101	55	085	U
01010110	56	086	V
01010111	57	087	W
01011000	58	088	X
01011001	59	089	Y
01011010	5a	090	Z
01011011	5b	091	[
01011100	5c	092	\
01011101	5d	093	]
01011110	5e	094	^
01011111	5f	095	_
01100000	60	096	`
01100001	61	097	a
01100010	62	098	b
01100011	63	099	c
01100100	64	100	d
01100101	65	101	e
01100110	66	102	f
01100111	67	103	g
01101000	68	104	h
01101001	69	105	i
01101010	6a	106	j
01101011	6b	107	k
01101100	6c	108	l
01101101	6d	109	m
01101110	6e	110	n
01101111	6f	111	o
01110000	70	112	p
01110001	71	113	q
01110010	72	114	r
01110011	73	115	s
01110100	74	116	t
01110101	75	117	u
01110110	76	118	v
01110111	77	119	w
01111000	78	120	x
01111001	79	121	y
01111010	7a	122	z
01111011	7b	123	{
01111100	7c	124	\|
01111101	7d	125	}
01111110	7e	126	~
01111111	7f	127

Recall the string module contains the printable ASCII characters:

In [8]:

import string

In the bytes class, the translation table for an ASCII character is always the same. Therefore in the formal representation instead of displaying the byte for that character, the ASCII character is shown:

In [9]:

string.printable

Out[9]:

'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'

Initialisation Signature¶

The initialisation signature for the bytes class can be examined:

In [10]:

bytes?

Init signature: bytes(self, /, *args, **kwargs)
Docstring:     
bytes(iterable_of_ints) -> bytes
bytes(string, encoding[, errors]) -> bytes
bytes(bytes_or_buffer) -> immutable copy of bytes_or_buffer
bytes(int) -> bytes object of size given by the parameter initialized with null bytes
bytes() -> empty bytes object

Construct an immutable array of bytes from:
  - an iterable yielding integers in range(256)
  - a text string encoded using the specified encoding
  - any object implementing the buffer API.
  - an integer
Type:           type
Subclasses:     bytes_

For the bytes string class, the initialisation signature shows 5 alternative ways of supplying instance data:

bytes(self, /, *args, **kwargs)
bytes(iterable_of_ints) -> bytes
bytes(string, encoding[, errors]) -> bytes
bytes(bytes_or_buffer) -> immutable copy of bytes_or_buffer
bytes(int) -> bytes object of size given by the parameter initialized with null bytes
bytes() -> empty bytes object

If the first way is examined:

bytes(self, /, *args, **kwargs)

The parenthesis ( ) are used to call a function and supply any necessary input arguments.
The comma , is used as a delimiter to seperate out any input arguments.

self is used to denote this instance. In other words a byte string can be constructed from an existing byte string instance, this is a special case as a byte string is a fundamental datatype.
Any input argument before a / must be provided positionally
*args indicates a variable number of additional positional input arguments. These are typically not used for initialisation of the bytes string class.

**kwargs indicates a variable number of additional named input arguments. These are typically not used for initialisation of the bytes string class.

A bytes instance can be instantiated by supplying an existing bytes instance self to the bytes class:

In [11]:

bytes(b'Hello World!')

Out[11]:

b'Hello World!'

However because the bytes class is a fundamental datatype it can also be instantiated shorthand using the following:

In [12]:

b'Hello World!'

Out[12]:

b'Hello World!'

All of the characters above in the bytes instance are ASCII printable characters. Therefore each byte value in the bytes instance above is represented by its corresponding ASCII character:

In [13]:

view(b'Hello World!')

Index 	 Type                 	 Size   	 Value                         
0 	 int                  	 1      	 72                             	
1 	 int                  	 1      	 101                            	
2 	 int                  	 1      	 108                            	
3 	 int                  	 1      	 108                            	
4 	 int                  	 1      	 111                            	
5 	 int                  	 1      	 32                             	
6 	 int                  	 1      	 87                             	
7 	 int                  	 1      	 111                            	
8 	 int                  	 1      	 114                            	
9 	 int                  	 1      	 108                            	
10 	 int                  	 1      	 100                            	
11 	 int                  	 1      	 33

In [14]:

string.printable

Out[14]:

'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'

A bytes instance can be initialised using an iterable such as a tuple of int instances:

bytes(iterable_of_ints, /) -> bytes

For this to work each int must be a valid byte value and recall that a byte looks like the following:

Since a byte has 2 ** 8 combinations, which is a total of 256 the range is 0:256 inclusive of the lower bound and exclusive of the maximum bound. Therefore the maximum value for an int instance is 255. Note that a trailing comma is required to distinguish a single element tuple from a numeric calculation using parenthesis:

In [15]:

num = (97)

In [16]:

archive = (97, )

In [17]:

variables()

Out[17]:

	Type	Size/Shape	Value
Instance Name
num	int		97
archive	tuple	1	(97,)

From the above ASCII table the int instance 97 corresponds to the character a and this can be seen when this is cast to a tuple:

In [18]:

bytes((97, ))

Out[18]:

b'a'

When an int exceeds the upper bound 256 (up to and exclusive of 256 due to zero-order indexing) a ValueError will display:

bytes((256, ))

In [ ]:

Normally the tuple will contain more than one int instance and each of these will be a valid byte value:

In [19]:

integers = (104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100, 33)

In [20]:

bytes(integers)

Out[20]:

b'hello world!'

Recall that the decimal int instance 104 can be represented in binary. The bin function will return this binary number as an Unicode str instance:

In [21]:

bin(104)

Out[21]:

'0b1101000'

For conception it is helpful to see the leading zeros using the str instance methods removeprefix and zfill, alongside str instance concatenation:

In [22]:

'0b' + bin(104).removeprefix('0b').zfill(8)

Out[22]:

'0b01101000'

The Unicode str instance corresponding to this byte can be retrieved using the chr function:

In [23]:

chr(104)

Out[23]:

'h'

This bytes instance consists of 12 individual byte units. The different representation for each byte unit can be examined below:

In [24]:

for number in integers:
    print(str(number).center(8), end=' ')

print()
for number in integers:
    print(bin(number).removeprefix('0b').zfill(8), end=' ')

print()
for number in integers:
    print(chr(number).center(8), end=' ')

  104      101      108      108      111       32      119      111      114      108      100       33    
01101000 01100101 01101100 01101100 01101111 00100000 01110111 01101111 01110010 01101100 01100100 00100001 
   h        e        l        l        o                 w        o        r        l        d        !

When every byte is a printable ASCII character it will be displayed instead of the byte sequence:

In [25]:

bytes(integers)

Out[25]:

b'hello world!'

When the byte is not printable for example, the int instance 0 which corresponds to the ASCII non-printable command NULL, is represented using a hexadecimal escape sequence:

In [26]:

bytes((0,))

Out[26]:

b'\x00'

All the whitespace characters with exception to the space are represented using an escape character. For the tab, newline and carriage return characters that are commonly used, there are the escape characters \t (decimal 9), \n (decimal 10) and \r (decimal 13). The vertical tab and form feed are less commonly used and represented by their hexadecimal escape sequences \x0b (decimal 11) and \x0c (decimal 12):

In [27]:

string.whitespace

Out[27]:

' \t\n\r\x0b\x0c'

In [28]:

integers = (9, 10, 11, 12, 13)

In [29]:

bytes(integers)

Out[29]:

b'\t\n\x0b\x0c\r'

Binary 0b00001100 is not very human-readable and therefore it is easy for a human to make transcription errors when dealing with binary. To make a byte more human readable the hexadecimal number system is introduced. In hexadecimal the byte is essentially split into 2 halves and each half byte is represented as a hexadecimal character:

Recall binary has 2 digits and the prefix 0b, decimal has 10 digits and no prefix because it is most commonly used numbering system. Hexadecimal has 16 digits and the prefix 0x.

Hexadecimal takes the first 10 digits from decimal and supplements them with the first 6 letters in the alphabet. The number of combinations in half a byte is:

In [30]:

2 ** 4

Out[30]:

As a consequence each hexadecimal character perfectly maps to a 4 bit (half a byte) binary sequence:

(0b) binary	(0x) hexadecimal character	decimal character
0000	0	0
0001	1	1
0010	2	2
0011	3	3
0100	4	4
0101	5	5
0110	6	6
0111	7	7
1000	8	8
1001	9	9
1010	a	10
1011	b	11
1100	c	12
1101	d	13
1110	e	14
1111	f	15

Although uppercase and lowercase can be used to represent a hexadecimal character, notice that the Python interpreter prefers lowercase:

In [31]:

integers = (11, 12)

In [32]:

bytes(integers)

Out[32]:

b'\x0b\x0c'

A human is more likely to make a transcription error when reading hexadecimal sequences. For example:

'ABB4AB8A'

when reading the above quickly, notice the similarity between A and 4 and B and 8. The lowercase characters are more clearly distinguished:

'abb4ab8a'

The following bytes instance:

Is the binary number:

In [33]:

0b00001100

Out[33]:

The value returned in the cell output displays the decimal integer. The hex function can be used to cast a decimal int into a Unicode str of a hexadecimal character:

In [34]:

hex(0b00001100)

Out[34]:

'0xc'

The hexadecimal value is displayed without the trailing zero, so the first half byte is not shown. This can be added for clarity:

In [35]:

'0x' + hex(12).removeprefix('0x').zfill(2)

Out[35]:

'0x0c'

In [36]:

'0b' + bin(12).removeprefix('0b').zfill(8)

Out[36]:

'0b00001100'

And for clarity:

In [37]:

print(bin(12).removeprefix('0b').zfill(8)[:4], bin(12).removeprefix('0b').zfill(8)[4:])
print(hex(12).removeprefix('0x').zfill(2)[:1].center(4), hex(12).removeprefix('0x').zfill(2)[1:].center(4))

0000 1100
 0    c

When the byte sequence contains a byte that maps to a whitespace character, a non-printable command or is unmapped to an ASCII character there is no corresponding character to display and an escape sequence instead displays. This can be seen when using the first 32 integers and integers above 127:

In [38]:

integers = (0, 1, 2, 29, 30, 31, 128, 129, 130, 253, 254, 255)

In [39]:

bytes(integers)

Out[39]:

b'\x00\x01\x02\x1d\x1e\x1f\x80\x81\x82\xfd\xfe\xff'

Notice that each character is inserted using its own hexadecimal escape sequence prefix \x and this instruction expects two hexadecimal characters so a leading zero must be included where applicable.

When a bytes instance contains byte sequences that map to characters these characters will be displayed instead of the hexadecimal escape sequence:

In [40]:

integers = (32, 65, 80, 120)

In [41]:

bytes(integers)

Out[41]:

b' APx'

The tab \t, newline \t, carriage return \r and backslash \\ itself all have single escape characters (the \ is shown as \\ as the \ is used to insert an escape character):

In [42]:

integers = (9, 10, 13, 92)

In [43]:

bytes(integers)

Out[43]:

b'\t\n\r\\'

A bytes instance containing all of these initially appear confusing:

In [44]:

integers = (0, 1, 2, 9, 10, 13, 29, 30, 31, 32, 65, 92, 97, 128, 129, 130, 253, 254, 255)

In [45]:

bytes_string = bytes(integers)

In [46]:

variables()

Out[46]:

	Type	Size/Shape	Value
Instance Name
num	int		97
archive	tuple	1	(97,)
integers	tuple	19	(0, 1, 2, 9, 10, 13, 29, 30, 31, 32, 65, 92, 97, 128, 129, 130, 253, 254, 255)
number	int		33
bytes_string	bytes	19	b'\x00\x01\x02\t\n\r\x1d\x1e\x1f A\\a\x80\x81\x82\xfd\xfe\xff'

In [47]:

view(bytes_string)

Index 	 Type                 	 Size   	 Value                         
0 	 int                  	 1      	 0                              	
1 	 int                  	 1      	 1                              	
2 	 int                  	 1      	 2                              	
3 	 int                  	 1      	 9                              	
4 	 int                  	 1      	 10                             	
5 	 int                  	 1      	 13                             	
6 	 int                  	 1      	 29                             	
7 	 int                  	 1      	 30                             	
8 	 int                  	 1      	 31                             	
9 	 int                  	 1      	 32                             	
10 	 int                  	 1      	 65                             	
11 	 int                  	 1      	 92                             	
12 	 int                  	 1      	 97                             	
13 	 int                  	 1      	 128                            	
14 	 int                  	 1      	 129                            	
15 	 int                  	 1      	 130                            	
16 	 int                  	 1      	 253                            	
17 	 int                  	 1      	 254                            	
18 	 int                  	 1      	 255

For this reason the bytes class has the method hex which returns a Unicode str instance of the hexadecimal values without any of the escape sequences:

In [48]:

bytes_string.hex()

Out[48]:

'000102090a0d1d1e1f20415c61808182fdfeff'

Note that the byte classes hex method differs from builtins function hex which provides a 0x prefix:

In [49]:

bytes((12, )).hex()

Out[49]:

'0c'

In [50]:

hex(12)

Out[50]:

'0xc'

The builtins function hex can process a single large integer that exceeds 1 byte:

In [51]:

hex(256)

Out[51]:

'0x100'

Whereas the byte classes hex method processes multiple integers that are within the constrains of a byte:

In [52]:

bytes((12, 34)).hex()

Out[52]:

'0c22'

The following bytes instance can be represented as a str of hexadecimal characters with 2 hexadecimal characters for each byte using the bytes class method hex:

In [53]:

b'hello'.hex()

Out[53]:

'68656c6c6f'

Notice when each of the ASCII characters is supplied using a hexadecimal escape sequence, the default representation simplifies the output displaying the ASCII character:

In [54]:

b'\x68\x65\x6c\x6c\x6f'

Out[54]:

b'hello'

The bytes class has the class method fromhex which is an alternative constructor to create a bytes instance from a Unicode str instance of hexadecimal characters:

In [55]:

bytes.fromhex('68656c6c6f')

Out[55]:

b'hello'

The different ways of representing each byte in the bytes instance can be examined using:

In [56]:

integers = (104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100, 33)
bytes(integers)

for number in integers:
    print(chr(number).center(8), end=' ')
    
print()
for number in integers:
    print(str(number).center(8), end=' ')

print()
for number in integers:
    print(bin(number).removeprefix('0b').zfill(8), end=' ')

print()
for number in integers:
    print((hex(number).removeprefix('0x')).center(8), end=' ')    

print()
for number in integers:
    print((r'0x' + hex(number).removeprefix('0x')).center(8), end=' ')

print()
for number in integers:
    print((r'\x' + hex(number).removeprefix('0x')).center(8), end=' ')

   h        e        l        l        o                 w        o        r        l        d        !     
  104      101      108      108      111       32      119      111      114      108      100       33    
01101000 01100101 01101100 01101100 01101111 00100000 01110111 01101111 01110010 01101100 01100100 00100001 
   68       65       6c       6c       6f       20       77       6f       72       6c       64       21    
  0x68     0x65     0x6c     0x6c     0x6f     0x20     0x77     0x6f     0x72     0x6c     0x64     0x21   
  \x68     \x65     \x6c     \x6c     \x6f     \x20     \x77     \x6f     \x72     \x6c     \x64     \x21

A bytes instance can be instantiated from a Unicode str however the named parameter encoding needs to be supplied, which gives the instructions to decode a Unicode character outwith the ASCII range. When the simplest encoding 'ascii' is supplied all characters in the Unicode str must be within the ASCII range. Generally the current standard 'utf-8' is used which is adaptable and encodes each Unicode character in the Unicode str to 1-4 bytes in the returned bytes instance:

bytes(string, /, encoding[, errors]) -> bytes

In [57]:

bytes('hello', encoding='ascii')

Out[57]:

b'hello'

In [58]:

bytes('hello', encoding='utf-8')

Out[58]:

b'hello'

Not supplying encoding gives a TypeError:

bytes('hello')

In [ ]:

Supplying a non-Unicode character that is not ASCII and specifying ASCII encoding will also give a UnicodeDecodeError:

bytes('α', encoding='ascii')

In [ ]:

A bytes instance can be cast from an existing bytes instance:

bytes(bytes_or_buffer, /) -> immutable copy of bytes_or_buffer

A bytes instance can be instantiated by casting a bytearray. The bytearray is the mutable counterpart to the bytes class:

In [59]:

bytearray_string = bytearray(b'hello')

In [60]:

bytes_string = bytes(bytearray_string)

In [61]:

variables()

Out[61]:

	Type	Size/Shape	Value
Instance Name
num	int		97
archive	tuple	1	(97,)
integers	tuple	12	(104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100, 33)
number	int		33
bytes_string	bytes	5	b'hello'
bytearray_string	bytearray	5	bytearray(b'hello')

A NULL bytes instance can also be initialised from an int. The int is used to specify the number of NULL bytes and is not cast into an individual byte as seen when an int is provided via a tuple:

bytes(int, /) -> bytes object of size given by the parameter initialized with null bytes

For example a bytes instance occupying 1 bytes can be instantiated:

In [62]:

bytes(1)

Out[62]:

b'\x00'

And another one occupying 4 bytes can be instantiated:

In [63]:

bytes(4)

Out[63]:

b'\x00\x00\x00\x00'

Using the bytes class with out providing any instantiation data will create a single NULL bytes instance:

bytes() -> empty bytes object

In [64]:

bytes()

Out[64]:

b''

Initialising data and then populating is more commonly used for mutable datatypes. For an immutable datatype, the instance cannot be modified and the instance name instead gets reassigned to a new instance.

Encoding and Decoding¶

When a bytes instance was instantiated from a Unicode str instance, an encoding translation table was selected; that is a table that maps the bytes sequence to a specific character. There have been many encoding standards developed throughout the years and by default the current standard 'utf-8' should be used. The Unicode str class is always 'utf-8' and as a consequence is far easier to use than the bytes class for most text applications:

encoding	bytes per character	bits per character	byte order	byte order marker BOM
'utf-8'	1, 2, 3, 4	8, 16, 24, 32	big endian
'utf-8-sig'	1, 2, 3, 4	8, 16, 24, 32	big endian	efbbbf
'utf-32'	4	32	little endian	fffe0000
'utf-32-le'	4	32	little endian
'utf-32-be'	4	32	big endian
'utf-16'	2	16	little endian	fffe
'utf-16-le'	2	16	little endian
'utf-16-be'	2	16	big endian
'latin1'	1	8
'ascii'	1	8

ASCII¶

'ascii' is the most basic translation table and each character is encoded over 1 byte using only half of the possible values. ASCII is restricted to a small subset of English characters:

In [65]:

a_ascii = b'a'

In [66]:

a_ascii

Out[66]:

b'a'

In [67]:

hex(ord('a'))

Out[67]:

'0x61'

The 'ascii' encoding scheme was originally developed using 7 bits with:

In [68]:

2 ** 7

Out[68]:

for this reason the commands span over the range 0:128 (up to and exclusing the upper bound of 128). This covers half the possible values of a byte:

In [69]:

2 ** 8

Out[69]:

Extended ASCII Variants¶

In the 1990s there were numerous regional translation tables which mapped the second half of the bytes to regional characters.

In the UK, 'latin1' was used which includes the £ sign:

In [70]:

gb = bytes('£123.45', encoding='latin1')

In [71]:

gb

Out[71]:

b'\xa3123.45'

In [72]:

int('0xa3', base=16)

Out[72]:

In [73]:

gb.decode(encoding='latin1')

Out[73]:

'£123.45'

This regional encoding scheme spanned over the full byte allowing the commonly used regional characters.

The problem with early regional encoding was that operating systems and browsers were often configured to use a regional encoding scheme that often differed to the encoding scheme the content itself was written in and as a result non-ASCII characters were often incorrectly substituted. This can be seen for example by decoding the bytes instance above which was originally encoded in 'latin1' with 'latin2', 'latin3', 'greek' and 'cyrillic':

In [74]:

gb.decode(encoding='latin2')

Out[74]:

'Ł123.45'

In [75]:

gb.decode(encoding='latin3')

Out[75]:

'£123.45'

In [76]:

gb.decode(encoding='greek')

Out[76]:

'£123.45'

In [77]:

gb.decode(encoding='cyrillic')

Out[77]:

'Ѓ123.45'

All of these formats should be considered as legacy formats.

UTF-16¶

The Unicode Transformation Format 'utf-16' was a previous standard where each character occupied 2 bytes which is 2 * 8 bits and is where the name 16 comes from. Using 2 bytes instead of 1 bytes per character increases the number of possible combinations to:

In [78]:

2 ** 16

Out[78]:

When we count using numbers we use big endian, for example the number twelve is represented using two decimal digits:

This is big endian and the most significant digit 1 which corresponds to 10 is stated first followed by the digit 2 which corresponds to 2 units.

This number twelve could also be represented in little endian:

In little endian, the least significant digit, the digit 2 which corresponds to 2 units is stated first followed by the most significant digit 1 which corresponds to 10.

When the ASCII character a is encoded using the 'ascii' translation table it occupies a single byte, which recall is represented using two hexadecimal characters:

In [79]:

bytes('\x61', encoding='ascii')

Out[79]:

b'a'

For 'utf-16', each character must occupy two bytes. For an ASCII character the single byte that was encoded in 'ascii' is taken as the least significant byte and is accompanied by the NULL byte which acts as the most significant byte. In big endian the most significant byte is placed first followed by the least significant byte:

In [80]:

b'\x00\x61'.hex()

Out[80]:

'0061'

In little endian the least significant byte is instead displayed first, followed by the least significant byte:

In [81]:

b'\x61\x00'.hex()

Out[81]:

'6100'

If the two byte instances are examined, the default representation assumes the bytes instance is using 'ascii' encoding and so the NULL byte displays as an escape character and the 'ascii' character displays:

In [82]:

b'\x00\x61' # big endian

Out[82]:

b'\x00a'

In [83]:

b'\x61\x00' # little endian

Out[83]:

b'a\x00'

If the Unicode str instance 'abc' is examined and encoded in 'utf-16-be':

In [84]:

bytes('abc', encoding='utf-16-be')

Out[84]:

b'\x00a\x00b\x00c'

Which looks like the following when the bytes corresponding to ASCII characters aren't processed:

In [85]:

b'\x00\x61\x00\x62\x00\x63' # big endian

Out[85]:

b'\x00a\x00b\x00c'

If the Unicode str instance 'abc' is instead encoded in 'utf-16-le':

In [86]:

bytes('abc', encoding='utf-16-le')

Out[86]:

b'a\x00b\x00c\x00'

In [87]:

b'\x61\x00\x62\x00\x63\x00' # little endian

Out[87]:

b'a\x00b\x00c\x00'

When 'utf-16' was introduced there was a deviation in the way processors handled characters that spanned over multiple bytes. Some processors used big endian and others used little endian. Intel, the most dominant processor manufacturer at the time favoured little endian. As there was confusion between the two variants of 'utf-16', Microsoft favoured addition of a Byte Order Marker (BOM). The BOM is at the start of the bytes instance and like every character in 'utf-16' will span over two bytes (4 hexadecimal characters):

In [88]:

bytes('abc', encoding='utf-16-le').hex()

Out[88]:

'610062006300'

In [89]:

bytes('abc', encoding='utf-16').hex()

Out[89]:

'fffe610062006300'

The BOM can be examined by casting an empty str instance:

In [90]:

bytes('', encoding='utf-16')

Out[90]:

b'\xff\xfe'

In [91]:

bytes('', encoding='utf-16').hex()

Out[91]:

'fffe'

The str instance corresponding to the Greek letter alpha can be encoded in 'utf-16-le':

In [92]:

alpha_be = bytes('α', encoding='utf-16-le')

The hexadecimal values can be examined:

In [93]:

alpha_be.hex()

Out[93]:

'b103'

In [94]:

alpha_be

Out[94]:

b'\xb1\x03'

This bytes instance can be decoded back to the original str instance using the correct encoding:

In [95]:

alpha_be.decode(encoding='utf-16-le')

Out[95]:

'α'

If the incorrect decoding is used the wrong character is selected:

In [96]:

alpha_be.decode(encoding='utf-16-be')

Out[96]:

'넃'

This is equivalent to:

In [97]:

b'\x03\xb1'.decode(encoding='utf-16-le')

Out[97]:

'넃'

If a single byte encoding is used, each byte will be represented as a different character. One of the characters is a non-printable ASCII character so displays as \x03:

In [98]:

alpha_be.decode(encoding='latin1')

Out[98]:

'±\x03'

With 16 bytes there are:

In [99]:

2 ** 16

Out[99]:

combinations. These are not enough to cover characters from all the languages in the world.

UTF-32¶

Therefore 'utf-32' was developed which spans over 32 bits which is 4 bytes:

In [100]:

2 ** 32

Out[100]:

4294967296

Like 'utf-16' there are BOM variations:

In [101]:

bytes('\x61', encoding='utf-32-be'), bytes('\x61', encoding='utf-32-be').hex()

Out[101]:

(b'\x00\x00\x00a', '00000061')

In [102]:

bytes('\x61', encoding='utf-32-le'), bytes('\x61', encoding='utf-32-le').hex()

Out[102]:

(b'a\x00\x00\x00', '61000000')

In [103]:

bytes('\x61', encoding='utf-32-be'), bytes('\x61', encoding='utf-32').hex()

Out[103]:

(b'\x00\x00\x00a', 'fffe000061000000')

This gives groupings of 4 bytes, which is 8 hexadecimal characters:

In [104]:

word = 'abαβ悤悥🦒🦓'

be = bytes(word, encoding='utf-32-be').hex()
le = bytes(word, encoding='utf-32-le').hex()
bom_le = bytes(word, encoding='utf-32').hex()


print('char', end=':               ')
for i in word:
    print(i, end='        ')

print()
print('utf-32-be', end=':          ')

for i in range(0, len(be), 8):
    print(be [i:i+8], end=' ')

print()    
print('utf-32-le', end=':          ')

for i in range(0, len(le), 8):
    print(le [i:i+8], end=' ')  
    
print()    
print('utf-32', end=':    ')

for i in range(0, len(bom_le), 8):
    print(bom_le [i:i+8], end=' ')

char:               a        b        α        β        悤        悥        🦒        🦓        
utf-32-be:          00000061 00000062 000003b1 000003b2 000060a4 000060a5 0001f992 0001f993 
utf-32-le:          61000000 62000000 b1030000 b2030000 a4600000 a5600000 92f90100 93f90100 
utf-32:    fffe0000 61000000 62000000 b1030000 b2030000 a4600000 a5600000 92f90100 93f90100

UTF-8¶

The main drawback of 'utf-32' is that it requires a lot more memory per character, had byte order issues and each ASCII character now needs to be accompanied by 3 NULL bytes. The current standard 'utf-8' was developed as an adaptable format and characters span over 1-4 bytes and is always big endian:

In [105]:

print('1 byte:', end=' ')
for unicode_char in 'abcde':
    print(bytes(unicode_char, encoding='utf-8').hex(), end=' ')

print()
print('2 bytes:', end=' ')    
for unicode_char in 'αβγδε':
    print(bytes(unicode_char, encoding='utf-8').hex(), end=' ')
    
print()
print('3 bytes:', end=' ')  
for unicode_char in '悤悥悦悧您':
    print(bytes(unicode_char, encoding='utf-8').hex(), end=' ')

print()
print('4 bytes:', end=' ')  
for unicode_char in '🦒🦓🦔🦕🦖':
    print(bytes(unicode_char, encoding='utf-8').hex(), end=' ')

1 byte: 61 62 63 64 65 
2 bytes: ceb1 ceb2 ceb3 ceb4 ceb5 
3 bytes: e682a4 e682a5 e682a6 e682a7 e682a8 
4 bytes: f09fa692 f09fa693 f09fa694 f09fa695 f09fa696

Generally:

1 byte is the 'ascii' subset.
2 bytes is extended European characters. 'utf-16 is not a subset as 'utf-8' uses a byte pattern that differs from 'utf-16'.

3 bytes are used for additional languages.
4 bytes are used for emojis. 'utf-32 is not a subset as 'utf-8' uses a byte pattern that differs from 'utf-32'.

Under the hood the start of the first byte is used to identify if 1, 2, 3 or 4 bytes is used to identify a character:

1 byte: XXXXXXXX (ASCII)

2 bytes: 110XXXXX 10XXXXXX

3 bytes: 1110XXXX 10XXXXXX 10XXXXXX

4 bytes: 11110XXX 10XXXXXX 10XXXXXX 10XXXXXX

4 of the example characters above can be cast into binary and seen to follow the above pattern:

In [106]:

('1 byte', bin(0x61).removeprefix('0b').zfill(1 * 8))

Out[106]:

('1 byte', '01100001')

In [107]:

('2 bytes', bin(0xceb1).removeprefix('0b').zfill(2 * 8))

Out[107]:

('2 bytes', '1100111010110001')

In [108]:

('3 bytes', bin(0xe682a4).removeprefix('0b').zfill(3 * 8))

Out[108]:

('3 bytes', '111001101000001010100100')

In [109]:

('4 bytes', bin(0xf09fa692).removeprefix('0b').zfill(4 * 8))

Out[109]:

('4 bytes', '11110000100111111010011010010010')

UTF-8-Sig¶

Since characters 'utf-8' is always big endian and characters encoded over multiple bytes have a byte pattern, there is generally no need for a BOM.

Despite 'utf-8' not requiring a BOM, Microsoft often include one in their products using a variation utf-8-sig and therefore may be seen in data exported from popular Microsoft applications such as Notepad or Excel. The BOM can be seen by compared the casting of an empty Unicode string to a bytes string using these different 'utf-8' and 'utf-8-sig' respectively:

In [110]:

bytes('', encoding='utf-8').hex()

Out[110]:

''

In [111]:

bytes('', encoding='utf-8-sig').hex()

Out[111]:

'efbbbf'

'utf-8' is the current standard and should be used by default. The Unicode string str class is locked to 'utf-8' and is much easier to work with as there is no worry about encoding.

When decoding data from another source that is in byte form, 'utf-8' should be used by default to decode it. If an unwanted BOM appears at the start when decoding data, then the data probably was processed using a Microsoft product with 'utf-8-sig'.

Hardware¶

The bytes class was the main text class for Python 2.

As 'utf-8' became widely adopted as an encoding standard, the text datatype was redeveloped to use 'utf-8' as the only translation table. This simplified class allowed a Unicode character to be considered as a fundamental unit opposed to a byte. Python 3 introduced major changes over Python 2, in particular changes to the default text class. The str class is the default text class for Python 3 and should be used in most applications, the older text datatype is the bytes class and is sometimes used when communicating directly with hardware on the hardware level. The bytes instance below can be transmitted over a serial port:

In [112]:

for number in b'hello world!':
    print(bin(number).removeprefix('0b').zfill(8), end='')

011010000110010101101100011011000110111100100000011101110110111101110010011011000110010000100001

A serial port for example has a signal pin which is configured to transmit or receive a digital time trace. The baud rate, 9600 for example means 9600 bits are processed in a second. When the bit is 0, the voltage of the signal pin is LOW and when the bit is 1 the voltage of the signal pin is HIGH.

bytes instances are therefore still used when directly interfacing with hardware for example an Arduino. In such applications it is recommended to decode a bytes instance to a Unicode str instance as early as possible in a Python program and only cast the Unicode str instance back to a bytes instance as late as possible before transmitting it to reduce the possibility of encoding issues.

Bytes Identifiers¶

If the identifiers of the bytes class and the str class are examined, and are seen to be largely consistent:

In [113]:

dir2(bytes, object, unique_only=True)

{'method': ['capitalize',
            'center',
            'count',
            'decode',
            'endswith',
            'expandtabs',
            'find',
            'fromhex',
            'hex',
            'index',
            'isalnum',
            'isalpha',
            'isascii',
            'isdigit',
            'islower',
            'isspace',
            'istitle',
            'isupper',
            'join',
            'ljust',
            'lower',
            'lstrip',
            'maketrans',
            'partition',
            'removeprefix',
            'removesuffix',
            'replace',
            'rfind',
            'rindex',
            'rjust',
            'rpartition',
            'rsplit',
            'rstrip',
            'split',
            'splitlines',
            'startswith',
            'strip',
            'swapcase',
            'title',
            'translate',
            'upper',
            'zfill'],
 'datamodel_method': ['__add__',
                      '__buffer__',
                      '__bytes__',
                      '__contains__',
                      '__getitem__',
                      '__getnewargs__',
                      '__iter__',
                      '__len__',
                      '__mod__',
                      '__mul__',
                      '__rmod__',
                      '__rmul__']}

In [114]:

dir2(str, object, unique_only=True)

{'method': ['capitalize',
            'casefold',
            'center',
            'count',
            'encode',
            'endswith',
            'expandtabs',
            'find',
            'format',
            'format_map',
            'index',
            'isalnum',
            'isalpha',
            'isascii',
            'isdecimal',
            'isdigit',
            'isidentifier',
            'islower',
            'isnumeric',
            'isprintable',
            'isspace',
            'istitle',
            'isupper',
            'join',
            'ljust',
            'lower',
            'lstrip',
            'maketrans',
            'partition',
            'removeprefix',
            'removesuffix',
            'replace',
            'rfind',
            'rindex',
            'rjust',
            'rpartition',
            'rsplit',
            'rstrip',
            'split',
            'splitlines',
            'startswith',
            'strip',
            'swapcase',
            'title',
            'translate',
            'upper',
            'zfill'],
 'datamodel_method': ['__add__',
                      '__contains__',
                      '__getitem__',
                      '__getnewargs__',
                      '__iter__',
                      '__len__',
                      '__mod__',
                      '__mul__',
                      '__rmod__',
                      '__rmul__']}

The unique identifiers can be examined for each class. The str method 'encode' casts a str instance to a bytes instance using a specified translation table to encode with. The bytes method 'decode' casts a bytes instance to a str instance using a specified translation table to decode with. The class method fromhex and instance hex can be used to cast from a Unicode str instance of hexadecimal characters and to return a Unicode str instance of hexadecimal characters respectively:

In [115]:

dir2(bytes, str, unique_only=True)

{'method': ['decode', 'fromhex', 'hex'],
 'datamodel_method': ['__buffer__', '__bytes__']}

In [116]:

dir2(str, bytes, unique_only=True)

{'method': ['casefold',
            'encode',
            'format',
            'format_map',
            'isdecimal',
            'isidentifier',
            'isnumeric',
            'isprintable']}

Some str methods related to formatting and methods which check groupings of Unicode characters don't have a counterpart available in the bytes class.

The consistent identifiers behave consistently between the two classes. The counterpart to str methods that return a str instance will instead return a bytes instance or single byte which recall is an int between 0:256. Supposing the four instances are instantiated:

In [117]:

english_b = b'abcde'
greek_b = bytes('αβγδε', encoding='UTF-8')
english_s = 'abcde'
greek_s = 'αβγδε'

The variables can be viewed:

In [118]:

variables(['english_b', 'english_s', 'greek_b', 'greek_s'])

Out[118]:

	Type	Size/Shape	Value
Instance Name
english_b	bytes	5	b'abcde'
english_s	str	5	abcde
greek_b	bytes	10	b'\xce\xb1\xce\xb2\xce\xb3\xce\xb4\xce\xb5'
greek_s	str	5	αβγδε

And examined:

In [119]:

view(english_b)

Index 	 Type                 	 Size   	 Value                         
0 	 int                  	 1      	 97                             	
1 	 int                  	 1      	 98                             	
2 	 int                  	 1      	 99                             	
3 	 int                  	 1      	 100                            	
4 	 int                  	 1      	 101

In [120]:

view(english_s)

Index 	 Type                 	 Size   	 Value                         
0 	 str                  	 1      	 a                              	
1 	 str                  	 1      	 b                              	
2 	 str                  	 1      	 c                              	
3 	 str                  	 1      	 d                              	
4 	 str                  	 1      	 e

In [121]:

view(greek_b)

Index 	 Type                 	 Size   	 Value                         
0 	 int                  	 1      	 206                            	
1 	 int                  	 1      	 177                            	
2 	 int                  	 1      	 206                            	
3 	 int                  	 1      	 178                            	
4 	 int                  	 1      	 206                            	
5 	 int                  	 1      	 179                            	
6 	 int                  	 1      	 206                            	
7 	 int                  	 1      	 180                            	
8 	 int                  	 1      	 206                            	
9 	 int                  	 1      	 181

In [122]:

view(greek_s)

Index 	 Type                 	 Size   	 Value                         
0 	 str                  	 1      	 α                              	
1 	 str                  	 1      	 β                              	
2 	 str                  	 1      	 γ                              	
3 	 str                  	 1      	 δ                              	
4 	 str                  	 1      	 ε

Notice that the len of english_s and greek_s are consistently 5 as there are 5 Unicode characters:

In [123]:

len(english_s)

Out[123]:

In [124]:

len(greek_s)

Out[124]:

Notice that the len of english_b is also 5 as each ASCII character spans 1 byte, however greek_s is a length of 10 as each character spans 2 bytes:

In [125]:

len(english_b)

Out[125]:

In [126]:

len(greek_b)

Out[126]:

For a str instance, the Unicode character corresponding to that index is returned. For a bytes instance on the other hand, the byte in the form of an int is returned:

In [127]:

greek_s[0]

Out[127]:

'α'

In [128]:

greek_b[0]

Out[128]:

In [129]:

greek_b.hex()

Out[129]:

'ceb1ceb2ceb3ceb4ceb5'

In [130]:

0xce

Out[130]:

Slicing will instead return a bytes instance. The difference can be seen when 1 byte is selected from a slice:

In [131]:

greek_b[:1]

Out[131]:

b'\xce'

The syntax otherwise is consistent to slicing when used in the str class:

In [132]:

greek_b[:2:]

Out[132]:

b'\xce\xb1'

In [133]:

greek_b[::2]

Out[133]:

b'\xce\xce\xce\xce\xce'

In [134]:

greek_b[1::2]

Out[134]:

b'\xb1\xb2\xb3\xb4\xb5'

Builtins Module: Bytes Class (bytes)

Categorize_Identifiers Module¶

Bytes Conception¶

ASCII Characters¶

Initialisation Signature¶

Encoding and Decoding¶

ASCII¶

Extended ASCII Variants¶

UTF-16¶

UTF-32¶

UTF-8¶

UTF-8-Sig¶

Hardware¶

Bytes Identifiers¶

Like this:

Categorize_Identifiers Module¶

Bytes Conception¶

ASCII Characters¶

Initialisation Signature¶

Encoding and Decoding¶

ASCII¶

Extended ASCII Variants¶

UTF-16¶

UTF-32¶

UTF-8¶

UTF-8-Sig¶

Hardware¶

Bytes Identifiers¶

Share this:

Like this: