notebook

The str class is an abbreviation for an immutable string of Unicode characters.

Categorize_Identifiers Module¶

This notebook will use the following functions dir2, variables and view in the custom module categorize_identifiers which is found in the same directory as this notebook file. dir2 is a variant of dir that groups identifiers into a dict under categories and variables is an IPython based a variable inspector. view is used to view a Collection in more detail:

In [1]:

from categorize_identifiers import dir2, variables, view

Initialisation Signature¶

The initialisation signature of the str class may be printed using:

In [2]:

str?

Init signature: str(self, /, *args, **kwargs)
Docstring:     
str(object='') -> str
str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or
errors is specified, then the object must expose a data buffer
that will be decoded using the given encoding and error handler.
Otherwise, returns the result of object.__str__() (if defined)
or repr(object).
encoding defaults to sys.getdefaultencoding().
errors defaults to 'strict'.
Type:           type
Subclasses:     StrEnum, DeferredConfigString, FoldedCase, _rstr, _ScriptTarget, _ModuleTarget, LSString, include, Keys, InputMode, ...

The purpose of the initialisation signature is to provide the data required to initialise a new instance. For the str class, the initialisation signature shows three alternative ways of supplying the required instance data.

If the first way is examined:

str(self, /, *args, **kwargs)

To recap:

The parenthesis ( ) are used to call a function and supply any necessary input arguments.
The comma , is used as a delimiter to separate out any input arguments.
In Python self is used to denote this instance. In other words a str instance is constructed from an existing str. This is a special case as a str is a fundamental datatype and has a shorthand way of instantiation.

Any input argument before a / must be provided positionally.
*args indicates a variable number of additional positional input arguments. These are typically not used for the string class.
**kwargs indicates a variable number of additional named input arguments. These are typically not used for the string class.

self can be provided positionally using an existing str instance:

In [3]:

str('hello')

Out[3]:

'hello'

However because the str is a fundamental datatype it is instantiated shorthand using the following:

In [4]:

'hello'

Out[4]:

'hello'

The characters in a str instance must be enclosed in quotations. These are used to distinguish a str of characters from an instance name.

Notice the difference in the syntax colour highlighting between the str instance (top) and the instance name (below). The instance name does not exist and the Python interpreter will flag a NameError when attempting to look it up:

'hello'

hello

In VSCode the Variables button can be selected to view Variables present. In this notebook, the custom function variables will instead be used which has a similar form:

In [5]:

variables()

Out[5]:

	Type	Size/Shape	Value
Instance Name

If the following code is input:

In [6]:

'hello'

Out[6]:

'hello'

Notice the value 'hello' is returned to the cell output. When a value is returned to the cell output, it is not stored elsewhere.

In [7]:

variables()

Out[7]:

	Type	Size/Shape	Value
Instance Name

This Python str instance that has no instance name and therefore cannot be reselected. Conceptualise an instance name as a label which points to the str instance and is therefore used to select the str instance.

A str instance can be assigned to an instance name during instantiation:

In [8]:

greeting = 'hello'

Notice now that the cell has no output. Instead it is stored under the instance name greeting and this displays in Variables:

In [9]:

variables()

Out[9]:

	Type	Size/Shape	Value
Instance Name
greeting	str	5	hello

The value of the str instance can be referenced via the instance name:

In [10]:

greeting

Out[10]:

'hello'

In the above cell, the Python interpreter recognised the instance name. This instance name was used to point to the str instance and the value retrieved was not assigned to another instance name and is therefore shown in the cell output.

If the instance is instead assigned to another instance name:

In [11]:

greeting2 = greeting

Then in the Variable Explorer, the str instance 'hello' is shown with two different instance names greeting and greeting2:

In [12]:

variables()

Out[12]:

	Type	Size/Shape	Value
Instance Name
greeting	str	5	hello
greeting2	str	5	hello

These two instance names act as alias to one another. If an instance name is conceptualised as a label, then this str instance has two labels. If either instance name are used, the same value is retrieved:

In [13]:

greeting

Out[13]:

'hello'

In [14]:

greeting2

Out[14]:

'hello'

A check is made to see if the value retrieved from each instance name is equal. Because they are the same str instance, the boolean True is returned:

In [15]:

greeting == greeting2

Out[15]:

True

Each instance in Python has a unique identification and can be checked using:

In [16]:

id(greeting)

Out[16]:

2064235241776

In [17]:

id(greeting2)

Out[17]:

2064235241776

Notice that the id is the same, because both these instance names are references to the same str instance. Therefore the following is True:

In [18]:

greeting is greeting2

Out[18]:

True

Which recall is shorthand for:

In [19]:

id(greeting) == id(greeting2)

Out[19]:

True

The delete statement del can be used to delete an instance name. Note that deleting an instance name only deletes a label, leaving the instance unchanged:

In [20]:

del greeting

Notice that the instance name greeting is deleted i.e. this label is removed. However the label greeting2 is still present and the instance 'hello' is unaltered:

In [21]:

variables()

Out[21]:

	Type	Size/Shape	Value
Instance Name
greeting2	str	5	hello

If del is used to also delete the instance name greeting2:

In [22]:

del greeting2

In [23]:

variables()

Out[23]:

	Type	Size/Shape	Value
Instance Name

Then there are no instance names for the str instance 'hello'. When an instance has no instance name it cannot be referenced and is considered orphaned. Orphaned instances are automatically cleaned up by Pythons garbage collection.

If a new instance is created:

In [24]:

greeting = 'Hello World'

Then the instance name displays on variables:

In [25]:

variables(show_id=True)

Out[25]:

	Type	Size/Shape	Value	ID
Instance Name
greeting	str	11	Hello World	2064295823280

If a reassignment is carried out:

In [26]:

greeting = 'hi'

The instance name remains on Variables but the instance it points to has changed. In other words the label greeting has been peeled off from the old str instance 'Hello World' and placed on the new str instance 'hi'. The old str instance now has no instance name and therefore no reference and is orphaned and finally because it is orphaned it is cleaned up by Pythons garbage collection:

In [27]:

variables(show_id=True)

Out[27]:

	Type	Size/Shape	Value	ID
Instance Name
greeting	str	2	hi	140727149708216

Reassignment moves the instance name from the old str instance to the new str instance and does not change either str instance. A str instance is immutable and cannot be modified after it has been instantiated.

The initialisation signature of the str class shows instantiation using a named keyword input argument object which has a default value of an empty str:

str(object='') -> str

This is used to cast instances of other Python builtins classes to str instances:

In [28]:

str(object='hello')

Out[28]:

'hello'

In [29]:

str(object=b'hello')

Out[29]:

"b'hello'"

In [30]:

str(object=bytearray(b'hello'))

Out[30]:

"bytearray(b'hello')"

In [31]:

str(object=2)

Out[31]:

'2'

In [32]:

str(object=True)

Out[32]:

'True'

In [33]:

str(object=3.14)

Out[33]:

'3.14'

If not assigned, it takes on its default value which returns an empty str instance:

In [34]:

str()

Out[34]:

''

Spacing and PEP8¶

If the following is examined:

instance = str(object='hello')

Notice the assignment operator is used to assign a value to a named parameter within the function call and the return value of the function call is also assigned to an instance name.

Notice the subtlety in the above spacing. Within a function call spacing is typically used to visually separate out input arguments:

func('a'=1, 'b'=2, 'c'=3)

Outside the function call, spacing is used to visually emphasise an operator:

result = 2 * 3

Operators within a function call are not visually separated as the spacing is used to visually separate out the parameters:

result = func('a'=1, 'b'=2, 'c'=2*3)

The code below will work but is harder to read:

result=func('a'=1,'b'=2,'c'=2*3)

result=func('a' = 1,'b' = 2,'c' = 2 * 3)

More details are given in the Python Enhanced Protocol 8: Style Guide.

Use of the Python formatters such as autopep8 was previously discussed in the tutorial on installing VSCode.

String Quotations¶

In Python single and double quotations can be used to enclose the characters in a str instance and are seen as equivalent:

In [35]:

"Hello World!"

Out[35]:

'Hello World!'

In [36]:

'Hello World!'

Out[36]:

'Hello World!'

Notice that the Python interpreter itself prefers single quotations and the value returned to the cell output in each case is the printed formal representation and is enclosed in single quotations.

The ' is a formatting character in a str instance and is used to enclose the characters of the str itself. If a str containing a str literal is attempted to be constructed.

'greeting = 'Hello World!'

Notice that the syntax highlighting above displays:

'greeting = ' as a str
hello as an instance name

world! as an instance name
'' as an empty string

This results in a SyntaxError.

The \ is another formatting character that is used to insert an escape character or escape character sequence. \' will incorporate the single quotation into the str:

In [37]:

'greeting = \'hello world!\''

Out[37]:

"greeting = 'hello world!'"

Notice that the str returned in the cell output is now enclosed in double quotations and is more readable. The main purpose of the double quotations is to make it easier to create a str instance which includes a str literal.

Triple double quotations are typically used for a multiline string. Double quotations are preferred over single quotations for multiline str instances as they are commonly used as docstrings and a docstring has a high probability of including a str literal. A very basic function can be created which takes in two input str instances and prints them within a formatted str instance:

In [38]:

def fun(string1='hello', string2='world'):
    print(f'{string1} {string2}')

The function can be tested:

In [39]:

fun()

hello world

In [40]:

fun(string1='bye')

bye world

Because it has no docstring, it has no documentation:

In [41]:

fun?

Signature: fun(string1='hello', string2='world')
Docstring: <no docstring>
File:      c:\users\phili\appdata\local\temp\ipykernel_3712\1566935369.py
Type:      function

A docstring is normally added at the start of the functions code block and although this is only a single line, it is typically input using triple double quotations:

In [42]:

def fun(string1='hello', string2='world'):
    """Prints string1 string2"""
    print(f'{string1} {string2}')

In [43]:

fun?

Signature: fun(string1='hello', string2='world')
Docstring: Prints string1 string2
File:      c:\users\phili\appdata\local\temp\ipykernel_3712\3973104799.py
Type:      function

The triple double quotations allow it to be readily expanded later on with optional str literals:

In [44]:

def fun(string1='hello', string2='world'):
    """Prints string1 string2
    For example fun(string1='hello', string2='world') prints hello world"""
    print(f'{string1} {string2}')

In [45]:

fun?

Signature: fun(string1='hello', string2='world')
Docstring:
Prints string1 string2
For example fun(string1='hello', string2='world') prints hello world
File:      c:\users\phili\appdata\local\temp\ipykernel_3712\1096554159.py
Type:      function

The Python Enhanced Protocol 8: Style Guide does not explicitly make a recommendation for quotation style:

In Python, single-quoted strings and double-quoted strings are the same. This PEP does not make a recommendation for this. Pick a rule and stick to it.

However the python interpreter, Python and Python documentation prefer single quotations over double quotes. Double quotes are used when the str instance contains a str literal. A docstring (which is likely to later be updated to include a str literal) uses triple double quotes. It is generally a good practice to make your code look as close to the code in the official Python documentation when getting started, as these tutorials attempt to do. Popular third-party libraries numpy, matplotlib, scipy and sklearn in the scientific stack are written using a consistent quotation style.

Python has a popular opinionated autoformatter black which unfortunately has a preference for double quotations, differing from the style used in Python itself. Moreover black is used for the development of some popular third-party libraries such as pandas and seaborn which are also in the scientific stack. The quotation style for the official documentation for libraries in the scientific stack therefore is unfortunately inconsistent. Finally because of the popularity of pandas in particular, double quotations tend to be more prevalent in datascience tutorials.

Identifiers¶

Two str instances can be instantiated:

In [46]:

greeting = 'hello'
farewell = 'bye'

In [47]:

variables()

Out[47]:

	Type	Size/Shape	Value
Instance Name
greeting	str	5	hello
farewell	str	3	bye

The dir function can be used to view a list of identifiers from an instance:

In [48]:

dir(greeting)

Out[48]:

['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getnewargs__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmod__',
 '__rmul__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'capitalize',
 'casefold',
 'center',
 'count',
 'encode',
 'endswith',
 'expandtabs',
 'find',
 'format',
 'format_map',
 'index',
 'isalnum',
 'isalpha',
 'isascii',
 'isdecimal',
 'isdigit',
 'isidentifier',
 'islower',
 'isnumeric',
 'isprintable',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'maketrans',
 'partition',
 'removeprefix',
 'removesuffix',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'strip',
 'swapcase',
 'title',
 'translate',
 'upper',
 'zfill']

These aren't grouped by category. This can be done by using the custom function dir2;

In [49]:

dir2(greeting)

{'method': ['capitalize',
            'casefold',
            'center',
            'count',
            'encode',
            'endswith',
            'expandtabs',
            'find',
            'format',
            'format_map',
            'index',
            'isalnum',
            'isalpha',
            'isascii',
            'isdecimal',
            'isdigit',
            'isidentifier',
            'islower',
            'isnumeric',
            'isprintable',
            'isspace',
            'istitle',
            'isupper',
            'join',
            'ljust',
            'lower',
            'lstrip',
            'maketrans',
            'partition',
            'removeprefix',
            'removesuffix',
            'replace',
            'rfind',
            'rindex',
            'rjust',
            'rpartition',
            'rsplit',
            'rstrip',
            'split',
            'splitlines',
            'startswith',
            'strip',
            'swapcase',
            'title',
            'translate',
            'upper',
            'zfill'],
 'datamodel_attribute': ['__doc__'],
 'datamodel_method': ['__add__',
                      '__class__',
                      '__contains__',
                      '__delattr__',
                      '__dir__',
                      '__eq__',
                      '__format__',
                      '__ge__',
                      '__getattribute__',
                      '__getitem__',
                      '__getnewargs__',
                      '__getstate__',
                      '__gt__',
                      '__hash__',
                      '__init__',
                      '__init_subclass__',
                      '__iter__',
                      '__le__',
                      '__len__',
                      '__lt__',
                      '__mod__',
                      '__mul__',
                      '__ne__',
                      '__new__',
                      '__reduce__',
                      '__reduce_ex__',
                      '__repr__',
                      '__rmod__',
                      '__rmul__',
                      '__setattr__',
                      '__sizeof__',
                      '__str__',
                      '__subclasshook__']}

Notice the same identifier names display when the other instance is examined:

In [50]:

dir2(farewell)

{'method': ['capitalize',
            'casefold',
            'center',
            'count',
            'encode',
            'endswith',
            'expandtabs',
            'find',
            'format',
            'format_map',
            'index',
            'isalnum',
            'isalpha',
            'isascii',
            'isdecimal',
            'isdigit',
            'isidentifier',
            'islower',
            'isnumeric',
            'isprintable',
            'isspace',
            'istitle',
            'isupper',
            'join',
            'ljust',
            'lower',
            'lstrip',
            'maketrans',
            'partition',
            'removeprefix',
            'removesuffix',
            'replace',
            'rfind',
            'rindex',
            'rjust',
            'rpartition',
            'rsplit',
            'rstrip',
            'split',
            'splitlines',
            'startswith',
            'strip',
            'swapcase',
            'title',
            'translate',
            'upper',
            'zfill'],
 'datamodel_attribute': ['__doc__'],
 'datamodel_method': ['__add__',
                      '__class__',
                      '__contains__',
                      '__delattr__',
                      '__dir__',
                      '__eq__',
                      '__format__',
                      '__ge__',
                      '__getattribute__',
                      '__getitem__',
                      '__getnewargs__',
                      '__getstate__',
                      '__gt__',
                      '__hash__',
                      '__init__',
                      '__init_subclass__',
                      '__iter__',
                      '__le__',
                      '__len__',
                      '__lt__',
                      '__mod__',
                      '__mul__',
                      '__ne__',
                      '__new__',
                      '__reduce__',
                      '__reduce_ex__',
                      '__repr__',
                      '__rmod__',
                      '__rmul__',
                      '__setattr__',
                      '__sizeof__',
                      '__str__',
                      '__subclasshook__']}

This is because both greeting and farewell are instance of the str class:

In [51]:

type(greeting)

Out[51]:

str

In [52]:

type(farewell)

Out[52]:

str

And the identifiers are defined in the str class:

In [53]:

dir2(str)

{'method': ['capitalize',
            'casefold',
            'center',
            'count',
            'encode',
            'endswith',
            'expandtabs',
            'find',
            'format',
            'format_map',
            'index',
            'isalnum',
            'isalpha',
            'isascii',
            'isdecimal',
            'isdigit',
            'isidentifier',
            'islower',
            'isnumeric',
            'isprintable',
            'isspace',
            'istitle',
            'isupper',
            'join',
            'ljust',
            'lower',
            'lstrip',
            'maketrans',
            'partition',
            'removeprefix',
            'removesuffix',
            'replace',
            'rfind',
            'rindex',
            'rjust',
            'rpartition',
            'rsplit',
            'rstrip',
            'split',
            'splitlines',
            'startswith',
            'strip',
            'swapcase',
            'title',
            'translate',
            'upper',
            'zfill'],
 'datamodel_attribute': ['__doc__'],
 'datamodel_method': ['__add__',
                      '__class__',
                      '__contains__',
                      '__delattr__',
                      '__dir__',
                      '__eq__',
                      '__format__',
                      '__ge__',
                      '__getattribute__',
                      '__getitem__',
                      '__getnewargs__',
                      '__getstate__',
                      '__gt__',
                      '__hash__',
                      '__init__',
                      '__init_subclass__',
                      '__iter__',
                      '__le__',
                      '__len__',
                      '__lt__',
                      '__mod__',
                      '__mul__',
                      '__ne__',
                      '__new__',
                      '__reduce__',
                      '__reduce_ex__',
                      '__repr__',
                      '__rmod__',
                      '__rmul__',
                      '__setattr__',
                      '__sizeof__',
                      '__str__',
                      '__subclasshook__']}

If the classes method resolution order is examined:

In [54]:

str.mro()

Out[54]:

[str, object]

Notice that there is a list instance containing the classes str and object. This means the str instance has all the object based datamodel identifiers:

In [55]:

dir2(str, object, consistent_only=True)

{'datamodel_attribute': ['__doc__'],
 'datamodel_method': ['__class__',
                      '__delattr__',
                      '__dir__',
                      '__eq__',
                      '__format__',
                      '__ge__',
                      '__getattribute__',
                      '__getstate__',
                      '__gt__',
                      '__hash__',
                      '__init__',
                      '__init_subclass__',
                      '__le__',
                      '__lt__',
                      '__ne__',
                      '__new__',
                      '__reduce__',
                      '__reduce_ex__',
                      '__repr__',
                      '__setattr__',
                      '__sizeof__',
                      '__str__',
                      '__subclasshook__']}

Alongside the following additions:

In [56]:

dir2(str, object, unique_only=True)

{'method': ['capitalize',
            'casefold',
            'center',
            'count',
            'encode',
            'endswith',
            'expandtabs',
            'find',
            'format',
            'format_map',
            'index',
            'isalnum',
            'isalpha',
            'isascii',
            'isdecimal',
            'isdigit',
            'isidentifier',
            'islower',
            'isnumeric',
            'isprintable',
            'isspace',
            'istitle',
            'isupper',
            'join',
            'ljust',
            'lower',
            'lstrip',
            'maketrans',
            'partition',
            'removeprefix',
            'removesuffix',
            'replace',
            'rfind',
            'rindex',
            'rjust',
            'rpartition',
            'rsplit',
            'rstrip',
            'split',
            'splitlines',
            'startswith',
            'strip',
            'swapcase',
            'title',
            'translate',
            'upper',
            'zfill'],
 'datamodel_method': ['__add__',
                      '__contains__',
                      '__getitem__',
                      '__getnewargs__',
                      '__iter__',
                      '__len__',
                      '__mod__',
                      '__mul__',
                      '__rmod__',
                      '__rmul__']}

The method resolution order is an instruction to preferentially use the method defined in the str class and to fallback on the method defined in the object class when not defined in the str class. More details about these two classes can be seen using help:

In [57]:

help(str)

Help on class str in module builtins:

class str(object)
 |  str(object='') -> str
 |  str(bytes_or_buffer[, encoding[, errors]]) -> str
 |
 |  Create a new string object from the given object. If encoding or
 |  errors is specified, then the object must expose a data buffer
 |  that will be decoded using the given encoding and error handler.
 |  Otherwise, returns the result of object.__str__() (if defined)
 |  or repr(object).
 |  encoding defaults to sys.getdefaultencoding().
 |  errors defaults to 'strict'.
 |
 |  Methods defined here:
 |
 |  __add__(self, value, /)
 |      Return self+value.
 |
 |  __contains__(self, key, /)
 |      Return bool(key in self).
 |
 |  __eq__(self, value, /)
 |      Return self==value.
 |
 |  __format__(self, format_spec, /)
 |      Return a formatted version of the string as described by format_spec.
 |
 |  __ge__(self, value, /)
 |      Return self>=value.
 |
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |
 |  __getitem__(self, key, /)
 |      Return self[key].
 |
 |  __getnewargs__(...)
 |
 |  __gt__(self, value, /)
 |      Return self>value.
 |
 |  __hash__(self, /)
 |      Return hash(self).
 |
 |  __iter__(self, /)
 |      Implement iter(self).
 |
 |  __le__(self, value, /)
 |      Return self<=value.
 |
 |  __len__(self, /)
 |      Return len(self).
 |
 |  __lt__(self, value, /)
 |      Return self<value.
 |
 |  __mod__(self, value, /)
 |      Return self%value.
 |
 |  __mul__(self, value, /)
 |      Return self*value.
 |
 |  __ne__(self, value, /)
 |      Return self!=value.
 |
 |  __repr__(self, /)
 |      Return repr(self).
 |
 |  __rmod__(self, value, /)
 |      Return value%self.
 |
 |  __rmul__(self, value, /)
 |      Return value*self.
 |
 |  __sizeof__(self, /)
 |      Return the size of the string in memory, in bytes.
 |
 |  __str__(self, /)
 |      Return str(self).
 |
 |  capitalize(self, /)
 |      Return a capitalized version of the string.
 |
 |      More specifically, make the first character have upper case and the rest lower
 |      case.
 |
 |  casefold(self, /)
 |      Return a version of the string suitable for caseless comparisons.
 |
 |  center(self, width, fillchar=' ', /)
 |      Return a centered string of length width.
 |
 |      Padding is done using the specified fill character (default is a space).
 |
 |  count(...)
 |      S.count(sub[, start[, end]]) -> int
 |
 |      Return the number of non-overlapping occurrences of substring sub in
 |      string S[start:end].  Optional arguments start and end are
 |      interpreted as in slice notation.
 |
 |  encode(self, /, encoding='utf-8', errors='strict')
 |      Encode the string using the codec registered for encoding.
 |
 |      encoding
 |        The encoding in which to encode the string.
 |      errors
 |        The error handling scheme to use for encoding errors.
 |        The default is 'strict' meaning that encoding errors raise a
 |        UnicodeEncodeError.  Other possible values are 'ignore', 'replace' and
 |        'xmlcharrefreplace' as well as any other name registered with
 |        codecs.register_error that can handle UnicodeEncodeErrors.
 |
 |  endswith(...)
 |      S.endswith(suffix[, start[, end]]) -> bool
 |
 |      Return True if S ends with the specified suffix, False otherwise.
 |      With optional start, test S beginning at that position.
 |      With optional end, stop comparing S at that position.
 |      suffix can also be a tuple of strings to try.
 |
 |  expandtabs(self, /, tabsize=8)
 |      Return a copy where all tab characters are expanded using spaces.
 |
 |      If tabsize is not given, a tab size of 8 characters is assumed.
 |
 |  find(...)
 |      S.find(sub[, start[, end]]) -> int
 |
 |      Return the lowest index in S where substring sub is found,
 |      such that sub is contained within S[start:end].  Optional
 |      arguments start and end are interpreted as in slice notation.
 |
 |      Return -1 on failure.
 |
 |  format(...)
 |      S.format(*args, **kwargs) -> str
 |
 |      Return a formatted version of S, using substitutions from args and kwargs.
 |      The substitutions are identified by braces ('{' and '}').
 |
 |  format_map(...)
 |      S.format_map(mapping) -> str
 |
 |      Return a formatted version of S, using substitutions from mapping.
 |      The substitutions are identified by braces ('{' and '}').
 |
 |  index(...)
 |      S.index(sub[, start[, end]]) -> int
 |
 |      Return the lowest index in S where substring sub is found,
 |      such that sub is contained within S[start:end].  Optional
 |      arguments start and end are interpreted as in slice notation.
 |
 |      Raises ValueError when the substring is not found.
 |
 |  isalnum(self, /)
 |      Return True if the string is an alpha-numeric string, False otherwise.
 |
 |      A string is alpha-numeric if all characters in the string are alpha-numeric and
 |      there is at least one character in the string.
 |
 |  isalpha(self, /)
 |      Return True if the string is an alphabetic string, False otherwise.
 |
 |      A string is alphabetic if all characters in the string are alphabetic and there
 |      is at least one character in the string.
 |
 |  isascii(self, /)
 |      Return True if all characters in the string are ASCII, False otherwise.
 |
 |      ASCII characters have code points in the range U+0000-U+007F.
 |      Empty string is ASCII too.
 |
 |  isdecimal(self, /)
 |      Return True if the string is a decimal string, False otherwise.
 |
 |      A string is a decimal string if all characters in the string are decimal and
 |      there is at least one character in the string.
 |
 |  isdigit(self, /)
 |      Return True if the string is a digit string, False otherwise.
 |
 |      A string is a digit string if all characters in the string are digits and there
 |      is at least one character in the string.
 |
 |  isidentifier(self, /)
 |      Return True if the string is a valid Python identifier, False otherwise.
 |
 |      Call keyword.iskeyword(s) to test whether string s is a reserved identifier,
 |      such as "def" or "class".
 |
 |  islower(self, /)
 |      Return True if the string is a lowercase string, False otherwise.
 |
 |      A string is lowercase if all cased characters in the string are lowercase and
 |      there is at least one cased character in the string.
 |
 |  isnumeric(self, /)
 |      Return True if the string is a numeric string, False otherwise.
 |
 |      A string is numeric if all characters in the string are numeric and there is at
 |      least one character in the string.
 |
 |  isprintable(self, /)
 |      Return True if the string is printable, False otherwise.
 |
 |      A string is printable if all of its characters are considered printable in
 |      repr() or if it is empty.
 |
 |  isspace(self, /)
 |      Return True if the string is a whitespace string, False otherwise.
 |
 |      A string is whitespace if all characters in the string are whitespace and there
 |      is at least one character in the string.
 |
 |  istitle(self, /)
 |      Return True if the string is a title-cased string, False otherwise.
 |
 |      In a title-cased string, upper- and title-case characters may only
 |      follow uncased characters and lowercase characters only cased ones.
 |
 |  isupper(self, /)
 |      Return True if the string is an uppercase string, False otherwise.
 |
 |      A string is uppercase if all cased characters in the string are uppercase and
 |      there is at least one cased character in the string.
 |
 |  join(self, iterable, /)
 |      Concatenate any number of strings.
 |
 |      The string whose method is called is inserted in between each given string.
 |      The result is returned as a new string.
 |
 |      Example: '.'.join(['ab', 'pq', 'rs']) -> 'ab.pq.rs'
 |
 |  ljust(self, width, fillchar=' ', /)
 |      Return a left-justified string of length width.
 |
 |      Padding is done using the specified fill character (default is a space).
 |
 |  lower(self, /)
 |      Return a copy of the string converted to lowercase.
 |
 |  lstrip(self, chars=None, /)
 |      Return a copy of the string with leading whitespace removed.
 |
 |      If chars is given and not None, remove characters in chars instead.
 |
 |  partition(self, sep, /)
 |      Partition the string into three parts using the given separator.
 |
 |      This will search for the separator in the string.  If the separator is found,
 |      returns a 3-tuple containing the part before the separator, the separator
 |      itself, and the part after it.
 |
 |      If the separator is not found, returns a 3-tuple containing the original string
 |      and two empty strings.
 |
 |  removeprefix(self, prefix, /)
 |      Return a str with the given prefix string removed if present.
 |
 |      If the string starts with the prefix string, return string[len(prefix):].
 |      Otherwise, return a copy of the original string.
 |
 |  removesuffix(self, suffix, /)
 |      Return a str with the given suffix string removed if present.
 |
 |      If the string ends with the suffix string and that suffix is not empty,
 |      return string[:-len(suffix)]. Otherwise, return a copy of the original
 |      string.
 |
 |  replace(self, old, new, count=-1, /)
 |      Return a copy with all occurrences of substring old replaced by new.
 |
 |        count
 |          Maximum number of occurrences to replace.
 |          -1 (the default value) means replace all occurrences.
 |
 |      If the optional argument count is given, only the first count occurrences are
 |      replaced.
 |
 |  rfind(...)
 |      S.rfind(sub[, start[, end]]) -> int
 |
 |      Return the highest index in S where substring sub is found,
 |      such that sub is contained within S[start:end].  Optional
 |      arguments start and end are interpreted as in slice notation.
 |
 |      Return -1 on failure.
 |
 |  rindex(...)
 |      S.rindex(sub[, start[, end]]) -> int
 |
 |      Return the highest index in S where substring sub is found,
 |      such that sub is contained within S[start:end].  Optional
 |      arguments start and end are interpreted as in slice notation.
 |
 |      Raises ValueError when the substring is not found.
 |
 |  rjust(self, width, fillchar=' ', /)
 |      Return a right-justified string of length width.
 |
 |      Padding is done using the specified fill character (default is a space).
 |
 |  rpartition(self, sep, /)
 |      Partition the string into three parts using the given separator.
 |
 |      This will search for the separator in the string, starting at the end. If
 |      the separator is found, returns a 3-tuple containing the part before the
 |      separator, the separator itself, and the part after it.
 |
 |      If the separator is not found, returns a 3-tuple containing two empty strings
 |      and the original string.
 |
 |  rsplit(self, /, sep=None, maxsplit=-1)
 |      Return a list of the substrings in the string, using sep as the separator string.
 |
 |        sep
 |          The separator used to split the string.
 |
 |          When set to None (the default value), will split on any whitespace
 |          character (including \n \r \t \f and spaces) and will discard
 |          empty strings from the result.
 |        maxsplit
 |          Maximum number of splits (starting from the left).
 |          -1 (the default value) means no limit.
 |
 |      Splitting starts at the end of the string and works to the front.
 |
 |  rstrip(self, chars=None, /)
 |      Return a copy of the string with trailing whitespace removed.
 |
 |      If chars is given and not None, remove characters in chars instead.
 |
 |  split(self, /, sep=None, maxsplit=-1)
 |      Return a list of the substrings in the string, using sep as the separator string.
 |
 |        sep
 |          The separator used to split the string.
 |
 |          When set to None (the default value), will split on any whitespace
 |          character (including \n \r \t \f and spaces) and will discard
 |          empty strings from the result.
 |        maxsplit
 |          Maximum number of splits (starting from the left).
 |          -1 (the default value) means no limit.
 |
 |      Note, str.split() is mainly useful for data that has been intentionally
 |      delimited.  With natural text that includes punctuation, consider using
 |      the regular expression module.
 |
 |  splitlines(self, /, keepends=False)
 |      Return a list of the lines in the string, breaking at line boundaries.
 |
 |      Line breaks are not included in the resulting list unless keepends is given and
 |      true.
 |
 |  startswith(...)
 |      S.startswith(prefix[, start[, end]]) -> bool
 |
 |      Return True if S starts with the specified prefix, False otherwise.
 |      With optional start, test S beginning at that position.
 |      With optional end, stop comparing S at that position.
 |      prefix can also be a tuple of strings to try.
 |
 |  strip(self, chars=None, /)
 |      Return a copy of the string with leading and trailing whitespace removed.
 |
 |      If chars is given and not None, remove characters in chars instead.
 |
 |  swapcase(self, /)
 |      Convert uppercase characters to lowercase and lowercase characters to uppercase.
 |
 |  title(self, /)
 |      Return a version of the string where each word is titlecased.
 |
 |      More specifically, words start with uppercased characters and all remaining
 |      cased characters have lower case.
 |
 |  translate(self, table, /)
 |      Replace each character in the string using the given translation table.
 |
 |        table
 |          Translation table, which must be a mapping of Unicode ordinals to
 |          Unicode ordinals, strings, or None.
 |
 |      The table must implement lookup/indexing via __getitem__, for instance a
 |      dictionary or list.  If this operation raises LookupError, the character is
 |      left untouched.  Characters mapped to None are deleted.
 |
 |  upper(self, /)
 |      Return a copy of the string converted to uppercase.
 |
 |  zfill(self, width, /)
 |      Pad a numeric string with zeros on the left, to fill a field of the given width.
 |
 |      The string is never truncated.
 |
 |  ----------------------------------------------------------------------
 |  Static methods defined here:
 |
 |  __new__(*args, **kwargs) from builtins.type
 |      Create and return a new object.  See help(type) for accurate signature.
 |
 |  maketrans(...)
 |      Return a translation table usable for str.translate().
 |
 |      If there is only one argument, it must be a dictionary mapping Unicode
 |      ordinals (integers) or characters to Unicode ordinals, strings or None.
 |      Character keys will be then converted to ordinals.
 |      If there are two arguments, they must be strings of equal length, and
 |      in the resulting dictionary, each character in x will be mapped to the
 |      character at the same position in y. If there is a third argument, it
 |      must be a string, whose characters will be mapped to None in the result.

In [58]:

help(object)

Help on class object in module builtins:

class object
 |  The base class of the class hierarchy.
 |
 |  When called, it accepts no arguments and returns a new featureless
 |  instance that has no instance attributes and cannot be given any.
 |
 |  Built-in subclasses:
 |      anext_awaitable
 |      async_generator
 |      async_generator_asend
 |      async_generator_athrow
 |      ... and 90 other subclasses
 |
 |  Methods defined here:
 |
 |  __delattr__(self, name, /)
 |      Implement delattr(self, name).
 |
 |  __dir__(self, /)
 |      Default dir() implementation.
 |
 |  __eq__(self, value, /)
 |      Return self==value.
 |
 |  __format__(self, format_spec, /)
 |      Default object formatter.
 |
 |      Return str(self) if format_spec is empty. Raise TypeError otherwise.
 |
 |  __ge__(self, value, /)
 |      Return self>=value.
 |
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |
 |  __getstate__(self, /)
 |      Helper for pickle.
 |
 |  __gt__(self, value, /)
 |      Return self>value.
 |
 |  __hash__(self, /)
 |      Return hash(self).
 |
 |  __init__(self, /, *args, **kwargs)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |
 |  __le__(self, value, /)
 |      Return self<=value.
 |
 |  __lt__(self, value, /)
 |      Return self<value.
 |
 |  __ne__(self, value, /)
 |      Return self!=value.
 |
 |  __reduce__(self, /)
 |      Helper for pickle.
 |
 |  __reduce_ex__(self, protocol, /)
 |      Helper for pickle.
 |
 |  __repr__(self, /)
 |      Return repr(self).
 |
 |  __setattr__(self, name, value, /)
 |      Implement setattr(self, name, value).
 |
 |  __sizeof__(self, /)
 |      Size of object in memory, in bytes.
 |
 |  __str__(self, /)
 |      Return str(self).
 |
 |  ----------------------------------------------------------------------
 |  Class methods defined here:
 |
 |  __init_subclass__(...) from builtins.type
 |      This method is called when a class is subclassed.
 |
 |      The default implementation does nothing. It may be
 |      overridden to extend subclasses.
 |
 |  __subclasshook__(...) from builtins.type
 |      Abstract classes can override this to customize issubclass().
 |
 |      This is invoked early on by abc.ABCMeta.__subclasscheck__().
 |      It should return True, False or NotImplemented.  If it returns
 |      NotImplemented, the normal algorithm is used.  Otherwise, it
 |      overrides the normal algorithm (and the outcome is cached).
 |
 |  ----------------------------------------------------------------------
 |  Static methods defined here:
 |
 |  __new__(*args, **kwargs) from builtins.type
 |      Create and return a new object.  See help(type) for accurate signature.
 |
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |
 |  __class__ = <class 'type'>
 |      type(object) -> the object's type
 |      type(name, bases, dict, **kwds) -> a new type

Datamodel Identifiers¶

The str has the object based datamodel identifiers. Recall from the previous tutorial these define the behaviour of the following builtins identifier:

Datamodel Identifier	Builtins Identifier	Builtins Identifier Type	Description
__new__			constructs the instance self
__init__			initialise an instance with instance data (automatically invoked by __new__)
__doc__	?	operator	view the docstring or initialisation signature docstring if a class
__class__	type	class	display the class type of an instance
__dir__	dir	function	list the directory of identifiers
__repr__	repr	function	formal str representation
__str__	str	class	informal str representation
__hash__	hash	function	hash value if immutable, if mutable __hash__ = None and the hash function cannot be used
__getattribute__	getattr	function	access an attribute (immutable)
__setattr__	setattr	function	set an attribute (mutable)
__delattr__	delattr	function	delete an attribute (mutable)
__eq__	==	operator	check if self is equal to value
__ne__	!=	operator	check if self is not equal to value
__lt__	<	operator	check if self is less than value
__le__	<=	operator	check if self is less than or equal to value
__gt__	>	operator	check if self is greater than value
__ge__	>=	operator	check if self is greater than or equal to value
__sizeof__	sys.sizeof	function	check the size of the instance in bytes

The identifiers used by the pickle module or for subclassing are not mentioned here and were covered in the previous tutorial on the object class.

These are supplemented by the following datamodel methods:

In [59]:

dir2(str, object, unique_only=True, print_output=False)['datamodel_method']

Out[59]:

['__add__',
 '__contains__',
 '__getitem__',
 '__getnewargs__',
 '__iter__',
 '__len__',
 '__mod__',
 '__mul__',
 '__rmod__',
 '__rmul__']

The str follows the design pattern on an immutable Collection. A Collection has the following datamodel identifiers:

Datamodel Identifier	Builtins Identifier	Builtins Identifier Type	Description
__len__	len	function	the number of Unicode characters in a str
__contains__	in	keyword	check if str contains a substr
__getitem__	[]		uses square brackets to index into a str
__iter__	iter	function	returns a str iterator
__add__	+	operator	concatenates two str instances
__mul__	*	operator	replicates a str by multiplication with an int instance `'hello' * 2`
__rmul__	*	operator	replicates a str by reverse multiplication with an int instance `2 * 'hello'`

There are also some str specific additions:

Datamodel Identifier	Builtins Identifier	Builtins Identifier Type	Description
__mod__	%	operator	create a formatted str by inserting variables into the str using a tuple `'% and % make %' % (2, 3, 5)`
__rmod__	%	operator	create a formatted str by reverse inserting variables into the str using a tuple `(2, 3, 5) % '% and % make %'`

The __getnewargs__ datamodel method is used by the pickle to serialise the str.

Using ? on the str class shoes the docstring of the __init__ signature:

In [60]:

str?

Init signature: str(self, /, *args, **kwargs)
Docstring:     
str(object='') -> str
str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or
errors is specified, then the object must expose a data buffer
that will be decoded using the given encoding and error handler.
Otherwise, returns the result of object.__str__() (if defined)
or repr(object).
encoding defaults to sys.getdefaultencoding().
errors defaults to 'strict'.
Type:           type
Subclasses:     StrEnum, DeferredConfigString, FoldedCase, _rstr, _ScriptTarget, _ModuleTarget, LSString, include, Keys, InputMode, ...

The datamodel identifier __new__ constructs the instance greeting and invokes the __init__ signature to provide the str with the required instance data:

In [61]:

greeting = 'Hello\tWorld!'

Using ? with the str instances gives the same docstring from the str class but displays instance specific details:

In [62]:

greeting?

Type:        str
String form: Hello	World!
Length:      12
Docstring:  
str(object='') -> str
str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or
errors is specified, then the object must expose a data buffer
that will be decoded using the given encoding and error handler.
Otherwise, returns the result of object.__str__() (if defined)
or repr(object).
encoding defaults to sys.getdefaultencoding().
errors defaults to 'strict'.

Such as the type:

In [63]:

type(greeting)

Out[63]:

str

formal (repr) and informal (str) str¶

prints out the informal str form:

In [64]:

print(greeting)

Hello	World!

Recall that there is the formal and informal str representation and the difference between these can be seen when an instance is printed (above) and examined in the cell output below:

In [65]:

greeting

Out[65]:

'Hello\tWorld!'

The informal str (__str__ datamodel method) defines the behaviour of the str class. Casting a str instance to a str instance leaves it unchanged:

In [66]:

str(greeting)

Out[66]:

'Hello\tWorld!'

Therefore the two are equivalent:

In [67]:

print(str(greeting))

Hello	World!

In [68]:

print(greeting)

Hello	World!

The formal repr (__repr__ datamodel method) defines the behaviour of the repr function:

In [69]:

repr(greeting)

Out[69]:

"'Hello\\tWorld!'"

Notice the print out of this shows the informal str representation which is the form used to instantiate a new str instance:

In [70]:

print(repr(greeting))

'Hello\tWorld!'

Indexing and Slicing (len, contains, getitem)¶

The length function len returns the number of Unicode Characters in the str:

In [71]:

len(greeting)

Out[71]:

Notice that \t is used to represent a single Unicode character. The custom function view can be imported from the custom module view_collection to view the str instance in more detail:

Notice that the str uses zero-order indexing where each index is an int. Notice that the "first" index known as the start index is 0 and increases in int steps of 1 up to but excluding the stop index which is the length of the collection. The last index is therefore 1 less than the length of the str instance.

Notice that the datatype for each character is itself a str and each of these str instances have a length of 1 corresponding to a value that is a single Unicode character:

In [72]:

view(greeting)

Index 	 Type                 	 Size   	 Value                         
0 	 str                  	 1      	 H                              	
1 	 str                  	 1      	 e                              	
2 	 str                  	 1      	 l                              	
3 	 str                  	 1      	 l                              	
4 	 str                  	 1      	 o                              	
5 	 str                  	 1      	 	                              	
6 	 str                  	 1      	 W                              	
7 	 str                  	 1      	 o                              	
8 	 str                  	 1      	 r                              	
9 	 str                  	 1      	 l                              	
10 	 str                  	 1      	 d                              	
11 	 str                  	 1      	 !

Square brackets can used to select an index:

In [73]:

greeting[0]

Out[73]:

'H'

In [74]:

greeting[len(greeting)-1]

Out[74]:

'!'

In [75]:

greeting[11]

Out[75]:

'!'

The slice class can be used to select a substr using a slice:

In [76]:

slice?

Init signature: slice(self, /, *args, **kwargs)
Docstring:     
slice(stop)
slice(start, stop[, step])

Create a slice object.  This is used for extended slicing (e.g. a[0:10:2]).
Type:           type
Subclasses:

To select the first word the following slice can be used:

slice(0, 5, 1)

Note because zero-order indexing is used, the start bound is inclusive and the stop bound is exclusive. A slice is therefore selected up to but excluding the stop bound:

Index	Type	Size	Value
0	str	1	H
1	str	1	e
2	str	1	l
3	str	1	l
4	str	1	o
5

In [77]:

start = 0
stop = 5
step = 1

In [78]:

greeting[slice(start, stop, step)]

Out[78]:

'Hello'

Because the default step is 1:

In [79]:

greeting[slice(start, stop)]

Out[79]:

'Hello'

Because the default start is 0:

In [80]:

greeting[slice(stop)]

Out[80]:

'Hello'

Slicing is usually done shorthand using colons to separate out the start, stop and step values:

In [81]:

greeting[start:stop:step]

Out[81]:

'Hello'

Because the default step is 1, this can be simplified to:

In [82]:

greeting[start:stop:]

Out[82]:

'Hello'

The last colon can also be dropped:

In [83]:

greeting[start:stop]

Out[83]:

'Hello'

Because the default start is 0 this can be simplied to:

In [84]:

greeting[:stop]

Out[84]:

'Hello'

The default stop is the length of the str and therefore the following returns the whole str:

In [85]:

greeting[:]

Out[85]:

'Hello\tWorld!'

Normally numbers are used in the slices directly:

In [86]:

greeting[0:5:1]

Out[86]:

'Hello'

In [87]:

greeting[6:]

Out[87]:

'World!'

The shorthand notation is generally preferred however a slice is sometimes used with a constant to make code more readable:

In [88]:

FIRST_WORD = slice(0, 5, 1)
greeting[FIRST_WORD]

Out[88]:

'Hello'

The index before 0 is -1 and is taken to be the last Unicode character in the str. Conceptualise the str wrapping around itself and a negative index can be prescribed to each index in the str until the "first" index is reached which has a negative index of the length of the str instance:

In [89]:

view(greeting, neg_index=True)
view(greeting)

Index 	 Type                 	 Size   	 Value                         
-12 	 str                  	 1      	 H                              	
-11 	 str                  	 1      	 e                              	
-10 	 str                  	 1      	 l                              	
-9 	 str                  	 1      	 l                              	
-8 	 str                  	 1      	 o                              	
-7 	 str                  	 1      	 	                              	
-6 	 str                  	 1      	 W                              	
-5 	 str                  	 1      	 o                              	
-4 	 str                  	 1      	 r                              	
-3 	 str                  	 1      	 l                              	
-2 	 str                  	 1      	 d                              	
-1 	 str                  	 1      	 !                              	
Index 	 Type                 	 Size   	 Value                         
0 	 str                  	 1      	 H                              	
1 	 str                  	 1      	 e                              	
2 	 str                  	 1      	 l                              	
3 	 str                  	 1      	 l                              	
4 	 str                  	 1      	 o                              	
5 	 str                  	 1      	 	                              	
6 	 str                  	 1      	 W                              	
7 	 str                  	 1      	 o                              	
8 	 str                  	 1      	 r                              	
9 	 str                  	 1      	 l                              	
10 	 str                  	 1      	 d                              	
11 	 str                  	 1      	 !

When a negative step is used -1. Notice this reverses the character order in the str instance:

In [90]:

greeting[::-1]

Out[90]:

'!dlroW\tolleH'

The default start is therefore index -1 and the default stop is -len(greeting)-1 because zero-order indexing is still sued that is inclusive of the start bound and exclusive of the stop bound:

In [91]:

start = -1
stop = -len(greeting) - 1
step = -1
greeting[start:stop:step]

Out[91]:

'!dlroW\tolleH'

In [92]:

greeting[-1:-len(greeting)-1:-1]

Out[92]:

'!dlroW\tolleH'

The __contains__ datamodel method contains the be behaviour of the in keyword:

In [93]:

greeting.__contains__?

Signature:      greeting.__contains__(key, /)
Call signature: greeting.__contains__(*args, **kwargs)
Type:           method-wrapper
String form:    <method-wrapper '__contains__' of str object at 0x000001E0A1A51B70>
Docstring:      Return bool(key in self).

It can be used to check whether a substr is present within a str:

In [94]:

greeting.__contains__('Hello')

Out[94]:

True

It is more common to use the in keyword to perform this check:

In [95]:

'Hello' in greeting

Out[95]:

True

In [96]:

'hello' in greeting

Out[96]:

False

Iteration (iter) and looping¶

If the str instance letters (plural) is instantiated:

In [97]:

letters = 'Hello World!'

In [98]:

view(letters)

Index 	 Type                 	 Size   	 Value                         
0 	 str                  	 1      	 H                              	
1 	 str                  	 1      	 e                              	
2 	 str                  	 1      	 l                              	
3 	 str                  	 1      	 l                              	
4 	 str                  	 1      	 o                              	
5 	 str                  	 1      	                                	
6 	 str                  	 1      	 W                              	
7 	 str                  	 1      	 o                              	
8 	 str                  	 1      	 r                              	
9 	 str                  	 1      	 l                              	
10 	 str                  	 1      	 d                              	
11 	 str                  	 1      	 !

It can be cast into an iterator using iter:

In [99]:

forward = iter(letters)

forward is a str ASCII iterator that iterates through a str of ASCII characters, displaying a single character at a time:

In [100]:

forward

Out[100]:

<str_ascii_iterator at 0x1e0a1a90f70>

The iterator has a number of datamodel identifiers:

In [101]:

dir2(forward, object, unique_only=True)

{'datamodel_method': ['__iter__',
                      '__length_hint__',
                      '__next__',
                      '__setstate__']}

The most important one is __next__ which controls the behaviour of the builtins function next. next is used to advance to the next value in the iterator. An iterator displays a single value at a time and each previous value is consumed when advanced:

In [102]:

next(forward)

Out[102]:

'H'

In [103]:

next(forward)

Out[103]:

'e'

In [104]:

next(forward)

Out[104]:

'l'

In each case assignment can be used, to the instance name letter (note singular):

In [105]:

letter = next(forward)

In [106]:

letter

Out[106]:

'l'

next can continue to be used on the ASCII iter instance until all the letters are exhausted. In other words next can be called on the ASCII iter instance len(letter) times. Alternatively all of the remaining elements in an iter instance can be consumed by casting using the tuple class:

In [107]:

tuple(forward)

Out[107]:

('o', ' ', 'W', 'o', 'r', 'l', 'd', '!')

A range instance can be constructed using the len(letter). Note the similarities between the range class and the slice class:

In [108]:

range?

Init signature: range(self, /, *args, **kwargs)
Docstring:     
range(stop) -> range object
range(start, stop[, step]) -> range object

Return an object that produces a sequence of integers from start (inclusive)
to stop (exclusive) by step.  range(i, j) produces i, i+1, i+2, ..., j-1.
start defaults to 0, and stop is omitted!  range(4) produces 0, 1, 2, 3.
These are exactly the valid indices for a list of 4 elements.
When step is given, it specifies the increment (or decrement).
Type:           type
Subclasses:

In [109]:

slice?

Init signature: slice(self, /, *args, **kwargs)
Docstring:     
slice(stop)
slice(start, stop[, step])

Create a slice object.  This is used for extended slicing (e.g. a[0:10:2]).
Type:           type
Subclasses:

In [110]:

indexes = range(len(letters))

The range instance is not an iter instance and does not have the identifier __next__ but each index in it can be viewed by casting to a tuple:

In [111]:

dir2(indexes, object, unique_only=True)

{'attribute': ['start', 'step', 'stop'],
 'method': ['count', 'index'],
 'datamodel_method': ['__bool__',
                      '__contains__',
                      '__getitem__',
                      '__iter__',
                      '__len__',
                      '__reversed__']}

In [112]:

tuple(indexes)

Out[112]:

(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11)

A for loop can be constructed from it:

In [113]:

for index in indexes:
    print(index)

Notice the instructions in the for loop body was repeated 12 times and the index printed was updated each loop iteration.

The str instance letters can be into an iter instance and next can be used to advance through the iterator within the for loop:

In [114]:

forward = iter(letters)

for index in indexes:
    print(next(forward))

H
e
l
l
o
 
W
o
r
l
d
!

Creating an iter instance and advancing through all its elements in a for loop is a common task and is simplified using the syntax below:

In [115]:

for letter in letters:
    print(letter)

H
e
l
l
o
 
W
o
r
l
d
!

Note sometimes it is useful to have both the index and the letter being looped through, this can be done using the enumerate class:

In [116]:

enumerate?

Init signature: enumerate(iterable, start=0)
Docstring:     
Return an enumerate object.

  iterable
    an object supporting iteration

The enumerate object yields pairs containing a count (from start, which
defaults to zero) and a value yielded by the iterable argument.

enumerate is useful for obtaining an indexed list:
    (0, seq[0]), (1, seq[1]), (2, seq[2]), ...
Type:           type
Subclasses:

In [117]:

enumerated_letters = enumerate(letters)

In [118]:

enumerated_letters

Out[118]:

<enumerate at 0x1e0a1ab9d50>

Note that enumerate instances is also an iter instance and has the datamodel identifier __next__:

In [119]:

dir2(enumerated_letters, object, unique_only=True)

{'datamodel_method': ['__class_getitem__', '__iter__', '__next__']}

When next is used a tuple is output:

In [120]:

next(enumerated_letters)

Out[120]:

(0, 'H')

This can be unpacked to two variables using an explicit tuple instance:

In [121]:

(index, letter) = next(enumerated_letters)

In [122]:

index

Out[122]:

In [123]:

letter

Out[123]:

'e'

However it is more common to use implicit tuple unpacking:

In [124]:

index, letter = next(enumerated_letters)

In [125]:

index

Out[125]:

In [126]:

letter

Out[126]:

'l'

A for loop can be constructed with two loop variables using the enumerate instance:

In [127]:

for index, letter in enumerate(letters):
    print(f'{index}: {letter}')

0: H
1: e
2: l
3: l
4: o
5:  
6: W
7: o
8: r
9: l
10: d
11: !

Sometimes this is useful when the index and letter are both required:

In [128]:

for index, letter in enumerate(letters):
    print(index * letter)

e
ll
lll
oooo
     
WWWWWW
ooooooo
rrrrrrrr
lllllllll
dddddddddd
!!!!!!!!!!!

Immutability and hash (hash)¶

The __hash__ datamodel identifier is not equal to None:

In [129]:

str.__hash__ == None

Out[129]:

False

This means the str is immutable. Recall immutable means once an instance is created, it cannot be modified. As a consequence each method has a return value which returns a new instance, normally a new str instance and leaves the original str unmodified:

In [130]:

greeting = 'Hello World!'

In [131]:

greeting[-1:-len(greeting)-1:-1] #return value shown in cell output

Out[131]:

'!dlroW olleH'

In [132]:

greeting # unchanged

Out[132]:

'Hello World!'

As mentioned above reassignment should not be confused with mutability.

In [133]:

greeting = 'Hello World!'

In [134]:

hash(greeting), id(greeting)

Out[134]:

(-7437652338063058407, 2064296737520)

When reassignment is used, the operation on the right is carried out first, in this case the operation highlighted in parenthesis. The instance data 'Hello World!' is used. The return value of this operation '!dlroW olleH' is then assigned to the instance name greeting on the right:

In [135]:

greeting = (greeting[-1:-len(greeting)-1:-1])

In [136]:

hash(greeting), id(greeting)

Out[136]:

(-2364074818600270120, 2064296728944)

Therefore the instance name greeting which can be conceptualised as a label has been unpeeled from the old instance and now is affixed to the new instance:

In [137]:

greeting

Out[137]:

'!dlroW olleH'

Because a str is hashable and therefore immutable it can be used in a mapping such as a dict which recall has the form:

{key: value,
 key: value,
 key: value}

A dict can be conceptualised as a collection of storage locations and an immutable key is used to access each storage location which then gives a reference to an object. The key must be immutable as a key that is modified will no longer fit the lock and therefore cannot be used.

Because str instances are immutable they commonly used as keys. An example is give in the 2 dict instances below:

In [138]:

from matplotlib.colors import BASE_COLORS, CSS4_COLORS

In [139]:

BASE_COLORS

Out[139]:

{'b': (0, 0, 1),
 'g': (0, 0.5, 0),
 'r': (1, 0, 0),
 'c': (0, 0.75, 0.75),
 'm': (0.75, 0, 0.75),
 'y': (0.75, 0.75, 0),
 'k': (0, 0, 0),
 'w': (1, 1, 1)}

In [140]:

CSS4_COLORS

Out[140]:

{'aliceblue': '#F0F8FF',
 'antiquewhite': '#FAEBD7',
 'aqua': '#00FFFF',
 'aquamarine': '#7FFFD4',
 'azure': '#F0FFFF',
 'beige': '#F5F5DC',
 'bisque': '#FFE4C4',
 'black': '#000000',
 'blanchedalmond': '#FFEBCD',
 'blue': '#0000FF',
 'blueviolet': '#8A2BE2',
 'brown': '#A52A2A',
 'burlywood': '#DEB887',
 'cadetblue': '#5F9EA0',
 'chartreuse': '#7FFF00',
 'chocolate': '#D2691E',
 'coral': '#FF7F50',
 'cornflowerblue': '#6495ED',
 'cornsilk': '#FFF8DC',
 'crimson': '#DC143C',
 'cyan': '#00FFFF',
 'darkblue': '#00008B',
 'darkcyan': '#008B8B',
 'darkgoldenrod': '#B8860B',
 'darkgray': '#A9A9A9',
 'darkgreen': '#006400',
 'darkgrey': '#A9A9A9',
 'darkkhaki': '#BDB76B',
 'darkmagenta': '#8B008B',
 'darkolivegreen': '#556B2F',
 'darkorange': '#FF8C00',
 'darkorchid': '#9932CC',
 'darkred': '#8B0000',
 'darksalmon': '#E9967A',
 'darkseagreen': '#8FBC8F',
 'darkslateblue': '#483D8B',
 'darkslategray': '#2F4F4F',
 'darkslategrey': '#2F4F4F',
 'darkturquoise': '#00CED1',
 'darkviolet': '#9400D3',
 'deeppink': '#FF1493',
 'deepskyblue': '#00BFFF',
 'dimgray': '#696969',
 'dimgrey': '#696969',
 'dodgerblue': '#1E90FF',
 'firebrick': '#B22222',
 'floralwhite': '#FFFAF0',
 'forestgreen': '#228B22',
 'fuchsia': '#FF00FF',
 'gainsboro': '#DCDCDC',
 'ghostwhite': '#F8F8FF',
 'gold': '#FFD700',
 'goldenrod': '#DAA520',
 'gray': '#808080',
 'green': '#008000',
 'greenyellow': '#ADFF2F',
 'grey': '#808080',
 'honeydew': '#F0FFF0',
 'hotpink': '#FF69B4',
 'indianred': '#CD5C5C',
 'indigo': '#4B0082',
 'ivory': '#FFFFF0',
 'khaki': '#F0E68C',
 'lavender': '#E6E6FA',
 'lavenderblush': '#FFF0F5',
 'lawngreen': '#7CFC00',
 'lemonchiffon': '#FFFACD',
 'lightblue': '#ADD8E6',
 'lightcoral': '#F08080',
 'lightcyan': '#E0FFFF',
 'lightgoldenrodyellow': '#FAFAD2',
 'lightgray': '#D3D3D3',
 'lightgreen': '#90EE90',
 'lightgrey': '#D3D3D3',
 'lightpink': '#FFB6C1',
 'lightsalmon': '#FFA07A',
 'lightseagreen': '#20B2AA',
 'lightskyblue': '#87CEFA',
 'lightslategray': '#778899',
 'lightslategrey': '#778899',
 'lightsteelblue': '#B0C4DE',
 'lightyellow': '#FFFFE0',
 'lime': '#00FF00',
 'limegreen': '#32CD32',
 'linen': '#FAF0E6',
 'magenta': '#FF00FF',
 'maroon': '#800000',
 'mediumaquamarine': '#66CDAA',
 'mediumblue': '#0000CD',
 'mediumorchid': '#BA55D3',
 'mediumpurple': '#9370DB',
 'mediumseagreen': '#3CB371',
 'mediumslateblue': '#7B68EE',
 'mediumspringgreen': '#00FA9A',
 'mediumturquoise': '#48D1CC',
 'mediumvioletred': '#C71585',
 'midnightblue': '#191970',
 'mintcream': '#F5FFFA',
 'mistyrose': '#FFE4E1',
 'moccasin': '#FFE4B5',
 'navajowhite': '#FFDEAD',
 'navy': '#000080',
 'oldlace': '#FDF5E6',
 'olive': '#808000',
 'olivedrab': '#6B8E23',
 'orange': '#FFA500',
 'orangered': '#FF4500',
 'orchid': '#DA70D6',
 'palegoldenrod': '#EEE8AA',
 'palegreen': '#98FB98',
 'paleturquoise': '#AFEEEE',
 'palevioletred': '#DB7093',
 'papayawhip': '#FFEFD5',
 'peachpuff': '#FFDAB9',
 'peru': '#CD853F',
 'pink': '#FFC0CB',
 'plum': '#DDA0DD',
 'powderblue': '#B0E0E6',
 'purple': '#800080',
 'rebeccapurple': '#663399',
 'red': '#FF0000',
 'rosybrown': '#BC8F8F',
 'royalblue': '#4169E1',
 'saddlebrown': '#8B4513',
 'salmon': '#FA8072',
 'sandybrown': '#F4A460',
 'seagreen': '#2E8B57',
 'seashell': '#FFF5EE',
 'sienna': '#A0522D',
 'silver': '#C0C0C0',
 'skyblue': '#87CEEB',
 'slateblue': '#6A5ACD',
 'slategray': '#708090',
 'slategrey': '#708090',
 'snow': '#FFFAFA',
 'springgreen': '#00FF7F',
 'steelblue': '#4682B4',
 'tan': '#D2B48C',
 'teal': '#008080',
 'thistle': '#D8BFD8',
 'tomato': '#FF6347',
 'turquoise': '#40E0D0',
 'violet': '#EE82EE',
 'wheat': '#F5DEB3',
 'white': '#FFFFFF',
 'whitesmoke': '#F5F5F5',
 'yellow': '#FFFF00',
 'yellowgreen': '#9ACD32'}

Note in each case the key is an easy to remember letter or English word and the value it corresponds to is a harder to remember tuple of the format (r, g, b) or hexadecimal value of the form '#rrggbb'.

Because a str is immutable, the function getattr can be used to access the identifier as a str:

In [141]:

getattr(str, '__len__')

Out[141]:

<slot wrapper '__len__' of 'str' objects>

In [142]:

str.__len__

Out[142]:

<slot wrapper '__len__' of 'str' objects>

The mutable counterparts setattr and delattr cannot be used because a str is mutable and therefore an attribute cannot be changed or deleted.

Comparison Operators (gt, ge, lt, le, eq and ne)¶

Early computers were based on a typewriter that essentially prints English characters onto a sheet of paper. In order to achieve such a task a number of non-printable commands such as the carriage return (moving the carriage back to the left) and the form feed (moving the piece of paper up by the width of a line) are required as well as the printable characters such as the English letters, numbers, and whitespace:

Each command has to be mapped physically into the computers memory. Fundamentally the computer can only store data in the form of a bit which is essentially a digital switch.

A single switch has the possible values 0, 1 which is 2 ** 1 combinations which is a total of 2. Note the combination 0 is included so 0:2 is inclusive of the lower bount 0 and exclusive of the upper bound 2.

More typically 8 of these switches are combined into a single logical unit called a byte. A byte has 2 ** 8 combinations which is a total of 256. Note the combination 0 is included so 0:256 is inclusive of the lower bount 0 and exclusive of the upper bound 256.

One of the most popular set of commands was developed in America and is known as the American Standard for Information Interchange (ASCII). The first 33 combinations correspond to non-printable characters such as the carriage return and form feed as previously discussed in addition to a number of additional hardware related commands.

Each bit can be 0 or 1 and the byte sequence corresponds to the physical position of the 8 switches. As binary is not human readible the hexadecimal system is also used which has 16 characters 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a, b, c, d, e, f. 2 ** 4 is 16 combinations and therefore each half of the byte is represented by its own hexadecimal character. These numbering systems are shown alongside the number in decimal.

byte	hex	num	command
00000000	00	000	null
00000001	01	001	start of heading
00000010	02	002	start of text
00000011	03	003	end of text
00000100	04	004	end of transmission
00000101	05	005	enquiry
00000110	06	006	acknowledge
00000111	07	007	bell
00001000	08	008	backspace
00001001	09	009	horizontal tab
00001010	0a	010	new line
00001011	0b	011	vertical tab
00001100	0c	012	form feed
00001101	0d	013	carriage return
00001110	0e	014	shift out
00001111	0f	015	shift in
00010000	10	016	data link escape
00010001	11	017	device control 1
00010010	12	018	device control 2
00010011	13	019	device control 3
00010100	14	020	device control 4
00010101	15	021	negative acknowledge
00010110	16	022	synchronous idle
00010111	17	023	end of transmission block
00011000	18	024	cancel
00011001	19	025	end of medium
00011010	1a	026	substitute
00011011	1b	027	escape
00011100	1c	028	file separator
00011101	1d	029	group separator
00011110	1e	030	record separator
00011111	1f	031	unit seperator
00100000	20	032	space

The remaining commands spanning up to half a byte contained the characters most commonly used in the English language.

byte	hex	num	character
00100001	21	033	!
00100010	22	034	"
00100011	23	035	#
00100100	24	036	$
00100101	25	037	%
00100110	26	038	&
00100111	27	039	'
00101000	28	040	(
00101001	29	041	)
00101010	2a	042	*
00101011	2b	043	+
00101100	2c	044	,
00101101	2d	045	–
00101110	2e	046	.
00101111	2f	047	/
00110000	30	048	0
00110001	31	049	1
00110010	32	050	2
00110011	33	051	3
00110100	34	052	4
00110101	35	053	5
00110110	36	054	6
00110111	37	055	7
00111000	38	056	8
00111001	39	057	9
00111010	3a	058	:
00111011	3b	059	;
00111100	3c	060	<
00111101	3d	061	=
00111110	3e	062	>
00111111	3f	063	?
01000000	40	064	@
01000001	41	065	A
01000010	42	066	B
01000011	43	067	C
01000100	44	068	D
01000101	45	069	E
01000110	46	070	F
01000111	47	071	G
01001000	48	072	H
01001001	49	073	I
01001010	4a	074	J
01001011	4b	075	K
01001100	4c	076	L
01001101	4d	077	M
01001110	4e	078	N
01001111	4f	079	O
01010000	50	080	P
01010001	51	081	Q
01010010	52	082	R
01010011	53	083	S
01010100	54	084	T
01010101	55	085	U
01010110	56	086	V
01010111	57	087	W
01011000	58	088	X
01011001	59	089	Y
01011010	5a	090	Z
01011011	5b	091	[
01011100	5c	092	\
01011101	5d	093	]
01011110	5e	094	^
01011111	5f	095	_
01100000	60	096	`
01100001	61	097	a
01100010	62	098	b
01100011	63	099	c
01100100	64	100	d
01100101	65	101	e
01100110	66	102	f
01100111	67	103	g
01101000	68	104	h
01101001	69	105	i
01101010	6a	106	j
01101011	6b	107	k
01101100	6c	108	l
01101101	6d	109	m
01101110	6e	110	n
01101111	6f	111	o
01110000	70	112	p
01110001	71	113	q
01110010	72	114	r
01110011	73	115	s
01110100	74	116	t
01110101	75	117	u
01110110	76	118	v
01110111	77	119	w
01111000	78	120	x
01111001	79	121	y
01111010	7a	122	z
01111011	7b	123	{
01111100	7c	124	\|
01111101	7d	125	}
01111110	7e	126	~
01111111	7f	127	DEL

The Unicode str uses a single encoding table, the Unicode Transformation Format 'utf-8 and this encodes a single Unicode character to a numeric combination. This numeric combination is recognised by a human as a decimal integer but stored on a computer using bits. 'utf-8' uses 8 bits (1 byte) for each ASCII character and (2-4 bytes for additional characters outside the ASCII range).

__getsizeof__ returns the number of bytes occupied by the str instance. Note that there is a base memory allocation for a str instance:

In [143]:

import sys
sys.getsizeof('') # 41

Out[143]:

Then memory allocation for each character in the str instances:

In [144]:

sys.getsizeof('a') # 41 + 1

Out[144]:

In [145]:

sys.getsizeof('ab') # 41 + (2 * 1)

Out[145]:

Use of non-English characters requires a higher memory overhead and requires a larger number of bytes per character:

In [146]:

sys.getsizeof('α') # 41 + 17 + (1 * 2)

Out[146]:

In [147]:

sys.getsizeof('αβ') # 41 + 17 + (2 * 2)

Out[147]:

Python also has additional text classes such as the bytes class which can use additional encoding tables, usually from older standards which will be explored in the next notebook.

Each character is ordinal, the characters 'a' and 'A' are ASCII characters:

In [148]:

ord('a')

Out[148]:

In [149]:

ord('A')

Out[149]:

Because these are ASCII they are stored over a single byte. Recall a single byte has the following number of combinations:

In [150]:

2 ** (1 * 8)

Out[150]:

The character 'α' is non-ASCII and has a value that exceeds this and is therefore stored over multiple bytes:

In [151]:

ord('α')

Out[151]:

In this case, the Greek letter is stored over 2 bytes:

In [152]:

2 ** (2 * 8)

Out[152]:

Because the str instance is ordinal, the six comparison operators can be used to compare the numeric values of str instances:

In [153]:

'a' > 'A'

Out[153]:

True

The above is essentially a comparison between the two ordinal values:

In [154]:

97 > 65

Out[154]:

True

This can be used with longer str instances:

In [155]:

'apples' > 'bananas'

Out[155]:

False

A check is made letter by letter:

In [156]:

'a' > 'b'

Out[156]:

False

If the first letters are equal, the second letters are compared:

In [157]:

'aa' > 'ab'

Out[157]:

False

The 6 comparison operators can be used:

In [158]:

'aa' < 'aa', 'aa' <= 'aa', 'aa' == 'aa', 'aa' >= 'aa', 'aa' > 'aa', 'aa' != 'aa'

Out[158]:

(False, True, True, True, False, False)

In [159]:

'aa' < 'ab', 'aa' <= 'ab', 'aa' == 'ab', 'aa' >= 'ab', 'aa' > 'ab', 'aa' != 'ab'

Out[159]:

(True, True, False, False, False, True)

Instance Methods¶

if the str instance greeting is instantiated:

In [160]:

greeting = 'Hello World!'

Most of the additional identifiers available to it are instance methods:

In [161]:

dir2(greeting, print_output=False)['method']

Out[161]:

['capitalize',
 'casefold',
 'center',
 'count',
 'encode',
 'endswith',
 'expandtabs',
 'find',
 'format',
 'format_map',
 'index',
 'isalnum',
 'isalpha',
 'isascii',
 'isdecimal',
 'isdigit',
 'isidentifier',
 'islower',
 'isnumeric',
 'isprintable',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'maketrans',
 'partition',
 'removeprefix',
 'removesuffix',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'strip',
 'swapcase',
 'title',
 'translate',
 'upper',
 'zfill']

Recall that the identifiers themselves are defined in the str class:

In [162]:

dir2(str, print_output=False)['method']

Out[162]:

['capitalize',
 'casefold',
 'center',
 'count',
 'encode',
 'endswith',
 'expandtabs',
 'find',
 'format',
 'format_map',
 'index',
 'isalnum',
 'isalpha',
 'isascii',
 'isdecimal',
 'isdigit',
 'isidentifier',
 'islower',
 'isnumeric',
 'isprintable',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'maketrans',
 'partition',
 'removeprefix',
 'removesuffix',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'strip',
 'swapcase',
 'title',
 'translate',
 'upper',
 'zfill']

Instance methods are accessed via an instance and therefore have access to the instance data. The docstring of the capitalize can be examined from a str instance:

In [163]:

greeting.capitalize?

Signature: greeting.capitalize()
Docstring:
Return a capitalized version of the string.

More specifically, make the first character have upper case and the rest lower
case.
Type:      builtin_function_or_method

Or it can be examined from the class str itself:

In [164]:

str.capitalize?

Signature: str.capitalize(self, /)
Docstring:
Return a capitalized version of the string.

More specifically, make the first character have upper case and the rest lower
case.
Type:      method_descriptor

Note that the identifier name is in American English:

Word	English Dialect
capitalize	American
capitalise	British

When the method capitalize is called from an instance, it has access to the instance data. As a consequence this method requires no additional data to operate which is why its parenthesis are otherwise empty.

greeting.capitalize()

In contrast when the method is called from the class itself, it has no instance data to work from therefore an instance must be provided. In Python self means this instance:

str.capitalize(self, /)

self occurs before an / and therefore must be provided positionally.

As the str is immutable the method has a return value and returns a new str instance that has been capitalised:

Docstring:
Return a capitalized version of the string.

When the method is called from an instance:

In [165]:

greeting.capitalize()

Out[165]:

'Hello world!'

The new capitalised str instance displays in the cell output. This a new instance and the original instance is unchanged in variables:

In [166]:

variables()

Out[166]:

	Type	Size/Shape	Value
Instance Name
greeting	str	12	Hello World!
farewell	str	3	bye
start	int		-1
stop	int		-13
step	int		-1
letters	str	12	Hello World!
letter	str	1	!
indexes	range	12	range(0, 12)
index	int		11
BASE_COLORS	dict	8	{'b': (0, 0, 1), 'g': (0, 0.5, 0), 'r': (1, 0, 0), 'c': (0, 0.75, 0.75), 'm': (0.75, 0, 0.75), 'y': (0.75, 0.75, 0), 'k': (0, 0, 0), 'w': (1, 1, 1)}
CSS4_COLORS	dict	148	{'aliceblue': '#F0F8FF', 'antiquewhite': '#FAEBD7', 'aqua': '#00FFFF', 'aquamarine': '#7FFFD4', 'azure': '#F0FFFF', 'beige': '#F5F5DC', 'bisque': '#FFE4C4', 'black': '#000000', 'blanchedalmond': '…

Since this new instance is not assigned an instance name it has no references and is automatically removed by Pythons Garbage collection. It can be assigned to an instance name using:

In [167]:

cap_greeting = greeting.capitalize()

Notice no cell output as the new instance is now assigned to the instance name instead of being shown in the cell output. This can be seen in Variables:

In [168]:

variables()

Out[168]:

	Type	Size/Shape	Value
Instance Name
greeting	str	12	Hello World!
farewell	str	3	bye
start	int		-1
stop	int		-13
step	int		-1
letters	str	12	Hello World!
letter	str	1	!
indexes	range	12	range(0, 12)
index	int		11
BASE_COLORS	dict	8	{'b': (0, 0, 1), 'g': (0, 0.5, 0), 'r': (1, 0, 0), 'c': (0, 0.75, 0.75), 'm': (0.75, 0, 0.75), 'y': (0.75, 0.75, 0), 'k': (0, 0, 0), 'w': (1, 1, 1)}
CSS4_COLORS	dict	148	{'aliceblue': '#F0F8FF', 'antiquewhite': '#FAEBD7', 'aqua': '#00FFFF', 'aquamarine': '#7FFFD4', 'azure': '#F0FFFF', 'beige': '#F5F5DC', 'bisque': '#FFE4C4', 'black': '#000000', 'blanchedalmond': '…
cap_greeting	str	12	Hello world!

If the instance method is invoked from a class, the instance self must be provided positionally as the first input argument:

In [169]:

str.capitalize(farewell)

Out[169]:

'Bye'

Failure to supply an instance will result in a TypeError. This can be seen by inputting the following into the blank code cell below:

str.capitalize()

Case Methods¶

The str case method capitalize has already been examined:

In [170]:

greeting.capitalize?

Signature: greeting.capitalize()
Docstring:
Return a capitalized version of the string.

More specifically, make the first character have upper case and the rest lower
case.
Type:      builtin_function_or_method

In [171]:

greeting.capitalize()

Out[171]:

'Hello world!'

There are associated identifiers such as:

lower
casefold
upper
title
swapcase

The docstrings of these can all be examined:

In [172]:

greeting.lower?

Signature: greeting.lower()
Docstring: Return a copy of the string converted to lowercase.
Type:      builtin_function_or_method

In [173]:

greeting.casefold?

Signature: greeting.casefold()
Docstring: Return a version of the string suitable for caseless comparisons.
Type:      builtin_function_or_method

In [174]:

greeting.upper?

Signature: greeting.upper()
Docstring: Return a copy of the string converted to uppercase.
Type:      builtin_function_or_method

In [175]:

greeting.title?

Signature: greeting.title()
Docstring:
Return a version of the string where each word is titlecased.

More specifically, words start with uppercased characters and all remaining
cased characters have lower case.
Type:      builtin_function_or_method

In [176]:

greeting.swapcase?

Signature: greeting.swapcase()
Docstring: Convert uppercase characters to lowercase and lowercase characters to uppercase.
Type:      builtin_function_or_method

In [177]:

greeting.title?

Signature: greeting.title()
Docstring:
Return a version of the string where each word is titlecased.

More specifically, words start with uppercased characters and all remaining
cased characters have lower case.
Type:      builtin_function_or_method

All of these case identifiers only require instance data and return a new str instance:

In [178]:

'hEllo wOrld'.lower()

Out[178]:

'hello world'

In [179]:

'hEllo wOrld'.casefold()

Out[179]:

'hello world'

In [180]:

'hEllo wOrld'.upper()

Out[180]:

'HELLO WORLD'

In [181]:

'hEllo wOrld'.swapcase()

Out[181]:

'HeLLO WoRLD'

In [182]:

'hEllo wOrld'.title()

Out[182]:

'Hello World'

casefold is similar to lower but has more support for non-English characters, as seen with the additional German characters and the Greek characters where some of the lower case characters have variants:

In [183]:

'ÄäÜüÖöẞß'.lower()

Out[183]:

'ääüüöößß'

In [184]:

'ÄäÜüÖöẞß'.casefold()

Out[184]:

'ääüüöössss'

In [185]:

'ΑαΒβΓγΔδΕεΖζΗηΘθΙιΚκΛλΜμΝνΞξΟοΠπΡρΣσςΤτΥυΦφΧχΨψΩω'.lower()

Out[185]:

'ααββγγδδεεζζηηθθιικκλλμμννξξοοππρρσσςττυυφφχχψψωω'

In [186]:

'ΑαΒβΓγΔδΕεΖζΗηΘθΙιΚκΛλΜμΝνΞξΟοΠπΡρΣσςΤτΥυΦφΧχΨψΩω'.casefold()

Out[186]:

'ααββγγδδεεζζηηθθιικκλλμμννξξοοππρρσσσττυυφφχχψψωω'

Boolean Identifiers¶

A number of identifiers are used to examine a specific property of a str and return a boolean of True if it has that property and False otherwise:

In [187]:

greeting.isupper?

Signature: greeting.isupper()
Docstring:
Return True if the string is an uppercase string, False otherwise.

A string is uppercase if all cased characters in the string are uppercase and
there is at least one cased character in the string.
Type:      builtin_function_or_method

In [188]:

greeting.islower?

Signature: greeting.islower()
Docstring:
Return True if the string is a lowercase string, False otherwise.

A string is lowercase if all cased characters in the string are lowercase and
there is at least one cased character in the string.
Type:      builtin_function_or_method

In [189]:

greeting.istitle?

Signature: greeting.istitle()
Docstring:
Return True if the string is a title-cased string, False otherwise.

In a title-cased string, upper- and title-case characters may only
follow uncased characters and lowercase characters only cased ones.
Type:      builtin_function_or_method

For example:

In [190]:

'HELLO'.isupper()

Out[190]:

True

In [191]:

'Hello'.isupper()

Out[191]:

False

In [192]:

'hello'.islower()

Out[192]:

True

In [193]:

'Hello'.islower()

Out[193]:

False

In [194]:

'Hello'.istitle()

Out[194]:

True

Valid Identifier Names¶

The str method isidentifier will check to see if the str is valid for an identifier name. This can be useful to check before assignment of an instance to an instance name:

In [195]:

greeting.isidentifier?

Signature: greeting.isidentifier()
Docstring:
Return True if the string is a valid Python identifier, False otherwise.

Call keyword.iskeyword(s) to test whether string s is a reserved identifier,
such as "def" or "class".
Type:      builtin_function_or_method

A lowercase str instance without spaces or special characters can be checked to see if the identifier is an acceptable identifier name:

In [196]:

'hello'.isidentifier()

Out[196]:

True

This means the following is acceptable:

hello = 'some string'

In [197]:

hello = 'some string'

In [198]:

variables()

Out[198]:

	Type	Size/Shape	Value
Instance Name
greeting	str	12	Hello World!
farewell	str	3	bye
start	int		-1
stop	int		-13
step	int		-1
letters	str	12	Hello World!
letter	str	1	!
indexes	range	12	range(0, 12)
index	int		11
BASE_COLORS	dict	8	{'b': (0, 0, 1), 'g': (0, 0.5, 0), 'r': (1, 0, 0), 'c': (0, 0.75, 0.75), 'm': (0.75, 0, 0.75), 'y': (0.75, 0.75, 0), 'k': (0, 0, 0), 'w': (1, 1, 1)}
CSS4_COLORS	dict	148	{'aliceblue': '#F0F8FF', 'antiquewhite': '#FAEBD7', 'aqua': '#00FFFF', 'aquamarine': '#7FFFD4', 'azure': '#F0FFFF', 'beige': '#F5F5DC', 'bisque': '#FFE4C4', 'black': '#000000', 'blanchedalmond': '…
cap_greeting	str	12	Hello world!
hello	str	11	some string

A space is not acceptable and attempted use of an identifier will give a SyntaxError:

In [199]:

'hello world'.isidentifier()

Out[199]:

False

This means the following is not acceptable:

hello world = 'some string'

because the Python interpreter sees two instance names to the left of the assignment operator.

An underscore is acceptable and identifier names generally use snake_case:

In [200]:

'hello_world'.isidentifier()

Out[200]:

True

This means the following is acceptable:

hello_world = 'some string'

In [201]:

hello_world = 'some string'

In [202]:

variables()

Out[202]:

	Type	Size/Shape	Value
Instance Name
greeting	str	12	Hello World!
farewell	str	3	bye
start	int		-1
stop	int		-13
step	int		-1
letters	str	12	Hello World!
letter	str	1	!
indexes	range	12	range(0, 12)
index	int		11
BASE_COLORS	dict	8	{'b': (0, 0, 1), 'g': (0, 0.5, 0), 'r': (1, 0, 0), 'c': (0, 0.75, 0.75), 'm': (0.75, 0, 0.75), 'y': (0.75, 0.75, 0), 'k': (0, 0, 0), 'w': (1, 1, 1)}
CSS4_COLORS	dict	148	{'aliceblue': '#F0F8FF', 'antiquewhite': '#FAEBD7', 'aqua': '#00FFFF', 'aquamarine': '#7FFFD4', 'azure': '#F0FFFF', 'beige': '#F5F5DC', 'bisque': '#FFE4C4', 'black': '#000000', 'blanchedalmond': '…
cap_greeting	str	12	Hello world!
hello	str	11	some string
hello_world	str	11	some string

Numbers can be included in an identifier name:

In [203]:

'hello_world2'.isidentifier()

Out[203]:

True

This means the following is acceptable:

hello_world2 = 'some string'

In [204]:

hello_world2 = 'some string'

In [205]:

variables()

Out[205]:

	Type	Size/Shape	Value
Instance Name
greeting	str	12	Hello World!
farewell	str	3	bye
start	int		-1
stop	int		-13
step	int		-1
letters	str	12	Hello World!
letter	str	1	!
indexes	range	12	range(0, 12)
index	int		11
BASE_COLORS	dict	8	{'b': (0, 0, 1), 'g': (0, 0.5, 0), 'r': (1, 0, 0), 'c': (0, 0.75, 0.75), 'm': (0.75, 0, 0.75), 'y': (0.75, 0.75, 0), 'k': (0, 0, 0), 'w': (1, 1, 1)}
CSS4_COLORS	dict	148	{'aliceblue': '#F0F8FF', 'antiquewhite': '#FAEBD7', 'aqua': '#00FFFF', 'aquamarine': '#7FFFD4', 'azure': '#F0FFFF', 'beige': '#F5F5DC', 'bisque': '#FFE4C4', 'black': '#000000', 'blanchedalmond': '…
cap_greeting	str	12	Hello world!
hello	str	11	some string
hello_world	str	11	some string
hello_world2	str	11	some string

However an identifier cannot begin with a number and the attempted use of an identifier will give a SyntaxError:

In [206]:

'2hello_world'.isidentifier()

Out[206]:

False

This means the following is not acceptable:

2hello_world = 'some string'

Python thinks the identifier is a number but this number contains letters which are unrecognised in the context of a numeric decimal system.

Special characters cannot be used as part of an identifier as they are recognised by Python as operators. Including them in an identifier will give a SyntaxError:

In [207]:

'hello-world2'.isidentifier()

Out[207]:

False

This means the following is not acceptable:

hello-world2 = 'some string'

because the Python interpreter is seeing an operation to carry out subtraction.

Upper case identifiers can be used but generally PascalCase is reserved for a class name:

In [208]:

'PascalCase'.isidentifier()

Out[208]:

True

This means the following is acceptable:

PascalCase = 'some string'

However this naming convention is normally reserved for a class.

In [209]:

PascalCase = 'some string'

In [210]:

variables()

Out[210]:

	Type	Size/Shape	Value
Instance Name
greeting	str	12	Hello World!
farewell	str	3	bye
start	int		-1
stop	int		-13
step	int		-1
letters	str	12	Hello World!
letter	str	1	!
indexes	range	12	range(0, 12)
index	int		11
BASE_COLORS	dict	8	{'b': (0, 0, 1), 'g': (0, 0.5, 0), 'r': (1, 0, 0), 'c': (0, 0.75, 0.75), 'm': (0.75, 0, 0.75), 'y': (0.75, 0.75, 0), 'k': (0, 0, 0), 'w': (1, 1, 1)}
CSS4_COLORS	dict	148	{'aliceblue': '#F0F8FF', 'antiquewhite': '#FAEBD7', 'aqua': '#00FFFF', 'aquamarine': '#7FFFD4', 'azure': '#F0FFFF', 'beige': '#F5F5DC', 'bisque': '#FFE4C4', 'black': '#000000', 'blanchedalmond': '…
cap_greeting	str	12	Hello world!
hello	str	11	some string
hello_world	str	11	some string
hello_world2	str	11	some string
PascalCase	str	11	some string

All capitals identifiers can be used but this generally ALL_CAPS is reserved for a constant:

In [211]:

'ALL_CAPS'.isidentifier()

Out[211]:

True

This means the following is acceptable:

ALL_CAPS = 'some string'

and the capitalisation states that this instance name is intended to be a constant, that should not be reassigned later on in the code:

In [212]:

ALL_CAPS = 'some string'

In [213]:

variables()

Out[213]:

	Type	Size/Shape	Value
Instance Name
greeting	str	12	Hello World!
farewell	str	3	bye
start	int		-1
stop	int		-13
step	int		-1
letters	str	12	Hello World!
letter	str	1	!
indexes	range	12	range(0, 12)
index	int		11
BASE_COLORS	dict	8	{'b': (0, 0, 1), 'g': (0, 0.5, 0), 'r': (1, 0, 0), 'c': (0, 0.75, 0.75), 'm': (0.75, 0, 0.75), 'y': (0.75, 0.75, 0), 'k': (0, 0, 0), 'w': (1, 1, 1)}
CSS4_COLORS	dict	148	{'aliceblue': '#F0F8FF', 'antiquewhite': '#FAEBD7', 'aqua': '#00FFFF', 'aquamarine': '#7FFFD4', 'azure': '#F0FFFF', 'beige': '#F5F5DC', 'bisque': '#FFE4C4', 'black': '#000000', 'blanchedalmond': '…
cap_greeting	str	12	Hello world!
hello	str	11	some string
hello_world	str	11	some string
hello_world2	str	11	some string
PascalCase	str	11	some string
ALL_CAPS	str	11	some string

An instance name shouldn't match any of the identifiers in __builtins__ otherwise it will override the builtin (until the kernel is restarted) which will lead to confusion when the builtins is attempted to be used.

One mistake that beginners often make is to reassign the class name to a instance:

In [214]:

str = 'hello'

Then when they attempt to use the str class they return the instance:

In [215]:

str

Out[215]:

'hello'

To rectify this issue str can be reassigned from the builtins module:

In [216]:

str = __builtins__.str

In [217]:

str

Out[217]:

str

Another mistake beginners make when working with modules is to call the module that they are using the same name as the module they are trying to learn. This means when they attempt to import the module they are trying to learn, they accidentally attempt to import the module they are working on flagging up a circular ImportError.

There are some identifiers which are reserved, these can be seen by importing the keyword module, pprint will also be imported to allow pretty printing of an Collection:

In [218]:

import keyword
import pprint

The list instance kwlist can be examined:

In [219]:

pprint.pprint(keyword.kwlist)

['False',
 'None',
 'True',
 'and',
 'as',
 'assert',
 'async',
 'await',
 'break',
 'class',
 'continue',
 'def',
 'del',
 'elif',
 'else',
 'except',
 'finally',
 'for',
 'from',
 'global',
 'if',
 'import',
 'in',
 'is',
 'lambda',
 'nonlocal',
 'not',
 'or',
 'pass',
 'raise',
 'return',
 'try',
 'while',
 'with',
 'yield']

If a keyword is reassigned a SyntaxError will display:

with = 'hello'

In [ ]:

There is also the soft keyword list softkwlist:

In [220]:

pprint.pprint(keyword.softkwlist)

['_', 'case', 'match', 'type']

case and match were recently introduced in Python 3.10 and should be regarded as keywords for new code. They are only soft keywords to allow backwards compatibility with older Python versions.

_ by default gives the last temporary variable. However _ is also commonly used to indicate skipping of an object during tuple unpacking for example.

As each character maps to a numeric bytes sequence it is ordinal. The builtins ordinal function ord will return the ordinal numeric value of the number in decimal:

In [221]:

ord?

Signature: ord(c, /)
Docstring: Return the Unicode code point for a one-character string.
Type:      builtin_function_or_method

For example the ordinal value of the str instance '3' can be checked:

In [222]:

ord('3')

Out[222]:

In [223]:

chr(51)

Out[223]:

'3'

Notice the difference in syntax highlighting between the str of the number '3' and the number 51. This number can be converted into a binary string or hex string using the builtins bin and hex functions respectively:

In [224]:

bin?

Signature: bin(number, /)
Docstring:
Return the binary representation of an integer.

>>> bin(2796202)
'0b1010101010101010101010'
Type:      builtin_function_or_method

In [225]:

hex?

Signature: hex(number, /)
Docstring:
Return the hexadecimal representation of an integer.

>>> hex(12648430)
'0xc0ffee'
Type:      builtin_function_or_method

For example:

In [226]:

bin(ord('3'))

Out[226]:

'0b110011'

This can be conceptualised as the following with the trailing zeros:

In [227]:

'0b' + bin(ord('3')).lstrip('0b').zfill(8)

Out[227]:

'0b00110011'

Note the prefix 0b indicates a binary number and does not display the two leading zeros:

In [228]:

hex(ord('3'))

Out[228]:

'0x33'

Note the prefix 0x indicates a hexadecimal number:

In [229]:

bin(16)

Out[229]:

'0b10000'

The string module¶

The string module contains a number of useful strings which group characters. It can be imported using:

In [230]:

import string

The identifiers can be viewed:

In [231]:

dir2(string, object, unique_only=True)

{'attribute': ['ascii_letters',
               'ascii_lowercase',
               'ascii_uppercase',
               'digits',
               'hexdigits',
               'octdigits',
               'printable',
               'punctuation',
               'whitespace'],
 'method': ['capwords'],
 'upper_class': ['Formatter', 'Template'],
 'datamodel_attribute': ['__all__',
                         '__builtins__',
                         '__cached__',
                         '__file__',
                         '__loader__',
                         '__name__',
                         '__package__',
                         '__spec__'],
 'internal_attribute': ['_re', '_sentinel_dict', '_string'],
 'internal_method': ['_ChainMap']}

Most of the identifiers are attributes and in this case are str instances. ascii_letters is a str instance containing all English letters:

In [232]:

string.ascii_letters

Out[232]:

'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

This can be split into lowercase and uppercase using the str instances ascii_lowercase and ascii_uppercase respectively:

In [233]:

string.ascii_lowercase

Out[233]:

'abcdefghijklmnopqrstuvwxyz'

In [234]:

string.ascii_uppercase

Out[234]:

'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

digits is a str instance that contains the 10 digits used in the decimal system:

In [235]:

string.digits

Out[235]:

'0123456789'

hexdigits is a str instance that contains the 16 characters that can be used for hexadecimal. Note a and A are an alias of one another:

In [236]:

string.hexdigits

Out[236]:

'0123456789abcdefABCDEF'

printable is a str instance that contains the printable characters:

In [237]:

string.printable

Out[237]:

'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'

punctuation is a str instance that contains all the punctuation characters:

In [238]:

string.punctuation

Out[238]:

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

whitespace is a str instance containing the whitespace characters:

In [239]:

string.whitespace

Out[239]:

' \t\n\r\x0b\x0c'

With the exception to the space, these are shown using escape sequences which will be further explored in a moment.

Now that the ASCII grouping and string groupings seen within the string module have been seen, the additional boolean identifiers can be examined. These boolean identifiers all act upon instance data and return a bool. Their docstrings are:

In [240]:

greeting.isprintable?

Signature: greeting.isprintable()
Docstring:
Return True if the string is printable, False otherwise.

A string is printable if all of its characters are considered printable in
repr() or if it is empty.
Type:      builtin_function_or_method

In [241]:

greeting.isascii?

Signature: greeting.isascii()
Docstring:
Return True if all characters in the string are ASCII, False otherwise.

ASCII characters have code points in the range U+0000-U+007F.
Empty string is ASCII too.
Type:      builtin_function_or_method

In [242]:

greeting.isalnum?

Signature: greeting.isalnum()
Docstring:
Return True if the string is an alpha-numeric string, False otherwise.

A string is alpha-numeric if all characters in the string are alpha-numeric and
there is at least one character in the string.
Type:      builtin_function_or_method

In [243]:

greeting.isalpha?

Signature: greeting.isalpha()
Docstring:
Return True if the string is an alphabetic string, False otherwise.

A string is alphabetic if all characters in the string are alphabetic and there
is at least one character in the string.
Type:      builtin_function_or_method

In [244]:

greeting.isspace?

Signature: greeting.isspace()
Docstring:
Return True if the string is a whitespace string, False otherwise.

A string is whitespace if all characters in the string are whitespace and there
is at least one character in the string.
Type:      builtin_function_or_method

In [245]:

greeting.isdecimal?

Signature: greeting.isdecimal()
Docstring:
Return True if the string is a decimal string, False otherwise.

A string is a decimal string if all characters in the string are decimal and
there is at least one character in the string.
Type:      builtin_function_or_method

In [246]:

greeting.isdigit?

Signature: greeting.isdigit()
Docstring:
Return True if the string is a digit string, False otherwise.

A string is a digit string if all characters in the string are digits and there
is at least one character in the string.
Type:      builtin_function_or_method

In [247]:

greeting.isnumeric?

Signature: greeting.isnumeric()
Docstring:
Return True if the string is a numeric string, False otherwise.

A string is numeric if all characters in the string are numeric and there is at
least one character in the string.
Type:      builtin_function_or_method

For example:

In [248]:

'hello Γειά σου 123'.isprintable()

Out[248]:

True

In [249]:

'hello Γειά σου 123'.isascii()

Out[249]:

False

In [250]:

'hello 123 !'.isascii()

Out[250]:

True

In [251]:

'hello 123 !'.isalnum()

Out[251]:

False

In [252]:

'hello123'.isalnum()

Out[252]:

True

In [253]:

'hello123'.isalpha()

Out[253]:

False

In [254]:

'hello'.isalpha()

Out[254]:

True

In [255]:

'hello'.isspace()

Out[255]:

False

The boolean numeric str datamodel methods have subtle differences. These can be seen by examining the response of the methods for each of the following number groupings:

In [256]:

numeric_groups = {'ascii': '0123456789', 
                  'font1': '𝟶𝟷𝟸𝟹𝟺𝟻𝟼𝟽𝟾𝟿', 
                  'font2': '𝟬𝟭𝟮𝟯𝟰𝟱𝟲𝟳𝟴𝟵', 
                  'font3': '𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡', 
                  'subscript': '₀₁₂₃₄₅₆₇₈₉',
                  'superscript': '⁰¹²³⁴⁵⁶⁷⁸⁹',
                  'circled1': '➀➁➂➃➄➅➆➇➈',
                  'circled2': '➉',
                  'fractions': '½⅓¼⅕⅙⅐⅛⅑⅒⅔¾⅖⅗⅘⅚⅜⅝⅞⅟↉', 
                  'asciihex': '0123456789abcdef', }

In [257]:

for group in numeric_groups:
    print(group, numeric_groups[group], numeric_groups[group].isdecimal())

ascii 0123456789 True
font1 𝟶𝟷𝟸𝟹𝟺𝟻𝟼𝟽𝟾𝟿 True
font2 𝟬𝟭𝟮𝟯𝟰𝟱𝟲𝟳𝟴𝟵 True
font3 𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡 True
subscript ₀₁₂₃₄₅₆₇₈₉ False
superscript ⁰¹²³⁴⁵⁶⁷⁸⁹ False
circled1 ➀➁➂➃➄➅➆➇➈ False
circled2 ➉ False
fractions ½⅓¼⅕⅙⅐⅛⅑⅒⅔¾⅖⅗⅘⅚⅜⅝⅞⅟↉ False
asciihex 0123456789abcdef False

In [258]:

for group in numeric_groups:
    print(group, numeric_groups[group], numeric_groups[group].isdigit())

ascii 0123456789 True
font1 𝟶𝟷𝟸𝟹𝟺𝟻𝟼𝟽𝟾𝟿 True
font2 𝟬𝟭𝟮𝟯𝟰𝟱𝟲𝟳𝟴𝟵 True
font3 𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡 True
subscript ₀₁₂₃₄₅₆₇₈₉ True
superscript ⁰¹²³⁴⁵⁶⁷⁸⁹ True
circled1 ➀➁➂➃➄➅➆➇➈ True
circled2 ➉ False
fractions ½⅓¼⅕⅙⅐⅛⅑⅒⅔¾⅖⅗⅘⅚⅜⅝⅞⅟↉ False
asciihex 0123456789abcdef False

In [259]:

for group in numeric_groups:
    print(group, numeric_groups[group], numeric_groups[group].isnumeric())

ascii 0123456789 True
font1 𝟶𝟷𝟸𝟹𝟺𝟻𝟼𝟽𝟾𝟿 True
font2 𝟬𝟭𝟮𝟯𝟰𝟱𝟲𝟳𝟴𝟵 True
font3 𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡 True
subscript ₀₁₂₃₄₅₆₇₈₉ True
superscript ⁰¹²³⁴⁵⁶⁷⁸⁹ True
circled1 ➀➁➂➃➄➅➆➇➈ True
circled2 ➉ True
fractions ½⅓¼⅕⅙⅐⅛⅑⅒⅔¾⅖⅗⅘⅚⅜⅝⅞⅟↉ True
asciihex 0123456789abcdef False

In [260]:

for group in numeric_groups:
    print(group, numeric_groups[group], numeric_groups[group].isalnum())

ascii 0123456789 True
font1 𝟶𝟷𝟸𝟹𝟺𝟻𝟼𝟽𝟾𝟿 True
font2 𝟬𝟭𝟮𝟯𝟰𝟱𝟲𝟳𝟴𝟵 True
font3 𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡 True
subscript ₀₁₂₃₄₅₆₇₈₉ True
superscript ⁰¹²³⁴⁵⁶⁷⁸⁹ True
circled1 ➀➁➂➃➄➅➆➇➈ True
circled2 ➉ True
fractions ½⅓¼⅕⅙⅐⅛⅑⅒⅔¾⅖⅗⅘⅚⅜⅝⅞⅟↉ True
asciihex 0123456789abcdef True

The boolean identifiers are often used for checks and these checks are used to create conditions and setup loops for example.

Escape Characters¶

The \ is a special symbol used to insert an escape character. The most commonly used escape characters have the form:

In [261]:

print('|  |') # no escape character

|  |

In [262]:

print('| \t |') # the tab

| 	 |

In [263]:

print('| \n |') # the new line

| 
 |

In [264]:

print('| \\ |') # the leftslash itself

| \ |

In [265]:

print('| \' |') # the single quotation

| ' |

In [266]:

print('| \" |') # the double quotation

| " |

An ASCII character or character spanning over the range of a single byte can be inserted using an escape character 2 hexadecimal digits:

In [267]:

hex(ord('!'))

Out[267]:

'0x21'

In [268]:

'\x21' # a byte (2 hexadecimal digits)

Out[268]:

'!'

In [269]:

print('| \x09 |') # the tab as a byte (2 hexadecimal digits)

| 	 |

Note the two hexadecimal digits have to be provided as otherwise there is an incomplete byte specified.

The most commonly used Unicode characters, outside of the ASCII range span over 2 bytes and can therefore be inserted using an escape sequence with 4 hexadecimal digits. For example:

In [270]:

hex(ord('α'))

Out[270]:

'0x3b1'

In [271]:

'\u03b1' # a Unicode character (4 hexadecimal digits, 2 hexadecimal digits × 2 bytes)

Out[271]:

'α'

Note the four hexadecimal digits have to be provided otherwise there is an incomplete byte. The next line of code shows a common problem when attempting to input a Windows Path:

'c:\users\philip'

In the above the Python interpreter sees the first \ is seen as an instruction to insert an escape character. u is an instruction to expect a Unicode escape sequence and therefore the Python interpreter attempts to read the next four characters sers as hexadecimal values. In hexadecimal s, e and r are not valid hexadecimal characters. Recall that a hexadecimal character has 16 digits 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a, b, c, d, e, f and therefore a SyntaxError is flagged up.

To insert a Windows path \\ should be used to indicate insertion of the escape character \:

'c:\\users\\philip'

Note that the hex form is normally used to represent a byte that is not printable. If the 6 whitespace characters are examined in more detail this can be seen:

In [272]:

string.whitespace

Out[272]:

' \t\n\r\x0b\x0c'

name		byte
space	' '	'\x20'
tab	'\t'	'\x09'
new line	'\n'	'\x0a'
carriage return	'\r'	'\x0d'
vertical tab		'\x0b'
form feed		'\x0c'

In [273]:

' ' == '\x20'

Out[273]:

True

In [274]:

'\t' == '\x09'

Out[274]:

True

In [275]:

'\n' == '\x0a'

Out[275]:

True

In [276]:

'\r' == '\x0d'

Out[276]:

True

It is not common to do so, however each ASCII character in a string can also be inserted as an escape character:

In [277]:

'\x68\x65\x6c\x6c\x6f\x20\x77\x6f\x72\x6c\x64\x21'

Out[277]:

'hello world!'

The unicodedata module can be imported:

In [278]:

import unicodedata

Its identifiers can be viewed using:

In [279]:

dir2(unicodedata, object, unique_only=True)

{'attribute': ['ucd_3_2_0', 'unidata_version'],
 'method': ['bidirectional',
            'category',
            'combining',
            'decimal',
            'decomposition',
            'digit',
            'east_asian_width',
            'is_normalized',
            'lookup',
            'mirrored',
            'name',
            'normalize',
            'numeric'],
 'upper_class': ['UCD'],
 'datamodel_attribute': ['__file__',
                         '__loader__',
                         '__name__',
                         '__package__',
                         '__spec__'],
 'internal_attribute': ['_ucnhash_CAPI']}

The Unicode version can be checked using:

In [280]:

unicodedata.unidata_version

Out[280]:

'15.0.0'

And once the version number is known, more details about the supported characters can be examined using the Unicode Documentation.

A Unicode escape character span over 4 bytes and can therefore be inserted using 8 hexadecimal digits. For example:

In [281]:

'\U0000303a'

Out[281]:

'〺'

Translation Table¶

A translation table can be created for use with the instance method translate:

In [282]:

greeting.translate?

Signature: greeting.translate(table, /)
Docstring:
Replace each character in the string using the given translation table.

  table
    Translation table, which must be a mapping of Unicode ordinals to
    Unicode ordinals, strings, or None.

The table must implement lookup/indexing via __getitem__, for instance a
dictionary or list.  If this operation raises LookupError, the character is
left untouched.  Characters mapped to None are deleted.
Type:      builtin_function_or_method

maketrans is a static method which is essentially a function thats neither bound to the instance or the class. This function merely exists in the namespace of the class as this is the most logical place to find it (conceptualise the class as a Python module):

In [283]:

str.maketrans?

Docstring:
Return a translation table usable for str.translate().

If there is only one argument, it must be a dictionary mapping Unicode
ordinals (integers) or characters to Unicode ordinals, strings or None.
Character keys will be then converted to ordinals.
If there are two arguments, they must be strings of equal length, and
in the resulting dictionary, each character in x will be mapped to the
character at the same position in y. If there is a third argument, it
must be a string, whose characters will be mapped to None in the result.
Type:      builtin_function_or_method

In [284]:

greektolatin = str.maketrans('αβγδε', 'abcde')
greektolatin

Out[284]:

{945: 97, 946: 98, 947: 99, 948: 100, 949: 101}

In [285]:

hex(945)

Out[285]:

'0x3b1'

In [286]:

hex(97)

Out[286]:

'0x61'

This translation table can be used on the example str instance to replace the Greek letters (keys) with the latin letters (values):

In [287]:

'αββγγγδδδδεεεεε'.translate(greektolatin)

Out[287]:

'abbcccddddeeeee'

File Paths and Raw Strings¶

In a Python string, the \ is a special character that is an instruction to insert an escape character. Unfortunately the \ is also the default directory seperator used for a file path in Windows.

To incorporate an \ into a str instance \\ has to be used; the first \ is an instruction to insert an escape character and the second \ states that the escape character to be inserted is the \ itself:

In [288]:

windows_file_path = 'C:\\Users\\Philip'

This problem does not occur on Linux because / is used as a directory seperator in a file path:

In [289]:

linux_file_path = '/users/philip'

Windows can also use / as an alternative directory separator however when copying file paths from Windows Explorer for example, the default separator \ will be used.

Compare the difference to the cell output and the output in a cell from a print statement:

In [290]:

windows_file_path

Out[290]:

'C:\\Users\\Philip'

In [291]:

print(windows_file_path)

C:\Users\Philip

In Windows the file path is of the form 'C:\Users\Philip' using the default separator \ and a SyntaxError displays when it is used:

windows_file_path = 'C:\Users\Philip'

For the file path to be recognised as a Python string each \ has to be converted into a \\:

windows_file_path = 'C:\\Users\\Philip'

This can be quite cumbersome for long file paths. Python also has a raw string which does not process escape characters and any \ is recognised as being part of the str instance. A raw str has the prefix r or R:

In [292]:

raw_windows_file_path1 = r'C:\Users\Philip'

In [293]:

raw_windows_file_path2 = R'C:\Users\Philip'

Although both r and R give the same raw str instance:

In [294]:

raw_windows_file_path1 == raw_windows_file_path2

Out[294]:

True

In [295]:

raw_windows_file_path2

Out[295]:

'C:\\Users\\Philip'

In [296]:

print(raw_windows_file_path2)

C:\Users\Philip

The subtle difference in the two is in the syntax highlighting. Uppercase R shows no formatting around the special characters which is appropriate for the file path. Lowercase r on the other hand shows syntax highlighting following the escape character and is used to construct regular expressions which will be briefly mentioned in the next section.

Find and Index¶

Previously indexing using an int or a slice was discussed:

In [297]:

greeting

Out[297]:

'Hello World!'

In [298]:

greeting[0]

Out[298]:

'H'

In [299]:

greeting[:5]

Out[299]:

'Hello'

The str instance methods index and find perform the counter operation and retrieve the positive index corresponding to the first occurrence of a character or the start of a substring:

In [300]:

greeting.find?

Docstring:
S.find(sub[, start[, end]]) -> int

Return the lowest index in S where substring sub is found,
such that sub is contained within S[start:end].  Optional
arguments start and end are interpreted as in slice notation.

Return -1 on failure.
Type:      builtin_function_or_method

In [301]:

greeting.index?

Docstring:
S.index(sub[, start[, end]]) -> int

Return the lowest index in S where substring sub is found,
such that sub is contained within S[start:end].  Optional
arguments start and end are interpreted as in slice notation.

Raises ValueError when the substring is not found.
Type:      builtin_function_or_method

These two instance methods behave identically upon success:

In [302]:

greeting.find('l')

Out[302]:

In [303]:

greeting.index('l')

Out[303]:

However give -1 and ValueError respectively upon failure:

In [304]:

greeting.find('L')

Out[304]:

-1

word.index('L')

In [ ]:

These instance methods, take consistent start and stop input arguments like in the slice and range classes seen earlier and can be used to constrict the search range. For example to find the index of all the values of 'l':

In [305]:

greeting.find('l')

Out[305]:

In [306]:

greeting.find('l', 2+1)

Out[306]:

In [307]:

greeting.find('l', 3+1)

Out[307]:

In [308]:

greeting.find('l', 9+1)

Out[308]:

-1

A Unicode substring can also be searched for opposed to a Unicode character:

In [309]:

greeting.find('World')

Out[309]:

In [310]:

greeting.find('W')

Out[310]:

The index and find methods search the str instance for a substring from the left to the right. These are complemented by the reverse find and reverse index, rfind and rindex respectively which search from right to left:

In [311]:

greeting.rfind('l')

Out[311]:

In [312]:

greeting.rfind('l', 0, 9)

Out[312]:

In [313]:

greeting.rfind('l', 0, 3)

Out[313]:

In [314]:

greeting.rfind('l', 0, 2)

Out[314]:

-1

In [315]:

greeting.rfind('l')

Out[315]:

The str instance method count returns the number of times a substring str instance is found in the str instance:

In [316]:

greeting.count('l')

Out[316]:

The bool based str identifiers startswith and endswith return a bool if the str instances starts or ends with a substring prefix or suffix. These also have consistent start and stop input arguments which can be used to constrict the search range:

In [317]:

greeting.startswith?

Docstring:
S.startswith(prefix[, start[, end]]) -> bool

Return True if S starts with the specified prefix, False otherwise.
With optional start, test S beginning at that position.
With optional end, stop comparing S at that position.
prefix can also be a tuple of strings to try.
Type:      builtin_function_or_method

In [318]:

greeting.endswith?

Docstring:
S.endswith(suffix[, start[, end]]) -> bool

Return True if S ends with the specified suffix, False otherwise.
With optional start, test S beginning at that position.
With optional end, stop comparing S at that position.
suffix can also be a tuple of strings to try.
Type:      builtin_function_or_method

In [319]:

greeting

Out[319]:

'Hello World!'

In [320]:

greeting.startswith('hello')

Out[320]:

False

In [321]:

greeting.startswith('hello', 1)

Out[321]:

False

In [322]:

greeting.endswith('!')

Out[322]:

True

In [323]:

greeting.endswith('!', 0, 11)

Out[323]:

False

The str instance method replace can be used to replace an old substring with a new substring. It has an optional argument count which has a default value of -1 and this means it allows for all replacements by default:

In [324]:

greeting.replace?

Signature: greeting.replace(old, new, count=-1, /)
Docstring:
Return a copy with all occurrences of substring old replaced by new.

  count
    Maximum number of occurrences to replace.
    -1 (the default value) means replace all occurrences.

If the optional argument count is given, only the first count occurrences are
replaced.
Type:      builtin_function_or_method

In [325]:

greeting

Out[325]:

'Hello World!'

In [326]:

greeting.replace('hello', 'bye')

Out[326]:

'Hello World!'

In [327]:

greeting.replace('l', 'L')

Out[327]:

'HeLLo WorLd!'

In [328]:

greeting.replace('l', 'L', 1)

Out[328]:

'HeLlo World!'

The re module¶

The regular expressions module is used for advanced pattern searching:

In [329]:

text = 'Email example@example.com, example2@example.com Telephone 0000000000 Website https://www.example.com'

For example a regular expression using r can be created for an email, number and website:

In [330]:

email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
number_pattern = r'\b\d{10}\b'
website_pattern = r'https?://(?:www\.)?[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'

Notice the difference in syntax highlighting when uppercase R is used:

In [331]:

email_pattern = R'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
number_pattern = R'\b\d{10}\b'
website_pattern = R'https?://(?:www\.)?[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'

The regular expression module can be imported:

In [332]:

import re

In [333]:

dir2(re, object, unique_only=True)

{'constant': ['A',
              'ASCII',
              'DEBUG',
              'DOTALL',
              'I',
              'IGNORECASE',
              'L',
              'LOCALE',
              'M',
              'MULTILINE',
              'NOFLAG',
              'S',
              'T',
              'TEMPLATE',
              'U',
              'UNICODE',
              'VERBOSE',
              'X'],
 'module': ['copyreg', 'enum', 'functools'],
 'method': ['compile',
            'escape',
            'findall',
            'finditer',
            'fullmatch',
            'match',
            'purge',
            'search',
            'split',
            'sub',
            'subn',
            'template'],
 'lower_class': ['error'],
 'upper_class': ['Match', 'Pattern', 'RegexFlag', 'Scanner'],
 'datamodel_attribute': ['__all__',
                         '__builtins__',
                         '__cached__',
                         '__file__',
                         '__loader__',
                         '__name__',
                         '__package__',
                         '__path__',
                         '__spec__',
                         '__version__'],
 'internal_attribute': ['_MAXCACHE',
                        '_MAXCACHE2',
                        '_cache',
                        '_cache2',
                        '_casefix',
                        '_compiler',
                        '_constants',
                        '_parser',
                        '_special_chars_map',
                        '_sre'],
 'internal_method': ['_compile', '_compile_template', '_pickle']}

The re.findall function can be used to search for the first occurrence of a pattern:

In [334]:

re.findall?

Signature: re.findall(pattern, string, flags=0)
Docstring:
Return a list of all non-overlapping matches in the string.

If one or more capturing groups are present in the pattern, return
a list of groups; this will be a list of tuples if the pattern
has more than one group.

Empty matches are included in the result.
File:      c:\users\phili\anaconda3\envs\vscode-env\lib\re\__init__.py
Type:      function

For example a search for the email_pattern can be made in text:

In [335]:

email_search = re.findall(email_pattern, text)

The results can be seen in the output list instance:

In [336]:

email_search

Out[336]:

['example@example.com', 'example2@example.com']

A search can also be made for the number_pattern and website_pattern:

In [337]:

number_search = re.findall(number_pattern, text)

In [338]:

number_search

Out[338]:

['0000000000']

In [339]:

website_search = re.findall(website_pattern, text)

In [340]:

website_search

Out[340]:

['https://www.example.com']

The print function¶

The print function has previously been used with its default named parameters. More details about these can be seen in the docstring:

In [341]:

print?

Signature: print(*args, sep=' ', end='\n', file=None, flush=False)
Docstring:
Prints the values to a stream, or to sys.stdout by default.

sep
  string inserted between values, default a space.
end
  string appended after the last value, default a newline.
file
  a file-like object (stream); defaults to the current sys.stdout.
flush
  whether to forcibly flush the stream.
Type:      builtin_function_or_method

*args indicates that a variable number of positional input arguments are used. sep and end are named input arguments which have a default value of a space and a new line respectively. file and flush are for advanced purposes when the print stream is to be directed for example to a file instead of a cell output:

print(*args, sep=' ', end='\n', file=None, flush=False)

The effect of overriding the default value of sep can be seen:

In [342]:

print('the', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog')

the brown fox jumps over the lazy dog

In [343]:

print('the', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', sep='')

thebrownfoxjumpsoverthelazydog

The effect of overriding the default value of end can be seen:

In [344]:

print('the', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog')
print('the', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog')

the brown fox jumps over the lazy dog
the brown fox jumps over the lazy dog

In [345]:

print('the', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', end='')
print('the', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog')

the brown fox jumps over the lazy dogthe brown fox jumps over the lazy dog

Formatted Strings¶

Supposing a str body has the form:

In [346]:

body = 'The string to 0 is 1 2!'

And there are three str instances:

In [347]:

var0 = 'print'
var1 = 'hello'
var2 = 'world'

The objective of a formatted string is to insert these instances into the str body so a formatted str instance of the form can be returned:

In [348]:

'The string to print is hello world!'

Out[348]:

'The string to print is hello world!'

If the docstring of the str method format is examined:

In [349]:

body.format?

Docstring:
S.format(*args, **kwargs) -> str

Return a formatted version of S, using substitutions from args and kwargs.
The substitutions are identified by braces ('{' and '}').
Type:      builtin_function_or_method

Then it can be seen that substitutions are identified by braces so the str body should be modified to have the following form:

In [350]:

body = 'The string to {0} is {1} {2}!'

Notice the syntax highlighting clearly distinguishes these placeholders.

*args represents a variable number of positional input arguments. When inserting instances into the str body, the number of positional input arguments should match the number of placeholders in the str body. Now the format method can be used:

In [351]:

body.format(var0, var1, var2)

Out[351]:

'The string to print is hello world!'

The str instance body can alternatively be setup to contain named variables:

In [352]:

body = 'The string to {var0_} is {var1_} {var2_}!'

**kwargs represents a variable number of named keyword input arguments which should match the named keyword input arguments in the str instance body:

In [353]:

body.format(var0_=var0, var1_=var1, var2_=var2)

Out[353]:

'The string to print is hello world!'

The two lines above can be combined:

In [354]:

'The string to {var0_} is {var1_} {var2_}!'.format(var0_=var0, var1_=var1, var2_=var2)

Out[354]:

'The string to print is hello world!'

It is more common for the placeholders to be given the same name as the instances to be inserted in the tuple:

In [355]:

'The string to {var0} is {var1} {var2}!'.format(var0=var0, var1=var1, var2=var2)

Out[355]:

'The string to print is hello world!'

Notice in the above that each instance name is used 3 times which is pretty cumbersome. A shorthand way of writing the expression above is to use the prefix f or F which means formatted string:

In [356]:

f'The string to {var0} is {var1} {var2}!'

Out[356]:

'The string to print is hello world!'

In [357]:

F'The string to {var0} is {var1} {var2}!'

Out[357]:

'The string to print is hello world!'

There is no difference for uppercase and lowercase in formatted str instances and the syntax highlighting is the same in either case.

If the object datamodel method __format__ is examined:

In [358]:

object.__format__?

Signature: object.__format__(self, format_spec, /)
Docstring:
Default object formatter.

Return str(self) if format_spec is empty. Raise TypeError otherwise.
Type:      method_descriptor

Notice there is a format specification format_spec:

In [359]:

greeting

Out[359]:

'Hello World!'

The format specification for a str instance has the form:

'0ns'

where n is an integer, s means str and 0 is used to fill in blank spaces.

In [360]:

greeting.__format__('s')

Out[360]:

'Hello World!'

In [361]:

greeting.__format__('22s')

Out[361]:

'Hello World!          '

In [362]:

greeting.__format__('022s')

Out[362]:

'Hello World!0000000000'

The formatter specifier options differ for each datatype. Normally a colon is used to include the format specifier beside the variable in the formatted str:

In [363]:

f'The string to {var0:s} is {var1} {var2}!'

Out[363]:

'The string to print is hello world!'

The str format specifier can specify an integer number of characters:

In [364]:

f'The string to {var0:10s} is {var1} {var2}!'

Out[364]:

'The string to print      is hello world!'

If prefixed with 0 then trailing spaces will be displayed using 0:

In [365]:

f'The string to {var0:010s} is {var1:s} {var2:s}!'

Out[365]:

'The string to print00000 is hello world!'

In the above str instances were inserted into a str instance body. It is more common to insert numeric variables into the str instance body:

In [366]:

num1 = 1
num2 = 0.0000123456789
num3 = 12.3456789

In [367]:

f'The numbers are {num1}, {num2} and {num3}.'

Out[367]:

'The numbers are 1, 1.23456789e-05 and 12.3456789.'

The format specifier for an integer decimal (d) can be used:

In [368]:

f'The numbers are {num1:d}, {num2} and {num3}.'

Out[368]:

'The numbers are 1, 1.23456789e-05 and 12.3456789.'

In [369]:

f'The numbers are {num1:5d}, {num2} and {num3}.'

Out[369]:

'The numbers are     1, 1.23456789e-05 and 12.3456789.'

In [370]:

f'The numbers are {num1:05d}, {num2} and {num3}.'

Out[370]:

'The numbers are 00001, 1.23456789e-05 and 12.3456789.'

In [371]:

f'The numbers are {num1: 05d}, {num2} and {num3}.'

Out[371]:

'The numbers are  0001, 1.23456789e-05 and 12.3456789.'

Again the number of characters in the string the number should occupy can be specified. Unlike the str formatter spacing is leading opposed to trailing. If prefixed with a 0, then these will be shown as 0.

Notice one of the five characters is a space because a space is part of the formatter specifier. Compare the difference when this space is removed:

In [372]:

f'The numbers are {num1}, {num2:g} and {num3:g}.'

Out[372]:

'The numbers are 1, 1.23457e-05 and 12.3457.'

The e can be used for float exponential format:

In [373]:

f'The numbers are {num1}, {num2:e} and {num3:e}.'

Out[373]:

'The numbers are 1, 1.234568e-05 and 1.234568e+01.'

The number of places after the decimal point can be specified:

In [374]:

f'The numbers are {num1}, {num2:0.3e} and {num3:0.3e}.'

Out[374]:

'The numbers are 1, 1.235e-05 and 1.235e+01.'

A fixed format can also be used:

In [375]:

f'The numbers are {num1}, {num2:f} and {num3:f}.'

Out[375]:

'The numbers are 1, 0.000012 and 12.345679.'

Once again the number of spaces after the decimal point can be specified:

In [376]:

f'The numbers are {num1}, {num2:0.3f} and {num3:0.3f}.'

Out[376]:

'The numbers are 1, 0.000 and 12.346.'

float instances can use the general (g), exponential (e) and fixed (f) format specifiers. The prefix 0.3 specifies rounding to 3 digits past the decimal point.

If the keys in a dict instance match the instance names in the str body:

In [377]:

numbers = {'num1': 1, 'num2': 0.0000123456789, 'num3': 12.3456789}

In [378]:

body = 'The numbers are {num1:d}, {num2:.3e} and {num3:.3e}.'

The format_map method can be used with the mapping to insert the instances:

In [379]:

body.format_map?

Docstring:
S.format_map(mapping) -> str

Return a formatted version of S, using substitutions from mapping.
The substitutions are identified by braces ('{' and '}').
Type:      builtin_function_or_method

In [380]:

body.format_map(numbers)

Out[380]:

'The numbers are 1, 1.235e-05 and 1.235e+01.'

Notice that the syntax for a format specifier {variable:format_spec} is similar to the form of a Python dict instance {key:value}. However spacing to the right of the colon is often present in a dictionary {key: value} and does not change the value. If a space is added to the formatting specifier, it is incorporated into the formatting specifier.

The older style of formatted str instances uses the datamodel identifier __mod__ (dunder mod) which controls the behaviour of the operator % and in the case of older style string formatting also uses the % as a placeholder opposed to the braces {}:

In [381]:

body = 'The numbers are %d, %0.3f and %0.3g.' 
nums = (1, 0.0000123456789, 12.3456789)

In [382]:

body.__mod__?

Signature:      body.__mod__(value, /)
Call signature: body.__mod__(*args, **kwargs)
Type:           method-wrapper
String form:    <method-wrapper '__mod__' of str object at 0x000001E0A227B960>
Docstring:      Return self%value.

In [383]:

body % nums

Out[383]:

'The numbers are 1, 0.000 and 12.3.'

Multiline Strings¶

A str instance can be displayed over multiple lines using triple double quotations:

In [384]:

multiline = """the quick brown fox jumps over the lazy dog
the quick brown fox jumps over the lazy dog
the quick brown fox jumps over the lazy dog
the quick brown fox jumps over the lazy dog"""

In [385]:

multiline

Out[385]:

'the quick brown fox jumps over the lazy dog\nthe quick brown fox jumps over the lazy dog\nthe quick brown fox jumps over the lazy dog\nthe quick brown fox jumps over the lazy dog'

In [386]:

print(multiline)

the quick brown fox jumps over the lazy dog
the quick brown fox jumps over the lazy dog
the quick brown fox jumps over the lazy dog
the quick brown fox jumps over the lazy dog

Note that any spacing added will be incorporated into the multiline str instance:

In [387]:

multiline = """
            the quick brown fox jumps over the lazy dog
            the quick brown fox jumps over the lazy dog
            the quick brown fox jumps over the lazy dog
            the quick brown fox jumps over the lazy dog
            """

In [388]:

multiline

Out[388]:

'\n            the quick brown fox jumps over the lazy dog\n            the quick brown fox jumps over the lazy dog\n            the quick brown fox jumps over the lazy dog\n            the quick brown fox jumps over the lazy dog\n            '

In [389]:

print(multiline)

            the quick brown fox jumps over the lazy dog
            the quick brown fox jumps over the lazy dog
            the quick brown fox jumps over the lazy dog
            the quick brown fox jumps over the lazy dog

Triple double quotations are preferred as multiline str instances are commonly used for docstrings and docstrings are commonly written briefly during development and expanded during production to include str literals:

In [390]:

print?

Signature: print(*args, sep=' ', end='\n', file=None, flush=False)
Docstring:
Prints the values to a stream, or to sys.stdout by default.

sep
  string inserted between values, default a space.
end
  string appended after the last value, default a newline.
file
  a file-like object (stream); defaults to the current sys.stdout.
flush
  whether to forcibly flush the stream.
Type:      builtin_function_or_method

In [391]:

doc = """Prints the values

sep
  string inserted between values, default a space ' '.
end
  string appended after the last value, default a newline '\\n'."""

In [392]:

print(doc)

Prints the values

sep
  string inserted between values, default a space ' '.
end
  string appended after the last value, default a newline '\n'.

Center and Justify¶

A str instance can be centered and justified using the str methods fill, centre, ljust and rjust:

In [393]:

greeting.center?

Signature: greeting.center(width, fillchar=' ', /)
Docstring:
Return a centered string of length width.

Padding is done using the specified fill character (default is a space).
Type:      builtin_function_or_method

In [394]:

greeting.ljust?

Signature: greeting.ljust(width, fillchar=' ', /)
Docstring:
Return a left-justified string of length width.

Padding is done using the specified fill character (default is a space).
Type:      builtin_function_or_method

In [395]:

greeting.rjust?

Signature: greeting.rjust(width, fillchar=' ', /)
Docstring:
Return a right-justified string of length width.

Padding is done using the specified fill character (default is a space).
Type:      builtin_function_or_method

In [396]:

len(greeting)

Out[396]:

In [397]:

greeting.center(20)

Out[397]:

'    Hello World!    '

In [398]:

greeting.center(20, 'X')

Out[398]:

'XXXXHello World!XXXX'

In [399]:

greeting.ljust(20, 'X')

Out[399]:

'Hello World!XXXXXXXX'

In [400]:

greeting.rjust(20, 'X')

Out[400]:

'XXXXXXXXHello World!'

The opposite operation can be carried out using the str methods left strip and right strip, lstrip and rstrip respectively which left strip and right strip whitespace by default or a specified fill character or character sequence:

In [401]:

padded_greeting = greeting.center(20)

In [402]:

padded_greeting

Out[402]:

'    Hello World!    '

In [403]:

padded_greeting.lstrip?

Signature: padded_greeting.lstrip(chars=None, /)
Docstring:
Return a copy of the string with leading whitespace removed.

If chars is given and not None, remove characters in chars instead.
Type:      builtin_function_or_method

In [404]:

padded_greeting.rstrip?

Signature: padded_greeting.rstrip(chars=None, /)
Docstring:
Return a copy of the string with trailing whitespace removed.

If chars is given and not None, remove characters in chars instead.
Type:      builtin_function_or_method

In [405]:

padded_greeting.lstrip()

Out[405]:

'Hello World!    '

In [406]:

padded_greeting.rstrip()

Out[406]:

'    Hello World!'

In [407]:

padded_greeting.lstrip().rstrip()

Out[407]:

'Hello World!'

In [408]:

padded_greeting = greeting.center(20, 'X')

In [409]:

padded_greeting

Out[409]:

'XXXXHello World!XXXX'

In [410]:

padded_greeting.lstrip('X').rstrip('X')

Out[410]:

'Hello World!'

The associated str methods removeprefix and removesuffix are more precise and will only remove a specified prefix or suffix:

In [411]:

padded_greeting.removeprefix?

Signature: padded_greeting.removeprefix(prefix, /)
Docstring:
Return a str with the given prefix string removed if present.

If the string starts with the prefix string, return string[len(prefix):].
Otherwise, return a copy of the original string.
Type:      builtin_function_or_method

In [412]:

padded_greeting.removesuffix?

Signature: padded_greeting.removesuffix(suffix, /)
Docstring:
Return a str with the given suffix string removed if present.

If the string ends with the suffix string and that suffix is not empty,
return string[:-len(suffix)]. Otherwise, return a copy of the original
string.
Type:      builtin_function_or_method

In [413]:

padded_greeting

Out[413]:

'XXXXHello World!XXXX'

In [414]:

padded_greeting.removeprefix('X')

Out[414]:

'XXXHello World!XXXX'

Earlier the ordinal value of the string '3' was examined. The prefix '0b' can be removed using remove prefix:

In [415]:

string_3 = bin(ord('3'))

In [416]:

string_3

Out[416]:

'0b110011'

In [417]:

string_3 = bin(ord('3')).removeprefix('0b')

In [418]:

string_3

Out[418]:

'110011'

There is also the zero fill string method zfill which is used to zero fill a string and is mainly intended for str instances of numeric values:

In [419]:

string_3.zfill?

Signature: string_3.zfill(width, /)
Docstring:
Pad a numeric string with zeros on the left, to fill a field of the given width.

The string is never truncated.
Type:      builtin_function_or_method

Since this binary number is of a byte that has 8 values, the width can be set to 8:

In [420]:

string_3.zfill(8)

Out[420]:

'00110011'

Binary Operators¶

__add__ is a binary datamodel method used to concatenate two str instances:

In [421]:

greeting.__add__?

Signature:      greeting.__add__(value, /)
Call signature: greeting.__add__(*args, **kwargs)
Type:           method-wrapper
String form:    <method-wrapper '__add__' of str object at 0x000001E0A216D7B0>
Docstring:      Return self+value.

In [422]:

'hello' + 'world'

Out[422]:

'helloworld'

In [423]:

'hello' + ' ' + 'world'

Out[423]:

'hello world'

__mul__ is a binary datamodel method used to replicate the characters in a str instance using an int instance:

In [424]:

greeting.__mul__?

Signature:      greeting.__mul__(value, /)
Call signature: greeting.__mul__(*args, **kwargs)
Type:           method-wrapper
String form:    <method-wrapper '__mul__' of str object at 0x000001E0A216D7B0>
Docstring:      Return self*value.

In [425]:

greeting * 3

Out[425]:

'Hello World!Hello World!Hello World!'

The reverse multiplication datamodel method is also defined:

In [426]:

greeting.__rmul__?

Signature:      greeting.__rmul__(value, /)
Call signature: greeting.__rmul__(*args, **kwargs)
Type:           method-wrapper
String form:    <method-wrapper '__rmul__' of str object at 0x000001E0A216D7B0>
Docstring:      Return value*self.

Which makes the multiplication of the str instance and int instance around the * operator commutative:

In [427]:

3 * greeting

Out[427]:

'Hello World!Hello World!Hello World!'

Binary operators are frequently used with assignment:

In [428]:

variables(['greeting',], show_id=True)

Out[428]:

	Type	Size/Shape	Value	ID
Instance Name
greeting	str	12	Hello World!	2064303708080

Recall the operation on the right of the assignment operator is carried out first using the original instance. The return value of the instance is then reassigned to the original instance:

In [429]:

greeting = greeting + ' world!'

In [430]:

variables(['greeting',], show_id=True)

Out[430]:

	Type	Size/Shape	Value	ID
Instance Name
greeting	str	19	Hello World! world!	2064304980336

A binary operator for example addition + can be combined with the assignment operator = resulting in the "inplace" addition operator +=. Because the str instance is immutable the operation is not in place but is equivalent to the order of the two separate operations concatenation and then reassignment as shown above:

In [431]:

greeting += ' world!'

In [432]:

variables(['greeting',], show_id=True)

Out[432]:

	Type	Size/Shape	Value	ID
Instance Name
greeting	str	26	Hello World! world! world!	2064296437168

Splitting and Joining Strings¶

A number of str methods are available for splitting and joining str instances. These generally involve casting to a Python collection such as a tuple of str instances or a list of str instances.

For example the str instance method partition and right partition rpartition will partition a str instance into a three element tuple of three str instances; the substring before the partition, the partition substring and the substring after the partition respectively. To make it more obvious the following str instance will be instantiated:

In [433]:

greeting = 'hello|world|!'

In [434]:

greeting.partition?

Signature: greeting.partition(sep, /)
Docstring:
Partition the string into three parts using the given separator.

This will search for the separator in the string.  If the separator is found,
returns a 3-tuple containing the part before the separator, the separator
itself, and the part after it.

If the separator is not found, returns a 3-tuple containing the original string
and two empty strings.
Type:      builtin_function_or_method

In [435]:

greeting.partition('|')

Out[435]:

('hello', '|', 'world|!')

In [436]:

greeting.rpartition?

Signature: greeting.rpartition(sep, /)
Docstring:
Partition the string into three parts using the given separator.

This will search for the separator in the string, starting at the end. If
the separator is found, returns a 3-tuple containing the part before the
separator, the separator itself, and the part after it.

If the separator is not found, returns a 3-tuple containing two empty strings
and the original string.
Type:      builtin_function_or_method

In [437]:

greeting.rpartition('|')

Out[437]:

('hello|world', '|', '!')

More generally the str instance methods split and join can be used to split a str instance into a list of str instances or join a list of str instances up into a single str instance. For example if the following sentence is created:

In [438]:

sentence = 'the fat black cat sat on the mat!'

The str instance method split can be examined:

In [439]:

sentence.split?

Signature: sentence.split(sep=None, maxsplit=-1)
Docstring:
Return a list of the substrings in the string, using sep as the separator string.

  sep
    The separator used to split the string.

    When set to None (the default value), will split on any whitespace
    character (including \n \r \t \f and spaces) and will discard
    empty strings from the result.
  maxsplit
    Maximum number of splits (starting from the left).
    -1 (the default value) means no limit.

Note, str.split() is mainly useful for data that has been intentionally
delimited.  With natural text that includes punctuation, consider using
the regular expression module.
Type:      builtin_function_or_method

Since the values to be split from are whitespace, the input arguments can be left unspecified defaulting to their default values. This gives a list of str instances:

In [440]:

words = sentence.split()

In [441]:

variables(['sentence', 'words'], show_id=True)

Out[441]:

	Type	Size/Shape	Value	ID
Instance Name
sentence	str	33	the fat black cat sat on the mat!	2064304679920
words	list	8	['the', 'fat', 'black', 'cat', 'sat', 'on', 'the', 'mat!']	2064305096496

There is also the str instance method right split rsplit, the difference is subtle and the methods behave different only when maxsplit is assigned a new value:

In [442]:

sentence.rsplit?

Signature: sentence.rsplit(sep=None, maxsplit=-1)
Docstring:
Return a list of the substrings in the string, using sep as the separator string.

  sep
    The separator used to split the string.

    When set to None (the default value), will split on any whitespace
    character (including \n \r \t \f and spaces) and will discard
    empty strings from the result.
  maxsplit
    Maximum number of splits (starting from the left).
    -1 (the default value) means no limit.

Splitting starts at the end of the string and works to the front.
Type:      builtin_function_or_method

In [443]:

words_r = sentence.rsplit()

In [444]:

variables(['sentence', 'words', 'words_r'], show_id=True)

Out[444]:

	Type	Size/Shape	Value	ID
Instance Name
sentence	str	33	the fat black cat sat on the mat!	2064304679920
words	list	8	['the', 'fat', 'black', 'cat', 'sat', 'on', 'the', 'mat!']	2064305097168
words_r	list	8	['the', 'fat', 'black', 'cat', 'sat', 'on', 'the', 'mat!']	2064305098960

The difference can be seen when maxsplit is used:

In [445]:

words = sentence.split(' ', maxsplit=3)

In [446]:

words_r = sentence.rsplit(' ', maxsplit=3)

In [447]:

variables(['sentence', 'words', 'words_r'], show_id=True)

Out[447]:

	Type	Size/Shape	Value	ID
Instance Name
sentence	str	33	the fat black cat sat on the mat!	2064304679920
words	list	4	['the', 'fat', 'black', 'cat sat on the mat!']	2064305097392
words_r	list	4	['the fat black cat sat', 'on', 'the', 'mat!']	2064305161872

To join the words, the str method join can be called from a delimiter str instance:

In [448]:

delimiter = ' '

In [449]:

delimiter.join?

Signature: delimiter.join(iterable, /)
Docstring:
Concatenate any number of strings.

The string whose method is called is inserted in between each given string.
The result is returned as a new string.

Example: '.'.join(['ab', 'pq', 'rs']) -> 'ab.pq.rs'
Type:      builtin_function_or_method

In [450]:

variables(show_id=True).loc[['delimiter', 'words']]

Out[450]:

	Type	Size/Shape	Value	ID
Instance Name
delimiter	str	1		140727149724840
words	list	4	['the', 'fat', 'black', 'cat sat on the mat!']	2064305101312

In [451]:

delimiter.join(words)

Out[451]:

'the fat black cat sat on the mat!'

join is typically called from a space str instance directly:

In [452]:

' '.join(words)

Out[452]:

'the fat black cat sat on the mat!'

In [453]:

'|'.join(words)

Out[453]:

'the|fat|black|cat sat on the mat!'

If a multiline str instance is created:

In [454]:

paragraph = """The quick brown fox jumps over the lazy dog
The quick brown fox jumps over the lazy dog
The quick brown fox jumps over the lazy dog
The quick brown fox jumps over the lazy dog"""

In [455]:

paragraph

Out[455]:

'The quick brown fox jumps over the lazy dog\nThe quick brown fox jumps over the lazy dog\nThe quick brown fox jumps over the lazy dog\nThe quick brown fox jumps over the lazy dog'

There is an associated str method splitlines, which splits the str into a list using the newline. It has an input argument keepends which defaults to False and therefore excludes the newline character:

In [456]:

paragraph.splitlines?

Signature: paragraph.splitlines(keepends=False)
Docstring:
Return a list of the lines in the string, breaking at line boundaries.

Line breaks are not included in the resulting list unless keepends is given and
true.
Type:      builtin_function_or_method

In [457]:

paragraph.splitlines()

Out[457]:

['The quick brown fox jumps over the lazy dog',
 'The quick brown fox jumps over the lazy dog',
 'The quick brown fox jumps over the lazy dog',
 'The quick brown fox jumps over the lazy dog']

If the multiline string is created with tabs:

In [458]:

paragraph = """\tThe quick brown fox jumps over the lazy dog
\tThe quick brown fox jumps over the lazy dog
\tThe quick brown fox jumps over the lazy dog
\tThe quick brown fox jumps over the lazy dog"""

The tabs can be replaced by a specified number of spaces using the str method expandtabs:

In [459]:

paragraph.expandtabs?

Signature: paragraph.expandtabs(tabsize=8)
Docstring:
Return a copy where all tab characters are expanded using spaces.

If tabsize is not given, a tab size of 8 characters is assumed.
Type:      builtin_function_or_method

In [460]:

paragraph.expandtabs(4)

Out[460]:

'    The quick brown fox jumps over the lazy dog\n    The quick brown fox jumps over the lazy dog\n    The quick brown fox jumps over the lazy dog\n    The quick brown fox jumps over the lazy dog'

In [461]:

print(paragraph)

	The quick brown fox jumps over the lazy dog
	The quick brown fox jumps over the lazy dog
	The quick brown fox jumps over the lazy dog
	The quick brown fox jumps over the lazy dog

In [462]:

print(paragraph.expandtabs(4))

    The quick brown fox jumps over the lazy dog
    The quick brown fox jumps over the lazy dog
    The quick brown fox jumps over the lazy dog
    The quick brown fox jumps over the lazy dog

The bytes class is another text based class. Instead of having the fundamental unit of a Unicode character, it has the fundamental unit of a byte:

The str instances encode method encodes the str to a bytes instance. The str instance under the hood uses the 'utf-8' translation table but this can be encoded to a bytes instance that uses this translation table or another one:

In [463]:

greeting.encode?

Signature: greeting.encode(encoding='utf-8', errors='strict')
Docstring:
Encode the string using the codec registered for encoding.

encoding
  The encoding in which to encode the string.
errors
  The error handling scheme to use for encoding errors.
  The default is 'strict' meaning that encoding errors raise a
  UnicodeEncodeError.  Other possible values are 'ignore', 'replace' and
  'xmlcharrefreplace' as well as any other name registered with
  codecs.register_error that can handle UnicodeEncodeErrors.
Type:      builtin_function_or_method

Since each English ASCII character is stored as a byte, the English character is used to represent its corresponding byte and therefore the two instances look familiar:

In [464]:

greeting.encode()

Out[464]:

b'hello|world|!'

Recall ASCII characters are encoded over the values 0:128, which are the values for half a byte. Legacy translation tables uses the second half of a byte for additional characters. The £ sign for example is not an ASCII character. In 'latin1' it spans over a single byte:

In [465]:

'£'.encode(encoding='latin1')

Out[465]:

b'\xa3'

In [466]:

0xa3

Out[466]:

In 'utf-16' each character spans over 2 bytes. There are variations of utf-16 depending on the byte order. The byte order endian can be conceptualised by encoding the number twelve (in decimal) as 12 (big endian) or 21 (little endian).

Humans normally encode numbers using big endian but Intel processors work using little endian. When utf-16 was first introduced by Intel, there was confusion with the byte order and as a consequence 2 variations of utf-16. Microsoft also included a third variant of little endian with a 2 bytes BOM prefix. The BOM is byte order marker used to quickly identify little endian:

In [467]:

'£'.encode(encoding='utf-16-be')

Out[467]:

b'\x00\xa3'

In [468]:

'£'.encode(encoding='utf-16-le')

Out[468]:

b'\xa3\x00'

In [469]:

'£'.encode(encoding='utf-16')

Out[469]:

b'\xff\xfe\xa3\x00'

In [470]:

'££'.encode(encoding='utf-16')

Out[470]:

b'\xff\xfe\xa3\x00\xa3\x00'

The current standard is 'utf-8' which uses a different bytes combination to the previous translation tables and uses 2 bytes to encode the £ sign:

In [471]:

'£'.encode(encoding='utf-8')

Out[471]:

b'\xc2\xa3'

The Greek letters also require 2 bytes each. Each of the characters in the str instance below, except for the space are not recognised as ASCII characters and therefore represented by two hexadecimal escape characters:

In [472]:

greek_greeting = 'Γειά σου Κόσμε!'

In [473]:

greek_greeting.encode(encoding='utf-8')

Out[473]:

b'\xce\x93\xce\xb5\xce\xb9\xce\xac \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xcf\x8c\xcf\x83\xce\xbc\xce\xb5!'

In [474]:

'Γ'.encode(encoding='utf-8')

Out[474]:

b'\xce\x93'

The bytes class and the concept of encoding will be covered in more detail in the next notebook.

Builtins Module: String Class (str)

Categorize_Identifiers Module¶

Initialisation Signature¶

Spacing and PEP8¶

String Quotations¶

Identifiers¶

Datamodel Identifiers¶

formal (repr) and informal (str) str¶

Indexing and Slicing (len, contains, getitem)¶

Iteration (iter) and looping¶

Immutability and hash (hash)¶

Comparison Operators (gt, ge, lt, le, eq and ne)¶

Instance Methods¶

Case Methods¶

Boolean Identifiers¶

Valid Identifier Names¶

The string module¶

Escape Characters¶

Translation Table¶

File Paths and Raw Strings¶

Find and Index¶

The re module¶

The print function¶

Formatted Strings¶

Multiline Strings¶

Center and Justify¶

Binary Operators¶

Splitting and Joining Strings¶

Like this:

Categorize_Identifiers Module¶

Initialisation Signature¶

Spacing and PEP8¶

String Quotations¶

Identifiers¶

Datamodel Identifiers¶

formal (__repr__) and informal (__str__) str¶

Indexing and Slicing (__len__, __contains__, __getitem__)¶

Iteration (__iter__) and looping¶

Immutability and hash (__hash__)¶

Comparison Operators (__gt__, __ge__, __lt__, __le__, __eq__ and __ne__)¶

Instance Methods¶

Case Methods¶

Boolean Identifiers¶

Valid Identifier Names¶

The string module¶

Escape Characters¶

Translation Table¶

File Paths and Raw Strings¶

Find and Index¶

The re module¶

The print function¶

Formatted Strings¶

Multiline Strings¶

Center and Justify¶

Binary Operators¶

Splitting and Joining Strings¶

Bytes Related Identifiers¶

Share this:

Like this:

formal (repr) and informal (str) str¶

Indexing and Slicing (len, contains, getitem)¶

Iteration (iter) and looping¶

Immutability and hash (hash)¶

Comparison Operators (gt, ge, lt, le, eq and ne)¶