The str
class is an abbreviation for an immutable string of Unicode characters.
Categorize_Identifiers Module¶
This notebook will use the following functions dir2
, variables
and view
in the custom module categorize_identifiers
which is found in the same directory as this notebook file. dir2
is a variant of dir
that groups identifiers into a dict
under categories and variables
is an IPython based a variable inspector. view
is used to view a Collection
in more detail:
from categorize_identifiers import dir2, variables, view
Initialisation Signature¶
The initialisation signature of the str
class may be printed using:
str?
Init signature: str(self, /, *args, **kwargs) Docstring: str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.__str__() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'. Type: type Subclasses: StrEnum, DeferredConfigString, FoldedCase, _rstr, _ScriptTarget, _ModuleTarget, LSString, include, Keys, InputMode, ...
The purpose of the initialisation signature is to provide the data required to initialise a new instance. For the str
class, the initialisation signature shows three alternative ways of supplying the required instance data.
If the first way is examined:
str(self, /, *args, **kwargs)
To recap:
- The parenthesis
( )
are used to call a function and supply any necessary input arguments. - The comma
,
is used as a delimiter to separate out any input arguments. - In Python
self
is used to denote this instance. In other words astr
instance is constructed from an existingstr
. This is a special case as astr
is a fundamental datatype and has a shorthand way of instantiation. - Any input argument before a
/
must be provided positionally. *args
indicates a variable number of additional positional input arguments. These are typically not used for the string class.**kwargs
indicates a variable number of additional named input arguments. These are typically not used for the string class.
self
can be provided positionally using an existing str
instance:
str('hello')
'hello'
However because the str
is a fundamental datatype it is instantiated shorthand using the following:
'hello'
'hello'
The characters in a str
instance must be enclosed in quotations. These are used to distinguish a str
of characters from an instance name.
Notice the difference in the syntax colour highlighting between the str
instance (top) and the instance name (below). The instance name does not exist and the Python interpreter will flag a NameError
when attempting to look it up:
'hello'
hello
In VSCode the Variables button can be selected to view Variables present. In this notebook, the custom function variables
will instead be used which has a similar form:
variables()
Type | Size/Shape | Value | |
---|---|---|---|
Instance Name |
If the following code is input:
'hello'
'hello'
Notice the value 'hello'
is returned to the cell output. When a value is returned to the cell output, it is not stored elsewhere.
variables()
Type | Size/Shape | Value | |
---|---|---|---|
Instance Name |
This Python str
instance that has no instance name and therefore cannot be reselected. Conceptualise an instance name as a label which points to the str
instance and is therefore used to select the str
instance.
A str
instance can be assigned to an instance name during instantiation:
greeting = 'hello'
Notice now that the cell has no output. Instead it is stored under the instance name greeting
and this displays in Variables:
variables()
Type | Size/Shape | Value | |
---|---|---|---|
Instance Name | |||
greeting | str | 5 | hello |
The value of the str
instance can be referenced via the instance name:
greeting
'hello'
In the above cell, the Python interpreter recognised the instance name. This instance name was used to point to the str
instance and the value retrieved was not assigned to another instance name and is therefore shown in the cell output.
If the instance is instead assigned to another instance name:
greeting2 = greeting
Then in the Variable Explorer, the str
instance 'hello'
is shown with two different instance names greeting
and greeting2
:
variables()
Type | Size/Shape | Value | |
---|---|---|---|
Instance Name | |||
greeting | str | 5 | hello |
greeting2 | str | 5 | hello |
These two instance names act as alias to one another. If an instance name is conceptualised as a label, then this str
instance has two labels. If either instance name are used, the same value is retrieved:
greeting
'hello'
greeting2
'hello'
A check is made to see if the value retrieved from each instance name is equal. Because they are the same str
instance, the boolean True
is returned:
greeting == greeting2
True
Each instance in Python has a unique identification and can be checked using:
id(greeting)
2064235241776
id(greeting2)
2064235241776
Notice that the id is the same, because both these instance names are references to the same str
instance. Therefore the following is True
:
greeting is greeting2
True
Which recall is shorthand for:
id(greeting) == id(greeting2)
True
The delete statement del
can be used to delete an instance name. Note that deleting an instance name only deletes a label, leaving the instance unchanged:
del greeting
Notice that the instance name greeting
is deleted i.e. this label is removed. However the label greeting2
is still present and the instance 'hello'
is unaltered:
variables()
Type | Size/Shape | Value | |
---|---|---|---|
Instance Name | |||
greeting2 | str | 5 | hello |
If del
is used to also delete the instance name greeting2
:
del greeting2
variables()
Type | Size/Shape | Value | |
---|---|---|---|
Instance Name |
Then there are no instance names for the str
instance 'hello'
. When an instance has no instance name it cannot be referenced and is considered orphaned. Orphaned instances are automatically cleaned up by Pythons garbage collection.
If a new instance is created:
greeting = 'Hello World'
Then the instance name displays on variables:
variables(show_id=True)
Type | Size/Shape | Value | ID | |
---|---|---|---|---|
Instance Name | ||||
greeting | str | 11 | Hello World | 2064295823280 |
If a reassignment is carried out:
greeting = 'hi'
The instance name remains on Variables but the instance it points to has changed. In other words the label greeting has been peeled off from the old str
instance 'Hello World'
and placed on the new str
instance 'hi'
. The old str
instance now has no instance name and therefore no reference and is orphaned and finally because it is orphaned it is cleaned up by Pythons garbage collection:
variables(show_id=True)
Type | Size/Shape | Value | ID | |
---|---|---|---|---|
Instance Name | ||||
greeting | str | 2 | hi | 140727149708216 |
Reassignment moves the instance name from the old str
instance to the new str
instance and does not change either str
instance. A str
instance is immutable and cannot be modified after it has been instantiated.
The initialisation signature of the str
class shows instantiation using a named keyword input argument object
which has a default value of an empty str
:
str(object='') -> str
This is used to cast instances of other Python builtins
classes to str
instances:
str(object='hello')
'hello'
str(object=b'hello')
"b'hello'"
str(object=bytearray(b'hello'))
"bytearray(b'hello')"
str(object=2)
'2'
str(object=True)
'True'
str(object=3.14)
'3.14'
If not assigned, it takes on its default value which returns an empty str
instance:
str()
''
Spacing and PEP8¶
If the following is examined:
instance = str(object='hello')
Notice the assignment operator is used to assign a value to a named parameter within the function call and the return
value of the function call is also assigned to an instance name.
Notice the subtlety in the above spacing. Within a function call spacing is typically used to visually separate out input arguments:
func('a'=1, 'b'=2, 'c'=3)
Outside the function call, spacing is used to visually emphasise an operator:
result = 2 * 3
Operators within a function call are not visually separated as the spacing is used to visually separate out the parameters:
result = func('a'=1, 'b'=2, 'c'=2*3)
The code below will work but is harder to read:
result=func('a'=1,'b'=2,'c'=2*3)
result=func('a' = 1,'b' = 2,'c' = 2 * 3)
More details are given in the Python Enhanced Protocol 8: Style Guide.
Use of the Python formatters such as autopep8 was previously discussed in the tutorial on installing VSCode.
String Quotations¶
In Python single and double quotations can be used to enclose the characters in a str
instance and are seen as equivalent:
"Hello World!"
'Hello World!'
'Hello World!'
'Hello World!'
Notice that the Python interpreter itself prefers single quotations and the value returned to the cell output in each case is the printed formal representation and is enclosed in single quotations.
The '
is a formatting character in a str
instance and is used to enclose the characters of the str
itself. If a str
containing a str
literal is attempted to be constructed.
'greeting = 'Hello World!'
Notice that the syntax highlighting above displays:
'greeting = '
as astr
hello
as an instance nameworld!
as an instance name''
as an empty string
This results in a SyntaxError
.
The \
is another formatting character that is used to insert an escape character or escape character sequence. \'
will incorporate the single quotation into the str
:
'greeting = \'hello world!\''
"greeting = 'hello world!'"
Notice that the str
returned in the cell output is now enclosed in double quotations and is more readable. The main purpose of the double quotations is to make it easier to create a str
instance which includes a str
literal.
Triple double quotations are typically used for a multiline string. Double quotations are preferred over single quotations for multiline str
instances as they are commonly used as docstrings and a docstring has a high probability of including a str
literal. A very basic function can be created which takes in two input str
instances and prints them within a formatted str
instance:
def fun(string1='hello', string2='world'):
print(f'{string1} {string2}')
The function can be tested:
fun()
hello world
fun(string1='bye')
bye world
Because it has no docstring, it has no documentation:
fun?
Signature: fun(string1='hello', string2='world') Docstring: <no docstring> File: c:\users\phili\appdata\local\temp\ipykernel_3712\1566935369.py Type: function
A docstring is normally added at the start of the functions code block and although this is only a single line, it is typically input using triple double quotations:
def fun(string1='hello', string2='world'):
"""Prints string1 string2"""
print(f'{string1} {string2}')
fun?
Signature: fun(string1='hello', string2='world') Docstring: Prints string1 string2 File: c:\users\phili\appdata\local\temp\ipykernel_3712\3973104799.py Type: function
The triple double quotations allow it to be readily expanded later on with optional str
literals:
def fun(string1='hello', string2='world'):
"""Prints string1 string2
For example fun(string1='hello', string2='world') prints hello world"""
print(f'{string1} {string2}')
fun?
Signature: fun(string1='hello', string2='world') Docstring: Prints string1 string2 For example fun(string1='hello', string2='world') prints hello world File: c:\users\phili\appdata\local\temp\ipykernel_3712\1096554159.py Type: function
The Python Enhanced Protocol 8: Style Guide does not explicitly make a recommendation for quotation style:
In Python, single-quoted strings and double-quoted strings are the same. This PEP does not make a recommendation for this. Pick a rule and stick to it.
However the python interpreter, Python and Python documentation prefer single quotations over double quotes. Double quotes are used when the str
instance contains a str
literal. A docstring (which is likely to later be updated to include a str
literal) uses triple double quotes. It is generally a good practice to make your code look as close to the code in the official Python documentation when getting started, as these tutorials attempt to do. Popular third-party libraries numpy
, matplotlib
, scipy
and sklearn
in the scientific stack are written using a consistent quotation style.
Python has a popular opinionated autoformatter black
which unfortunately has a preference for double quotations, differing from the style used in Python itself. Moreover black
is used for the development of some popular third-party libraries such as pandas
and seaborn
which are also in the scientific stack. The quotation style for the official documentation for libraries in the scientific stack therefore is unfortunately inconsistent. Finally because of the popularity of pandas
in particular, double quotations tend to be more prevalent in datascience tutorials.
Identifiers¶
Two str
instances can be instantiated:
greeting = 'hello'
farewell = 'bye'
variables()
Type | Size/Shape | Value | |
---|---|---|---|
Instance Name | |||
greeting | str | 5 | hello |
farewell | str | 3 | bye |
The dir
function can be used to view a list of identifiers from an instance:
dir(greeting)
['__add__', '__class__', '__contains__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__mod__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmod__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'capitalize', 'casefold', 'center', 'count', 'encode', 'endswith', 'expandtabs', 'find', 'format', 'format_map', 'index', 'isalnum', 'isalpha', 'isascii', 'isdecimal', 'isdigit', 'isidentifier', 'islower', 'isnumeric', 'isprintable', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'maketrans', 'partition', 'removeprefix', 'removesuffix', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']
These aren't grouped by category. This can be done by using the custom function dir2
;
dir2(greeting)
{'method': ['capitalize', 'casefold', 'center', 'count', 'encode', 'endswith', 'expandtabs', 'find', 'format', 'format_map', 'index', 'isalnum', 'isalpha', 'isascii', 'isdecimal', 'isdigit', 'isidentifier', 'islower', 'isnumeric', 'isprintable', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'maketrans', 'partition', 'removeprefix', 'removesuffix', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill'], 'datamodel_attribute': ['__doc__'], 'datamodel_method': ['__add__', '__class__', '__contains__', '__delattr__', '__dir__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__mod__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmod__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__']}
Notice the same identifier names display when the other instance is examined:
dir2(farewell)
{'method': ['capitalize', 'casefold', 'center', 'count', 'encode', 'endswith', 'expandtabs', 'find', 'format', 'format_map', 'index', 'isalnum', 'isalpha', 'isascii', 'isdecimal', 'isdigit', 'isidentifier', 'islower', 'isnumeric', 'isprintable', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'maketrans', 'partition', 'removeprefix', 'removesuffix', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill'], 'datamodel_attribute': ['__doc__'], 'datamodel_method': ['__add__', '__class__', '__contains__', '__delattr__', '__dir__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__mod__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmod__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__']}
This is because both greeting
and farewell
are instance of the str
class:
type(greeting)
str
type(farewell)
str
And the identifiers are defined in the str
class:
dir2(str)
{'method': ['capitalize', 'casefold', 'center', 'count', 'encode', 'endswith', 'expandtabs', 'find', 'format', 'format_map', 'index', 'isalnum', 'isalpha', 'isascii', 'isdecimal', 'isdigit', 'isidentifier', 'islower', 'isnumeric', 'isprintable', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'maketrans', 'partition', 'removeprefix', 'removesuffix', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill'], 'datamodel_attribute': ['__doc__'], 'datamodel_method': ['__add__', '__class__', '__contains__', '__delattr__', '__dir__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__mod__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmod__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__']}
If the classes method resolution order is examined:
str.mro()
[str, object]
Notice that there is a list
instance containing the classes str
and object
. This means the str
instance has all the object
based datamodel identifiers:
dir2(str, object, consistent_only=True)
{'datamodel_attribute': ['__doc__'], 'datamodel_method': ['__class__', '__delattr__', '__dir__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__']}
Alongside the following additions:
dir2(str, object, unique_only=True)
{'method': ['capitalize', 'casefold', 'center', 'count', 'encode', 'endswith', 'expandtabs', 'find', 'format', 'format_map', 'index', 'isalnum', 'isalpha', 'isascii', 'isdecimal', 'isdigit', 'isidentifier', 'islower', 'isnumeric', 'isprintable', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'maketrans', 'partition', 'removeprefix', 'removesuffix', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill'], 'datamodel_method': ['__add__', '__contains__', '__getitem__', '__getnewargs__', '__iter__', '__len__', '__mod__', '__mul__', '__rmod__', '__rmul__']}
The method resolution order is an instruction to preferentially use the method defined in the str
class and to fallback on the method defined in the object
class when not defined in the str
class. More details about these two classes can be seen using help
:
help(str)
Help on class str in module builtins: class str(object) | str(object='') -> str | str(bytes_or_buffer[, encoding[, errors]]) -> str | | Create a new string object from the given object. If encoding or | errors is specified, then the object must expose a data buffer | that will be decoded using the given encoding and error handler. | Otherwise, returns the result of object.__str__() (if defined) | or repr(object). | encoding defaults to sys.getdefaultencoding(). | errors defaults to 'strict'. | | Methods defined here: | | __add__(self, value, /) | Return self+value. | | __contains__(self, key, /) | Return bool(key in self). | | __eq__(self, value, /) | Return self==value. | | __format__(self, format_spec, /) | Return a formatted version of the string as described by format_spec. | | __ge__(self, value, /) | Return self>=value. | | __getattribute__(self, name, /) | Return getattr(self, name). | | __getitem__(self, key, /) | Return self[key]. | | __getnewargs__(...) | | __gt__(self, value, /) | Return self>value. | | __hash__(self, /) | Return hash(self). | | __iter__(self, /) | Implement iter(self). | | __le__(self, value, /) | Return self<=value. | | __len__(self, /) | Return len(self). | | __lt__(self, value, /) | Return self<value. | | __mod__(self, value, /) | Return self%value. | | __mul__(self, value, /) | Return self*value. | | __ne__(self, value, /) | Return self!=value. | | __repr__(self, /) | Return repr(self). | | __rmod__(self, value, /) | Return value%self. | | __rmul__(self, value, /) | Return value*self. | | __sizeof__(self, /) | Return the size of the string in memory, in bytes. | | __str__(self, /) | Return str(self). | | capitalize(self, /) | Return a capitalized version of the string. | | More specifically, make the first character have upper case and the rest lower | case. | | casefold(self, /) | Return a version of the string suitable for caseless comparisons. | | center(self, width, fillchar=' ', /) | Return a centered string of length width. | | Padding is done using the specified fill character (default is a space). | | count(...) | S.count(sub[, start[, end]]) -> int | | Return the number of non-overlapping occurrences of substring sub in | string S[start:end]. Optional arguments start and end are | interpreted as in slice notation. | | encode(self, /, encoding='utf-8', errors='strict') | Encode the string using the codec registered for encoding. | | encoding | The encoding in which to encode the string. | errors | The error handling scheme to use for encoding errors. | The default is 'strict' meaning that encoding errors raise a | UnicodeEncodeError. Other possible values are 'ignore', 'replace' and | 'xmlcharrefreplace' as well as any other name registered with | codecs.register_error that can handle UnicodeEncodeErrors. | | endswith(...) | S.endswith(suffix[, start[, end]]) -> bool | | Return True if S ends with the specified suffix, False otherwise. | With optional start, test S beginning at that position. | With optional end, stop comparing S at that position. | suffix can also be a tuple of strings to try. | | expandtabs(self, /, tabsize=8) | Return a copy where all tab characters are expanded using spaces. | | If tabsize is not given, a tab size of 8 characters is assumed. | | find(...) | S.find(sub[, start[, end]]) -> int | | Return the lowest index in S where substring sub is found, | such that sub is contained within S[start:end]. Optional | arguments start and end are interpreted as in slice notation. | | Return -1 on failure. | | format(...) | S.format(*args, **kwargs) -> str | | Return a formatted version of S, using substitutions from args and kwargs. | The substitutions are identified by braces ('{' and '}'). | | format_map(...) | S.format_map(mapping) -> str | | Return a formatted version of S, using substitutions from mapping. | The substitutions are identified by braces ('{' and '}'). | | index(...) | S.index(sub[, start[, end]]) -> int | | Return the lowest index in S where substring sub is found, | such that sub is contained within S[start:end]. Optional | arguments start and end are interpreted as in slice notation. | | Raises ValueError when the substring is not found. | | isalnum(self, /) | Return True if the string is an alpha-numeric string, False otherwise. | | A string is alpha-numeric if all characters in the string are alpha-numeric and | there is at least one character in the string. | | isalpha(self, /) | Return True if the string is an alphabetic string, False otherwise. | | A string is alphabetic if all characters in the string are alphabetic and there | is at least one character in the string. | | isascii(self, /) | Return True if all characters in the string are ASCII, False otherwise. | | ASCII characters have code points in the range U+0000-U+007F. | Empty string is ASCII too. | | isdecimal(self, /) | Return True if the string is a decimal string, False otherwise. | | A string is a decimal string if all characters in the string are decimal and | there is at least one character in the string. | | isdigit(self, /) | Return True if the string is a digit string, False otherwise. | | A string is a digit string if all characters in the string are digits and there | is at least one character in the string. | | isidentifier(self, /) | Return True if the string is a valid Python identifier, False otherwise. | | Call keyword.iskeyword(s) to test whether string s is a reserved identifier, | such as "def" or "class". | | islower(self, /) | Return True if the string is a lowercase string, False otherwise. | | A string is lowercase if all cased characters in the string are lowercase and | there is at least one cased character in the string. | | isnumeric(self, /) | Return True if the string is a numeric string, False otherwise. | | A string is numeric if all characters in the string are numeric and there is at | least one character in the string. | | isprintable(self, /) | Return True if the string is printable, False otherwise. | | A string is printable if all of its characters are considered printable in | repr() or if it is empty. | | isspace(self, /) | Return True if the string is a whitespace string, False otherwise. | | A string is whitespace if all characters in the string are whitespace and there | is at least one character in the string. | | istitle(self, /) | Return True if the string is a title-cased string, False otherwise. | | In a title-cased string, upper- and title-case characters may only | follow uncased characters and lowercase characters only cased ones. | | isupper(self, /) | Return True if the string is an uppercase string, False otherwise. | | A string is uppercase if all cased characters in the string are uppercase and | there is at least one cased character in the string. | | join(self, iterable, /) | Concatenate any number of strings. | | The string whose method is called is inserted in between each given string. | The result is returned as a new string. | | Example: '.'.join(['ab', 'pq', 'rs']) -> 'ab.pq.rs' | | ljust(self, width, fillchar=' ', /) | Return a left-justified string of length width. | | Padding is done using the specified fill character (default is a space). | | lower(self, /) | Return a copy of the string converted to lowercase. | | lstrip(self, chars=None, /) | Return a copy of the string with leading whitespace removed. | | If chars is given and not None, remove characters in chars instead. | | partition(self, sep, /) | Partition the string into three parts using the given separator. | | This will search for the separator in the string. If the separator is found, | returns a 3-tuple containing the part before the separator, the separator | itself, and the part after it. | | If the separator is not found, returns a 3-tuple containing the original string | and two empty strings. | | removeprefix(self, prefix, /) | Return a str with the given prefix string removed if present. | | If the string starts with the prefix string, return string[len(prefix):]. | Otherwise, return a copy of the original string. | | removesuffix(self, suffix, /) | Return a str with the given suffix string removed if present. | | If the string ends with the suffix string and that suffix is not empty, | return string[:-len(suffix)]. Otherwise, return a copy of the original | string. | | replace(self, old, new, count=-1, /) | Return a copy with all occurrences of substring old replaced by new. | | count | Maximum number of occurrences to replace. | -1 (the default value) means replace all occurrences. | | If the optional argument count is given, only the first count occurrences are | replaced. | | rfind(...) | S.rfind(sub[, start[, end]]) -> int | | Return the highest index in S where substring sub is found, | such that sub is contained within S[start:end]. Optional | arguments start and end are interpreted as in slice notation. | | Return -1 on failure. | | rindex(...) | S.rindex(sub[, start[, end]]) -> int | | Return the highest index in S where substring sub is found, | such that sub is contained within S[start:end]. Optional | arguments start and end are interpreted as in slice notation. | | Raises ValueError when the substring is not found. | | rjust(self, width, fillchar=' ', /) | Return a right-justified string of length width. | | Padding is done using the specified fill character (default is a space). | | rpartition(self, sep, /) | Partition the string into three parts using the given separator. | | This will search for the separator in the string, starting at the end. If | the separator is found, returns a 3-tuple containing the part before the | separator, the separator itself, and the part after it. | | If the separator is not found, returns a 3-tuple containing two empty strings | and the original string. | | rsplit(self, /, sep=None, maxsplit=-1) | Return a list of the substrings in the string, using sep as the separator string. | | sep | The separator used to split the string. | | When set to None (the default value), will split on any whitespace | character (including \n \r \t \f and spaces) and will discard | empty strings from the result. | maxsplit | Maximum number of splits (starting from the left). | -1 (the default value) means no limit. | | Splitting starts at the end of the string and works to the front. | | rstrip(self, chars=None, /) | Return a copy of the string with trailing whitespace removed. | | If chars is given and not None, remove characters in chars instead. | | split(self, /, sep=None, maxsplit=-1) | Return a list of the substrings in the string, using sep as the separator string. | | sep | The separator used to split the string. | | When set to None (the default value), will split on any whitespace | character (including \n \r \t \f and spaces) and will discard | empty strings from the result. | maxsplit | Maximum number of splits (starting from the left). | -1 (the default value) means no limit. | | Note, str.split() is mainly useful for data that has been intentionally | delimited. With natural text that includes punctuation, consider using | the regular expression module. | | splitlines(self, /, keepends=False) | Return a list of the lines in the string, breaking at line boundaries. | | Line breaks are not included in the resulting list unless keepends is given and | true. | | startswith(...) | S.startswith(prefix[, start[, end]]) -> bool | | Return True if S starts with the specified prefix, False otherwise. | With optional start, test S beginning at that position. | With optional end, stop comparing S at that position. | prefix can also be a tuple of strings to try. | | strip(self, chars=None, /) | Return a copy of the string with leading and trailing whitespace removed. | | If chars is given and not None, remove characters in chars instead. | | swapcase(self, /) | Convert uppercase characters to lowercase and lowercase characters to uppercase. | | title(self, /) | Return a version of the string where each word is titlecased. | | More specifically, words start with uppercased characters and all remaining | cased characters have lower case. | | translate(self, table, /) | Replace each character in the string using the given translation table. | | table | Translation table, which must be a mapping of Unicode ordinals to | Unicode ordinals, strings, or None. | | The table must implement lookup/indexing via __getitem__, for instance a | dictionary or list. If this operation raises LookupError, the character is | left untouched. Characters mapped to None are deleted. | | upper(self, /) | Return a copy of the string converted to uppercase. | | zfill(self, width, /) | Pad a numeric string with zeros on the left, to fill a field of the given width. | | The string is never truncated. | | ---------------------------------------------------------------------- | Static methods defined here: | | __new__(*args, **kwargs) from builtins.type | Create and return a new object. See help(type) for accurate signature. | | maketrans(...) | Return a translation table usable for str.translate(). | | If there is only one argument, it must be a dictionary mapping Unicode | ordinals (integers) or characters to Unicode ordinals, strings or None. | Character keys will be then converted to ordinals. | If there are two arguments, they must be strings of equal length, and | in the resulting dictionary, each character in x will be mapped to the | character at the same position in y. If there is a third argument, it | must be a string, whose characters will be mapped to None in the result.
help(object)
Help on class object in module builtins: class object | The base class of the class hierarchy. | | When called, it accepts no arguments and returns a new featureless | instance that has no instance attributes and cannot be given any. | | Built-in subclasses: | anext_awaitable | async_generator | async_generator_asend | async_generator_athrow | ... and 90 other subclasses | | Methods defined here: | | __delattr__(self, name, /) | Implement delattr(self, name). | | __dir__(self, /) | Default dir() implementation. | | __eq__(self, value, /) | Return self==value. | | __format__(self, format_spec, /) | Default object formatter. | | Return str(self) if format_spec is empty. Raise TypeError otherwise. | | __ge__(self, value, /) | Return self>=value. | | __getattribute__(self, name, /) | Return getattr(self, name). | | __getstate__(self, /) | Helper for pickle. | | __gt__(self, value, /) | Return self>value. | | __hash__(self, /) | Return hash(self). | | __init__(self, /, *args, **kwargs) | Initialize self. See help(type(self)) for accurate signature. | | __le__(self, value, /) | Return self<=value. | | __lt__(self, value, /) | Return self<value. | | __ne__(self, value, /) | Return self!=value. | | __reduce__(self, /) | Helper for pickle. | | __reduce_ex__(self, protocol, /) | Helper for pickle. | | __repr__(self, /) | Return repr(self). | | __setattr__(self, name, value, /) | Implement setattr(self, name, value). | | __sizeof__(self, /) | Size of object in memory, in bytes. | | __str__(self, /) | Return str(self). | | ---------------------------------------------------------------------- | Class methods defined here: | | __init_subclass__(...) from builtins.type | This method is called when a class is subclassed. | | The default implementation does nothing. It may be | overridden to extend subclasses. | | __subclasshook__(...) from builtins.type | Abstract classes can override this to customize issubclass(). | | This is invoked early on by abc.ABCMeta.__subclasscheck__(). | It should return True, False or NotImplemented. If it returns | NotImplemented, the normal algorithm is used. Otherwise, it | overrides the normal algorithm (and the outcome is cached). | | ---------------------------------------------------------------------- | Static methods defined here: | | __new__(*args, **kwargs) from builtins.type | Create and return a new object. See help(type) for accurate signature. | | ---------------------------------------------------------------------- | Data and other attributes defined here: | | __class__ = <class 'type'> | type(object) -> the object's type | type(name, bases, dict, **kwds) -> a new type
Datamodel Identifiers¶
The str
has the object
based datamodel identifiers. Recall from the previous tutorial these define the behaviour of the following builtins
identifier:
Datamodel Identifier | Builtins Identifier | Builtins Identifier Type | Description |
---|---|---|---|
__new__ | constructs the instance self | ||
__init__ | initialise an instance with instance data (automatically invoked by __new__) | ||
__doc__ | ? | operator | view the docstring or initialisation signature docstring if a class |
__class__ | type | class | display the class type of an instance |
__dir__ | dir | function | list the directory of identifiers |
__repr__ | repr | function | formal str representation |
__str__ | str | class | informal str representation |
__hash__ | hash | function | hash value if immutable, if mutable __hash__ = None and the hash function cannot be used |
__getattribute__ | getattr | function | access an attribute (immutable) |
__setattr__ | setattr | function | set an attribute (mutable) |
__delattr__ | delattr | function | delete an attribute (mutable) |
__eq__ | == | operator | check if self is equal to value |
__ne__ | != | operator | check if self is not equal to value |
__lt__ | < | operator | check if self is less than value |
__le__ | <= | operator | check if self is less than or equal to value |
__gt__ | > | operator | check if self is greater than value |
__ge__ | >= | operator | check if self is greater than or equal to value |
__sizeof__ | sys.sizeof | function | check the size of the instance in bytes |
The identifiers used by the pickle module or for subclassing are not mentioned here and were covered in the previous tutorial on the object
class.
These are supplemented by the following datamodel methods:
dir2(str, object, unique_only=True, print_output=False)['datamodel_method']
['__add__', '__contains__', '__getitem__', '__getnewargs__', '__iter__', '__len__', '__mod__', '__mul__', '__rmod__', '__rmul__']
The str
follows the design pattern on an immutable Collection
. A Collection
has the following datamodel identifiers:
Datamodel Identifier | Builtins Identifier | Builtins Identifier Type | Description |
---|---|---|---|
__len__ | len | function | the number of Unicode characters in a str |
__contains__ | in | keyword | check if str contains a substr |
__getitem__ | [] | uses square brackets to index into a str | |
__iter__ | iter | function | returns a str iterator |
__add__ | + | operator | concatenates two str instances |
__mul__ | * | operator | replicates a str by multiplication with an int instance 'hello' * 2 |
__rmul__ | * | operator | replicates a str by reverse multiplication with an int instance 2 * 'hello' |
There are also some str
specific additions:
Datamodel Identifier | Builtins Identifier | Builtins Identifier Type | Description |
---|---|---|---|
__mod__ | % | operator | create a formatted str by inserting variables into the str using a tuple '% and % make %' % (2, 3, 5) |
__rmod__ | % | operator | create a formatted str by reverse inserting variables into the str using a tuple (2, 3, 5) % '% and % make %' |
The __getnewargs__
datamodel method is used by the pickle
to serialise the str
.
Using ?
on the str
class shoes the docstring of the __init__
signature:
str?
Init signature: str(self, /, *args, **kwargs) Docstring: str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.__str__() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'. Type: type Subclasses: StrEnum, DeferredConfigString, FoldedCase, _rstr, _ScriptTarget, _ModuleTarget, LSString, include, Keys, InputMode, ...
The datamodel identifier __new__
constructs the instance greeting
and invokes the __init__
signature to provide the str
with the required instance data:
greeting = 'Hello\tWorld!'
Using ?
with the str
instances gives the same docstring from the str
class but displays instance specific details:
greeting?
Type: str String form: Hello World! Length: 12 Docstring: str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.__str__() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.
Such as the type
:
type(greeting)
str
formal (__repr__) and informal (__str__) str¶
prints out the informal str
form:
print(greeting)
Hello World!
Recall that there is the formal and informal str
representation and the difference between these can be seen when an instance is printed (above) and examined in the cell output below:
greeting
'Hello\tWorld!'
The informal str
(__str__
datamodel method) defines the behaviour of the str
class. Casting a str
instance to a str
instance leaves it unchanged:
str(greeting)
'Hello\tWorld!'
Therefore the two are equivalent:
print(str(greeting))
Hello World!
print(greeting)
Hello World!
The formal repr
(__repr__
datamodel method) defines the behaviour of the repr
function:
repr(greeting)
"'Hello\\tWorld!'"
Notice the print out of this shows the informal str
representation which is the form used to instantiate a new str
instance:
print(repr(greeting))
'Hello\tWorld!'
Indexing and Slicing (__len__, __contains__, __getitem__)¶
The length function len
returns the number of Unicode Characters in the str
:
len(greeting)
12
Notice that \t
is used to represent a single Unicode character. The custom function view
can be imported from the custom module view_collection
to view the str
instance in more detail:
Notice that the str
uses zero-order indexing where each index is an int
. Notice that the "first" index known as the start index is 0
and increases in int
steps of 1
up to but excluding the stop index which is the length of the collection. The last index is therefore 1 less than the length of the str
instance.
Notice that the datatype for each character is itself a str
and each of these str
instances have a length of 1 corresponding to a value that is a single Unicode character:
view(greeting)
Index Type Size Value 0 str 1 H 1 str 1 e 2 str 1 l 3 str 1 l 4 str 1 o 5 str 1 6 str 1 W 7 str 1 o 8 str 1 r 9 str 1 l 10 str 1 d 11 str 1 !
Square brackets can used to select an index:
greeting[0]
'H'
greeting[len(greeting)-1]
'!'
greeting[11]
'!'
The slice
class can be used to select a substr using a slice:
slice?
Init signature: slice(self, /, *args, **kwargs) Docstring: slice(stop) slice(start, stop[, step]) Create a slice object. This is used for extended slicing (e.g. a[0:10:2]). Type: type Subclasses:
To select the first word the following slice can be used:
slice(0, 5, 1)
Note because zero-order indexing is used, the start bound is inclusive and the stop bound is exclusive. A slice is therefore selected up to but excluding the stop bound:
Index | Type | Size | Value |
---|---|---|---|
0 | str | 1 | H |
1 | str | 1 | e |
2 | str | 1 | l |
3 | str | 1 | l |
4 | str | 1 | o |
5 |
start = 0
stop = 5
step = 1
greeting[slice(start, stop, step)]
'Hello'
Because the default step is 1:
greeting[slice(start, stop)]
'Hello'
Because the default start is 0:
greeting[slice(stop)]
'Hello'
Slicing is usually done shorthand using colons to separate out the start, stop and step values:
greeting[start:stop:step]
'Hello'
Because the default step is 1, this can be simplified to:
greeting[start:stop:]
'Hello'
The last colon can also be dropped:
greeting[start:stop]
'Hello'
Because the default start is 0 this can be simplied to:
greeting[:stop]
'Hello'
The default stop is the length of the str
and therefore the following returns the whole str
:
greeting[:]
'Hello\tWorld!'
Normally numbers are used in the slices directly:
greeting[0:5:1]
'Hello'
greeting[6:]
'World!'
The shorthand notation is generally preferred however a slice is sometimes used with a constant to make code more readable:
FIRST_WORD = slice(0, 5, 1)
greeting[FIRST_WORD]
'Hello'
The index before 0
is -1
and is taken to be the last Unicode character in the str
. Conceptualise the str
wrapping around itself and a negative index can be prescribed to each index in the str
until the "first" index is reached which has a negative index of the length of the str
instance:
view(greeting, neg_index=True)
view(greeting)
Index Type Size Value -12 str 1 H -11 str 1 e -10 str 1 l -9 str 1 l -8 str 1 o -7 str 1 -6 str 1 W -5 str 1 o -4 str 1 r -3 str 1 l -2 str 1 d -1 str 1 ! Index Type Size Value 0 str 1 H 1 str 1 e 2 str 1 l 3 str 1 l 4 str 1 o 5 str 1 6 str 1 W 7 str 1 o 8 str 1 r 9 str 1 l 10 str 1 d 11 str 1 !
When a negative step is used -1
. Notice this reverses the character order in the str
instance:
greeting[::-1]
'!dlroW\tolleH'
The default start is therefore index -1
and the default stop is -len(greeting)-1
because zero-order indexing is still sued that is inclusive of the start bound and exclusive of the stop bound:
start = -1
stop = -len(greeting) - 1
step = -1
greeting[start:stop:step]
'!dlroW\tolleH'
greeting[-1:-len(greeting)-1:-1]
'!dlroW\tolleH'
The __contains__
datamodel method contains the be behaviour of the in
keyword:
greeting.__contains__?
Signature: greeting.__contains__(key, /) Call signature: greeting.__contains__(*args, **kwargs) Type: method-wrapper String form: <method-wrapper '__contains__' of str object at 0x000001E0A1A51B70> Docstring: Return bool(key in self).
It can be used to check whether a substr is present within a str
:
greeting.__contains__('Hello')
True
It is more common to use the in
keyword to perform this check:
'Hello' in greeting
True
'hello' in greeting
False
Iteration (__iter__) and looping¶
If the str
instance letters
(plural) is instantiated:
letters = 'Hello World!'
view(letters)
Index Type Size Value 0 str 1 H 1 str 1 e 2 str 1 l 3 str 1 l 4 str 1 o 5 str 1 6 str 1 W 7 str 1 o 8 str 1 r 9 str 1 l 10 str 1 d 11 str 1 !
It can be cast into an iterator using iter
:
forward = iter(letters)
forward
is a str
ASCII iterator that iterates through a str
of ASCII characters, displaying a single character at a time:
forward
<str_ascii_iterator at 0x1e0a1a90f70>
The iterator has a number of datamodel identifiers:
dir2(forward, object, unique_only=True)
{'datamodel_method': ['__iter__', '__length_hint__', '__next__', '__setstate__']}
The most important one is __next__
which controls the behaviour of the builtins
function next
. next
is used to advance to the next value in the iterator. An iterator displays a single value at a time and each previous value is consumed when advanced:
next(forward)
'H'
next(forward)
'e'
next(forward)
'l'
In each case assignment can be used, to the instance name letter
(note singular):
letter = next(forward)
letter
'l'
next
can continue to be used on the ASCII iter
instance until all the letters are exhausted. In other words next
can be called on the ASCII iter
instance len(letter)
times. Alternatively all of the remaining elements in an iter
instance can be consumed by casting using the tuple
class:
tuple(forward)
('o', ' ', 'W', 'o', 'r', 'l', 'd', '!')
A range
instance can be constructed using the len(letter)
. Note the similarities between the range
class and the slice
class:
range?
Init signature: range(self, /, *args, **kwargs) Docstring: range(stop) -> range object range(start, stop[, step]) -> range object Return an object that produces a sequence of integers from start (inclusive) to stop (exclusive) by step. range(i, j) produces i, i+1, i+2, ..., j-1. start defaults to 0, and stop is omitted! range(4) produces 0, 1, 2, 3. These are exactly the valid indices for a list of 4 elements. When step is given, it specifies the increment (or decrement). Type: type Subclasses:
slice?
Init signature: slice(self, /, *args, **kwargs) Docstring: slice(stop) slice(start, stop[, step]) Create a slice object. This is used for extended slicing (e.g. a[0:10:2]). Type: type Subclasses:
indexes = range(len(letters))
The range
instance is not an iter
instance and does not have the identifier __next__
but each index in it can be viewed by casting to a tuple
:
dir2(indexes, object, unique_only=True)
{'attribute': ['start', 'step', 'stop'], 'method': ['count', 'index'], 'datamodel_method': ['__bool__', '__contains__', '__getitem__', '__iter__', '__len__', '__reversed__']}
tuple(indexes)
(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
A for
loop can be constructed from it:
for index in indexes:
print(index)
0 1 2 3 4 5 6 7 8 9 10 11
Notice the instructions in the for
loop body was repeated 12 times and the index
printed was updated each loop iteration.
The str
instance letters
can be into an iter
instance and next
can be used to advance through the iterator within the for
loop:
forward = iter(letters)
for index in indexes:
print(next(forward))
H e l l o W o r l d !
Creating an iter
instance and advancing through all its elements in a for
loop is a common task and is simplified using the syntax below:
for letter in letters:
print(letter)
H e l l o W o r l d !
Note sometimes it is useful to have both the index and the letter being looped through, this can be done using the enumerate
class:
enumerate?
Init signature: enumerate(iterable, start=0) Docstring: Return an enumerate object. iterable an object supporting iteration The enumerate object yields pairs containing a count (from start, which defaults to zero) and a value yielded by the iterable argument. enumerate is useful for obtaining an indexed list: (0, seq[0]), (1, seq[1]), (2, seq[2]), ... Type: type Subclasses:
enumerated_letters = enumerate(letters)
enumerated_letters
<enumerate at 0x1e0a1ab9d50>
Note that enumerate
instances is also an iter
instance and has the datamodel identifier __next__
:
dir2(enumerated_letters, object, unique_only=True)
{'datamodel_method': ['__class_getitem__', '__iter__', '__next__']}
When next
is used a tuple
is output:
next(enumerated_letters)
(0, 'H')
This can be unpacked to two variables using an explicit tuple
instance:
(index, letter) = next(enumerated_letters)
index
1
letter
'e'
However it is more common to use implicit tuple
unpacking:
index, letter = next(enumerated_letters)
index
2
letter
'l'
A for
loop can be constructed with two loop variables using the enumerate
instance:
for index, letter in enumerate(letters):
print(f'{index}: {letter}')
0: H 1: e 2: l 3: l 4: o 5: 6: W 7: o 8: r 9: l 10: d 11: !
Sometimes this is useful when the index and letter are both required:
for index, letter in enumerate(letters):
print(index * letter)
e ll lll oooo WWWWWW ooooooo rrrrrrrr lllllllll dddddddddd !!!!!!!!!!!
Immutability and hash (__hash__)¶
The __hash__
datamodel identifier is not equal to None
:
str.__hash__ == None
False
This means the str
is immutable. Recall immutable means once an instance is created, it cannot be modified. As a consequence each method has a return
value which returns a new instance, normally a new str
instance and leaves the original str
unmodified:
greeting = 'Hello World!'
greeting[-1:-len(greeting)-1:-1] #return value shown in cell output
'!dlroW olleH'
greeting # unchanged
'Hello World!'
As mentioned above reassignment should not be confused with mutability.
greeting = 'Hello World!'
hash(greeting), id(greeting)
(-7437652338063058407, 2064296737520)
When reassignment is used, the operation on the right is carried out first, in this case the operation highlighted in parenthesis. The instance data 'Hello World!'
is used. The return
value of this operation '!dlroW olleH'
is then assigned to the instance name greeting
on the right:
greeting = (greeting[-1:-len(greeting)-1:-1])
hash(greeting), id(greeting)
(-2364074818600270120, 2064296728944)
Therefore the instance name greeting
which can be conceptualised as a label has been unpeeled from the old instance and now is affixed to the new instance:
greeting
'!dlroW olleH'
Because a str
is hashable and therefore immutable it can be used in a mapping such as a dict
which recall has the form:
{key: value,
key: value,
key: value}
A dict
can be conceptualised as a collection of storage locations and an immutable key is used to access each storage location which then gives a reference to an object
. The key must be immutable as a key that is modified will no longer fit the lock and therefore cannot be used.
Because str
instances are immutable they commonly used as keys. An example is give in the 2 dict
instances below:
from matplotlib.colors import BASE_COLORS, CSS4_COLORS
BASE_COLORS
{'b': (0, 0, 1), 'g': (0, 0.5, 0), 'r': (1, 0, 0), 'c': (0, 0.75, 0.75), 'm': (0.75, 0, 0.75), 'y': (0.75, 0.75, 0), 'k': (0, 0, 0), 'w': (1, 1, 1)}
CSS4_COLORS
{'aliceblue': '#F0F8FF', 'antiquewhite': '#FAEBD7', 'aqua': '#00FFFF', 'aquamarine': '#7FFFD4', 'azure': '#F0FFFF', 'beige': '#F5F5DC', 'bisque': '#FFE4C4', 'black': '#000000', 'blanchedalmond': '#FFEBCD', 'blue': '#0000FF', 'blueviolet': '#8A2BE2', 'brown': '#A52A2A', 'burlywood': '#DEB887', 'cadetblue': '#5F9EA0', 'chartreuse': '#7FFF00', 'chocolate': '#D2691E', 'coral': '#FF7F50', 'cornflowerblue': '#6495ED', 'cornsilk': '#FFF8DC', 'crimson': '#DC143C', 'cyan': '#00FFFF', 'darkblue': '#00008B', 'darkcyan': '#008B8B', 'darkgoldenrod': '#B8860B', 'darkgray': '#A9A9A9', 'darkgreen': '#006400', 'darkgrey': '#A9A9A9', 'darkkhaki': '#BDB76B', 'darkmagenta': '#8B008B', 'darkolivegreen': '#556B2F', 'darkorange': '#FF8C00', 'darkorchid': '#9932CC', 'darkred': '#8B0000', 'darksalmon': '#E9967A', 'darkseagreen': '#8FBC8F', 'darkslateblue': '#483D8B', 'darkslategray': '#2F4F4F', 'darkslategrey': '#2F4F4F', 'darkturquoise': '#00CED1', 'darkviolet': '#9400D3', 'deeppink': '#FF1493', 'deepskyblue': '#00BFFF', 'dimgray': '#696969', 'dimgrey': '#696969', 'dodgerblue': '#1E90FF', 'firebrick': '#B22222', 'floralwhite': '#FFFAF0', 'forestgreen': '#228B22', 'fuchsia': '#FF00FF', 'gainsboro': '#DCDCDC', 'ghostwhite': '#F8F8FF', 'gold': '#FFD700', 'goldenrod': '#DAA520', 'gray': '#808080', 'green': '#008000', 'greenyellow': '#ADFF2F', 'grey': '#808080', 'honeydew': '#F0FFF0', 'hotpink': '#FF69B4', 'indianred': '#CD5C5C', 'indigo': '#4B0082', 'ivory': '#FFFFF0', 'khaki': '#F0E68C', 'lavender': '#E6E6FA', 'lavenderblush': '#FFF0F5', 'lawngreen': '#7CFC00', 'lemonchiffon': '#FFFACD', 'lightblue': '#ADD8E6', 'lightcoral': '#F08080', 'lightcyan': '#E0FFFF', 'lightgoldenrodyellow': '#FAFAD2', 'lightgray': '#D3D3D3', 'lightgreen': '#90EE90', 'lightgrey': '#D3D3D3', 'lightpink': '#FFB6C1', 'lightsalmon': '#FFA07A', 'lightseagreen': '#20B2AA', 'lightskyblue': '#87CEFA', 'lightslategray': '#778899', 'lightslategrey': '#778899', 'lightsteelblue': '#B0C4DE', 'lightyellow': '#FFFFE0', 'lime': '#00FF00', 'limegreen': '#32CD32', 'linen': '#FAF0E6', 'magenta': '#FF00FF', 'maroon': '#800000', 'mediumaquamarine': '#66CDAA', 'mediumblue': '#0000CD', 'mediumorchid': '#BA55D3', 'mediumpurple': '#9370DB', 'mediumseagreen': '#3CB371', 'mediumslateblue': '#7B68EE', 'mediumspringgreen': '#00FA9A', 'mediumturquoise': '#48D1CC', 'mediumvioletred': '#C71585', 'midnightblue': '#191970', 'mintcream': '#F5FFFA', 'mistyrose': '#FFE4E1', 'moccasin': '#FFE4B5', 'navajowhite': '#FFDEAD', 'navy': '#000080', 'oldlace': '#FDF5E6', 'olive': '#808000', 'olivedrab': '#6B8E23', 'orange': '#FFA500', 'orangered': '#FF4500', 'orchid': '#DA70D6', 'palegoldenrod': '#EEE8AA', 'palegreen': '#98FB98', 'paleturquoise': '#AFEEEE', 'palevioletred': '#DB7093', 'papayawhip': '#FFEFD5', 'peachpuff': '#FFDAB9', 'peru': '#CD853F', 'pink': '#FFC0CB', 'plum': '#DDA0DD', 'powderblue': '#B0E0E6', 'purple': '#800080', 'rebeccapurple': '#663399', 'red': '#FF0000', 'rosybrown': '#BC8F8F', 'royalblue': '#4169E1', 'saddlebrown': '#8B4513', 'salmon': '#FA8072', 'sandybrown': '#F4A460', 'seagreen': '#2E8B57', 'seashell': '#FFF5EE', 'sienna': '#A0522D', 'silver': '#C0C0C0', 'skyblue': '#87CEEB', 'slateblue': '#6A5ACD', 'slategray': '#708090', 'slategrey': '#708090', 'snow': '#FFFAFA', 'springgreen': '#00FF7F', 'steelblue': '#4682B4', 'tan': '#D2B48C', 'teal': '#008080', 'thistle': '#D8BFD8', 'tomato': '#FF6347', 'turquoise': '#40E0D0', 'violet': '#EE82EE', 'wheat': '#F5DEB3', 'white': '#FFFFFF', 'whitesmoke': '#F5F5F5', 'yellow': '#FFFF00', 'yellowgreen': '#9ACD32'}
Note in each case the key is an easy to remember letter or English word and the value it corresponds to is a harder to remember tuple
of the format (r, g, b)
or hexadecimal value of the form '#rrggbb'
.
Because a str
is immutable, the function getattr
can be used to access the identifier as a str
:
getattr(str, '__len__')
<slot wrapper '__len__' of 'str' objects>
str.__len__
<slot wrapper '__len__' of 'str' objects>
The mutable counterparts setattr
and delattr
cannot be used because a str
is mutable and therefore an attribute cannot be changed or deleted.
Comparison Operators (__gt__, __ge__, __lt__, __le__, __eq__ and __ne__)¶
Early computers were based on a typewriter that essentially prints English characters onto a sheet of paper. In order to achieve such a task a number of non-printable commands such as the carriage return (moving the carriage back to the left) and the form feed (moving the piece of paper up by the width of a line) are required as well as the printable characters such as the English letters, numbers, and whitespace:
Each command has to be mapped physically into the computers memory. Fundamentally the computer can only store data in the form of a bit which is essentially a digital switch.
A single switch has the possible values 0
, 1
which is 2 ** 1
combinations which is a total of 2
. Note the combination 0
is included so 0:2
is inclusive of the lower bount 0
and exclusive of the upper bound 2
.
More typically 8
of these switches are combined into a single logical unit called a byte. A byte has 2 ** 8
combinations which is a total of 256
. Note the combination 0
is included so 0:256
is inclusive of the lower bount 0
and exclusive of the upper bound 256
.
One of the most popular set of commands was developed in America and is known as the American Standard for Information Interchange (ASCII). The first 33
combinations correspond to non-printable characters such as the carriage return and form feed as previously discussed in addition to a number of additional hardware related commands.
Each bit can be 0
or 1
and the byte sequence corresponds to the physical position of the 8
switches. As binary is not human readible the hexadecimal system is also used which has 16
characters 0
, 1
, 2
, 3
, 4
, 5
, 6
, 7
, 8
, 9
, a
, b
, c
, d
, e
, f
. 2 ** 4
is 16
combinations and therefore each half of the byte is represented by its own hexadecimal character. These numbering systems are shown alongside the number in decimal.
byte | hex | num | command |
---|---|---|---|
00000000 | 00 | 000 | null |
00000001 | 01 | 001 | start of heading |
00000010 | 02 | 002 | start of text |
00000011 | 03 | 003 | end of text |
00000100 | 04 | 004 | end of transmission |
00000101 | 05 | 005 | enquiry |
00000110 | 06 | 006 | acknowledge |
00000111 | 07 | 007 | bell |
00001000 | 08 | 008 | backspace |
00001001 | 09 | 009 | horizontal tab |
00001010 | 0a | 010 | new line |
00001011 | 0b | 011 | vertical tab |
00001100 | 0c | 012 | form feed |
00001101 | 0d | 013 | carriage return |
00001110 | 0e | 014 | shift out |
00001111 | 0f | 015 | shift in |
00010000 | 10 | 016 | data link escape |
00010001 | 11 | 017 | device control 1 |
00010010 | 12 | 018 | device control 2 |
00010011 | 13 | 019 | device control 3 |
00010100 | 14 | 020 | device control 4 |
00010101 | 15 | 021 | negative acknowledge |
00010110 | 16 | 022 | synchronous idle |
00010111 | 17 | 023 | end of transmission block |
00011000 | 18 | 024 | cancel |
00011001 | 19 | 025 | end of medium |
00011010 | 1a | 026 | substitute |
00011011 | 1b | 027 | escape |
00011100 | 1c | 028 | file separator |
00011101 | 1d | 029 | group separator |
00011110 | 1e | 030 | record separator |
00011111 | 1f | 031 | unit seperator |
00100000 | 20 | 032 | space |
The remaining commands spanning up to half a byte contained the characters most commonly used in the English language.
byte | hex | num | character |
---|---|---|---|
00100001 | 21 | 033 | ! |
00100010 | 22 | 034 | " |
00100011 | 23 | 035 | # |
00100100 | 24 | 036 | $ |
00100101 | 25 | 037 | % |
00100110 | 26 | 038 | & |
00100111 | 27 | 039 | ' |
00101000 | 28 | 040 | ( |
00101001 | 29 | 041 | ) |
00101010 | 2a | 042 | * |
00101011 | 2b | 043 | + |
00101100 | 2c | 044 | , |
00101101 | 2d | 045 | – |
00101110 | 2e | 046 | . |
00101111 | 2f | 047 | / |
00110000 | 30 | 048 | 0 |
00110001 | 31 | 049 | 1 |
00110010 | 32 | 050 | 2 |
00110011 | 33 | 051 | 3 |
00110100 | 34 | 052 | 4 |
00110101 | 35 | 053 | 5 |
00110110 | 36 | 054 | 6 |
00110111 | 37 | 055 | 7 |
00111000 | 38 | 056 | 8 |
00111001 | 39 | 057 | 9 |
00111010 | 3a | 058 | : |
00111011 | 3b | 059 | ; |
00111100 | 3c | 060 | < |
00111101 | 3d | 061 | = |
00111110 | 3e | 062 | > |
00111111 | 3f | 063 | ? |
01000000 | 40 | 064 | @ |
01000001 | 41 | 065 | A |
01000010 | 42 | 066 | B |
01000011 | 43 | 067 | C |
01000100 | 44 | 068 | D |
01000101 | 45 | 069 | E |
01000110 | 46 | 070 | F |
01000111 | 47 | 071 | G |
01001000 | 48 | 072 | H |
01001001 | 49 | 073 | I |
01001010 | 4a | 074 | J |
01001011 | 4b | 075 | K |
01001100 | 4c | 076 | L |
01001101 | 4d | 077 | M |
01001110 | 4e | 078 | N |
01001111 | 4f | 079 | O |
01010000 | 50 | 080 | P |
01010001 | 51 | 081 | Q |
01010010 | 52 | 082 | R |
01010011 | 53 | 083 | S |
01010100 | 54 | 084 | T |
01010101 | 55 | 085 | U |
01010110 | 56 | 086 | V |
01010111 | 57 | 087 | W |
01011000 | 58 | 088 | X |
01011001 | 59 | 089 | Y |
01011010 | 5a | 090 | Z |
01011011 | 5b | 091 | [ |
01011100 | 5c | 092 | \ |
01011101 | 5d | 093 | ] |
01011110 | 5e | 094 | ^ |
01011111 | 5f | 095 | _ |
01100000 | 60 | 096 | ` |
01100001 | 61 | 097 | a |
01100010 | 62 | 098 | b |
01100011 | 63 | 099 | c |
01100100 | 64 | 100 | d |
01100101 | 65 | 101 | e |
01100110 | 66 | 102 | f |
01100111 | 67 | 103 | g |
01101000 | 68 | 104 | h |
01101001 | 69 | 105 | i |
01101010 | 6a | 106 | j |
01101011 | 6b | 107 | k |
01101100 | 6c | 108 | l |
01101101 | 6d | 109 | m |
01101110 | 6e | 110 | n |
01101111 | 6f | 111 | o |
01110000 | 70 | 112 | p |
01110001 | 71 | 113 | q |
01110010 | 72 | 114 | r |
01110011 | 73 | 115 | s |
01110100 | 74 | 116 | t |
01110101 | 75 | 117 | u |
01110110 | 76 | 118 | v |
01110111 | 77 | 119 | w |
01111000 | 78 | 120 | x |
01111001 | 79 | 121 | y |
01111010 | 7a | 122 | z |
01111011 | 7b | 123 | { |
01111100 | 7c | 124 | | |
01111101 | 7d | 125 | } |
01111110 | 7e | 126 | ~ |
01111111 | 7f | 127 | DEL |
The Unicode str
uses a single encoding table, the Unicode Transformation Format 'utf-8
and this encodes a single Unicode character to a numeric combination. This numeric combination is recognised by a human as a decimal integer but stored on a computer using bits. 'utf-8'
uses 8 bits (1 byte) for each ASCII character and (2-4 bytes for additional characters outside the ASCII range).
__getsizeof__
returns the number of bytes occupied by the str
instance. Note that there is a base memory allocation for a str
instance:
import sys
sys.getsizeof('') # 41
41
Then memory allocation for each character in the str
instances:
sys.getsizeof('a') # 41 + 1
42
sys.getsizeof('ab') # 41 + (2 * 1)
43
Use of non-English characters requires a higher memory overhead and requires a larger number of bytes per character:
sys.getsizeof('α') # 41 + 17 + (1 * 2)
60
sys.getsizeof('αβ') # 41 + 17 + (2 * 2)
62
Python also has additional text classes such as the bytes
class which can use additional encoding tables, usually from older standards which will be explored in the next notebook.
Each character is ordinal, the characters 'a'
and 'A'
are ASCII characters:
ord('a')
97
ord('A')
65
Because these are ASCII they are stored over a single byte. Recall a single byte has the following number of combinations:
2 ** (1 * 8)
256
The character 'α'
is non-ASCII and has a value that exceeds this and is therefore stored over multiple bytes:
ord('α')
945
In this case, the Greek letter is stored over 2 bytes:
2 ** (2 * 8)
65536
Because the str
instance is ordinal, the six comparison operators can be used to compare the numeric values of str
instances:
'a' > 'A'
True
The above is essentially a comparison between the two ordinal values:
97 > 65
True
This can be used with longer str
instances:
'apples' > 'bananas'
False
A check is made letter by letter:
'a' > 'b'
False
If the first letters are equal, the second letters are compared:
'aa' > 'ab'
False
The 6 comparison operators can be used:
'aa' < 'aa', 'aa' <= 'aa', 'aa' == 'aa', 'aa' >= 'aa', 'aa' > 'aa', 'aa' != 'aa'
(False, True, True, True, False, False)
'aa' < 'ab', 'aa' <= 'ab', 'aa' == 'ab', 'aa' >= 'ab', 'aa' > 'ab', 'aa' != 'ab'
(True, True, False, False, False, True)
Instance Methods¶
if the str
instance greeting
is instantiated:
greeting = 'Hello World!'
Most of the additional identifiers available to it are instance methods:
dir2(greeting, print_output=False)['method']
['capitalize', 'casefold', 'center', 'count', 'encode', 'endswith', 'expandtabs', 'find', 'format', 'format_map', 'index', 'isalnum', 'isalpha', 'isascii', 'isdecimal', 'isdigit', 'isidentifier', 'islower', 'isnumeric', 'isprintable', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'maketrans', 'partition', 'removeprefix', 'removesuffix', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']
Recall that the identifiers themselves are defined in the str
class:
dir2(str, print_output=False)['method']
['capitalize', 'casefold', 'center', 'count', 'encode', 'endswith', 'expandtabs', 'find', 'format', 'format_map', 'index', 'isalnum', 'isalpha', 'isascii', 'isdecimal', 'isdigit', 'isidentifier', 'islower', 'isnumeric', 'isprintable', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'maketrans', 'partition', 'removeprefix', 'removesuffix', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']
Instance methods are accessed via an instance and therefore have access to the instance data. The docstring of the capitalize
can be examined from a str
instance:
greeting.capitalize?
Signature: greeting.capitalize() Docstring: Return a capitalized version of the string. More specifically, make the first character have upper case and the rest lower case. Type: builtin_function_or_method
Or it can be examined from the class str
itself:
str.capitalize?
Signature: str.capitalize(self, /) Docstring: Return a capitalized version of the string. More specifically, make the first character have upper case and the rest lower case. Type: method_descriptor
Note that the identifier name is in American English:
Word | English Dialect |
---|---|
capitalize | American |
capitalise | British |
When the method capitalize
is called from an instance, it has access to the instance data. As a consequence this method requires no additional data to operate which is why its parenthesis are otherwise empty.
greeting.capitalize()
In contrast when the method is called from the class itself, it has no instance data to work from therefore an instance must be provided. In Python self
means this instance:
str.capitalize(self, /)
self
occurs before an /
and therefore must be provided positionally.
As the str
is immutable the method has a return
value and returns a new str
instance that has been capitalised:
Docstring:
Return a capitalized version of the string.
When the method is called from an instance:
greeting.capitalize()
'Hello world!'
The new capitalised str
instance displays in the cell output. This a new instance and the original instance is unchanged in variables:
variables()
Type | Size/Shape | Value | |
---|---|---|---|
Instance Name | |||
greeting | str | 12 | Hello World! |
farewell | str | 3 | bye |
start | int | -1 | |
stop | int | -13 | |
step | int | -1 | |
letters | str | 12 | Hello World! |
letter | str | 1 | ! |
indexes | range | 12 | range(0, 12) |
index | int | 11 | |
BASE_COLORS | dict | 8 | {'b': (0, 0, 1), 'g': (0, 0.5, 0), 'r': (1, 0, 0), 'c': (0, 0.75, 0.75), 'm': (0.75, 0, 0.75), 'y': (0.75, 0.75, 0), 'k': (0, 0, 0), 'w': (1, 1, 1)} |
CSS4_COLORS | dict | 148 | {'aliceblue': '#F0F8FF', 'antiquewhite': '#FAEBD7', 'aqua': '#00FFFF', 'aquamarine': '#7FFFD4', 'azure': '#F0FFFF', 'beige': '#F5F5DC', 'bisque': '#FFE4C4', 'black': '#000000', 'blanchedalmond': '… |
Since this new instance is not assigned an instance name it has no references and is automatically removed by Pythons Garbage collection. It can be assigned to an instance name using:
cap_greeting = greeting.capitalize()
Notice no cell output as the new instance is now assigned to the instance name instead of being shown in the cell output. This can be seen in Variables:
variables()
Type | Size/Shape | Value | |
---|---|---|---|
Instance Name | |||
greeting | str | 12 | Hello World! |
farewell | str | 3 | bye |
start | int | -1 | |
stop | int | -13 | |
step | int | -1 | |
letters | str | 12 | Hello World! |
letter | str | 1 | ! |
indexes | range | 12 | range(0, 12) |
index | int | 11 | |
BASE_COLORS | dict | 8 | {'b': (0, 0, 1), 'g': (0, 0.5, 0), 'r': (1, 0, 0), 'c': (0, 0.75, 0.75), 'm': (0.75, 0, 0.75), 'y': (0.75, 0.75, 0), 'k': (0, 0, 0), 'w': (1, 1, 1)} |
CSS4_COLORS | dict | 148 | {'aliceblue': '#F0F8FF', 'antiquewhite': '#FAEBD7', 'aqua': '#00FFFF', 'aquamarine': '#7FFFD4', 'azure': '#F0FFFF', 'beige': '#F5F5DC', 'bisque': '#FFE4C4', 'black': '#000000', 'blanchedalmond': '… |
cap_greeting | str | 12 | Hello world! |
If the instance method is invoked from a class, the instance self
must be provided positionally as the first input argument:
str.capitalize(farewell)
'Bye'
Failure to supply an instance will result in a TypeError
. This can be seen by inputting the following into the blank code cell below:
str.capitalize()
Case Methods¶
The str
case method capitalize
has already been examined:
greeting.capitalize?
Signature: greeting.capitalize() Docstring: Return a capitalized version of the string. More specifically, make the first character have upper case and the rest lower case. Type: builtin_function_or_method
greeting.capitalize()
'Hello world!'
There are associated identifiers such as:
lower
casefold
upper
title
swapcase
The docstrings of these can all be examined:
greeting.lower?
Signature: greeting.lower() Docstring: Return a copy of the string converted to lowercase. Type: builtin_function_or_method
greeting.casefold?
Signature: greeting.casefold() Docstring: Return a version of the string suitable for caseless comparisons. Type: builtin_function_or_method
greeting.upper?
Signature: greeting.upper() Docstring: Return a copy of the string converted to uppercase. Type: builtin_function_or_method
greeting.title?
Signature: greeting.title() Docstring: Return a version of the string where each word is titlecased. More specifically, words start with uppercased characters and all remaining cased characters have lower case. Type: builtin_function_or_method
greeting.swapcase?
Signature: greeting.swapcase() Docstring: Convert uppercase characters to lowercase and lowercase characters to uppercase. Type: builtin_function_or_method
greeting.title?
Signature: greeting.title() Docstring: Return a version of the string where each word is titlecased. More specifically, words start with uppercased characters and all remaining cased characters have lower case. Type: builtin_function_or_method
All of these case identifiers only require instance data and return a new str
instance:
'hEllo wOrld'.lower()
'hello world'
'hEllo wOrld'.casefold()
'hello world'
'hEllo wOrld'.upper()
'HELLO WORLD'
'hEllo wOrld'.swapcase()
'HeLLO WoRLD'
'hEllo wOrld'.title()
'Hello World'
casefold is similar to lower but has more support for non-English characters, as seen with the additional German characters and the Greek characters where some of the lower case characters have variants:
'ÄäÜüÖöẞß'.lower()
'ääüüöößß'
'ÄäÜüÖöẞß'.casefold()
'ääüüöössss'
'ΑαΒβΓγΔδΕεΖζΗηΘθΙιΚκΛλΜμΝνΞξΟοΠπΡρΣσςΤτΥυΦφΧχΨψΩω'.lower()
'ααββγγδδεεζζηηθθιικκλλμμννξξοοππρρσσςττυυφφχχψψωω'
'ΑαΒβΓγΔδΕεΖζΗηΘθΙιΚκΛλΜμΝνΞξΟοΠπΡρΣσςΤτΥυΦφΧχΨψΩω'.casefold()
'ααββγγδδεεζζηηθθιικκλλμμννξξοοππρρσσσττυυφφχχψψωω'
Boolean Identifiers¶
A number of identifiers are used to examine a specific property of a str
and return a boolean of True
if it has that property and False
otherwise:
greeting.isupper?
Signature: greeting.isupper() Docstring: Return True if the string is an uppercase string, False otherwise. A string is uppercase if all cased characters in the string are uppercase and there is at least one cased character in the string. Type: builtin_function_or_method
greeting.islower?
Signature: greeting.islower() Docstring: Return True if the string is a lowercase string, False otherwise. A string is lowercase if all cased characters in the string are lowercase and there is at least one cased character in the string. Type: builtin_function_or_method
greeting.istitle?
Signature: greeting.istitle() Docstring: Return True if the string is a title-cased string, False otherwise. In a title-cased string, upper- and title-case characters may only follow uncased characters and lowercase characters only cased ones. Type: builtin_function_or_method
For example:
'HELLO'.isupper()
True
'Hello'.isupper()
False
'hello'.islower()
True
'Hello'.islower()
False
'Hello'.istitle()
True
Valid Identifier Names¶
The str
method isidentifier
will check to see if the str
is valid for an identifier name. This can be useful to check before assignment of an instance to an instance name:
greeting.isidentifier?
Signature: greeting.isidentifier() Docstring: Return True if the string is a valid Python identifier, False otherwise. Call keyword.iskeyword(s) to test whether string s is a reserved identifier, such as "def" or "class". Type: builtin_function_or_method
A lowercase str
instance without spaces or special characters can be checked to see if the identifier is an acceptable identifier name:
'hello'.isidentifier()
True
This means the following is acceptable:
hello = 'some string'
hello = 'some string'
variables()
Type | Size/Shape | Value | |
---|---|---|---|
Instance Name | |||
greeting | str | 12 | Hello World! |
farewell | str | 3 | bye |
start | int | -1 | |
stop | int | -13 | |
step | int | -1 | |
letters | str | 12 | Hello World! |
letter | str | 1 | ! |
indexes | range | 12 | range(0, 12) |
index | int | 11 | |
BASE_COLORS | dict | 8 | {'b': (0, 0, 1), 'g': (0, 0.5, 0), 'r': (1, 0, 0), 'c': (0, 0.75, 0.75), 'm': (0.75, 0, 0.75), 'y': (0.75, 0.75, 0), 'k': (0, 0, 0), 'w': (1, 1, 1)} |
CSS4_COLORS | dict | 148 | {'aliceblue': '#F0F8FF', 'antiquewhite': '#FAEBD7', 'aqua': '#00FFFF', 'aquamarine': '#7FFFD4', 'azure': '#F0FFFF', 'beige': '#F5F5DC', 'bisque': '#FFE4C4', 'black': '#000000', 'blanchedalmond': '… |
cap_greeting | str | 12 | Hello world! |
hello | str | 11 | some string |
A space is not acceptable and attempted use of an identifier will give a SyntaxError
:
'hello world'.isidentifier()
False
This means the following is not acceptable:
hello world = 'some string'
because the Python interpreter sees two instance names to the left of the assignment operator.
An underscore is acceptable and identifier names generally use snake_case
:
'hello_world'.isidentifier()
True
This means the following is acceptable:
hello_world = 'some string'
hello_world = 'some string'
variables()
Type | Size/Shape | Value | |
---|---|---|---|
Instance Name | |||
greeting | str | 12 | Hello World! |
farewell | str | 3 | bye |
start | int | -1 | |
stop | int | -13 | |
step | int | -1 | |
letters | str | 12 | Hello World! |
letter | str | 1 | ! |
indexes | range | 12 | range(0, 12) |
index | int | 11 | |
BASE_COLORS | dict | 8 | {'b': (0, 0, 1), 'g': (0, 0.5, 0), 'r': (1, 0, 0), 'c': (0, 0.75, 0.75), 'm': (0.75, 0, 0.75), 'y': (0.75, 0.75, 0), 'k': (0, 0, 0), 'w': (1, 1, 1)} |
CSS4_COLORS | dict | 148 | {'aliceblue': '#F0F8FF', 'antiquewhite': '#FAEBD7', 'aqua': '#00FFFF', 'aquamarine': '#7FFFD4', 'azure': '#F0FFFF', 'beige': '#F5F5DC', 'bisque': '#FFE4C4', 'black': '#000000', 'blanchedalmond': '… |
cap_greeting | str | 12 | Hello world! |
hello | str | 11 | some string |
hello_world | str | 11 | some string |
Numbers can be included in an identifier name:
'hello_world2'.isidentifier()
True
This means the following is acceptable:
hello_world2 = 'some string'
hello_world2 = 'some string'
variables()
Type | Size/Shape | Value | |
---|---|---|---|
Instance Name | |||
greeting | str | 12 | Hello World! |
farewell | str | 3 | bye |
start | int | -1 | |
stop | int | -13 | |
step | int | -1 | |
letters | str | 12 | Hello World! |
letter | str | 1 | ! |
indexes | range | 12 | range(0, 12) |
index | int | 11 | |
BASE_COLORS | dict | 8 | {'b': (0, 0, 1), 'g': (0, 0.5, 0), 'r': (1, 0, 0), 'c': (0, 0.75, 0.75), 'm': (0.75, 0, 0.75), 'y': (0.75, 0.75, 0), 'k': (0, 0, 0), 'w': (1, 1, 1)} |
CSS4_COLORS | dict | 148 | {'aliceblue': '#F0F8FF', 'antiquewhite': '#FAEBD7', 'aqua': '#00FFFF', 'aquamarine': '#7FFFD4', 'azure': '#F0FFFF', 'beige': '#F5F5DC', 'bisque': '#FFE4C4', 'black': '#000000', 'blanchedalmond': '… |
cap_greeting | str | 12 | Hello world! |
hello | str | 11 | some string |
hello_world | str | 11 | some string |
hello_world2 | str | 11 | some string |
However an identifier cannot begin with a number and the attempted use of an identifier will give a SyntaxError
:
'2hello_world'.isidentifier()
False
This means the following is not acceptable:
2hello_world = 'some string'
Python thinks the identifier is a number but this number contains letters which are unrecognised in the context of a numeric decimal system.
Special characters cannot be used as part of an identifier as they are recognised by Python as operators. Including them in an identifier will give a SyntaxError
:
'hello-world2'.isidentifier()
False
This means the following is not acceptable:
hello-world2 = 'some string'
because the Python interpreter is seeing an operation to carry out subtraction.
Upper case identifiers can be used but generally PascalCase
is reserved for a class name:
'PascalCase'.isidentifier()
True
This means the following is acceptable:
PascalCase = 'some string'
However this naming convention is normally reserved for a class.
PascalCase = 'some string'
variables()
Type | Size/Shape | Value | |
---|---|---|---|
Instance Name | |||
greeting | str | 12 | Hello World! |
farewell | str | 3 | bye |
start | int | -1 | |
stop | int | -13 | |
step | int | -1 | |
letters | str | 12 | Hello World! |
letter | str | 1 | ! |
indexes | range | 12 | range(0, 12) |
index | int | 11 | |
BASE_COLORS | dict | 8 | {'b': (0, 0, 1), 'g': (0, 0.5, 0), 'r': (1, 0, 0), 'c': (0, 0.75, 0.75), 'm': (0.75, 0, 0.75), 'y': (0.75, 0.75, 0), 'k': (0, 0, 0), 'w': (1, 1, 1)} |
CSS4_COLORS | dict | 148 | {'aliceblue': '#F0F8FF', 'antiquewhite': '#FAEBD7', 'aqua': '#00FFFF', 'aquamarine': '#7FFFD4', 'azure': '#F0FFFF', 'beige': '#F5F5DC', 'bisque': '#FFE4C4', 'black': '#000000', 'blanchedalmond': '… |
cap_greeting | str | 12 | Hello world! |
hello | str | 11 | some string |
hello_world | str | 11 | some string |
hello_world2 | str | 11 | some string |
PascalCase | str | 11 | some string |
All capitals identifiers can be used but this generally ALL_CAPS
is reserved for a constant:
'ALL_CAPS'.isidentifier()
True
This means the following is acceptable:
ALL_CAPS = 'some string'
and the capitalisation states that this instance name is intended to be a constant, that should not be reassigned later on in the code:
ALL_CAPS = 'some string'
variables()
Type | Size/Shape | Value | |
---|---|---|---|
Instance Name | |||
greeting | str | 12 | Hello World! |
farewell | str | 3 | bye |
start | int | -1 | |
stop | int | -13 | |
step | int | -1 | |
letters | str | 12 | Hello World! |
letter | str | 1 | ! |
indexes | range | 12 | range(0, 12) |
index | int | 11 | |
BASE_COLORS | dict | 8 | {'b': (0, 0, 1), 'g': (0, 0.5, 0), 'r': (1, 0, 0), 'c': (0, 0.75, 0.75), 'm': (0.75, 0, 0.75), 'y': (0.75, 0.75, 0), 'k': (0, 0, 0), 'w': (1, 1, 1)} |
CSS4_COLORS | dict | 148 | {'aliceblue': '#F0F8FF', 'antiquewhite': '#FAEBD7', 'aqua': '#00FFFF', 'aquamarine': '#7FFFD4', 'azure': '#F0FFFF', 'beige': '#F5F5DC', 'bisque': '#FFE4C4', 'black': '#000000', 'blanchedalmond': '… |
cap_greeting | str | 12 | Hello world! |
hello | str | 11 | some string |
hello_world | str | 11 | some string |
hello_world2 | str | 11 | some string |
PascalCase | str | 11 | some string |
ALL_CAPS | str | 11 | some string |
An instance name shouldn't match any of the identifiers in __builtins__
otherwise it will override the builtin (until the kernel is restarted) which will lead to confusion when the builtins
is attempted to be used.
One mistake that beginners often make is to reassign the class name to a instance:
str = 'hello'
Then when they attempt to use the str
class they return the instance:
str
'hello'
To rectify this issue str
can be reassigned from the builtins
module:
str = __builtins__.str
str
str
Another mistake beginners make when working with modules is to call the module that they are using the same name as the module they are trying to learn. This means when they attempt to import the module they are trying to learn, they accidentally attempt to import the module they are working on flagging up a circular ImportError
.
There are some identifiers which are reserved, these can be seen by importing the keyword
module, pprint
will also be imported to allow pretty printing of an Collection
:
import keyword
import pprint
The list
instance kwlist
can be examined:
pprint.pprint(keyword.kwlist)
['False', 'None', 'True', 'and', 'as', 'assert', 'async', 'await', 'break', 'class', 'continue', 'def', 'del', 'elif', 'else', 'except', 'finally', 'for', 'from', 'global', 'if', 'import', 'in', 'is', 'lambda', 'nonlocal', 'not', 'or', 'pass', 'raise', 'return', 'try', 'while', 'with', 'yield']
If a keyword is reassigned a SyntaxError
will display:
with = 'hello'
There is also the soft keyword list softkwlist
:
pprint.pprint(keyword.softkwlist)
['_', 'case', 'match', 'type']
case
and match
were recently introduced in Python 3.10 and should be regarded as keywords for new code. They are only soft keywords to allow backwards compatibility with older Python versions.
_
by default gives the last temporary variable. However _
is also commonly used to indicate skipping of an object
during tuple
unpacking for example.
As each character maps to a numeric bytes sequence it is ordinal. The builtins ordinal function ord
will return the ordinal numeric value of the number in decimal:
ord?
Signature: ord(c, /) Docstring: Return the Unicode code point for a one-character string. Type: builtin_function_or_method
For example the ordinal value of the str
instance '3'
can be checked:
ord('3')
51
chr(51)
'3'
Notice the difference in syntax highlighting between the str
of the number '3'
and the number 51
. This number can be converted into a binary string or hex string using the builtins bin
and hex
functions respectively:
bin?
Signature: bin(number, /) Docstring: Return the binary representation of an integer. >>> bin(2796202) '0b1010101010101010101010' Type: builtin_function_or_method
hex?
Signature: hex(number, /) Docstring: Return the hexadecimal representation of an integer. >>> hex(12648430) '0xc0ffee' Type: builtin_function_or_method
For example:
bin(ord('3'))
'0b110011'
This can be conceptualised as the following with the trailing zeros:
'0b' + bin(ord('3')).lstrip('0b').zfill(8)
'0b00110011'
Note the prefix 0b indicates a binary number and does not display the two leading zeros:
hex(ord('3'))
'0x33'
Note the prefix 0x indicates a hexadecimal number:
bin(16)
'0b10000'
The string module¶
The string
module contains a number of useful strings which group characters. It can be imported using:
import string
The identifiers can be viewed:
dir2(string, object, unique_only=True)
{'attribute': ['ascii_letters', 'ascii_lowercase', 'ascii_uppercase', 'digits', 'hexdigits', 'octdigits', 'printable', 'punctuation', 'whitespace'], 'method': ['capwords'], 'upper_class': ['Formatter', 'Template'], 'datamodel_attribute': ['__all__', '__builtins__', '__cached__', '__file__', '__loader__', '__name__', '__package__', '__spec__'], 'internal_attribute': ['_re', '_sentinel_dict', '_string'], 'internal_method': ['_ChainMap']}
Most of the identifiers are attributes and in this case are str
instances. ascii_letters
is a str
instance containing all English letters:
string.ascii_letters
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
This can be split into lowercase and uppercase using the str
instances ascii_lowercase
and ascii_uppercase
respectively:
string.ascii_lowercase
'abcdefghijklmnopqrstuvwxyz'
string.ascii_uppercase
'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
digits
is a str
instance that contains the 10
digits used in the decimal system:
string.digits
'0123456789'
hexdigits
is a str
instance that contains the 16
characters that can be used for hexadecimal. Note a
and A
are an alias of one another:
string.hexdigits
'0123456789abcdefABCDEF'
printable
is a str
instance that contains the printable characters:
string.printable
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'
punctuation
is a str
instance that contains all the punctuation characters:
string.punctuation
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
whitespace
is a str
instance containing the whitespace characters:
string.whitespace
' \t\n\r\x0b\x0c'
With the exception to the space, these are shown using escape sequences which will be further explored in a moment.
Now that the ASCII grouping and string groupings seen within the string
module have been seen, the additional boolean identifiers can be examined. These boolean identifiers all act upon instance data and return a bool
. Their docstrings are:
greeting.isprintable?
Signature: greeting.isprintable() Docstring: Return True if the string is printable, False otherwise. A string is printable if all of its characters are considered printable in repr() or if it is empty. Type: builtin_function_or_method
greeting.isascii?
Signature: greeting.isascii() Docstring: Return True if all characters in the string are ASCII, False otherwise. ASCII characters have code points in the range U+0000-U+007F. Empty string is ASCII too. Type: builtin_function_or_method
greeting.isalnum?
Signature: greeting.isalnum() Docstring: Return True if the string is an alpha-numeric string, False otherwise. A string is alpha-numeric if all characters in the string are alpha-numeric and there is at least one character in the string. Type: builtin_function_or_method
greeting.isalpha?
Signature: greeting.isalpha() Docstring: Return True if the string is an alphabetic string, False otherwise. A string is alphabetic if all characters in the string are alphabetic and there is at least one character in the string. Type: builtin_function_or_method
greeting.isspace?
Signature: greeting.isspace() Docstring: Return True if the string is a whitespace string, False otherwise. A string is whitespace if all characters in the string are whitespace and there is at least one character in the string. Type: builtin_function_or_method
greeting.isdecimal?
Signature: greeting.isdecimal() Docstring: Return True if the string is a decimal string, False otherwise. A string is a decimal string if all characters in the string are decimal and there is at least one character in the string. Type: builtin_function_or_method
greeting.isdigit?
Signature: greeting.isdigit() Docstring: Return True if the string is a digit string, False otherwise. A string is a digit string if all characters in the string are digits and there is at least one character in the string. Type: builtin_function_or_method
greeting.isnumeric?
Signature: greeting.isnumeric() Docstring: Return True if the string is a numeric string, False otherwise. A string is numeric if all characters in the string are numeric and there is at least one character in the string. Type: builtin_function_or_method
For example:
'hello Γειά σου 123'.isprintable()
True
'hello Γειά σου 123'.isascii()
False
'hello 123 !'.isascii()
True
'hello 123 !'.isalnum()
False
'hello123'.isalnum()
True
'hello123'.isalpha()
False
'hello'.isalpha()
True
'hello'.isspace()
False
The boolean numeric str
datamodel methods have subtle differences. These can be seen by examining the response of the methods for each of the following number groupings:
numeric_groups = {'ascii': '0123456789',
'font1': '𝟶𝟷𝟸𝟹𝟺𝟻𝟼𝟽𝟾𝟿',
'font2': '𝟬𝟭𝟮𝟯𝟰𝟱𝟲𝟳𝟴𝟵',
'font3': '𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡',
'subscript': '₀₁₂₃₄₅₆₇₈₉',
'superscript': '⁰¹²³⁴⁵⁶⁷⁸⁹',
'circled1': '➀➁➂➃➄➅➆➇➈',
'circled2': '➉',
'fractions': '½⅓¼⅕⅙⅐⅛⅑⅒⅔¾⅖⅗⅘⅚⅜⅝⅞⅟↉',
'asciihex': '0123456789abcdef', }
for group in numeric_groups:
print(group, numeric_groups[group], numeric_groups[group].isdecimal())
ascii 0123456789 True font1 𝟶𝟷𝟸𝟹𝟺𝟻𝟼𝟽𝟾𝟿 True font2 𝟬𝟭𝟮𝟯𝟰𝟱𝟲𝟳𝟴𝟵 True font3 𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡 True subscript ₀₁₂₃₄₅₆₇₈₉ False superscript ⁰¹²³⁴⁵⁶⁷⁸⁹ False circled1 ➀➁➂➃➄➅➆➇➈ False circled2 ➉ False fractions ½⅓¼⅕⅙⅐⅛⅑⅒⅔¾⅖⅗⅘⅚⅜⅝⅞⅟↉ False asciihex 0123456789abcdef False
for group in numeric_groups:
print(group, numeric_groups[group], numeric_groups[group].isdigit())
ascii 0123456789 True font1 𝟶𝟷𝟸𝟹𝟺𝟻𝟼𝟽𝟾𝟿 True font2 𝟬𝟭𝟮𝟯𝟰𝟱𝟲𝟳𝟴𝟵 True font3 𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡 True subscript ₀₁₂₃₄₅₆₇₈₉ True superscript ⁰¹²³⁴⁵⁶⁷⁸⁹ True circled1 ➀➁➂➃➄➅➆➇➈ True circled2 ➉ False fractions ½⅓¼⅕⅙⅐⅛⅑⅒⅔¾⅖⅗⅘⅚⅜⅝⅞⅟↉ False asciihex 0123456789abcdef False
for group in numeric_groups:
print(group, numeric_groups[group], numeric_groups[group].isnumeric())
ascii 0123456789 True font1 𝟶𝟷𝟸𝟹𝟺𝟻𝟼𝟽𝟾𝟿 True font2 𝟬𝟭𝟮𝟯𝟰𝟱𝟲𝟳𝟴𝟵 True font3 𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡 True subscript ₀₁₂₃₄₅₆₇₈₉ True superscript ⁰¹²³⁴⁵⁶⁷⁸⁹ True circled1 ➀➁➂➃➄➅➆➇➈ True circled2 ➉ True fractions ½⅓¼⅕⅙⅐⅛⅑⅒⅔¾⅖⅗⅘⅚⅜⅝⅞⅟↉ True asciihex 0123456789abcdef False
for group in numeric_groups:
print(group, numeric_groups[group], numeric_groups[group].isalnum())
ascii 0123456789 True font1 𝟶𝟷𝟸𝟹𝟺𝟻𝟼𝟽𝟾𝟿 True font2 𝟬𝟭𝟮𝟯𝟰𝟱𝟲𝟳𝟴𝟵 True font3 𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡 True subscript ₀₁₂₃₄₅₆₇₈₉ True superscript ⁰¹²³⁴⁵⁶⁷⁸⁹ True circled1 ➀➁➂➃➄➅➆➇➈ True circled2 ➉ True fractions ½⅓¼⅕⅙⅐⅛⅑⅒⅔¾⅖⅗⅘⅚⅜⅝⅞⅟↉ True asciihex 0123456789abcdef True
The boolean identifiers are often used for checks and these checks are used to create conditions and setup loops for example.
Escape Characters¶
The \
is a special symbol used to insert an escape character. The most commonly used escape characters have the form:
print('| |') # no escape character
| |
print('| \t |') # the tab
| |
print('| \n |') # the new line
| |
print('| \\ |') # the leftslash itself
| \ |
print('| \' |') # the single quotation
| ' |
print('| \" |') # the double quotation
| " |
An ASCII character or character spanning over the range of a single byte can be inserted using an escape character 2 hexadecimal digits:
hex(ord('!'))
'0x21'
'\x21' # a byte (2 hexadecimal digits)
'!'
print('| \x09 |') # the tab as a byte (2 hexadecimal digits)
| |
Note the two hexadecimal digits have to be provided as otherwise there is an incomplete byte specified.
The most commonly used Unicode characters, outside of the ASCII range span over 2 bytes and can therefore be inserted using an escape sequence with 4 hexadecimal digits. For example:
hex(ord('α'))
'0x3b1'
'\u03b1' # a Unicode character (4 hexadecimal digits, 2 hexadecimal digits × 2 bytes)
'α'
Note the four hexadecimal digits have to be provided otherwise there is an incomplete byte. The next line of code shows a common problem when attempting to input a Windows Path:
'c:\users\philip'
In the above the Python interpreter sees the first \
is seen as an instruction to insert an escape character. u
is an instruction to expect a Unicode escape sequence and therefore the Python interpreter attempts to read the next four characters sers
as hexadecimal values. In hexadecimal s
, e
and r
are not valid hexadecimal characters. Recall that a hexadecimal character has 16 digits 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a, b, c, d, e, f
and therefore a SyntaxError
is flagged up.
To insert a Windows path \\
should be used to indicate insertion of the escape character \
:
'c:\\users\\philip'
Note that the hex form is normally used to represent a byte that is not printable. If the 6 whitespace characters are examined in more detail this can be seen:
string.whitespace
' \t\n\r\x0b\x0c'
name | byte | |
---|---|---|
space | ' ' | '\x20' |
tab | '\t' | '\x09' |
new line | '\n' | '\x0a' |
carriage return | '\r' | '\x0d' |
vertical tab | '\x0b' | |
form feed | '\x0c' |
' ' == '\x20'
True
'\t' == '\x09'
True
'\n' == '\x0a'
True
'\r' == '\x0d'
True
It is not common to do so, however each ASCII character in a string can also be inserted as an escape character:
'\x68\x65\x6c\x6c\x6f\x20\x77\x6f\x72\x6c\x64\x21'
'hello world!'
The unicodedata
module can be imported:
import unicodedata
Its identifiers can be viewed using:
dir2(unicodedata, object, unique_only=True)
{'attribute': ['ucd_3_2_0', 'unidata_version'], 'method': ['bidirectional', 'category', 'combining', 'decimal', 'decomposition', 'digit', 'east_asian_width', 'is_normalized', 'lookup', 'mirrored', 'name', 'normalize', 'numeric'], 'upper_class': ['UCD'], 'datamodel_attribute': ['__file__', '__loader__', '__name__', '__package__', '__spec__'], 'internal_attribute': ['_ucnhash_CAPI']}
The Unicode version can be checked using:
unicodedata.unidata_version
'15.0.0'
And once the version number is known, more details about the supported characters can be examined using the Unicode Documentation.
A Unicode escape character span over 4 bytes and can therefore be inserted using 8 hexadecimal digits. For example:
'\U0000303a'
'〺'
Translation Table¶
A translation table can be created for use with the instance method translate
:
greeting.translate?
Signature: greeting.translate(table, /) Docstring: Replace each character in the string using the given translation table. table Translation table, which must be a mapping of Unicode ordinals to Unicode ordinals, strings, or None. The table must implement lookup/indexing via __getitem__, for instance a dictionary or list. If this operation raises LookupError, the character is left untouched. Characters mapped to None are deleted. Type: builtin_function_or_method
maketrans
is a static method which is essentially a function thats neither bound to the instance or the class. This function merely exists in the namespace of the class as this is the most logical place to find it (conceptualise the class as a Python module):
str.maketrans?
Docstring: Return a translation table usable for str.translate(). If there is only one argument, it must be a dictionary mapping Unicode ordinals (integers) or characters to Unicode ordinals, strings or None. Character keys will be then converted to ordinals. If there are two arguments, they must be strings of equal length, and in the resulting dictionary, each character in x will be mapped to the character at the same position in y. If there is a third argument, it must be a string, whose characters will be mapped to None in the result. Type: builtin_function_or_method
greektolatin = str.maketrans('αβγδε', 'abcde')
greektolatin
{945: 97, 946: 98, 947: 99, 948: 100, 949: 101}
hex(945)
'0x3b1'
hex(97)
'0x61'
This translation table can be used on the example str
instance to replace the Greek letters (keys) with the latin letters (values):
'αββγγγδδδδεεεεε'.translate(greektolatin)
'abbcccddddeeeee'
File Paths and Raw Strings¶
In a Python string, the \
is a special character that is an instruction to insert an escape character. Unfortunately the \
is also the default directory seperator used for a file path in Windows.
To incorporate an \
into a str
instance \\
has to be used; the first \
is an instruction to insert an escape character and the second \
states that the escape character to be inserted is the \
itself:
windows_file_path = 'C:\\Users\\Philip'
This problem does not occur on Linux because /
is used as a directory seperator in a file path:
linux_file_path = '/users/philip'
Windows can also use /
as an alternative directory separator however when copying file paths from Windows Explorer for example, the default separator \
will be used.
Compare the difference to the cell output and the output in a cell from a print
statement:
windows_file_path
'C:\\Users\\Philip'
print(windows_file_path)
C:\Users\Philip
In Windows the file path is of the form 'C:\Users\Philip'
using the default separator \
and a SyntaxError
displays when it is used:
windows_file_path = 'C:\Users\Philip'
For the file path to be recognised as a Python string each \
has to be converted into a \\
:
windows_file_path = 'C:\\Users\\Philip'
This can be quite cumbersome for long file paths. Python also has a raw string which does not process escape characters and any \
is recognised as being part of the str
instance. A raw str
has the prefix r
or R
:
raw_windows_file_path1 = r'C:\Users\Philip'
raw_windows_file_path2 = R'C:\Users\Philip'
Although both r
and R
give the same raw str
instance:
raw_windows_file_path1 == raw_windows_file_path2
True
raw_windows_file_path2
'C:\\Users\\Philip'
print(raw_windows_file_path2)
C:\Users\Philip
The subtle difference in the two is in the syntax highlighting. Uppercase R
shows no formatting around the special characters which is appropriate for the file path. Lowercase r
on the other hand shows syntax highlighting following the escape character and is used to construct regular expressions which will be briefly mentioned in the next section.
Find and Index¶
Previously indexing using an int
or a slice
was discussed:
greeting
'Hello World!'
greeting[0]
'H'
greeting[:5]
'Hello'
The str
instance methods index
and find
perform the counter operation and retrieve the positive index corresponding to the first occurrence of a character or the start of a substring:
greeting.find?
Docstring: S.find(sub[, start[, end]]) -> int Return the lowest index in S where substring sub is found, such that sub is contained within S[start:end]. Optional arguments start and end are interpreted as in slice notation. Return -1 on failure. Type: builtin_function_or_method
greeting.index?
Docstring: S.index(sub[, start[, end]]) -> int Return the lowest index in S where substring sub is found, such that sub is contained within S[start:end]. Optional arguments start and end are interpreted as in slice notation. Raises ValueError when the substring is not found. Type: builtin_function_or_method
These two instance methods behave identically upon success:
greeting.find('l')
2
greeting.index('l')
2
However give -1
and ValueError
respectively upon failure:
greeting.find('L')
-1
word.index('L')
These instance methods, take consistent start
and stop
input arguments like in the slice
and range
classes seen earlier and can be used to constrict the search range. For example to find the index of all the values of 'l'
:
greeting.find('l')
2
greeting.find('l', 2+1)
3
greeting.find('l', 3+1)
9
greeting.find('l', 9+1)
-1
A Unicode substring can also be searched for opposed to a Unicode character:
greeting.find('World')
6
greeting.find('W')
6
The index
and find
methods search the str
instance for a substring from the left to the right. These are complemented by the reverse find and reverse index, rfind
and rindex
respectively which search from right to left:
greeting.rfind('l')
9
greeting.rfind('l', 0, 9)
3
greeting.rfind('l', 0, 3)
2
greeting.rfind('l', 0, 2)
-1
greeting.rfind('l')
9
The str
instance method count
returns the number of times a substring str
instance is found in the str
instance:
greeting.count('l')
3
The bool
based str
identifiers startswith
and endswith
return a bool
if the str
instances starts or ends with a substring prefix
or suffix
. These also have consistent start
and stop
input arguments which can be used to constrict the search range:
greeting.startswith?
Docstring: S.startswith(prefix[, start[, end]]) -> bool Return True if S starts with the specified prefix, False otherwise. With optional start, test S beginning at that position. With optional end, stop comparing S at that position. prefix can also be a tuple of strings to try. Type: builtin_function_or_method
greeting.endswith?
Docstring: S.endswith(suffix[, start[, end]]) -> bool Return True if S ends with the specified suffix, False otherwise. With optional start, test S beginning at that position. With optional end, stop comparing S at that position. suffix can also be a tuple of strings to try. Type: builtin_function_or_method
greeting
'Hello World!'
greeting.startswith('hello')
False
greeting.startswith('hello', 1)
False
greeting.endswith('!')
True
greeting.endswith('!', 0, 11)
False
The str
instance method replace
can be used to replace an old
substring with a new
substring. It has an optional argument count
which has a default value of -1
and this means it allows for all replacements by default:
greeting.replace?
Signature: greeting.replace(old, new, count=-1, /) Docstring: Return a copy with all occurrences of substring old replaced by new. count Maximum number of occurrences to replace. -1 (the default value) means replace all occurrences. If the optional argument count is given, only the first count occurrences are replaced. Type: builtin_function_or_method
greeting
'Hello World!'
greeting.replace('hello', 'bye')
'Hello World!'
greeting.replace('l', 'L')
'HeLLo WorLd!'
greeting.replace('l', 'L', 1)
'HeLlo World!'
The re module¶
The regular expressions module is used for advanced pattern searching:
text = 'Email example@example.com, example2@example.com Telephone 0000000000 Website https://www.example.com'
For example a regular expression using r
can be created for an email, number and website:
email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
number_pattern = r'\b\d{10}\b'
website_pattern = r'https?://(?:www\.)?[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
Notice the difference in syntax highlighting when uppercase R
is used:
email_pattern = R'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
number_pattern = R'\b\d{10}\b'
website_pattern = R'https?://(?:www\.)?[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
The regular expression module can be imported:
import re
dir2(re, object, unique_only=True)
{'constant': ['A', 'ASCII', 'DEBUG', 'DOTALL', 'I', 'IGNORECASE', 'L', 'LOCALE', 'M', 'MULTILINE', 'NOFLAG', 'S', 'T', 'TEMPLATE', 'U', 'UNICODE', 'VERBOSE', 'X'], 'module': ['copyreg', 'enum', 'functools'], 'method': ['compile', 'escape', 'findall', 'finditer', 'fullmatch', 'match', 'purge', 'search', 'split', 'sub', 'subn', 'template'], 'lower_class': ['error'], 'upper_class': ['Match', 'Pattern', 'RegexFlag', 'Scanner'], 'datamodel_attribute': ['__all__', '__builtins__', '__cached__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__version__'], 'internal_attribute': ['_MAXCACHE', '_MAXCACHE2', '_cache', '_cache2', '_casefix', '_compiler', '_constants', '_parser', '_special_chars_map', '_sre'], 'internal_method': ['_compile', '_compile_template', '_pickle']}
The re.findall
function can be used to search for the first occurrence of a pattern:
re.findall?
Signature: re.findall(pattern, string, flags=0) Docstring: Return a list of all non-overlapping matches in the string. If one or more capturing groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result. File: c:\users\phili\anaconda3\envs\vscode-env\lib\re\__init__.py Type: function
For example a search for the email_pattern
can be made in text
:
email_search = re.findall(email_pattern, text)
The results can be seen in the output list
instance:
email_search
['example@example.com', 'example2@example.com']
A search can also be made for the number_pattern
and website_pattern
:
number_search = re.findall(number_pattern, text)
number_search
['0000000000']
website_search = re.findall(website_pattern, text)
website_search
['https://www.example.com']
The print function¶
The print
function has previously been used with its default named parameters. More details about these can be seen in the docstring:
print?
Signature: print(*args, sep=' ', end='\n', file=None, flush=False) Docstring: Prints the values to a stream, or to sys.stdout by default. sep string inserted between values, default a space. end string appended after the last value, default a newline. file a file-like object (stream); defaults to the current sys.stdout. flush whether to forcibly flush the stream. Type: builtin_function_or_method
*args
indicates that a variable number of positional input arguments are used. sep
and end
are named input arguments which have a default value of a space and a new line respectively. file
and flush
are for advanced purposes when the print stream is to be directed for example to a file instead of a cell output:
print(*args, sep=' ', end='\n', file=None, flush=False)
The effect of overriding the default value of sep
can be seen:
print('the', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog')
the brown fox jumps over the lazy dog
print('the', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', sep='')
thebrownfoxjumpsoverthelazydog
The effect of overriding the default value of end
can be seen:
print('the', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog')
print('the', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog')
the brown fox jumps over the lazy dog the brown fox jumps over the lazy dog
print('the', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', end='')
print('the', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog')
the brown fox jumps over the lazy dogthe brown fox jumps over the lazy dog
Formatted Strings¶
Supposing a str
body has the form:
body = 'The string to 0 is 1 2!'
And there are three str
instances:
var0 = 'print'
var1 = 'hello'
var2 = 'world'
The objective of a formatted string is to insert these instances into the str
body so a formatted str
instance of the form can be returned:
'The string to print is hello world!'
'The string to print is hello world!'
If the docstring of the str
method format
is examined:
body.format?
Docstring: S.format(*args, **kwargs) -> str Return a formatted version of S, using substitutions from args and kwargs. The substitutions are identified by braces ('{' and '}'). Type: builtin_function_or_method
Then it can be seen that substitutions are identified by braces so the str
body should be modified to have the following form:
body = 'The string to {0} is {1} {2}!'
Notice the syntax highlighting clearly distinguishes these placeholders.
*args
represents a variable number of positional input arguments. When inserting instances into the str
body, the number of positional input arguments should match the number of placeholders in the str
body. Now the format
method can be used:
body.format(var0, var1, var2)
'The string to print is hello world!'
The str
instance body can alternatively be setup to contain named variables:
body = 'The string to {var0_} is {var1_} {var2_}!'
**kwargs
represents a variable number of named keyword input arguments which should match the named keyword input arguments in the str
instance body
:
body.format(var0_=var0, var1_=var1, var2_=var2)
'The string to print is hello world!'
The two lines above can be combined:
'The string to {var0_} is {var1_} {var2_}!'.format(var0_=var0, var1_=var1, var2_=var2)
'The string to print is hello world!'
It is more common for the placeholders to be given the same name as the instances to be inserted in the tuple
:
'The string to {var0} is {var1} {var2}!'.format(var0=var0, var1=var1, var2=var2)
'The string to print is hello world!'
Notice in the above that each instance name is used 3 times which is pretty cumbersome. A shorthand way of writing the expression above is to use the prefix f
or F
which means formatted string:
f'The string to {var0} is {var1} {var2}!'
'The string to print is hello world!'
F'The string to {var0} is {var1} {var2}!'
'The string to print is hello world!'
There is no difference for uppercase and lowercase in formatted str
instances and the syntax highlighting is the same in either case.
If the object
datamodel method __format__
is examined:
object.__format__?
Signature: object.__format__(self, format_spec, /) Docstring: Default object formatter. Return str(self) if format_spec is empty. Raise TypeError otherwise. Type: method_descriptor
Notice there is a format specification format_spec
:
greeting
'Hello World!'
The format specification for a str
instance has the form:
'0ns'
where n
is an integer, s
means str
and 0
is used to fill in blank spaces.
greeting.__format__('s')
'Hello World!'
greeting.__format__('22s')
'Hello World! '
greeting.__format__('022s')
'Hello World!0000000000'
The formatter specifier options differ for each datatype. Normally a colon is used to include the format specifier beside the variable in the formatted str
:
f'The string to {var0:s} is {var1} {var2}!'
'The string to print is hello world!'
The str
format specifier can specify an integer number of characters:
f'The string to {var0:10s} is {var1} {var2}!'
'The string to print is hello world!'
If prefixed with 0
then trailing spaces will be displayed using 0
:
f'The string to {var0:010s} is {var1:s} {var2:s}!'
'The string to print00000 is hello world!'
In the above str
instances were inserted into a str
instance body. It is more common to insert numeric variables into the str
instance body:
num1 = 1
num2 = 0.0000123456789
num3 = 12.3456789
f'The numbers are {num1}, {num2} and {num3}.'
'The numbers are 1, 1.23456789e-05 and 12.3456789.'
The format specifier for an integer decimal (d
) can be used:
f'The numbers are {num1:d}, {num2} and {num3}.'
'The numbers are 1, 1.23456789e-05 and 12.3456789.'
f'The numbers are {num1:5d}, {num2} and {num3}.'
'The numbers are 1, 1.23456789e-05 and 12.3456789.'
f'The numbers are {num1:05d}, {num2} and {num3}.'
'The numbers are 00001, 1.23456789e-05 and 12.3456789.'
f'The numbers are {num1: 05d}, {num2} and {num3}.'
'The numbers are 0001, 1.23456789e-05 and 12.3456789.'
Again the number of characters in the string the number should occupy can be specified. Unlike the str
formatter spacing is leading opposed to trailing. If prefixed with a 0
, then these will be shown as 0
.
Notice one of the five characters is a space because a space is part of the formatter specifier. Compare the difference when this space is removed:
f'The numbers are {num1}, {num2:g} and {num3:g}.'
'The numbers are 1, 1.23457e-05 and 12.3457.'
The e
can be used for float
exponential format:
f'The numbers are {num1}, {num2:e} and {num3:e}.'
'The numbers are 1, 1.234568e-05 and 1.234568e+01.'
The number of places after the decimal point can be specified:
f'The numbers are {num1}, {num2:0.3e} and {num3:0.3e}.'
'The numbers are 1, 1.235e-05 and 1.235e+01.'
A fixed format can also be used:
f'The numbers are {num1}, {num2:f} and {num3:f}.'
'The numbers are 1, 0.000012 and 12.345679.'
Once again the number of spaces after the decimal point can be specified:
f'The numbers are {num1}, {num2:0.3f} and {num3:0.3f}.'
'The numbers are 1, 0.000 and 12.346.'
float
instances can use the general (g
), exponential (e
) and fixed (f
) format specifiers. The prefix 0.3
specifies rounding to 3
digits past the decimal point.
If the keys in a dict
instance match the instance names in the str
body:
numbers = {'num1': 1, 'num2': 0.0000123456789, 'num3': 12.3456789}
body = 'The numbers are {num1:d}, {num2:.3e} and {num3:.3e}.'
The format_map
method can be used with the mapping to insert the instances:
body.format_map?
Docstring: S.format_map(mapping) -> str Return a formatted version of S, using substitutions from mapping. The substitutions are identified by braces ('{' and '}'). Type: builtin_function_or_method
body.format_map(numbers)
'The numbers are 1, 1.235e-05 and 1.235e+01.'
Notice that the syntax for a format specifier {variable:format_spec}
is similar to the form of a Python dict
instance {key:value}
. However spacing to the right of the colon is often present in a dictionary {key: value}
and does not change the value. If a space is added to the formatting specifier, it is incorporated into the formatting specifier.
The older style of formatted str
instances uses the datamodel identifier __mod__
(dunder mod) which controls the behaviour of the operator %
and in the case of older style string formatting also uses the %
as a placeholder opposed to the braces {}
:
body = 'The numbers are %d, %0.3f and %0.3g.'
nums = (1, 0.0000123456789, 12.3456789)
body.__mod__?
Signature: body.__mod__(value, /) Call signature: body.__mod__(*args, **kwargs) Type: method-wrapper String form: <method-wrapper '__mod__' of str object at 0x000001E0A227B960> Docstring: Return self%value.
body % nums
'The numbers are 1, 0.000 and 12.3.'
Multiline Strings¶
A str
instance can be displayed over multiple lines using triple double quotations:
multiline = """the quick brown fox jumps over the lazy dog
the quick brown fox jumps over the lazy dog
the quick brown fox jumps over the lazy dog
the quick brown fox jumps over the lazy dog"""
multiline
'the quick brown fox jumps over the lazy dog\nthe quick brown fox jumps over the lazy dog\nthe quick brown fox jumps over the lazy dog\nthe quick brown fox jumps over the lazy dog'
print(multiline)
the quick brown fox jumps over the lazy dog the quick brown fox jumps over the lazy dog the quick brown fox jumps over the lazy dog the quick brown fox jumps over the lazy dog
Note that any spacing added will be incorporated into the multiline str
instance:
multiline = """
the quick brown fox jumps over the lazy dog
the quick brown fox jumps over the lazy dog
the quick brown fox jumps over the lazy dog
the quick brown fox jumps over the lazy dog
"""
multiline
'\n the quick brown fox jumps over the lazy dog\n the quick brown fox jumps over the lazy dog\n the quick brown fox jumps over the lazy dog\n the quick brown fox jumps over the lazy dog\n '
print(multiline)
the quick brown fox jumps over the lazy dog the quick brown fox jumps over the lazy dog the quick brown fox jumps over the lazy dog the quick brown fox jumps over the lazy dog
Triple double quotations are preferred as multiline str
instances are commonly used for docstrings and docstrings are commonly written briefly during development and expanded during production to include str
literals:
print?
Signature: print(*args, sep=' ', end='\n', file=None, flush=False) Docstring: Prints the values to a stream, or to sys.stdout by default. sep string inserted between values, default a space. end string appended after the last value, default a newline. file a file-like object (stream); defaults to the current sys.stdout. flush whether to forcibly flush the stream. Type: builtin_function_or_method
doc = """Prints the values
sep
string inserted between values, default a space ' '.
end
string appended after the last value, default a newline '\\n'."""
print(doc)
Prints the values sep string inserted between values, default a space ' '. end string appended after the last value, default a newline '\n'.
Center and Justify¶
A str
instance can be centered and justified using the str
methods fill
, centre
, ljust
and rjust
:
greeting.center?
Signature: greeting.center(width, fillchar=' ', /) Docstring: Return a centered string of length width. Padding is done using the specified fill character (default is a space). Type: builtin_function_or_method
greeting.ljust?
Signature: greeting.ljust(width, fillchar=' ', /) Docstring: Return a left-justified string of length width. Padding is done using the specified fill character (default is a space). Type: builtin_function_or_method
greeting.rjust?
Signature: greeting.rjust(width, fillchar=' ', /) Docstring: Return a right-justified string of length width. Padding is done using the specified fill character (default is a space). Type: builtin_function_or_method
len(greeting)
12
greeting.center(20)
' Hello World! '
greeting.center(20, 'X')
'XXXXHello World!XXXX'
greeting.ljust(20, 'X')
'Hello World!XXXXXXXX'
greeting.rjust(20, 'X')
'XXXXXXXXHello World!'
The opposite operation can be carried out using the str
methods left strip and right strip, lstrip
and rstrip
respectively which left strip and right strip whitespace by default or a specified fill character or character sequence:
padded_greeting = greeting.center(20)
padded_greeting
' Hello World! '
padded_greeting.lstrip?
Signature: padded_greeting.lstrip(chars=None, /) Docstring: Return a copy of the string with leading whitespace removed. If chars is given and not None, remove characters in chars instead. Type: builtin_function_or_method
padded_greeting.rstrip?
Signature: padded_greeting.rstrip(chars=None, /) Docstring: Return a copy of the string with trailing whitespace removed. If chars is given and not None, remove characters in chars instead. Type: builtin_function_or_method
padded_greeting.lstrip()
'Hello World! '
padded_greeting.rstrip()
' Hello World!'
padded_greeting.lstrip().rstrip()
'Hello World!'
padded_greeting = greeting.center(20, 'X')
padded_greeting
'XXXXHello World!XXXX'
padded_greeting.lstrip('X').rstrip('X')
'Hello World!'
The associated str
methods removeprefix
and removesuffix
are more precise and will only remove a specified prefix
or suffix
:
padded_greeting.removeprefix?
Signature: padded_greeting.removeprefix(prefix, /) Docstring: Return a str with the given prefix string removed if present. If the string starts with the prefix string, return string[len(prefix):]. Otherwise, return a copy of the original string. Type: builtin_function_or_method
padded_greeting.removesuffix?
Signature: padded_greeting.removesuffix(suffix, /) Docstring: Return a str with the given suffix string removed if present. If the string ends with the suffix string and that suffix is not empty, return string[:-len(suffix)]. Otherwise, return a copy of the original string. Type: builtin_function_or_method
padded_greeting
'XXXXHello World!XXXX'
padded_greeting.removeprefix('X')
'XXXHello World!XXXX'
Earlier the ordinal value of the string '3'
was examined. The prefix '0b'
can be removed using remove prefix:
string_3 = bin(ord('3'))
string_3
'0b110011'
string_3 = bin(ord('3')).removeprefix('0b')
string_3
'110011'
There is also the zero fill string method zfill
which is used to zero fill a string and is mainly intended for str
instances of numeric values:
string_3.zfill?
Signature: string_3.zfill(width, /) Docstring: Pad a numeric string with zeros on the left, to fill a field of the given width. The string is never truncated. Type: builtin_function_or_method
Since this binary number is of a byte that has 8
values, the width can be set to 8
:
string_3.zfill(8)
'00110011'
Binary Operators¶
__add__
is a binary datamodel method used to concatenate two str
instances:
greeting.__add__?
Signature: greeting.__add__(value, /) Call signature: greeting.__add__(*args, **kwargs) Type: method-wrapper String form: <method-wrapper '__add__' of str object at 0x000001E0A216D7B0> Docstring: Return self+value.
'hello' + 'world'
'helloworld'
'hello' + ' ' + 'world'
'hello world'
__mul__
is a binary datamodel method used to replicate the characters in a str
instance using an int
instance:
greeting.__mul__?
Signature: greeting.__mul__(value, /) Call signature: greeting.__mul__(*args, **kwargs) Type: method-wrapper String form: <method-wrapper '__mul__' of str object at 0x000001E0A216D7B0> Docstring: Return self*value.
greeting * 3
'Hello World!Hello World!Hello World!'
The reverse multiplication datamodel method is also defined:
greeting.__rmul__?
Signature: greeting.__rmul__(value, /) Call signature: greeting.__rmul__(*args, **kwargs) Type: method-wrapper String form: <method-wrapper '__rmul__' of str object at 0x000001E0A216D7B0> Docstring: Return value*self.
Which makes the multiplication of the str
instance and int
instance around the *
operator commutative:
3 * greeting
'Hello World!Hello World!Hello World!'
Binary operators are frequently used with assignment:
variables(['greeting',], show_id=True)
Type | Size/Shape | Value | ID | |
---|---|---|---|---|
Instance Name | ||||
greeting | str | 12 | Hello World! | 2064303708080 |
Recall the operation on the right of the assignment operator is carried out first using the original instance. The return
value of the instance is then reassigned to the original instance:
greeting = greeting + ' world!'
variables(['greeting',], show_id=True)
Type | Size/Shape | Value | ID | |
---|---|---|---|---|
Instance Name | ||||
greeting | str | 19 | Hello World! world! | 2064304980336 |
A binary operator for example addition +
can be combined with the assignment operator =
resulting in the "inplace" addition operator +=
. Because the str
instance is immutable the operation is not in place but is equivalent to the order of the two separate operations concatenation and then reassignment as shown above:
greeting += ' world!'
variables(['greeting',], show_id=True)
Type | Size/Shape | Value | ID | |
---|---|---|---|---|
Instance Name | ||||
greeting | str | 26 | Hello World! world! world! | 2064296437168 |
Splitting and Joining Strings¶
A number of str
methods are available for splitting and joining str
instances. These generally involve casting to a Python collection such as a tuple
of str
instances or a list
of str
instances.
For example the str
instance method partition
and right partition rpartition
will partition a str
instance into a three element tuple
of three str
instances; the substring before the partition, the partition substring and the substring after the partition respectively. To make it more obvious the following str
instance will be instantiated:
greeting = 'hello|world|!'
greeting.partition?
Signature: greeting.partition(sep, /) Docstring: Partition the string into three parts using the given separator. This will search for the separator in the string. If the separator is found, returns a 3-tuple containing the part before the separator, the separator itself, and the part after it. If the separator is not found, returns a 3-tuple containing the original string and two empty strings. Type: builtin_function_or_method
greeting.partition('|')
('hello', '|', 'world|!')
greeting.rpartition?
Signature: greeting.rpartition(sep, /) Docstring: Partition the string into three parts using the given separator. This will search for the separator in the string, starting at the end. If the separator is found, returns a 3-tuple containing the part before the separator, the separator itself, and the part after it. If the separator is not found, returns a 3-tuple containing two empty strings and the original string. Type: builtin_function_or_method
greeting.rpartition('|')
('hello|world', '|', '!')
More generally the str
instance methods split
and join
can be used to split a str
instance into a list
of str
instances or join a list
of str
instances up into a single str
instance. For example if the following sentence is created:
sentence = 'the fat black cat sat on the mat!'
The str
instance method split
can be examined:
sentence.split?
Signature: sentence.split(sep=None, maxsplit=-1) Docstring: Return a list of the substrings in the string, using sep as the separator string. sep The separator used to split the string. When set to None (the default value), will split on any whitespace character (including \n \r \t \f and spaces) and will discard empty strings from the result. maxsplit Maximum number of splits (starting from the left). -1 (the default value) means no limit. Note, str.split() is mainly useful for data that has been intentionally delimited. With natural text that includes punctuation, consider using the regular expression module. Type: builtin_function_or_method
Since the values to be split
from are whitespace, the input arguments can be left unspecified defaulting to their default values. This gives a list
of str
instances:
words = sentence.split()
variables(['sentence', 'words'], show_id=True)
Type | Size/Shape | Value | ID | |
---|---|---|---|---|
Instance Name | ||||
sentence | str | 33 | the fat black cat sat on the mat! | 2064304679920 |
words | list | 8 | ['the', 'fat', 'black', 'cat', 'sat', 'on', 'the', 'mat!'] | 2064305096496 |
There is also the str
instance method right split rsplit
, the difference is subtle and the methods behave different only when maxsplit
is assigned a new value:
sentence.rsplit?
Signature: sentence.rsplit(sep=None, maxsplit=-1) Docstring: Return a list of the substrings in the string, using sep as the separator string. sep The separator used to split the string. When set to None (the default value), will split on any whitespace character (including \n \r \t \f and spaces) and will discard empty strings from the result. maxsplit Maximum number of splits (starting from the left). -1 (the default value) means no limit. Splitting starts at the end of the string and works to the front. Type: builtin_function_or_method
words_r = sentence.rsplit()
variables(['sentence', 'words', 'words_r'], show_id=True)
Type | Size/Shape | Value | ID | |
---|---|---|---|---|
Instance Name | ||||
sentence | str | 33 | the fat black cat sat on the mat! | 2064304679920 |
words | list | 8 | ['the', 'fat', 'black', 'cat', 'sat', 'on', 'the', 'mat!'] | 2064305097168 |
words_r | list | 8 | ['the', 'fat', 'black', 'cat', 'sat', 'on', 'the', 'mat!'] | 2064305098960 |
The difference can be seen when maxsplit
is used:
words = sentence.split(' ', maxsplit=3)
words_r = sentence.rsplit(' ', maxsplit=3)
variables(['sentence', 'words', 'words_r'], show_id=True)
Type | Size/Shape | Value | ID | |
---|---|---|---|---|
Instance Name | ||||
sentence | str | 33 | the fat black cat sat on the mat! | 2064304679920 |
words | list | 4 | ['the', 'fat', 'black', 'cat sat on the mat!'] | 2064305097392 |
words_r | list | 4 | ['the fat black cat sat', 'on', 'the', 'mat!'] | 2064305161872 |
To join the words, the str
method join
can be called from a delimiter str
instance:
delimiter = ' '
delimiter.join?
Signature: delimiter.join(iterable, /) Docstring: Concatenate any number of strings. The string whose method is called is inserted in between each given string. The result is returned as a new string. Example: '.'.join(['ab', 'pq', 'rs']) -> 'ab.pq.rs' Type: builtin_function_or_method
variables(show_id=True).loc[['delimiter', 'words']]
Type | Size/Shape | Value | ID | |
---|---|---|---|---|
Instance Name | ||||
delimiter | str | 1 | 140727149724840 | |
words | list | 4 | ['the', 'fat', 'black', 'cat sat on the mat!'] | 2064305101312 |
delimiter.join(words)
'the fat black cat sat on the mat!'
join
is typically called from a space str
instance directly:
' '.join(words)
'the fat black cat sat on the mat!'
'|'.join(words)
'the|fat|black|cat sat on the mat!'
If a multiline str
instance is created:
paragraph = """The quick brown fox jumps over the lazy dog
The quick brown fox jumps over the lazy dog
The quick brown fox jumps over the lazy dog
The quick brown fox jumps over the lazy dog"""
paragraph
'The quick brown fox jumps over the lazy dog\nThe quick brown fox jumps over the lazy dog\nThe quick brown fox jumps over the lazy dog\nThe quick brown fox jumps over the lazy dog'
There is an associated str
method splitlines
, which splits the str
into a list
using the newline. It has an input argument keepends
which defaults to False
and therefore excludes the newline character:
paragraph.splitlines?
Signature: paragraph.splitlines(keepends=False) Docstring: Return a list of the lines in the string, breaking at line boundaries. Line breaks are not included in the resulting list unless keepends is given and true. Type: builtin_function_or_method
paragraph.splitlines()
['The quick brown fox jumps over the lazy dog', 'The quick brown fox jumps over the lazy dog', 'The quick brown fox jumps over the lazy dog', 'The quick brown fox jumps over the lazy dog']
If the multiline string is created with tabs:
paragraph = """\tThe quick brown fox jumps over the lazy dog
\tThe quick brown fox jumps over the lazy dog
\tThe quick brown fox jumps over the lazy dog
\tThe quick brown fox jumps over the lazy dog"""
The tabs can be replaced by a specified number of spaces using the str
method expandtabs
:
paragraph.expandtabs?
Signature: paragraph.expandtabs(tabsize=8) Docstring: Return a copy where all tab characters are expanded using spaces. If tabsize is not given, a tab size of 8 characters is assumed. Type: builtin_function_or_method
paragraph.expandtabs(4)
' The quick brown fox jumps over the lazy dog\n The quick brown fox jumps over the lazy dog\n The quick brown fox jumps over the lazy dog\n The quick brown fox jumps over the lazy dog'
print(paragraph)
The quick brown fox jumps over the lazy dog The quick brown fox jumps over the lazy dog The quick brown fox jumps over the lazy dog The quick brown fox jumps over the lazy dog
print(paragraph.expandtabs(4))
The quick brown fox jumps over the lazy dog The quick brown fox jumps over the lazy dog The quick brown fox jumps over the lazy dog The quick brown fox jumps over the lazy dog
Bytes Related Identifiers¶
The bytes
class is another text based class. Instead of having the fundamental unit of a Unicode character, it has the fundamental unit of a byte:
The str
instances encode
method encodes the str
to a bytes
instance. The str
instance under the hood uses the 'utf-8'
translation table but this can be encoded to a bytes
instance that uses this translation table or another one:
greeting.encode?
Signature: greeting.encode(encoding='utf-8', errors='strict') Docstring: Encode the string using the codec registered for encoding. encoding The encoding in which to encode the string. errors The error handling scheme to use for encoding errors. The default is 'strict' meaning that encoding errors raise a UnicodeEncodeError. Other possible values are 'ignore', 'replace' and 'xmlcharrefreplace' as well as any other name registered with codecs.register_error that can handle UnicodeEncodeErrors. Type: builtin_function_or_method
Since each English ASCII character is stored as a byte, the English character is used to represent its corresponding byte and therefore the two instances look familiar:
greeting.encode()
b'hello|world|!'
Recall ASCII characters are encoded over the values 0:128
, which are the values for half a byte. Legacy translation tables uses the second half of a byte for additional characters. The £
sign for example is not an ASCII character. In 'latin1'
it spans over a single byte:
'£'.encode(encoding='latin1')
b'\xa3'
0xa3
163
In 'utf-16'
each character spans over 2 bytes. There are variations of utf-16
depending on the byte order. The byte order endian can be conceptualised by encoding the number twelve (in decimal) as 12 (big endian) or 21 (little endian).
Humans normally encode numbers using big endian but Intel processors work using little endian. When utf-16
was first introduced by Intel, there was confusion with the byte order and as a consequence 2 variations of utf-16
. Microsoft also included a third variant of little endian with a 2 bytes BOM prefix. The BOM is byte order marker used to quickly identify little endian:
'£'.encode(encoding='utf-16-be')
b'\x00\xa3'
'£'.encode(encoding='utf-16-le')
b'\xa3\x00'
'£'.encode(encoding='utf-16')
b'\xff\xfe\xa3\x00'
'££'.encode(encoding='utf-16')
b'\xff\xfe\xa3\x00\xa3\x00'
The current standard is 'utf-8'
which uses a different bytes
combination to the previous translation tables and uses 2 bytes to encode the £
sign:
'£'.encode(encoding='utf-8')
b'\xc2\xa3'
The Greek letters also require 2 bytes each. Each of the characters in the str
instance below, except for the space are not recognised as ASCII characters and therefore represented by two hexadecimal escape characters:
greek_greeting = 'Γειά σου Κόσμε!'
greek_greeting.encode(encoding='utf-8')
b'\xce\x93\xce\xb5\xce\xb9\xce\xac \xcf\x83\xce\xbf\xcf\x85 \xce\x9a\xcf\x8c\xcf\x83\xce\xbc\xce\xb5!'
'Γ'.encode(encoding='utf-8')
b'\xce\x93'
The bytes
class and the concept of encoding will be covered in more detail in the next notebook.