notebook

pandas is an abbreviation for the Python and Data Analysis Library. It is a library that uses three main data structures:

the Index class
the Series class
the DataFrame class

Most Index classes are numeric, that is zero-order integer steps of one.

The Index, similar to a tuple, list or 1darray has a single dimension which can be represented either as a row:

index	0	1	2	3

Or as a column when convenient:

index
0
1
2
3

The Series class has a value at each index and a name. It is essentially a numpy 1darray that has a name. A Series is normally represented as a column (notice the Index associated with the Series is also displayed as a column):

index	'x'
0	1.1
1	2.1
2	3.1
3	4.1

A DataFrame class is essentially a grouping of series instances that have the same index:

index	'x'	'y'
0	1.1	1.2
1	2.1	2.2
2	3.1	3.2
3	4.1	4.2

Categorize_Identifiers Module¶

This notebook will use the following functions dir2, variables and view in the custom module categorize_identifiers which is found in the same directory as this notebook file. dir2 is a variant of dir that groups identifiers into a dict under categories and variables is an IPython based a variable inspector. view is used to view a Collection in more detail:

In [1]:

from categorize_identifiers import dir2, variables, view

The following will be imported in order to simplify the output of dir2:

In [2]:

import operator
reverse_operators = ['radd', 'rmod', 'rmul', 'rpow', 'rsub', 'rtruediv']
import obsolete

Importing Libraries¶

To use the data science libraries they need to be imported:

In [3]:

import numpy as np 
import pandas as pd

Once imported the identifiers can be viewed:

In [4]:

dir2(pd, exclude_external_modules=True, drop_internal=True, exclude_identifier_list=obsolete.pandas_depreciation)

{'attribute': ['annotations', 'options'],
 'constant': ['IndexSlice', 'NA', 'NaT'],
 'module': ['api',
            'arrays',
            'compat',
            'core',
            'errors',
            'io',
            'offsets',
            'pandas',
            'plotting',
            'testing',
            'tseries',
            'util'],
 'method': ['array',
            'bdate_range',
            'concat',
            'crosstab',
            'cut',
            'date_range',
            'describe_option',
            'eval',
            'factorize',
            'from_dummies',
            'get_dummies',
            'get_option',
            'infer_freq',
            'interval_range',
            'isna',
            'isnull',
            'json_normalize',
            'lreshape',
            'melt',
            'merge',
            'merge_asof',
            'merge_ordered',
            'notna',
            'notnull',
            'period_range',
            'pivot',
            'pivot_table',
            'qcut',
            'read_clipboard',
            'read_csv',
            'read_excel',
            'read_feather',
            'read_fwf',
            'read_hdf',
            'read_html',
            'read_json',
            'read_orc',
            'read_parquet',
            'read_pickle',
            'read_sas',
            'read_spss',
            'read_sql',
            'read_sql_query',
            'read_sql_table',
            'read_stata',
            'read_table',
            'read_xml',
            'reset_option',
            'set_eng_float_format',
            'set_option',
            'show_versions',
            'test',
            'timedelta_range',
            'to_datetime',
            'to_numeric',
            'to_pickle',
            'to_timedelta',
            'unique',
            'wide_to_long'],
 'lower_class': ['option_context'],
 'upper_class': ['ArrowDtype',
                 'BooleanDtype',
                 'Categorical',
                 'CategoricalDtype',
                 'CategoricalIndex',
                 'DataFrame',
                 'DateOffset',
                 'DatetimeIndex',
                 'DatetimeTZDtype',
                 'ExcelFile',
                 'ExcelWriter',
                 'Flags',
                 'Float32Dtype',
                 'Float64Dtype',
                 'Grouper',
                 'HDFStore',
                 'Index',
                 'Int16Dtype',
                 'Int32Dtype',
                 'Int64Dtype',
                 'Int8Dtype',
                 'Interval',
                 'IntervalDtype',
                 'IntervalIndex',
                 'MultiIndex',
                 'NamedAgg',
                 'Period',
                 'PeriodDtype',
                 'PeriodIndex',
                 'RangeIndex',
                 'Series',
                 'SparseDtype',
                 'StringDtype',
                 'Timedelta',
                 'TimedeltaIndex',
                 'Timestamp',
                 'UInt16Dtype',
                 'UInt32Dtype',
                 'UInt64Dtype',
                 'UInt8Dtype'],
 'datamodel_attribute': ['__all__',
                         '__builtins__',
                         '__cached__',
                         '__doc__',
                         '__docformat__',
                         '__file__',
                         '__git_version__',
                         '__loader__',
                         '__name__',
                         '__package__',
                         '__path__',
                         '__spec__',
                         '__version__']}

In general pd uses object orientated programming (OOP) opposed to functional programming. Most of the functions within pd are used to read in data from a file and output a pd.DataFrame instance.

The classes are all in PascalCase. The main classes are:

pd.Index
pd.Series
pd.DataFrame

The OOP approach means methods are normally called from instances to analyse or manipulate data from the instance. The identifiers for the main three classes can be examined. The relationship between the identifiers in the three classes above will first be examined before later examining the classes in detail.

The pd.Series can be conceptualised as a np.ndarray with a name and therefore has consistent attributes, methods and numerical datamodel methods:

In [5]:

dir2(pd.Series, np.ndarray, consistent_only=True, exclude_external_modules=True, drop_internal=True, exclude_identifier_list=obsolete.series_depreciation)

{'attribute': ['dtype', 'flags', 'nbytes', 'ndim', 'shape', 'size'],
 'constant': ['T'],
 'method': ['all',
            'any',
            'argmax',
            'argmin',
            'argsort',
            'astype',
            'clip',
            'copy',
            'cumprod',
            'cumsum',
            'dot',
            'item',
            'max',
            'mean',
            'min',
            'prod',
            'repeat',
            'round',
            'searchsorted',
            'squeeze',
            'std',
            'sum',
            'take',
            'tolist',
            'transpose',
            'var'],
 'datamodel_attribute': ['__array_priority__', '__doc__', '__hash__'],
 'datamodel_method': ['__abs__',
                      '__add__',
                      '__and__',
                      '__array__',
                      '__array_ufunc__',
                      '__bool__',
                      '__class__',
                      '__contains__',
                      '__copy__',
                      '__deepcopy__',
                      '__delattr__',
                      '__delitem__',
                      '__dir__',
                      '__divmod__',
                      '__eq__',
                      '__float__',
                      '__floordiv__',
                      '__format__',
                      '__ge__',
                      '__getattribute__',
                      '__getitem__',
                      '__getstate__',
                      '__gt__',
                      '__iadd__',
                      '__iand__',
                      '__ifloordiv__',
                      '__imod__',
                      '__imul__',
                      '__init__',
                      '__init_subclass__',
                      '__int__',
                      '__invert__',
                      '__ior__',
                      '__ipow__',
                      '__isub__',
                      '__iter__',
                      '__itruediv__',
                      '__ixor__',
                      '__le__',
                      '__len__',
                      '__lt__',
                      '__matmul__',
                      '__mod__',
                      '__mul__',
                      '__ne__',
                      '__neg__',
                      '__new__',
                      '__or__',
                      '__pos__',
                      '__pow__',
                      '__radd__',
                      '__rand__',
                      '__rdivmod__',
                      '__reduce__',
                      '__reduce_ex__',
                      '__repr__',
                      '__rfloordiv__',
                      '__rmatmul__',
                      '__rmod__',
                      '__rmul__',
                      '__ror__',
                      '__rpow__',
                      '__rsub__',
                      '__rtruediv__',
                      '__rxor__',
                      '__setattr__',
                      '__setitem__',
                      '__setstate__',
                      '__sizeof__',
                      '__str__',
                      '__sub__',
                      '__subclasshook__',
                      '__truediv__',
                      '__xor__']}

A number of the __builtins__ functions are also available as pd.Series methods:

In [6]:

dir2(pd.Series, __builtins__, consistent_only=True, exclude_identifier_list=obsolete.series_depreciation)

{'method': ['abs',
            'all',
            'any',
            'divmod',
            'filter',
            'map',
            'max',
            'min',
            'pow',
            'round',
            'sum'],
 'lower_class': ['list', 'str'],
 'datamodel_attribute': ['__doc__']}

Many of the datamodel operators defined for numeric operation are also available as a method:

In [7]:

dir2(pd.Series, operator, consistent_only=True)

{'attribute': ['index'],
 'method': ['abs',
            'add',
            'eq',
            'floordiv',
            'ge',
            'gt',
            'le',
            'lt',
            'mod',
            'mul',
            'ne',
            'pow',
            'sub',
            'truediv'],
 'datamodel_attribute': ['__doc__'],
 'datamodel_method': ['__abs__',
                      '__add__',
                      '__and__',
                      '__contains__',
                      '__delitem__',
                      '__eq__',
                      '__floordiv__',
                      '__ge__',
                      '__getitem__',
                      '__gt__',
                      '__iadd__',
                      '__iand__',
                      '__ifloordiv__',
                      '__imod__',
                      '__imul__',
                      '__invert__',
                      '__ior__',
                      '__ipow__',
                      '__isub__',
                      '__itruediv__',
                      '__ixor__',
                      '__le__',
                      '__lt__',
                      '__matmul__',
                      '__mod__',
                      '__mul__',
                      '__ne__',
                      '__neg__',
                      '__or__',
                      '__pos__',
                      '__pow__',
                      '__setitem__',
                      '__sub__',
                      '__truediv__',
                      '__xor__']}

The pd.Series has additional attributes such as:

values returns the underlying values normally as an np.ndarray (1darray) when numeric.
name which is a str instance corresponding to the series name.
index which returns the pd.Index instance.
at, iat, loc and iloc are used for additional indexing purposes.
attrs is used to store optional metadata.
The following attributes return a bool of True when:
- hasnans if any element is NaN or None
- empty if no values are supplied
- is_unique if every value is unique
- is_monotonic_decreasing if every value >= the previous value
- is_monotonic_decreasing if every value <= the previous value

The pd.Series has additional methods. Many of these are statistical methods and have a similar name to functions found in the statistics module. There are also a large number of to methods which convert the pd.Series to another datatype or file format.

There are also lowercase classes:

str which groups together str identifiers
cat which groups together category identifiers
dt which groups together datetime (datetime.datetime or datetime.timedelta) identifiers
plot which groups together pyplot based functions which can be use on the values
list which groups together list[pyarrow] identifiers (pyarrow features are still experimental)

The lowercase classes are appropriate when the pd.Series datatype matches one of the class types above.

In [8]:

dir2(pd.Series, [np.ndarray, operator], unique_only=True, 
     exclude_external_modules=True, exclude_identifier_list=reverse_operators+obsolete.series_depreciation, drop_internal=True)

{'attribute': ['array',
               'at',
               'attrs',
               'axes',
               'dtypes',
               'empty',
               'hasnans',
               'iat',
               'iloc',
               'is_monotonic_decreasing',
               'is_monotonic_increasing',
               'is_unique',
               'loc',
               'name',
               'values'],
 'method': ['add_prefix',
            'add_suffix',
            'agg',
            'aggregate',
            'align',
            'apply',
            'asfreq',
            'asof',
            'at_time',
            'autocorr',
            'between',
            'between_time',
            'bfill',
            'case_when',
            'combine',
            'combine_first',
            'compare',
            'convert_dtypes',
            'corr',
            'count',
            'cov',
            'cummax',
            'cummin',
            'describe',
            'diff',
            'div',
            'divide',
            'divmod',
            'drop',
            'drop_duplicates',
            'droplevel',
            'dropna',
            'duplicated',
            'equals',
            'ewm',
            'expanding',
            'explode',
            'factorize',
            'ffill',
            'fillna',
            'filter',
            'first',
            'first_valid_index',
            'get',
            'groupby',
            'head',
            'hist',
            'idxmax',
            'idxmin',
            'infer_objects',
            'info',
            'interpolate',
            'isin',
            'isna',
            'isnull',
            'items',
            'keys',
            'kurt',
            'kurtosis',
            'last',
            'last_valid_index',
            'map',
            'mask',
            'median',
            'memory_usage',
            'mode',
            'multiply',
            'nlargest',
            'notna',
            'notnull',
            'nsmallest',
            'nunique',
            'pad',
            'pct_change',
            'pipe',
            'pop',
            'product',
            'quantile',
            'rank',
            'rdiv',
            'rdivmod',
            'reindex',
            'reindex_like',
            'rename',
            'rename_axis',
            'reorder_levels',
            'replace',
            'resample',
            'reset_index',
            'rfloordiv',
            'rolling',
            'sample',
            'sem',
            'set_axis',
            'set_flags',
            'shift',
            'skew',
            'sort_index',
            'sort_values',
            'subtract',
            'swaplevel',
            'tail',
            'to_clipboard',
            'to_csv',
            'to_dict',
            'to_excel',
            'to_frame',
            'to_hdf',
            'to_json',
            'to_latex',
            'to_list',
            'to_markdown',
            'to_numpy',
            'to_period',
            'to_pickle',
            'to_sql',
            'to_string',
            'to_timestamp',
            'to_xarray',
            'transform',
            'truncate',
            'tz_convert',
            'tz_localize',
            'unique',
            'unstack',
            'update',
            'value_counts',
            'where',
            'xs'],
 'lower_class': ['cat', 'dt', 'list', 'plot', 'sparse', 'str', 'struct'],
 'datamodel_attribute': ['__annotations__',
                         '__dict__',
                         '__module__',
                         '__pandas_priority__',
                         '__weakref__'],
 'datamodel_method': ['__column_consortium_standard__',
                      '__finalize__',
                      '__getattr__',
                      '__nonzero__',
                      '__round__']}

A pd.DataFrame is essentially a grouping of pd.Series instances and therefore has many identifiers in common.

The attributes dtypes and values are used as plural equivalents to dtype and array seen in the pd.Series. dtypes reflects the fact each pd.Series in the pd.DataFrame instance can have a different datatype. However when each pd.Series are numeric then pd.values is the 2d ndarray instance. Most of the statistical methods will only also work when each pd.Series in the pd.DataFrame instance are numeric. It is more common to use statistical functions on a pd.Series.

In [9]:

dir2(pd.DataFrame, pd.Series, consistent_only=True, drop_internal=True, exclude_identifier_list=obsolete.df_depreciation)

{'attribute': ['at',
               'attrs',
               'axes',
               'dtypes',
               'empty',
               'flags',
               'iat',
               'iloc',
               'index',
               'loc',
               'ndim',
               'shape',
               'size',
               'values'],
 'constant': ['T'],
 'method': ['abs',
            'add',
            'add_prefix',
            'add_suffix',
            'agg',
            'aggregate',
            'align',
            'all',
            'any',
            'apply',
            'asfreq',
            'asof',
            'astype',
            'at_time',
            'backfill',
            'between_time',
            'bfill',
            'clip',
            'combine',
            'combine_first',
            'compare',
            'convert_dtypes',
            'copy',
            'corr',
            'count',
            'cov',
            'cummax',
            'cummin',
            'cumprod',
            'cumsum',
            'describe',
            'diff',
            'div',
            'divide',
            'dot',
            'drop',
            'drop_duplicates',
            'droplevel',
            'dropna',
            'duplicated',
            'eq',
            'equals',
            'ewm',
            'expanding',
            'explode',
            'ffill',
            'fillna',
            'filter',
            'first_valid_index',
            'floordiv',
            'ge',
            'get',
            'groupby',
            'gt',
            'head',
            'hist',
            'idxmax',
            'idxmin',
            'infer_objects',
            'info',
            'interpolate',
            'isin',
            'isna',
            'isnull',
            'items',
            'keys',
            'kurt',
            'kurtosis',
            'last_valid_index',
            'le',
            'lt',
            'map',
            'mask',
            'max',
            'mean',
            'median',
            'memory_usage',
            'min',
            'mod',
            'mode',
            'mul',
            'multiply',
            'ne',
            'nlargest',
            'notna',
            'notnull',
            'nsmallest',
            'nunique',
            'pad',
            'pct_change',
            'pipe',
            'pop',
            'pow',
            'prod',
            'product',
            'quantile',
            'radd',
            'rank',
            'rdiv',
            'reindex',
            'reindex_like',
            'rename',
            'rename_axis',
            'reorder_levels',
            'replace',
            'resample',
            'reset_index',
            'rfloordiv',
            'rmod',
            'rmul',
            'rolling',
            'round',
            'rpow',
            'rsub',
            'rtruediv',
            'sample',
            'sem',
            'set_axis',
            'set_flags',
            'shift',
            'skew',
            'sort_index',
            'sort_values',
            'squeeze',
            'std',
            'sub',
            'subtract',
            'sum',
            'swaplevel',
            'tail',
            'take',
            'to_clipboard',
            'to_csv',
            'to_dict',
            'to_excel',
            'to_hdf',
            'to_json',
            'to_latex',
            'to_markdown',
            'to_numpy',
            'to_period',
            'to_pickle',
            'to_sql',
            'to_string',
            'to_timestamp',
            'to_xarray',
            'transform',
            'transpose',
            'truediv',
            'truncate',
            'tz_convert',
            'tz_localize',
            'unstack',
            'update',
            'value_counts',
            'var',
            'where',
            'xs'],
 'lower_class': ['plot', 'sparse'],
 'datamodel_attribute': ['__annotations__',
                         '__array_priority__',
                         '__dict__',
                         '__doc__',
                         '__hash__',
                         '__module__',
                         '__pandas_priority__',
                         '__weakref__'],
 'datamodel_method': ['__abs__',
                      '__add__',
                      '__and__',
                      '__array__',
                      '__array_ufunc__',
                      '__bool__',
                      '__class__',
                      '__contains__',
                      '__copy__',
                      '__deepcopy__',
                      '__delattr__',
                      '__delitem__',
                      '__dir__',
                      '__divmod__',
                      '__eq__',
                      '__finalize__',
                      '__floordiv__',
                      '__format__',
                      '__ge__',
                      '__getattr__',
                      '__getattribute__',
                      '__getitem__',
                      '__getstate__',
                      '__gt__',
                      '__iadd__',
                      '__iand__',
                      '__ifloordiv__',
                      '__imod__',
                      '__imul__',
                      '__init__',
                      '__init_subclass__',
                      '__invert__',
                      '__ior__',
                      '__ipow__',
                      '__isub__',
                      '__iter__',
                      '__itruediv__',
                      '__ixor__',
                      '__le__',
                      '__len__',
                      '__lt__',
                      '__matmul__',
                      '__mod__',
                      '__mul__',
                      '__ne__',
                      '__neg__',
                      '__new__',
                      '__nonzero__',
                      '__or__',
                      '__pos__',
                      '__pow__',
                      '__radd__',
                      '__rand__',
                      '__rdivmod__',
                      '__reduce__',
                      '__reduce_ex__',
                      '__repr__',
                      '__rfloordiv__',
                      '__rmatmul__',
                      '__rmod__',
                      '__rmul__',
                      '__ror__',
                      '__round__',
                      '__rpow__',
                      '__rsub__',
                      '__rtruediv__',
                      '__rxor__',
                      '__setattr__',
                      '__setitem__',
                      '__setstate__',
                      '__sizeof__',
                      '__str__',
                      '__sub__',
                      '__subclasshook__',
                      '__truediv__',
                      '__xor__']}

The pd.DataFrame class only has a small number of additional identifiers such as the attributes columns which is a list of str instances with each str corresponding to the appropriate pd.Series.name available in the pd.DataFrame. The pd.DataFrame instance is mutable and the insert method can be used to insert another pd.Series instance. join and merge are used to join and merge pd.DataFrame instances.

The style attribute is an additional attribute which can be used to control the style of the pd.DataFrame instance when displayed in a cell output.

In [10]:

dir2(pd.DataFrame, pd.Series, unique_only=True, drop_internal=True, exclude_identifier_list=obsolete.df_depreciation)

{'attribute': ['columns', 'style'],
 'method': ['assign',
            'boxplot',
            'corrwith',
            'eval',
            'from_dict',
            'from_records',
            'insert',
            'isetitem',
            'iterrows',
            'itertuples',
            'join',
            'melt',
            'merge',
            'pivot',
            'pivot_table',
            'query',
            'select_dtypes',
            'set_index',
            'stack',
            'to_feather',
            'to_html',
            'to_orc',
            'to_parquet',
            'to_records',
            'to_stata',
            'to_xml'],
 'datamodel_method': ['__arrow_c_stream__',
                      '__dataframe__',
                      '__dataframe_consortium_standard__']}

The identifiers in a pd.Series instance are singular identifiers typically only used across a pd.Series instance. These include the lowercase classes used to work a specific datatype across a pd.Series:

In [11]:

dir2(pd.Series, pd.DataFrame, unique_only=True, drop_internal=True, exclude_identifier_list=obsolete.series_depreciation)

{'attribute': ['array',
               'dtype',
               'hasnans',
               'is_monotonic_decreasing',
               'is_monotonic_increasing',
               'is_unique',
               'name',
               'nbytes'],
 'method': ['argmax',
            'argmin',
            'argsort',
            'autocorr',
            'between',
            'case_when',
            'divmod',
            'factorize',
            'item',
            'rdivmod',
            'repeat',
            'searchsorted',
            'to_frame',
            'to_list',
            'tolist',
            'unique'],
 'lower_class': ['cat', 'dt', 'list', 'str', 'struct'],
 'datamodel_method': ['__column_consortium_standard__', '__float__', '__int__']}

Each pd.Series is associated with a pd.Index and every pd.Series in a pd.DataFrame uses the same pd.Index. Many identifiers for the pd.Index can be seen to be consistent with a pd.Series as it is also based on an np.ndarray:

In [12]:

dir2(pd.Index, pd.Series, consistent_only=True, exclude_external_modules=True, drop_internal=True, exclude_identifier_list=obsolete.index_depreciation)

{'attribute': ['array',
               'dtype',
               'empty',
               'hasnans',
               'is_monotonic_decreasing',
               'is_monotonic_increasing',
               'is_unique',
               'name',
               'nbytes',
               'ndim',
               'shape',
               'size',
               'values'],
 'constant': ['T'],
 'method': ['all',
            'any',
            'argmax',
            'argmin',
            'argsort',
            'asof',
            'astype',
            'copy',
            'diff',
            'drop',
            'drop_duplicates',
            'droplevel',
            'dropna',
            'duplicated',
            'equals',
            'factorize',
            'fillna',
            'groupby',
            'infer_objects',
            'isin',
            'isna',
            'isnull',
            'item',
            'map',
            'max',
            'memory_usage',
            'min',
            'notna',
            'notnull',
            'nunique',
            'ravel',
            'reindex',
            'rename',
            'repeat',
            'round',
            'searchsorted',
            'shift',
            'sort_values',
            'take',
            'to_frame',
            'to_list',
            'to_numpy',
            'tolist',
            'transpose',
            'unique',
            'value_counts',
            'view',
            'where'],
 'lower_class': ['str'],
 'datamodel_attribute': ['__annotations__',
                         '__array_priority__',
                         '__dict__',
                         '__doc__',
                         '__hash__',
                         '__module__',
                         '__pandas_priority__',
                         '__weakref__'],
 'datamodel_method': ['__abs__',
                      '__add__',
                      '__and__',
                      '__array__',
                      '__array_ufunc__',
                      '__bool__',
                      '__class__',
                      '__contains__',
                      '__copy__',
                      '__deepcopy__',
                      '__delattr__',
                      '__dir__',
                      '__divmod__',
                      '__eq__',
                      '__floordiv__',
                      '__format__',
                      '__ge__',
                      '__getattribute__',
                      '__getitem__',
                      '__getstate__',
                      '__gt__',
                      '__iadd__',
                      '__init__',
                      '__init_subclass__',
                      '__invert__',
                      '__iter__',
                      '__le__',
                      '__len__',
                      '__lt__',
                      '__mod__',
                      '__mul__',
                      '__ne__',
                      '__neg__',
                      '__new__',
                      '__nonzero__',
                      '__or__',
                      '__pos__',
                      '__pow__',
                      '__radd__',
                      '__rand__',
                      '__rdivmod__',
                      '__reduce__',
                      '__reduce_ex__',
                      '__repr__',
                      '__rfloordiv__',
                      '__rmod__',
                      '__rmul__',
                      '__ror__',
                      '__rpow__',
                      '__rsub__',
                      '__rtruediv__',
                      '__rxor__',
                      '__setattr__',
                      '__setitem__',
                      '__sizeof__',
                      '__str__',
                      '__sub__',
                      '__subclasshook__',
                      '__truediv__',
                      '__xor__']}

the pd.Index class only has a small number of identifiers not found in a pd.Series instance. The pd.Index is typically numeric but each value in the Series can be assigned a name and the attribute names retrieves these. Typically each name should be unique for the purpose of indexing and the attribute has_duplicates is used to check for duplicate values and the pd.Series or pd.DataFrame method reset_index can be used in this case to return a unique index of numeric values. More complicated indexes can be multilevel and the nlevels attribute gives details about the amount of levels in the index:

In [13]:

dir2(pd.Index, pd.Series, unique_only=True, exclude_external_modules=True, drop_internal=True, exclude_identifier_list=obsolete.index_depreciation)

{'attribute': ['has_duplicates', 'inferred_type', 'names', 'nlevels'],
 'method': ['append',
            'asof_locs',
            'delete',
            'difference',
            'get_indexer',
            'get_indexer_for',
            'get_indexer_non_unique',
            'get_level_values',
            'get_loc',
            'get_slice_bound',
            'identical',
            'insert',
            'intersection',
            'is_',
            'join',
            'putmask',
            'set_names',
            'slice_indexer',
            'slice_locs',
            'sort',
            'sortlevel',
            'symmetric_difference',
            'to_flat_index',
            'to_series',
            'union'],
 'datamodel_method': ['__array_wrap__']}

Series¶

The initialisation signature for the pd.Series class can be examined:

In [14]:

pd.Series?

Init signature:
pd.Series(
    data=None,
    index=None,
    dtype: 'Dtype | None' = None,
    name=None,
    copy: 'bool | None' = None,
    fastpath: 'bool | lib.NoDefault' = <no_default>,
) -> 'None'
Docstring:     
One-dimensional ndarray with axis labels (including time series).

Labels need not be unique but must be a hashable type. The object
supports both integer- and label-based indexing and provides a host of
methods for performing operations involving the index. Statistical
methods from ndarray have been overridden to automatically exclude
missing data (currently represented as NaN).

Operations between Series (+, -, /, \*, \*\*) align values based on their
associated index values-- they need not be the same length. The result
index will be the sorted union of the two indexes.

Parameters
----------
data : array-like, Iterable, dict, or scalar value
    Contains data stored in Series. If data is a dict, argument order is
    maintained.
index : array-like or Index (1d)
    Values must be hashable and have the same length as `data`.
    Non-unique index values are allowed. Will default to
    RangeIndex (0, 1, 2, ..., n) if not provided. If data is dict-like
    and index is None, then the keys in the data are used as the index. If the
    index is not None, the resulting Series is reindexed with the index values.
dtype : str, numpy.dtype, or ExtensionDtype, optional
    Data type for the output Series. If not specified, this will be
    inferred from `data`.
    See the :ref:`user guide <basics.dtypes>` for more usages.
name : Hashable, default None
    The name to give to the Series.
copy : bool, default False
    Copy input data. Only affects Series or 1d ndarray input. See examples.

Notes
-----
Please reference the :ref:`User Guide <basics.series>` for more information.

Examples
--------
Constructing Series from a dictionary with an Index specified

>>> d = {'a': 1, 'b': 2, 'c': 3}
>>> ser = pd.Series(data=d, index=['a', 'b', 'c'])
>>> ser
a   1
b   2
c   3
dtype: int64

The keys of the dictionary match with the Index values, hence the Index
values have no effect.

>>> d = {'a': 1, 'b': 2, 'c': 3}
>>> ser = pd.Series(data=d, index=['x', 'y', 'z'])
>>> ser
x   NaN
y   NaN
z   NaN
dtype: float64

Note that the Index is first build with the keys from the dictionary.
After this the Series is reindexed with the given Index values, hence we
get all NaN as a result.

Constructing Series from a list with `copy=False`.

>>> r = [1, 2]
>>> ser = pd.Series(r, copy=False)
>>> ser.iloc[0] = 999
>>> r
[1, 2]
>>> ser
0    999
1      2
dtype: int64

Due to input data type the Series has a `copy` of
the original data even though `copy=False`, so
the data is unchanged.

Constructing Series from a 1d ndarray with `copy=False`.

>>> r = np.array([1, 2])
>>> ser = pd.Series(r, copy=False)
>>> ser.iloc[0] = 999
>>> r
array([999,   2])
>>> ser
0    999
1      2
dtype: int64

Due to input data type the Series has a `view` on
the original data, so
the data is changed as well.
File:           c:\users\phili\anaconda3\envs\vscode-env\lib\site-packages\pandas\core\series.py
Type:           type
Subclasses:     SubclassedSeries

The main keyword input arguments are:

data
index
dtype
name

If these are not supplied an empty series instance with no index, no name and a generic object datatype is instantiated:

In [15]:

pd.Series()

Out[15]:

Series([], dtype: object)

Normally data is supplied in the form of a numpy 1darray:

In [16]:

pd.Series(data=np.array([1, 2, 3]))

Out[16]:

0    1
1    2
2    3
dtype: int32

Since a ndarray itself is initialised from a list, this can be abbreviated to:

In [17]:

pd.Series(data=[1, 2, 3])

Out[17]:

0    1
1    2
2    3
dtype: int64

When dtype=None, the data type will be inferred from the data:

In [18]:

pd.Series(data=[1., 2., 3.])

Out[18]:

0    1.0
1    2.0
2    3.0
dtype: float64

In [19]:

from datetime import datetime, timedelta
pd.Series(data=[datetime.now(), 
                datetime.now() + timedelta(days=1),
                datetime.now() + timedelta(days=2)])

Out[19]:

0   2024-02-15 01:23:45.711586
1   2024-02-16 01:23:45.711586
2   2024-02-17 01:23:45.711586
dtype: datetime64[ns]

In [20]:

pd.Series(data=['a', 'b', 'c'])

Out[20]:

0    a
1    b
2    c
dtype: object

Anything with a str in it is classed as non-numeric and has the generic dtype object (meaning it can be any Python object).

The dtype can be manually overidden when supplying the numpy 1darray by using the np.array input argument dtype:

In [21]:

pd.Series(data=np.array([1., 2., 3.], dtype=np.int32))

Out[21]:

0    1
1    2
2    3
dtype: int32

Or by alternatively using the Series keyword input argument dtype:

In [22]:

pd.Series(data=[1., 2., 3.], dtype=np.int32)

Out[22]:

0    1
1    2
2    3
dtype: int32

Notice that the index is zero-ordered numeric in integer steps of 1 by default. This can be manually changed by use of the keyword input argument index and providing an Index instance, ndarray instance or list instance of index values:

In [23]:

pd.Series(index=['a', 'b', 'c'], data=[1., 2., 3.], dtype=np.int32)

Out[23]:

a    1
b    2
c    3
dtype: int32

A Series usually also has a name:

In [24]:

pd.Series(index=['a', 'b', 'c'], data=[1., 2., 3.], dtype=np.int32, name='x')

Out[24]:

a    1
b    2
c    3
Name: x, dtype: int32

Normally the data and name are supplied and the index and dtype are inferred:

In [25]:

pd.Series([1.1, 2.1, 3.1, 4.1], name='x')

Out[25]:

0    1.1
1    2.1
2    3.1
3    4.1
Name: x, dtype: float64

DataFrame¶

The initialisation signature for the pd.DataFrame class can be examined:

In [26]:

pd.DataFrame?

Init signature:
pd.DataFrame(
    data=None,
    index: 'Axes | None' = None,
    columns: 'Axes | None' = None,
    dtype: 'Dtype | None' = None,
    copy: 'bool | None' = None,
) -> 'None'
Docstring:     
Two-dimensional, size-mutable, potentially heterogeneous tabular data.

Data structure also contains labeled axes (rows and columns).
Arithmetic operations align on both row and column labels. Can be
thought of as a dict-like container for Series objects. The primary
pandas data structure.

Parameters
----------
data : ndarray (structured or homogeneous), Iterable, dict, or DataFrame
    Dict can contain Series, arrays, constants, dataclass or list-like objects. If
    data is a dict, column order follows insertion-order. If a dict contains Series
    which have an index defined, it is aligned by its index. This alignment also
    occurs if data is a Series or a DataFrame itself. Alignment is done on
    Series/DataFrame inputs.

    If data is a list of dicts, column order follows insertion-order.

index : Index or array-like
    Index to use for resulting frame. Will default to RangeIndex if
    no indexing information part of input data and no index provided.
columns : Index or array-like
    Column labels to use for resulting frame when data does not have them,
    defaulting to RangeIndex(0, 1, 2, ..., n). If data contains column labels,
    will perform column selection instead.
dtype : dtype, default None
    Data type to force. Only a single dtype is allowed. If None, infer.
copy : bool or None, default None
    Copy data from inputs.
    For dict data, the default of None behaves like ``copy=True``.  For DataFrame
    or 2d ndarray input, the default of None behaves like ``copy=False``.
    If data is a dict containing one or more Series (possibly of different dtypes),
    ``copy=False`` will ensure that these inputs are not copied.

    .. versionchanged:: 1.3.0

See Also
--------
DataFrame.from_records : Constructor from tuples, also record arrays.
DataFrame.from_dict : From dicts of Series, arrays, or dicts.
read_csv : Read a comma-separated values (csv) file into DataFrame.
read_table : Read general delimited file into DataFrame.
read_clipboard : Read text from clipboard into DataFrame.

Notes
-----
Please reference the :ref:`User Guide <basics.dataframe>` for more information.

Examples
--------
Constructing DataFrame from a dictionary.

>>> d = {'col1': [1, 2], 'col2': [3, 4]}
>>> df = pd.DataFrame(data=d)
>>> df
   col1  col2
0     1     3
1     2     4

Notice that the inferred dtype is int64.

>>> df.dtypes
col1    int64
col2    int64
dtype: object

To enforce a single dtype:

>>> df = pd.DataFrame(data=d, dtype=np.int8)
>>> df.dtypes
col1    int8
col2    int8
dtype: object

Constructing DataFrame from a dictionary including Series:

>>> d = {'col1': [0, 1, 2, 3], 'col2': pd.Series([2, 3], index=[2, 3])}
>>> pd.DataFrame(data=d, index=[0, 1, 2, 3])
   col1  col2
0     0   NaN
1     1   NaN
2     2   2.0
3     3   3.0

Constructing DataFrame from numpy ndarray:

>>> df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
...                    columns=['a', 'b', 'c'])
>>> df2
   a  b  c
0  1  2  3
1  4  5  6
2  7  8  9

Constructing DataFrame from a numpy ndarray that has labeled columns:

>>> data = np.array([(1, 2, 3), (4, 5, 6), (7, 8, 9)],
...                 dtype=[("a", "i4"), ("b", "i4"), ("c", "i4")])
>>> df3 = pd.DataFrame(data, columns=['c', 'a'])
...
>>> df3
   c  a
0  3  1
1  6  4
2  9  7

Constructing DataFrame from dataclass:

>>> from dataclasses import make_dataclass
>>> Point = make_dataclass("Point", [("x", int), ("y", int)])
>>> pd.DataFrame([Point(0, 0), Point(0, 3), Point(2, 3)])
   x  y
0  0  0
1  0  3
2  2  3

Constructing DataFrame from Series/DataFrame:

>>> ser = pd.Series([1, 2, 3], index=["a", "b", "c"])
>>> df = pd.DataFrame(data=ser, index=["a", "c"])
>>> df
   0
a  1
c  3

>>> df1 = pd.DataFrame([1, 2, 3], index=["a", "b", "c"], columns=["x"])
>>> df2 = pd.DataFrame(data=df1, index=["a", "c"])
>>> df2
   x
a  1
c  3
File:           c:\users\phili\anaconda3\envs\vscode-env\lib\site-packages\pandas\core\frame.py
Type:           type
Subclasses:     SubclassedDataFrame

The keyword input arguments for a DataFrame instance are similar to those found for a Series instance however because a DataFrame is a collection of Series most of these are plural:

data (plural)
index (singular)
columns (plural of name)
dtype (plural)

In [27]:

pd.DataFrame(data=[[1.1, 1.2], 
                   [2.1, 2.2],
                   [3.1, 3.2],
                   [4.1, 4.2]],
             index=['a', 'b', 'c', 'd'],
             columns=('x', 'y'),
             dtype=(np.float64, np.float64))

Out[27]:

	x	y
a	1.1	1.2
b	2.1	2.2
c	3.1	3.2
d	4.1	4.2

The dtype has to be supplied as a tuple containing the dtype for each Series instance in the DataFrame instance. If it is supplied as a list of dtypes a TypeError will display.

Once again normally the dtype and index are inferred:

In [28]:

pd.DataFrame(data=[[1.1, 1.2], 
                   [2.1, 2.2],
                   [3.1, 3.2],
                   [4.1, 4.2]],
             columns=('x', 'y'))

Out[28]:

	x	y
0	1.1	1.2
1	2.1	2.2
2	3.1	3.2
3	4.1	4.2

It is common to supply columns and data in the form of a mapping. The key should be a str instance which will become the column name and the value should be a np.ndarray (1d) or list instance which corresponds to the data for that pd.Series:

In [29]:

pd.DataFrame({'x': np.array([1.1, 2.1, 3.1, 3.1]),
              'y': np.array([1.2, 2.2, 3.2, 4.2])})

Out[29]:

	x	y
0	1.1	1.2
1	2.1	2.2
2	3.1	3.2
3	3.1	4.2

Reading Data from Files¶

The Series and DataFrames previously examined were created from scratch using builtins datatypes. It is also common to read data in from another source. The pandas library therefore has a number of functions for reading in data from external files. The function names all have a read_ prefix followed by the file type:

In [30]:

for identifier in dir(pd):
    if identifier.startswith('read_'):
        print(identifier)

read_clipboard
read_csv
read_excel
read_feather
read_fwf
read_gbq
read_hdf
read_html
read_json
read_orc
read_parquet
read_pickle
read_sas
read_spss
read_sql
read_sql_query
read_sql_table
read_stata
read_table
read_xml

Some of the more common formats will be explored.

Comma Separated Values File¶

CSV is an abbreviation for Comma Separated Values.

The file format uses:

a , as a column separator which is where the name comma separated values comes from
\n, as a row separator

A csv with escape characters explicitly shown looks as follows:

string,integer,bool,float,date,time,category\n
the fat black cat,4,TRUE,0.86,24/07/2023,11:36:00,A\n
sat on the mat,4,TRUE,0.86,25/07/2023,12:36:00,A\n
"twinkle, twinkle",2,TRUE,-1.14,26/07/2023,13:36:00,B\n
little star,2,TRUE,-1.14,27/07/2023,14:36:00,B\n
how I wonder,3,FALSE,-0.14,28/07/2023,15:36:00,B\n
what you are,4,TRUE,0.86,29/07/2023,16:36:00,B\n

This can be written to a csv file:

In [31]:

%%writefile ./files/Book1.csv
string,integer,bool,float,date,time,category
the fat black cat,4,TRUE,0.86,24/07/2023,11:36:00,A
sat on the mat,4,TRUE,0.86,25/07/2023,12:36:00,A
"twinkle, twinkle",2,TRUE,-1.14,26/07/2023,13:36:00,B
little star,2,TRUE,-1.14,27/07/2023,14:36:00,B
how I wonder,3,FALSE,-0.14,28/07/2023,15:36:00,B
what you are,4,TRUE,0.86,29/07/2023,16:36:00,B

Overwriting ./files/Book1.csv

When opened in a program such as Microsoft Excel, these delimiters display as a grid:

Notice that the comma in twinkle, twinkle is not a delimiter but part of the str. For this reason "twinkle, twinkle" was is enclosed in quotations.

The pd.read_csv function is used to read in a csv file returning a DataFrame instance:

In [32]:

pd.read_csv?

Signature:
pd.read_csv(
    filepath_or_buffer: 'FilePath | ReadCsvBuffer[bytes] | ReadCsvBuffer[str]',
    *,
    sep: 'str | None | lib.NoDefault' = <no_default>,
    delimiter: 'str | None | lib.NoDefault' = None,
    header: "int | Sequence[int] | None | Literal['infer']" = 'infer',
    names: 'Sequence[Hashable] | None | lib.NoDefault' = <no_default>,
    index_col: 'IndexLabel | Literal[False] | None' = None,
    usecols: 'UsecolsArgType' = None,
    dtype: 'DtypeArg | None' = None,
    engine: 'CSVEngine | None' = None,
    converters: 'Mapping[Hashable, Callable] | None' = None,
    true_values: 'list | None' = None,
    false_values: 'list | None' = None,
    skipinitialspace: 'bool' = False,
    skiprows: 'list[int] | int | Callable[[Hashable], bool] | None' = None,
    skipfooter: 'int' = 0,
    nrows: 'int | None' = None,
    na_values: 'Hashable | Iterable[Hashable] | Mapping[Hashable, Iterable[Hashable]] | None' = None,
    keep_default_na: 'bool' = True,
    na_filter: 'bool' = True,
    verbose: 'bool | lib.NoDefault' = <no_default>,
    skip_blank_lines: 'bool' = True,
    parse_dates: 'bool | Sequence[Hashable] | None' = None,
    infer_datetime_format: 'bool | lib.NoDefault' = <no_default>,
    keep_date_col: 'bool | lib.NoDefault' = <no_default>,
    date_parser: 'Callable | lib.NoDefault' = <no_default>,
    date_format: 'str | dict[Hashable, str] | None' = None,
    dayfirst: 'bool' = False,
    cache_dates: 'bool' = True,
    iterator: 'bool' = False,
    chunksize: 'int | None' = None,
    compression: 'CompressionOptions' = 'infer',
    thousands: 'str | None' = None,
    decimal: 'str' = '.',
    lineterminator: 'str | None' = None,
    quotechar: 'str' = '"',
    quoting: 'int' = 0,
    doublequote: 'bool' = True,
    escapechar: 'str | None' = None,
    comment: 'str | None' = None,
    encoding: 'str | None' = None,
    encoding_errors: 'str | None' = 'strict',
    dialect: 'str | csv.Dialect | None' = None,
    on_bad_lines: 'str' = 'error',
    delim_whitespace: 'bool | lib.NoDefault' = <no_default>,
    low_memory: 'bool' = True,
    memory_map: 'bool' = False,
    float_precision: "Literal['high', 'legacy'] | None" = None,
    storage_options: 'StorageOptions | None' = None,
    dtype_backend: 'DtypeBackend | lib.NoDefault' = <no_default>,
) -> 'DataFrame | TextFileReader'
Docstring:
Read a comma-separated values (csv) file into DataFrame.

Also supports optionally iterating or breaking of the file
into chunks.

Additional help can be found in the online docs for
`IO Tools <https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html>`_.

Parameters
----------
filepath_or_buffer : str, path object or file-like object
    Any valid string path is acceptable. The string could be a URL. Valid
    URL schemes include http, ftp, s3, gs, and file. For file URLs, a host is
    expected. A local file could be: file://localhost/path/to/table.csv.

    If you want to pass in a path object, pandas accepts any ``os.PathLike``.

    By file-like object, we refer to objects with a ``read()`` method, such as
    a file handle (e.g. via builtin ``open`` function) or ``StringIO``.
sep : str, default ','
    Character or regex pattern to treat as the delimiter. If ``sep=None``, the
    C engine cannot automatically detect
    the separator, but the Python parsing engine can, meaning the latter will
    be used and automatically detect the separator from only the first valid
    row of the file by Python's builtin sniffer tool, ``csv.Sniffer``.
    In addition, separators longer than 1 character and different from
    ``'\s+'`` will be interpreted as regular expressions and will also force
    the use of the Python parsing engine. Note that regex delimiters are prone
    to ignoring quoted data. Regex example: ``'\r\t'``.
delimiter : str, optional
    Alias for ``sep``.
header : int, Sequence of int, 'infer' or None, default 'infer'
    Row number(s) containing column labels and marking the start of the
    data (zero-indexed). Default behavior is to infer the column names: if no ``names``
    are passed the behavior is identical to ``header=0`` and column
    names are inferred from the first line of the file, if column
    names are passed explicitly to ``names`` then the behavior is identical to
    ``header=None``. Explicitly pass ``header=0`` to be able to
    replace existing names. The header can be a list of integers that
    specify row locations for a :class:`~pandas.MultiIndex` on the columns
    e.g. ``[0, 1, 3]``. Intervening rows that are not specified will be
    skipped (e.g. 2 in this example is skipped). Note that this
    parameter ignores commented lines and empty lines if
    ``skip_blank_lines=True``, so ``header=0`` denotes the first line of
    data rather than the first line of the file.
names : Sequence of Hashable, optional
    Sequence of column labels to apply. If the file contains a header row,
    then you should explicitly pass ``header=0`` to override the column names.
    Duplicates in this list are not allowed.
index_col : Hashable, Sequence of Hashable or False, optional
  Column(s) to use as row label(s), denoted either by column labels or column
  indices.  If a sequence of labels or indices is given, :class:`~pandas.MultiIndex`
  will be formed for the row labels.

  Note: ``index_col=False`` can be used to force pandas to *not* use the first
  column as the index, e.g., when you have a malformed file with delimiters at
  the end of each line.
usecols : Sequence of Hashable or Callable, optional
    Subset of columns to select, denoted either by column labels or column indices.
    If list-like, all elements must either
    be positional (i.e. integer indices into the document columns) or strings
    that correspond to column names provided either by the user in ``names`` or
    inferred from the document header row(s). If ``names`` are given, the document
    header row(s) are not taken into account. For example, a valid list-like
    ``usecols`` parameter would be ``[0, 1, 2]`` or ``['foo', 'bar', 'baz']``.
    Element order is ignored, so ``usecols=[0, 1]`` is the same as ``[1, 0]``.
    To instantiate a :class:`~pandas.DataFrame` from ``data`` with element order
    preserved use ``pd.read_csv(data, usecols=['foo', 'bar'])[['foo', 'bar']]``
    for columns in ``['foo', 'bar']`` order or
    ``pd.read_csv(data, usecols=['foo', 'bar'])[['bar', 'foo']]``
    for ``['bar', 'foo']`` order.

    If callable, the callable function will be evaluated against the column
    names, returning names where the callable function evaluates to ``True``. An
    example of a valid callable argument would be ``lambda x: x.upper() in
    ['AAA', 'BBB', 'DDD']``. Using this parameter results in much faster
    parsing time and lower memory usage.
dtype : dtype or dict of {Hashable : dtype}, optional
    Data type(s) to apply to either the whole dataset or individual columns.
    E.g., ``{'a': np.float64, 'b': np.int32, 'c': 'Int64'}``
    Use ``str`` or ``object`` together with suitable ``na_values`` settings
    to preserve and not interpret ``dtype``.
    If ``converters`` are specified, they will be applied INSTEAD
    of ``dtype`` conversion.

    .. versionadded:: 1.5.0

        Support for ``defaultdict`` was added. Specify a ``defaultdict`` as input where
        the default determines the ``dtype`` of the columns which are not explicitly
        listed.
engine : {'c', 'python', 'pyarrow'}, optional
    Parser engine to use. The C and pyarrow engines are faster, while the python engine
    is currently more feature-complete. Multithreading is currently only supported by
    the pyarrow engine.

    .. versionadded:: 1.4.0

        The 'pyarrow' engine was added as an *experimental* engine, and some features
        are unsupported, or may not work correctly, with this engine.
converters : dict of {Hashable : Callable}, optional
    Functions for converting values in specified columns. Keys can either
    be column labels or column indices.
true_values : list, optional
    Values to consider as ``True`` in addition to case-insensitive variants of 'True'.
false_values : list, optional
    Values to consider as ``False`` in addition to case-insensitive variants of 'False'.
skipinitialspace : bool, default False
    Skip spaces after delimiter.
skiprows : int, list of int or Callable, optional
    Line numbers to skip (0-indexed) or number of lines to skip (``int``)
    at the start of the file.

    If callable, the callable function will be evaluated against the row
    indices, returning ``True`` if the row should be skipped and ``False`` otherwise.
    An example of a valid callable argument would be ``lambda x: x in [0, 2]``.
skipfooter : int, default 0
    Number of lines at bottom of file to skip (Unsupported with ``engine='c'``).
nrows : int, optional
    Number of rows of file to read. Useful for reading pieces of large files.
na_values : Hashable, Iterable of Hashable or dict of {Hashable : Iterable}, optional
    Additional strings to recognize as ``NA``/``NaN``. If ``dict`` passed, specific
    per-column ``NA`` values.  By default the following values are interpreted as
    ``NaN``: " ", "#N/A", "#N/A N/A", "#NA", "-1.#IND", "-1.#QNAN", "-NaN", "-nan",
    "1.#IND", "1.#QNAN", "<NA>", "N/A", "NA", "NULL", "NaN", "None",
    "n/a", "nan", "null ".

keep_default_na : bool, default True
    Whether or not to include the default ``NaN`` values when parsing the data.
    Depending on whether ``na_values`` is passed in, the behavior is as follows:

    * If ``keep_default_na`` is ``True``, and ``na_values`` are specified, ``na_values``
      is appended to the default ``NaN`` values used for parsing.
    * If ``keep_default_na`` is ``True``, and ``na_values`` are not specified, only
      the default ``NaN`` values are used for parsing.
    * If ``keep_default_na`` is ``False``, and ``na_values`` are specified, only
      the ``NaN`` values specified ``na_values`` are used for parsing.
    * If ``keep_default_na`` is ``False``, and ``na_values`` are not specified, no
      strings will be parsed as ``NaN``.

    Note that if ``na_filter`` is passed in as ``False``, the ``keep_default_na`` and
    ``na_values`` parameters will be ignored.
na_filter : bool, default True
    Detect missing value markers (empty strings and the value of ``na_values``). In
    data without any ``NA`` values, passing ``na_filter=False`` can improve the
    performance of reading a large file.
verbose : bool, default False
    Indicate number of ``NA`` values placed in non-numeric columns.
skip_blank_lines : bool, default True
    If ``True``, skip over blank lines rather than interpreting as ``NaN`` values.
parse_dates : bool, list of Hashable, list of lists or dict of {Hashable : list}, default False
    The behavior is as follows:

    * ``bool``. If ``True`` -> try parsing the index. Note: Automatically set to
      ``True`` if ``date_format`` or ``date_parser`` arguments have been passed.
    * ``list`` of ``int`` or names. e.g. If ``[1, 2, 3]`` -> try parsing columns 1, 2, 3
      each as a separate date column.
    * ``list`` of ``list``. e.g.  If ``[[1, 3]]`` -> combine columns 1 and 3 and parse
      as a single date column. Values are joined with a space before parsing.
    * ``dict``, e.g. ``{'foo' : [1, 3]}`` -> parse columns 1, 3 as date and call
      result 'foo'. Values are joined with a space before parsing.

    If a column or index cannot be represented as an array of ``datetime``,
    say because of an unparsable value or a mixture of timezones, the column
    or index will be returned unaltered as an ``object`` data type. For
    non-standard ``datetime`` parsing, use :func:`~pandas.to_datetime` after
    :func:`~pandas.read_csv`.

    Note: A fast-path exists for iso8601-formatted dates.
infer_datetime_format : bool, default False
    If ``True`` and ``parse_dates`` is enabled, pandas will attempt to infer the
    format of the ``datetime`` strings in the columns, and if it can be inferred,
    switch to a faster method of parsing them. In some cases this can increase
    the parsing speed by 5-10x.

    .. deprecated:: 2.0.0
        A strict version of this argument is now the default, passing it has no effect.

keep_date_col : bool, default False
    If ``True`` and ``parse_dates`` specifies combining multiple columns then
    keep the original columns.
date_parser : Callable, optional
    Function to use for converting a sequence of string columns to an array of
    ``datetime`` instances. The default uses ``dateutil.parser.parser`` to do the
    conversion. pandas will try to call ``date_parser`` in three different ways,
    advancing to the next if an exception occurs: 1) Pass one or more arrays
    (as defined by ``parse_dates``) as arguments; 2) concatenate (row-wise) the
    string values from the columns defined by ``parse_dates`` into a single array
    and pass that; and 3) call ``date_parser`` once for each row using one or
    more strings (corresponding to the columns defined by ``parse_dates``) as
    arguments.

    .. deprecated:: 2.0.0
       Use ``date_format`` instead, or read in as ``object`` and then apply
       :func:`~pandas.to_datetime` as-needed.
date_format : str or dict of column -> format, optional
    Format to use for parsing dates when used in conjunction with ``parse_dates``.
    The strftime to parse time, e.g. :const:`"%d/%m/%Y"`. See
    `strftime documentation
    <https://docs.python.org/3/library/datetime.html
    #strftime-and-strptime-behavior>`_ for more information on choices, though
    note that :const:`"%f"` will parse all the way up to nanoseconds.
    You can also pass:

    - "ISO8601", to parse any `ISO8601 <https://en.wikipedia.org/wiki/ISO_8601>`_
        time string (not necessarily in exactly the same format);
    - "mixed", to infer the format for each element individually. This is risky,
        and you should probably use it along with `dayfirst`.

    .. versionadded:: 2.0.0
dayfirst : bool, default False
    DD/MM format dates, international and European format.
cache_dates : bool, default True
    If ``True``, use a cache of unique, converted dates to apply the ``datetime``
    conversion. May produce significant speed-up when parsing duplicate
    date strings, especially ones with timezone offsets.

iterator : bool, default False
    Return ``TextFileReader`` object for iteration or getting chunks with
    ``get_chunk()``.
chunksize : int, optional
    Number of lines to read from the file per chunk. Passing a value will cause the
    function to return a ``TextFileReader`` object for iteration.
    See the `IO Tools docs
    <https://pandas.pydata.org/pandas-docs/stable/io.html#io-chunking>`_
    for more information on ``iterator`` and ``chunksize``.

compression : str or dict, default 'infer'
    For on-the-fly decompression of on-disk data. If 'infer' and 'filepath_or_buffer' is
    path-like, then detect compression from the following extensions: '.gz',
    '.bz2', '.zip', '.xz', '.zst', '.tar', '.tar.gz', '.tar.xz' or '.tar.bz2'
    (otherwise no compression).
    If using 'zip' or 'tar', the ZIP file must contain only one data file to be read in.
    Set to ``None`` for no decompression.
    Can also be a dict with key ``'method'`` set
    to one of {``'zip'``, ``'gzip'``, ``'bz2'``, ``'zstd'``, ``'xz'``, ``'tar'``} and
    other key-value pairs are forwarded to
    ``zipfile.ZipFile``, ``gzip.GzipFile``,
    ``bz2.BZ2File``, ``zstandard.ZstdDecompressor``, ``lzma.LZMAFile`` or
    ``tarfile.TarFile``, respectively.
    As an example, the following could be passed for Zstandard decompression using a
    custom compression dictionary:
    ``compression={'method': 'zstd', 'dict_data': my_compression_dict}``.

    .. versionadded:: 1.5.0
        Added support for `.tar` files.

    .. versionchanged:: 1.4.0 Zstandard support.

thousands : str (length 1), optional
    Character acting as the thousands separator in numerical values.
decimal : str (length 1), default '.'
    Character to recognize as decimal point (e.g., use ',' for European data).
lineterminator : str (length 1), optional
    Character used to denote a line break. Only valid with C parser.
quotechar : str (length 1), optional
    Character used to denote the start and end of a quoted item. Quoted
    items can include the ``delimiter`` and it will be ignored.
quoting : {0 or csv.QUOTE_MINIMAL, 1 or csv.QUOTE_ALL, 2 or csv.QUOTE_NONNUMERIC, 3 or csv.QUOTE_NONE}, default csv.QUOTE_MINIMAL
    Control field quoting behavior per ``csv.QUOTE_*`` constants. Default is
    ``csv.QUOTE_MINIMAL`` (i.e., 0) which implies that only fields containing special
    characters are quoted (e.g., characters defined in ``quotechar``, ``delimiter``,
    or ``lineterminator``.
doublequote : bool, default True
   When ``quotechar`` is specified and ``quoting`` is not ``QUOTE_NONE``, indicate
   whether or not to interpret two consecutive ``quotechar`` elements INSIDE a
   field as a single ``quotechar`` element.
escapechar : str (length 1), optional
    Character used to escape other characters.
comment : str (length 1), optional
    Character indicating that the remainder of line should not be parsed.
    If found at the beginning
    of a line, the line will be ignored altogether. This parameter must be a
    single character. Like empty lines (as long as ``skip_blank_lines=True``),
    fully commented lines are ignored by the parameter ``header`` but not by
    ``skiprows``. For example, if ``comment='#'``, parsing
    ``#empty\na,b,c\n1,2,3`` with ``header=0`` will result in ``'a,b,c'`` being
    treated as the header.
encoding : str, optional, default 'utf-8'
    Encoding to use for UTF when reading/writing (ex. ``'utf-8'``). `List of Python
    standard encodings
    <https://docs.python.org/3/library/codecs.html#standard-encodings>`_ .

encoding_errors : str, optional, default 'strict'
    How encoding errors are treated. `List of possible values
    <https://docs.python.org/3/library/codecs.html#error-handlers>`_ .

    .. versionadded:: 1.3.0

dialect : str or csv.Dialect, optional
    If provided, this parameter will override values (default or not) for the
    following parameters: ``delimiter``, ``doublequote``, ``escapechar``,
    ``skipinitialspace``, ``quotechar``, and ``quoting``. If it is necessary to
    override values, a ``ParserWarning`` will be issued. See ``csv.Dialect``
    documentation for more details.
on_bad_lines : {'error', 'warn', 'skip'} or Callable, default 'error'
    Specifies what to do upon encountering a bad line (a line with too many fields).
    Allowed values are :

    - ``'error'``, raise an Exception when a bad line is encountered.
    - ``'warn'``, raise a warning when a bad line is encountered and skip that line.
    - ``'skip'``, skip bad lines without raising or warning when they are encountered.

    .. versionadded:: 1.3.0

    .. versionadded:: 1.4.0

        - Callable, function with signature
          ``(bad_line: list[str]) -> list[str] | None`` that will process a single
          bad line. ``bad_line`` is a list of strings split by the ``sep``.
          If the function returns ``None``, the bad line will be ignored.
          If the function returns a new ``list`` of strings with more elements than
          expected, a ``ParserWarning`` will be emitted while dropping extra elements.
          Only supported when ``engine='python'``

    .. versionchanged:: 2.2.0

        - Callable, function with signature
          as described in `pyarrow documentation
          <https://arrow.apache.org/docs/python/generated/pyarrow.csv.ParseOptions.html
          #pyarrow.csv.ParseOptions.invalid_row_handler>`_ when ``engine='pyarrow'``

delim_whitespace : bool, default False
    Specifies whether or not whitespace (e.g. ``' '`` or ``'\t'``) will be
    used as the ``sep`` delimiter. Equivalent to setting ``sep='\s+'``. If this option
    is set to ``True``, nothing should be passed in for the ``delimiter``
    parameter.

    .. deprecated:: 2.2.0
        Use ``sep="\s+"`` instead.
low_memory : bool, default True
    Internally process the file in chunks, resulting in lower memory use
    while parsing, but possibly mixed type inference.  To ensure no mixed
    types either set ``False``, or specify the type with the ``dtype`` parameter.
    Note that the entire file is read into a single :class:`~pandas.DataFrame`
    regardless, use the ``chunksize`` or ``iterator`` parameter to return the data in
    chunks. (Only valid with C parser).
memory_map : bool, default False
    If a filepath is provided for ``filepath_or_buffer``, map the file object
    directly onto memory and access the data directly from there. Using this
    option can improve performance because there is no longer any I/O overhead.
float_precision : {'high', 'legacy', 'round_trip'}, optional
    Specifies which converter the C engine should use for floating-point
    values. The options are ``None`` or ``'high'`` for the ordinary converter,
    ``'legacy'`` for the original lower precision pandas converter, and
    ``'round_trip'`` for the round-trip converter.

storage_options : dict, optional
    Extra options that make sense for a particular storage connection, e.g.
    host, port, username, password, etc. For HTTP(S) URLs the key-value pairs
    are forwarded to ``urllib.request.Request`` as header options. For other
    URLs (e.g. starting with "s3://", and "gcs://") the key-value pairs are
    forwarded to ``fsspec.open``. Please see ``fsspec`` and ``urllib`` for more
    details, and for more examples on storage options refer `here
    <https://pandas.pydata.org/docs/user_guide/io.html?
    highlight=storage_options#reading-writing-remote-files>`_.

dtype_backend : {'numpy_nullable', 'pyarrow'}, default 'numpy_nullable'
    Back-end data type applied to the resultant :class:`DataFrame`
    (still experimental). Behaviour is as follows:

    * ``"numpy_nullable"``: returns nullable-dtype-backed :class:`DataFrame`
      (default).
    * ``"pyarrow"``: returns pyarrow-backed nullable :class:`ArrowDtype`
      DataFrame.

    .. versionadded:: 2.0

Returns
-------
DataFrame or TextFileReader
    A comma-separated values (csv) file is returned as two-dimensional
    data structure with labeled axes.

See Also
--------
DataFrame.to_csv : Write DataFrame to a comma-separated values (csv) file.
read_table : Read general delimited file into DataFrame.
read_fwf : Read a table of fixed-width formatted lines into DataFrame.

Examples
--------
>>> pd.read_csv('data.csv')  # doctest: +SKIP
File:      c:\users\phili\anaconda3\envs\vscode-env\lib\site-packages\pandas\io\parsers\readers.py
Type:      function

The pd.read_csv function has a larger number of input parameters. Note that most of these are named input parameters and are therefore assigned to a default value which is consistent to the default behaviour of a csv file. When the file is in the expected format only the filepath_or_buffer needs to be specified and this is normally provided positionally:

In [33]:

df = pd.read_csv('./files/Book1.csv')

In [34]:

df

Out[34]:

	string	integer	bool	float	date	time	category
0	the fat black cat	4	True	0.86	24/07/2023	11:36:00	A
1	sat on the mat	4	True	0.86	25/07/2023	12:36:00	A
2	twinkle, twinkle	2	True	-1.14	26/07/2023	13:36:00	B
3	little star	2	True	-1.14	27/07/2023	14:36:00	B
4	how I wonder	3	False	-0.14	28/07/2023	15:36:00	B
5	what you are	4	True	0.86	29/07/2023	16:36:00	B

Notice the Series names in the file are in the expected format and taken from the first line. A csv does not have an index by default and so a RangeIndex has automatically been generated.

In [35]:

df.axes

Out[35]:

[RangeIndex(start=0, stop=6, step=1),
 Index(['string', 'integer', 'bool', 'float', 'date', 'time', 'category'], dtype='object')]

Tab Delimited Text File¶

A text file, has the file extension txt and is very similar to a csv file but uses \t instead of , as a delimiter:

string\tinteger\tbool\tfloat\tdate\ttime\tcategory\n
the fat black cat\t4\tTRUE\t0.86\t24/07/2023\t11:36:00\tA\n
sat on the mat\t4\tTRUE\t0.86\t25/07/2023\t12:36:00\tA\n
"twinkle, twinkle"\t2\tTRUE\t-1.14\t26/07/2023\t13:36:00\tB\n
little star\t2\tTRUE\t-1.14\t27/07/2023\t14:36:00\tB\n
how I wonder\t3\tFALSE\t-0.14\t28/07/2023\t15:36:00\tB\n
what you are\t4\tTRUE\t0.86\t29/07/2023\t16:36:00\tB\n

In [36]:

%%writefile ./files/Book2.txt
string	integer	boolean	floatingpoint	date	time	category
the fat black cat	4	TRUE	0.86	24/07/2023	11:36:00	A
sat on the mat	4	TRUE	0.86	25/07/2023	12:36:00	A
"twinkle, twinkle"	2	TRUE	-1.14	26/07/2023	13:36:00	B
little star	2	TRUE	-1.14	27/07/2023	14:36:00	B
how I wonder	3	FALSE	-0.14	28/07/2023	15:36:00	B
what you are	4	TRUE	0.86	29/07/2023	16:36:00	B

Overwriting ./files/Book2.txt

The same pd.read_csv function is used to read in a txt file. However this function by default looks for a , as a delimiter to move onto the next column and as it is not present, the data is all shown in a single column:

In [37]:

df = pd.read_csv(r'.\files\Book2.txt')

In [38]:

df

Out[38]:

	string\tinteger\tboolean\tfloatingpoint\tdate\ttime\tcategory
0	the fat black cat\t4\tTRUE\t0.86\t24/07/2023\t11:36:00\tA
1	sat on the mat\t4\tTRUE\t0.86\t25/07/2023\t12:36:00\tA
2	twinkle, twinkle\t2\tTRUE\t-1.14\t26/07/2023\t13:36:00\tB
3	little star\t2\tTRUE\t-1.14\t27/07/2023\t14:36:00\tB
4	how I wonder\t3\tFALSE\t-0.14\t28/07/2023\t15:36:00\tB
5	what you are\t4\tTRUE\t0.86\t29/07/2023\t16:36:00\tB

If the delimiter is specified as '\t' the data will instead be read in properly:

In [39]:

df = pd.read_csv(r'.\files\Book2.txt', delimiter='\t')

In [40]:

df

Out[40]:

	string	integer	boolean	floatingpoint	date	time	category
0	the fat black cat	4	True	0.86	24/07/2023	11:36:00	A
1	sat on the mat	4	True	0.86	25/07/2023	12:36:00	A
2	twinkle, twinkle	2	True	-1.14	26/07/2023	13:36:00	B
3	little star	2	True	-1.14	27/07/2023	14:36:00	B
4	how I wonder	3	False	-0.14	28/07/2023	15:36:00	B
5	what you are	4	True	0.86	29/07/2023	16:36:00	B

JavaScript Object Notation¶

If the data is in the form of a JSON str instance:

In [41]:

json_string = '{"string":{"0":"the fat black cat","1":"sat on the mat","2":"twinkle, twinkle","3":"little star","4":"how I wonder","5":"what you are"},"integer":{"0":4,"1":4,"2":2,"3":2,"4":3,"5":4},"boolean":{"0":true,"1":true,"2":true,"3":true,"4":false,"5":true},"floatingpoint":{"0":0.86,"1":0.86,"2":-1.14,"3":-1.14,"4":-0.14,"5":0.86},"date":{"0":"24\\/07\\/2023","1":"25\\/07\\/2023","2":"26\\/07\\/2023","3":"27\\/07\\/2023","4":"28\\/07\\/2023","5":"29\\/07\\/2023"},"time":{"0":"11:36:00","1":"12:36:00","2":"13:36:00","3":"14:36:00","4":"15:36:00","5":"16:36:00"},"category":{"0":"A","1":"A","2":"B","3":"B","4":"B","5":"B"}}'

It can be cast into a StringIO instance:

In [42]:

import io

In [43]:

io.StringIO(json_string)

Out[43]:

<_io.StringIO at 0x1aaad271780>

And then read in using the pd.read_json function:

In [44]:

pd.read_json(io.StringIO(json_string))

Out[44]:

	string	integer	boolean	floatingpoint	date	time	category
0	the fat black cat	4	True	0.86	2023-07-24	11:36:00	A
1	sat on the mat	4	True	0.86	2023-07-25	12:36:00	A
2	twinkle, twinkle	2	True	-1.14	2023-07-26	13:36:00	B
3	little star	2	True	-1.14	2023-07-27	14:36:00	B
4	how I wonder	3	False	-0.14	2023-07-28	15:36:00	B
5	what you are	4	True	0.86	2023-07-29	16:36:00	B

Microsoft Excel File¶

A Microsoft Excel File, file extensions .xlsx (or .xls for older files) is a collection of sheets. The data in each individual sheet is similar to an individual csv file, although obscured because Excel applies formatting options to the data:

The related function read_excel is used to read in the data from an Excel File. The delimiter is predefined in an Excel File however the Excel File can have multiple sheets so the keyword input argument sheet_name is available, this defaults to the name 'Sheet1' which is the default for an Excel Spreadsheet:

In [45]:

pd.read_excel?

Signature:
pd.read_excel(
    io,
    sheet_name: 'str | int | list[IntStrT] | None' = 0,
    *,
    header: 'int | Sequence[int] | None' = 0,
    names: 'SequenceNotStr[Hashable] | range | None' = None,
    index_col: 'int | str | Sequence[int] | None' = None,
    usecols: 'int | str | Sequence[int] | Sequence[str] | Callable[[str], bool] | None' = None,
    dtype: 'DtypeArg | None' = None,
    engine: "Literal['xlrd', 'openpyxl', 'odf', 'pyxlsb', 'calamine'] | None" = None,
    converters: 'dict[str, Callable] | dict[int, Callable] | None' = None,
    true_values: 'Iterable[Hashable] | None' = None,
    false_values: 'Iterable[Hashable] | None' = None,
    skiprows: 'Sequence[int] | int | Callable[[int], object] | None' = None,
    nrows: 'int | None' = None,
    na_values=None,
    keep_default_na: 'bool' = True,
    na_filter: 'bool' = True,
    verbose: 'bool' = False,
    parse_dates: 'list | dict | bool' = False,
    date_parser: 'Callable | lib.NoDefault' = <no_default>,
    date_format: 'dict[Hashable, str] | str | None' = None,
    thousands: 'str | None' = None,
    decimal: 'str' = '.',
    comment: 'str | None' = None,
    skipfooter: 'int' = 0,
    storage_options: 'StorageOptions | None' = None,
    dtype_backend: 'DtypeBackend | lib.NoDefault' = <no_default>,
    engine_kwargs: 'dict | None' = None,
) -> 'DataFrame | dict[IntStrT, DataFrame]'
Docstring:
Read an Excel file into a ``pandas`` ``DataFrame``.

Supports `xls`, `xlsx`, `xlsm`, `xlsb`, `odf`, `ods` and `odt` file extensions
read from a local filesystem or URL. Supports an option to read
a single sheet or a list of sheets.

Parameters
----------
io : str, bytes, ExcelFile, xlrd.Book, path object, or file-like object
    Any valid string path is acceptable. The string could be a URL. Valid
    URL schemes include http, ftp, s3, and file. For file URLs, a host is
    expected. A local file could be: ``file://localhost/path/to/table.xlsx``.

    If you want to pass in a path object, pandas accepts any ``os.PathLike``.

    By file-like object, we refer to objects with a ``read()`` method,
    such as a file handle (e.g. via builtin ``open`` function)
    or ``StringIO``.

    .. deprecated:: 2.1.0
        Passing byte strings is deprecated. To read from a
        byte string, wrap it in a ``BytesIO`` object.
sheet_name : str, int, list, or None, default 0
    Strings are used for sheet names. Integers are used in zero-indexed
    sheet positions (chart sheets do not count as a sheet position).
    Lists of strings/integers are used to request multiple sheets.
    Specify ``None`` to get all worksheets.

    Available cases:

    * Defaults to ``0``: 1st sheet as a `DataFrame`
    * ``1``: 2nd sheet as a `DataFrame`
    * ``"Sheet1"``: Load sheet with name "Sheet1"
    * ``[0, 1, "Sheet5"]``: Load first, second and sheet named "Sheet5"
      as a dict of `DataFrame`
    * ``None``: All worksheets.

header : int, list of int, default 0
    Row (0-indexed) to use for the column labels of the parsed
    DataFrame. If a list of integers is passed those row positions will
    be combined into a ``MultiIndex``. Use None if there is no header.
names : array-like, default None
    List of column names to use. If file contains no header row,
    then you should explicitly pass header=None.
index_col : int, str, list of int, default None
    Column (0-indexed) to use as the row labels of the DataFrame.
    Pass None if there is no such column.  If a list is passed,
    those columns will be combined into a ``MultiIndex``.  If a
    subset of data is selected with ``usecols``, index_col
    is based on the subset.

    Missing values will be forward filled to allow roundtripping with
    ``to_excel`` for ``merged_cells=True``. To avoid forward filling the
    missing values use ``set_index`` after reading the data instead of
    ``index_col``.
usecols : str, list-like, or callable, default None
    * If None, then parse all columns.
    * If str, then indicates comma separated list of Excel column letters
      and column ranges (e.g. "A:E" or "A,C,E:F"). Ranges are inclusive of
      both sides.
    * If list of int, then indicates list of column numbers to be parsed
      (0-indexed).
    * If list of string, then indicates list of column names to be parsed.
    * If callable, then evaluate each column name against it and parse the
      column if the callable returns ``True``.

    Returns a subset of the columns according to behavior above.
dtype : Type name or dict of column -> type, default None
    Data type for data or columns. E.g. {'a': np.float64, 'b': np.int32}
    Use ``object`` to preserve data as stored in Excel and not interpret dtype,
    which will necessarily result in ``object`` dtype.
    If converters are specified, they will be applied INSTEAD
    of dtype conversion.
    If you use ``None``, it will infer the dtype of each column based on the data.
engine : {'openpyxl', 'calamine', 'odf', 'pyxlsb', 'xlrd'}, default None
    If io is not a buffer or path, this must be set to identify io.
    Engine compatibility :

    - ``openpyxl`` supports newer Excel file formats.
    - ``calamine`` supports Excel (.xls, .xlsx, .xlsm, .xlsb)
      and OpenDocument (.ods) file formats.
    - ``odf`` supports OpenDocument file formats (.odf, .ods, .odt).
    - ``pyxlsb`` supports Binary Excel files.
    - ``xlrd`` supports old-style Excel files (.xls).

    When ``engine=None``, the following logic will be used to determine the engine:

    - If ``path_or_buffer`` is an OpenDocument format (.odf, .ods, .odt),
      then `odf <https://pypi.org/project/odfpy/>`_ will be used.
    - Otherwise if ``path_or_buffer`` is an xls format, ``xlrd`` will be used.
    - Otherwise if ``path_or_buffer`` is in xlsb format, ``pyxlsb`` will be used.
    - Otherwise ``openpyxl`` will be used.
converters : dict, default None
    Dict of functions for converting values in certain columns. Keys can
    either be integers or column labels, values are functions that take one
    input argument, the Excel cell content, and return the transformed
    content.
true_values : list, default None
    Values to consider as True.
false_values : list, default None
    Values to consider as False.
skiprows : list-like, int, or callable, optional
    Line numbers to skip (0-indexed) or number of lines to skip (int) at the
    start of the file. If callable, the callable function will be evaluated
    against the row indices, returning True if the row should be skipped and
    False otherwise. An example of a valid callable argument would be ``lambda
    x: x in [0, 2]``.
nrows : int, default None
    Number of rows to parse.
na_values : scalar, str, list-like, or dict, default None
    Additional strings to recognize as NA/NaN. If dict passed, specific
    per-column NA values. By default the following values are interpreted
    as NaN: '', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan',
    '1.#IND', '1.#QNAN', '<NA>', 'N/A', 'NA', 'NULL', 'NaN', 'None',
    'n/a', 'nan', 'null'.
keep_default_na : bool, default True
    Whether or not to include the default NaN values when parsing the data.
    Depending on whether ``na_values`` is passed in, the behavior is as follows:

    * If ``keep_default_na`` is True, and ``na_values`` are specified,
      ``na_values`` is appended to the default NaN values used for parsing.
    * If ``keep_default_na`` is True, and ``na_values`` are not specified, only
      the default NaN values are used for parsing.
    * If ``keep_default_na`` is False, and ``na_values`` are specified, only
      the NaN values specified ``na_values`` are used for parsing.
    * If ``keep_default_na`` is False, and ``na_values`` are not specified, no
      strings will be parsed as NaN.

    Note that if `na_filter` is passed in as False, the ``keep_default_na`` and
    ``na_values`` parameters will be ignored.
na_filter : bool, default True
    Detect missing value markers (empty strings and the value of na_values). In
    data without any NAs, passing ``na_filter=False`` can improve the
    performance of reading a large file.
verbose : bool, default False
    Indicate number of NA values placed in non-numeric columns.
parse_dates : bool, list-like, or dict, default False
    The behavior is as follows:

    * ``bool``. If True -> try parsing the index.
    * ``list`` of int or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3
      each as a separate date column.
    * ``list`` of lists. e.g.  If [[1, 3]] -> combine columns 1 and 3 and parse as
      a single date column.
    * ``dict``, e.g. {'foo' : [1, 3]} -> parse columns 1, 3 as date and call
      result 'foo'

    If a column or index contains an unparsable date, the entire column or
    index will be returned unaltered as an object data type. If you don`t want to
    parse some cells as date just change their type in Excel to "Text".
    For non-standard datetime parsing, use ``pd.to_datetime`` after ``pd.read_excel``.

    Note: A fast-path exists for iso8601-formatted dates.
date_parser : function, optional
    Function to use for converting a sequence of string columns to an array of
    datetime instances. The default uses ``dateutil.parser.parser`` to do the
    conversion. Pandas will try to call `date_parser` in three different ways,
    advancing to the next if an exception occurs: 1) Pass one or more arrays
    (as defined by `parse_dates`) as arguments; 2) concatenate (row-wise) the
    string values from the columns defined by `parse_dates` into a single array
    and pass that; and 3) call `date_parser` once for each row using one or
    more strings (corresponding to the columns defined by `parse_dates`) as
    arguments.

    .. deprecated:: 2.0.0
       Use ``date_format`` instead, or read in as ``object`` and then apply
       :func:`to_datetime` as-needed.
date_format : str or dict of column -> format, default ``None``
   If used in conjunction with ``parse_dates``, will parse dates according to this
   format. For anything more complex,
   please read in as ``object`` and then apply :func:`to_datetime` as-needed.

   .. versionadded:: 2.0.0
thousands : str, default None
    Thousands separator for parsing string columns to numeric.  Note that
    this parameter is only necessary for columns stored as TEXT in Excel,
    any numeric columns will automatically be parsed, regardless of display
    format.
decimal : str, default '.'
    Character to recognize as decimal point for parsing string columns to numeric.
    Note that this parameter is only necessary for columns stored as TEXT in Excel,
    any numeric columns will automatically be parsed, regardless of display
    format.(e.g. use ',' for European data).

    .. versionadded:: 1.4.0

comment : str, default None
    Comments out remainder of line. Pass a character or characters to this
    argument to indicate comments in the input file. Any data between the
    comment string and the end of the current line is ignored.
skipfooter : int, default 0
    Rows at the end to skip (0-indexed).
storage_options : dict, optional
    Extra options that make sense for a particular storage connection, e.g.
    host, port, username, password, etc. For HTTP(S) URLs the key-value pairs
    are forwarded to ``urllib.request.Request`` as header options. For other
    URLs (e.g. starting with "s3://", and "gcs://") the key-value pairs are
    forwarded to ``fsspec.open``. Please see ``fsspec`` and ``urllib`` for more
    details, and for more examples on storage options refer `here
    <https://pandas.pydata.org/docs/user_guide/io.html?
    highlight=storage_options#reading-writing-remote-files>`_.

dtype_backend : {'numpy_nullable', 'pyarrow'}, default 'numpy_nullable'
    Back-end data type applied to the resultant :class:`DataFrame`
    (still experimental). Behaviour is as follows:

    * ``"numpy_nullable"``: returns nullable-dtype-backed :class:`DataFrame`
      (default).
    * ``"pyarrow"``: returns pyarrow-backed nullable :class:`ArrowDtype`
      DataFrame.

    .. versionadded:: 2.0

engine_kwargs : dict, optional
    Arbitrary keyword arguments passed to excel engine.

Returns
-------
DataFrame or dict of DataFrames
    DataFrame from the passed in Excel file. See notes in sheet_name
    argument for more information on when a dict of DataFrames is returned.

See Also
--------
DataFrame.to_excel : Write DataFrame to an Excel file.
DataFrame.to_csv : Write DataFrame to a comma-separated values (csv) file.
read_csv : Read a comma-separated values (csv) file into DataFrame.
read_fwf : Read a table of fixed-width formatted lines into DataFrame.

Notes
-----
For specific information on the methods used for each Excel engine, refer to the pandas
:ref:`user guide <io.excel_reader>`

Examples
--------
The file can be read using the file name as string or an open file object:

>>> pd.read_excel('tmp.xlsx', index_col=0)  # doctest: +SKIP
       Name  Value
0   string1      1
1   string2      2
2  #Comment      3

>>> pd.read_excel(open('tmp.xlsx', 'rb'),
...               sheet_name='Sheet3')  # doctest: +SKIP
   Unnamed: 0      Name  Value
0           0   string1      1
1           1   string2      2
2           2  #Comment      3

Index and header can be specified via the `index_col` and `header` arguments

>>> pd.read_excel('tmp.xlsx', index_col=None, header=None)  # doctest: +SKIP
     0         1      2
0  NaN      Name  Value
1  0.0   string1      1
2  1.0   string2      2
3  2.0  #Comment      3

Column types are inferred but can be explicitly specified

>>> pd.read_excel('tmp.xlsx', index_col=0,
...               dtype={'Name': str, 'Value': float})  # doctest: +SKIP
       Name  Value
0   string1    1.0
1   string2    2.0
2  #Comment    3.0

True, False, and NA values, and thousands separators have defaults,
but can be explicitly specified, too. Supply the values you would like
as strings or lists of strings!

>>> pd.read_excel('tmp.xlsx', index_col=0,
...               na_values=['string1', 'string2'])  # doctest: +SKIP
       Name  Value
0       NaN      1
1       NaN      2
2  #Comment      3

Comment lines in the excel input file can be skipped using the
``comment`` kwarg.

>>> pd.read_excel('tmp.xlsx', index_col=0, comment='#')  # doctest: +SKIP
      Name  Value
0  string1    1.0
1  string2    2.0
2     None    NaN
File:      c:\users\phili\anaconda3\envs\vscode-env\lib\site-packages\pandas\io\excel\_base.py
Type:      function

In [46]:

df = pd.read_excel('./files/Book3.xlsx', sheet_name='Sheet1')

In [47]:

df

Out[47]:

	string	integer	boolean	floatingpoint	date	time	category
0	the fat black cat	4	True	0.86	2023-07-24 11:30:00	11:36:00	A
1	sat on the mat	4	True	0.86	2023-07-25 00:00:00	12:36:00	A
2	twinkle, twinkle	2	True	-1.14	2023-07-26 00:00:00	13:36:00	B
3	little star	2	True	-1.14	2023-07-27 00:00:00	14:36:00	B
4	how I wonder	3	False	-0.14	2023-07-28 00:00:00	15:36:00	B
5	what you are	4	True	0.86	2023-07-29 00:00:00	16:36:00	B

Sheets in an Excel File are ordered and the ordering of the sheets is analogous to a RangeIndex ('Sheet1' corresponds to an index of 0 because of zero-order indexing). This parameter is often supplied positionally:

In [48]:

df = pd.read_excel('./files/Book3.xlsx', 0)

In [49]:

df

Out[49]:

	string	integer	boolean	floatingpoint	date	time	category
0	the fat black cat	4	True	0.86	2023-07-24 11:30:00	11:36:00	A
1	sat on the mat	4	True	0.86	2023-07-25 00:00:00	12:36:00	A
2	twinkle, twinkle	2	True	-1.14	2023-07-26 00:00:00	13:36:00	B
3	little star	2	True	-1.14	2023-07-27 00:00:00	14:36:00	B
4	how I wonder	3	False	-0.14	2023-07-28 00:00:00	15:36:00	B
5	what you are	4	True	0.86	2023-07-29 00:00:00	16:36:00	B

In a .xlsx file, a column is normally formatted for the appropriate datatype. For example a datetime format and the .xlsx file contains the formatting information. This information can be used by the read_excel function:

In [50]:

df = pd.read_excel('./files/Book3.xlsx', 0, parse_dates=True)

In [51]:

df

Out[51]:

	string	integer	boolean	floatingpoint	date	time	category
0	the fat black cat	4	True	0.86	2023-07-24 11:30:00	11:36:00	A
1	sat on the mat	4	True	0.86	2023-07-25 00:00:00	12:36:00	A
2	twinkle, twinkle	2	True	-1.14	2023-07-26 00:00:00	13:36:00	B
3	little star	2	True	-1.14	2023-07-27 00:00:00	14:36:00	B
4	how I wonder	3	False	-0.14	2023-07-28 00:00:00	15:36:00	B
5	what you are	4	True	0.86	2023-07-29 00:00:00	16:36:00	B

The date column is parsed as a date correctly but the time column wasn't parsed as a time column:

In [52]:

df.dtypes

Out[52]:

string                   object
integer                   int64
boolean                    bool
floatingpoint           float64
date             datetime64[ns]
time                     object
category                 object
dtype: object

The df['time'] pd.Series instance can be cast into a str instance:

In [53]:

df['time'].astype('str')

Out[53]:

0    11:36:00
1    12:36:00
2    13:36:00
3    14:36:00
4    15:36:00
5    16:36:00
Name: time, dtype: object

The pd.Series method astype has limited support for converting to a timedelta64 and instead te pd function to_timedelta needs to be used:

In [54]:

pd.to_timedelta(df['time'].astype('str'))

Out[54]:

0   0 days 11:36:00
1   0 days 12:36:00
2   0 days 13:36:00
3   0 days 14:36:00
4   0 days 15:36:00
5   0 days 16:36:00
Name: time, dtype: timedelta64[ns]

This can be reassigned to the pd.Series instance name df['time']. This will update the df instance inplace:

In [55]:

df['time'] = pd.to_timedelta(df['time'].astype('str'))

In [56]:

df

Out[56]:

	string	integer	boolean	floatingpoint	date	time	category
0	the fat black cat	4	True	0.86	2023-07-24 11:30:00	0 days 11:36:00	A
1	sat on the mat	4	True	0.86	2023-07-25 00:00:00	0 days 12:36:00	A
2	twinkle, twinkle	2	True	-1.14	2023-07-26 00:00:00	0 days 13:36:00	B
3	little star	2	True	-1.14	2023-07-27 00:00:00	0 days 14:36:00	B
4	how I wonder	3	False	-0.14	2023-07-28 00:00:00	0 days 15:36:00	B
5	what you are	4	True	0.86	2023-07-29 00:00:00	0 days 16:36:00	B

In [57]:

df.dtypes

Out[57]:

string                    object
integer                    int64
boolean                     bool
floatingpoint            float64
date              datetime64[ns]
time             timedelta64[ns]
category                  object
dtype: object

The df['category'] pd.Series instance can be cast to a 'category' datatype:

In [58]:

df['category'].astype('category')

Out[58]:

0    A
1    A
2    B
3    B
4    B
5    B
Name: category, dtype: category
Categories (2, object): ['A', 'B']

In [59]:

df['category'] = df['category'].astype('category')

In [60]:

df

Out[60]:

	string	integer	boolean	floatingpoint	date	time	category
0	the fat black cat	4	True	0.86	2023-07-24 11:30:00	0 days 11:36:00	A
1	sat on the mat	4	True	0.86	2023-07-25 00:00:00	0 days 12:36:00	A
2	twinkle, twinkle	2	True	-1.14	2023-07-26 00:00:00	0 days 13:36:00	B
3	little star	2	True	-1.14	2023-07-27 00:00:00	0 days 14:36:00	B
4	how I wonder	3	False	-0.14	2023-07-28 00:00:00	0 days 15:36:00	B
5	what you are	4	True	0.86	2023-07-29 00:00:00	0 days 16:36:00	B

In [61]:

df.dtypes

Out[61]:

string                    object
integer                    int64
boolean                     bool
floatingpoint            float64
date              datetime64[ns]
time             timedelta64[ns]
category                category
dtype: object

To reorder the columns, indexing is quite commonly used:

In [62]:

df[['string', 'integer', 'boolean', 'floatingpoint', 'date', 'time', 'category']]

Out[62]:

	string	integer	boolean	floatingpoint	date	time	category
0	the fat black cat	4	True	0.86	2023-07-24 11:30:00	0 days 11:36:00	A
1	sat on the mat	4	True	0.86	2023-07-25 00:00:00	0 days 12:36:00	A
2	twinkle, twinkle	2	True	-1.14	2023-07-26 00:00:00	0 days 13:36:00	B
3	little star	2	True	-1.14	2023-07-27 00:00:00	0 days 14:36:00	B
4	how I wonder	3	False	-0.14	2023-07-28 00:00:00	0 days 15:36:00	B
5	what you are	4	True	0.86	2023-07-29 00:00:00	0 days 16:36:00	B

The instance name df can be reassigned to this output:

In [63]:

df = df[['string', 'integer', 'boolean', 'floatingpoint', 'date', 'time', 'category']]

In [64]:

df

Out[64]:

	string	integer	boolean	floatingpoint	date	time	category
0	the fat black cat	4	True	0.86	2023-07-24 11:30:00	0 days 11:36:00	A
1	sat on the mat	4	True	0.86	2023-07-25 00:00:00	0 days 12:36:00	A
2	twinkle, twinkle	2	True	-1.14	2023-07-26 00:00:00	0 days 13:36:00	B
3	little star	2	True	-1.14	2023-07-27 00:00:00	0 days 14:36:00	B
4	how I wonder	3	False	-0.14	2023-07-28 00:00:00	0 days 15:36:00	B
5	what you are	4	True	0.86	2023-07-29 00:00:00	0 days 16:36:00	B

Writing DataFrames to Objects and Files¶

The DataFrame class has a number of methods for writing to files:

In [65]:

for identifier in dir(df):
    if identifier.startswith('to_'):
        print(identifier)

to_clipboard
to_csv
to_dict
to_excel
to_feather
to_gbq
to_hdf
to_html
to_json
to_latex
to_markdown
to_numpy
to_orc
to_parquet
to_period
to_pickle
to_records
to_sql
to_stata
to_string
to_timestamp
to_xarray
to_xml

Python Dictionary¶

A pd.DataFrame instance can be written to a dict instance using:

In [66]:

df.to_dict()

Out[66]:

{'string': {0: 'the fat black cat',
  1: 'sat on the mat',
  2: 'twinkle, twinkle',
  3: 'little star',
  4: 'how I wonder',
  5: 'what you are'},
 'integer': {0: 4, 1: 4, 2: 2, 3: 2, 4: 3, 5: 4},
 'boolean': {0: True, 1: True, 2: True, 3: True, 4: False, 5: True},
 'floatingpoint': {0: 0.8599999999999999,
  1: 0.8599999999999999,
  2: -1.1400000000000001,
  3: -1.1400000000000001,
  4: -0.14000000000000012,
  5: 0.8599999999999999},
 'date': {0: Timestamp('2023-07-24 11:30:00'),
  1: Timestamp('2023-07-25 00:00:00'),
  2: Timestamp('2023-07-26 00:00:00'),
  3: Timestamp('2023-07-27 00:00:00'),
  4: Timestamp('2023-07-28 00:00:00'),
  5: Timestamp('2023-07-29 00:00:00')},
 'time': {0: Timedelta('0 days 11:36:00'),
  1: Timedelta('0 days 12:36:00'),
  2: Timedelta('0 days 13:36:00'),
  3: Timedelta('0 days 14:36:00'),
  4: Timedelta('0 days 15:36:00'),
  5: Timedelta('0 days 16:36:00')},
 'category': {0: 'A', 1: 'A', 2: 'B', 3: 'B', 4: 'B', 5: 'B'}}

This is read into a pd.DataFrame instance using:

In [67]:

pd.DataFrame(data=df.to_dict())

Out[67]:

	string	integer	boolean	floatingpoint	date	time	category
0	the fat black cat	4	True	0.86	2023-07-24 11:30:00	0 days 11:36:00	A
1	sat on the mat	4	True	0.86	2023-07-25 00:00:00	0 days 12:36:00	A
2	twinkle, twinkle	2	True	-1.14	2023-07-26 00:00:00	0 days 13:36:00	B
3	little star	2	True	-1.14	2023-07-27 00:00:00	0 days 14:36:00	B
4	how I wonder	3	False	-0.14	2023-07-28 00:00:00	0 days 15:36:00	B
5	what you are	4	True	0.86	2023-07-29 00:00:00	0 days 16:36:00	B

Comma Separated Values File¶

A pd.DataFrame can be written to a file using:

In [68]:

df.to_csv('./files/Book4.csv')

Notice that an index was added, meaning if this is read into a pd.DataFrame instance using the defaults there is an Unnamed pd.Series corresponding to the index read in:

In [69]:

pd.read_csv('./files/Book4.csv')

Out[69]:

	Unnamed: 0	string	integer	boolean	floatingpoint	date	time	category
0	0	the fat black cat	4	True	0.86	2023-07-24 11:30:00	0 days 11:36:00	A
1	1	sat on the mat	4	True	0.86	2023-07-25 00:00:00	0 days 12:36:00	A
2	2	twinkle, twinkle	2	True	-1.14	2023-07-26 00:00:00	0 days 13:36:00	B
3	3	little star	2	True	-1.14	2023-07-27 00:00:00	0 days 14:36:00	B
4	4	how I wonder	3	False	-0.14	2023-07-28 00:00:00	0 days 15:36:00	B
5	5	what you are	4	True	0.86	2023-07-29 00:00:00	0 days 16:36:00	B

This can be assigned to the pd.Index instance associated with the pd.DataFrame using the keyword input argument index_col:

In [70]:

pd.read_csv('./files/Book4.csv', index_col=0)

Out[70]:

	string	integer	boolean	floatingpoint	date	time	category
0	the fat black cat	4	True	0.86	2023-07-24 11:30:00	0 days 11:36:00	A
1	sat on the mat	4	True	0.86	2023-07-25 00:00:00	0 days 12:36:00	A
2	twinkle, twinkle	2	True	-1.14	2023-07-26 00:00:00	0 days 13:36:00	B
3	little star	2	True	-1.14	2023-07-27 00:00:00	0 days 14:36:00	B
4	how I wonder	3	False	-0.14	2023-07-28 00:00:00	0 days 15:36:00	B
5	what you are	4	True	0.86	2023-07-29 00:00:00	0 days 16:36:00	B

Alternatively the pd.DataFrame instance can be exported without the pd.Index information:

In [71]:

df.to_csv?

Signature:
df.to_csv(
    path_or_buf: 'FilePath | WriteBuffer[bytes] | WriteBuffer[str] | None' = None,
    *,
    sep: 'str' = ',',
    na_rep: 'str' = '',
    float_format: 'str | Callable | None' = None,
    columns: 'Sequence[Hashable] | None' = None,
    header: 'bool_t | list[str]' = True,
    index: 'bool_t' = True,
    index_label: 'IndexLabel | None' = None,
    mode: 'str' = 'w',
    encoding: 'str | None' = None,
    compression: 'CompressionOptions' = 'infer',
    quoting: 'int | None' = None,
    quotechar: 'str' = '"',
    lineterminator: 'str | None' = None,
    chunksize: 'int | None' = None,
    date_format: 'str | None' = None,
    doublequote: 'bool_t' = True,
    escapechar: 'str | None' = None,
    decimal: 'str' = '.',
    errors: 'OpenFileErrors' = 'strict',
    storage_options: 'StorageOptions | None' = None,
) -> 'str | None'
Docstring:
Write object to a comma-separated values (csv) file.

Parameters
----------
path_or_buf : str, path object, file-like object, or None, default None
    String, path object (implementing os.PathLike[str]), or file-like
    object implementing a write() function. If None, the result is
    returned as a string. If a non-binary file object is passed, it should
    be opened with `newline=''`, disabling universal newlines. If a binary
    file object is passed, `mode` might need to contain a `'b'`.
sep : str, default ','
    String of length 1. Field delimiter for the output file.
na_rep : str, default ''
    Missing data representation.
float_format : str, Callable, default None
    Format string for floating point numbers. If a Callable is given, it takes
    precedence over other numeric formatting parameters, like decimal.
columns : sequence, optional
    Columns to write.
header : bool or list of str, default True
    Write out the column names. If a list of strings is given it is
    assumed to be aliases for the column names.
index : bool, default True
    Write row names (index).
index_label : str or sequence, or False, default None
    Column label for index column(s) if desired. If None is given, and
    `header` and `index` are True, then the index names are used. A
    sequence should be given if the object uses MultiIndex. If
    False do not print fields for index names. Use index_label=False
    for easier importing in R.
mode : {'w', 'x', 'a'}, default 'w'
    Forwarded to either `open(mode=)` or `fsspec.open(mode=)` to control
    the file opening. Typical values include:

    - 'w', truncate the file first.
    - 'x', exclusive creation, failing if the file already exists.
    - 'a', append to the end of file if it exists.

encoding : str, optional
    A string representing the encoding to use in the output file,
    defaults to 'utf-8'. `encoding` is not supported if `path_or_buf`
    is a non-binary file object.
compression : str or dict, default 'infer'
    For on-the-fly compression of the output data. If 'infer' and 'path_or_buf' is
    path-like, then detect compression from the following extensions: '.gz',
    '.bz2', '.zip', '.xz', '.zst', '.tar', '.tar.gz', '.tar.xz' or '.tar.bz2'
    (otherwise no compression).
    Set to ``None`` for no compression.
    Can also be a dict with key ``'method'`` set
    to one of {``'zip'``, ``'gzip'``, ``'bz2'``, ``'zstd'``, ``'xz'``, ``'tar'``} and
    other key-value pairs are forwarded to
    ``zipfile.ZipFile``, ``gzip.GzipFile``,
    ``bz2.BZ2File``, ``zstandard.ZstdCompressor``, ``lzma.LZMAFile`` or
    ``tarfile.TarFile``, respectively.
    As an example, the following could be passed for faster compression and to create
    a reproducible gzip archive:
    ``compression={'method': 'gzip', 'compresslevel': 1, 'mtime': 1}``.

    .. versionadded:: 1.5.0
        Added support for `.tar` files.

       May be a dict with key 'method' as compression mode
       and other entries as additional compression options if
       compression mode is 'zip'.

       Passing compression options as keys in dict is
       supported for compression modes 'gzip', 'bz2', 'zstd', and 'zip'.
quoting : optional constant from csv module
    Defaults to csv.QUOTE_MINIMAL. If you have set a `float_format`
    then floats are converted to strings and thus csv.QUOTE_NONNUMERIC
    will treat them as non-numeric.
quotechar : str, default '\"'
    String of length 1. Character used to quote fields.
lineterminator : str, optional
    The newline character or character sequence to use in the output
    file. Defaults to `os.linesep`, which depends on the OS in which
    this method is called ('\\n' for linux, '\\r\\n' for Windows, i.e.).

    .. versionchanged:: 1.5.0

        Previously was line_terminator, changed for consistency with
        read_csv and the standard library 'csv' module.

chunksize : int or None
    Rows to write at a time.
date_format : str, default None
    Format string for datetime objects.
doublequote : bool, default True
    Control quoting of `quotechar` inside a field.
escapechar : str, default None
    String of length 1. Character used to escape `sep` and `quotechar`
    when appropriate.
decimal : str, default '.'
    Character recognized as decimal separator. E.g. use ',' for
    European data.
errors : str, default 'strict'
    Specifies how encoding and decoding errors are to be handled.
    See the errors argument for :func:`open` for a full list
    of options.

storage_options : dict, optional
    Extra options that make sense for a particular storage connection, e.g.
    host, port, username, password, etc. For HTTP(S) URLs the key-value pairs
    are forwarded to ``urllib.request.Request`` as header options. For other
    URLs (e.g. starting with "s3://", and "gcs://") the key-value pairs are
    forwarded to ``fsspec.open``. Please see ``fsspec`` and ``urllib`` for more
    details, and for more examples on storage options refer `here
    <https://pandas.pydata.org/docs/user_guide/io.html?
    highlight=storage_options#reading-writing-remote-files>`_.

Returns
-------
None or str
    If path_or_buf is None, returns the resulting csv format as a
    string. Otherwise returns None.

See Also
--------
read_csv : Load a CSV file into a DataFrame.
to_excel : Write DataFrame to an Excel file.

Examples
--------
Create 'out.csv' containing 'df' without indices

>>> df = pd.DataFrame({'name': ['Raphael', 'Donatello'],
...                    'mask': ['red', 'purple'],
...                    'weapon': ['sai', 'bo staff']})
>>> df.to_csv('out.csv', index=False)  # doctest: +SKIP

Create 'out.zip' containing 'out.csv'

>>> df.to_csv(index=False)
'name,mask,weapon\nRaphael,red,sai\nDonatello,purple,bo staff\n'
>>> compression_opts = dict(method='zip',
...                         archive_name='out.csv')  # doctest: +SKIP
>>> df.to_csv('out.zip', index=False,
...           compression=compression_opts)  # doctest: +SKIP

To write a csv file to a new folder or nested folder you will first
need to create it using either Pathlib or os:

>>> from pathlib import Path  # doctest: +SKIP
>>> filepath = Path('folder/subfolder/out.csv')  # doctest: +SKIP
>>> filepath.parent.mkdir(parents=True, exist_ok=True)  # doctest: +SKIP
>>> df.to_csv(filepath)  # doctest: +SKIP

>>> import os  # doctest: +SKIP
>>> os.makedirs('folder/subfolder', exist_ok=True)  # doctest: +SKIP
>>> df.to_csv('folder/subfolder/out.csv')  # doctest: +SKIP
File:      c:\users\phili\anaconda3\envs\vscode-env\lib\site-packages\pandas\core\generic.py
Type:      method

In [72]:

df.to_csv('./files/Book5.csv', index=False)

Tab Delimited Text File¶

To save to a text file, the separator needs to be specified:

In [73]:

df.to_csv('./files/Book6.txt', sep='\t', index=False)

This is read into a pd.DataFrame instance using:

In [74]:

pd.read_csv('./files/Book6.txt', sep='\t')

Out[74]:

	string	integer	boolean	floatingpoint	date	time	category
0	the fat black cat	4	True	0.86	2023-07-24 11:30:00	0 days 11:36:00	A
1	sat on the mat	4	True	0.86	2023-07-25 00:00:00	0 days 12:36:00	A
2	twinkle, twinkle	2	True	-1.14	2023-07-26 00:00:00	0 days 13:36:00	B
3	little star	2	True	-1.14	2023-07-27 00:00:00	0 days 14:36:00	B
4	how I wonder	3	False	-0.14	2023-07-28 00:00:00	0 days 15:36:00	B
5	what you are	4	True	0.86	2023-07-29 00:00:00	0 days 16:36:00	B

JavaScript Object Notation¶

A pd.DataFrame can be written to a JSON str using:

In [75]:

df.to_json()

Out[75]:

'{"string":{"0":"the fat black cat","1":"sat on the mat","2":"twinkle, twinkle","3":"little star","4":"how I wonder","5":"what you are"},"integer":{"0":4,"1":4,"2":2,"3":2,"4":3,"5":4},"boolean":{"0":true,"1":true,"2":true,"3":true,"4":false,"5":true},"floatingpoint":{"0":0.86,"1":0.86,"2":-1.14,"3":-1.14,"4":-0.14,"5":0.86},"date":{"0":1690198200000,"1":1690243200000,"2":1690329600000,"3":1690416000000,"4":1690502400000,"5":1690588800000},"time":{"0":41760000,"1":45360000,"2":48960000,"3":52560000,"4":56160000,"5":59760000},"category":{"0":"A","1":"A","2":"B","3":"B","4":"B","5":"B"}}'

Note that this is the same form that can be used to instantiate a pd.DataFrame instance:

In [76]:

pd.read_json(io.StringIO(df.to_json()))

Out[76]:

	string	integer	boolean	floatingpoint	date	time	category
0	the fat black cat	4	True	0.86	2023-07-24 11:30:00	41760000	A
1	sat on the mat	4	True	0.86	2023-07-25 00:00:00	45360000	A
2	twinkle, twinkle	2	True	-1.14	2023-07-26 00:00:00	48960000	B
3	little star	2	True	-1.14	2023-07-27 00:00:00	52560000	B
4	how I wonder	3	False	-0.14	2023-07-28 00:00:00	56160000	B
5	what you are	4	True	0.86	2023-07-29 00:00:00	59760000	B

Supposing there are three DataFrame instances:

Microsoft Excel File¶

Recall that an Excel file has sheets. For convenience multiple pd.DataFrame instances can be created by indexing into the pd.DataFrame instance df and indexing using a list of str instances, where each str instance corresponds to the name of a pd.Series:

In [77]:

df

Out[77]:

	string	integer	boolean	floatingpoint	date	time	category
0	the fat black cat	4	True	0.86	2023-07-24 11:30:00	0 days 11:36:00	A
1	sat on the mat	4	True	0.86	2023-07-25 00:00:00	0 days 12:36:00	A
2	twinkle, twinkle	2	True	-1.14	2023-07-26 00:00:00	0 days 13:36:00	B
3	little star	2	True	-1.14	2023-07-27 00:00:00	0 days 14:36:00	B
4	how I wonder	3	False	-0.14	2023-07-28 00:00:00	0 days 15:36:00	B
5	what you are	4	True	0.86	2023-07-29 00:00:00	0 days 16:36:00	B

In [78]:

df1 = df[['string', 'integer', 'time']]

In [79]:

df1

Out[79]:

	string	integer	time
0	the fat black cat	4	0 days 11:36:00
1	sat on the mat	4	0 days 12:36:00
2	twinkle, twinkle	2	0 days 13:36:00
3	little star	2	0 days 14:36:00
4	how I wonder	3	0 days 15:36:00
5	what you are	4	0 days 16:36:00

In [80]:

df2 = df[['string', 'integer', 'boolean']]

In [81]:

df2

Out[81]:

	string	integer	boolean
0	the fat black cat	4	True
1	sat on the mat	4	True
2	twinkle, twinkle	2	True
3	little star	2	True
4	how I wonder	3	False
5	what you are	4	True

In [82]:

df3 = df[['string', 'date', 'category']]

In [83]:

df3

Out[83]:

	string	date	category
0	the fat black cat	2023-07-24 11:30:00	A
1	sat on the mat	2023-07-25 00:00:00	A
2	twinkle, twinkle	2023-07-26 00:00:00	B
3	little star	2023-07-27 00:00:00	B
4	how I wonder	2023-07-28 00:00:00	B
5	what you are	2023-07-29 00:00:00	B

The pd.DataFrame method to_excel allows the writing of multiple pd.DataFrame instances to individual sheets within an Excel File:

In [84]:

df.to_excel?

Signature:
df.to_excel(
    excel_writer: 'FilePath | WriteExcelBuffer | ExcelWriter',
    *,
    sheet_name: 'str' = 'Sheet1',
    na_rep: 'str' = '',
    float_format: 'str | None' = None,
    columns: 'Sequence[Hashable] | None' = None,
    header: 'Sequence[Hashable] | bool_t' = True,
    index: 'bool_t' = True,
    index_label: 'IndexLabel | None' = None,
    startrow: 'int' = 0,
    startcol: 'int' = 0,
    engine: "Literal['openpyxl', 'xlsxwriter'] | None" = None,
    merge_cells: 'bool_t' = True,
    inf_rep: 'str' = 'inf',
    freeze_panes: 'tuple[int, int] | None' = None,
    storage_options: 'StorageOptions | None' = None,
    engine_kwargs: 'dict[str, Any] | None' = None,
) -> 'None'
Docstring:
Write object to an Excel sheet.

To write a single object to an Excel .xlsx file it is only necessary to
specify a target file name. To write to multiple sheets it is necessary to
create an `ExcelWriter` object with a target file name, and specify a sheet
in the file to write to.

Multiple sheets may be written to by specifying unique `sheet_name`.
With all data written to the file it is necessary to save the changes.
Note that creating an `ExcelWriter` object with a file name that already
exists will result in the contents of the existing file being erased.

Parameters
----------
excel_writer : path-like, file-like, or ExcelWriter object
    File path or existing ExcelWriter.
sheet_name : str, default 'Sheet1'
    Name of sheet which will contain DataFrame.
na_rep : str, default ''
    Missing data representation.
float_format : str, optional
    Format string for floating point numbers. For example
    ``float_format="%.2f"`` will format 0.1234 to 0.12.
columns : sequence or list of str, optional
    Columns to write.
header : bool or list of str, default True
    Write out the column names. If a list of string is given it is
    assumed to be aliases for the column names.
index : bool, default True
    Write row names (index).
index_label : str or sequence, optional
    Column label for index column(s) if desired. If not specified, and
    `header` and `index` are True, then the index names are used. A
    sequence should be given if the DataFrame uses MultiIndex.
startrow : int, default 0
    Upper left cell row to dump data frame.
startcol : int, default 0
    Upper left cell column to dump data frame.
engine : str, optional
    Write engine to use, 'openpyxl' or 'xlsxwriter'. You can also set this
    via the options ``io.excel.xlsx.writer`` or
    ``io.excel.xlsm.writer``.

merge_cells : bool, default True
    Write MultiIndex and Hierarchical Rows as merged cells.
inf_rep : str, default 'inf'
    Representation for infinity (there is no native representation for
    infinity in Excel).
freeze_panes : tuple of int (length 2), optional
    Specifies the one-based bottommost row and rightmost column that
    is to be frozen.
storage_options : dict, optional
    Extra options that make sense for a particular storage connection, e.g.
    host, port, username, password, etc. For HTTP(S) URLs the key-value pairs
    are forwarded to ``urllib.request.Request`` as header options. For other
    URLs (e.g. starting with "s3://", and "gcs://") the key-value pairs are
    forwarded to ``fsspec.open``. Please see ``fsspec`` and ``urllib`` for more
    details, and for more examples on storage options refer `here
    <https://pandas.pydata.org/docs/user_guide/io.html?
    highlight=storage_options#reading-writing-remote-files>`_.

    .. versionadded:: 1.2.0
engine_kwargs : dict, optional
    Arbitrary keyword arguments passed to excel engine.

See Also
--------
to_csv : Write DataFrame to a comma-separated values (csv) file.
ExcelWriter : Class for writing DataFrame objects into excel sheets.
read_excel : Read an Excel file into a pandas DataFrame.
read_csv : Read a comma-separated values (csv) file into DataFrame.
io.formats.style.Styler.to_excel : Add styles to Excel sheet.

Notes
-----
For compatibility with :meth:`~DataFrame.to_csv`,
to_excel serializes lists and dicts to strings before writing.

Once a workbook has been saved it is not possible to write further
data without rewriting the whole workbook.

Examples
--------

Create, write to and save a workbook:

>>> df1 = pd.DataFrame([['a', 'b'], ['c', 'd']],
...                    index=['row 1', 'row 2'],
...                    columns=['col 1', 'col 2'])
>>> df1.to_excel("output.xlsx")  # doctest: +SKIP

To specify the sheet name:

>>> df1.to_excel("output.xlsx",
...              sheet_name='Sheet_name_1')  # doctest: +SKIP

If you wish to write to more than one sheet in the workbook, it is
necessary to specify an ExcelWriter object:

>>> df2 = df1.copy()
>>> with pd.ExcelWriter('output.xlsx') as writer:  # doctest: +SKIP
...     df1.to_excel(writer, sheet_name='Sheet_name_1')
...     df2.to_excel(writer, sheet_name='Sheet_name_2')

ExcelWriter can also be used to append to an existing Excel file:

>>> with pd.ExcelWriter('output.xlsx',
...                     mode='a') as writer:  # doctest: +SKIP
...     df1.to_excel(writer, sheet_name='Sheet_name_3')

To set the library that is used to write the Excel file,
you can pass the `engine` keyword (the default engine is
automatically chosen depending on the file extension):

>>> df1.to_excel('output1.xlsx', engine='xlsxwriter')  # doctest: +SKIP
File:      c:\users\phili\anaconda3\envs\vscode-env\lib\site-packages\pandas\core\generic.py
Type:      method

To write pd.DataFrame instances to multiple sheets an pd.ExcelWriter instance has to be instantiated and given the instruction to create a blank Excel File:

In [85]:

writer = pd.ExcelWriter(path='./files/Book7.xlsx')

The DataFrame method to_excel can then be used to instruct the ExcelWriter instance to write the df.DataFrame instance to a specified sheet:

In [86]:

df1.to_excel(excel_writer=writer, sheet_name='sheet1')
df2.to_excel(excel_writer=writer, sheet_name='sheet2')
df3.to_excel(excel_writer=writer, sheet_name='sheet3')

Details about the sheets being written can be seen using the ExcelWriter attribute sheets which is a mapping where the key is the sheet name and the value is the sheet being written:

In [87]:

writer.sheets

Out[87]:

{'sheet1': <xlsxwriter.worksheet.Worksheet at 0x1aaadbe8f80>,
 'sheet2': <xlsxwriter.worksheet.Worksheet at 0x1aaad6cec60>,
 'sheet3': <xlsxwriter.worksheet.Worksheet at 0x1aaada6c050>}

Finally the ExcelWriter instance can be closed. This will release the Excel SpreadSheet from Python:

In [88]:

writer.close()

The identifiers of the ExcelWriter class can be examined:

In [89]:

dir2(pd.ExcelWriter, drop_internal=True)

{'attribute': ['book',
               'date_format',
               'datetime_format',
               'engine',
               'if_sheet_exists',
               'sheets',
               'supported_extensions'],
 'method': ['check_extension', 'close'],
 'datamodel_attribute': ['__annotations__',
                         '__dict__',
                         '__doc__',
                         '__module__',
                         '__orig_bases__',
                         '__parameters__',
                         '__weakref__'],
 'datamodel_method': ['__class__',
                      '__class_getitem__',
                      '__delattr__',
                      '__dir__',
                      '__enter__',
                      '__eq__',
                      '__exit__',
                      '__format__',
                      '__fspath__',
                      '__ge__',
                      '__getattribute__',
                      '__getstate__',
                      '__gt__',
                      '__hash__',
                      '__init__',
                      '__init_subclass__',
                      '__le__',
                      '__lt__',
                      '__ne__',
                      '__new__',
                      '__reduce__',
                      '__reduce_ex__',
                      '__repr__',
                      '__setattr__',
                      '__sizeof__',
                      '__str__',
                      '__subclasshook__']}

Notice it has the datamodel identifiers __enter__ and __exit__ which means it can be used within a with code block. The with code block will automatically close the ExcelWriter class when the block ends and is the safest to create the file, write multiple sheets to it and close the file::

In [90]:

with pd.ExcelWriter('./files/Book8.xlsx') as writer:  
    df1.to_excel(writer, sheet_name='df1', index=False)
    df2.to_excel(writer, sheet_name='df2', index=False)
    df3.to_excel(writer, sheet_name='df3', index=False)

Markdown¶

A pd.DataFrame instance can be exported into a markdown str using:

In [91]:

df.to_markdown()

Out[91]:

'|    | string            |   integer | boolean   |   floatingpoint | date                | time            | category   |\n|---:|:------------------|----------:|:----------|----------------:|:--------------------|:----------------|:-----------|\n|  0 | the fat black cat |         4 | True      |            0.86 | 2023-07-24 11:30:00 | 0 days 11:36:00 | A          |\n|  1 | sat on the mat    |         4 | True      |            0.86 | 2023-07-25 00:00:00 | 0 days 12:36:00 | A          |\n|  2 | twinkle, twinkle  |         2 | True      |           -1.14 | 2023-07-26 00:00:00 | 0 days 13:36:00 | B          |\n|  3 | little star       |         2 | True      |           -1.14 | 2023-07-27 00:00:00 | 0 days 14:36:00 | B          |\n|  4 | how I wonder      |         3 | False     |           -0.14 | 2023-07-28 00:00:00 | 0 days 15:36:00 | B          |\n|  5 | what you are      |         4 | True      |            0.86 | 2023-07-29 00:00:00 | 0 days 16:36:00 | B          |'

When this is printed, it has the following form:

In [92]:

print(df.to_markdown())

|    | string            |   integer | boolean   |   floatingpoint | date                | time            | category   |
|---:|:------------------|----------:|:----------|----------------:|:--------------------|:----------------|:-----------|
|  0 | the fat black cat |         4 | True      |            0.86 | 2023-07-24 11:30:00 | 0 days 11:36:00 | A          |
|  1 | sat on the mat    |         4 | True      |            0.86 | 2023-07-25 00:00:00 | 0 days 12:36:00 | A          |
|  2 | twinkle, twinkle  |         2 | True      |           -1.14 | 2023-07-26 00:00:00 | 0 days 13:36:00 | B          |
|  3 | little star       |         2 | True      |           -1.14 | 2023-07-27 00:00:00 | 0 days 14:36:00 | B          |
|  4 | how I wonder      |         3 | False     |           -0.14 | 2023-07-28 00:00:00 | 0 days 15:36:00 | B          |
|  5 | what you are      |         4 | True      |            0.86 | 2023-07-29 00:00:00 | 0 days 16:36:00 | B          |

And when this printed format is copied into a markdown cell, it looks like:

	string	integer	boolean	floatingpoint	date	time	category
0	the fat black cat	4	True	0.86	24/07/2023	11:36:00	A
1	sat on the mat	4	True	0.86	25/07/2023	12:36:00	A
2	twinkle, twinkle	2	True	-1.14	26/07/2023	13:36:00	B
3	little star	2	True	-1.14	27/07/2023	14:36:00	B
4	how I wonder	3	False	-0.14	28/07/2023	15:36:00	B
5	what you are	4	True	0.86	29/07/2023	16:36:00	B

Series Identifiers¶

The identifiers for a pd.Series are:

In [93]:

dir2(pd.Series, drop_internal=True, exclude_identifier_list=obsolete.series_depreciation)

{'attribute': ['array',
               'at',
               'attrs',
               'axes',
               'dtype',
               'dtypes',
               'empty',
               'flags',
               'hasnans',
               'iat',
               'iloc',
               'index',
               'is_monotonic_decreasing',
               'is_monotonic_increasing',
               'is_unique',
               'loc',
               'name',
               'nbytes',
               'ndim',
               'shape',
               'size',
               'values'],
 'constant': ['T'],
 'method': ['abs',
            'add',
            'add_prefix',
            'add_suffix',
            'agg',
            'aggregate',
            'align',
            'all',
            'any',
            'apply',
            'argmax',
            'argmin',
            'argsort',
            'asfreq',
            'asof',
            'astype',
            'at_time',
            'autocorr',
            'between',
            'between_time',
            'bfill',
            'case_when',
            'clip',
            'combine',
            'combine_first',
            'compare',
            'convert_dtypes',
            'copy',
            'corr',
            'count',
            'cov',
            'cummax',
            'cummin',
            'cumprod',
            'cumsum',
            'describe',
            'diff',
            'div',
            'divide',
            'divmod',
            'dot',
            'drop',
            'drop_duplicates',
            'droplevel',
            'dropna',
            'duplicated',
            'eq',
            'equals',
            'ewm',
            'expanding',
            'explode',
            'factorize',
            'ffill',
            'fillna',
            'filter',
            'first',
            'first_valid_index',
            'floordiv',
            'ge',
            'get',
            'groupby',
            'gt',
            'head',
            'hist',
            'idxmax',
            'idxmin',
            'infer_objects',
            'info',
            'interpolate',
            'isin',
            'isna',
            'isnull',
            'item',
            'items',
            'keys',
            'kurt',
            'kurtosis',
            'last',
            'last_valid_index',
            'le',
            'lt',
            'map',
            'mask',
            'max',
            'mean',
            'median',
            'memory_usage',
            'min',
            'mod',
            'mode',
            'mul',
            'multiply',
            'ne',
            'nlargest',
            'notna',
            'notnull',
            'nsmallest',
            'nunique',
            'pad',
            'pct_change',
            'pipe',
            'pop',
            'pow',
            'prod',
            'product',
            'quantile',
            'radd',
            'rank',
            'rdiv',
            'rdivmod',
            'reindex',
            'reindex_like',
            'rename',
            'rename_axis',
            'reorder_levels',
            'repeat',
            'replace',
            'resample',
            'reset_index',
            'rfloordiv',
            'rmod',
            'rmul',
            'rolling',
            'round',
            'rpow',
            'rsub',
            'rtruediv',
            'sample',
            'searchsorted',
            'sem',
            'set_axis',
            'set_flags',
            'shift',
            'skew',
            'sort_index',
            'sort_values',
            'squeeze',
            'std',
            'sub',
            'subtract',
            'sum',
            'swaplevel',
            'tail',
            'take',
            'to_clipboard',
            'to_csv',
            'to_dict',
            'to_excel',
            'to_frame',
            'to_hdf',
            'to_json',
            'to_latex',
            'to_list',
            'to_markdown',
            'to_numpy',
            'to_period',
            'to_pickle',
            'to_sql',
            'to_string',
            'to_timestamp',
            'to_xarray',
            'tolist',
            'transform',
            'transpose',
            'truediv',
            'truncate',
            'tz_convert',
            'tz_localize',
            'unique',
            'unstack',
            'update',
            'value_counts',
            'var',
            'where',
            'xs'],
 'lower_class': ['cat', 'dt', 'list', 'plot', 'sparse', 'str', 'struct'],
 'datamodel_attribute': ['__annotations__',
                         '__array_priority__',
                         '__dict__',
                         '__doc__',
                         '__hash__',
                         '__module__',
                         '__pandas_priority__',
                         '__weakref__'],
 'datamodel_method': ['__abs__',
                      '__add__',
                      '__and__',
                      '__array__',
                      '__array_ufunc__',
                      '__bool__',
                      '__class__',
                      '__column_consortium_standard__',
                      '__contains__',
                      '__copy__',
                      '__deepcopy__',
                      '__delattr__',
                      '__delitem__',
                      '__dir__',
                      '__divmod__',
                      '__eq__',
                      '__finalize__',
                      '__float__',
                      '__floordiv__',
                      '__format__',
                      '__ge__',
                      '__getattr__',
                      '__getattribute__',
                      '__getitem__',
                      '__getstate__',
                      '__gt__',
                      '__iadd__',
                      '__iand__',
                      '__ifloordiv__',
                      '__imod__',
                      '__imul__',
                      '__init__',
                      '__init_subclass__',
                      '__int__',
                      '__invert__',
                      '__ior__',
                      '__ipow__',
                      '__isub__',
                      '__iter__',
                      '__itruediv__',
                      '__ixor__',
                      '__le__',
                      '__len__',
                      '__lt__',
                      '__matmul__',
                      '__mod__',
                      '__mul__',
                      '__ne__',
                      '__neg__',
                      '__new__',
                      '__nonzero__',
                      '__or__',
                      '__pos__',
                      '__pow__',
                      '__radd__',
                      '__rand__',
                      '__rdivmod__',
                      '__reduce__',
                      '__reduce_ex__',
                      '__repr__',
                      '__rfloordiv__',
                      '__rmatmul__',
                      '__rmod__',
                      '__rmul__',
                      '__ror__',
                      '__round__',
                      '__rpow__',
                      '__rsub__',
                      '__rtruediv__',
                      '__rxor__',
                      '__setattr__',
                      '__setitem__',
                      '__setstate__',
                      '__sizeof__',
                      '__str__',
                      '__sub__',
                      '__subclasshook__',
                      '__truediv__',
                      '__xor__']}

The initialisation signature for a pd.Series can be examined:

In [94]:

pd.Series?

Init signature:
pd.Series(
    data=None,
    index=None,
    dtype: 'Dtype | None' = None,
    name=None,
    copy: 'bool | None' = None,
    fastpath: 'bool | lib.NoDefault' = <no_default>,
) -> 'None'
Docstring:     
One-dimensional ndarray with axis labels (including time series).

Labels need not be unique but must be a hashable type. The object
supports both integer- and label-based indexing and provides a host of
methods for performing operations involving the index. Statistical
methods from ndarray have been overridden to automatically exclude
missing data (currently represented as NaN).

Operations between Series (+, -, /, \*, \*\*) align values based on their
associated index values-- they need not be the same length. The result
index will be the sorted union of the two indexes.

Parameters
----------
data : array-like, Iterable, dict, or scalar value
    Contains data stored in Series. If data is a dict, argument order is
    maintained.
index : array-like or Index (1d)
    Values must be hashable and have the same length as `data`.
    Non-unique index values are allowed. Will default to
    RangeIndex (0, 1, 2, ..., n) if not provided. If data is dict-like
    and index is None, then the keys in the data are used as the index. If the
    index is not None, the resulting Series is reindexed with the index values.
dtype : str, numpy.dtype, or ExtensionDtype, optional
    Data type for the output Series. If not specified, this will be
    inferred from `data`.
    See the :ref:`user guide <basics.dtypes>` for more usages.
name : Hashable, default None
    The name to give to the Series.
copy : bool, default False
    Copy input data. Only affects Series or 1d ndarray input. See examples.

Notes
-----
Please reference the :ref:`User Guide <basics.series>` for more information.

Examples
--------
Constructing Series from a dictionary with an Index specified

>>> d = {'a': 1, 'b': 2, 'c': 3}
>>> ser = pd.Series(data=d, index=['a', 'b', 'c'])
>>> ser
a   1
b   2
c   3
dtype: int64

The keys of the dictionary match with the Index values, hence the Index
values have no effect.

>>> d = {'a': 1, 'b': 2, 'c': 3}
>>> ser = pd.Series(data=d, index=['x', 'y', 'z'])
>>> ser
x   NaN
y   NaN
z   NaN
dtype: float64

Note that the Index is first build with the keys from the dictionary.
After this the Series is reindexed with the given Index values, hence we
get all NaN as a result.

Constructing Series from a list with `copy=False`.

>>> r = [1, 2]
>>> ser = pd.Series(r, copy=False)
>>> ser.iloc[0] = 999
>>> r
[1, 2]
>>> ser
0    999
1      2
dtype: int64

Due to input data type the Series has a `copy` of
the original data even though `copy=False`, so
the data is unchanged.

Constructing Series from a 1d ndarray with `copy=False`.

>>> r = np.array([1, 2])
>>> ser = pd.Series(r, copy=False)
>>> ser.iloc[0] = 999
>>> r
array([999,   2])
>>> ser
0    999
1      2
dtype: int64

Due to input data type the Series has a `view` on
the original data, so
the data is changed as well.
File:           c:\users\phili\anaconda3\envs\vscode-env\lib\site-packages\pandas\core\series.py
Type:           type
Subclasses:     SubclassedSeries

A pd.Series can be instantiated by supplying data to a list of values and providing a name:

In [95]:

s = pd.Series([4, 3, 2, 2], index=['a', 'b', 'c', 'd'], name='s')
t = pd.Series(name='t')
u = pd.Series([object(), 'hello', 1, None], name='u')
v = pd.Series([1.1, 2.2, 3.1, 4.1], name='v')
w = pd.Series([-1, 2, -3, 4], name='w')
x = pd.Series([1.1, -2.1, -2.1, None], name='x')
y = pd.Series(['A', 'A', 'B', 'B'], index=['a', 'b', 'c', 'd'], dtype='category', name='y')
z = pd.Series(['A', 'B', 'C', 'D'], index=['a', 'b', 'c', 'd'], name='z')

Attributes¶

If a pd.Series is seen in the cell output, the values, index, name and dtype are shown:

In [96]:

Out[96]:

0    1.1
1    2.2
2    3.1
3    4.1
Name: v, dtype: float64

These are also separately available as attributes:

In [97]:

v.index

Out[97]:

RangeIndex(start=0, stop=4, step=1)

In [98]:

v.values

Out[98]:

array([1.1, 2.2, 3.1, 4.1])

In [99]:

v.dtype

Out[99]:

dtype('float64')

In [100]:

v.name

Out[100]:

'v'

A pd.Series is based on a 1d np.ndarray and has dimensional related attributes. It always has a dimension of 1:

In [101]:

v.ndim

Out[101]:

The size attribute gives the number of values:

In [102]:

v.size

Out[102]:

The shape attribute is a 1 element tuple containing the size:

In [103]:

v.shape

Out[103]:

(4,)

Boolean Attributes¶

The pd.Series has a number of bool attributes. The hasnans attribute returns True if any value is None or NaN and False otherwise:

In [104]:

v = pd.Series([1.1, 2.2, 3.1, 4.1], name='v')
x = pd.Series([1.1, -2.1, -2.1, None], name='x', dtype=float)

In [105]:

v.hasnans

Out[105]:

False

In [106]:

x.hasnans

Out[106]:

True

The empty attribute returns True if there are no values and False otherwise:

In [107]:

t = pd.Series(name='t')
u = pd.Series([[], [], [], []], name='u')

In [108]:

t.empty

Out[108]:

True

In [109]:

u.empty

Out[109]:

False

The is_unique attribute returns True if all values are unique and False otherwise:

In [110]:

v = pd.Series([1.1, 2.2, 3.1, 4.1], name='v')
x = pd.Series([1.1, -2.1, -2.1, None], name='x', dtype=float)

In [111]:

v.is_unique

Out[111]:

True

In [112]:

x.is_unique

Out[112]:

False

The attribute is_monotonic_increasing returns True if the values are always greater or equal to the previous value and False otherwise. Likewise the attribute is_monotonic_decreasing returns True if the values are always less than or equal to the previous value and False otherwise:

In [113]:

s = pd.Series([4, 3, 2, 2], index=['a', 'b', 'c', 'd'], name='s')
v = pd.Series([1.1, 2.2, 3.1, 4.1], name='v')
w = pd.Series([-1.1, 2.1, -3.2, 4.1], name='w')
y = pd.Series(['A', 'A', 'B', 'B'], index=['a', 'b', 'c', 'd'], name='y')

In [114]:

v.is_monotonic_increasing

Out[114]:

True

In [115]:

w.is_monotonic_increasing

Out[115]:

False

In [116]:

y.is_monotonic_increasing

Out[116]:

True

In [117]:

s.is_monotonic_decreasing

Out[117]:

True

Indexing¶

The pd.Series instance v has a default pd.RangeIndex:

In [118]:

Out[118]:

0    1.1
1    2.2
2    3.1
3    4.1
Name: v, dtype: float64

A row can be accessed from the pd.Series by indexing into it using square brackets and providing the int value from the index:

In [119]:

v[0]

Out[119]:

1.1

Indexing of rows can also be done using a slice:

In [120]:

v[0:2]

Out[120]:

0    1.1
1    2.2
Name: v, dtype: float64

A row can also be accessed from it using the attribute iloc integer location followed by square brackets and the row number:

In [121]:

v.iloc[0]

Out[121]:

1.1

Since iloc always expects a numeric value it be used for slicing:

In [122]:

v.iloc[0:3]

Out[122]:

0    1.1
1    2.2
2    3.1
Name: v, dtype: float64

A single element can be accessed using iat integer at:

In [123]:

v.iat[0]

Out[123]:

1.1

Because iat is used to only select a single element, a slice cannot be used.

The pd.Series instance z has a pd.Index where each index value has a name:

In [124]:

Out[124]:

a    A
b    B
c    C
d    D
Name: z, dtype: object

A row can be accessed from the pd.Series by indexing into it using square brackets and providing the str value from the index:

In [125]:

z['a']

Out[125]:

'A'

Slicing cannot be used when indexing using str instances, however a list of these str instances can be provided:

In [126]:

z[['a', 'c']]

Out[126]:

a    A
c    C
Name: z, dtype: object

When the name of one of the values from the index is a valid identifier:

In [127]:

'a'.isidentifier()

Out[127]:

True

It is available as an attribute of the pd.Series instance:

In [128]:

z.a

Out[128]:

'A'

The loc location attribute can be used with a str instance that corresponds to the name of an index:

In [129]:

z.loc['a']

Out[129]:

'A'

A list of str instances each corresponding to a name of a value from the index can also be provided:

In [130]:

z.loc[['a','c']]

Out[130]:

a    A
c    C
Name: z, dtype: object

The at attribute can be used to read of a single element using a str instance corresponding to a name of a value from the index:

In [131]:

z.at['a']

Out[131]:

'A'

Although, z has a pd.Index where each value in the pd.Index has an associated name. Indexing can still be done numerically using iloc with a single numeric value or slice or iat with a single numeric value:

In [132]:

Out[132]:

a    A
b    B
c    C
d    D
Name: z, dtype: object

In [133]:

z.iloc[0:2]

Out[133]:

a    A
b    B
Name: z, dtype: object

In [134]:

z.iat[0]

Out[134]:

'A'

When a pd.RangeIndex is used:

In [135]:

Out[135]:

0    1.1
1    2.2
2    3.1
3    4.1
Name: v, dtype: float64

The attributes loc and at can still be used. loc can take a single numeric value or a list of numeric values and does not support slicing. at can take a single numeric value only:

In [136]:

v.loc[[0, 1]]

Out[136]:

0    1.1
1    2.2
Name: v, dtype: float64

In [137]:

v.at[0]

Out[137]:

1.1

Normally for such a scenario iloc an iat would be used.

String Methods¶

What the data supplied is numeric or None, its datatype will be determined to be numeric, in this case 'float64' corresponding to the np.float64 class:

In [138]:

x = pd.Series([1.1, -2.1, -2.1, None], name='x')

In [139]:

x.dtype

Out[139]:

dtype('float64')

When the data includes a non-numeric object its datatype will be determined to be 'O' meaning object:

In [140]:

u = pd.Series([object(), 'hello', 1, None], name='u')

In [141]:

u.dtype

Out[141]:

dtype('O')

There is unfortunately no str datatype. Therefore when the data supplied are all str instances, the datatype is determined to be 'O' meaning object:

In [142]:

z = pd.Series(['A', 'B', 'C', 'D'], index=['a', 'b', 'c', 'd'], name='z')

In [143]:

z.dtype

Out[143]:

dtype('O')

The apply method can be used to apply a anonymous lambda function which will check if the instance is a str:

In [144]:

z.apply(lambda value: isinstance(value, str))

Out[144]:

a    True
b    True
c    True
d    True
Name: z, dtype: bool

The all mthod can be applied on the pd.Series which has a bool dtype to return a scalar bool that is True if all values are str instances:

In [145]:

z.apply(lambda value: isinstance(value, str)).all()

Out[145]:

True

When all instances in the pd.Series are str instances the str class is accessible as an attribute and groups together identifiers that are consistent to the builtins class str:

In [146]:

dir2(z.str, drop_internal=True)

{'method': ['capitalize',
            'casefold',
            'cat',
            'center',
            'contains',
            'count',
            'decode',
            'encode',
            'endswith',
            'extract',
            'extractall',
            'find',
            'findall',
            'fullmatch',
            'get',
            'get_dummies',
            'index',
            'isalnum',
            'isalpha',
            'isdecimal',
            'isdigit',
            'islower',
            'isnumeric',
            'isspace',
            'istitle',
            'isupper',
            'join',
            'len',
            'ljust',
            'lower',
            'lstrip',
            'match',
            'normalize',
            'pad',
            'partition',
            'removeprefix',
            'removesuffix',
            'repeat',
            'replace',
            'rfind',
            'rindex',
            'rjust',
            'rpartition',
            'rsplit',
            'rstrip',
            'slice',
            'slice_replace',
            'split',
            'startswith',
            'strip',
            'swapcase',
            'title',
            'translate',
            'upper',
            'wrap',
            'zfill'],
 'datamodel_attribute': ['__annotations__',
                         '__dict__',
                         '__doc__',
                         '__frozen',
                         '__module__',
                         '__weakref__'],
 'datamodel_method': ['__class__',
                      '__delattr__',
                      '__dir__',
                      '__eq__',
                      '__format__',
                      '__ge__',
                      '__getattribute__',
                      '__getitem__',
                      '__getstate__',
                      '__gt__',
                      '__hash__',
                      '__init__',
                      '__init_subclass__',
                      '__iter__',
                      '__le__',
                      '__lt__',
                      '__ne__',
                      '__new__',
                      '__reduce__',
                      '__reduce_ex__',
                      '__repr__',
                      '__setattr__',
                      '__sizeof__',
                      '__str__',
                      '__subclasshook__']}

A str method such as zfill can be used:

In [147]:

z.str.zfill(10)

Out[147]:

a    000000000A
b    000000000B
c    000000000C
d    000000000D
Name: z, dtype: object

Instead of the datamodel methods some of the builtins functions such as len are also listed as a str method:

In [148]:

z.str.zfill(10).str.len()

Out[148]:

a    10
b    10
c    10
d    10
Name: z, dtype: int64

Notice the above syntax starts with a pd.Series instance z where each value is a str instance. It uses the str class to access a method zfill, because this is a method it requires parenthesis for the function call and is supplied the data 10:

z.str.zfill(10)

This returns a new pd.Series instance another_series where each value is once again a str instance. It then uses the str class again to access a method len and because this is a method it requires parenthesis for the function call. This time no additional data is required as the method operates on the value provided which can be thought of as self.

another_series.str.len()

Stacked this looks like:

z.str.zfill(10).str.len()

Because len is supplied as a str method it is not necessary to apply len using an anonymous lambda function:

In [149]:

z.str.zfill(10).apply(lambda value: len(value))

Out[149]:

a    10
b    10
c    10
d    10
Name: z, dtype: int64

Note when len is applied to the pd.Series instance, it will return the length of the pd.Series:

In [150]:

len(z)

Out[150]:

An anonymous lambda function can be used to apply an operator. Recall for a str, the + operator is setup for concatenation and the * operator is setup for replication using an int instance:

In [151]:

z.apply(lambda value: 5 * value)

Out[151]:

a    AAAAA
b    BBBBB
c    CCCCC
d    DDDDD
Name: z, dtype: object

When every value in a pd.Series is a str instance the __mul__ and __add__ datamodel operators are broadcast and behave like they do with str instances. This means the * operator can be used to replicate a str using an int:

In [152]:

5 * z

Out[152]:

a    AAAAA
b    BBBBB
c    CCCCC
d    DDDDD
Name: z, dtype: object

In addition to a scalar, a pd.Series of equal length, that has a matching index and a dtype of int can be used:

In [153]:

s = pd.Series([4, 3, 2, 2], index=['a', 'b', 'c', 'd'], name='s')

In [154]:

z * s

Out[154]:

a    AAAA
b     BBB
c      CC
d      DD
dtype: object

This is commutative because the str class has the __rmul__ datamodel identifier defined:

In [155]:

s * z

Out[155]:

a    AAAA
b     BBB
c      CC
d      DD
dtype: object

The + operator can be used to concatenate another str instance:

In [156]:

z + ' world!'

Out[156]:

a    A world!
b    B world!
c    C world!
d    D world!
Name: z, dtype: object

A pd.Series of equal length, that has a matching index and a dtype of str can also be concatenated:

In [157]:

z + z

Out[157]:

a    AA
b    BB
c    CC
d    DD
Name: z, dtype: object

Numeric Methods¶

When the values in a pd.Series instance are numeric, the numeric datamodel operators are broadcast:

In [158]:

v = pd.Series([1.1, 2.2, 3.1, 4.1], name='v')
w = pd.Series([-1, 2, -3, 4], name='w')
x = pd.Series([1.1, -2.1, -2.1, None], name='x')

This allows similar numeric operations as seen for np.ndarray (1d arrays):

In [159]:

v * 5

Out[159]:

0     5.5
1    11.0
2    15.5
3    20.5
Name: v, dtype: float64

In [160]:

v + w

Out[160]:

0    0.1
1    4.2
2    0.1
3    8.1
dtype: float64

Notice that any numeric operation involving None displays NaN which is an instance of np.nan:

In [161]:

-x + w

Out[161]:

0   -2.1
1    4.1
2   -0.9
3    NaN
dtype: float64

The builtins datamodel method __abs__ is defined which means the builtins function abs can be used, to retrieve the absolute values (stripping the sign):

In [162]:

abs(w)

Out[162]:

0    1
1    2
2    3
3    4
Name: w, dtype: int64

This is also available as a method for convenience:

In [163]:

w.abs()

Out[163]:

0    1
1    2
2    3
3    4
Name: w, dtype: int64

Many of the functions in the statistics module are broadcast across a pd.Series. The statistics module can be imported:

In [164]:

import statistics

If w is examined:

In [165]:

w = pd.Series([-1, 2, -3, 4], name='w')

The mean can be calculated manually using:

In [166]:

w.sum() / len(w)

Out[166]:

0.5

Or using the statistical function mean:

In [167]:

statistics.mean(w)

Out[167]:

0.5

Commonly used statistical functions are available as methods:

In [168]:

w.mean()

Out[168]:

0.5

Then the difference from the mean can be calculated:

In [169]:

w - w.mean()

Out[169]:

0   -1.5
1    1.5
2   -3.5
3    3.5
Name: w, dtype: float64

Squared:

In [170]:

(w - w.mean()) ** 2

Out[170]:

0     2.25
1     2.25
2    12.25
3    12.25
Name: w, dtype: float64

Summed and divided by the length of w minus 1:

In [171]:

sum((w - w.mean()) ** 2) / (len(w) - 1)

Out[171]:

9.666666666666666

The square root of this can be taken to get the standard deviation:

In [172]:

(sum((w - w.mean()) ** 2) / (len(w) - 1)) ** 0.5

Out[172]:

3.1091263510296048

Alternatively the statistics.stdev function can be used:

In [173]:

statistics.stdev(w)

Out[173]:

3.109126351029605

This is available as a method with a slightly different name std:

In [174]:

w.std()

Out[174]:

3.1091263510296048

Outwith the commonly used numeric datamodel identifiers and statistical functions, mathematical functions are carried out on a pd.Series instance by using the relevant np function. This supplies the values to the np function in the form of an np.ndarray and then returns a pd.Series of the new values using the existing pd.Series name:

In [175]:

v = pd.Series([1.1, 2.2, 3.1, 4.1], name='v')

In [176]:

np.floor(v)

Out[176]:

0    1.0
1    2.0
2    3.0
3    4.0
Name: v, dtype: float64

Category Methods¶

A pd.Series can have a 'category' dtype:

In [177]:

y = pd.Series(['A', 'A', 'B', 'B'], index=['a', 'b', 'c', 'd'], dtype='category', name='y')

In [178]:

Out[178]:

a    A
b    A
c    B
d    B
Name: y, dtype: category
Categories (2, object): ['A', 'B']

pd.Series of the 'category' dttype have access to the cat class which groups together category based identifiers:

In [179]:

dir2(y.cat, drop_internal=True)

{'attribute': ['categories', 'codes', 'ordered'],
 'method': ['add_categories',
            'as_ordered',
            'as_unordered',
            'remove_categories',
            'remove_unused_categories',
            'rename_categories',
            'reorder_categories',
            'set_categories'],
 'datamodel_attribute': ['__annotations__',
                         '__dict__',
                         '__doc__',
                         '__frozen',
                         '__module__',
                         '__weakref__'],
 'datamodel_method': ['__class__',
                      '__delattr__',
                      '__dir__',
                      '__eq__',
                      '__format__',
                      '__ge__',
                      '__getattribute__',
                      '__getstate__',
                      '__gt__',
                      '__hash__',
                      '__init__',
                      '__init_subclass__',
                      '__le__',
                      '__lt__',
                      '__ne__',
                      '__new__',
                      '__reduce__',
                      '__reduce_ex__',
                      '__repr__',
                      '__setattr__',
                      '__sizeof__',
                      '__str__',
                      '__subclasshook__']}

For example the categories can be set to ordered using the cat method as_ordered:

In [180]:

y.cat.as_ordered()

Out[180]:

a    A
b    A
c    B
d    B
Name: y, dtype: category
Categories (2, object): ['A' < 'B']

The order can be changed using the cat method reorder_categories. For example 'A' can be greater than 'B' in a test result. Once again identifiers can be stacked to achieve this purpose:

In [181]:

y.cat.as_ordered().cat.reorder_categories(['B', 'A'])

Out[181]:

a    A
b    A
c    B
d    B
Name: y, dtype: category
Categories (2, object): ['B' < 'A']

The updated pd.Series can be reassigned to y:

In [182]:

y = y.cat.as_ordered().cat.reorder_categories(['B', 'A'])

Now the equality operators can be used returning a pd.Series of dtype bool:

In [183]:

y == 'A'

Out[183]:

a     True
b     True
c    False
d    False
Name: y, dtype: bool

This can be used to index into a pd.Series that has the same index:

In [184]:

s = pd.Series([4, 3, 2, 2], index=['a', 'b', 'c', 'd'], name='s')

In [185]:

s[y == 'A']

Out[185]:

a    4
b    3
Name: s, dtype: int64

If a pd.Series instance is created with random scores from 0 to 100:

In [186]:

np.random.seed(0)
scores = pd.Series(np.random.randint(0, 100, size=10), name='scores')

In [187]:

scores

Out[187]:

0    44
1    47
2    64
3    67
4    67
5     9
6    83
7    21
8    36
9    87
Name: scores, dtype: int32

The pd.cut function can be used to cut the values into categories using boundaries for bins. Note for 4 categories, there are 5 bins as a lower and upper bound is specified for each bin. The categories are automatically using the curt off values. Note the ( in (-0.001, 50.0] means inclusive of the lower bound and the ] means exclusive of the upper bound:

In [188]:

pd.cut(scores, bins=[0, 50, 60, 70, 101])

Out[188]:

0      (0, 50]
1      (0, 50]
2     (60, 70]
3     (60, 70]
4     (60, 70]
5      (0, 50]
6    (70, 101]
7      (0, 50]
8      (0, 50]
9    (70, 101]
Name: scores, dtype: category
Categories (4, interval[int64, right]): [(0, 50] < (50, 60] < (60, 70] < (70, 101]]

This pd.Series can be assigned to grades:

In [189]:

grades = pd.cut(scores, bins=[0, 50, 60, 70, 101], include_lowest=True)

Because it is 'category' dtype, the cat class can be accessed and the categories attribute accessed:

In [190]:

grades.cat.categories

Out[190]:

IntervalIndex([(-0.001, 50.0], (50.0, 60.0], (60.0, 70.0], (70.0, 101.0]], dtype='interval[float64, right]')

This can be assigned to old_cats:

In [191]:

old_cats = grades.cat.categories

These new_cats can be grouped together in a list:

In [192]:

new_cats = ['F', 'C', 'B', 'A']

This allows zipping of the old_cats with the new_cats and casting to a dict instance:

In [193]:

cats_mapping = dict(zip(old_cats, new_cats))

In [194]:

cats_mapping

Out[194]:

{Interval(-0.001, 50.0, closed='right'): 'F',
 Interval(50.0, 60.0, closed='right'): 'C',
 Interval(60.0, 70.0, closed='right'): 'B',
 Interval(70.0, 101.0, closed='right'): 'A'}

The cat class can be accessed and the rename_categories method used with this dict instance:

In [195]:

grades.cat.rename_categories(cats_mapping)

Out[195]:

0    F
1    F
2    B
3    B
4    B
5    F
6    A
7    F
8    F
9    A
Name: scores, dtype: category
Categories (4, object): ['F' < 'C' < 'B' < 'A']

The instance name grades can be reassigned to this new pd.Series:

In [196]:

grades = grades.cat.rename_categories(cats_mapping)

Plot Methods¶

A pd.Series of numeric values has access to the plot class which can be used to create matplotlib (mpl) plots from the data in the pd.Series:

In [197]:

s = pd.Series([4, 3, 2, 2], index=['a', 'b', 'c', 'd'], name='s')

The identifiers for the plot class will be examined:

In [198]:

dir2(s.plot, drop_internal=True)

{'method': ['area',
            'bar',
            'barh',
            'box',
            'density',
            'hexbin',
            'hist',
            'kde',
            'line',
            'pie',
            'scatter'],
 'datamodel_attribute': ['__annotations__',
                         '__dict__',
                         '__doc__',
                         '__module__',
                         '__weakref__'],
 'datamodel_method': ['__call__',
                      '__class__',
                      '__delattr__',
                      '__dir__',
                      '__eq__',
                      '__format__',
                      '__ge__',
                      '__getattribute__',
                      '__getstate__',
                      '__gt__',
                      '__hash__',
                      '__init__',
                      '__init_subclass__',
                      '__le__',
                      '__lt__',
                      '__ne__',
                      '__new__',
                      '__reduce__',
                      '__reduce_ex__',
                      '__repr__',
                      '__setattr__',
                      '__sizeof__',
                      '__str__',
                      '__subclasshook__']}

The underlying values determine what plot types are most sensible:

In [199]:

s.plot.line()

Out[199]:

<Axes: >

No description has been provided for this image

In [200]:

s.plot.bar()

Out[200]:

<Axes: >

In [201]:

s.plot.pie()

Out[201]:

<Axes: ylabel='s'>

In [202]:

v = pd.Series([1.1, 2.2, 3.1, 4.1], name='v')

The plot types can be configured by using mpl consistent input parameters which is covered in the next tutorial.

Datetime and TimeDelta¶

In pandas date and time intervals are based upon the datatypes datetime64 or timedelta64 from the numpy library:

The datetime64 class is normally initialised using a timestamp string of the following format:

np.datetime64('YYYY-MM-DD')
np.datetime64('YYYY-MM-DDThh:mm:ss:μμμμμμ')

For example:

In [203]:

np.datetime64('2008-03-12')

Out[203]:

numpy.datetime64('2008-03-12')

In [204]:

np.datetime64('2008-03-12T14:30:15.123456')

Out[204]:

numpy.datetime64('2008-03-12T14:30:15.123456')

The timedelta64 is normally initialised using a set of tuples where X is the quantity followed by the unit:

np.datetime64(X, 'U')

These are usually combined using addition:

In [205]:

np.timedelta64(1, 'D') + np.timedelta64(1, 'h') + np.timedelta64(1, 's') + np.timedelta64(1, 'ms')

Out[205]:

numpy.timedelta64(90001001,'ms')

These can be used to make an Index or Series respectively, using the np.arange function:

In [206]:

start_time = np.datetime64('2008-03-12')
end_time = np.datetime64('2008-03-13')
time_interval = np.timedelta64(1, 'h')

In [207]:

np.arange(start_time, end_time, time_interval)

Out[207]:

array(['2008-03-12T00', '2008-03-12T01', '2008-03-12T02', '2008-03-12T03',
       '2008-03-12T04', '2008-03-12T05', '2008-03-12T06', '2008-03-12T07',
       '2008-03-12T08', '2008-03-12T09', '2008-03-12T10', '2008-03-12T11',
       '2008-03-12T12', '2008-03-12T13', '2008-03-12T14', '2008-03-12T15',
       '2008-03-12T16', '2008-03-12T17', '2008-03-12T18', '2008-03-12T19',
       '2008-03-12T20', '2008-03-12T21', '2008-03-12T22', '2008-03-12T23'],
      dtype='datetime64[h]')

These times can be cast into a pd.Index instance:

In [208]:

pd.Index(np.arange(start_time, end_time, time_interval))

Out[208]:

DatetimeIndex(['2008-03-12 00:00:00', '2008-03-12 01:00:00',
               '2008-03-12 02:00:00', '2008-03-12 03:00:00',
               '2008-03-12 04:00:00', '2008-03-12 05:00:00',
               '2008-03-12 06:00:00', '2008-03-12 07:00:00',
               '2008-03-12 08:00:00', '2008-03-12 09:00:00',
               '2008-03-12 10:00:00', '2008-03-12 11:00:00',
               '2008-03-12 12:00:00', '2008-03-12 13:00:00',
               '2008-03-12 14:00:00', '2008-03-12 15:00:00',
               '2008-03-12 16:00:00', '2008-03-12 17:00:00',
               '2008-03-12 18:00:00', '2008-03-12 19:00:00',
               '2008-03-12 20:00:00', '2008-03-12 21:00:00',
               '2008-03-12 22:00:00', '2008-03-12 23:00:00'],
              dtype='datetime64[s]', freq=None)

Or pd.Series instance:

In [209]:

pd.Series(np.arange(start_time, end_time, time_interval), name='times')

Out[209]:

0    2008-03-12 00:00:00
1    2008-03-12 01:00:00
2    2008-03-12 02:00:00
3    2008-03-12 03:00:00
4    2008-03-12 04:00:00
5    2008-03-12 05:00:00
6    2008-03-12 06:00:00
7    2008-03-12 07:00:00
8    2008-03-12 08:00:00
9    2008-03-12 09:00:00
10   2008-03-12 10:00:00
11   2008-03-12 11:00:00
12   2008-03-12 12:00:00
13   2008-03-12 13:00:00
14   2008-03-12 14:00:00
15   2008-03-12 15:00:00
16   2008-03-12 16:00:00
17   2008-03-12 17:00:00
18   2008-03-12 18:00:00
19   2008-03-12 19:00:00
20   2008-03-12 20:00:00
21   2008-03-12 21:00:00
22   2008-03-12 22:00:00
23   2008-03-12 23:00:00
Name: times, dtype: datetime64[s]

It is more common to have a pd.Index of times for example when recording a number of observations at a given time:

In [210]:

times = pd.Index(np.arange(start_time, end_time, time_interval), name='times')

In [211]:

times

Out[211]:

DatetimeIndex(['2008-03-12 00:00:00', '2008-03-12 01:00:00',
               '2008-03-12 02:00:00', '2008-03-12 03:00:00',
               '2008-03-12 04:00:00', '2008-03-12 05:00:00',
               '2008-03-12 06:00:00', '2008-03-12 07:00:00',
               '2008-03-12 08:00:00', '2008-03-12 09:00:00',
               '2008-03-12 10:00:00', '2008-03-12 11:00:00',
               '2008-03-12 12:00:00', '2008-03-12 13:00:00',
               '2008-03-12 14:00:00', '2008-03-12 15:00:00',
               '2008-03-12 16:00:00', '2008-03-12 17:00:00',
               '2008-03-12 18:00:00', '2008-03-12 19:00:00',
               '2008-03-12 20:00:00', '2008-03-12 21:00:00',
               '2008-03-12 22:00:00', '2008-03-12 23:00:00'],
              dtype='datetime64[s]', name='times', freq=None)

The Index method tz_localize can be used to specify a timezone tz when timezone information has been omitted For example if these observations took place in Australia:

In [212]:

times.tz_localize(tz='Australia/Sydney')

Out[212]:

DatetimeIndex(['2008-03-12 00:00:00+11:00', '2008-03-12 01:00:00+11:00',
               '2008-03-12 02:00:00+11:00', '2008-03-12 03:00:00+11:00',
               '2008-03-12 04:00:00+11:00', '2008-03-12 05:00:00+11:00',
               '2008-03-12 06:00:00+11:00', '2008-03-12 07:00:00+11:00',
               '2008-03-12 08:00:00+11:00', '2008-03-12 09:00:00+11:00',
               '2008-03-12 10:00:00+11:00', '2008-03-12 11:00:00+11:00',
               '2008-03-12 12:00:00+11:00', '2008-03-12 13:00:00+11:00',
               '2008-03-12 14:00:00+11:00', '2008-03-12 15:00:00+11:00',
               '2008-03-12 16:00:00+11:00', '2008-03-12 17:00:00+11:00',
               '2008-03-12 18:00:00+11:00', '2008-03-12 19:00:00+11:00',
               '2008-03-12 20:00:00+11:00', '2008-03-12 21:00:00+11:00',
               '2008-03-12 22:00:00+11:00', '2008-03-12 23:00:00+11:00'],
              dtype='datetime64[s, Australia/Sydney]', name='times', freq=None)

The associated method tz_convert can be used to convert these times into corresponding times in another timezone:

In [213]:

times.tz_localize(tz='Australia/Sydney').tz_convert(tz='Europe/London')

Out[213]:

DatetimeIndex(['2008-03-11 13:00:00+00:00', '2008-03-11 14:00:00+00:00',
               '2008-03-11 15:00:00+00:00', '2008-03-11 16:00:00+00:00',
               '2008-03-11 17:00:00+00:00', '2008-03-11 18:00:00+00:00',
               '2008-03-11 19:00:00+00:00', '2008-03-11 20:00:00+00:00',
               '2008-03-11 21:00:00+00:00', '2008-03-11 22:00:00+00:00',
               '2008-03-11 23:00:00+00:00', '2008-03-12 00:00:00+00:00',
               '2008-03-12 01:00:00+00:00', '2008-03-12 02:00:00+00:00',
               '2008-03-12 03:00:00+00:00', '2008-03-12 04:00:00+00:00',
               '2008-03-12 05:00:00+00:00', '2008-03-12 06:00:00+00:00',
               '2008-03-12 07:00:00+00:00', '2008-03-12 08:00:00+00:00',
               '2008-03-12 09:00:00+00:00', '2008-03-12 10:00:00+00:00',
               '2008-03-12 11:00:00+00:00', '2008-03-12 12:00:00+00:00'],
              dtype='datetime64[s, Europe/London]', name='times', freq=None)

Unlike np which is timezone agnostic, pd can perform calculations involving datetime64 data with a timezone:

In [214]:

times.tz_localize(tz='Australia/Sydney')

Out[214]:

DatetimeIndex(['2008-03-12 00:00:00+11:00', '2008-03-12 01:00:00+11:00',
               '2008-03-12 02:00:00+11:00', '2008-03-12 03:00:00+11:00',
               '2008-03-12 04:00:00+11:00', '2008-03-12 05:00:00+11:00',
               '2008-03-12 06:00:00+11:00', '2008-03-12 07:00:00+11:00',
               '2008-03-12 08:00:00+11:00', '2008-03-12 09:00:00+11:00',
               '2008-03-12 10:00:00+11:00', '2008-03-12 11:00:00+11:00',
               '2008-03-12 12:00:00+11:00', '2008-03-12 13:00:00+11:00',
               '2008-03-12 14:00:00+11:00', '2008-03-12 15:00:00+11:00',
               '2008-03-12 16:00:00+11:00', '2008-03-12 17:00:00+11:00',
               '2008-03-12 18:00:00+11:00', '2008-03-12 19:00:00+11:00',
               '2008-03-12 20:00:00+11:00', '2008-03-12 21:00:00+11:00',
               '2008-03-12 22:00:00+11:00', '2008-03-12 23:00:00+11:00'],
              dtype='datetime64[s, Australia/Sydney]', name='times', freq=None)

In [215]:

times.tz_localize(tz='Australia/Sydney') + np.timedelta64(1, 'h')

Out[215]:

DatetimeIndex(['2008-03-12 01:00:00+11:00', '2008-03-12 02:00:00+11:00',
               '2008-03-12 03:00:00+11:00', '2008-03-12 04:00:00+11:00',
               '2008-03-12 05:00:00+11:00', '2008-03-12 06:00:00+11:00',
               '2008-03-12 07:00:00+11:00', '2008-03-12 08:00:00+11:00',
               '2008-03-12 09:00:00+11:00', '2008-03-12 10:00:00+11:00',
               '2008-03-12 11:00:00+11:00', '2008-03-12 12:00:00+11:00',
               '2008-03-12 13:00:00+11:00', '2008-03-12 14:00:00+11:00',
               '2008-03-12 15:00:00+11:00', '2008-03-12 16:00:00+11:00',
               '2008-03-12 17:00:00+11:00', '2008-03-12 18:00:00+11:00',
               '2008-03-12 19:00:00+11:00', '2008-03-12 20:00:00+11:00',
               '2008-03-12 21:00:00+11:00', '2008-03-12 22:00:00+11:00',
               '2008-03-12 23:00:00+11:00', '2008-03-13 00:00:00+11:00'],
              dtype='datetime64[s, Australia/Sydney]', name='times', freq=None)

Care however needs to be taken as clock changes leads to ambiguous times. In the UK one of biannual clock changes for example can be examined:

In [216]:

start_time = np.datetime64('2023-10-28T11:00:00')
end_time = np.datetime64('2023-10-29T03:00:00')
time_interval = np.timedelta64(30, 'm')

In [217]:

times = pd.Index(np.arange(start_time, end_time, time_interval))

In [218]:

times

Out[218]:

DatetimeIndex(['2023-10-28 11:00:00', '2023-10-28 11:30:00',
               '2023-10-28 12:00:00', '2023-10-28 12:30:00',
               '2023-10-28 13:00:00', '2023-10-28 13:30:00',
               '2023-10-28 14:00:00', '2023-10-28 14:30:00',
               '2023-10-28 15:00:00', '2023-10-28 15:30:00',
               '2023-10-28 16:00:00', '2023-10-28 16:30:00',
               '2023-10-28 17:00:00', '2023-10-28 17:30:00',
               '2023-10-28 18:00:00', '2023-10-28 18:30:00',
               '2023-10-28 19:00:00', '2023-10-28 19:30:00',
               '2023-10-28 20:00:00', '2023-10-28 20:30:00',
               '2023-10-28 21:00:00', '2023-10-28 21:30:00',
               '2023-10-28 22:00:00', '2023-10-28 22:30:00',
               '2023-10-28 23:00:00', '2023-10-28 23:30:00',
               '2023-10-29 00:00:00', '2023-10-29 00:30:00',
               '2023-10-29 01:00:00', '2023-10-29 01:30:00',
               '2023-10-29 02:00:00', '2023-10-29 02:30:00'],
              dtype='datetime64[s]', freq=None)

When these recorded times are localised as UK times, the tz_localize method does not know what one of the two duplicate times this is i.e whether these time recorded occured before or after the clock change:

In [219]:

times.tz_localize(tz='Europe/London', ambiguous='NaT')

Out[219]:

DatetimeIndex(['2023-10-28 11:00:00+01:00', '2023-10-28 11:30:00+01:00',
               '2023-10-28 12:00:00+01:00', '2023-10-28 12:30:00+01:00',
               '2023-10-28 13:00:00+01:00', '2023-10-28 13:30:00+01:00',
               '2023-10-28 14:00:00+01:00', '2023-10-28 14:30:00+01:00',
               '2023-10-28 15:00:00+01:00', '2023-10-28 15:30:00+01:00',
               '2023-10-28 16:00:00+01:00', '2023-10-28 16:30:00+01:00',
               '2023-10-28 17:00:00+01:00', '2023-10-28 17:30:00+01:00',
               '2023-10-28 18:00:00+01:00', '2023-10-28 18:30:00+01:00',
               '2023-10-28 19:00:00+01:00', '2023-10-28 19:30:00+01:00',
               '2023-10-28 20:00:00+01:00', '2023-10-28 20:30:00+01:00',
               '2023-10-28 21:00:00+01:00', '2023-10-28 21:30:00+01:00',
               '2023-10-28 22:00:00+01:00', '2023-10-28 22:30:00+01:00',
               '2023-10-28 23:00:00+01:00', '2023-10-28 23:30:00+01:00',
               '2023-10-29 00:00:00+01:00', '2023-10-29 00:30:00+01:00',
                                     'NaT',                       'NaT',
               '2023-10-29 02:00:00+00:00', '2023-10-29 02:30:00+00:00'],
              dtype='datetime64[s, Europe/London]', freq=None)

In this case, casting to a pd.Series and using the pd.Serie method interpolate with the 'linear' works because the np.arange function was used which works linearly:

In [220]:

pd.Series(times.tz_localize(tz='Europe/London', ambiguous='NaT')).interpolate('linear')

Out[220]:

0    2023-10-28 11:00:00+01:00
1    2023-10-28 11:30:00+01:00
2    2023-10-28 12:00:00+01:00
3    2023-10-28 12:30:00+01:00
4    2023-10-28 13:00:00+01:00
5    2023-10-28 13:30:00+01:00
6    2023-10-28 14:00:00+01:00
7    2023-10-28 14:30:00+01:00
8    2023-10-28 15:00:00+01:00
9    2023-10-28 15:30:00+01:00
10   2023-10-28 16:00:00+01:00
11   2023-10-28 16:30:00+01:00
12   2023-10-28 17:00:00+01:00
13   2023-10-28 17:30:00+01:00
14   2023-10-28 18:00:00+01:00
15   2023-10-28 18:30:00+01:00
16   2023-10-28 19:00:00+01:00
17   2023-10-28 19:30:00+01:00
18   2023-10-28 20:00:00+01:00
19   2023-10-28 20:30:00+01:00
20   2023-10-28 21:00:00+01:00
21   2023-10-28 21:30:00+01:00
22   2023-10-28 22:00:00+01:00
23   2023-10-28 22:30:00+01:00
24   2023-10-28 23:00:00+01:00
25   2023-10-28 23:30:00+01:00
26   2023-10-29 00:00:00+01:00
27   2023-10-29 00:30:00+01:00
28   2023-10-29 01:20:00+01:00
29   2023-10-29 01:10:00+00:00
30   2023-10-29 02:00:00+00:00
31   2023-10-29 02:30:00+00:00
dtype: datetime64[s, Europe/London]

DataFrame Identifiers¶

The identifiers for a pd.DataFrame are:

In [221]:

dir2(pd.DataFrame, drop_internal=True, exclude_identifier_list=obsolete.df_depreciation)

{'attribute': ['at',
               'attrs',
               'axes',
               'columns',
               'dtypes',
               'empty',
               'flags',
               'iat',
               'iloc',
               'index',
               'loc',
               'ndim',
               'shape',
               'size',
               'style',
               'values'],
 'constant': ['T'],
 'method': ['abs',
            'add',
            'add_prefix',
            'add_suffix',
            'agg',
            'aggregate',
            'align',
            'all',
            'any',
            'apply',
            'asfreq',
            'asof',
            'assign',
            'astype',
            'at_time',
            'backfill',
            'between_time',
            'bfill',
            'boxplot',
            'clip',
            'combine',
            'combine_first',
            'compare',
            'convert_dtypes',
            'copy',
            'corr',
            'corrwith',
            'count',
            'cov',
            'cummax',
            'cummin',
            'cumprod',
            'cumsum',
            'describe',
            'diff',
            'div',
            'divide',
            'dot',
            'drop',
            'drop_duplicates',
            'droplevel',
            'dropna',
            'duplicated',
            'eq',
            'equals',
            'eval',
            'ewm',
            'expanding',
            'explode',
            'ffill',
            'fillna',
            'filter',
            'first_valid_index',
            'floordiv',
            'from_dict',
            'from_records',
            'ge',
            'get',
            'groupby',
            'gt',
            'head',
            'hist',
            'idxmax',
            'idxmin',
            'infer_objects',
            'info',
            'insert',
            'interpolate',
            'isetitem',
            'isin',
            'isna',
            'isnull',
            'items',
            'iterrows',
            'itertuples',
            'join',
            'keys',
            'kurt',
            'kurtosis',
            'last_valid_index',
            'le',
            'lt',
            'map',
            'mask',
            'max',
            'mean',
            'median',
            'melt',
            'memory_usage',
            'merge',
            'min',
            'mod',
            'mode',
            'mul',
            'multiply',
            'ne',
            'nlargest',
            'notna',
            'notnull',
            'nsmallest',
            'nunique',
            'pad',
            'pct_change',
            'pipe',
            'pivot',
            'pivot_table',
            'pop',
            'pow',
            'prod',
            'product',
            'quantile',
            'query',
            'radd',
            'rank',
            'rdiv',
            'reindex',
            'reindex_like',
            'rename',
            'rename_axis',
            'reorder_levels',
            'replace',
            'resample',
            'reset_index',
            'rfloordiv',
            'rmod',
            'rmul',
            'rolling',
            'round',
            'rpow',
            'rsub',
            'rtruediv',
            'sample',
            'select_dtypes',
            'sem',
            'set_axis',
            'set_flags',
            'set_index',
            'shift',
            'skew',
            'sort_index',
            'sort_values',
            'squeeze',
            'stack',
            'std',
            'sub',
            'subtract',
            'sum',
            'swaplevel',
            'tail',
            'take',
            'to_clipboard',
            'to_csv',
            'to_dict',
            'to_excel',
            'to_feather',
            'to_hdf',
            'to_html',
            'to_json',
            'to_latex',
            'to_markdown',
            'to_numpy',
            'to_orc',
            'to_parquet',
            'to_period',
            'to_pickle',
            'to_records',
            'to_sql',
            'to_stata',
            'to_string',
            'to_timestamp',
            'to_xarray',
            'to_xml',
            'transform',
            'transpose',
            'truediv',
            'truncate',
            'tz_convert',
            'tz_localize',
            'unstack',
            'update',
            'value_counts',
            'var',
            'where',
            'xs'],
 'lower_class': ['plot', 'sparse'],
 'datamodel_attribute': ['__annotations__',
                         '__array_priority__',
                         '__dict__',
                         '__doc__',
                         '__hash__',
                         '__module__',
                         '__pandas_priority__',
                         '__weakref__'],
 'datamodel_method': ['__abs__',
                      '__add__',
                      '__and__',
                      '__array__',
                      '__array_ufunc__',
                      '__arrow_c_stream__',
                      '__bool__',
                      '__class__',
                      '__contains__',
                      '__copy__',
                      '__dataframe__',
                      '__dataframe_consortium_standard__',
                      '__deepcopy__',
                      '__delattr__',
                      '__delitem__',
                      '__dir__',
                      '__divmod__',
                      '__eq__',
                      '__finalize__',
                      '__floordiv__',
                      '__format__',
                      '__ge__',
                      '__getattr__',
                      '__getattribute__',
                      '__getitem__',
                      '__getstate__',
                      '__gt__',
                      '__iadd__',
                      '__iand__',
                      '__ifloordiv__',
                      '__imod__',
                      '__imul__',
                      '__init__',
                      '__init_subclass__',
                      '__invert__',
                      '__ior__',
                      '__ipow__',
                      '__isub__',
                      '__iter__',
                      '__itruediv__',
                      '__ixor__',
                      '__le__',
                      '__len__',
                      '__lt__',
                      '__matmul__',
                      '__mod__',
                      '__mul__',
                      '__ne__',
                      '__neg__',
                      '__new__',
                      '__nonzero__',
                      '__or__',
                      '__pos__',
                      '__pow__',
                      '__radd__',
                      '__rand__',
                      '__rdivmod__',
                      '__reduce__',
                      '__reduce_ex__',
                      '__repr__',
                      '__rfloordiv__',
                      '__rmatmul__',
                      '__rmod__',
                      '__rmul__',
                      '__ror__',
                      '__round__',
                      '__rpow__',
                      '__rsub__',
                      '__rtruediv__',
                      '__rxor__',
                      '__setattr__',
                      '__setitem__',
                      '__setstate__',
                      '__sizeof__',
                      '__str__',
                      '__sub__',
                      '__subclasshook__',
                      '__truediv__',
                      '__xor__']}

The initialisation signature for a pd.DataFrame can be examined:

In [222]:

pd.DataFrame?

Init signature:
pd.DataFrame(
    data=None,
    index: 'Axes | None' = None,
    columns: 'Axes | None' = None,
    dtype: 'Dtype | None' = None,
    copy: 'bool | None' = None,
) -> 'None'
Docstring:     
Two-dimensional, size-mutable, potentially heterogeneous tabular data.

Data structure also contains labeled axes (rows and columns).
Arithmetic operations align on both row and column labels. Can be
thought of as a dict-like container for Series objects. The primary
pandas data structure.

Parameters
----------
data : ndarray (structured or homogeneous), Iterable, dict, or DataFrame
    Dict can contain Series, arrays, constants, dataclass or list-like objects. If
    data is a dict, column order follows insertion-order. If a dict contains Series
    which have an index defined, it is aligned by its index. This alignment also
    occurs if data is a Series or a DataFrame itself. Alignment is done on
    Series/DataFrame inputs.

    If data is a list of dicts, column order follows insertion-order.

index : Index or array-like
    Index to use for resulting frame. Will default to RangeIndex if
    no indexing information part of input data and no index provided.
columns : Index or array-like
    Column labels to use for resulting frame when data does not have them,
    defaulting to RangeIndex(0, 1, 2, ..., n). If data contains column labels,
    will perform column selection instead.
dtype : dtype, default None
    Data type to force. Only a single dtype is allowed. If None, infer.
copy : bool or None, default None
    Copy data from inputs.
    For dict data, the default of None behaves like ``copy=True``.  For DataFrame
    or 2d ndarray input, the default of None behaves like ``copy=False``.
    If data is a dict containing one or more Series (possibly of different dtypes),
    ``copy=False`` will ensure that these inputs are not copied.

    .. versionchanged:: 1.3.0

See Also
--------
DataFrame.from_records : Constructor from tuples, also record arrays.
DataFrame.from_dict : From dicts of Series, arrays, or dicts.
read_csv : Read a comma-separated values (csv) file into DataFrame.
read_table : Read general delimited file into DataFrame.
read_clipboard : Read text from clipboard into DataFrame.

Notes
-----
Please reference the :ref:`User Guide <basics.dataframe>` for more information.

Examples
--------
Constructing DataFrame from a dictionary.

>>> d = {'col1': [1, 2], 'col2': [3, 4]}
>>> df = pd.DataFrame(data=d)
>>> df
   col1  col2
0     1     3
1     2     4

Notice that the inferred dtype is int64.

>>> df.dtypes
col1    int64
col2    int64
dtype: object

To enforce a single dtype:

>>> df = pd.DataFrame(data=d, dtype=np.int8)
>>> df.dtypes
col1    int8
col2    int8
dtype: object

Constructing DataFrame from a dictionary including Series:

>>> d = {'col1': [0, 1, 2, 3], 'col2': pd.Series([2, 3], index=[2, 3])}
>>> pd.DataFrame(data=d, index=[0, 1, 2, 3])
   col1  col2
0     0   NaN
1     1   NaN
2     2   2.0
3     3   3.0

Constructing DataFrame from numpy ndarray:

>>> df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
...                    columns=['a', 'b', 'c'])
>>> df2
   a  b  c
0  1  2  3
1  4  5  6
2  7  8  9

Constructing DataFrame from a numpy ndarray that has labeled columns:

>>> data = np.array([(1, 2, 3), (4, 5, 6), (7, 8, 9)],
...                 dtype=[("a", "i4"), ("b", "i4"), ("c", "i4")])
>>> df3 = pd.DataFrame(data, columns=['c', 'a'])
...
>>> df3
   c  a
0  3  1
1  6  4
2  9  7

Constructing DataFrame from dataclass:

>>> from dataclasses import make_dataclass
>>> Point = make_dataclass("Point", [("x", int), ("y", int)])
>>> pd.DataFrame([Point(0, 0), Point(0, 3), Point(2, 3)])
   x  y
0  0  0
1  0  3
2  2  3

Constructing DataFrame from Series/DataFrame:

>>> ser = pd.Series([1, 2, 3], index=["a", "b", "c"])
>>> df = pd.DataFrame(data=ser, index=["a", "c"])
>>> df
   0
a  1
c  3

>>> df1 = pd.DataFrame([1, 2, 3], index=["a", "b", "c"], columns=["x"])
>>> df2 = pd.DataFrame(data=df1, index=["a", "c"])
>>> df2
   x
a  1
c  3
File:           c:\users\phili\anaconda3\envs\vscode-env\lib\site-packages\pandas\core\frame.py
Type:           type
Subclasses:     SubclassedDataFrame

A pd.DataFrame can be instantiated by supplying data to a dict where each key is the pd.Series name and pd.Series values respectively:

In [223]:

df1 = pd.DataFrame({'w': [0, 1, 2, 3],
                    'x': [4, 5, 6, 7],
                    'y': [8, 9, 10, 11],
                    'z': [12, 13, 14, 15]})

In [224]:

df1

Out[224]:

	w	x	y	z
0	0	4	8	12
1	1	5	9	13
2	2	6	10	14
3	3	7	11	15

Attributes¶

A pd.DataFrame instance has 2 dimensions so ndim is:

In [225]:

df1.ndim

Out[225]:

The attribute axes displays a list of each axis:

In [226]:

df1.axes

Out[226]:

[RangeIndex(start=0, stop=4, step=1),
 Index(['w', 'x', 'y', 'z'], dtype='object')]

The detail for each axis can be viewed:

In [227]:

df1.index

Out[227]:

RangeIndex(start=0, stop=4, step=1)

Note the columns axis is an Index instance. The dtype is 'O' because it contains str instances:

In [228]:

df1.columns

Out[228]:

Index(['w', 'x', 'y', 'z'], dtype='object')

dtypes groups the datatype for each pd.Series in the pd.DataFrame, note that this is a pd.Series instance:

In [229]:

df1.dtypes

Out[229]:

w    int64
x    int64
y    int64
z    int64
dtype: object

The pd.DataFrame instance also has a shape and size:

In [230]:

df1.shape

Out[230]:

(4, 4)

In [231]:

df1.size

Out[231]:

The pd.DataFrame instance has an optional attribute attrs which is an empty dict by default:

In [232]:

df1.attrs

Out[232]:

{}

This is used to provive a description and a reference for example:

In [233]:

df1.attrs = {'description': 'This is a DataFrame',
            'documentation': 'https://pandas.pydata.org/'}

In [234]:

df1.attrs

Out[234]:

{'description': 'This is a DataFrame',
 'documentation': 'https://pandas.pydata.org/'}

The pd.DataFrame attribute style will display a styled DataFrame in the cell output:

In [235]:

df1.style

Out[235]:

	w	x	y	z
0	0	4	8	12
1	1	5	9	13
2	2	6	10	14
3	3	7	11	15

A number of identifiers are available to be called from style to change how the pd.DataFrame instance displays:

In [236]:

dir2(df1.style, drop_internal=True)

{'attribute': ['caption',
               'cell_context',
               'cell_ids',
               'columns',
               'concatenated',
               'css',
               'ctx',
               'ctx_columns',
               'ctx_index',
               'data',
               'env',
               'hidden_columns',
               'hidden_rows',
               'hide_column_names',
               'hide_columns_',
               'hide_index_',
               'hide_index_names',
               'index',
               'loader',
               'table_attributes',
               'table_styles',
               'template_html',
               'template_html_style',
               'template_html_table',
               'template_latex',
               'template_string',
               'tooltips',
               'uuid',
               'uuid_len'],
 'method': ['apply',
            'apply_index',
            'applymap',
            'applymap_index',
            'background_gradient',
            'bar',
            'clear',
            'concat',
            'export',
            'format',
            'format_index',
            'from_custom_template',
            'hide',
            'highlight_between',
            'highlight_max',
            'highlight_min',
            'highlight_null',
            'highlight_quantile',
            'map',
            'map_index',
            'pipe',
            'relabel_index',
            'set_caption',
            'set_properties',
            'set_sticky',
            'set_table_attributes',
            'set_table_styles',
            'set_td_classes',
            'set_tooltips',
            'set_uuid',
            'text_gradient',
            'to_excel',
            'to_html',
            'to_latex',
            'to_string',
            'use'],
 'datamodel_attribute': ['__dict__', '__doc__', '__module__', '__weakref__'],
 'datamodel_method': ['__class__',
                      '__copy__',
                      '__deepcopy__',
                      '__delattr__',
                      '__dir__',
                      '__eq__',
                      '__format__',
                      '__ge__',
                      '__getattribute__',
                      '__getstate__',
                      '__gt__',
                      '__hash__',
                      '__init__',
                      '__init_subclass__',
                      '__le__',
                      '__lt__',
                      '__ne__',
                      '__new__',
                      '__reduce__',
                      '__reduce_ex__',
                      '__repr__',
                      '__setattr__',
                      '__sizeof__',
                      '__str__',
                      '__subclasshook__']}

These can often be stacked:

In [237]:

df1.style.format(precision=3).set_caption('DataFrame Instance').relabel_index(['a', 'b', 'c', 'd'])

Out[237]:

DataFrame Instance
	w	x	y	z
a	0	4	8	12
b	1	5	9	13
c	2	6	10	14
d	3	7	11	15

Indexing¶

In [238]:

df1 = pd.DataFrame({'w': [0, 1, 2, 3],
                    'x': [4, 5, 6, 7],
                    'y': [8, 9, 10, 11],
                    'z': [12, 13, 14, 15]})

When each pd.Series name is a valid identifier:

In [239]:

'w'.isidentifier()

Out[239]:

True

It is available as an attribute:

In [240]:

df1.w

Out[240]:

0    0
1    1
2    2
3    3
Name: w, dtype: int64

A pd.Series can be indexed from a pd.DataFrame by using square brackets and the str instance correponding to the pd.Series name:

In [241]:

df1['x']

Out[241]:

0    4
1    5
2    6
3    7
Name: x, dtype: int64

A single value can be specified by indexing into the returned pd.Series:

In [242]:

df1['x'][1]

Out[242]:

The data supplied to the df1 was a dict instance:

In [243]:

mapping = {'w': [0, 1, 2, 3],
           'x': [4, 5, 6, 7],
           'y': [8, 9, 10, 11],
           'z': [12, 13, 14, 15]}

Notice the consistency by indexing into the dict using a key:

In [244]:

mapping['x']

Out[244]:

[4, 5, 6, 7]

And then indexing into that list to get a single element:

In [245]:

mapping['x'][1]

Out[245]:

A list of str instances can also be provided where each str corresponds to a pd.Series name returns a pd.DataFrame instance:

In [246]:

df1[['x', 'z']]

Out[246]:

	x	z
0	4	12
1	5	13
2	6	14
3	7	15

The iloc attribute is used to instead retrieve a row by specifying the numeric value in the index. The row is returned as a pd.Series and because the default representation for a pd.Series is a column, the row is displayed as a column:

In [247]:

df1.iloc[2]

Out[247]:

w     2
x     6
y    10
z    14
Name: 2, dtype: int64

Slicing can also be used to return a pd.DataFrame instance:

In [248]:

df1.iloc[2:4]

Out[248]:

	w	x	y	z
2	2	6	10	14
3	3	7	11	15

The at attribute is used to return a value, specifying the index name (when applicable) and the column name:

In [249]:

df1.at[0, 'x']

Out[249]:

If a second pd.DataFrame instance with a pd.Index that is named is used:

In [250]:

df2 = pd.DataFrame({'w': [0, 1, 2, 3],
                    'x': [4, 5, 6, 7],
                    'y': [8, 9, 10, 11],
                    'z': [12, 13, 14, 15]},
                 index=('a', 'b', 'c', 'd'))

In [251]:

df2

Out[251]:

	w	x	y	z
a	0	4	8	12
b	1	5	9	13
c	2	6	10	14
d	3	7	11	15

The difference between indexing attributes can be examined such as indexing with a column name:

In [252]:

df2['x']

Out[252]:

a    4
b    5
c    6
d    7
Name: x, dtype: int64

In [253]:

df2[['x', 'y']]

Out[253]:

	x	y
a	4	8
b	5	9
c	6	10
d	7	11

Indexing a row using an index name:

In [254]:

df2.loc['b']

Out[254]:

w     1
x     5
y     9
z    13
Name: b, dtype: int64

Indexing a row using an index int:

In [255]:

df2.iloc[1]

Out[255]:

w     1
x     5
y     9
z    13
Name: b, dtype: int64

In [256]:

df2.iloc[1:3]

Out[256]:

	w	x	y	z
b	1	5	9	13
c	2	6	10	14

Indexing all rows and using a column int:

In [257]:

df2.iloc[:, [0, 1]]

Out[257]:

	w	x
a	0	4
b	1	5
c	2	6
d	3	7

Specifying an element using at and specifying a row name followed by a column name:

In [258]:

df2.at['b', 'y']

Out[258]:

Specifying an element using iat and specifying a row int followed by a column int:

In [259]:

df2.iat[1, 2]

Out[259]:

Under the hood iloc can be conceptualised as using a reset index:

In [260]:

df2

Out[260]:

	w	x	y	z
a	0	4	8	12
b	1	5	9	13
c	2	6	10	14
d	3	7	11	15

In [261]:

df2.reset_index()

Out[261]:

	index	w	x	y	z
0	a	0	4	8	12
1	b	1	5	9	13
2	c	2	6	10	14
3	d	3	7	11	15

In [262]:

df2.iloc[2]

Out[262]:

w     2
x     6
y    10
z    14
Name: c, dtype: int64

Or by examining the underlying np.ndarray instance and indexing numerically:

In [263]:

df2.values

Out[263]:

array([[ 0,  4,  8, 12],
       [ 1,  5,  9, 13],
       [ 2,  6, 10, 14],
       [ 3,  7, 11, 15]], dtype=int64)

In [264]:

df2.values[2]

Out[264]:

array([ 2,  6, 10, 14], dtype=int64)

In [265]:

df2.iloc[2, 1]

Out[265]:

In [266]:

df2.values[2, 1]

Out[266]:

If df1 is examined:

In [267]:

df1

Out[267]:

	w	x	y	z
0	0	4	8	12
1	1	5	9	13
2	2	6	10	14
3	3	7	11	15

Notice a pd.Series can be retrieved using the attribute or by indexing:

In [268]:

df1.x

Out[268]:

0    4
1    5
2    6
3    7
Name: x, dtype: int64

In [269]:

df1['x']

Out[269]:

0    4
1    5
2    6
3    7
Name: x, dtype: int64

Attempting to access an attribute that does not exist will flag up an AttributeError. Indexing using square brackets can be used to create a new key in a dict instance, the key then essentially becomes an instance name for the value:

In [270]:

mapping['letters'] = ['A', 'B', 'C', 'D']

In [271]:

mapping

Out[271]:

{'w': [0, 1, 2, 3],
 'x': [4, 5, 6, 7],
 'y': [8, 9, 10, 11],
 'z': [12, 13, 14, 15],
 'letters': ['A', 'B', 'C', 'D']}

Similar syntax can be used for a pd.DataFrame instance:

In [272]:

df1['letters'] = pd.Series(['A', 'B', 'C', 'D'], name='letters')

In [273]:

df1

Out[273]:

	w	x	y	z	letters
0	0	4	8	12	A
1	1	5	9	13	B
2	2	6	10	14	C
3	3	7	11	15	D

Since the new pd.Series name is a valid identifier name, it becomes an attribute:

In [274]:

df1.letters

Out[274]:

0    A
1    B
2    C
3    D
Name: letters, dtype: object

Information¶

The example pd.DataFrame instance df3 is very small:

In [275]:

df3 = pd.DataFrame({'w': [0, 1, 2, 3],
                    'x': [4, 5, None, 7],
                    'y': [8, None, 10, 11],
                    'z': [12, 13, 14, 15]})

In [276]:

df3

Out[276]:

	w	x	y	z
0	0	4.0	8.0	12
1	1	5.0	NaN	13
2	2	NaN	10.0	14
3	3	7.0	11.0	15

However many pd.DataFrame instance have hundreds of rows of data and a preview can be made using the pd.DataFrame method head or tail. These methods optionally take an int as a positional input argument, the int supplied gives the number of rows to show, this is typically 5 if unspecified:

In [277]:

df3.head(2)

Out[277]:

	w	x	y	z
0	0	4.0	8.0	12
1	1	5.0	NaN	13

In [278]:

df3.tail(3)

Out[278]:

	w	x	y	z
1	1	5.0	NaN	13
2	2	NaN	10.0	14
3	3	7.0	11.0	15

The info gives details about each pd.Series such as the name and number of non-null values as well as the dtype:

In [279]:

df3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   w       4 non-null      int64  
 1   x       3 non-null      float64
 2   y       3 non-null      float64
 3   z       4 non-null      int64  
dtypes: float64(2), int64(2)
memory usage: 260.0 bytes

When a pd.Series is numeric, descriptive statistics can be computed on it:

In [280]:

df3.describe()

Out[280]:

	w	x	y	z
count	4.000000	3.000000	3.000000	4.000000
mean	1.500000	5.333333	9.666667	13.500000
std	1.290994	1.527525	1.527525	1.290994
min	0.000000	4.000000	8.000000	12.000000
25%	0.750000	4.500000	9.000000	12.750000
50%	1.500000	5.000000	10.000000	13.500000
75%	2.250000	6.000000	10.500000	14.250000
max	3.000000	7.000000	11.000000	15.000000

If another pd.DataFrame instance is examined with some values that are None:

In [281]:

df3 = pd.DataFrame({'w': [0, 1, 2, 3],
                    'x': [4, 5, None, 7],
                    'y': [8, None, 10, 11],
                    'z': [12, 13, 14, 15]})

Then the pd.Series with None are cast into float64 instances and have a non-null count lower than the size of the pd.Series:

In [282]:

df3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   w       4 non-null      int64  
 1   x       3 non-null      float64
 2   y       3 non-null      float64
 3   z       4 non-null      int64  
dtypes: float64(2), int64(2)
memory usage: 260.0 bytes

The isna (alias isnull) will return a bool pd.DataFrame instance:

In [283]:

df3.isna()

Out[283]:

	w	x	y	z
0	False	False	False	False
1	False	False	True	False
2	False	True	False	False
3	False	False	False	False

It is quite common to use this with the pd.DataFrame method isany to return a pd.Series:

In [284]:

df3.isna().any()

Out[284]:

w    False
x     True
y     True
z    False
dtype: bool

This can be used to see what pd.Series instances need to be examined to remove or fill in missing values.

The notna (alias notnull) will return the inverted bool pd.DataFrame instance:

In [285]:

df3.notna()

Out[285]:

	w	x	y	z
0	True	True	True	True
1	True	True	False	True
2	True	False	True	True
3	True	True	True	True

Mutability¶

The three classes are mutable and therefore the __hash__ identifier is None:

In [286]:

pd.Index.__hash__ == None

Out[286]:

True

In [287]:

pd.Series.__hash__ == None

Out[287]:

True

In [288]:

pd.DataFrame.__hash__ == None

Out[288]:

True

This is because they are based on the np.ndarray which is also immutable:

In [289]:

np.ndarray.__hash__ == None

Out[289]:

True

Most of the methods are immutable by default, returning a new instance. A large number of pd.Series and consistent pd.DataFrame methods have the parameter inplace which can be used to change whether the method is immutable or mutable:

In [290]:

dir2(pd.Series, parameter='inplace', exclude_identifier_list=obsolete.series_depreciation)

{'method': ['bfill',
            'clip',
            'drop',
            'drop_duplicates',
            'dropna',
            'ffill',
            'fillna',
            'interpolate',
            'mask',
            'pad',
            'rename',
            'rename_axis',
            'replace',
            'reset_index',
            'sort_index',
            'sort_values',
            'where']}

For example the pd.DataFrame instance df3 can be examined:

In [291]:

df3

Out[291]:

	w	x	y	z
0	0	4.0	8.0	12
1	1	5.0	NaN	13
2	2	NaN	10.0	14
3	3	7.0	11.0	15

Notice that x has a NaN value (np.nan instance):

In [292]:

df3['x']

Out[292]:

0    4.0
1    5.0
2    NaN
3    7.0
Name: x, dtype: float64

The method fillna can be used to fill in a missing value using a specified value:

In [293]:

df3['x'].fillna(0) # immutable returns a new pd.Series

Out[293]:

0    4.0
1    5.0
2    0.0
3    7.0
Name: x, dtype: float64

The method bfill can be used to fill in a missing value using the previous value:

In [294]:

df3['x'].bfill() # immutable returns a new pd.Series

Out[294]:

0    4.0
1    5.0
2    7.0
3    7.0
Name: x, dtype: float64

The method ffill can be used to fill in a missing value using the previous value:

In [295]:

df3['x'].ffill() # immutable returns a new pd.Series

Out[295]:

0    4.0
1    5.0
2    5.0
3    7.0
Name: x, dtype: float64

The method interpolate can be used to interpolate values using 'linear', 'quadratic' or 'cubic' (there is not enough data for 'cubic' in this sample) interpolation:

In [296]:

df3['x'].interpolate('linear') # immutable returns a new pd.Series

Out[296]:

0    4.0
1    5.0
2    6.0
3    7.0
Name: x, dtype: float64

In [297]:

df3['x'].interpolate('quadratic') # immutable returns a new pd.Series

Out[297]:

0    4.0
1    5.0
2    6.0
3    7.0
Name: x, dtype: float64

Notice that the df3 is unchanged:

In [298]:

df3 # not updated

Out[298]:

	w	x	y	z
0	0	4.0	8.0	12
1	1	5.0	NaN	13
2	2	NaN	10.0	14
3	3	7.0	11.0	15

When the named parameter inplace is set to a bool value of True, the method become mutable and updates the pd.Series with name 'x' found on the pd.DataFrame instance df3 inplace:

In [299]:

df3['x'].ffill(inplace=True) # mutable no return value

In [300]:

df3 # updated in place

Out[300]:

	w	x	y	z
0	0	4.0	8.0	12
1	1	5.0	NaN	13
2	2	5.0	10.0	14
3	3	7.0	11.0	15

A mutable method should not be used with reassignment because it has no return value and the value attempted to be reassigned is therefore None:

df3['y'] = df3['y'].ffill(inplace=True)

The above attempts to use a mutable method which returns the value None and then it attempts to reassign None to the pd.Series instance with name df3['y'].

This is one of the most common mistakes beginners make, currently the operation is carried out with a FutureWarning however this behaviour will be prevented in pandas 3.

Because the pd.Index, pd.Series and pd.DataFrame are mutable an item can be accessed:

In [301]:

df3.at[1, 'y']

Out[301]:

nan

And reassigned:

In [302]:

df3.at[1, 'y'] = 40

In [303]:

df3

Out[303]:

	w	x	y	z
0	0	4.0	8.0	12
1	1	5.0	40.0	13
2	2	5.0	10.0	14
3	3	7.0	11.0	15

The del keyword only works for deleting a pd.Series from a pd.DataFrame:

In [304]:

del df3['y']

In [305]:

df3

Out[305]:

	w	x	z
0	0	4.0	12
1	1	5.0	13
2	2	5.0	14
3	3	7.0	15

The pd.DataFrame has mutable methods consistent with a mutable Collection such as a list. insert can be used to insert a new pd.Series at a specified index:

In [306]:

df3.insert(loc=2, column='y2', value=[20, 21, 22, 23])

In [307]:

df3

Out[307]:

	w	x	y2	z
0	0	4.0	20	12
1	1	5.0	21	13
2	2	5.0	22	14
3	3	7.0	23	15

pop can be used to pop of a pd.Series using a str instance that corresponds to the the pd.Series name. Recall that pop is one of the few mutable methods that has a return value. The return value is the pd.Series popped and the pd.DataFrame instance is mutated inplace:

In [308]:

df3.pop('z') # mutable method WITH return value of pd.Series

Out[308]:

0    12
1    13
2    14
3    15
Name: z, dtype: int64

In [309]:

df3 # mutated inplace

Out[309]:

	w	x	y2
0	0	4.0	20
1	1	5.0	21
2	2	5.0	22
3	3	7.0	23

Note there is no append method because indexing with a new pd.Series name appends the pd.Series to the end of the pd.DataFrame:

In [310]:

df3['z2'] = pd.Series([30, 31, 32, 33], name='z')

Restoring df3 to the following:

In [311]:

df3 = pd.DataFrame({'w': [0, 1, 2, 3],
                    'x': [4, 5, None, 7],
                    'y': [8, None, 10, 11],
                    'z': [12, 13, 14, 15]})

The pd.DataFrame method dropna can be used. Notice any observations with a NaN (None or np.nan) datatype are dropped:

In [312]:

df3.dropna()

Out[312]:

	w	x	y	z
0	0	4.0	8.0	12
3	3	7.0	11.0	15

Alternatively an observation can be dropped using:

In [313]:

df3.drop(0)

Out[313]:

	w	x	y	z
1	1	5.0	NaN	13
2	2	NaN	10.0	14
3	3	7.0	11.0	15

In [314]:

df3.drop([0, 3])

Out[314]:

	w	x	y	z
1	1	5.0	NaN	13
2	2	NaN	10.0	14

If the following pd.DataFrame instance is instantiated:

In [315]:

df4 = pd.DataFrame({'w': [0, 0, 4, 4],
                    'x': [1, 1, 5, 5],
                    'y': [2, 2, 6, 6],
                    'z': [3, 3, 7, 7]})

Notice index 0 and 1 have duplicate data and index 2 and 3 also have duplicate data:

In [316]:

df4

Out[316]:

	w	x	y	z
0	0	1	2	3
1	0	1	2	3
2	4	5	6	7
3	4	5	6	7

The duplicates can be dropped using:

In [317]:

df4.drop_duplicates()

Out[317]:

	w	x	y	z
0	0	1	2	3
2	4	5	6	7

The index is now 0 and 2 as the observations that were at position 1 and 3 on the index have been dropped. The index can be reset using the reset_index method:

In [318]:

df4.drop_duplicates().reset_index()

Out[318]:

	index	w	x	y	z
0	0	0	1	2	3
1	2	4	5	6	7

The old index is now a pd.Series with name 'index', it can also be dropped:

In [319]:

df4.drop_duplicates().reset_index(drop=True)

Out[319]:

	w	x	y	z
0	0	1	2	3
1	4	5	6	7

If the pd.DataFrame instance df4 is examined:

In [320]:

df4

Out[320]:

	w	x	y	z
0	0	1	2	3
1	0	1	2	3
2	4	5	6	7
3	4	5	6	7

Its len can be determined:

In [321]:

len(df4)

Out[321]:

Notice that this is 1 more than the last index. loc can be used to access this new value on the index and assign it to a dict instance that corresponds to data from a new observation:

In [322]:

df4.loc[len(df4)] = {'w': 9, 'x': 99, 'y': 999, 'z': 999}

In [323]:

df4

Out[323]:

	w	x	y	z
0	0	1	2	3
1	0	1	2	3
2	4	5	6	7
3	4	5	6	7
4	9	99	999	999

iloc cannot be used for this purpose as it cannot access a numeric index value that does not yet exist. Sometimes it is more convenient to provide a list of new values:

In [324]:

df4.loc[len(df4)] = [8, 88, 888, 888]

In [325]:

df4

Out[325]:

	w	x	y	z
0	0	1	2	3
1	0	1	2	3
2	4	5	6	7
3	4	5	6	7
4	9	99	999	999
5	8	88	888	888

If the following pd.DataFrame instance is instantiated:

In [326]:

df5 = pd.DataFrame({'w': [2, 2, 1, 1],
                    'x': [3, 3, 3, 4],
                    'y': [5, 6, 8, 8],
                    'z': ['a', 'b', 'c', 'd']})

In [327]:

df5

Out[327]:

	w	x	y	z
0	2	3	5	a
1	2	3	6	b
2	1	3	8	c
3	1	4	8	d

The values in a pd.DataFrame can be sorted using the sort_values method and providing a list of str instances that each correspond to a name of a pd.Series to sort by:

In [328]:

df5.sort_values(by=['w', 'x', 'y'], inplace=True) # mutable

In [329]:

df5 # modified inplace

Out[329]:

	w	x	y	z
2	1	3	8	c
3	1	4	8	d
0	2	3	5	a
1	2	3	6	b

The pd.DataFrame method can alternatively be used to sort by the values of the index which restores the original df5:

In [330]:

df5.sort_index() # return value

Out[330]:

	w	x	y	z
0	2	3	5	a
1	2	3	6	b
2	1	3	8	c
3	1	4	8	d

In [331]:

df5 # not modified

Out[331]:

	w	x	y	z
2	1	3	8	c
3	1	4	8	d
0	2	3	5	a
1	2	3	6	b

The index is often reset after sorting:

In [332]:

df5.reset_index(drop=True)

Out[332]:

	w	x	y	z
0	1	3	8	c
1	1	4	8	d
2	2	3	5	a
3	2	3	6	b

If two pd.DataFrame instances have the same index but different pd.Series, they can be joined using the join method:

In [333]:

dfleft = pd.DataFrame({'w1': [1, 2, 3, 4],
                      'x1': [1, 2, 3, 4],
                      'y1': [1, 2, 3, 4],
                      'z1': [1, 2, 3, 4]})
dfright = pd.DataFrame({'w2': [5, 6, 7, 8],
                        'x2': [5, 6, 7, 8],
                        'y2': [5, 6, 7, 8]})

In [334]:

dfleft.join(dfright)

Out[334]:

	w1	x1	y1	z1	w2	x2	y2
0	1	1	1	1	5	5	5
1	2	2	2	2	6	6	6
2	3	3	3	3	7	7	7
3	4	4	4	4	8	8	8

If two pd.DataFrame instances have the same pd.Series but different observations, they can be concatenated using the pd.concat function:

In [335]:

dftop = pd.DataFrame({'w': [1, 2, 3, 4],
                      'x': [1, 2, 3, 4],
                      'y': [1, 2, 3, 4],
                      'z': [1, 2, 3, 4]})
dfbot = pd.DataFrame({'w': [5, 6, 7, 8],
                      'x': [5, 6, 7, 8],
                      'y': [5, 6, 7, 8],
                      'z': [5, 6, 7, 8]})

In [336]:

pd.concat([dftop, dfbot])

Out[336]:

	w	x	y	z
0	1	1	1	1
1	2	2	2	2
2	3	3	3	3
3	4	4	4	4
0	5	5	5	5
1	6	6	6	6
2	7	7	7	7
3	8	8	8	8

This outputs a new pd.DataFrame instance. Note there are now two observations with the int value 0 on the index:

In [337]:

pd.concat([dftop, dfbot]).loc[0]

Out[337]:

	w	x	y	z
0	1	1	1	1
0	5	5	5	5

To rectify this, the named parameter ignore_index can be assigned to the bool instance True:

In [338]:

pd.concat([dftop, dfbot], ignore_index=True)

Out[338]:

	w	x	y	z
0	1	1	1	1
1	2	2	2	2
2	3	3	3	3
3	4	4	4	4
4	5	5	5	5
5	6	6	6	6
6	7	7	7	7
7	8	8	8	8

The pd.DataFrame method align is used for aligning two pd.DataFrame instances side by side within a tuple:

In [339]:

dfa = pd.DataFrame({'x': [1, 2],
                    'y': [5, 6],
                    'z': [1, 2]})
dfb = pd.DataFrame({'w': [5, 6, 7, 8],
                    'x': [5, 6, 7, 8],
                    'z': [5, 6, 7, 8]})

Any missing values found in one but not another are replaced by NaN:

In [340]:

dfa.align(dfb)

Out[340]:

(    w    x    y    z
 0 NaN  1.0  5.0  1.0
 1 NaN  2.0  6.0  2.0
 2 NaN  NaN  NaN  NaN
 3 NaN  NaN  NaN  NaN,
    w  x   y  z
 0  5  5 NaN  5
 1  6  6 NaN  6
 2  7  7 NaN  7
 3  8  8 NaN  8)

In [ ]:

Axis¶

If the following pd.DataFrame is instantiated:

In [341]:

df6 = pd.DataFrame({'w': [0, 1, None, 3],
                    'x': [1, 1, None, 4],
                    'y': [None, None, 9, 9],
                    'z': [3, 4, 9, 9]},
                   index=('a', 'b', 'c', 'd'))

In [342]:

df6

Out[342]:

	w	x	y	z
a	0.0	1.0	NaN	3
b	1.0	1.0	NaN	4
c	NaN	NaN	9.0	9
d	3.0	4.0	9.0	9

Notice when ffill is used, by default it carries out the operation simultaneously along the index of each pd.Series giving consistent behaviour to the equivalent pd.Series method:

In [343]:

df6.ffill()

Out[343]:

	w	x	y	z
a	0.0	1.0	NaN	3
b	1.0	1.0	NaN	4
c	1.0	1.0	9.0	9
d	3.0	4.0	9.0	9

Many of the pd.DataFrame methods have an axis parameter which can instead be used to specify the axis to perform the operation on:

In [344]:

dir2(pd.DataFrame, parameter='axis', exclude_identifier_list=obsolete.pandas_depreciation)

{'method': ['add',
            'add_prefix',
            'add_suffix',
            'agg',
            'aggregate',
            'align',
            'all',
            'any',
            'apply',
            'at_time',
            'backfill',
            'between_time',
            'bfill',
            'clip',
            'corrwith',
            'count',
            'cummax',
            'cummin',
            'cumprod',
            'cumsum',
            'diff',
            'div',
            'divide',
            'drop',
            'droplevel',
            'dropna',
            'eq',
            'ewm',
            'expanding',
            'ffill',
            'fillna',
            'filter',
            'floordiv',
            'ge',
            'groupby',
            'gt',
            'idxmax',
            'idxmin',
            'interpolate',
            'kurt',
            'kurtosis',
            'le',
            'lt',
            'mask',
            'max',
            'mean',
            'median',
            'min',
            'mod',
            'mode',
            'mul',
            'multiply',
            'ne',
            'nunique',
            'pad',
            'pow',
            'prod',
            'product',
            'quantile',
            'radd',
            'rank',
            'rdiv',
            'reindex',
            'rename',
            'rename_axis',
            'reorder_levels',
            'resample',
            'rfloordiv',
            'rmod',
            'rmul',
            'rolling',
            'rpow',
            'rsub',
            'rtruediv',
            'sample',
            'sem',
            'set_axis',
            'shift',
            'skew',
            'sort_index',
            'sort_values',
            'squeeze',
            'std',
            'sub',
            'subtract',
            'sum',
            'swaplevel',
            'take',
            'to_period',
            'to_timestamp',
            'transform',
            'truediv',
            'truncate',
            'tz_convert',
            'tz_localize',
            'var',
            'where',
            'xs']}

In [345]:

df6

Out[345]:

	w	x	y	z
a	0.0	1.0	NaN	3
b	1.0	1.0	NaN	4
c	NaN	NaN	9.0	9
d	3.0	4.0	9.0	9

The axis parameter can be used to instruct the method to operate along the 'index' which is the default behaviour:

In [346]:

df6.ffill(axis='index') # operates along the index

Out[346]:

	w	x	y	z
a	0.0	1.0	NaN	3
b	1.0	1.0	NaN	4
c	1.0	1.0	9.0	9
d	3.0	4.0	9.0	9

Alternatively it can be used to instruct the method to operate along the 'columns' (this requires that the dtypes of each Series are consistent, usually numeric):

In [347]:

df6.dtypes

Out[347]:

w    float64
x    float64
y    float64
z      int64
dtype: object

In [348]:

df6.ffill(axis='columns') # operates along columns

Out[348]:

	w	x	y	z
a	0.0	1.0	1.0	3.0
b	1.0	1.0	1.0	4.0
c	NaN	NaN	9.0	9.0
d	3.0	4.0	9.0	9.0

Category Methods¶

A pd.DataFrame instance can be created using four pd.Series, using 'float64', 'category', 'category' and 'int32' as the respective dtype:

In [349]:

np.random.seed(0)
w = pd.Series(np.random.random(size=10), name='w', dtype='float64')
x = pd.Series(np.random.choice(['A', 'B', 'C'], size=10), name='x', dtype='category')
y = pd.Series(np.random.choice(['a', 'b', 'c'], size=10), name='y', dtype='category')
z = pd.Series(np.random.randint(0, 10, size=10), name='z', dtype='int32')

df7 = pd.DataFrame()
df7['w'] = w
df7['x'] = x
df7['y'] = y
df7['z'] = z

df7

Out[349]:

	w	x	y	z
0	0.548814	C	a	1
1	0.715189	C	b	9
2	0.602763	A	c	9
3	0.544883	B	a	0
4	0.423655	B	c	4
5	0.645894	B	a	7
6	0.437587	B	b	3
7	0.891773	A	b	2
8	0.963663	B	c	7
9	0.383442	A	a	2

The cat attribute can be used to access the as_ordered method and then the cat attribute from this pd.Series returned can be used to access the reorder_categories method:

In [350]:

df7['x'] = df7['x'].cat.as_ordered().cat.reorder_categories(['C', 'B', 'A'])
df7['y'] = df7['y'].cat.as_ordered().cat.reorder_categories(['c', 'b', 'a'])

In [351]:

df7

Out[351]:

	w	x	y	z
0	0.548814	C	a	1
1	0.715189	C	b	9
2	0.602763	A	c	9
3	0.544883	B	a	0
4	0.423655	B	c	4
5	0.645894	B	a	7
6	0.437587	B	b	3
7	0.891773	A	b	2
8	0.963663	B	c	7
9	0.383442	A	a	2

This allows use of comparison operatorswith one of the pd.Series with a dtype of 'category':

In [352]:

df7['x'] >= 'B'

Out[352]:

0    False
1    False
2     True
3     True
4     True
5     True
6     True
7     True
8     True
9     True
Name: x, dtype: bool

In [353]:

df7['y'] >= 'b'

Out[353]:

0     True
1     True
2    False
3     True
4    False
5     True
6     True
7     True
8    False
9     True
Name: y, dtype: bool

Conditions can be joined together using the & operator, | operator or ^ operator. Note parenthesis should be used to enclose each condition:

In [354]:

(df7['x'] >= 'B') & (df7['y'] >= 'b')

Out[354]:

0    False
1    False
2    False
3     True
4    False
5     True
6     True
7     True
8    False
9     True
dtype: bool

The and or or keyword are not supported and flag a ValueError.

This pd.Series instance with dtype of bool can be used to index into the pd.DataFrame instance df7:

In [355]:

df7[(df7['x'] >= 'B') & (df7['y'] >= 'b')]

Out[355]:

	w	x	y	z
3	0.544883	B	a	0
5	0.645894	B	a	7
6	0.437587	B	b	3
7	0.891773	A	b	2
9	0.383442	A	a	2

The pd.DataFrame method can also be used to group by each category in a pd.Series with a dtype of 'category', creating a groupby object:

In [356]:

gbo = df7.groupby(df7['x'], observed=True)

The identifiers of this DataFrameGroupBy can be examined:

In [357]:

type(gbo)

Out[357]:

pandas.core.groupby.generic.DataFrameGroupBy

In [358]:

dir2(gbo, drop_internal=True)

{'attribute': ['dtypes',
               'groups',
               'indices',
               'ndim',
               'ngroups',
               'w',
               'x',
               'y',
               'z'],
 'method': ['agg',
            'aggregate',
            'all',
            'any',
            'apply',
            'bfill',
            'boxplot',
            'corr',
            'corrwith',
            'count',
            'cov',
            'cumcount',
            'cummax',
            'cummin',
            'cumprod',
            'cumsum',
            'describe',
            'diff',
            'ewm',
            'expanding',
            'ffill',
            'fillna',
            'filter',
            'first',
            'get_group',
            'head',
            'hist',
            'idxmax',
            'idxmin',
            'last',
            'max',
            'mean',
            'median',
            'min',
            'ngroup',
            'nth',
            'nunique',
            'ohlc',
            'pct_change',
            'pipe',
            'plot',
            'prod',
            'quantile',
            'rank',
            'resample',
            'rolling',
            'sample',
            'sem',
            'shift',
            'size',
            'skew',
            'std',
            'sum',
            'tail',
            'take',
            'transform',
            'value_counts',
            'var'],
 'datamodel_attribute': ['__annotations__',
                         '__dict__',
                         '__doc__',
                         '__module__',
                         '__orig_bases__',
                         '__parameters__',
                         '__weakref__'],
 'datamodel_method': ['__class__',
                      '__class_getitem__',
                      '__delattr__',
                      '__dir__',
                      '__eq__',
                      '__format__',
                      '__ge__',
                      '__getattr__',
                      '__getattribute__',
                      '__getitem__',
                      '__getstate__',
                      '__gt__',
                      '__hash__',
                      '__init__',
                      '__init_subclass__',
                      '__iter__',
                      '__le__',
                      '__len__',
                      '__lt__',
                      '__ne__',
                      '__new__',
                      '__reduce__',
                      '__reduce_ex__',
                      '__repr__',
                      '__setattr__',
                      '__sizeof__',
                      '__str__',
                      '__subclasshook__']}

C:\Users\phili\OneDrive\Documents\GitHub\python-notebooks\pandas_library\categorize_identifiers.py:86: FutureWarning: DataFrameGroupBy.dtypes is deprecated and will be removed in a future version. Check the dtypes on the base object instead
  is_method = callable(getattr(obj, identifier))
C:\Users\phili\OneDrive\Documents\GitHub\python-notebooks\pandas_library\categorize_identifiers.py:87: FutureWarning: DataFrameGroupBy.dtypes is deprecated and will be removed in a future version. Check the dtypes on the base object instead
  is_class = inspect.isclass(getattr(obj, identifier))
C:\Users\phili\OneDrive\Documents\GitHub\python-notebooks\pandas_library\categorize_identifiers.py:111: FutureWarning: DataFrameGroupBy.dtypes is deprecated and will be removed in a future version. Check the dtypes on the base object instead
  datatype = str(type(getattr(obj, identifier)))

A method such as count can be used to count the number of values in each category:

In [359]:

gbo.count()

Out[359]:

	w	y	z
x
C	2	2	2
B	5	5	5
A	3	3	3

This is typically done in one line:

In [360]:

df7.groupby(df7['x'], observed=True).count()

Out[360]:

	w	y	z
x
C	2	2	2
B	5	5	5
A	3	3	3

Note a pd.DataFrame instance is output with duplicate values for each pd.Series. If the groupby method is called from a pd.Series instance instead a pd.Series will be returned:

In [361]:

df7['x'].groupby(df7['x'], observed=True).count()

Out[361]:

x
C    2
B    5
A    3
Name: x, dtype: int64

When a DataFrameGroupBy is created from a pd.DataFrame instance and the describe method is used, it will display statistics for each category in each numeric pd.Series:

In [362]:

df7.groupby(df7['x'], observed=True).describe()

Out[362]:

	w								z
	count	mean	std	min	25%	50%	75%	max	count	mean	std	min	25%	50%	75%	max
x
C	2.0	0.632001	0.117646	0.548814	0.590407	0.632001	0.673595	0.715189	2.0	5.000000	5.656854	1.0	3.0	5.0	7.0	9.0
B	5.0	0.603136	0.220693	0.423655	0.437587	0.544883	0.645894	0.963663	5.0	4.200000	2.949576	0.0	3.0	4.0	7.0	7.0
A	3.0	0.625993	0.254961	0.383442	0.493102	0.602763	0.747268	0.891773	3.0	4.333333	4.041452	2.0	2.0	2.0	5.5	9.0

It is common to call the the groupby method from a numeric pd.Series instance and groupby another pd.Series, which has a dtype of 'category':

In [363]:

df7['w'].groupby(df7['x'], observed=True).describe()

Out[363]:

	count	mean	std	min	25%	50%	75%	max
x
C	2.0	0.632001	0.117646	0.548814	0.590407	0.632001	0.673595	0.715189
B	5.0	0.603136	0.220693	0.423655	0.437587	0.544883	0.645894	0.963663
A	3.0	0.625993	0.254961	0.383442	0.493102	0.602763	0.747268	0.891773

This is done for many of the statistical DataFrameGroupBy methods:

In [364]:

df7['w'].groupby(df7['x'], observed=True).mean()

Out[364]:

x
C    0.632001
B    0.603136
A    0.625993
Name: w, dtype: float64

In [365]:

df7['w'].groupby(df7['x'], observed=True).std()

Out[365]:

x
C    0.117646
B    0.220693
A    0.254961
Name: w, dtype: float64

Categorize_Identifiers Module¶

Importing Libraries¶

Series¶

DataFrame¶

Reading Data from Files¶

Comma Separated Values File¶

Tab Delimited Text File¶

JavaScript Object Notation¶

Microsoft Excel File¶

Writing DataFrames to Objects and Files¶

Python Dictionary¶

Comma Separated Values File¶

Tab Delimited Text File¶

JavaScript Object Notation¶

Microsoft Excel File¶

Markdown¶

Series Identifiers¶

Attributes¶

Boolean Attributes¶

Indexing¶

String Methods¶

Numeric Methods¶

Category Methods¶

Plot Methods¶

Datetime and TimeDelta¶

DataFrame Identifiers¶

Attributes¶

Indexing¶

Information¶

Mutability¶

Axis¶

Category Methods¶

Share this:

Like this: