pandas is an abbreviation for the Python and Data Analysis Library. It is a library that uses three main data structures:
- the Index class
- the Series class
- the DataFrame class
Most Index classes are numeric, that is zero-order integer steps of one.
The Index, similar to a tuple, list or 1darray has a single dimension which can be represented either as a row:
index | 0 | 1 | 2 | 3 |
---|
Or as a column when convenient:
index |
---|
0 |
1 |
2 |
3 |
The Series class has a value at each index and a name. It is essentially a numpy 1darray that has a name. A Series is normally represented as a column (notice the Index associated with the Series is also displayed as a column):
index | 'x' |
---|---|
0 | 1.1 |
1 | 2.1 |
2 | 3.1 |
3 | 4.1 |
A DataFrame class is essentially a grouping of series instances that have the same index:
index | 'x' | 'y' |
---|---|---|
0 | 1.1 | 1.2 |
1 | 2.1 | 2.2 |
2 | 3.1 | 3.2 |
3 | 4.1 | 4.2 |
Categorize_Identifiers Module¶
This notebook will use the following functions dir2
, variables
and view
in the custom module categorize_identifiers
which is found in the same directory as this notebook file. dir2
is a variant of dir
that groups identifiers into a dict
under categories and variables
is an IPython based a variable inspector. view
is used to view a Collection
in more detail:
from categorize_identifiers import dir2, variables, view
The following will be imported in order to simplify the output of dir2
:
import operator
reverse_operators = ['radd', 'rmod', 'rmul', 'rpow', 'rsub', 'rtruediv']
import obsolete
Importing Libraries¶
To use the data science libraries they need to be imported:
import numpy as np
import pandas as pd
Once imported the identifiers can be viewed:
dir2(pd, exclude_external_modules=True, drop_internal=True, exclude_identifier_list=obsolete.pandas_depreciation)
{'attribute': ['annotations', 'options'], 'constant': ['IndexSlice', 'NA', 'NaT'], 'module': ['api', 'arrays', 'compat', 'core', 'errors', 'io', 'offsets', 'pandas', 'plotting', 'testing', 'tseries', 'util'], 'method': ['array', 'bdate_range', 'concat', 'crosstab', 'cut', 'date_range', 'describe_option', 'eval', 'factorize', 'from_dummies', 'get_dummies', 'get_option', 'infer_freq', 'interval_range', 'isna', 'isnull', 'json_normalize', 'lreshape', 'melt', 'merge', 'merge_asof', 'merge_ordered', 'notna', 'notnull', 'period_range', 'pivot', 'pivot_table', 'qcut', 'read_clipboard', 'read_csv', 'read_excel', 'read_feather', 'read_fwf', 'read_hdf', 'read_html', 'read_json', 'read_orc', 'read_parquet', 'read_pickle', 'read_sas', 'read_spss', 'read_sql', 'read_sql_query', 'read_sql_table', 'read_stata', 'read_table', 'read_xml', 'reset_option', 'set_eng_float_format', 'set_option', 'show_versions', 'test', 'timedelta_range', 'to_datetime', 'to_numeric', 'to_pickle', 'to_timedelta', 'unique', 'wide_to_long'], 'lower_class': ['option_context'], 'upper_class': ['ArrowDtype', 'BooleanDtype', 'Categorical', 'CategoricalDtype', 'CategoricalIndex', 'DataFrame', 'DateOffset', 'DatetimeIndex', 'DatetimeTZDtype', 'ExcelFile', 'ExcelWriter', 'Flags', 'Float32Dtype', 'Float64Dtype', 'Grouper', 'HDFStore', 'Index', 'Int16Dtype', 'Int32Dtype', 'Int64Dtype', 'Int8Dtype', 'Interval', 'IntervalDtype', 'IntervalIndex', 'MultiIndex', 'NamedAgg', 'Period', 'PeriodDtype', 'PeriodIndex', 'RangeIndex', 'Series', 'SparseDtype', 'StringDtype', 'Timedelta', 'TimedeltaIndex', 'Timestamp', 'UInt16Dtype', 'UInt32Dtype', 'UInt64Dtype', 'UInt8Dtype'], 'datamodel_attribute': ['__all__', '__builtins__', '__cached__', '__doc__', '__docformat__', '__file__', '__git_version__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__version__']}
In general pd
uses object orientated programming (OOP) opposed to functional programming. Most of the functions within pd
are used to read in data from a file and output a pd.DataFrame
instance.
The classes are all in PascalCase. The main classes are:
- pd.Index
- pd.Series
- pd.DataFrame
The OOP approach means methods are normally called from instances to analyse or manipulate data from the instance. The identifiers for the main three classes can be examined. The relationship between the identifiers in the three classes above will first be examined before later examining the classes in detail.
The pd.Series
can be conceptualised as a np.ndarray
with a name
and therefore has consistent attributes, methods and numerical datamodel methods:
dir2(pd.Series, np.ndarray, consistent_only=True, exclude_external_modules=True, drop_internal=True, exclude_identifier_list=obsolete.series_depreciation)
{'attribute': ['dtype', 'flags', 'nbytes', 'ndim', 'shape', 'size'], 'constant': ['T'], 'method': ['all', 'any', 'argmax', 'argmin', 'argsort', 'astype', 'clip', 'copy', 'cumprod', 'cumsum', 'dot', 'item', 'max', 'mean', 'min', 'prod', 'repeat', 'round', 'searchsorted', 'squeeze', 'std', 'sum', 'take', 'tolist', 'transpose', 'var'], 'datamodel_attribute': ['__array_priority__', '__doc__', '__hash__'], 'datamodel_method': ['__abs__', '__add__', '__and__', '__array__', '__array_ufunc__', '__bool__', '__class__', '__contains__', '__copy__', '__deepcopy__', '__delattr__', '__delitem__', '__dir__', '__divmod__', '__eq__', '__float__', '__floordiv__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__iadd__', '__iand__', '__ifloordiv__', '__imod__', '__imul__', '__init__', '__init_subclass__', '__int__', '__invert__', '__ior__', '__ipow__', '__isub__', '__iter__', '__itruediv__', '__ixor__', '__le__', '__len__', '__lt__', '__matmul__', '__mod__', '__mul__', '__ne__', '__neg__', '__new__', '__or__', '__pos__', '__pow__', '__radd__', '__rand__', '__rdivmod__', '__reduce__', '__reduce_ex__', '__repr__', '__rfloordiv__', '__rmatmul__', '__rmod__', '__rmul__', '__ror__', '__rpow__', '__rsub__', '__rtruediv__', '__rxor__', '__setattr__', '__setitem__', '__setstate__', '__sizeof__', '__str__', '__sub__', '__subclasshook__', '__truediv__', '__xor__']}
A number of the __builtins__
functions are also available as pd.Series
methods:
dir2(pd.Series, __builtins__, consistent_only=True, exclude_identifier_list=obsolete.series_depreciation)
{'method': ['abs', 'all', 'any', 'divmod', 'filter', 'map', 'max', 'min', 'pow', 'round', 'sum'], 'lower_class': ['list', 'str'], 'datamodel_attribute': ['__doc__']}
Many of the datamodel operators defined for numeric operation are also available as a method:
dir2(pd.Series, operator, consistent_only=True)
{'attribute': ['index'], 'method': ['abs', 'add', 'eq', 'floordiv', 'ge', 'gt', 'le', 'lt', 'mod', 'mul', 'ne', 'pow', 'sub', 'truediv'], 'datamodel_attribute': ['__doc__'], 'datamodel_method': ['__abs__', '__add__', '__and__', '__contains__', '__delitem__', '__eq__', '__floordiv__', '__ge__', '__getitem__', '__gt__', '__iadd__', '__iand__', '__ifloordiv__', '__imod__', '__imul__', '__invert__', '__ior__', '__ipow__', '__isub__', '__itruediv__', '__ixor__', '__le__', '__lt__', '__matmul__', '__mod__', '__mul__', '__ne__', '__neg__', '__or__', '__pos__', '__pow__', '__setitem__', '__sub__', '__truediv__', '__xor__']}
The pd.Series
has additional attributes such as:
values
returns the underlying values normally as annp.ndarray
(1darray) when numeric.name
which is astr
instance corresponding to the series name.index
which returns thepd.Index
instance.at
,iat
,loc
andiloc
are used for additional indexing purposes.attrs
is used to store optional metadata.- The following attributes return a
bool
ofTrue
when:hasnans
if any element isNaN
orNone
empty
if no values are suppliedis_unique
if every value is uniqueis_monotonic_decreasing
if every value>=
the previous valueis_monotonic_decreasing
if every value<=
the previous value
The pd.Series
has additional methods. Many of these are statistical methods and have a similar name to functions found in the statistics module. There are also a large number of to
methods which convert the pd.Series
to another datatype or file format.
There are also lowercase classes:
str
which groups togetherstr
identifierscat
which groups together category identifiersdt
which groups together datetime (datetime.datetime
ordatetime.timedelta
) identifiersplot
which groups togetherpyplot
based functions which can be use on the valueslist
which groups togetherlist[pyarrow]
identifiers (pyarrow
features are still experimental)
The lowercase classes are appropriate when the pd.Series
datatype matches one of the class types above.
dir2(pd.Series, [np.ndarray, operator], unique_only=True,
exclude_external_modules=True, exclude_identifier_list=reverse_operators+obsolete.series_depreciation, drop_internal=True)
{'attribute': ['array', 'at', 'attrs', 'axes', 'dtypes', 'empty', 'hasnans', 'iat', 'iloc', 'is_monotonic_decreasing', 'is_monotonic_increasing', 'is_unique', 'loc', 'name', 'values'], 'method': ['add_prefix', 'add_suffix', 'agg', 'aggregate', 'align', 'apply', 'asfreq', 'asof', 'at_time', 'autocorr', 'between', 'between_time', 'bfill', 'case_when', 'combine', 'combine_first', 'compare', 'convert_dtypes', 'corr', 'count', 'cov', 'cummax', 'cummin', 'describe', 'diff', 'div', 'divide', 'divmod', 'drop', 'drop_duplicates', 'droplevel', 'dropna', 'duplicated', 'equals', 'ewm', 'expanding', 'explode', 'factorize', 'ffill', 'fillna', 'filter', 'first', 'first_valid_index', 'get', 'groupby', 'head', 'hist', 'idxmax', 'idxmin', 'infer_objects', 'info', 'interpolate', 'isin', 'isna', 'isnull', 'items', 'keys', 'kurt', 'kurtosis', 'last', 'last_valid_index', 'map', 'mask', 'median', 'memory_usage', 'mode', 'multiply', 'nlargest', 'notna', 'notnull', 'nsmallest', 'nunique', 'pad', 'pct_change', 'pipe', 'pop', 'product', 'quantile', 'rank', 'rdiv', 'rdivmod', 'reindex', 'reindex_like', 'rename', 'rename_axis', 'reorder_levels', 'replace', 'resample', 'reset_index', 'rfloordiv', 'rolling', 'sample', 'sem', 'set_axis', 'set_flags', 'shift', 'skew', 'sort_index', 'sort_values', 'subtract', 'swaplevel', 'tail', 'to_clipboard', 'to_csv', 'to_dict', 'to_excel', 'to_frame', 'to_hdf', 'to_json', 'to_latex', 'to_list', 'to_markdown', 'to_numpy', 'to_period', 'to_pickle', 'to_sql', 'to_string', 'to_timestamp', 'to_xarray', 'transform', 'truncate', 'tz_convert', 'tz_localize', 'unique', 'unstack', 'update', 'value_counts', 'where', 'xs'], 'lower_class': ['cat', 'dt', 'list', 'plot', 'sparse', 'str', 'struct'], 'datamodel_attribute': ['__annotations__', '__dict__', '__module__', '__pandas_priority__', '__weakref__'], 'datamodel_method': ['__column_consortium_standard__', '__finalize__', '__getattr__', '__nonzero__', '__round__']}
A pd.DataFrame
is essentially a grouping of pd.Series
instances and therefore has many identifiers in common.
The attributes dtypes
and values
are used as plural equivalents to dtype
and array
seen in the pd.Series
. dtypes
reflects the fact each pd.Series
in the pd.DataFrame
instance can have a different datatype. However when each pd.Series
are numeric then pd.values
is the 2d ndarray
instance. Most of the statistical methods will only also work when each pd.Series
in the pd.DataFrame
instance are numeric. It is more common to use statistical functions on a pd.Series
.
dir2(pd.DataFrame, pd.Series, consistent_only=True, drop_internal=True, exclude_identifier_list=obsolete.df_depreciation)
{'attribute': ['at', 'attrs', 'axes', 'dtypes', 'empty', 'flags', 'iat', 'iloc', 'index', 'loc', 'ndim', 'shape', 'size', 'values'], 'constant': ['T'], 'method': ['abs', 'add', 'add_prefix', 'add_suffix', 'agg', 'aggregate', 'align', 'all', 'any', 'apply', 'asfreq', 'asof', 'astype', 'at_time', 'backfill', 'between_time', 'bfill', 'clip', 'combine', 'combine_first', 'compare', 'convert_dtypes', 'copy', 'corr', 'count', 'cov', 'cummax', 'cummin', 'cumprod', 'cumsum', 'describe', 'diff', 'div', 'divide', 'dot', 'drop', 'drop_duplicates', 'droplevel', 'dropna', 'duplicated', 'eq', 'equals', 'ewm', 'expanding', 'explode', 'ffill', 'fillna', 'filter', 'first_valid_index', 'floordiv', 'ge', 'get', 'groupby', 'gt', 'head', 'hist', 'idxmax', 'idxmin', 'infer_objects', 'info', 'interpolate', 'isin', 'isna', 'isnull', 'items', 'keys', 'kurt', 'kurtosis', 'last_valid_index', 'le', 'lt', 'map', 'mask', 'max', 'mean', 'median', 'memory_usage', 'min', 'mod', 'mode', 'mul', 'multiply', 'ne', 'nlargest', 'notna', 'notnull', 'nsmallest', 'nunique', 'pad', 'pct_change', 'pipe', 'pop', 'pow', 'prod', 'product', 'quantile', 'radd', 'rank', 'rdiv', 'reindex', 'reindex_like', 'rename', 'rename_axis', 'reorder_levels', 'replace', 'resample', 'reset_index', 'rfloordiv', 'rmod', 'rmul', 'rolling', 'round', 'rpow', 'rsub', 'rtruediv', 'sample', 'sem', 'set_axis', 'set_flags', 'shift', 'skew', 'sort_index', 'sort_values', 'squeeze', 'std', 'sub', 'subtract', 'sum', 'swaplevel', 'tail', 'take', 'to_clipboard', 'to_csv', 'to_dict', 'to_excel', 'to_hdf', 'to_json', 'to_latex', 'to_markdown', 'to_numpy', 'to_period', 'to_pickle', 'to_sql', 'to_string', 'to_timestamp', 'to_xarray', 'transform', 'transpose', 'truediv', 'truncate', 'tz_convert', 'tz_localize', 'unstack', 'update', 'value_counts', 'var', 'where', 'xs'], 'lower_class': ['plot', 'sparse'], 'datamodel_attribute': ['__annotations__', '__array_priority__', '__dict__', '__doc__', '__hash__', '__module__', '__pandas_priority__', '__weakref__'], 'datamodel_method': ['__abs__', '__add__', '__and__', '__array__', '__array_ufunc__', '__bool__', '__class__', '__contains__', '__copy__', '__deepcopy__', '__delattr__', '__delitem__', '__dir__', '__divmod__', '__eq__', '__finalize__', '__floordiv__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__iadd__', '__iand__', '__ifloordiv__', '__imod__', '__imul__', '__init__', '__init_subclass__', '__invert__', '__ior__', '__ipow__', '__isub__', '__iter__', '__itruediv__', '__ixor__', '__le__', '__len__', '__lt__', '__matmul__', '__mod__', '__mul__', '__ne__', '__neg__', '__new__', '__nonzero__', '__or__', '__pos__', '__pow__', '__radd__', '__rand__', '__rdivmod__', '__reduce__', '__reduce_ex__', '__repr__', '__rfloordiv__', '__rmatmul__', '__rmod__', '__rmul__', '__ror__', '__round__', '__rpow__', '__rsub__', '__rtruediv__', '__rxor__', '__setattr__', '__setitem__', '__setstate__', '__sizeof__', '__str__', '__sub__', '__subclasshook__', '__truediv__', '__xor__']}
The pd.DataFrame
class only has a small number of additional identifiers such as the attributes columns
which is a list
of str
instances with each str
corresponding to the appropriate pd.Series.name
available in the pd.DataFrame
. The pd.DataFrame
instance is mutable and the insert
method can be used to insert another pd.Series
instance. join
and merge
are used to join and merge pd.DataFrame
instances.
The style
attribute is an additional attribute which can be used to control the style of the pd.DataFrame
instance when displayed in a cell output.
dir2(pd.DataFrame, pd.Series, unique_only=True, drop_internal=True, exclude_identifier_list=obsolete.df_depreciation)
{'attribute': ['columns', 'style'], 'method': ['assign', 'boxplot', 'corrwith', 'eval', 'from_dict', 'from_records', 'insert', 'isetitem', 'iterrows', 'itertuples', 'join', 'melt', 'merge', 'pivot', 'pivot_table', 'query', 'select_dtypes', 'set_index', 'stack', 'to_feather', 'to_html', 'to_orc', 'to_parquet', 'to_records', 'to_stata', 'to_xml'], 'datamodel_method': ['__arrow_c_stream__', '__dataframe__', '__dataframe_consortium_standard__']}
The identifiers in a pd.Series
instance are singular identifiers typically only used across a pd.Series
instance. These include the lowercase classes used to work a specific datatype across a pd.Series
:
dir2(pd.Series, pd.DataFrame, unique_only=True, drop_internal=True, exclude_identifier_list=obsolete.series_depreciation)
{'attribute': ['array', 'dtype', 'hasnans', 'is_monotonic_decreasing', 'is_monotonic_increasing', 'is_unique', 'name', 'nbytes'], 'method': ['argmax', 'argmin', 'argsort', 'autocorr', 'between', 'case_when', 'divmod', 'factorize', 'item', 'rdivmod', 'repeat', 'searchsorted', 'to_frame', 'to_list', 'tolist', 'unique'], 'lower_class': ['cat', 'dt', 'list', 'str', 'struct'], 'datamodel_method': ['__column_consortium_standard__', '__float__', '__int__']}
Each pd.Series
is associated with a pd.Index
and every pd.Series
in a pd.DataFrame
uses the same pd.Index
. Many identifiers for the pd.Index
can be seen to be consistent with a pd.Series
as it is also based on an np.ndarray
:
dir2(pd.Index, pd.Series, consistent_only=True, exclude_external_modules=True, drop_internal=True, exclude_identifier_list=obsolete.index_depreciation)
{'attribute': ['array', 'dtype', 'empty', 'hasnans', 'is_monotonic_decreasing', 'is_monotonic_increasing', 'is_unique', 'name', 'nbytes', 'ndim', 'shape', 'size', 'values'], 'constant': ['T'], 'method': ['all', 'any', 'argmax', 'argmin', 'argsort', 'asof', 'astype', 'copy', 'diff', 'drop', 'drop_duplicates', 'droplevel', 'dropna', 'duplicated', 'equals', 'factorize', 'fillna', 'groupby', 'infer_objects', 'isin', 'isna', 'isnull', 'item', 'map', 'max', 'memory_usage', 'min', 'notna', 'notnull', 'nunique', 'ravel', 'reindex', 'rename', 'repeat', 'round', 'searchsorted', 'shift', 'sort_values', 'take', 'to_frame', 'to_list', 'to_numpy', 'tolist', 'transpose', 'unique', 'value_counts', 'view', 'where'], 'lower_class': ['str'], 'datamodel_attribute': ['__annotations__', '__array_priority__', '__dict__', '__doc__', '__hash__', '__module__', '__pandas_priority__', '__weakref__'], 'datamodel_method': ['__abs__', '__add__', '__and__', '__array__', '__array_ufunc__', '__bool__', '__class__', '__contains__', '__copy__', '__deepcopy__', '__delattr__', '__dir__', '__divmod__', '__eq__', '__floordiv__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__iadd__', '__init__', '__init_subclass__', '__invert__', '__iter__', '__le__', '__len__', '__lt__', '__mod__', '__mul__', '__ne__', '__neg__', '__new__', '__nonzero__', '__or__', '__pos__', '__pow__', '__radd__', '__rand__', '__rdivmod__', '__reduce__', '__reduce_ex__', '__repr__', '__rfloordiv__', '__rmod__', '__rmul__', '__ror__', '__rpow__', '__rsub__', '__rtruediv__', '__rxor__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__sub__', '__subclasshook__', '__truediv__', '__xor__']}
the pd.Index
class only has a small number of identifiers not found in a pd.Series
instance. The pd.Index
is typically numeric but each value in the Series
can be assigned a name
and the attribute names
retrieves these. Typically each name
should be unique for the purpose of indexing and the attribute has_duplicates
is used to check for duplicate values and the pd.Series
or pd.DataFrame
method reset_index
can be used in this case to return a unique index of numeric values. More complicated indexes can be multilevel and the nlevels
attribute gives details about the amount of levels in the index:
dir2(pd.Index, pd.Series, unique_only=True, exclude_external_modules=True, drop_internal=True, exclude_identifier_list=obsolete.index_depreciation)
{'attribute': ['has_duplicates', 'inferred_type', 'names', 'nlevels'], 'method': ['append', 'asof_locs', 'delete', 'difference', 'get_indexer', 'get_indexer_for', 'get_indexer_non_unique', 'get_level_values', 'get_loc', 'get_slice_bound', 'identical', 'insert', 'intersection', 'is_', 'join', 'putmask', 'set_names', 'slice_indexer', 'slice_locs', 'sort', 'sortlevel', 'symmetric_difference', 'to_flat_index', 'to_series', 'union'], 'datamodel_method': ['__array_wrap__']}
Series¶
The initialisation signature for the pd.Series
class can be examined:
pd.Series?
Init signature: pd.Series( data=None, index=None, dtype: 'Dtype | None' = None, name=None, copy: 'bool | None' = None, fastpath: 'bool | lib.NoDefault' = <no_default>, ) -> 'None' Docstring: One-dimensional ndarray with axis labels (including time series). Labels need not be unique but must be a hashable type. The object supports both integer- and label-based indexing and provides a host of methods for performing operations involving the index. Statistical methods from ndarray have been overridden to automatically exclude missing data (currently represented as NaN). Operations between Series (+, -, /, \*, \*\*) align values based on their associated index values-- they need not be the same length. The result index will be the sorted union of the two indexes. Parameters ---------- data : array-like, Iterable, dict, or scalar value Contains data stored in Series. If data is a dict, argument order is maintained. index : array-like or Index (1d) Values must be hashable and have the same length as `data`. Non-unique index values are allowed. Will default to RangeIndex (0, 1, 2, ..., n) if not provided. If data is dict-like and index is None, then the keys in the data are used as the index. If the index is not None, the resulting Series is reindexed with the index values. dtype : str, numpy.dtype, or ExtensionDtype, optional Data type for the output Series. If not specified, this will be inferred from `data`. See the :ref:`user guide <basics.dtypes>` for more usages. name : Hashable, default None The name to give to the Series. copy : bool, default False Copy input data. Only affects Series or 1d ndarray input. See examples. Notes ----- Please reference the :ref:`User Guide <basics.series>` for more information. Examples -------- Constructing Series from a dictionary with an Index specified >>> d = {'a': 1, 'b': 2, 'c': 3} >>> ser = pd.Series(data=d, index=['a', 'b', 'c']) >>> ser a 1 b 2 c 3 dtype: int64 The keys of the dictionary match with the Index values, hence the Index values have no effect. >>> d = {'a': 1, 'b': 2, 'c': 3} >>> ser = pd.Series(data=d, index=['x', 'y', 'z']) >>> ser x NaN y NaN z NaN dtype: float64 Note that the Index is first build with the keys from the dictionary. After this the Series is reindexed with the given Index values, hence we get all NaN as a result. Constructing Series from a list with `copy=False`. >>> r = [1, 2] >>> ser = pd.Series(r, copy=False) >>> ser.iloc[0] = 999 >>> r [1, 2] >>> ser 0 999 1 2 dtype: int64 Due to input data type the Series has a `copy` of the original data even though `copy=False`, so the data is unchanged. Constructing Series from a 1d ndarray with `copy=False`. >>> r = np.array([1, 2]) >>> ser = pd.Series(r, copy=False) >>> ser.iloc[0] = 999 >>> r array([999, 2]) >>> ser 0 999 1 2 dtype: int64 Due to input data type the Series has a `view` on the original data, so the data is changed as well. File: c:\users\phili\anaconda3\envs\vscode-env\lib\site-packages\pandas\core\series.py Type: type Subclasses: SubclassedSeries
The main keyword input arguments are:
- data
- index
- dtype
- name
If these are not supplied an empty series instance with no index, no name and a generic object datatype is instantiated:
pd.Series()
Series([], dtype: object)
Normally data is supplied in the form of a numpy 1darray:
pd.Series(data=np.array([1, 2, 3]))
0 1 1 2 2 3 dtype: int32
Since a ndarray itself is initialised from a list, this can be abbreviated to:
pd.Series(data=[1, 2, 3])
0 1 1 2 2 3 dtype: int64
When dtype=None
, the data type will be inferred from the data:
pd.Series(data=[1., 2., 3.])
0 1.0 1 2.0 2 3.0 dtype: float64
from datetime import datetime, timedelta
pd.Series(data=[datetime.now(),
datetime.now() + timedelta(days=1),
datetime.now() + timedelta(days=2)])
0 2024-02-15 01:23:45.711586 1 2024-02-16 01:23:45.711586 2 2024-02-17 01:23:45.711586 dtype: datetime64[ns]
pd.Series(data=['a', 'b', 'c'])
0 a 1 b 2 c dtype: object
Anything with a str
in it is classed as non-numeric and has the generic dtype object
(meaning it can be any Python object).
The dtype can be manually overidden when supplying the numpy 1darray by using the np.array
input argument dtype
:
pd.Series(data=np.array([1., 2., 3.], dtype=np.int32))
0 1 1 2 2 3 dtype: int32
Or by alternatively using the Series
keyword input argument dtype
:
pd.Series(data=[1., 2., 3.], dtype=np.int32)
0 1 1 2 2 3 dtype: int32
Notice that the index is zero-ordered numeric in integer steps of 1 by default. This can be manually changed by use of the keyword input argument index
and providing an Index
instance, ndarray
instance or list
instance of index values:
pd.Series(index=['a', 'b', 'c'], data=[1., 2., 3.], dtype=np.int32)
a 1 b 2 c 3 dtype: int32
A Series
usually also has a name
:
pd.Series(index=['a', 'b', 'c'], data=[1., 2., 3.], dtype=np.int32, name='x')
a 1 b 2 c 3 Name: x, dtype: int32
Normally the data
and name
are supplied and the index
and dtype
are inferred:
pd.Series([1.1, 2.1, 3.1, 4.1], name='x')
0 1.1 1 2.1 2 3.1 3 4.1 Name: x, dtype: float64
DataFrame¶
The initialisation signature for the pd.DataFrame
class can be examined:
pd.DataFrame?
Init signature: pd.DataFrame( data=None, index: 'Axes | None' = None, columns: 'Axes | None' = None, dtype: 'Dtype | None' = None, copy: 'bool | None' = None, ) -> 'None' Docstring: Two-dimensional, size-mutable, potentially heterogeneous tabular data. Data structure also contains labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure. Parameters ---------- data : ndarray (structured or homogeneous), Iterable, dict, or DataFrame Dict can contain Series, arrays, constants, dataclass or list-like objects. If data is a dict, column order follows insertion-order. If a dict contains Series which have an index defined, it is aligned by its index. This alignment also occurs if data is a Series or a DataFrame itself. Alignment is done on Series/DataFrame inputs. If data is a list of dicts, column order follows insertion-order. index : Index or array-like Index to use for resulting frame. Will default to RangeIndex if no indexing information part of input data and no index provided. columns : Index or array-like Column labels to use for resulting frame when data does not have them, defaulting to RangeIndex(0, 1, 2, ..., n). If data contains column labels, will perform column selection instead. dtype : dtype, default None Data type to force. Only a single dtype is allowed. If None, infer. copy : bool or None, default None Copy data from inputs. For dict data, the default of None behaves like ``copy=True``. For DataFrame or 2d ndarray input, the default of None behaves like ``copy=False``. If data is a dict containing one or more Series (possibly of different dtypes), ``copy=False`` will ensure that these inputs are not copied. .. versionchanged:: 1.3.0 See Also -------- DataFrame.from_records : Constructor from tuples, also record arrays. DataFrame.from_dict : From dicts of Series, arrays, or dicts. read_csv : Read a comma-separated values (csv) file into DataFrame. read_table : Read general delimited file into DataFrame. read_clipboard : Read text from clipboard into DataFrame. Notes ----- Please reference the :ref:`User Guide <basics.dataframe>` for more information. Examples -------- Constructing DataFrame from a dictionary. >>> d = {'col1': [1, 2], 'col2': [3, 4]} >>> df = pd.DataFrame(data=d) >>> df col1 col2 0 1 3 1 2 4 Notice that the inferred dtype is int64. >>> df.dtypes col1 int64 col2 int64 dtype: object To enforce a single dtype: >>> df = pd.DataFrame(data=d, dtype=np.int8) >>> df.dtypes col1 int8 col2 int8 dtype: object Constructing DataFrame from a dictionary including Series: >>> d = {'col1': [0, 1, 2, 3], 'col2': pd.Series([2, 3], index=[2, 3])} >>> pd.DataFrame(data=d, index=[0, 1, 2, 3]) col1 col2 0 0 NaN 1 1 NaN 2 2 2.0 3 3 3.0 Constructing DataFrame from numpy ndarray: >>> df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), ... columns=['a', 'b', 'c']) >>> df2 a b c 0 1 2 3 1 4 5 6 2 7 8 9 Constructing DataFrame from a numpy ndarray that has labeled columns: >>> data = np.array([(1, 2, 3), (4, 5, 6), (7, 8, 9)], ... dtype=[("a", "i4"), ("b", "i4"), ("c", "i4")]) >>> df3 = pd.DataFrame(data, columns=['c', 'a']) ... >>> df3 c a 0 3 1 1 6 4 2 9 7 Constructing DataFrame from dataclass: >>> from dataclasses import make_dataclass >>> Point = make_dataclass("Point", [("x", int), ("y", int)]) >>> pd.DataFrame([Point(0, 0), Point(0, 3), Point(2, 3)]) x y 0 0 0 1 0 3 2 2 3 Constructing DataFrame from Series/DataFrame: >>> ser = pd.Series([1, 2, 3], index=["a", "b", "c"]) >>> df = pd.DataFrame(data=ser, index=["a", "c"]) >>> df 0 a 1 c 3 >>> df1 = pd.DataFrame([1, 2, 3], index=["a", "b", "c"], columns=["x"]) >>> df2 = pd.DataFrame(data=df1, index=["a", "c"]) >>> df2 x a 1 c 3 File: c:\users\phili\anaconda3\envs\vscode-env\lib\site-packages\pandas\core\frame.py Type: type Subclasses: SubclassedDataFrame
The keyword input arguments for a DataFrame
instance are similar to those found for a Series
instance however because a DataFrame
is a collection of Series
most of these are plural:
- data (plural)
- index (singular)
- columns (plural of name)
- dtype (plural)
pd.DataFrame(data=[[1.1, 1.2],
[2.1, 2.2],
[3.1, 3.2],
[4.1, 4.2]],
index=['a', 'b', 'c', 'd'],
columns=('x', 'y'),
dtype=(np.float64, np.float64))
x | y | |
---|---|---|
a | 1.1 | 1.2 |
b | 2.1 | 2.2 |
c | 3.1 | 3.2 |
d | 4.1 | 4.2 |
The dtype has to be supplied as a tuple
containing the dtype
for each Series
instance in the DataFrame
instance. If it is supplied as a list of dtypes a TypeError
will display.
Once again normally the dtype
and index
are inferred:
pd.DataFrame(data=[[1.1, 1.2],
[2.1, 2.2],
[3.1, 3.2],
[4.1, 4.2]],
columns=('x', 'y'))
x | y | |
---|---|---|
0 | 1.1 | 1.2 |
1 | 2.1 | 2.2 |
2 | 3.1 | 3.2 |
3 | 4.1 | 4.2 |
It is common to supply columns
and data
in the form of a mapping. The key
should be a str
instance which will become the column name and the value
should be a np.ndarray
(1d) or list
instance which corresponds to the data for that pd.Series
:
pd.DataFrame({'x': np.array([1.1, 2.1, 3.1, 3.1]),
'y': np.array([1.2, 2.2, 3.2, 4.2])})
x | y | |
---|---|---|
0 | 1.1 | 1.2 |
1 | 2.1 | 2.2 |
2 | 3.1 | 3.2 |
3 | 3.1 | 4.2 |
Reading Data from Files¶
The Series
and DataFrames
previously examined were created from scratch using builtins
datatypes. It is also common to read data in from another source. The pandas
library therefore has a number of functions for reading in data from external files. The function names all have a read_
prefix followed by the file type:
for identifier in dir(pd):
if identifier.startswith('read_'):
print(identifier)
read_clipboard read_csv read_excel read_feather read_fwf read_gbq read_hdf read_html read_json read_orc read_parquet read_pickle read_sas read_spss read_sql read_sql_query read_sql_table read_stata read_table read_xml
Some of the more common formats will be explored.
Comma Separated Values File¶
CSV is an abbreviation for Comma Separated Values.
The file format uses:
a
, as a column separator which is where the name comma separated values comes from\n
, as a row separator
A csv with escape characters explicitly shown looks as follows:
string,integer,bool,float,date,time,category\n
the fat black cat,4,TRUE,0.86,24/07/2023,11:36:00,A\n
sat on the mat,4,TRUE,0.86,25/07/2023,12:36:00,A\n
"twinkle, twinkle",2,TRUE,-1.14,26/07/2023,13:36:00,B\n
little star,2,TRUE,-1.14,27/07/2023,14:36:00,B\n
how I wonder,3,FALSE,-0.14,28/07/2023,15:36:00,B\n
what you are,4,TRUE,0.86,29/07/2023,16:36:00,B\n
This can be written to a csv file:
%%writefile ./files/Book1.csv
string,integer,bool,float,date,time,category
the fat black cat,4,TRUE,0.86,24/07/2023,11:36:00,A
sat on the mat,4,TRUE,0.86,25/07/2023,12:36:00,A
"twinkle, twinkle",2,TRUE,-1.14,26/07/2023,13:36:00,B
little star,2,TRUE,-1.14,27/07/2023,14:36:00,B
how I wonder,3,FALSE,-0.14,28/07/2023,15:36:00,B
what you are,4,TRUE,0.86,29/07/2023,16:36:00,B
Overwriting ./files/Book1.csv
When opened in a program such as Microsoft Excel, these delimiters display as a grid:
Notice that the comma in twinkle, twinkle
is not a delimiter but part of the str
. For this reason "twinkle, twinkle"
was is enclosed in quotations.
The pd.read_csv
function is used to read in a csv file returning a DataFrame
instance:
pd.read_csv?
Signature: pd.read_csv( filepath_or_buffer: 'FilePath | ReadCsvBuffer[bytes] | ReadCsvBuffer[str]', *, sep: 'str | None | lib.NoDefault' = <no_default>, delimiter: 'str | None | lib.NoDefault' = None, header: "int | Sequence[int] | None | Literal['infer']" = 'infer', names: 'Sequence[Hashable] | None | lib.NoDefault' = <no_default>, index_col: 'IndexLabel | Literal[False] | None' = None, usecols: 'UsecolsArgType' = None, dtype: 'DtypeArg | None' = None, engine: 'CSVEngine | None' = None, converters: 'Mapping[Hashable, Callable] | None' = None, true_values: 'list | None' = None, false_values: 'list | None' = None, skipinitialspace: 'bool' = False, skiprows: 'list[int] | int | Callable[[Hashable], bool] | None' = None, skipfooter: 'int' = 0, nrows: 'int | None' = None, na_values: 'Hashable | Iterable[Hashable] | Mapping[Hashable, Iterable[Hashable]] | None' = None, keep_default_na: 'bool' = True, na_filter: 'bool' = True, verbose: 'bool | lib.NoDefault' = <no_default>, skip_blank_lines: 'bool' = True, parse_dates: 'bool | Sequence[Hashable] | None' = None, infer_datetime_format: 'bool | lib.NoDefault' = <no_default>, keep_date_col: 'bool | lib.NoDefault' = <no_default>, date_parser: 'Callable | lib.NoDefault' = <no_default>, date_format: 'str | dict[Hashable, str] | None' = None, dayfirst: 'bool' = False, cache_dates: 'bool' = True, iterator: 'bool' = False, chunksize: 'int | None' = None, compression: 'CompressionOptions' = 'infer', thousands: 'str | None' = None, decimal: 'str' = '.', lineterminator: 'str | None' = None, quotechar: 'str' = '"', quoting: 'int' = 0, doublequote: 'bool' = True, escapechar: 'str | None' = None, comment: 'str | None' = None, encoding: 'str | None' = None, encoding_errors: 'str | None' = 'strict', dialect: 'str | csv.Dialect | None' = None, on_bad_lines: 'str' = 'error', delim_whitespace: 'bool | lib.NoDefault' = <no_default>, low_memory: 'bool' = True, memory_map: 'bool' = False, float_precision: "Literal['high', 'legacy'] | None" = None, storage_options: 'StorageOptions | None' = None, dtype_backend: 'DtypeBackend | lib.NoDefault' = <no_default>, ) -> 'DataFrame | TextFileReader' Docstring: Read a comma-separated values (csv) file into DataFrame. Also supports optionally iterating or breaking of the file into chunks. Additional help can be found in the online docs for `IO Tools <https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html>`_. Parameters ---------- filepath_or_buffer : str, path object or file-like object Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, gs, and file. For file URLs, a host is expected. A local file could be: file://localhost/path/to/table.csv. If you want to pass in a path object, pandas accepts any ``os.PathLike``. By file-like object, we refer to objects with a ``read()`` method, such as a file handle (e.g. via builtin ``open`` function) or ``StringIO``. sep : str, default ',' Character or regex pattern to treat as the delimiter. If ``sep=None``, the C engine cannot automatically detect the separator, but the Python parsing engine can, meaning the latter will be used and automatically detect the separator from only the first valid row of the file by Python's builtin sniffer tool, ``csv.Sniffer``. In addition, separators longer than 1 character and different from ``'\s+'`` will be interpreted as regular expressions and will also force the use of the Python parsing engine. Note that regex delimiters are prone to ignoring quoted data. Regex example: ``'\r\t'``. delimiter : str, optional Alias for ``sep``. header : int, Sequence of int, 'infer' or None, default 'infer' Row number(s) containing column labels and marking the start of the data (zero-indexed). Default behavior is to infer the column names: if no ``names`` are passed the behavior is identical to ``header=0`` and column names are inferred from the first line of the file, if column names are passed explicitly to ``names`` then the behavior is identical to ``header=None``. Explicitly pass ``header=0`` to be able to replace existing names. The header can be a list of integers that specify row locations for a :class:`~pandas.MultiIndex` on the columns e.g. ``[0, 1, 3]``. Intervening rows that are not specified will be skipped (e.g. 2 in this example is skipped). Note that this parameter ignores commented lines and empty lines if ``skip_blank_lines=True``, so ``header=0`` denotes the first line of data rather than the first line of the file. names : Sequence of Hashable, optional Sequence of column labels to apply. If the file contains a header row, then you should explicitly pass ``header=0`` to override the column names. Duplicates in this list are not allowed. index_col : Hashable, Sequence of Hashable or False, optional Column(s) to use as row label(s), denoted either by column labels or column indices. If a sequence of labels or indices is given, :class:`~pandas.MultiIndex` will be formed for the row labels. Note: ``index_col=False`` can be used to force pandas to *not* use the first column as the index, e.g., when you have a malformed file with delimiters at the end of each line. usecols : Sequence of Hashable or Callable, optional Subset of columns to select, denoted either by column labels or column indices. If list-like, all elements must either be positional (i.e. integer indices into the document columns) or strings that correspond to column names provided either by the user in ``names`` or inferred from the document header row(s). If ``names`` are given, the document header row(s) are not taken into account. For example, a valid list-like ``usecols`` parameter would be ``[0, 1, 2]`` or ``['foo', 'bar', 'baz']``. Element order is ignored, so ``usecols=[0, 1]`` is the same as ``[1, 0]``. To instantiate a :class:`~pandas.DataFrame` from ``data`` with element order preserved use ``pd.read_csv(data, usecols=['foo', 'bar'])[['foo', 'bar']]`` for columns in ``['foo', 'bar']`` order or ``pd.read_csv(data, usecols=['foo', 'bar'])[['bar', 'foo']]`` for ``['bar', 'foo']`` order. If callable, the callable function will be evaluated against the column names, returning names where the callable function evaluates to ``True``. An example of a valid callable argument would be ``lambda x: x.upper() in ['AAA', 'BBB', 'DDD']``. Using this parameter results in much faster parsing time and lower memory usage. dtype : dtype or dict of {Hashable : dtype}, optional Data type(s) to apply to either the whole dataset or individual columns. E.g., ``{'a': np.float64, 'b': np.int32, 'c': 'Int64'}`` Use ``str`` or ``object`` together with suitable ``na_values`` settings to preserve and not interpret ``dtype``. If ``converters`` are specified, they will be applied INSTEAD of ``dtype`` conversion. .. versionadded:: 1.5.0 Support for ``defaultdict`` was added. Specify a ``defaultdict`` as input where the default determines the ``dtype`` of the columns which are not explicitly listed. engine : {'c', 'python', 'pyarrow'}, optional Parser engine to use. The C and pyarrow engines are faster, while the python engine is currently more feature-complete. Multithreading is currently only supported by the pyarrow engine. .. versionadded:: 1.4.0 The 'pyarrow' engine was added as an *experimental* engine, and some features are unsupported, or may not work correctly, with this engine. converters : dict of {Hashable : Callable}, optional Functions for converting values in specified columns. Keys can either be column labels or column indices. true_values : list, optional Values to consider as ``True`` in addition to case-insensitive variants of 'True'. false_values : list, optional Values to consider as ``False`` in addition to case-insensitive variants of 'False'. skipinitialspace : bool, default False Skip spaces after delimiter. skiprows : int, list of int or Callable, optional Line numbers to skip (0-indexed) or number of lines to skip (``int``) at the start of the file. If callable, the callable function will be evaluated against the row indices, returning ``True`` if the row should be skipped and ``False`` otherwise. An example of a valid callable argument would be ``lambda x: x in [0, 2]``. skipfooter : int, default 0 Number of lines at bottom of file to skip (Unsupported with ``engine='c'``). nrows : int, optional Number of rows of file to read. Useful for reading pieces of large files. na_values : Hashable, Iterable of Hashable or dict of {Hashable : Iterable}, optional Additional strings to recognize as ``NA``/``NaN``. If ``dict`` passed, specific per-column ``NA`` values. By default the following values are interpreted as ``NaN``: " ", "#N/A", "#N/A N/A", "#NA", "-1.#IND", "-1.#QNAN", "-NaN", "-nan", "1.#IND", "1.#QNAN", "<NA>", "N/A", "NA", "NULL", "NaN", "None", "n/a", "nan", "null ". keep_default_na : bool, default True Whether or not to include the default ``NaN`` values when parsing the data. Depending on whether ``na_values`` is passed in, the behavior is as follows: * If ``keep_default_na`` is ``True``, and ``na_values`` are specified, ``na_values`` is appended to the default ``NaN`` values used for parsing. * If ``keep_default_na`` is ``True``, and ``na_values`` are not specified, only the default ``NaN`` values are used for parsing. * If ``keep_default_na`` is ``False``, and ``na_values`` are specified, only the ``NaN`` values specified ``na_values`` are used for parsing. * If ``keep_default_na`` is ``False``, and ``na_values`` are not specified, no strings will be parsed as ``NaN``. Note that if ``na_filter`` is passed in as ``False``, the ``keep_default_na`` and ``na_values`` parameters will be ignored. na_filter : bool, default True Detect missing value markers (empty strings and the value of ``na_values``). In data without any ``NA`` values, passing ``na_filter=False`` can improve the performance of reading a large file. verbose : bool, default False Indicate number of ``NA`` values placed in non-numeric columns. skip_blank_lines : bool, default True If ``True``, skip over blank lines rather than interpreting as ``NaN`` values. parse_dates : bool, list of Hashable, list of lists or dict of {Hashable : list}, default False The behavior is as follows: * ``bool``. If ``True`` -> try parsing the index. Note: Automatically set to ``True`` if ``date_format`` or ``date_parser`` arguments have been passed. * ``list`` of ``int`` or names. e.g. If ``[1, 2, 3]`` -> try parsing columns 1, 2, 3 each as a separate date column. * ``list`` of ``list``. e.g. If ``[[1, 3]]`` -> combine columns 1 and 3 and parse as a single date column. Values are joined with a space before parsing. * ``dict``, e.g. ``{'foo' : [1, 3]}`` -> parse columns 1, 3 as date and call result 'foo'. Values are joined with a space before parsing. If a column or index cannot be represented as an array of ``datetime``, say because of an unparsable value or a mixture of timezones, the column or index will be returned unaltered as an ``object`` data type. For non-standard ``datetime`` parsing, use :func:`~pandas.to_datetime` after :func:`~pandas.read_csv`. Note: A fast-path exists for iso8601-formatted dates. infer_datetime_format : bool, default False If ``True`` and ``parse_dates`` is enabled, pandas will attempt to infer the format of the ``datetime`` strings in the columns, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the parsing speed by 5-10x. .. deprecated:: 2.0.0 A strict version of this argument is now the default, passing it has no effect. keep_date_col : bool, default False If ``True`` and ``parse_dates`` specifies combining multiple columns then keep the original columns. date_parser : Callable, optional Function to use for converting a sequence of string columns to an array of ``datetime`` instances. The default uses ``dateutil.parser.parser`` to do the conversion. pandas will try to call ``date_parser`` in three different ways, advancing to the next if an exception occurs: 1) Pass one or more arrays (as defined by ``parse_dates``) as arguments; 2) concatenate (row-wise) the string values from the columns defined by ``parse_dates`` into a single array and pass that; and 3) call ``date_parser`` once for each row using one or more strings (corresponding to the columns defined by ``parse_dates``) as arguments. .. deprecated:: 2.0.0 Use ``date_format`` instead, or read in as ``object`` and then apply :func:`~pandas.to_datetime` as-needed. date_format : str or dict of column -> format, optional Format to use for parsing dates when used in conjunction with ``parse_dates``. The strftime to parse time, e.g. :const:`"%d/%m/%Y"`. See `strftime documentation <https://docs.python.org/3/library/datetime.html #strftime-and-strptime-behavior>`_ for more information on choices, though note that :const:`"%f"` will parse all the way up to nanoseconds. You can also pass: - "ISO8601", to parse any `ISO8601 <https://en.wikipedia.org/wiki/ISO_8601>`_ time string (not necessarily in exactly the same format); - "mixed", to infer the format for each element individually. This is risky, and you should probably use it along with `dayfirst`. .. versionadded:: 2.0.0 dayfirst : bool, default False DD/MM format dates, international and European format. cache_dates : bool, default True If ``True``, use a cache of unique, converted dates to apply the ``datetime`` conversion. May produce significant speed-up when parsing duplicate date strings, especially ones with timezone offsets. iterator : bool, default False Return ``TextFileReader`` object for iteration or getting chunks with ``get_chunk()``. chunksize : int, optional Number of lines to read from the file per chunk. Passing a value will cause the function to return a ``TextFileReader`` object for iteration. See the `IO Tools docs <https://pandas.pydata.org/pandas-docs/stable/io.html#io-chunking>`_ for more information on ``iterator`` and ``chunksize``. compression : str or dict, default 'infer' For on-the-fly decompression of on-disk data. If 'infer' and 'filepath_or_buffer' is path-like, then detect compression from the following extensions: '.gz', '.bz2', '.zip', '.xz', '.zst', '.tar', '.tar.gz', '.tar.xz' or '.tar.bz2' (otherwise no compression). If using 'zip' or 'tar', the ZIP file must contain only one data file to be read in. Set to ``None`` for no decompression. Can also be a dict with key ``'method'`` set to one of {``'zip'``, ``'gzip'``, ``'bz2'``, ``'zstd'``, ``'xz'``, ``'tar'``} and other key-value pairs are forwarded to ``zipfile.ZipFile``, ``gzip.GzipFile``, ``bz2.BZ2File``, ``zstandard.ZstdDecompressor``, ``lzma.LZMAFile`` or ``tarfile.TarFile``, respectively. As an example, the following could be passed for Zstandard decompression using a custom compression dictionary: ``compression={'method': 'zstd', 'dict_data': my_compression_dict}``. .. versionadded:: 1.5.0 Added support for `.tar` files. .. versionchanged:: 1.4.0 Zstandard support. thousands : str (length 1), optional Character acting as the thousands separator in numerical values. decimal : str (length 1), default '.' Character to recognize as decimal point (e.g., use ',' for European data). lineterminator : str (length 1), optional Character used to denote a line break. Only valid with C parser. quotechar : str (length 1), optional Character used to denote the start and end of a quoted item. Quoted items can include the ``delimiter`` and it will be ignored. quoting : {0 or csv.QUOTE_MINIMAL, 1 or csv.QUOTE_ALL, 2 or csv.QUOTE_NONNUMERIC, 3 or csv.QUOTE_NONE}, default csv.QUOTE_MINIMAL Control field quoting behavior per ``csv.QUOTE_*`` constants. Default is ``csv.QUOTE_MINIMAL`` (i.e., 0) which implies that only fields containing special characters are quoted (e.g., characters defined in ``quotechar``, ``delimiter``, or ``lineterminator``. doublequote : bool, default True When ``quotechar`` is specified and ``quoting`` is not ``QUOTE_NONE``, indicate whether or not to interpret two consecutive ``quotechar`` elements INSIDE a field as a single ``quotechar`` element. escapechar : str (length 1), optional Character used to escape other characters. comment : str (length 1), optional Character indicating that the remainder of line should not be parsed. If found at the beginning of a line, the line will be ignored altogether. This parameter must be a single character. Like empty lines (as long as ``skip_blank_lines=True``), fully commented lines are ignored by the parameter ``header`` but not by ``skiprows``. For example, if ``comment='#'``, parsing ``#empty\na,b,c\n1,2,3`` with ``header=0`` will result in ``'a,b,c'`` being treated as the header. encoding : str, optional, default 'utf-8' Encoding to use for UTF when reading/writing (ex. ``'utf-8'``). `List of Python standard encodings <https://docs.python.org/3/library/codecs.html#standard-encodings>`_ . encoding_errors : str, optional, default 'strict' How encoding errors are treated. `List of possible values <https://docs.python.org/3/library/codecs.html#error-handlers>`_ . .. versionadded:: 1.3.0 dialect : str or csv.Dialect, optional If provided, this parameter will override values (default or not) for the following parameters: ``delimiter``, ``doublequote``, ``escapechar``, ``skipinitialspace``, ``quotechar``, and ``quoting``. If it is necessary to override values, a ``ParserWarning`` will be issued. See ``csv.Dialect`` documentation for more details. on_bad_lines : {'error', 'warn', 'skip'} or Callable, default 'error' Specifies what to do upon encountering a bad line (a line with too many fields). Allowed values are : - ``'error'``, raise an Exception when a bad line is encountered. - ``'warn'``, raise a warning when a bad line is encountered and skip that line. - ``'skip'``, skip bad lines without raising or warning when they are encountered. .. versionadded:: 1.3.0 .. versionadded:: 1.4.0 - Callable, function with signature ``(bad_line: list[str]) -> list[str] | None`` that will process a single bad line. ``bad_line`` is a list of strings split by the ``sep``. If the function returns ``None``, the bad line will be ignored. If the function returns a new ``list`` of strings with more elements than expected, a ``ParserWarning`` will be emitted while dropping extra elements. Only supported when ``engine='python'`` .. versionchanged:: 2.2.0 - Callable, function with signature as described in `pyarrow documentation <https://arrow.apache.org/docs/python/generated/pyarrow.csv.ParseOptions.html #pyarrow.csv.ParseOptions.invalid_row_handler>`_ when ``engine='pyarrow'`` delim_whitespace : bool, default False Specifies whether or not whitespace (e.g. ``' '`` or ``'\t'``) will be used as the ``sep`` delimiter. Equivalent to setting ``sep='\s+'``. If this option is set to ``True``, nothing should be passed in for the ``delimiter`` parameter. .. deprecated:: 2.2.0 Use ``sep="\s+"`` instead. low_memory : bool, default True Internally process the file in chunks, resulting in lower memory use while parsing, but possibly mixed type inference. To ensure no mixed types either set ``False``, or specify the type with the ``dtype`` parameter. Note that the entire file is read into a single :class:`~pandas.DataFrame` regardless, use the ``chunksize`` or ``iterator`` parameter to return the data in chunks. (Only valid with C parser). memory_map : bool, default False If a filepath is provided for ``filepath_or_buffer``, map the file object directly onto memory and access the data directly from there. Using this option can improve performance because there is no longer any I/O overhead. float_precision : {'high', 'legacy', 'round_trip'}, optional Specifies which converter the C engine should use for floating-point values. The options are ``None`` or ``'high'`` for the ordinary converter, ``'legacy'`` for the original lower precision pandas converter, and ``'round_trip'`` for the round-trip converter. storage_options : dict, optional Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to ``urllib.request.Request`` as header options. For other URLs (e.g. starting with "s3://", and "gcs://") the key-value pairs are forwarded to ``fsspec.open``. Please see ``fsspec`` and ``urllib`` for more details, and for more examples on storage options refer `here <https://pandas.pydata.org/docs/user_guide/io.html? highlight=storage_options#reading-writing-remote-files>`_. dtype_backend : {'numpy_nullable', 'pyarrow'}, default 'numpy_nullable' Back-end data type applied to the resultant :class:`DataFrame` (still experimental). Behaviour is as follows: * ``"numpy_nullable"``: returns nullable-dtype-backed :class:`DataFrame` (default). * ``"pyarrow"``: returns pyarrow-backed nullable :class:`ArrowDtype` DataFrame. .. versionadded:: 2.0 Returns ------- DataFrame or TextFileReader A comma-separated values (csv) file is returned as two-dimensional data structure with labeled axes. See Also -------- DataFrame.to_csv : Write DataFrame to a comma-separated values (csv) file. read_table : Read general delimited file into DataFrame. read_fwf : Read a table of fixed-width formatted lines into DataFrame. Examples -------- >>> pd.read_csv('data.csv') # doctest: +SKIP File: c:\users\phili\anaconda3\envs\vscode-env\lib\site-packages\pandas\io\parsers\readers.py Type: function
The pd.read_csv
function has a larger number of input parameters. Note that most of these are named input parameters and are therefore assigned to a default value which is consistent to the default behaviour of a csv file. When the file is in the expected format only the filepath_or_buffer
needs to be specified and this is normally provided positionally:
df = pd.read_csv('./files/Book1.csv')
df
string | integer | bool | float | date | time | category | |
---|---|---|---|---|---|---|---|
0 | the fat black cat | 4 | True | 0.86 | 24/07/2023 | 11:36:00 | A |
1 | sat on the mat | 4 | True | 0.86 | 25/07/2023 | 12:36:00 | A |
2 | twinkle, twinkle | 2 | True | -1.14 | 26/07/2023 | 13:36:00 | B |
3 | little star | 2 | True | -1.14 | 27/07/2023 | 14:36:00 | B |
4 | how I wonder | 3 | False | -0.14 | 28/07/2023 | 15:36:00 | B |
5 | what you are | 4 | True | 0.86 | 29/07/2023 | 16:36:00 | B |
Notice the Series
names in the file are in the expected format and taken from the first line. A csv does not have an index by default and so a RangeIndex
has automatically been generated.
df.axes
[RangeIndex(start=0, stop=6, step=1), Index(['string', 'integer', 'bool', 'float', 'date', 'time', 'category'], dtype='object')]
Tab Delimited Text File¶
A text file, has the file extension txt and is very similar to a csv file but uses \t
instead of ,
as a delimiter:
string\tinteger\tbool\tfloat\tdate\ttime\tcategory\n
the fat black cat\t4\tTRUE\t0.86\t24/07/2023\t11:36:00\tA\n
sat on the mat\t4\tTRUE\t0.86\t25/07/2023\t12:36:00\tA\n
"twinkle, twinkle"\t2\tTRUE\t-1.14\t26/07/2023\t13:36:00\tB\n
little star\t2\tTRUE\t-1.14\t27/07/2023\t14:36:00\tB\n
how I wonder\t3\tFALSE\t-0.14\t28/07/2023\t15:36:00\tB\n
what you are\t4\tTRUE\t0.86\t29/07/2023\t16:36:00\tB\n
%%writefile ./files/Book2.txt
string integer boolean floatingpoint date time category
the fat black cat 4 TRUE 0.86 24/07/2023 11:36:00 A
sat on the mat 4 TRUE 0.86 25/07/2023 12:36:00 A
"twinkle, twinkle" 2 TRUE -1.14 26/07/2023 13:36:00 B
little star 2 TRUE -1.14 27/07/2023 14:36:00 B
how I wonder 3 FALSE -0.14 28/07/2023 15:36:00 B
what you are 4 TRUE 0.86 29/07/2023 16:36:00 B
Overwriting ./files/Book2.txt
The same pd.read_csv
function is used to read in a txt file. However this function by default looks for a ,
as a delimiter to move onto the next column and as it is not present, the data is all shown in a single column:
df = pd.read_csv(r'.\files\Book2.txt')
df
string\tinteger\tboolean\tfloatingpoint\tdate\ttime\tcategory | |
---|---|
0 | the fat black cat\t4\tTRUE\t0.86\t24/07/2023\t11:36:00\tA |
1 | sat on the mat\t4\tTRUE\t0.86\t25/07/2023\t12:36:00\tA |
2 | twinkle, twinkle\t2\tTRUE\t-1.14\t26/07/2023\t13:36:00\tB |
3 | little star\t2\tTRUE\t-1.14\t27/07/2023\t14:36:00\tB |
4 | how I wonder\t3\tFALSE\t-0.14\t28/07/2023\t15:36:00\tB |
5 | what you are\t4\tTRUE\t0.86\t29/07/2023\t16:36:00\tB |
If the delimiter is specified as '\t'
the data will instead be read in properly:
df = pd.read_csv(r'.\files\Book2.txt', delimiter='\t')
df
string | integer | boolean | floatingpoint | date | time | category | |
---|---|---|---|---|---|---|---|
0 | the fat black cat | 4 | True | 0.86 | 24/07/2023 | 11:36:00 | A |
1 | sat on the mat | 4 | True | 0.86 | 25/07/2023 | 12:36:00 | A |
2 | twinkle, twinkle | 2 | True | -1.14 | 26/07/2023 | 13:36:00 | B |
3 | little star | 2 | True | -1.14 | 27/07/2023 | 14:36:00 | B |
4 | how I wonder | 3 | False | -0.14 | 28/07/2023 | 15:36:00 | B |
5 | what you are | 4 | True | 0.86 | 29/07/2023 | 16:36:00 | B |
JavaScript Object Notation¶
If the data is in the form of a JSON str
instance:
json_string = '{"string":{"0":"the fat black cat","1":"sat on the mat","2":"twinkle, twinkle","3":"little star","4":"how I wonder","5":"what you are"},"integer":{"0":4,"1":4,"2":2,"3":2,"4":3,"5":4},"boolean":{"0":true,"1":true,"2":true,"3":true,"4":false,"5":true},"floatingpoint":{"0":0.86,"1":0.86,"2":-1.14,"3":-1.14,"4":-0.14,"5":0.86},"date":{"0":"24\\/07\\/2023","1":"25\\/07\\/2023","2":"26\\/07\\/2023","3":"27\\/07\\/2023","4":"28\\/07\\/2023","5":"29\\/07\\/2023"},"time":{"0":"11:36:00","1":"12:36:00","2":"13:36:00","3":"14:36:00","4":"15:36:00","5":"16:36:00"},"category":{"0":"A","1":"A","2":"B","3":"B","4":"B","5":"B"}}'
It can be cast into a StringIO
instance:
import io
io.StringIO(json_string)
<_io.StringIO at 0x1aaad271780>
And then read in using the pd.read_json
function:
pd.read_json(io.StringIO(json_string))
string | integer | boolean | floatingpoint | date | time | category | |
---|---|---|---|---|---|---|---|
0 | the fat black cat | 4 | True | 0.86 | 2023-07-24 | 11:36:00 | A |
1 | sat on the mat | 4 | True | 0.86 | 2023-07-25 | 12:36:00 | A |
2 | twinkle, twinkle | 2 | True | -1.14 | 2023-07-26 | 13:36:00 | B |
3 | little star | 2 | True | -1.14 | 2023-07-27 | 14:36:00 | B |
4 | how I wonder | 3 | False | -0.14 | 2023-07-28 | 15:36:00 | B |
5 | what you are | 4 | True | 0.86 | 2023-07-29 | 16:36:00 | B |
Microsoft Excel File¶
A Microsoft Excel File, file extensions .xlsx (or .xls for older files) is a collection of sheets. The data in each individual sheet is similar to an individual csv file, although obscured because Excel applies formatting options to the data:
The related function read_excel
is used to read in the data from an Excel File. The delimiter is predefined in an Excel File however the Excel File can have multiple sheets so the keyword input argument sheet_name
is available, this defaults to the name 'Sheet1'
which is the default for an Excel Spreadsheet:
pd.read_excel?
Signature: pd.read_excel( io, sheet_name: 'str | int | list[IntStrT] | None' = 0, *, header: 'int | Sequence[int] | None' = 0, names: 'SequenceNotStr[Hashable] | range | None' = None, index_col: 'int | str | Sequence[int] | None' = None, usecols: 'int | str | Sequence[int] | Sequence[str] | Callable[[str], bool] | None' = None, dtype: 'DtypeArg | None' = None, engine: "Literal['xlrd', 'openpyxl', 'odf', 'pyxlsb', 'calamine'] | None" = None, converters: 'dict[str, Callable] | dict[int, Callable] | None' = None, true_values: 'Iterable[Hashable] | None' = None, false_values: 'Iterable[Hashable] | None' = None, skiprows: 'Sequence[int] | int | Callable[[int], object] | None' = None, nrows: 'int | None' = None, na_values=None, keep_default_na: 'bool' = True, na_filter: 'bool' = True, verbose: 'bool' = False, parse_dates: 'list | dict | bool' = False, date_parser: 'Callable | lib.NoDefault' = <no_default>, date_format: 'dict[Hashable, str] | str | None' = None, thousands: 'str | None' = None, decimal: 'str' = '.', comment: 'str | None' = None, skipfooter: 'int' = 0, storage_options: 'StorageOptions | None' = None, dtype_backend: 'DtypeBackend | lib.NoDefault' = <no_default>, engine_kwargs: 'dict | None' = None, ) -> 'DataFrame | dict[IntStrT, DataFrame]' Docstring: Read an Excel file into a ``pandas`` ``DataFrame``. Supports `xls`, `xlsx`, `xlsm`, `xlsb`, `odf`, `ods` and `odt` file extensions read from a local filesystem or URL. Supports an option to read a single sheet or a list of sheets. Parameters ---------- io : str, bytes, ExcelFile, xlrd.Book, path object, or file-like object Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. A local file could be: ``file://localhost/path/to/table.xlsx``. If you want to pass in a path object, pandas accepts any ``os.PathLike``. By file-like object, we refer to objects with a ``read()`` method, such as a file handle (e.g. via builtin ``open`` function) or ``StringIO``. .. deprecated:: 2.1.0 Passing byte strings is deprecated. To read from a byte string, wrap it in a ``BytesIO`` object. sheet_name : str, int, list, or None, default 0 Strings are used for sheet names. Integers are used in zero-indexed sheet positions (chart sheets do not count as a sheet position). Lists of strings/integers are used to request multiple sheets. Specify ``None`` to get all worksheets. Available cases: * Defaults to ``0``: 1st sheet as a `DataFrame` * ``1``: 2nd sheet as a `DataFrame` * ``"Sheet1"``: Load sheet with name "Sheet1" * ``[0, 1, "Sheet5"]``: Load first, second and sheet named "Sheet5" as a dict of `DataFrame` * ``None``: All worksheets. header : int, list of int, default 0 Row (0-indexed) to use for the column labels of the parsed DataFrame. If a list of integers is passed those row positions will be combined into a ``MultiIndex``. Use None if there is no header. names : array-like, default None List of column names to use. If file contains no header row, then you should explicitly pass header=None. index_col : int, str, list of int, default None Column (0-indexed) to use as the row labels of the DataFrame. Pass None if there is no such column. If a list is passed, those columns will be combined into a ``MultiIndex``. If a subset of data is selected with ``usecols``, index_col is based on the subset. Missing values will be forward filled to allow roundtripping with ``to_excel`` for ``merged_cells=True``. To avoid forward filling the missing values use ``set_index`` after reading the data instead of ``index_col``. usecols : str, list-like, or callable, default None * If None, then parse all columns. * If str, then indicates comma separated list of Excel column letters and column ranges (e.g. "A:E" or "A,C,E:F"). Ranges are inclusive of both sides. * If list of int, then indicates list of column numbers to be parsed (0-indexed). * If list of string, then indicates list of column names to be parsed. * If callable, then evaluate each column name against it and parse the column if the callable returns ``True``. Returns a subset of the columns according to behavior above. dtype : Type name or dict of column -> type, default None Data type for data or columns. E.g. {'a': np.float64, 'b': np.int32} Use ``object`` to preserve data as stored in Excel and not interpret dtype, which will necessarily result in ``object`` dtype. If converters are specified, they will be applied INSTEAD of dtype conversion. If you use ``None``, it will infer the dtype of each column based on the data. engine : {'openpyxl', 'calamine', 'odf', 'pyxlsb', 'xlrd'}, default None If io is not a buffer or path, this must be set to identify io. Engine compatibility : - ``openpyxl`` supports newer Excel file formats. - ``calamine`` supports Excel (.xls, .xlsx, .xlsm, .xlsb) and OpenDocument (.ods) file formats. - ``odf`` supports OpenDocument file formats (.odf, .ods, .odt). - ``pyxlsb`` supports Binary Excel files. - ``xlrd`` supports old-style Excel files (.xls). When ``engine=None``, the following logic will be used to determine the engine: - If ``path_or_buffer`` is an OpenDocument format (.odf, .ods, .odt), then `odf <https://pypi.org/project/odfpy/>`_ will be used. - Otherwise if ``path_or_buffer`` is an xls format, ``xlrd`` will be used. - Otherwise if ``path_or_buffer`` is in xlsb format, ``pyxlsb`` will be used. - Otherwise ``openpyxl`` will be used. converters : dict, default None Dict of functions for converting values in certain columns. Keys can either be integers or column labels, values are functions that take one input argument, the Excel cell content, and return the transformed content. true_values : list, default None Values to consider as True. false_values : list, default None Values to consider as False. skiprows : list-like, int, or callable, optional Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file. If callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False otherwise. An example of a valid callable argument would be ``lambda x: x in [0, 2]``. nrows : int, default None Number of rows to parse. na_values : scalar, str, list-like, or dict, default None Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: '', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan', '1.#IND', '1.#QNAN', '<NA>', 'N/A', 'NA', 'NULL', 'NaN', 'None', 'n/a', 'nan', 'null'. keep_default_na : bool, default True Whether or not to include the default NaN values when parsing the data. Depending on whether ``na_values`` is passed in, the behavior is as follows: * If ``keep_default_na`` is True, and ``na_values`` are specified, ``na_values`` is appended to the default NaN values used for parsing. * If ``keep_default_na`` is True, and ``na_values`` are not specified, only the default NaN values are used for parsing. * If ``keep_default_na`` is False, and ``na_values`` are specified, only the NaN values specified ``na_values`` are used for parsing. * If ``keep_default_na`` is False, and ``na_values`` are not specified, no strings will be parsed as NaN. Note that if `na_filter` is passed in as False, the ``keep_default_na`` and ``na_values`` parameters will be ignored. na_filter : bool, default True Detect missing value markers (empty strings and the value of na_values). In data without any NAs, passing ``na_filter=False`` can improve the performance of reading a large file. verbose : bool, default False Indicate number of NA values placed in non-numeric columns. parse_dates : bool, list-like, or dict, default False The behavior is as follows: * ``bool``. If True -> try parsing the index. * ``list`` of int or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column. * ``list`` of lists. e.g. If [[1, 3]] -> combine columns 1 and 3 and parse as a single date column. * ``dict``, e.g. {'foo' : [1, 3]} -> parse columns 1, 3 as date and call result 'foo' If a column or index contains an unparsable date, the entire column or index will be returned unaltered as an object data type. If you don`t want to parse some cells as date just change their type in Excel to "Text". For non-standard datetime parsing, use ``pd.to_datetime`` after ``pd.read_excel``. Note: A fast-path exists for iso8601-formatted dates. date_parser : function, optional Function to use for converting a sequence of string columns to an array of datetime instances. The default uses ``dateutil.parser.parser`` to do the conversion. Pandas will try to call `date_parser` in three different ways, advancing to the next if an exception occurs: 1) Pass one or more arrays (as defined by `parse_dates`) as arguments; 2) concatenate (row-wise) the string values from the columns defined by `parse_dates` into a single array and pass that; and 3) call `date_parser` once for each row using one or more strings (corresponding to the columns defined by `parse_dates`) as arguments. .. deprecated:: 2.0.0 Use ``date_format`` instead, or read in as ``object`` and then apply :func:`to_datetime` as-needed. date_format : str or dict of column -> format, default ``None`` If used in conjunction with ``parse_dates``, will parse dates according to this format. For anything more complex, please read in as ``object`` and then apply :func:`to_datetime` as-needed. .. versionadded:: 2.0.0 thousands : str, default None Thousands separator for parsing string columns to numeric. Note that this parameter is only necessary for columns stored as TEXT in Excel, any numeric columns will automatically be parsed, regardless of display format. decimal : str, default '.' Character to recognize as decimal point for parsing string columns to numeric. Note that this parameter is only necessary for columns stored as TEXT in Excel, any numeric columns will automatically be parsed, regardless of display format.(e.g. use ',' for European data). .. versionadded:: 1.4.0 comment : str, default None Comments out remainder of line. Pass a character or characters to this argument to indicate comments in the input file. Any data between the comment string and the end of the current line is ignored. skipfooter : int, default 0 Rows at the end to skip (0-indexed). storage_options : dict, optional Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to ``urllib.request.Request`` as header options. For other URLs (e.g. starting with "s3://", and "gcs://") the key-value pairs are forwarded to ``fsspec.open``. Please see ``fsspec`` and ``urllib`` for more details, and for more examples on storage options refer `here <https://pandas.pydata.org/docs/user_guide/io.html? highlight=storage_options#reading-writing-remote-files>`_. dtype_backend : {'numpy_nullable', 'pyarrow'}, default 'numpy_nullable' Back-end data type applied to the resultant :class:`DataFrame` (still experimental). Behaviour is as follows: * ``"numpy_nullable"``: returns nullable-dtype-backed :class:`DataFrame` (default). * ``"pyarrow"``: returns pyarrow-backed nullable :class:`ArrowDtype` DataFrame. .. versionadded:: 2.0 engine_kwargs : dict, optional Arbitrary keyword arguments passed to excel engine. Returns ------- DataFrame or dict of DataFrames DataFrame from the passed in Excel file. See notes in sheet_name argument for more information on when a dict of DataFrames is returned. See Also -------- DataFrame.to_excel : Write DataFrame to an Excel file. DataFrame.to_csv : Write DataFrame to a comma-separated values (csv) file. read_csv : Read a comma-separated values (csv) file into DataFrame. read_fwf : Read a table of fixed-width formatted lines into DataFrame. Notes ----- For specific information on the methods used for each Excel engine, refer to the pandas :ref:`user guide <io.excel_reader>` Examples -------- The file can be read using the file name as string or an open file object: >>> pd.read_excel('tmp.xlsx', index_col=0) # doctest: +SKIP Name Value 0 string1 1 1 string2 2 2 #Comment 3 >>> pd.read_excel(open('tmp.xlsx', 'rb'), ... sheet_name='Sheet3') # doctest: +SKIP Unnamed: 0 Name Value 0 0 string1 1 1 1 string2 2 2 2 #Comment 3 Index and header can be specified via the `index_col` and `header` arguments >>> pd.read_excel('tmp.xlsx', index_col=None, header=None) # doctest: +SKIP 0 1 2 0 NaN Name Value 1 0.0 string1 1 2 1.0 string2 2 3 2.0 #Comment 3 Column types are inferred but can be explicitly specified >>> pd.read_excel('tmp.xlsx', index_col=0, ... dtype={'Name': str, 'Value': float}) # doctest: +SKIP Name Value 0 string1 1.0 1 string2 2.0 2 #Comment 3.0 True, False, and NA values, and thousands separators have defaults, but can be explicitly specified, too. Supply the values you would like as strings or lists of strings! >>> pd.read_excel('tmp.xlsx', index_col=0, ... na_values=['string1', 'string2']) # doctest: +SKIP Name Value 0 NaN 1 1 NaN 2 2 #Comment 3 Comment lines in the excel input file can be skipped using the ``comment`` kwarg. >>> pd.read_excel('tmp.xlsx', index_col=0, comment='#') # doctest: +SKIP Name Value 0 string1 1.0 1 string2 2.0 2 None NaN File: c:\users\phili\anaconda3\envs\vscode-env\lib\site-packages\pandas\io\excel\_base.py Type: function
df = pd.read_excel('./files/Book3.xlsx', sheet_name='Sheet1')
df
string | integer | boolean | floatingpoint | date | time | category | |
---|---|---|---|---|---|---|---|
0 | the fat black cat | 4 | True | 0.86 | 2023-07-24 11:30:00 | 11:36:00 | A |
1 | sat on the mat | 4 | True | 0.86 | 2023-07-25 00:00:00 | 12:36:00 | A |
2 | twinkle, twinkle | 2 | True | -1.14 | 2023-07-26 00:00:00 | 13:36:00 | B |
3 | little star | 2 | True | -1.14 | 2023-07-27 00:00:00 | 14:36:00 | B |
4 | how I wonder | 3 | False | -0.14 | 2023-07-28 00:00:00 | 15:36:00 | B |
5 | what you are | 4 | True | 0.86 | 2023-07-29 00:00:00 | 16:36:00 | B |
Sheets in an Excel File are ordered and the ordering of the sheets is analogous to a RangeIndex
('Sheet1'
corresponds to an index of 0
because of zero-order indexing). This parameter is often supplied positionally:
df = pd.read_excel('./files/Book3.xlsx', 0)
df
string | integer | boolean | floatingpoint | date | time | category | |
---|---|---|---|---|---|---|---|
0 | the fat black cat | 4 | True | 0.86 | 2023-07-24 11:30:00 | 11:36:00 | A |
1 | sat on the mat | 4 | True | 0.86 | 2023-07-25 00:00:00 | 12:36:00 | A |
2 | twinkle, twinkle | 2 | True | -1.14 | 2023-07-26 00:00:00 | 13:36:00 | B |
3 | little star | 2 | True | -1.14 | 2023-07-27 00:00:00 | 14:36:00 | B |
4 | how I wonder | 3 | False | -0.14 | 2023-07-28 00:00:00 | 15:36:00 | B |
5 | what you are | 4 | True | 0.86 | 2023-07-29 00:00:00 | 16:36:00 | B |
In a .xlsx file, a column is normally formatted for the appropriate datatype. For example a datetime format and the .xlsx file contains the formatting information. This information can be used by the read_excel
function:
df = pd.read_excel('./files/Book3.xlsx', 0, parse_dates=True)
df
string | integer | boolean | floatingpoint | date | time | category | |
---|---|---|---|---|---|---|---|
0 | the fat black cat | 4 | True | 0.86 | 2023-07-24 11:30:00 | 11:36:00 | A |
1 | sat on the mat | 4 | True | 0.86 | 2023-07-25 00:00:00 | 12:36:00 | A |
2 | twinkle, twinkle | 2 | True | -1.14 | 2023-07-26 00:00:00 | 13:36:00 | B |
3 | little star | 2 | True | -1.14 | 2023-07-27 00:00:00 | 14:36:00 | B |
4 | how I wonder | 3 | False | -0.14 | 2023-07-28 00:00:00 | 15:36:00 | B |
5 | what you are | 4 | True | 0.86 | 2023-07-29 00:00:00 | 16:36:00 | B |
The date column is parsed as a date correctly but the time column wasn't parsed as a time column:
df.dtypes
string object integer int64 boolean bool floatingpoint float64 date datetime64[ns] time object category object dtype: object
The df['time']
pd.Series
instance can be cast into a str
instance:
df['time'].astype('str')
0 11:36:00 1 12:36:00 2 13:36:00 3 14:36:00 4 15:36:00 5 16:36:00 Name: time, dtype: object
The pd.Series
method astype
has limited support for converting to a timedelta64
and instead te pd
function to_timedelta
needs to be used:
pd.to_timedelta(df['time'].astype('str'))
0 0 days 11:36:00 1 0 days 12:36:00 2 0 days 13:36:00 3 0 days 14:36:00 4 0 days 15:36:00 5 0 days 16:36:00 Name: time, dtype: timedelta64[ns]
This can be reassigned to the pd.Series
instance name df['time']
. This will update the df
instance inplace:
df['time'] = pd.to_timedelta(df['time'].astype('str'))
df
string | integer | boolean | floatingpoint | date | time | category | |
---|---|---|---|---|---|---|---|
0 | the fat black cat | 4 | True | 0.86 | 2023-07-24 11:30:00 | 0 days 11:36:00 | A |
1 | sat on the mat | 4 | True | 0.86 | 2023-07-25 00:00:00 | 0 days 12:36:00 | A |
2 | twinkle, twinkle | 2 | True | -1.14 | 2023-07-26 00:00:00 | 0 days 13:36:00 | B |
3 | little star | 2 | True | -1.14 | 2023-07-27 00:00:00 | 0 days 14:36:00 | B |
4 | how I wonder | 3 | False | -0.14 | 2023-07-28 00:00:00 | 0 days 15:36:00 | B |
5 | what you are | 4 | True | 0.86 | 2023-07-29 00:00:00 | 0 days 16:36:00 | B |
df.dtypes
string object integer int64 boolean bool floatingpoint float64 date datetime64[ns] time timedelta64[ns] category object dtype: object
The df['category']
pd.Series
instance can be cast to a 'category'
datatype:
df['category'].astype('category')
0 A 1 A 2 B 3 B 4 B 5 B Name: category, dtype: category Categories (2, object): ['A', 'B']
df['category'] = df['category'].astype('category')
df
string | integer | boolean | floatingpoint | date | time | category | |
---|---|---|---|---|---|---|---|
0 | the fat black cat | 4 | True | 0.86 | 2023-07-24 11:30:00 | 0 days 11:36:00 | A |
1 | sat on the mat | 4 | True | 0.86 | 2023-07-25 00:00:00 | 0 days 12:36:00 | A |
2 | twinkle, twinkle | 2 | True | -1.14 | 2023-07-26 00:00:00 | 0 days 13:36:00 | B |
3 | little star | 2 | True | -1.14 | 2023-07-27 00:00:00 | 0 days 14:36:00 | B |
4 | how I wonder | 3 | False | -0.14 | 2023-07-28 00:00:00 | 0 days 15:36:00 | B |
5 | what you are | 4 | True | 0.86 | 2023-07-29 00:00:00 | 0 days 16:36:00 | B |
df.dtypes
string object integer int64 boolean bool floatingpoint float64 date datetime64[ns] time timedelta64[ns] category category dtype: object
To reorder the columns, indexing is quite commonly used:
df[['string', 'integer', 'boolean', 'floatingpoint', 'date', 'time', 'category']]
string | integer | boolean | floatingpoint | date | time | category | |
---|---|---|---|---|---|---|---|
0 | the fat black cat | 4 | True | 0.86 | 2023-07-24 11:30:00 | 0 days 11:36:00 | A |
1 | sat on the mat | 4 | True | 0.86 | 2023-07-25 00:00:00 | 0 days 12:36:00 | A |
2 | twinkle, twinkle | 2 | True | -1.14 | 2023-07-26 00:00:00 | 0 days 13:36:00 | B |
3 | little star | 2 | True | -1.14 | 2023-07-27 00:00:00 | 0 days 14:36:00 | B |
4 | how I wonder | 3 | False | -0.14 | 2023-07-28 00:00:00 | 0 days 15:36:00 | B |
5 | what you are | 4 | True | 0.86 | 2023-07-29 00:00:00 | 0 days 16:36:00 | B |
The instance name df
can be reassigned to this output:
df = df[['string', 'integer', 'boolean', 'floatingpoint', 'date', 'time', 'category']]
df
string | integer | boolean | floatingpoint | date | time | category | |
---|---|---|---|---|---|---|---|
0 | the fat black cat | 4 | True | 0.86 | 2023-07-24 11:30:00 | 0 days 11:36:00 | A |
1 | sat on the mat | 4 | True | 0.86 | 2023-07-25 00:00:00 | 0 days 12:36:00 | A |
2 | twinkle, twinkle | 2 | True | -1.14 | 2023-07-26 00:00:00 | 0 days 13:36:00 | B |
3 | little star | 2 | True | -1.14 | 2023-07-27 00:00:00 | 0 days 14:36:00 | B |
4 | how I wonder | 3 | False | -0.14 | 2023-07-28 00:00:00 | 0 days 15:36:00 | B |
5 | what you are | 4 | True | 0.86 | 2023-07-29 00:00:00 | 0 days 16:36:00 | B |
Writing DataFrames to Objects and Files¶
The DataFrame
class has a number of methods for writing to files:
for identifier in dir(df):
if identifier.startswith('to_'):
print(identifier)
to_clipboard to_csv to_dict to_excel to_feather to_gbq to_hdf to_html to_json to_latex to_markdown to_numpy to_orc to_parquet to_period to_pickle to_records to_sql to_stata to_string to_timestamp to_xarray to_xml
Python Dictionary¶
A pd.DataFrame
instance can be written to a dict
instance using:
df.to_dict()
{'string': {0: 'the fat black cat', 1: 'sat on the mat', 2: 'twinkle, twinkle', 3: 'little star', 4: 'how I wonder', 5: 'what you are'}, 'integer': {0: 4, 1: 4, 2: 2, 3: 2, 4: 3, 5: 4}, 'boolean': {0: True, 1: True, 2: True, 3: True, 4: False, 5: True}, 'floatingpoint': {0: 0.8599999999999999, 1: 0.8599999999999999, 2: -1.1400000000000001, 3: -1.1400000000000001, 4: -0.14000000000000012, 5: 0.8599999999999999}, 'date': {0: Timestamp('2023-07-24 11:30:00'), 1: Timestamp('2023-07-25 00:00:00'), 2: Timestamp('2023-07-26 00:00:00'), 3: Timestamp('2023-07-27 00:00:00'), 4: Timestamp('2023-07-28 00:00:00'), 5: Timestamp('2023-07-29 00:00:00')}, 'time': {0: Timedelta('0 days 11:36:00'), 1: Timedelta('0 days 12:36:00'), 2: Timedelta('0 days 13:36:00'), 3: Timedelta('0 days 14:36:00'), 4: Timedelta('0 days 15:36:00'), 5: Timedelta('0 days 16:36:00')}, 'category': {0: 'A', 1: 'A', 2: 'B', 3: 'B', 4: 'B', 5: 'B'}}
This is read into a pd.DataFrame
instance using:
pd.DataFrame(data=df.to_dict())
string | integer | boolean | floatingpoint | date | time | category | |
---|---|---|---|---|---|---|---|
0 | the fat black cat | 4 | True | 0.86 | 2023-07-24 11:30:00 | 0 days 11:36:00 | A |
1 | sat on the mat | 4 | True | 0.86 | 2023-07-25 00:00:00 | 0 days 12:36:00 | A |
2 | twinkle, twinkle | 2 | True | -1.14 | 2023-07-26 00:00:00 | 0 days 13:36:00 | B |
3 | little star | 2 | True | -1.14 | 2023-07-27 00:00:00 | 0 days 14:36:00 | B |
4 | how I wonder | 3 | False | -0.14 | 2023-07-28 00:00:00 | 0 days 15:36:00 | B |
5 | what you are | 4 | True | 0.86 | 2023-07-29 00:00:00 | 0 days 16:36:00 | B |
Comma Separated Values File¶
A pd.DataFrame
can be written to a file using:
df.to_csv('./files/Book4.csv')
Notice that an index was added, meaning if this is read into a pd.DataFrame
instance using the defaults there is an Unnamed pd.Series
corresponding to the index read in:
pd.read_csv('./files/Book4.csv')
Unnamed: 0 | string | integer | boolean | floatingpoint | date | time | category | |
---|---|---|---|---|---|---|---|---|
0 | 0 | the fat black cat | 4 | True | 0.86 | 2023-07-24 11:30:00 | 0 days 11:36:00 | A |
1 | 1 | sat on the mat | 4 | True | 0.86 | 2023-07-25 00:00:00 | 0 days 12:36:00 | A |
2 | 2 | twinkle, twinkle | 2 | True | -1.14 | 2023-07-26 00:00:00 | 0 days 13:36:00 | B |
3 | 3 | little star | 2 | True | -1.14 | 2023-07-27 00:00:00 | 0 days 14:36:00 | B |
4 | 4 | how I wonder | 3 | False | -0.14 | 2023-07-28 00:00:00 | 0 days 15:36:00 | B |
5 | 5 | what you are | 4 | True | 0.86 | 2023-07-29 00:00:00 | 0 days 16:36:00 | B |
This can be assigned to the pd.Index
instance associated with the pd.DataFrame
using the keyword input argument index_col
:
pd.read_csv('./files/Book4.csv', index_col=0)
string | integer | boolean | floatingpoint | date | time | category | |
---|---|---|---|---|---|---|---|
0 | the fat black cat | 4 | True | 0.86 | 2023-07-24 11:30:00 | 0 days 11:36:00 | A |
1 | sat on the mat | 4 | True | 0.86 | 2023-07-25 00:00:00 | 0 days 12:36:00 | A |
2 | twinkle, twinkle | 2 | True | -1.14 | 2023-07-26 00:00:00 | 0 days 13:36:00 | B |
3 | little star | 2 | True | -1.14 | 2023-07-27 00:00:00 | 0 days 14:36:00 | B |
4 | how I wonder | 3 | False | -0.14 | 2023-07-28 00:00:00 | 0 days 15:36:00 | B |
5 | what you are | 4 | True | 0.86 | 2023-07-29 00:00:00 | 0 days 16:36:00 | B |
Alternatively the pd.DataFrame
instance can be exported without the pd.Index
information:
df.to_csv?
Signature: df.to_csv( path_or_buf: 'FilePath | WriteBuffer[bytes] | WriteBuffer[str] | None' = None, *, sep: 'str' = ',', na_rep: 'str' = '', float_format: 'str | Callable | None' = None, columns: 'Sequence[Hashable] | None' = None, header: 'bool_t | list[str]' = True, index: 'bool_t' = True, index_label: 'IndexLabel | None' = None, mode: 'str' = 'w', encoding: 'str | None' = None, compression: 'CompressionOptions' = 'infer', quoting: 'int | None' = None, quotechar: 'str' = '"', lineterminator: 'str | None' = None, chunksize: 'int | None' = None, date_format: 'str | None' = None, doublequote: 'bool_t' = True, escapechar: 'str | None' = None, decimal: 'str' = '.', errors: 'OpenFileErrors' = 'strict', storage_options: 'StorageOptions | None' = None, ) -> 'str | None' Docstring: Write object to a comma-separated values (csv) file. Parameters ---------- path_or_buf : str, path object, file-like object, or None, default None String, path object (implementing os.PathLike[str]), or file-like object implementing a write() function. If None, the result is returned as a string. If a non-binary file object is passed, it should be opened with `newline=''`, disabling universal newlines. If a binary file object is passed, `mode` might need to contain a `'b'`. sep : str, default ',' String of length 1. Field delimiter for the output file. na_rep : str, default '' Missing data representation. float_format : str, Callable, default None Format string for floating point numbers. If a Callable is given, it takes precedence over other numeric formatting parameters, like decimal. columns : sequence, optional Columns to write. header : bool or list of str, default True Write out the column names. If a list of strings is given it is assumed to be aliases for the column names. index : bool, default True Write row names (index). index_label : str or sequence, or False, default None Column label for index column(s) if desired. If None is given, and `header` and `index` are True, then the index names are used. A sequence should be given if the object uses MultiIndex. If False do not print fields for index names. Use index_label=False for easier importing in R. mode : {'w', 'x', 'a'}, default 'w' Forwarded to either `open(mode=)` or `fsspec.open(mode=)` to control the file opening. Typical values include: - 'w', truncate the file first. - 'x', exclusive creation, failing if the file already exists. - 'a', append to the end of file if it exists. encoding : str, optional A string representing the encoding to use in the output file, defaults to 'utf-8'. `encoding` is not supported if `path_or_buf` is a non-binary file object. compression : str or dict, default 'infer' For on-the-fly compression of the output data. If 'infer' and 'path_or_buf' is path-like, then detect compression from the following extensions: '.gz', '.bz2', '.zip', '.xz', '.zst', '.tar', '.tar.gz', '.tar.xz' or '.tar.bz2' (otherwise no compression). Set to ``None`` for no compression. Can also be a dict with key ``'method'`` set to one of {``'zip'``, ``'gzip'``, ``'bz2'``, ``'zstd'``, ``'xz'``, ``'tar'``} and other key-value pairs are forwarded to ``zipfile.ZipFile``, ``gzip.GzipFile``, ``bz2.BZ2File``, ``zstandard.ZstdCompressor``, ``lzma.LZMAFile`` or ``tarfile.TarFile``, respectively. As an example, the following could be passed for faster compression and to create a reproducible gzip archive: ``compression={'method': 'gzip', 'compresslevel': 1, 'mtime': 1}``. .. versionadded:: 1.5.0 Added support for `.tar` files. May be a dict with key 'method' as compression mode and other entries as additional compression options if compression mode is 'zip'. Passing compression options as keys in dict is supported for compression modes 'gzip', 'bz2', 'zstd', and 'zip'. quoting : optional constant from csv module Defaults to csv.QUOTE_MINIMAL. If you have set a `float_format` then floats are converted to strings and thus csv.QUOTE_NONNUMERIC will treat them as non-numeric. quotechar : str, default '\"' String of length 1. Character used to quote fields. lineterminator : str, optional The newline character or character sequence to use in the output file. Defaults to `os.linesep`, which depends on the OS in which this method is called ('\\n' for linux, '\\r\\n' for Windows, i.e.). .. versionchanged:: 1.5.0 Previously was line_terminator, changed for consistency with read_csv and the standard library 'csv' module. chunksize : int or None Rows to write at a time. date_format : str, default None Format string for datetime objects. doublequote : bool, default True Control quoting of `quotechar` inside a field. escapechar : str, default None String of length 1. Character used to escape `sep` and `quotechar` when appropriate. decimal : str, default '.' Character recognized as decimal separator. E.g. use ',' for European data. errors : str, default 'strict' Specifies how encoding and decoding errors are to be handled. See the errors argument for :func:`open` for a full list of options. storage_options : dict, optional Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to ``urllib.request.Request`` as header options. For other URLs (e.g. starting with "s3://", and "gcs://") the key-value pairs are forwarded to ``fsspec.open``. Please see ``fsspec`` and ``urllib`` for more details, and for more examples on storage options refer `here <https://pandas.pydata.org/docs/user_guide/io.html? highlight=storage_options#reading-writing-remote-files>`_. Returns ------- None or str If path_or_buf is None, returns the resulting csv format as a string. Otherwise returns None. See Also -------- read_csv : Load a CSV file into a DataFrame. to_excel : Write DataFrame to an Excel file. Examples -------- Create 'out.csv' containing 'df' without indices >>> df = pd.DataFrame({'name': ['Raphael', 'Donatello'], ... 'mask': ['red', 'purple'], ... 'weapon': ['sai', 'bo staff']}) >>> df.to_csv('out.csv', index=False) # doctest: +SKIP Create 'out.zip' containing 'out.csv' >>> df.to_csv(index=False) 'name,mask,weapon\nRaphael,red,sai\nDonatello,purple,bo staff\n' >>> compression_opts = dict(method='zip', ... archive_name='out.csv') # doctest: +SKIP >>> df.to_csv('out.zip', index=False, ... compression=compression_opts) # doctest: +SKIP To write a csv file to a new folder or nested folder you will first need to create it using either Pathlib or os: >>> from pathlib import Path # doctest: +SKIP >>> filepath = Path('folder/subfolder/out.csv') # doctest: +SKIP >>> filepath.parent.mkdir(parents=True, exist_ok=True) # doctest: +SKIP >>> df.to_csv(filepath) # doctest: +SKIP >>> import os # doctest: +SKIP >>> os.makedirs('folder/subfolder', exist_ok=True) # doctest: +SKIP >>> df.to_csv('folder/subfolder/out.csv') # doctest: +SKIP File: c:\users\phili\anaconda3\envs\vscode-env\lib\site-packages\pandas\core\generic.py Type: method
df.to_csv('./files/Book5.csv', index=False)
Tab Delimited Text File¶
To save to a text file, the separator needs to be specified:
df.to_csv('./files/Book6.txt', sep='\t', index=False)
This is read into a pd.DataFrame
instance using:
pd.read_csv('./files/Book6.txt', sep='\t')
string | integer | boolean | floatingpoint | date | time | category | |
---|---|---|---|---|---|---|---|
0 | the fat black cat | 4 | True | 0.86 | 2023-07-24 11:30:00 | 0 days 11:36:00 | A |
1 | sat on the mat | 4 | True | 0.86 | 2023-07-25 00:00:00 | 0 days 12:36:00 | A |
2 | twinkle, twinkle | 2 | True | -1.14 | 2023-07-26 00:00:00 | 0 days 13:36:00 | B |
3 | little star | 2 | True | -1.14 | 2023-07-27 00:00:00 | 0 days 14:36:00 | B |
4 | how I wonder | 3 | False | -0.14 | 2023-07-28 00:00:00 | 0 days 15:36:00 | B |
5 | what you are | 4 | True | 0.86 | 2023-07-29 00:00:00 | 0 days 16:36:00 | B |
JavaScript Object Notation¶
A pd.DataFrame
can be written to a JSON str
using:
df.to_json()
'{"string":{"0":"the fat black cat","1":"sat on the mat","2":"twinkle, twinkle","3":"little star","4":"how I wonder","5":"what you are"},"integer":{"0":4,"1":4,"2":2,"3":2,"4":3,"5":4},"boolean":{"0":true,"1":true,"2":true,"3":true,"4":false,"5":true},"floatingpoint":{"0":0.86,"1":0.86,"2":-1.14,"3":-1.14,"4":-0.14,"5":0.86},"date":{"0":1690198200000,"1":1690243200000,"2":1690329600000,"3":1690416000000,"4":1690502400000,"5":1690588800000},"time":{"0":41760000,"1":45360000,"2":48960000,"3":52560000,"4":56160000,"5":59760000},"category":{"0":"A","1":"A","2":"B","3":"B","4":"B","5":"B"}}'
Note that this is the same form that can be used to instantiate a pd.DataFrame
instance:
pd.read_json(io.StringIO(df.to_json()))
string | integer | boolean | floatingpoint | date | time | category | |
---|---|---|---|---|---|---|---|
0 | the fat black cat | 4 | True | 0.86 | 2023-07-24 11:30:00 | 41760000 | A |
1 | sat on the mat | 4 | True | 0.86 | 2023-07-25 00:00:00 | 45360000 | A |
2 | twinkle, twinkle | 2 | True | -1.14 | 2023-07-26 00:00:00 | 48960000 | B |
3 | little star | 2 | True | -1.14 | 2023-07-27 00:00:00 | 52560000 | B |
4 | how I wonder | 3 | False | -0.14 | 2023-07-28 00:00:00 | 56160000 | B |
5 | what you are | 4 | True | 0.86 | 2023-07-29 00:00:00 | 59760000 | B |
Supposing there are three DataFrame instances:
Microsoft Excel File¶
Recall that an Excel file has sheets. For convenience multiple pd.DataFrame
instances can be created by indexing into the pd.DataFrame
instance df
and indexing using a list
of str
instances, where each str
instance corresponds to the name of a pd.Series
:
df
string | integer | boolean | floatingpoint | date | time | category | |
---|---|---|---|---|---|---|---|
0 | the fat black cat | 4 | True | 0.86 | 2023-07-24 11:30:00 | 0 days 11:36:00 | A |
1 | sat on the mat | 4 | True | 0.86 | 2023-07-25 00:00:00 | 0 days 12:36:00 | A |
2 | twinkle, twinkle | 2 | True | -1.14 | 2023-07-26 00:00:00 | 0 days 13:36:00 | B |
3 | little star | 2 | True | -1.14 | 2023-07-27 00:00:00 | 0 days 14:36:00 | B |
4 | how I wonder | 3 | False | -0.14 | 2023-07-28 00:00:00 | 0 days 15:36:00 | B |
5 | what you are | 4 | True | 0.86 | 2023-07-29 00:00:00 | 0 days 16:36:00 | B |
df1 = df[['string', 'integer', 'time']]
df1
string | integer | time | |
---|---|---|---|
0 | the fat black cat | 4 | 0 days 11:36:00 |
1 | sat on the mat | 4 | 0 days 12:36:00 |
2 | twinkle, twinkle | 2 | 0 days 13:36:00 |
3 | little star | 2 | 0 days 14:36:00 |
4 | how I wonder | 3 | 0 days 15:36:00 |
5 | what you are | 4 | 0 days 16:36:00 |
df2 = df[['string', 'integer', 'boolean']]
df2
string | integer | boolean | |
---|---|---|---|
0 | the fat black cat | 4 | True |
1 | sat on the mat | 4 | True |
2 | twinkle, twinkle | 2 | True |
3 | little star | 2 | True |
4 | how I wonder | 3 | False |
5 | what you are | 4 | True |
df3 = df[['string', 'date', 'category']]
df3
string | date | category | |
---|---|---|---|
0 | the fat black cat | 2023-07-24 11:30:00 | A |
1 | sat on the mat | 2023-07-25 00:00:00 | A |
2 | twinkle, twinkle | 2023-07-26 00:00:00 | B |
3 | little star | 2023-07-27 00:00:00 | B |
4 | how I wonder | 2023-07-28 00:00:00 | B |
5 | what you are | 2023-07-29 00:00:00 | B |
The pd.DataFrame
method to_excel
allows the writing of multiple pd.DataFrame
instances to individual sheets within an Excel File:
df.to_excel?
Signature: df.to_excel( excel_writer: 'FilePath | WriteExcelBuffer | ExcelWriter', *, sheet_name: 'str' = 'Sheet1', na_rep: 'str' = '', float_format: 'str | None' = None, columns: 'Sequence[Hashable] | None' = None, header: 'Sequence[Hashable] | bool_t' = True, index: 'bool_t' = True, index_label: 'IndexLabel | None' = None, startrow: 'int' = 0, startcol: 'int' = 0, engine: "Literal['openpyxl', 'xlsxwriter'] | None" = None, merge_cells: 'bool_t' = True, inf_rep: 'str' = 'inf', freeze_panes: 'tuple[int, int] | None' = None, storage_options: 'StorageOptions | None' = None, engine_kwargs: 'dict[str, Any] | None' = None, ) -> 'None' Docstring: Write object to an Excel sheet. To write a single object to an Excel .xlsx file it is only necessary to specify a target file name. To write to multiple sheets it is necessary to create an `ExcelWriter` object with a target file name, and specify a sheet in the file to write to. Multiple sheets may be written to by specifying unique `sheet_name`. With all data written to the file it is necessary to save the changes. Note that creating an `ExcelWriter` object with a file name that already exists will result in the contents of the existing file being erased. Parameters ---------- excel_writer : path-like, file-like, or ExcelWriter object File path or existing ExcelWriter. sheet_name : str, default 'Sheet1' Name of sheet which will contain DataFrame. na_rep : str, default '' Missing data representation. float_format : str, optional Format string for floating point numbers. For example ``float_format="%.2f"`` will format 0.1234 to 0.12. columns : sequence or list of str, optional Columns to write. header : bool or list of str, default True Write out the column names. If a list of string is given it is assumed to be aliases for the column names. index : bool, default True Write row names (index). index_label : str or sequence, optional Column label for index column(s) if desired. If not specified, and `header` and `index` are True, then the index names are used. A sequence should be given if the DataFrame uses MultiIndex. startrow : int, default 0 Upper left cell row to dump data frame. startcol : int, default 0 Upper left cell column to dump data frame. engine : str, optional Write engine to use, 'openpyxl' or 'xlsxwriter'. You can also set this via the options ``io.excel.xlsx.writer`` or ``io.excel.xlsm.writer``. merge_cells : bool, default True Write MultiIndex and Hierarchical Rows as merged cells. inf_rep : str, default 'inf' Representation for infinity (there is no native representation for infinity in Excel). freeze_panes : tuple of int (length 2), optional Specifies the one-based bottommost row and rightmost column that is to be frozen. storage_options : dict, optional Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to ``urllib.request.Request`` as header options. For other URLs (e.g. starting with "s3://", and "gcs://") the key-value pairs are forwarded to ``fsspec.open``. Please see ``fsspec`` and ``urllib`` for more details, and for more examples on storage options refer `here <https://pandas.pydata.org/docs/user_guide/io.html? highlight=storage_options#reading-writing-remote-files>`_. .. versionadded:: 1.2.0 engine_kwargs : dict, optional Arbitrary keyword arguments passed to excel engine. See Also -------- to_csv : Write DataFrame to a comma-separated values (csv) file. ExcelWriter : Class for writing DataFrame objects into excel sheets. read_excel : Read an Excel file into a pandas DataFrame. read_csv : Read a comma-separated values (csv) file into DataFrame. io.formats.style.Styler.to_excel : Add styles to Excel sheet. Notes ----- For compatibility with :meth:`~DataFrame.to_csv`, to_excel serializes lists and dicts to strings before writing. Once a workbook has been saved it is not possible to write further data without rewriting the whole workbook. Examples -------- Create, write to and save a workbook: >>> df1 = pd.DataFrame([['a', 'b'], ['c', 'd']], ... index=['row 1', 'row 2'], ... columns=['col 1', 'col 2']) >>> df1.to_excel("output.xlsx") # doctest: +SKIP To specify the sheet name: >>> df1.to_excel("output.xlsx", ... sheet_name='Sheet_name_1') # doctest: +SKIP If you wish to write to more than one sheet in the workbook, it is necessary to specify an ExcelWriter object: >>> df2 = df1.copy() >>> with pd.ExcelWriter('output.xlsx') as writer: # doctest: +SKIP ... df1.to_excel(writer, sheet_name='Sheet_name_1') ... df2.to_excel(writer, sheet_name='Sheet_name_2') ExcelWriter can also be used to append to an existing Excel file: >>> with pd.ExcelWriter('output.xlsx', ... mode='a') as writer: # doctest: +SKIP ... df1.to_excel(writer, sheet_name='Sheet_name_3') To set the library that is used to write the Excel file, you can pass the `engine` keyword (the default engine is automatically chosen depending on the file extension): >>> df1.to_excel('output1.xlsx', engine='xlsxwriter') # doctest: +SKIP File: c:\users\phili\anaconda3\envs\vscode-env\lib\site-packages\pandas\core\generic.py Type: method
To write pd.DataFrame
instances to multiple sheets an pd.ExcelWriter
instance has to be instantiated and given the instruction to create a blank Excel File:
writer = pd.ExcelWriter(path='./files/Book7.xlsx')
The DataFrame
method to_excel
can then be used to instruct the ExcelWriter
instance to write the df.DataFrame
instance to a specified sheet:
df1.to_excel(excel_writer=writer, sheet_name='sheet1')
df2.to_excel(excel_writer=writer, sheet_name='sheet2')
df3.to_excel(excel_writer=writer, sheet_name='sheet3')
Details about the sheets being written can be seen using the ExcelWriter
attribute sheets
which is a mapping where the key is the sheet name and the value is the sheet being written:
writer.sheets
{'sheet1': <xlsxwriter.worksheet.Worksheet at 0x1aaadbe8f80>, 'sheet2': <xlsxwriter.worksheet.Worksheet at 0x1aaad6cec60>, 'sheet3': <xlsxwriter.worksheet.Worksheet at 0x1aaada6c050>}
Finally the ExcelWriter
instance can be closed. This will release the Excel SpreadSheet from Python:
writer.close()
The identifiers of the ExcelWriter
class can be examined:
dir2(pd.ExcelWriter, drop_internal=True)
{'attribute': ['book', 'date_format', 'datetime_format', 'engine', 'if_sheet_exists', 'sheets', 'supported_extensions'], 'method': ['check_extension', 'close'], 'datamodel_attribute': ['__annotations__', '__dict__', '__doc__', '__module__', '__orig_bases__', '__parameters__', '__weakref__'], 'datamodel_method': ['__class__', '__class_getitem__', '__delattr__', '__dir__', '__enter__', '__eq__', '__exit__', '__format__', '__fspath__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__']}
Notice it has the datamodel identifiers __enter__
and __exit__
which means it can be used within a with
code block. The with
code block will automatically close the ExcelWriter
class when the block ends and is the safest to create the file, write multiple sheets to it and close the file::
with pd.ExcelWriter('./files/Book8.xlsx') as writer:
df1.to_excel(writer, sheet_name='df1', index=False)
df2.to_excel(writer, sheet_name='df2', index=False)
df3.to_excel(writer, sheet_name='df3', index=False)
Markdown¶
A pd.DataFrame
instance can be exported into a markdown str
using:
df.to_markdown()
'| | string | integer | boolean | floatingpoint | date | time | category |\n|---:|:------------------|----------:|:----------|----------------:|:--------------------|:----------------|:-----------|\n| 0 | the fat black cat | 4 | True | 0.86 | 2023-07-24 11:30:00 | 0 days 11:36:00 | A |\n| 1 | sat on the mat | 4 | True | 0.86 | 2023-07-25 00:00:00 | 0 days 12:36:00 | A |\n| 2 | twinkle, twinkle | 2 | True | -1.14 | 2023-07-26 00:00:00 | 0 days 13:36:00 | B |\n| 3 | little star | 2 | True | -1.14 | 2023-07-27 00:00:00 | 0 days 14:36:00 | B |\n| 4 | how I wonder | 3 | False | -0.14 | 2023-07-28 00:00:00 | 0 days 15:36:00 | B |\n| 5 | what you are | 4 | True | 0.86 | 2023-07-29 00:00:00 | 0 days 16:36:00 | B |'
When this is printed, it has the following form:
print(df.to_markdown())
| | string | integer | boolean | floatingpoint | date | time | category | |---:|:------------------|----------:|:----------|----------------:|:--------------------|:----------------|:-----------| | 0 | the fat black cat | 4 | True | 0.86 | 2023-07-24 11:30:00 | 0 days 11:36:00 | A | | 1 | sat on the mat | 4 | True | 0.86 | 2023-07-25 00:00:00 | 0 days 12:36:00 | A | | 2 | twinkle, twinkle | 2 | True | -1.14 | 2023-07-26 00:00:00 | 0 days 13:36:00 | B | | 3 | little star | 2 | True | -1.14 | 2023-07-27 00:00:00 | 0 days 14:36:00 | B | | 4 | how I wonder | 3 | False | -0.14 | 2023-07-28 00:00:00 | 0 days 15:36:00 | B | | 5 | what you are | 4 | True | 0.86 | 2023-07-29 00:00:00 | 0 days 16:36:00 | B |
And when this printed format is copied into a markdown cell, it looks like:
string | integer | boolean | floatingpoint | date | time | category | |
---|---|---|---|---|---|---|---|
0 | the fat black cat | 4 | True | 0.86 | 24/07/2023 | 11:36:00 | A |
1 | sat on the mat | 4 | True | 0.86 | 25/07/2023 | 12:36:00 | A |
2 | twinkle, twinkle | 2 | True | -1.14 | 26/07/2023 | 13:36:00 | B |
3 | little star | 2 | True | -1.14 | 27/07/2023 | 14:36:00 | B |
4 | how I wonder | 3 | False | -0.14 | 28/07/2023 | 15:36:00 | B |
5 | what you are | 4 | True | 0.86 | 29/07/2023 | 16:36:00 | B |
Series Identifiers¶
The identifiers for a pd.Series
are:
dir2(pd.Series, drop_internal=True, exclude_identifier_list=obsolete.series_depreciation)
{'attribute': ['array', 'at', 'attrs', 'axes', 'dtype', 'dtypes', 'empty', 'flags', 'hasnans', 'iat', 'iloc', 'index', 'is_monotonic_decreasing', 'is_monotonic_increasing', 'is_unique', 'loc', 'name', 'nbytes', 'ndim', 'shape', 'size', 'values'], 'constant': ['T'], 'method': ['abs', 'add', 'add_prefix', 'add_suffix', 'agg', 'aggregate', 'align', 'all', 'any', 'apply', 'argmax', 'argmin', 'argsort', 'asfreq', 'asof', 'astype', 'at_time', 'autocorr', 'between', 'between_time', 'bfill', 'case_when', 'clip', 'combine', 'combine_first', 'compare', 'convert_dtypes', 'copy', 'corr', 'count', 'cov', 'cummax', 'cummin', 'cumprod', 'cumsum', 'describe', 'diff', 'div', 'divide', 'divmod', 'dot', 'drop', 'drop_duplicates', 'droplevel', 'dropna', 'duplicated', 'eq', 'equals', 'ewm', 'expanding', 'explode', 'factorize', 'ffill', 'fillna', 'filter', 'first', 'first_valid_index', 'floordiv', 'ge', 'get', 'groupby', 'gt', 'head', 'hist', 'idxmax', 'idxmin', 'infer_objects', 'info', 'interpolate', 'isin', 'isna', 'isnull', 'item', 'items', 'keys', 'kurt', 'kurtosis', 'last', 'last_valid_index', 'le', 'lt', 'map', 'mask', 'max', 'mean', 'median', 'memory_usage', 'min', 'mod', 'mode', 'mul', 'multiply', 'ne', 'nlargest', 'notna', 'notnull', 'nsmallest', 'nunique', 'pad', 'pct_change', 'pipe', 'pop', 'pow', 'prod', 'product', 'quantile', 'radd', 'rank', 'rdiv', 'rdivmod', 'reindex', 'reindex_like', 'rename', 'rename_axis', 'reorder_levels', 'repeat', 'replace', 'resample', 'reset_index', 'rfloordiv', 'rmod', 'rmul', 'rolling', 'round', 'rpow', 'rsub', 'rtruediv', 'sample', 'searchsorted', 'sem', 'set_axis', 'set_flags', 'shift', 'skew', 'sort_index', 'sort_values', 'squeeze', 'std', 'sub', 'subtract', 'sum', 'swaplevel', 'tail', 'take', 'to_clipboard', 'to_csv', 'to_dict', 'to_excel', 'to_frame', 'to_hdf', 'to_json', 'to_latex', 'to_list', 'to_markdown', 'to_numpy', 'to_period', 'to_pickle', 'to_sql', 'to_string', 'to_timestamp', 'to_xarray', 'tolist', 'transform', 'transpose', 'truediv', 'truncate', 'tz_convert', 'tz_localize', 'unique', 'unstack', 'update', 'value_counts', 'var', 'where', 'xs'], 'lower_class': ['cat', 'dt', 'list', 'plot', 'sparse', 'str', 'struct'], 'datamodel_attribute': ['__annotations__', '__array_priority__', '__dict__', '__doc__', '__hash__', '__module__', '__pandas_priority__', '__weakref__'], 'datamodel_method': ['__abs__', '__add__', '__and__', '__array__', '__array_ufunc__', '__bool__', '__class__', '__column_consortium_standard__', '__contains__', '__copy__', '__deepcopy__', '__delattr__', '__delitem__', '__dir__', '__divmod__', '__eq__', '__finalize__', '__float__', '__floordiv__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__iadd__', '__iand__', '__ifloordiv__', '__imod__', '__imul__', '__init__', '__init_subclass__', '__int__', '__invert__', '__ior__', '__ipow__', '__isub__', '__iter__', '__itruediv__', '__ixor__', '__le__', '__len__', '__lt__', '__matmul__', '__mod__', '__mul__', '__ne__', '__neg__', '__new__', '__nonzero__', '__or__', '__pos__', '__pow__', '__radd__', '__rand__', '__rdivmod__', '__reduce__', '__reduce_ex__', '__repr__', '__rfloordiv__', '__rmatmul__', '__rmod__', '__rmul__', '__ror__', '__round__', '__rpow__', '__rsub__', '__rtruediv__', '__rxor__', '__setattr__', '__setitem__', '__setstate__', '__sizeof__', '__str__', '__sub__', '__subclasshook__', '__truediv__', '__xor__']}
The initialisation signature for a pd.Series
can be examined:
pd.Series?
Init signature: pd.Series( data=None, index=None, dtype: 'Dtype | None' = None, name=None, copy: 'bool | None' = None, fastpath: 'bool | lib.NoDefault' = <no_default>, ) -> 'None' Docstring: One-dimensional ndarray with axis labels (including time series). Labels need not be unique but must be a hashable type. The object supports both integer- and label-based indexing and provides a host of methods for performing operations involving the index. Statistical methods from ndarray have been overridden to automatically exclude missing data (currently represented as NaN). Operations between Series (+, -, /, \*, \*\*) align values based on their associated index values-- they need not be the same length. The result index will be the sorted union of the two indexes. Parameters ---------- data : array-like, Iterable, dict, or scalar value Contains data stored in Series. If data is a dict, argument order is maintained. index : array-like or Index (1d) Values must be hashable and have the same length as `data`. Non-unique index values are allowed. Will default to RangeIndex (0, 1, 2, ..., n) if not provided. If data is dict-like and index is None, then the keys in the data are used as the index. If the index is not None, the resulting Series is reindexed with the index values. dtype : str, numpy.dtype, or ExtensionDtype, optional Data type for the output Series. If not specified, this will be inferred from `data`. See the :ref:`user guide <basics.dtypes>` for more usages. name : Hashable, default None The name to give to the Series. copy : bool, default False Copy input data. Only affects Series or 1d ndarray input. See examples. Notes ----- Please reference the :ref:`User Guide <basics.series>` for more information. Examples -------- Constructing Series from a dictionary with an Index specified >>> d = {'a': 1, 'b': 2, 'c': 3} >>> ser = pd.Series(data=d, index=['a', 'b', 'c']) >>> ser a 1 b 2 c 3 dtype: int64 The keys of the dictionary match with the Index values, hence the Index values have no effect. >>> d = {'a': 1, 'b': 2, 'c': 3} >>> ser = pd.Series(data=d, index=['x', 'y', 'z']) >>> ser x NaN y NaN z NaN dtype: float64 Note that the Index is first build with the keys from the dictionary. After this the Series is reindexed with the given Index values, hence we get all NaN as a result. Constructing Series from a list with `copy=False`. >>> r = [1, 2] >>> ser = pd.Series(r, copy=False) >>> ser.iloc[0] = 999 >>> r [1, 2] >>> ser 0 999 1 2 dtype: int64 Due to input data type the Series has a `copy` of the original data even though `copy=False`, so the data is unchanged. Constructing Series from a 1d ndarray with `copy=False`. >>> r = np.array([1, 2]) >>> ser = pd.Series(r, copy=False) >>> ser.iloc[0] = 999 >>> r array([999, 2]) >>> ser 0 999 1 2 dtype: int64 Due to input data type the Series has a `view` on the original data, so the data is changed as well. File: c:\users\phili\anaconda3\envs\vscode-env\lib\site-packages\pandas\core\series.py Type: type Subclasses: SubclassedSeries
A pd.Series
can be instantiated by supplying data
to a list
of values and providing a name
:
s = pd.Series([4, 3, 2, 2], index=['a', 'b', 'c', 'd'], name='s')
t = pd.Series(name='t')
u = pd.Series([object(), 'hello', 1, None], name='u')
v = pd.Series([1.1, 2.2, 3.1, 4.1], name='v')
w = pd.Series([-1, 2, -3, 4], name='w')
x = pd.Series([1.1, -2.1, -2.1, None], name='x')
y = pd.Series(['A', 'A', 'B', 'B'], index=['a', 'b', 'c', 'd'], dtype='category', name='y')
z = pd.Series(['A', 'B', 'C', 'D'], index=['a', 'b', 'c', 'd'], name='z')
Attributes¶
If a pd.Series
is seen in the cell output, the values
, index
, name
and dtype
are shown:
v
0 1.1 1 2.2 2 3.1 3 4.1 Name: v, dtype: float64
These are also separately available as attributes:
v.index
RangeIndex(start=0, stop=4, step=1)
v.values
array([1.1, 2.2, 3.1, 4.1])
v.dtype
dtype('float64')
v.name
'v'
A pd.Series
is based on a 1d np.ndarray
and has dimensional related attributes. It always has a dimension of 1
:
v.ndim
1
The size
attribute gives the number of values:
v.size
4
The shape
attribute is a 1 element tuple
containing the size
:
v.shape
(4,)
Boolean Attributes¶
The pd.Series
has a number of bool
attributes. The hasnans
attribute returns True
if any value is None
or NaN
and False
otherwise:
v = pd.Series([1.1, 2.2, 3.1, 4.1], name='v')
x = pd.Series([1.1, -2.1, -2.1, None], name='x', dtype=float)
v.hasnans
False
x.hasnans
True
The empty
attribute returns True
if there are no values and False
otherwise:
t = pd.Series(name='t')
u = pd.Series([[], [], [], []], name='u')
t.empty
True
u.empty
False
The is_unique
attribute returns True
if all values are unique and False
otherwise:
v = pd.Series([1.1, 2.2, 3.1, 4.1], name='v')
x = pd.Series([1.1, -2.1, -2.1, None], name='x', dtype=float)
v.is_unique
True
x.is_unique
False
The attribute is_monotonic_increasing
returns True
if the values are always greater or equal to the previous value and False
otherwise. Likewise the attribute is_monotonic_decreasing
returns True
if the values are always less than or equal to the previous value and False
otherwise:
s = pd.Series([4, 3, 2, 2], index=['a', 'b', 'c', 'd'], name='s')
v = pd.Series([1.1, 2.2, 3.1, 4.1], name='v')
w = pd.Series([-1.1, 2.1, -3.2, 4.1], name='w')
y = pd.Series(['A', 'A', 'B', 'B'], index=['a', 'b', 'c', 'd'], name='y')
v.is_monotonic_increasing
True
w.is_monotonic_increasing
False
y.is_monotonic_increasing
True
s.is_monotonic_decreasing
True
Indexing¶
The pd.Series
instance v
has a default pd.RangeIndex
:
v
0 1.1 1 2.2 2 3.1 3 4.1 Name: v, dtype: float64
A row can be accessed from the pd.Series
by indexing into it using square brackets and providing the int
value from the index
:
v[0]
1.1
Indexing of rows can also be done using a slice:
v[0:2]
0 1.1 1 2.2 Name: v, dtype: float64
A row can also be accessed from it using the attribute iloc
integer location followed by square brackets and the row number:
v.iloc[0]
1.1
Since iloc
always expects a numeric value it be used for slicing:
v.iloc[0:3]
0 1.1 1 2.2 2 3.1 Name: v, dtype: float64
A single element can be accessed using iat
integer at:
v.iat[0]
1.1
Because iat
is used to only select a single element, a slice cannot be used.
The pd.Series
instance z
has a pd.Index
where each index value has a name
:
z
a A b B c C d D Name: z, dtype: object
A row can be accessed from the pd.Series
by indexing into it using square brackets and providing the str
value from the index
:
z['a']
'A'
Slicing cannot be used when indexing using str
instances, however a list
of these str
instances can be provided:
z[['a', 'c']]
a A c C Name: z, dtype: object
When the name
of one of the values from the index
is a valid identifier:
'a'.isidentifier()
True
It is available as an attribute of the pd.Series
instance:
z.a
'A'
The loc
location attribute can be used with a str
instance that corresponds to the name
of an index:
z.loc['a']
'A'
A list
of str
instances each corresponding to a name
of a value from the index
can also be provided:
z.loc[['a','c']]
a A c C Name: z, dtype: object
The at
attribute can be used to read of a single element using a str
instance corresponding to a name
of a value from the index
:
z.at['a']
'A'
Although, z
has a pd.Index
where each value in the pd.Index
has an associated name
. Indexing can still be done numerically using iloc
with a single numeric value or slice or iat
with a single numeric value:
z
a A b B c C d D Name: z, dtype: object
z.iloc[0:2]
a A b B Name: z, dtype: object
z.iat[0]
'A'
When a pd.RangeIndex
is used:
v
0 1.1 1 2.2 2 3.1 3 4.1 Name: v, dtype: float64
The attributes loc
and at
can still be used. loc
can take a single numeric value or a list
of numeric values and does not support slicing. at
can take a single numeric value only:
v.loc[[0, 1]]
0 1.1 1 2.2 Name: v, dtype: float64
v.at[0]
1.1
Normally for such a scenario iloc
an iat
would be used.
String Methods¶
What the data supplied is numeric or None
, its datatype will be determined to be numeric, in this case 'float64'
corresponding to the np.float64
class:
x = pd.Series([1.1, -2.1, -2.1, None], name='x')
x.dtype
dtype('float64')
When the data includes a non-numeric object
its datatype will be determined to be 'O'
meaning object
:
u = pd.Series([object(), 'hello', 1, None], name='u')
u.dtype
dtype('O')
There is unfortunately no str
datatype. Therefore when the data supplied are all str
instances, the datatype is determined to be 'O'
meaning object
:
z = pd.Series(['A', 'B', 'C', 'D'], index=['a', 'b', 'c', 'd'], name='z')
z.dtype
dtype('O')
The apply
method can be used to apply a anonymous lambda function which will check if the instance is a str
:
z.apply(lambda value: isinstance(value, str))
a True b True c True d True Name: z, dtype: bool
The all
mthod can be applied on the pd.Series
which has a bool
dtype
to return a scalar bool
that is True
if all values are str
instances:
z.apply(lambda value: isinstance(value, str)).all()
True
When all instances in the pd.Series
are str
instances the str
class is accessible as an attribute and groups together identifiers that are consistent to the builtins
class str
:
dir2(z.str, drop_internal=True)
{'method': ['capitalize', 'casefold', 'cat', 'center', 'contains', 'count', 'decode', 'encode', 'endswith', 'extract', 'extractall', 'find', 'findall', 'fullmatch', 'get', 'get_dummies', 'index', 'isalnum', 'isalpha', 'isdecimal', 'isdigit', 'islower', 'isnumeric', 'isspace', 'istitle', 'isupper', 'join', 'len', 'ljust', 'lower', 'lstrip', 'match', 'normalize', 'pad', 'partition', 'removeprefix', 'removesuffix', 'repeat', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'slice', 'slice_replace', 'split', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'wrap', 'zfill'], 'datamodel_attribute': ['__annotations__', '__dict__', '__doc__', '__frozen', '__module__', '__weakref__'], 'datamodel_method': ['__class__', '__delattr__', '__dir__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__']}
A str
method such as zfill
can be used:
z.str.zfill(10)
a 000000000A b 000000000B c 000000000C d 000000000D Name: z, dtype: object
Instead of the datamodel methods some of the builtins
functions such as len
are also listed as a str
method:
z.str.zfill(10).str.len()
a 10 b 10 c 10 d 10 Name: z, dtype: int64
Notice the above syntax starts with a pd.Series
instance z
where each value is a str
instance. It uses the str
class to access a method zfill
, because this is a method it requires parenthesis for the function call and is supplied the data 10
:
z.str.zfill(10)
This returns a new pd.Series
instance another_series
where each value is once again a str
instance. It then uses the str
class again to access a method len
and because this is a method it requires parenthesis for the function call. This time no additional data is required as the method operates on the value provided which can be thought of as self
.
another_series.str.len()
Stacked this looks like:
z.str.zfill(10).str.len()
Because len
is supplied as a str
method it is not necessary to apply len
using an anonymous lambda
function:
z.str.zfill(10).apply(lambda value: len(value))
a 10 b 10 c 10 d 10 Name: z, dtype: int64
Note when len
is applied to the pd.Series
instance, it will return the length of the pd.Series
:
len(z)
4
An anonymous lambda
function can be used to apply an operator. Recall for a str
, the +
operator is setup for concatenation and the *
operator is setup for replication using an int
instance:
z.apply(lambda value: 5 * value)
a AAAAA b BBBBB c CCCCC d DDDDD Name: z, dtype: object
When every value in a pd.Series
is a str
instance the __mul__
and __add__
datamodel operators are broadcast and behave like they do with str
instances. This means the *
operator can be used to replicate a str
using an int
:
5 * z
a AAAAA b BBBBB c CCCCC d DDDDD Name: z, dtype: object
In addition to a scalar, a pd.Series
of equal length, that has a matching index
and a dtype
of int
can be used:
s = pd.Series([4, 3, 2, 2], index=['a', 'b', 'c', 'd'], name='s')
z * s
a AAAA b BBB c CC d DD dtype: object
This is commutative because the str
class has the __rmul__
datamodel identifier defined:
s * z
a AAAA b BBB c CC d DD dtype: object
The +
operator can be used to concatenate another str
instance:
z + ' world!'
a A world! b B world! c C world! d D world! Name: z, dtype: object
A pd.Series
of equal length, that has a matching index
and a dtype
of str
can also be concatenated:
z + z
a AA b BB c CC d DD Name: z, dtype: object
Numeric Methods¶
When the values in a pd.Series
instance are numeric, the numeric datamodel operators are broadcast:
v = pd.Series([1.1, 2.2, 3.1, 4.1], name='v')
w = pd.Series([-1, 2, -3, 4], name='w')
x = pd.Series([1.1, -2.1, -2.1, None], name='x')
This allows similar numeric operations as seen for np.ndarray
(1d arrays):
v * 5
0 5.5 1 11.0 2 15.5 3 20.5 Name: v, dtype: float64
v + w
0 0.1 1 4.2 2 0.1 3 8.1 dtype: float64
Notice that any numeric operation involving None
displays NaN
which is an instance of np.nan
:
-x + w
0 -2.1 1 4.1 2 -0.9 3 NaN dtype: float64
The builtins
datamodel method __abs__
is defined which means the builtins
function abs
can be used, to retrieve the absolute values (stripping the sign):
abs(w)
0 1 1 2 2 3 3 4 Name: w, dtype: int64
This is also available as a method for convenience:
w.abs()
0 1 1 2 2 3 3 4 Name: w, dtype: int64
Many of the functions in the statistics
module are broadcast across a pd.Series
. The statistics
module can be imported:
import statistics
If w
is examined:
w = pd.Series([-1, 2, -3, 4], name='w')
The mean can be calculated manually using:
w.sum() / len(w)
0.5
Or using the statistical function mean
:
statistics.mean(w)
0.5
Commonly used statistical functions are available as methods:
w.mean()
0.5
Then the difference from the mean can be calculated:
w - w.mean()
0 -1.5 1 1.5 2 -3.5 3 3.5 Name: w, dtype: float64
Squared:
(w - w.mean()) ** 2
0 2.25 1 2.25 2 12.25 3 12.25 Name: w, dtype: float64
Summed and divided by the length of w
minus 1:
sum((w - w.mean()) ** 2) / (len(w) - 1)
9.666666666666666
The square root of this can be taken to get the standard deviation:
(sum((w - w.mean()) ** 2) / (len(w) - 1)) ** 0.5
3.1091263510296048
Alternatively the statistics.stdev
function can be used:
statistics.stdev(w)
3.109126351029605
This is available as a method with a slightly different name std
:
w.std()
3.1091263510296048
Outwith the commonly used numeric datamodel identifiers and statistical functions, mathematical functions are carried out on a pd.Series
instance by using the relevant np
function. This supplies the values
to the np
function in the form of an np.ndarray
and then returns a pd.Series
of the new values using the existing pd.Series
name
:
v = pd.Series([1.1, 2.2, 3.1, 4.1], name='v')
np.floor(v)
0 1.0 1 2.0 2 3.0 3 4.0 Name: v, dtype: float64
Category Methods¶
A pd.Series
can have a 'category'
dtype
:
y = pd.Series(['A', 'A', 'B', 'B'], index=['a', 'b', 'c', 'd'], dtype='category', name='y')
y
a A b A c B d B Name: y, dtype: category Categories (2, object): ['A', 'B']
pd.Series
of the 'category'
dttype
have access to the cat
class which groups together category based identifiers:
dir2(y.cat, drop_internal=True)
{'attribute': ['categories', 'codes', 'ordered'], 'method': ['add_categories', 'as_ordered', 'as_unordered', 'remove_categories', 'remove_unused_categories', 'rename_categories', 'reorder_categories', 'set_categories'], 'datamodel_attribute': ['__annotations__', '__dict__', '__doc__', '__frozen', '__module__', '__weakref__'], 'datamodel_method': ['__class__', '__delattr__', '__dir__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__']}
For example the categories can be set to ordered using the cat
method as_ordered
:
y.cat.as_ordered()
a A b A c B d B Name: y, dtype: category Categories (2, object): ['A' < 'B']
The order can be changed using the cat
method reorder_categories
. For example 'A'
can be greater than 'B'
in a test result. Once again identifiers can be stacked to achieve this purpose:
y.cat.as_ordered().cat.reorder_categories(['B', 'A'])
a A b A c B d B Name: y, dtype: category Categories (2, object): ['B' < 'A']
The updated pd.Series
can be reassigned to y
:
y = y.cat.as_ordered().cat.reorder_categories(['B', 'A'])
Now the equality operators can be used returning a pd.Series
of dtype
bool
:
y == 'A'
a True b True c False d False Name: y, dtype: bool
This can be used to index into a pd.Series
that has the same index
:
s = pd.Series([4, 3, 2, 2], index=['a', 'b', 'c', 'd'], name='s')
s[y == 'A']
a 4 b 3 Name: s, dtype: int64
If a pd.Series
instance is created with random scores from 0
to 100
:
np.random.seed(0)
scores = pd.Series(np.random.randint(0, 100, size=10), name='scores')
scores
0 44 1 47 2 64 3 67 4 67 5 9 6 83 7 21 8 36 9 87 Name: scores, dtype: int32
The pd.cut
function can be used to cut the values into categories using boundaries for bins. Note for 4 categories, there are 5 bins as a lower and upper bound is specified for each bin. The categories are automatically using the curt off values. Note the (
in (-0.001, 50.0]
means inclusive of the lower bound and the ]
means exclusive of the upper bound:
pd.cut(scores, bins=[0, 50, 60, 70, 101])
0 (0, 50] 1 (0, 50] 2 (60, 70] 3 (60, 70] 4 (60, 70] 5 (0, 50] 6 (70, 101] 7 (0, 50] 8 (0, 50] 9 (70, 101] Name: scores, dtype: category Categories (4, interval[int64, right]): [(0, 50] < (50, 60] < (60, 70] < (70, 101]]
This pd.Series
can be assigned to grades
:
grades = pd.cut(scores, bins=[0, 50, 60, 70, 101], include_lowest=True)
Because it is 'category'
dtype
, the cat
class can be accessed and the categories
attribute accessed:
grades.cat.categories
IntervalIndex([(-0.001, 50.0], (50.0, 60.0], (60.0, 70.0], (70.0, 101.0]], dtype='interval[float64, right]')
This can be assigned to old_cats
:
old_cats = grades.cat.categories
These new_cats
can be grouped together in a list
:
new_cats = ['F', 'C', 'B', 'A']
This allows zipping of the old_cats
with the new_cats
and casting to a dict
instance:
cats_mapping = dict(zip(old_cats, new_cats))
cats_mapping
{Interval(-0.001, 50.0, closed='right'): 'F', Interval(50.0, 60.0, closed='right'): 'C', Interval(60.0, 70.0, closed='right'): 'B', Interval(70.0, 101.0, closed='right'): 'A'}
The cat
class can be accessed and the rename_categories
method used with this dict
instance:
grades.cat.rename_categories(cats_mapping)
0 F 1 F 2 B 3 B 4 B 5 F 6 A 7 F 8 F 9 A Name: scores, dtype: category Categories (4, object): ['F' < 'C' < 'B' < 'A']
The instance name grades
can be reassigned to this new pd.Series
:
grades = grades.cat.rename_categories(cats_mapping)
Plot Methods¶
A pd.Series
of numeric values has access to the plot
class which can be used to create matplotlib (mpl
) plots from the data in the pd.Series
:
s = pd.Series([4, 3, 2, 2], index=['a', 'b', 'c', 'd'], name='s')
The identifiers for the plot
class will be examined:
dir2(s.plot, drop_internal=True)
{'method': ['area', 'bar', 'barh', 'box', 'density', 'hexbin', 'hist', 'kde', 'line', 'pie', 'scatter'], 'datamodel_attribute': ['__annotations__', '__dict__', '__doc__', '__module__', '__weakref__'], 'datamodel_method': ['__call__', '__class__', '__delattr__', '__dir__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__']}
The underlying values determine what plot types are most sensible:
s.plot.line()
<Axes: >
s.plot.bar()
<Axes: >
s.plot.pie()
<Axes: ylabel='s'>
v = pd.Series([1.1, 2.2, 3.1, 4.1], name='v')
The plot types can be configured by using mpl
consistent input parameters which is covered in the next tutorial.
Datetime and TimeDelta¶
In pandas
date and time intervals are based upon the datatypes datetime64
or timedelta64
from the numpy
library:
The datetime64
class is normally initialised using a timestamp string of the following format:
np.datetime64('YYYY-MM-DD')
np.datetime64('YYYY-MM-DDThh:mm:ss:μμμμμμ')
For example:
np.datetime64('2008-03-12')
numpy.datetime64('2008-03-12')
np.datetime64('2008-03-12T14:30:15.123456')
numpy.datetime64('2008-03-12T14:30:15.123456')
The timedelta64
is normally initialised using a set of tuples where X
is the quantity followed by the unit:
np.datetime64(X, 'U')
These are usually combined using addition:
np.timedelta64(1, 'D') + np.timedelta64(1, 'h') + np.timedelta64(1, 's') + np.timedelta64(1, 'ms')
numpy.timedelta64(90001001,'ms')
These can be used to make an Index
or Series
respectively, using the np.arange
function:
start_time = np.datetime64('2008-03-12')
end_time = np.datetime64('2008-03-13')
time_interval = np.timedelta64(1, 'h')
np.arange(start_time, end_time, time_interval)
array(['2008-03-12T00', '2008-03-12T01', '2008-03-12T02', '2008-03-12T03', '2008-03-12T04', '2008-03-12T05', '2008-03-12T06', '2008-03-12T07', '2008-03-12T08', '2008-03-12T09', '2008-03-12T10', '2008-03-12T11', '2008-03-12T12', '2008-03-12T13', '2008-03-12T14', '2008-03-12T15', '2008-03-12T16', '2008-03-12T17', '2008-03-12T18', '2008-03-12T19', '2008-03-12T20', '2008-03-12T21', '2008-03-12T22', '2008-03-12T23'], dtype='datetime64[h]')
These times can be cast into a pd.Index
instance:
pd.Index(np.arange(start_time, end_time, time_interval))
DatetimeIndex(['2008-03-12 00:00:00', '2008-03-12 01:00:00', '2008-03-12 02:00:00', '2008-03-12 03:00:00', '2008-03-12 04:00:00', '2008-03-12 05:00:00', '2008-03-12 06:00:00', '2008-03-12 07:00:00', '2008-03-12 08:00:00', '2008-03-12 09:00:00', '2008-03-12 10:00:00', '2008-03-12 11:00:00', '2008-03-12 12:00:00', '2008-03-12 13:00:00', '2008-03-12 14:00:00', '2008-03-12 15:00:00', '2008-03-12 16:00:00', '2008-03-12 17:00:00', '2008-03-12 18:00:00', '2008-03-12 19:00:00', '2008-03-12 20:00:00', '2008-03-12 21:00:00', '2008-03-12 22:00:00', '2008-03-12 23:00:00'], dtype='datetime64[s]', freq=None)
Or pd.Series
instance:
pd.Series(np.arange(start_time, end_time, time_interval), name='times')
0 2008-03-12 00:00:00 1 2008-03-12 01:00:00 2 2008-03-12 02:00:00 3 2008-03-12 03:00:00 4 2008-03-12 04:00:00 5 2008-03-12 05:00:00 6 2008-03-12 06:00:00 7 2008-03-12 07:00:00 8 2008-03-12 08:00:00 9 2008-03-12 09:00:00 10 2008-03-12 10:00:00 11 2008-03-12 11:00:00 12 2008-03-12 12:00:00 13 2008-03-12 13:00:00 14 2008-03-12 14:00:00 15 2008-03-12 15:00:00 16 2008-03-12 16:00:00 17 2008-03-12 17:00:00 18 2008-03-12 18:00:00 19 2008-03-12 19:00:00 20 2008-03-12 20:00:00 21 2008-03-12 21:00:00 22 2008-03-12 22:00:00 23 2008-03-12 23:00:00 Name: times, dtype: datetime64[s]
It is more common to have a pd.Index
of times for example when recording a number of observations at a given time:
times = pd.Index(np.arange(start_time, end_time, time_interval), name='times')
times
DatetimeIndex(['2008-03-12 00:00:00', '2008-03-12 01:00:00', '2008-03-12 02:00:00', '2008-03-12 03:00:00', '2008-03-12 04:00:00', '2008-03-12 05:00:00', '2008-03-12 06:00:00', '2008-03-12 07:00:00', '2008-03-12 08:00:00', '2008-03-12 09:00:00', '2008-03-12 10:00:00', '2008-03-12 11:00:00', '2008-03-12 12:00:00', '2008-03-12 13:00:00', '2008-03-12 14:00:00', '2008-03-12 15:00:00', '2008-03-12 16:00:00', '2008-03-12 17:00:00', '2008-03-12 18:00:00', '2008-03-12 19:00:00', '2008-03-12 20:00:00', '2008-03-12 21:00:00', '2008-03-12 22:00:00', '2008-03-12 23:00:00'], dtype='datetime64[s]', name='times', freq=None)
The Index
method tz_localize
can be used to specify a timezone tz
when timezone information has been omitted For example if these observations took place in Australia:
times.tz_localize(tz='Australia/Sydney')
DatetimeIndex(['2008-03-12 00:00:00+11:00', '2008-03-12 01:00:00+11:00', '2008-03-12 02:00:00+11:00', '2008-03-12 03:00:00+11:00', '2008-03-12 04:00:00+11:00', '2008-03-12 05:00:00+11:00', '2008-03-12 06:00:00+11:00', '2008-03-12 07:00:00+11:00', '2008-03-12 08:00:00+11:00', '2008-03-12 09:00:00+11:00', '2008-03-12 10:00:00+11:00', '2008-03-12 11:00:00+11:00', '2008-03-12 12:00:00+11:00', '2008-03-12 13:00:00+11:00', '2008-03-12 14:00:00+11:00', '2008-03-12 15:00:00+11:00', '2008-03-12 16:00:00+11:00', '2008-03-12 17:00:00+11:00', '2008-03-12 18:00:00+11:00', '2008-03-12 19:00:00+11:00', '2008-03-12 20:00:00+11:00', '2008-03-12 21:00:00+11:00', '2008-03-12 22:00:00+11:00', '2008-03-12 23:00:00+11:00'], dtype='datetime64[s, Australia/Sydney]', name='times', freq=None)
The associated method tz_convert
can be used to convert these times into corresponding times in another timezone:
times.tz_localize(tz='Australia/Sydney').tz_convert(tz='Europe/London')
DatetimeIndex(['2008-03-11 13:00:00+00:00', '2008-03-11 14:00:00+00:00', '2008-03-11 15:00:00+00:00', '2008-03-11 16:00:00+00:00', '2008-03-11 17:00:00+00:00', '2008-03-11 18:00:00+00:00', '2008-03-11 19:00:00+00:00', '2008-03-11 20:00:00+00:00', '2008-03-11 21:00:00+00:00', '2008-03-11 22:00:00+00:00', '2008-03-11 23:00:00+00:00', '2008-03-12 00:00:00+00:00', '2008-03-12 01:00:00+00:00', '2008-03-12 02:00:00+00:00', '2008-03-12 03:00:00+00:00', '2008-03-12 04:00:00+00:00', '2008-03-12 05:00:00+00:00', '2008-03-12 06:00:00+00:00', '2008-03-12 07:00:00+00:00', '2008-03-12 08:00:00+00:00', '2008-03-12 09:00:00+00:00', '2008-03-12 10:00:00+00:00', '2008-03-12 11:00:00+00:00', '2008-03-12 12:00:00+00:00'], dtype='datetime64[s, Europe/London]', name='times', freq=None)
Unlike np
which is timezone agnostic, pd
can perform calculations involving datetime64
data with a timezone:
times.tz_localize(tz='Australia/Sydney')
DatetimeIndex(['2008-03-12 00:00:00+11:00', '2008-03-12 01:00:00+11:00', '2008-03-12 02:00:00+11:00', '2008-03-12 03:00:00+11:00', '2008-03-12 04:00:00+11:00', '2008-03-12 05:00:00+11:00', '2008-03-12 06:00:00+11:00', '2008-03-12 07:00:00+11:00', '2008-03-12 08:00:00+11:00', '2008-03-12 09:00:00+11:00', '2008-03-12 10:00:00+11:00', '2008-03-12 11:00:00+11:00', '2008-03-12 12:00:00+11:00', '2008-03-12 13:00:00+11:00', '2008-03-12 14:00:00+11:00', '2008-03-12 15:00:00+11:00', '2008-03-12 16:00:00+11:00', '2008-03-12 17:00:00+11:00', '2008-03-12 18:00:00+11:00', '2008-03-12 19:00:00+11:00', '2008-03-12 20:00:00+11:00', '2008-03-12 21:00:00+11:00', '2008-03-12 22:00:00+11:00', '2008-03-12 23:00:00+11:00'], dtype='datetime64[s, Australia/Sydney]', name='times', freq=None)
times.tz_localize(tz='Australia/Sydney') + np.timedelta64(1, 'h')
DatetimeIndex(['2008-03-12 01:00:00+11:00', '2008-03-12 02:00:00+11:00', '2008-03-12 03:00:00+11:00', '2008-03-12 04:00:00+11:00', '2008-03-12 05:00:00+11:00', '2008-03-12 06:00:00+11:00', '2008-03-12 07:00:00+11:00', '2008-03-12 08:00:00+11:00', '2008-03-12 09:00:00+11:00', '2008-03-12 10:00:00+11:00', '2008-03-12 11:00:00+11:00', '2008-03-12 12:00:00+11:00', '2008-03-12 13:00:00+11:00', '2008-03-12 14:00:00+11:00', '2008-03-12 15:00:00+11:00', '2008-03-12 16:00:00+11:00', '2008-03-12 17:00:00+11:00', '2008-03-12 18:00:00+11:00', '2008-03-12 19:00:00+11:00', '2008-03-12 20:00:00+11:00', '2008-03-12 21:00:00+11:00', '2008-03-12 22:00:00+11:00', '2008-03-12 23:00:00+11:00', '2008-03-13 00:00:00+11:00'], dtype='datetime64[s, Australia/Sydney]', name='times', freq=None)
Care however needs to be taken as clock changes leads to ambiguous times. In the UK one of biannual clock changes for example can be examined:
start_time = np.datetime64('2023-10-28T11:00:00')
end_time = np.datetime64('2023-10-29T03:00:00')
time_interval = np.timedelta64(30, 'm')
times = pd.Index(np.arange(start_time, end_time, time_interval))
times
DatetimeIndex(['2023-10-28 11:00:00', '2023-10-28 11:30:00', '2023-10-28 12:00:00', '2023-10-28 12:30:00', '2023-10-28 13:00:00', '2023-10-28 13:30:00', '2023-10-28 14:00:00', '2023-10-28 14:30:00', '2023-10-28 15:00:00', '2023-10-28 15:30:00', '2023-10-28 16:00:00', '2023-10-28 16:30:00', '2023-10-28 17:00:00', '2023-10-28 17:30:00', '2023-10-28 18:00:00', '2023-10-28 18:30:00', '2023-10-28 19:00:00', '2023-10-28 19:30:00', '2023-10-28 20:00:00', '2023-10-28 20:30:00', '2023-10-28 21:00:00', '2023-10-28 21:30:00', '2023-10-28 22:00:00', '2023-10-28 22:30:00', '2023-10-28 23:00:00', '2023-10-28 23:30:00', '2023-10-29 00:00:00', '2023-10-29 00:30:00', '2023-10-29 01:00:00', '2023-10-29 01:30:00', '2023-10-29 02:00:00', '2023-10-29 02:30:00'], dtype='datetime64[s]', freq=None)
When these recorded times are localised as UK times, the tz_localize
method does not know what one of the two duplicate times this is i.e whether these time recorded occured before or after the clock change:
times.tz_localize(tz='Europe/London', ambiguous='NaT')
DatetimeIndex(['2023-10-28 11:00:00+01:00', '2023-10-28 11:30:00+01:00', '2023-10-28 12:00:00+01:00', '2023-10-28 12:30:00+01:00', '2023-10-28 13:00:00+01:00', '2023-10-28 13:30:00+01:00', '2023-10-28 14:00:00+01:00', '2023-10-28 14:30:00+01:00', '2023-10-28 15:00:00+01:00', '2023-10-28 15:30:00+01:00', '2023-10-28 16:00:00+01:00', '2023-10-28 16:30:00+01:00', '2023-10-28 17:00:00+01:00', '2023-10-28 17:30:00+01:00', '2023-10-28 18:00:00+01:00', '2023-10-28 18:30:00+01:00', '2023-10-28 19:00:00+01:00', '2023-10-28 19:30:00+01:00', '2023-10-28 20:00:00+01:00', '2023-10-28 20:30:00+01:00', '2023-10-28 21:00:00+01:00', '2023-10-28 21:30:00+01:00', '2023-10-28 22:00:00+01:00', '2023-10-28 22:30:00+01:00', '2023-10-28 23:00:00+01:00', '2023-10-28 23:30:00+01:00', '2023-10-29 00:00:00+01:00', '2023-10-29 00:30:00+01:00', 'NaT', 'NaT', '2023-10-29 02:00:00+00:00', '2023-10-29 02:30:00+00:00'], dtype='datetime64[s, Europe/London]', freq=None)
In this case, casting to a pd.Series
and using the pd.Serie
method interpolate
with the 'linear'
works because the np.arange
function was used which works linearly:
pd.Series(times.tz_localize(tz='Europe/London', ambiguous='NaT')).interpolate('linear')
0 2023-10-28 11:00:00+01:00 1 2023-10-28 11:30:00+01:00 2 2023-10-28 12:00:00+01:00 3 2023-10-28 12:30:00+01:00 4 2023-10-28 13:00:00+01:00 5 2023-10-28 13:30:00+01:00 6 2023-10-28 14:00:00+01:00 7 2023-10-28 14:30:00+01:00 8 2023-10-28 15:00:00+01:00 9 2023-10-28 15:30:00+01:00 10 2023-10-28 16:00:00+01:00 11 2023-10-28 16:30:00+01:00 12 2023-10-28 17:00:00+01:00 13 2023-10-28 17:30:00+01:00 14 2023-10-28 18:00:00+01:00 15 2023-10-28 18:30:00+01:00 16 2023-10-28 19:00:00+01:00 17 2023-10-28 19:30:00+01:00 18 2023-10-28 20:00:00+01:00 19 2023-10-28 20:30:00+01:00 20 2023-10-28 21:00:00+01:00 21 2023-10-28 21:30:00+01:00 22 2023-10-28 22:00:00+01:00 23 2023-10-28 22:30:00+01:00 24 2023-10-28 23:00:00+01:00 25 2023-10-28 23:30:00+01:00 26 2023-10-29 00:00:00+01:00 27 2023-10-29 00:30:00+01:00 28 2023-10-29 01:20:00+01:00 29 2023-10-29 01:10:00+00:00 30 2023-10-29 02:00:00+00:00 31 2023-10-29 02:30:00+00:00 dtype: datetime64[s, Europe/London]
DataFrame Identifiers¶
The identifiers for a pd.DataFrame
are:
dir2(pd.DataFrame, drop_internal=True, exclude_identifier_list=obsolete.df_depreciation)
{'attribute': ['at', 'attrs', 'axes', 'columns', 'dtypes', 'empty', 'flags', 'iat', 'iloc', 'index', 'loc', 'ndim', 'shape', 'size', 'style', 'values'], 'constant': ['T'], 'method': ['abs', 'add', 'add_prefix', 'add_suffix', 'agg', 'aggregate', 'align', 'all', 'any', 'apply', 'asfreq', 'asof', 'assign', 'astype', 'at_time', 'backfill', 'between_time', 'bfill', 'boxplot', 'clip', 'combine', 'combine_first', 'compare', 'convert_dtypes', 'copy', 'corr', 'corrwith', 'count', 'cov', 'cummax', 'cummin', 'cumprod', 'cumsum', 'describe', 'diff', 'div', 'divide', 'dot', 'drop', 'drop_duplicates', 'droplevel', 'dropna', 'duplicated', 'eq', 'equals', 'eval', 'ewm', 'expanding', 'explode', 'ffill', 'fillna', 'filter', 'first_valid_index', 'floordiv', 'from_dict', 'from_records', 'ge', 'get', 'groupby', 'gt', 'head', 'hist', 'idxmax', 'idxmin', 'infer_objects', 'info', 'insert', 'interpolate', 'isetitem', 'isin', 'isna', 'isnull', 'items', 'iterrows', 'itertuples', 'join', 'keys', 'kurt', 'kurtosis', 'last_valid_index', 'le', 'lt', 'map', 'mask', 'max', 'mean', 'median', 'melt', 'memory_usage', 'merge', 'min', 'mod', 'mode', 'mul', 'multiply', 'ne', 'nlargest', 'notna', 'notnull', 'nsmallest', 'nunique', 'pad', 'pct_change', 'pipe', 'pivot', 'pivot_table', 'pop', 'pow', 'prod', 'product', 'quantile', 'query', 'radd', 'rank', 'rdiv', 'reindex', 'reindex_like', 'rename', 'rename_axis', 'reorder_levels', 'replace', 'resample', 'reset_index', 'rfloordiv', 'rmod', 'rmul', 'rolling', 'round', 'rpow', 'rsub', 'rtruediv', 'sample', 'select_dtypes', 'sem', 'set_axis', 'set_flags', 'set_index', 'shift', 'skew', 'sort_index', 'sort_values', 'squeeze', 'stack', 'std', 'sub', 'subtract', 'sum', 'swaplevel', 'tail', 'take', 'to_clipboard', 'to_csv', 'to_dict', 'to_excel', 'to_feather', 'to_hdf', 'to_html', 'to_json', 'to_latex', 'to_markdown', 'to_numpy', 'to_orc', 'to_parquet', 'to_period', 'to_pickle', 'to_records', 'to_sql', 'to_stata', 'to_string', 'to_timestamp', 'to_xarray', 'to_xml', 'transform', 'transpose', 'truediv', 'truncate', 'tz_convert', 'tz_localize', 'unstack', 'update', 'value_counts', 'var', 'where', 'xs'], 'lower_class': ['plot', 'sparse'], 'datamodel_attribute': ['__annotations__', '__array_priority__', '__dict__', '__doc__', '__hash__', '__module__', '__pandas_priority__', '__weakref__'], 'datamodel_method': ['__abs__', '__add__', '__and__', '__array__', '__array_ufunc__', '__arrow_c_stream__', '__bool__', '__class__', '__contains__', '__copy__', '__dataframe__', '__dataframe_consortium_standard__', '__deepcopy__', '__delattr__', '__delitem__', '__dir__', '__divmod__', '__eq__', '__finalize__', '__floordiv__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__iadd__', '__iand__', '__ifloordiv__', '__imod__', '__imul__', '__init__', '__init_subclass__', '__invert__', '__ior__', '__ipow__', '__isub__', '__iter__', '__itruediv__', '__ixor__', '__le__', '__len__', '__lt__', '__matmul__', '__mod__', '__mul__', '__ne__', '__neg__', '__new__', '__nonzero__', '__or__', '__pos__', '__pow__', '__radd__', '__rand__', '__rdivmod__', '__reduce__', '__reduce_ex__', '__repr__', '__rfloordiv__', '__rmatmul__', '__rmod__', '__rmul__', '__ror__', '__round__', '__rpow__', '__rsub__', '__rtruediv__', '__rxor__', '__setattr__', '__setitem__', '__setstate__', '__sizeof__', '__str__', '__sub__', '__subclasshook__', '__truediv__', '__xor__']}
The initialisation signature for a pd.DataFrame
can be examined:
pd.DataFrame?
Init signature: pd.DataFrame( data=None, index: 'Axes | None' = None, columns: 'Axes | None' = None, dtype: 'Dtype | None' = None, copy: 'bool | None' = None, ) -> 'None' Docstring: Two-dimensional, size-mutable, potentially heterogeneous tabular data. Data structure also contains labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure. Parameters ---------- data : ndarray (structured or homogeneous), Iterable, dict, or DataFrame Dict can contain Series, arrays, constants, dataclass or list-like objects. If data is a dict, column order follows insertion-order. If a dict contains Series which have an index defined, it is aligned by its index. This alignment also occurs if data is a Series or a DataFrame itself. Alignment is done on Series/DataFrame inputs. If data is a list of dicts, column order follows insertion-order. index : Index or array-like Index to use for resulting frame. Will default to RangeIndex if no indexing information part of input data and no index provided. columns : Index or array-like Column labels to use for resulting frame when data does not have them, defaulting to RangeIndex(0, 1, 2, ..., n). If data contains column labels, will perform column selection instead. dtype : dtype, default None Data type to force. Only a single dtype is allowed. If None, infer. copy : bool or None, default None Copy data from inputs. For dict data, the default of None behaves like ``copy=True``. For DataFrame or 2d ndarray input, the default of None behaves like ``copy=False``. If data is a dict containing one or more Series (possibly of different dtypes), ``copy=False`` will ensure that these inputs are not copied. .. versionchanged:: 1.3.0 See Also -------- DataFrame.from_records : Constructor from tuples, also record arrays. DataFrame.from_dict : From dicts of Series, arrays, or dicts. read_csv : Read a comma-separated values (csv) file into DataFrame. read_table : Read general delimited file into DataFrame. read_clipboard : Read text from clipboard into DataFrame. Notes ----- Please reference the :ref:`User Guide <basics.dataframe>` for more information. Examples -------- Constructing DataFrame from a dictionary. >>> d = {'col1': [1, 2], 'col2': [3, 4]} >>> df = pd.DataFrame(data=d) >>> df col1 col2 0 1 3 1 2 4 Notice that the inferred dtype is int64. >>> df.dtypes col1 int64 col2 int64 dtype: object To enforce a single dtype: >>> df = pd.DataFrame(data=d, dtype=np.int8) >>> df.dtypes col1 int8 col2 int8 dtype: object Constructing DataFrame from a dictionary including Series: >>> d = {'col1': [0, 1, 2, 3], 'col2': pd.Series([2, 3], index=[2, 3])} >>> pd.DataFrame(data=d, index=[0, 1, 2, 3]) col1 col2 0 0 NaN 1 1 NaN 2 2 2.0 3 3 3.0 Constructing DataFrame from numpy ndarray: >>> df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), ... columns=['a', 'b', 'c']) >>> df2 a b c 0 1 2 3 1 4 5 6 2 7 8 9 Constructing DataFrame from a numpy ndarray that has labeled columns: >>> data = np.array([(1, 2, 3), (4, 5, 6), (7, 8, 9)], ... dtype=[("a", "i4"), ("b", "i4"), ("c", "i4")]) >>> df3 = pd.DataFrame(data, columns=['c', 'a']) ... >>> df3 c a 0 3 1 1 6 4 2 9 7 Constructing DataFrame from dataclass: >>> from dataclasses import make_dataclass >>> Point = make_dataclass("Point", [("x", int), ("y", int)]) >>> pd.DataFrame([Point(0, 0), Point(0, 3), Point(2, 3)]) x y 0 0 0 1 0 3 2 2 3 Constructing DataFrame from Series/DataFrame: >>> ser = pd.Series([1, 2, 3], index=["a", "b", "c"]) >>> df = pd.DataFrame(data=ser, index=["a", "c"]) >>> df 0 a 1 c 3 >>> df1 = pd.DataFrame([1, 2, 3], index=["a", "b", "c"], columns=["x"]) >>> df2 = pd.DataFrame(data=df1, index=["a", "c"]) >>> df2 x a 1 c 3 File: c:\users\phili\anaconda3\envs\vscode-env\lib\site-packages\pandas\core\frame.py Type: type Subclasses: SubclassedDataFrame
A pd.DataFrame
can be instantiated by supplying data
to a dict
where each key
is the pd.Series
name
and pd.Series
values
respectively:
df1 = pd.DataFrame({'w': [0, 1, 2, 3],
'x': [4, 5, 6, 7],
'y': [8, 9, 10, 11],
'z': [12, 13, 14, 15]})
df1
w | x | y | z | |
---|---|---|---|---|
0 | 0 | 4 | 8 | 12 |
1 | 1 | 5 | 9 | 13 |
2 | 2 | 6 | 10 | 14 |
3 | 3 | 7 | 11 | 15 |
Attributes¶
A pd.DataFrame
instance has 2 dimensions so ndim
is:
df1.ndim
2
The attribute axes
displays a list of each axis:
df1.axes
[RangeIndex(start=0, stop=4, step=1), Index(['w', 'x', 'y', 'z'], dtype='object')]
The detail for each axis can be viewed:
df1.index
RangeIndex(start=0, stop=4, step=1)
Note the columns
axis is an Index
instance. The dtype
is 'O'
because it contains str
instances:
df1.columns
Index(['w', 'x', 'y', 'z'], dtype='object')
dtypes
groups the datatype for each pd.Series
in the pd.DataFrame
, note that this is a pd.Series
instance:
df1.dtypes
w int64 x int64 y int64 z int64 dtype: object
The pd.DataFrame
instance also has a shape
and size
:
df1.shape
(4, 4)
df1.size
16
The pd.DataFrame
instance has an optional attribute attrs
which is an empty dict
by default:
df1.attrs
{}
This is used to provive a description and a reference for example:
df1.attrs = {'description': 'This is a DataFrame',
'documentation': 'https://pandas.pydata.org/'}
df1.attrs
{'description': 'This is a DataFrame', 'documentation': 'https://pandas.pydata.org/'}
The pd.DataFrame
attribute style
will display a styled DataFrame
in the cell output:
df1.style
w | x | y | z | |
---|---|---|---|---|
0 | 0 | 4 | 8 | 12 |
1 | 1 | 5 | 9 | 13 |
2 | 2 | 6 | 10 | 14 |
3 | 3 | 7 | 11 | 15 |
A number of identifiers are available to be called from style
to change how the pd.DataFrame
instance displays:
dir2(df1.style, drop_internal=True)
{'attribute': ['caption', 'cell_context', 'cell_ids', 'columns', 'concatenated', 'css', 'ctx', 'ctx_columns', 'ctx_index', 'data', 'env', 'hidden_columns', 'hidden_rows', 'hide_column_names', 'hide_columns_', 'hide_index_', 'hide_index_names', 'index', 'loader', 'table_attributes', 'table_styles', 'template_html', 'template_html_style', 'template_html_table', 'template_latex', 'template_string', 'tooltips', 'uuid', 'uuid_len'], 'method': ['apply', 'apply_index', 'applymap', 'applymap_index', 'background_gradient', 'bar', 'clear', 'concat', 'export', 'format', 'format_index', 'from_custom_template', 'hide', 'highlight_between', 'highlight_max', 'highlight_min', 'highlight_null', 'highlight_quantile', 'map', 'map_index', 'pipe', 'relabel_index', 'set_caption', 'set_properties', 'set_sticky', 'set_table_attributes', 'set_table_styles', 'set_td_classes', 'set_tooltips', 'set_uuid', 'text_gradient', 'to_excel', 'to_html', 'to_latex', 'to_string', 'use'], 'datamodel_attribute': ['__dict__', '__doc__', '__module__', '__weakref__'], 'datamodel_method': ['__class__', '__copy__', '__deepcopy__', '__delattr__', '__dir__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__']}
These can often be stacked:
df1.style.format(precision=3).set_caption('DataFrame Instance').relabel_index(['a', 'b', 'c', 'd'])
w | x | y | z | |
---|---|---|---|---|
a | 0 | 4 | 8 | 12 |
b | 1 | 5 | 9 | 13 |
c | 2 | 6 | 10 | 14 |
d | 3 | 7 | 11 | 15 |
Indexing¶
df1 = pd.DataFrame({'w': [0, 1, 2, 3],
'x': [4, 5, 6, 7],
'y': [8, 9, 10, 11],
'z': [12, 13, 14, 15]})
When each pd.Series
name
is a valid identifier:
'w'.isidentifier()
True
It is available as an attribute:
df1.w
0 0 1 1 2 2 3 3 Name: w, dtype: int64
A pd.Series
can be indexed from a pd.DataFrame
by using square brackets and the str
instance correponding to the pd.Series
name
:
df1['x']
0 4 1 5 2 6 3 7 Name: x, dtype: int64
A single value can be specified by indexing into the returned pd.Series
:
df1['x'][1]
5
The data supplied to the df1
was a dict
instance:
mapping = {'w': [0, 1, 2, 3],
'x': [4, 5, 6, 7],
'y': [8, 9, 10, 11],
'z': [12, 13, 14, 15]}
Notice the consistency by indexing into the dict
using a key
:
mapping['x']
[4, 5, 6, 7]
And then indexing into that list
to get a single element:
mapping['x'][1]
5
A list
of str
instances can also be provided where each str
corresponds to a pd.Series
name
returns a pd.DataFrame
instance:
df1[['x', 'z']]
x | z | |
---|---|---|
0 | 4 | 12 |
1 | 5 | 13 |
2 | 6 | 14 |
3 | 7 | 15 |
The iloc
attribute is used to instead retrieve a row by specifying the numeric value in the index
. The row is returned as a pd.Series
and because the default representation for a pd.Series
is a column, the row is displayed as a column:
df1.iloc[2]
w 2 x 6 y 10 z 14 Name: 2, dtype: int64
Slicing can also be used to return a pd.DataFrame
instance:
df1.iloc[2:4]
w | x | y | z | |
---|---|---|---|---|
2 | 2 | 6 | 10 | 14 |
3 | 3 | 7 | 11 | 15 |
The at
attribute is used to return a value, specifying the index
name
(when applicable) and the column
name
:
df1.at[0, 'x']
4
If a second pd.DataFrame
instance with a pd.Index
that is named is used:
df2 = pd.DataFrame({'w': [0, 1, 2, 3],
'x': [4, 5, 6, 7],
'y': [8, 9, 10, 11],
'z': [12, 13, 14, 15]},
index=('a', 'b', 'c', 'd'))
df2
w | x | y | z | |
---|---|---|---|---|
a | 0 | 4 | 8 | 12 |
b | 1 | 5 | 9 | 13 |
c | 2 | 6 | 10 | 14 |
d | 3 | 7 | 11 | 15 |
The difference between indexing attributes can be examined such as indexing with a column
name
:
df2['x']
a 4 b 5 c 6 d 7 Name: x, dtype: int64
df2[['x', 'y']]
x | y | |
---|---|---|
a | 4 | 8 |
b | 5 | 9 |
c | 6 | 10 |
d | 7 | 11 |
Indexing a row using an index
name
:
df2.loc['b']
w 1 x 5 y 9 z 13 Name: b, dtype: int64
Indexing a row using an index
int
:
df2.iloc[1]
w 1 x 5 y 9 z 13 Name: b, dtype: int64
df2.iloc[1:3]
w | x | y | z | |
---|---|---|---|---|
b | 1 | 5 | 9 | 13 |
c | 2 | 6 | 10 | 14 |
Indexing all rows and using a column
int
:
df2.iloc[:, [0, 1]]
w | x | |
---|---|---|
a | 0 | 4 |
b | 1 | 5 |
c | 2 | 6 |
d | 3 | 7 |
Specifying an element using at
and specifying a row
name
followed by a column
name
:
df2.at['b', 'y']
9
Specifying an element using iat
and specifying a row
int
followed by a column
int
:
df2.iat[1, 2]
9
Under the hood iloc
can be conceptualised as using a reset index:
df2
w | x | y | z | |
---|---|---|---|---|
a | 0 | 4 | 8 | 12 |
b | 1 | 5 | 9 | 13 |
c | 2 | 6 | 10 | 14 |
d | 3 | 7 | 11 | 15 |
df2.reset_index()
index | w | x | y | z | |
---|---|---|---|---|---|
0 | a | 0 | 4 | 8 | 12 |
1 | b | 1 | 5 | 9 | 13 |
2 | c | 2 | 6 | 10 | 14 |
3 | d | 3 | 7 | 11 | 15 |
df2.iloc[2]
w 2 x 6 y 10 z 14 Name: c, dtype: int64
Or by examining the underlying np.ndarray
instance and indexing numerically:
df2.values
array([[ 0, 4, 8, 12], [ 1, 5, 9, 13], [ 2, 6, 10, 14], [ 3, 7, 11, 15]], dtype=int64)
df2.values[2]
array([ 2, 6, 10, 14], dtype=int64)
df2.iloc[2, 1]
6
df2.values[2, 1]
6
If df1
is examined:
df1
w | x | y | z | |
---|---|---|---|---|
0 | 0 | 4 | 8 | 12 |
1 | 1 | 5 | 9 | 13 |
2 | 2 | 6 | 10 | 14 |
3 | 3 | 7 | 11 | 15 |
Notice a pd.Series
can be retrieved using the attribute or by indexing:
df1.x
0 4 1 5 2 6 3 7 Name: x, dtype: int64
df1['x']
0 4 1 5 2 6 3 7 Name: x, dtype: int64
Attempting to access an attribute that does not exist will flag up an AttributeError
. Indexing using square brackets can be used to create a new key
in a dict
instance, the key
then essentially becomes an instance name for the value
:
mapping['letters'] = ['A', 'B', 'C', 'D']
mapping
{'w': [0, 1, 2, 3], 'x': [4, 5, 6, 7], 'y': [8, 9, 10, 11], 'z': [12, 13, 14, 15], 'letters': ['A', 'B', 'C', 'D']}
Similar syntax can be used for a pd.DataFrame
instance:
df1['letters'] = pd.Series(['A', 'B', 'C', 'D'], name='letters')
df1
w | x | y | z | letters | |
---|---|---|---|---|---|
0 | 0 | 4 | 8 | 12 | A |
1 | 1 | 5 | 9 | 13 | B |
2 | 2 | 6 | 10 | 14 | C |
3 | 3 | 7 | 11 | 15 | D |
Since the new pd.Series
name
is a valid identifier name, it becomes an attribute:
df1.letters
0 A 1 B 2 C 3 D Name: letters, dtype: object
Information¶
The example pd.DataFrame
instance df3
is very small:
df3 = pd.DataFrame({'w': [0, 1, 2, 3],
'x': [4, 5, None, 7],
'y': [8, None, 10, 11],
'z': [12, 13, 14, 15]})
df3
w | x | y | z | |
---|---|---|---|---|
0 | 0 | 4.0 | 8.0 | 12 |
1 | 1 | 5.0 | NaN | 13 |
2 | 2 | NaN | 10.0 | 14 |
3 | 3 | 7.0 | 11.0 | 15 |
However many pd.DataFrame
instance have hundreds of rows of data and a preview can be made using the pd.DataFrame
method head
or tail
. These methods optionally take an int
as a positional input argument, the int
supplied gives the number of rows to show, this is typically 5
if unspecified:
df3.head(2)
w | x | y | z | |
---|---|---|---|---|
0 | 0 | 4.0 | 8.0 | 12 |
1 | 1 | 5.0 | NaN | 13 |
df3.tail(3)
w | x | y | z | |
---|---|---|---|---|
1 | 1 | 5.0 | NaN | 13 |
2 | 2 | NaN | 10.0 | 14 |
3 | 3 | 7.0 | 11.0 | 15 |
The info
gives details about each pd.Series
such as the name
and number of non-null values as well as the dtype
:
df3.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4 entries, 0 to 3 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 w 4 non-null int64 1 x 3 non-null float64 2 y 3 non-null float64 3 z 4 non-null int64 dtypes: float64(2), int64(2) memory usage: 260.0 bytes
When a pd.Series
is numeric, descriptive statistics can be computed on it:
df3.describe()
w | x | y | z | |
---|---|---|---|---|
count | 4.000000 | 3.000000 | 3.000000 | 4.000000 |
mean | 1.500000 | 5.333333 | 9.666667 | 13.500000 |
std | 1.290994 | 1.527525 | 1.527525 | 1.290994 |
min | 0.000000 | 4.000000 | 8.000000 | 12.000000 |
25% | 0.750000 | 4.500000 | 9.000000 | 12.750000 |
50% | 1.500000 | 5.000000 | 10.000000 | 13.500000 |
75% | 2.250000 | 6.000000 | 10.500000 | 14.250000 |
max | 3.000000 | 7.000000 | 11.000000 | 15.000000 |
If another pd.DataFrame
instance is examined with some values that are None
:
df3 = pd.DataFrame({'w': [0, 1, 2, 3],
'x': [4, 5, None, 7],
'y': [8, None, 10, 11],
'z': [12, 13, 14, 15]})
Then the pd.Series
with None
are cast into float64
instances and have a non-null count lower than the size
of the pd.Series
:
df3.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4 entries, 0 to 3 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 w 4 non-null int64 1 x 3 non-null float64 2 y 3 non-null float64 3 z 4 non-null int64 dtypes: float64(2), int64(2) memory usage: 260.0 bytes
The isna
(alias isnull
) will return a bool
pd.DataFrame
instance:
df3.isna()
w | x | y | z | |
---|---|---|---|---|
0 | False | False | False | False |
1 | False | False | True | False |
2 | False | True | False | False |
3 | False | False | False | False |
It is quite common to use this with the pd.DataFrame
method isany
to return a pd.Series
:
df3.isna().any()
w False x True y True z False dtype: bool
This can be used to see what pd.Series
instances need to be examined to remove or fill in missing values.
The notna
(alias notnull
) will return the inverted bool
pd.DataFrame
instance:
df3.notna()
w | x | y | z | |
---|---|---|---|---|
0 | True | True | True | True |
1 | True | True | False | True |
2 | True | False | True | True |
3 | True | True | True | True |
Mutability¶
The three classes are mutable and therefore the __hash__
identifier is None
:
pd.Index.__hash__ == None
True
pd.Series.__hash__ == None
True
pd.DataFrame.__hash__ == None
True
This is because they are based on the np.ndarray
which is also immutable:
np.ndarray.__hash__ == None
True
Most of the methods are immutable by default, returning a new instance. A large number of pd.Series
and consistent pd.DataFrame
methods have the parameter inplace
which can be used to change whether the method is immutable or mutable:
dir2(pd.Series, parameter='inplace', exclude_identifier_list=obsolete.series_depreciation)
{'method': ['bfill', 'clip', 'drop', 'drop_duplicates', 'dropna', 'ffill', 'fillna', 'interpolate', 'mask', 'pad', 'rename', 'rename_axis', 'replace', 'reset_index', 'sort_index', 'sort_values', 'where']}
For example the pd.DataFrame
instance df3
can be examined:
df3
w | x | y | z | |
---|---|---|---|---|
0 | 0 | 4.0 | 8.0 | 12 |
1 | 1 | 5.0 | NaN | 13 |
2 | 2 | NaN | 10.0 | 14 |
3 | 3 | 7.0 | 11.0 | 15 |
Notice that x
has a NaN
value (np.nan
instance):
df3['x']
0 4.0 1 5.0 2 NaN 3 7.0 Name: x, dtype: float64
The method fillna
can be used to fill in a missing value using a specified value:
df3['x'].fillna(0) # immutable returns a new pd.Series
0 4.0 1 5.0 2 0.0 3 7.0 Name: x, dtype: float64
The method bfill
can be used to fill in a missing value using the previous value:
df3['x'].bfill() # immutable returns a new pd.Series
0 4.0 1 5.0 2 7.0 3 7.0 Name: x, dtype: float64
The method ffill
can be used to fill in a missing value using the previous value:
df3['x'].ffill() # immutable returns a new pd.Series
0 4.0 1 5.0 2 5.0 3 7.0 Name: x, dtype: float64
The method interpolate
can be used to interpolate values using 'linear'
, 'quadratic'
or 'cubic'
(there is not enough data for 'cubic'
in this sample) interpolation:
df3['x'].interpolate('linear') # immutable returns a new pd.Series
0 4.0 1 5.0 2 6.0 3 7.0 Name: x, dtype: float64
df3['x'].interpolate('quadratic') # immutable returns a new pd.Series
0 4.0 1 5.0 2 6.0 3 7.0 Name: x, dtype: float64
Notice that the df3
is unchanged:
df3 # not updated
w | x | y | z | |
---|---|---|---|---|
0 | 0 | 4.0 | 8.0 | 12 |
1 | 1 | 5.0 | NaN | 13 |
2 | 2 | NaN | 10.0 | 14 |
3 | 3 | 7.0 | 11.0 | 15 |
When the named parameter inplace
is set to a bool
value of True
, the method become mutable and updates the pd.Series
with name 'x'
found on the pd.DataFrame
instance df3
inplace:
df3['x'].ffill(inplace=True) # mutable no return value
df3 # updated in place
w | x | y | z | |
---|---|---|---|---|
0 | 0 | 4.0 | 8.0 | 12 |
1 | 1 | 5.0 | NaN | 13 |
2 | 2 | 5.0 | 10.0 | 14 |
3 | 3 | 7.0 | 11.0 | 15 |
A mutable method should not be used with reassignment because it has no return value and the value attempted to be reassigned is therefore None
:
df3['y'] = df3['y'].ffill(inplace=True)
The above attempts to use a mutable method which returns the value None
and then it attempts to reassign None
to the pd.Series
instance with name
df3['y']
.
This is one of the most common mistakes beginners make, currently the operation is carried out with a FutureWarning
however this behaviour will be prevented in pandas 3.
Because the pd.Index
, pd.Series
and pd.DataFrame
are mutable an item can be accessed:
df3.at[1, 'y']
nan
And reassigned:
df3.at[1, 'y'] = 40
df3
w | x | y | z | |
---|---|---|---|---|
0 | 0 | 4.0 | 8.0 | 12 |
1 | 1 | 5.0 | 40.0 | 13 |
2 | 2 | 5.0 | 10.0 | 14 |
3 | 3 | 7.0 | 11.0 | 15 |
The del
keyword only works for deleting a pd.Series
from a pd.DataFrame
:
del df3['y']
df3
w | x | z | |
---|---|---|---|
0 | 0 | 4.0 | 12 |
1 | 1 | 5.0 | 13 |
2 | 2 | 5.0 | 14 |
3 | 3 | 7.0 | 15 |
The pd.DataFrame
has mutable methods consistent with a mutable Collection
such as a list
. insert
can be used to insert a new pd.Series
at a specified index:
df3.insert(loc=2, column='y2', value=[20, 21, 22, 23])
df3
w | x | y2 | z | |
---|---|---|---|---|
0 | 0 | 4.0 | 20 | 12 |
1 | 1 | 5.0 | 21 | 13 |
2 | 2 | 5.0 | 22 | 14 |
3 | 3 | 7.0 | 23 | 15 |
pop
can be used to pop
of a pd.Series
using a str
instance that corresponds to the the pd.Series
name
. Recall that pop
is one of the few mutable methods that has a return
value. The return
value is the pd.Series
popped and the pd.DataFrame
instance is mutated inplace:
df3.pop('z') # mutable method WITH return value of pd.Series
0 12 1 13 2 14 3 15 Name: z, dtype: int64
df3 # mutated inplace
w | x | y2 | |
---|---|---|---|
0 | 0 | 4.0 | 20 |
1 | 1 | 5.0 | 21 |
2 | 2 | 5.0 | 22 |
3 | 3 | 7.0 | 23 |
Note there is no append
method because indexing with a new pd.Series
name appends the pd.Series
to the end of the pd.DataFrame
:
df3['z2'] = pd.Series([30, 31, 32, 33], name='z')
Restoring df3
to the following:
df3 = pd.DataFrame({'w': [0, 1, 2, 3],
'x': [4, 5, None, 7],
'y': [8, None, 10, 11],
'z': [12, 13, 14, 15]})
The pd.DataFrame
method dropna
can be used. Notice any observations with a NaN
(None
or np.nan
) datatype are dropped:
df3.dropna()
w | x | y | z | |
---|---|---|---|---|
0 | 0 | 4.0 | 8.0 | 12 |
3 | 3 | 7.0 | 11.0 | 15 |
Alternatively an observation can be dropped using:
df3.drop(0)
w | x | y | z | |
---|---|---|---|---|
1 | 1 | 5.0 | NaN | 13 |
2 | 2 | NaN | 10.0 | 14 |
3 | 3 | 7.0 | 11.0 | 15 |
df3.drop([0, 3])
w | x | y | z | |
---|---|---|---|---|
1 | 1 | 5.0 | NaN | 13 |
2 | 2 | NaN | 10.0 | 14 |
If the following pd.DataFrame
instance is instantiated:
df4 = pd.DataFrame({'w': [0, 0, 4, 4],
'x': [1, 1, 5, 5],
'y': [2, 2, 6, 6],
'z': [3, 3, 7, 7]})
Notice index 0
and 1
have duplicate data and index 2
and 3
also have duplicate data:
df4
w | x | y | z | |
---|---|---|---|---|
0 | 0 | 1 | 2 | 3 |
1 | 0 | 1 | 2 | 3 |
2 | 4 | 5 | 6 | 7 |
3 | 4 | 5 | 6 | 7 |
The duplicates can be dropped using:
df4.drop_duplicates()
w | x | y | z | |
---|---|---|---|---|
0 | 0 | 1 | 2 | 3 |
2 | 4 | 5 | 6 | 7 |
The index is now 0
and 2
as the observations that were at position 1
and 3
on the index
have been dropped. The index can be reset using the reset_index
method:
df4.drop_duplicates().reset_index()
index | w | x | y | z | |
---|---|---|---|---|---|
0 | 0 | 0 | 1 | 2 | 3 |
1 | 2 | 4 | 5 | 6 | 7 |
The old index
is now a pd.Series
with name
'index'
, it can also be dropped:
df4.drop_duplicates().reset_index(drop=True)
w | x | y | z | |
---|---|---|---|---|
0 | 0 | 1 | 2 | 3 |
1 | 4 | 5 | 6 | 7 |
If the pd.DataFrame
instance df4
is examined:
df4
w | x | y | z | |
---|---|---|---|---|
0 | 0 | 1 | 2 | 3 |
1 | 0 | 1 | 2 | 3 |
2 | 4 | 5 | 6 | 7 |
3 | 4 | 5 | 6 | 7 |
Its len
can be determined:
len(df4)
4
Notice that this is 1
more than the last index. loc
can be used to access this new value on the index
and assign it to a dict
instance that corresponds to data from a new observation:
df4.loc[len(df4)] = {'w': 9, 'x': 99, 'y': 999, 'z': 999}
df4
w | x | y | z | |
---|---|---|---|---|
0 | 0 | 1 | 2 | 3 |
1 | 0 | 1 | 2 | 3 |
2 | 4 | 5 | 6 | 7 |
3 | 4 | 5 | 6 | 7 |
4 | 9 | 99 | 999 | 999 |
iloc
cannot be used for this purpose as it cannot access a numeric index value that does not yet exist. Sometimes it is more convenient to provide a list
of new values:
df4.loc[len(df4)] = [8, 88, 888, 888]
df4
w | x | y | z | |
---|---|---|---|---|
0 | 0 | 1 | 2 | 3 |
1 | 0 | 1 | 2 | 3 |
2 | 4 | 5 | 6 | 7 |
3 | 4 | 5 | 6 | 7 |
4 | 9 | 99 | 999 | 999 |
5 | 8 | 88 | 888 | 888 |
If the following pd.DataFrame
instance is instantiated:
df5 = pd.DataFrame({'w': [2, 2, 1, 1],
'x': [3, 3, 3, 4],
'y': [5, 6, 8, 8],
'z': ['a', 'b', 'c', 'd']})
df5
w | x | y | z | |
---|---|---|---|---|
0 | 2 | 3 | 5 | a |
1 | 2 | 3 | 6 | b |
2 | 1 | 3 | 8 | c |
3 | 1 | 4 | 8 | d |
The values in a pd.DataFrame
can be sorted using the sort_values
method and providing a list
of str
instances that each correspond to a name of a pd.Series
to sort by:
df5.sort_values(by=['w', 'x', 'y'], inplace=True) # mutable
df5 # modified inplace
w | x | y | z | |
---|---|---|---|---|
2 | 1 | 3 | 8 | c |
3 | 1 | 4 | 8 | d |
0 | 2 | 3 | 5 | a |
1 | 2 | 3 | 6 | b |
The pd.DataFrame
method can alternatively be used to sort by the values of the index
which restores the original df5
:
df5.sort_index() # return value
w | x | y | z | |
---|---|---|---|---|
0 | 2 | 3 | 5 | a |
1 | 2 | 3 | 6 | b |
2 | 1 | 3 | 8 | c |
3 | 1 | 4 | 8 | d |
df5 # not modified
w | x | y | z | |
---|---|---|---|---|
2 | 1 | 3 | 8 | c |
3 | 1 | 4 | 8 | d |
0 | 2 | 3 | 5 | a |
1 | 2 | 3 | 6 | b |
The index
is often reset after sorting:
df5.reset_index(drop=True)
w | x | y | z | |
---|---|---|---|---|
0 | 1 | 3 | 8 | c |
1 | 1 | 4 | 8 | d |
2 | 2 | 3 | 5 | a |
3 | 2 | 3 | 6 | b |
If two pd.DataFrame
instances have the same index
but different pd.Series
, they can be joined using the join
method:
dfleft = pd.DataFrame({'w1': [1, 2, 3, 4],
'x1': [1, 2, 3, 4],
'y1': [1, 2, 3, 4],
'z1': [1, 2, 3, 4]})
dfright = pd.DataFrame({'w2': [5, 6, 7, 8],
'x2': [5, 6, 7, 8],
'y2': [5, 6, 7, 8]})
dfleft.join(dfright)
w1 | x1 | y1 | z1 | w2 | x2 | y2 | |
---|---|---|---|---|---|---|---|
0 | 1 | 1 | 1 | 1 | 5 | 5 | 5 |
1 | 2 | 2 | 2 | 2 | 6 | 6 | 6 |
2 | 3 | 3 | 3 | 3 | 7 | 7 | 7 |
3 | 4 | 4 | 4 | 4 | 8 | 8 | 8 |
If two pd.DataFrame
instances have the same pd.Series
but different observations, they can be concatenated using the pd.concat
function:
dftop = pd.DataFrame({'w': [1, 2, 3, 4],
'x': [1, 2, 3, 4],
'y': [1, 2, 3, 4],
'z': [1, 2, 3, 4]})
dfbot = pd.DataFrame({'w': [5, 6, 7, 8],
'x': [5, 6, 7, 8],
'y': [5, 6, 7, 8],
'z': [5, 6, 7, 8]})
pd.concat([dftop, dfbot])
w | x | y | z | |
---|---|---|---|---|
0 | 1 | 1 | 1 | 1 |
1 | 2 | 2 | 2 | 2 |
2 | 3 | 3 | 3 | 3 |
3 | 4 | 4 | 4 | 4 |
0 | 5 | 5 | 5 | 5 |
1 | 6 | 6 | 6 | 6 |
2 | 7 | 7 | 7 | 7 |
3 | 8 | 8 | 8 | 8 |
This outputs a new pd.DataFrame
instance. Note there are now two observations with the int
value 0
on the index:
pd.concat([dftop, dfbot]).loc[0]
w | x | y | z | |
---|---|---|---|---|
0 | 1 | 1 | 1 | 1 |
0 | 5 | 5 | 5 | 5 |
To rectify this, the named parameter ignore_index
can be assigned to the bool
instance True
:
pd.concat([dftop, dfbot], ignore_index=True)
w | x | y | z | |
---|---|---|---|---|
0 | 1 | 1 | 1 | 1 |
1 | 2 | 2 | 2 | 2 |
2 | 3 | 3 | 3 | 3 |
3 | 4 | 4 | 4 | 4 |
4 | 5 | 5 | 5 | 5 |
5 | 6 | 6 | 6 | 6 |
6 | 7 | 7 | 7 | 7 |
7 | 8 | 8 | 8 | 8 |
The pd.DataFrame
method align
is used for aligning two pd.DataFrame
instances side by side within a tuple
:
dfa = pd.DataFrame({'x': [1, 2],
'y': [5, 6],
'z': [1, 2]})
dfb = pd.DataFrame({'w': [5, 6, 7, 8],
'x': [5, 6, 7, 8],
'z': [5, 6, 7, 8]})
Any missing values found in one but not another are replaced by NaN
:
dfa.align(dfb)
( w x y z 0 NaN 1.0 5.0 1.0 1 NaN 2.0 6.0 2.0 2 NaN NaN NaN NaN 3 NaN NaN NaN NaN, w x y z 0 5 5 NaN 5 1 6 6 NaN 6 2 7 7 NaN 7 3 8 8 NaN 8)
Axis¶
If the following pd.DataFrame
is instantiated:
df6 = pd.DataFrame({'w': [0, 1, None, 3],
'x': [1, 1, None, 4],
'y': [None, None, 9, 9],
'z': [3, 4, 9, 9]},
index=('a', 'b', 'c', 'd'))
df6
w | x | y | z | |
---|---|---|---|---|
a | 0.0 | 1.0 | NaN | 3 |
b | 1.0 | 1.0 | NaN | 4 |
c | NaN | NaN | 9.0 | 9 |
d | 3.0 | 4.0 | 9.0 | 9 |
Notice when ffill
is used, by default it carries out the operation simultaneously along the index of each pd.Series
giving consistent behaviour to the equivalent pd.Series
method:
df6.ffill()
w | x | y | z | |
---|---|---|---|---|
a | 0.0 | 1.0 | NaN | 3 |
b | 1.0 | 1.0 | NaN | 4 |
c | 1.0 | 1.0 | 9.0 | 9 |
d | 3.0 | 4.0 | 9.0 | 9 |
Many of the pd.DataFrame
methods have an axis
parameter which can instead be used to specify the axis
to perform the operation on:
dir2(pd.DataFrame, parameter='axis', exclude_identifier_list=obsolete.pandas_depreciation)
{'method': ['add', 'add_prefix', 'add_suffix', 'agg', 'aggregate', 'align', 'all', 'any', 'apply', 'at_time', 'backfill', 'between_time', 'bfill', 'clip', 'corrwith', 'count', 'cummax', 'cummin', 'cumprod', 'cumsum', 'diff', 'div', 'divide', 'drop', 'droplevel', 'dropna', 'eq', 'ewm', 'expanding', 'ffill', 'fillna', 'filter', 'floordiv', 'ge', 'groupby', 'gt', 'idxmax', 'idxmin', 'interpolate', 'kurt', 'kurtosis', 'le', 'lt', 'mask', 'max', 'mean', 'median', 'min', 'mod', 'mode', 'mul', 'multiply', 'ne', 'nunique', 'pad', 'pow', 'prod', 'product', 'quantile', 'radd', 'rank', 'rdiv', 'reindex', 'rename', 'rename_axis', 'reorder_levels', 'resample', 'rfloordiv', 'rmod', 'rmul', 'rolling', 'rpow', 'rsub', 'rtruediv', 'sample', 'sem', 'set_axis', 'shift', 'skew', 'sort_index', 'sort_values', 'squeeze', 'std', 'sub', 'subtract', 'sum', 'swaplevel', 'take', 'to_period', 'to_timestamp', 'transform', 'truediv', 'truncate', 'tz_convert', 'tz_localize', 'var', 'where', 'xs']}
df6
w | x | y | z | |
---|---|---|---|---|
a | 0.0 | 1.0 | NaN | 3 |
b | 1.0 | 1.0 | NaN | 4 |
c | NaN | NaN | 9.0 | 9 |
d | 3.0 | 4.0 | 9.0 | 9 |
The axis
parameter can be used to instruct the method to operate along the 'index'
which is the default behaviour:
df6.ffill(axis='index') # operates along the index
w | x | y | z | |
---|---|---|---|---|
a | 0.0 | 1.0 | NaN | 3 |
b | 1.0 | 1.0 | NaN | 4 |
c | 1.0 | 1.0 | 9.0 | 9 |
d | 3.0 | 4.0 | 9.0 | 9 |
Alternatively it can be used to instruct the method to operate along the 'columns'
(this requires that the dtypes
of each Series
are consistent, usually numeric):
df6.dtypes
w float64 x float64 y float64 z int64 dtype: object
df6.ffill(axis='columns') # operates along columns
w | x | y | z | |
---|---|---|---|---|
a | 0.0 | 1.0 | 1.0 | 3.0 |
b | 1.0 | 1.0 | 1.0 | 4.0 |
c | NaN | NaN | 9.0 | 9.0 |
d | 3.0 | 4.0 | 9.0 | 9.0 |
Category Methods¶
A pd.DataFrame
instance can be created using four pd.Series
, using 'float64'
, 'category'
, 'category'
and 'int32'
as the respective dtype
:
np.random.seed(0)
w = pd.Series(np.random.random(size=10), name='w', dtype='float64')
x = pd.Series(np.random.choice(['A', 'B', 'C'], size=10), name='x', dtype='category')
y = pd.Series(np.random.choice(['a', 'b', 'c'], size=10), name='y', dtype='category')
z = pd.Series(np.random.randint(0, 10, size=10), name='z', dtype='int32')
df7 = pd.DataFrame()
df7['w'] = w
df7['x'] = x
df7['y'] = y
df7['z'] = z
df7
w | x | y | z | |
---|---|---|---|---|
0 | 0.548814 | C | a | 1 |
1 | 0.715189 | C | b | 9 |
2 | 0.602763 | A | c | 9 |
3 | 0.544883 | B | a | 0 |
4 | 0.423655 | B | c | 4 |
5 | 0.645894 | B | a | 7 |
6 | 0.437587 | B | b | 3 |
7 | 0.891773 | A | b | 2 |
8 | 0.963663 | B | c | 7 |
9 | 0.383442 | A | a | 2 |
The cat
attribute can be used to access the as_ordered
method and then the cat
attribute from this pd.Series
returned can be used to access the reorder_categories
method:
df7['x'] = df7['x'].cat.as_ordered().cat.reorder_categories(['C', 'B', 'A'])
df7['y'] = df7['y'].cat.as_ordered().cat.reorder_categories(['c', 'b', 'a'])
df7
w | x | y | z | |
---|---|---|---|---|
0 | 0.548814 | C | a | 1 |
1 | 0.715189 | C | b | 9 |
2 | 0.602763 | A | c | 9 |
3 | 0.544883 | B | a | 0 |
4 | 0.423655 | B | c | 4 |
5 | 0.645894 | B | a | 7 |
6 | 0.437587 | B | b | 3 |
7 | 0.891773 | A | b | 2 |
8 | 0.963663 | B | c | 7 |
9 | 0.383442 | A | a | 2 |
This allows use of comparison operatorswith one of the pd.Series
with a dtype
of 'category'
:
df7['x'] >= 'B'
0 False 1 False 2 True 3 True 4 True 5 True 6 True 7 True 8 True 9 True Name: x, dtype: bool
df7['y'] >= 'b'
0 True 1 True 2 False 3 True 4 False 5 True 6 True 7 True 8 False 9 True Name: y, dtype: bool
Conditions can be joined together using the &
operator, |
operator or ^
operator. Note parenthesis should be used to enclose each condition:
(df7['x'] >= 'B') & (df7['y'] >= 'b')
0 False 1 False 2 False 3 True 4 False 5 True 6 True 7 True 8 False 9 True dtype: bool
The and
or or
keyword are not supported and flag a ValueError
.
This pd.Series
instance with dtype
of bool
can be used to index into the pd.DataFrame
instance df7
:
df7[(df7['x'] >= 'B') & (df7['y'] >= 'b')]
w | x | y | z | |
---|---|---|---|---|
3 | 0.544883 | B | a | 0 |
5 | 0.645894 | B | a | 7 |
6 | 0.437587 | B | b | 3 |
7 | 0.891773 | A | b | 2 |
9 | 0.383442 | A | a | 2 |
The pd.DataFrame
method can also be used to group by each category in a pd.Series
with a dtype
of 'category'
, creating a groupby object:
gbo = df7.groupby(df7['x'], observed=True)
The identifiers of this DataFrameGroupBy
can be examined:
type(gbo)
pandas.core.groupby.generic.DataFrameGroupBy
dir2(gbo, drop_internal=True)
{'attribute': ['dtypes', 'groups', 'indices', 'ndim', 'ngroups', 'w', 'x', 'y', 'z'], 'method': ['agg', 'aggregate', 'all', 'any', 'apply', 'bfill', 'boxplot', 'corr', 'corrwith', 'count', 'cov', 'cumcount', 'cummax', 'cummin', 'cumprod', 'cumsum', 'describe', 'diff', 'ewm', 'expanding', 'ffill', 'fillna', 'filter', 'first', 'get_group', 'head', 'hist', 'idxmax', 'idxmin', 'last', 'max', 'mean', 'median', 'min', 'ngroup', 'nth', 'nunique', 'ohlc', 'pct_change', 'pipe', 'plot', 'prod', 'quantile', 'rank', 'resample', 'rolling', 'sample', 'sem', 'shift', 'size', 'skew', 'std', 'sum', 'tail', 'take', 'transform', 'value_counts', 'var'], 'datamodel_attribute': ['__annotations__', '__dict__', '__doc__', '__module__', '__orig_bases__', '__parameters__', '__weakref__'], 'datamodel_method': ['__class__', '__class_getitem__', '__delattr__', '__dir__', '__eq__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__']}
C:\Users\phili\OneDrive\Documents\GitHub\python-notebooks\pandas_library\categorize_identifiers.py:86: FutureWarning: DataFrameGroupBy.dtypes is deprecated and will be removed in a future version. Check the dtypes on the base object instead is_method = callable(getattr(obj, identifier)) C:\Users\phili\OneDrive\Documents\GitHub\python-notebooks\pandas_library\categorize_identifiers.py:87: FutureWarning: DataFrameGroupBy.dtypes is deprecated and will be removed in a future version. Check the dtypes on the base object instead is_class = inspect.isclass(getattr(obj, identifier)) C:\Users\phili\OneDrive\Documents\GitHub\python-notebooks\pandas_library\categorize_identifiers.py:111: FutureWarning: DataFrameGroupBy.dtypes is deprecated and will be removed in a future version. Check the dtypes on the base object instead datatype = str(type(getattr(obj, identifier)))
A method such as count
can be used to count the number of values in each category:
gbo.count()
w | y | z | |
---|---|---|---|
x | |||
C | 2 | 2 | 2 |
B | 5 | 5 | 5 |
A | 3 | 3 | 3 |
This is typically done in one line:
df7.groupby(df7['x'], observed=True).count()
w | y | z | |
---|---|---|---|
x | |||
C | 2 | 2 | 2 |
B | 5 | 5 | 5 |
A | 3 | 3 | 3 |
Note a pd.DataFrame
instance is output with duplicate values for each pd.Series
. If the groupby
method is called from a pd.Series
instance instead a pd.Series
will be returned:
df7['x'].groupby(df7['x'], observed=True).count()
x C 2 B 5 A 3 Name: x, dtype: int64
When a DataFrameGroupBy
is created from a pd.DataFrame
instance and the describe
method is used, it will display statistics for each category in each numeric pd.Series
:
df7.groupby(df7['x'], observed=True).describe()
w | z | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | mean | std | min | 25% | 50% | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | |
x | ||||||||||||||||
C | 2.0 | 0.632001 | 0.117646 | 0.548814 | 0.590407 | 0.632001 | 0.673595 | 0.715189 | 2.0 | 5.000000 | 5.656854 | 1.0 | 3.0 | 5.0 | 7.0 | 9.0 |
B | 5.0 | 0.603136 | 0.220693 | 0.423655 | 0.437587 | 0.544883 | 0.645894 | 0.963663 | 5.0 | 4.200000 | 2.949576 | 0.0 | 3.0 | 4.0 | 7.0 | 7.0 |
A | 3.0 | 0.625993 | 0.254961 | 0.383442 | 0.493102 | 0.602763 | 0.747268 | 0.891773 | 3.0 | 4.333333 | 4.041452 | 2.0 | 2.0 | 2.0 | 5.5 | 9.0 |
It is common to call the the groupby
method from a numeric pd.Series
instance and groupby another pd.Series
, which has a dtype
of 'category'
:
df7['w'].groupby(df7['x'], observed=True).describe()
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
x | ||||||||
C | 2.0 | 0.632001 | 0.117646 | 0.548814 | 0.590407 | 0.632001 | 0.673595 | 0.715189 |
B | 5.0 | 0.603136 | 0.220693 | 0.423655 | 0.437587 | 0.544883 | 0.645894 | 0.963663 |
A | 3.0 | 0.625993 | 0.254961 | 0.383442 | 0.493102 | 0.602763 | 0.747268 | 0.891773 |
This is done for many of the statistical DataFrameGroupBy
methods:
df7['w'].groupby(df7['x'], observed=True).mean()
x C 0.632001 B 0.603136 A 0.625993 Name: w, dtype: float64
df7['w'].groupby(df7['x'], observed=True).std()
x C 0.117646 B 0.220693 A 0.254961 Name: w, dtype: float64