Python Programming for Data Science

The Anaconda Data Science Python Distribution

To get started with Python, we need to install Anaconda. The Anaconda Python distribution comes with the Python programming language, the Spyder 5 IDE, the JupyterLab 3 IDE and a multitude of datascience libraries such as numpy, pandas, matplotlib and seaborn which these guides will explore.

These first guides give detailed installation instructions for the Anaconda or Miniconda 2021-11 Python Distribution on Windows 11 (equally applicable to Windows 10) and Linux Ubuntu 22.04 LTS (equally applicable to other modern distros).

These installation guides also explain how to use the Spyder 5 IDE and JupyterLab 3 IDE as well as the use of the conda package manager to manage conda packages and work with conda environments. It aims to clear up a lot of confusion new users have when they first come to use Python or Anaconda.

Procedural Programming

This guide is a beginner guide and will look at the inbuilt Python programming language and the concept of basic procedural programming. Procedural programming takes place line by line in the order specified for example within a script file. These guides will use the Scientific PYthon Development EnviRonment (Spyder 5) which is one of the best Integrated Development Environments (IDEs) particularly for beginners due to its simple but powerful user interface and very versatile variable explorer. 

Code Blocks and Debugging

So far we have only looked at procedural programming where we executed every line, line by lines. It is now worthwhile understanding the concept of code blocks. The Spyder 5 Debugger is used in this guide to visualize how these code blocks operate. if, elif, else code blocks can be used to execute code dependant on a condition. A for loop code block can be used to repeat a block of code over an iterable object and a while loop can be used to repeat code while a condition is satisfied. Functions can be used to partmentalize code, particularly code that is going to be used several times. We also discuss how functions have their own local environment (namespace). Finally we end up discussing the try, except, except, finally code blocks which are used for error handling.

Object Orientated Programming (OOP)

Python is an Object Orientated Programming (OOP) language where everything we interact with is an object… Each object has a class, which initially can be conceptualised as an abstract blueprint which defines how to create a new object and outlines the properties and functionality behind an object. The properties can be thought of as data belonging to the object and are known as attributes. The functionality can be thought of as functions belonging to the object known as methods. We use the dot syntax to access an attribute or method from an instance of an object. because methods are functions they have to be called using parenthesis (which enclose any mandatory positional or optional keyword input argument). Object Orientated Programming is used all over Python and this beginner tutorial into OOP using the inbuilt Python Programming Language will help you understand the workings behind commonly used Python Libraries.

object.attribute
object.method()

The numpy Library

Pythons inbuilt data structures such as lists, tuples and dicts are not optimised for numeric operations. For numeric operation we should instead use the numpy library which is based around the ndarray data structure. numpy is an abbreviation for numeric Python. numpy should be considered as the primary data science library as most other data science libraries build upon numpy. Having a basic understanding of numpy will help when it comes to looking at the other data science libraries particularly pandas.

The pandas Library

The pandas library is built around three data structures. The index is normally a numpy arange array object or sometimes a list of strings. The series is essentially a numpy array where each value is tied to a corresponding index and the series has a series name. The series can be conceptualised as a column with each value in the series shown to the right hand side of its corresponding index value. Finally there is the dataframe which is a collection of series which all share a common index. The dataframe conceptually is analogous to an Excel spreadsheet and every operation to manipulate data that can be done in Excel can be done programmatically in pandas. The dataframe is one of the most commonly used data structures within data science.

The matplotlib Library

This guide will look at the use of matplotlib, which is an abbreviation for the matrix plotting library i.e. is a library which plots data from numpy ndarrays. When using matplotlib, typically the pyplot module is used, an abbreviation for Python plot. This guide will first explore the use of pyplot via procedural programming and then look at using object orientated programming (OOP) which increases flexibility. This guide looks at common 2D plots such as a line plot, scatterplot, bar graph, histogram, pie chart, boxplot, violinplot and 3D plots such as contour and surface plots.

The seaborn Library

This guide will look at the use of seaborn, which is a wrapper around the matplotlib plotting library optimised for the pandas dataframe data structure. seaborn includes functions to set a consistent style and palette across plots. The seaborn plotting library includes a number of plots. These are output either in the matplotlib AxesSubplot (axes level) or the seaborn FacetGrid (figure level). seaborn splits data in dataframes using categories in categorical series to give multiple lines for example in a line plot.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.