Photo by Pascal Müller on Unsplash

`pandas`

is a Python library that makes it easy to read, export and work with relational data. This lesson will expand on its functionality and usage.

Use `pip install pandas==0.24.2`

to install the same version that we are using.

We typically `import pandas as pd `

to refer to the library using the abbreviated form.

From the official documentation, a Series is a one-dimensional ndarray with axis labels. A `ndarray `

is a special data type found in the `numpy `

library which defines an array of fixed size elements. In simpler terms, a Series is a column in a table or spreadsheet with the same data type. Each Series has an index used to indicate the axis labels.

We can create a Series using `pd.Series(['some', 'array', 'object'])`

We can look up Series values using the axis labels or by their positional labels. If not specified, the Series axis labels (otherwise known as the Series index) will default to integers. We can set the index to strings too.

```
sample_series = pd.Series(['some', 'array', 'object'], index=list('abc'))
# positional indexing: this returns the first value, which is 'some'
sample_series[0]
# label indexing: this also returns the first value 'some'
sample_series['a']
```

This is what our `sample_series`

looks like.

We can slice a Series to grab a range of values. Slicing behavior is different when using the axis labels -- contrary to usual Python slices, both the start and the endpoint are included!

```
# positional slicing: this returns the first two values
sample_series[:2]
# label slicing: this also returns the first two values
sample_series[:'b']
```

DataFrames are used to define two-dimensional data. Rows are labelled using indices and columns are labelled using column headers. Each column can be interpreted as a Series. We can create a DataFrame using `pd.DataFrame({'column 1': [1, 1], 'column 2': [2, 2]}).`

Alternatively, we can also read tabular data into DataFrames.

```
# Read in a CSV file
csv_dataframe = pd.read_csv('my_csv_file.csv')
# Read in an Excel file
xls_dataframe = pd.read_excel('my_xls_file.xls')
```

We can index DataFrame columns using square brackets. Let's use the very simple DataFrame we created as an example.

```
sample_dataframe = pd.DataFrame({'column 1': [1, 1], 'column 2': [2, 2]})
# get the column 'column 1'
sample_dataframe['column 1']
```

For more complex indexing, we can use .iloc or .loc.

Because we did not specify the axis labels for the rows, they adopted the default integer values. As such, the positional labels and axis labels are the same for this DataFrame.

We can retrieve the first row like this:

```
sample_dataframe.iloc[0, :]
sample_dataframe.loc[0, :]
```

Let's create another DataFrame to illustrate some of the functionalities. We can pretend that this data is taken from a company that distributes education materials.

```
data = pd.DataFrame({'customer_id': [1,2,3,4,5,6,7,8],
'age': [29,43,22,82,41,33,63,57],
'email_linked': [True,True,False,True,False,False,True,True],
'occupation': ['teacher','highschool teacher','student','retired',
'tutor','unemployed','entrepreneur','professor']})
```

For larger DataFrames, we can use`.head(n)`

to look at the first *n* rows. To see the last few rows, we can perform a similar operation using `.tail(n).`

Neither are necessary for our small dataset, but we can still demonstrate using ` data.head(3)`

.

Assume that we want to run an email campaign. We start by extracting relevant columns to conduct our campaign.

```
# use double brackets to index multiple columns, single brackets for one column
email_data = data[['customer_id', 'email_linked']]
```

Not all of our customers have emails, so we definitely want to exclude those that don't.

```
# the condition goes inside the square brackets
email_data = email_data[email_data['email_linked']]
```

Let's write a very simple function to determine if a customer is an educator. This is how we define educator.

```
def is_educator(occupation):
return 'teacher' in occupation.lower() or occupation.lower() in ['tutor', 'professor', 'lecturer']
```

We can apply this function onto the *occupation* column to create a new column.

`data['is_educator'] = data['occupation'].apply(is_educator)`

We can also transform all the rows in each column of a DataFrame. This requires that we set `axis=0`

(which is also the default setting). We can write a column-wise function to remove any columns that contain missing values. This is for demonstration only -- there are better ways to handle missing values (see the official pandas documentation for more detail)

```
def remove_missing_columns(col):
if col.isnull().values.any():
return col
data.apply(remove_missing_columns, axis=0)
```

We can also apply a function to transform every column in a row. This requires that we set `axis=1.`

```
def is_educator_above_50(row):
return row['age'] > 50 and is_educator(row['occupation'])
data['is_educator_above_50'] = data.apply(is_educator_above_50, axis=1)
```

Groupby operations are useful for analyzing Pandas objects and engineering new features from large amounts of data. All groupby operations can be broken down into the following steps:

Typically, the object is split based on some criteria, a summary statistic is calculated for each group and combined into a larger object. We can use a groupby operation to calculate the average age for each occupation.

The code for this operation is very simple: `data.groupby(by=['occupation']).mean()['age']`

The *by* parameter indicates how groups are determined, *mean()* is the statistic of interest and indexing by *age* grabs the group statistic for age. The output is a series where *occupation* are the axis labels.

We can split groups using multiple parameters, for instance a combination of occupation and whether or not their email is linked: `data.groupby(by=['email_linked', 'occupation']).mean()['age']`

In addition to *mean*, there are other built-in functions that we can apply to each group: *min, max, count, sum* just to name a few. We can also use `agg() `

to apply any custom functions. The aggregation method is also useful for returning multiple summary statistics.

For instance, `data.groupby(by=['occupation']).agg(['mean', 'sum'])['age']`

will return the average age as well as the sum of age for each group.

These attributes help us explore and get familiar with new DataFrames.

`data.columns `

returns a list of all the columns`data.shape `

returns the dimensions in the form of (number of rows, number of columns)`data.dtypes `

returns the data types of each column`data.index `

returns the range of the index valuesHere are some additional reading(s) that may be helpful: