Extra Topics

Base Python

Try / Except - Robustness

Errors and warnings are very common while developing code, and an important part of the learning process. In some cases, they can also be useful in designing an algorithm. For example, suppose we have a stream of user entered data that is supposed to contain the user's age in years. You might expect to get a few errors or nonsense entries.

It would be useful to convert these values to a numeric type to get the average age of our users, but we want to build something that can set non-numeric values aside. We can attempt to convert to numeric and give Python instructions for errors with a try-except statement:

User-defined Functions

While Python (and its available packages) provide a wide variety of functions, sometimes it's useful to create your own. Python's syntax for defining a function is as follows:

def <function_name> ( <arguments> ):
    <code depending on arguments>
    return <value>

The mean function below returns the mean of a list of numbers. (Base Python does not include a function for the mean.)

Pandas

Merging and Concatenating Datasets

Concatenating Datasets

With the datasets above it seems clear that these DataFramea would be best combined by stacking them on top of eachother, or appending one to the other as additional rows or observations. Panda's pd.concat lets us concatenate a list of DataFramea into a single DataFrame.

Merging

When joining column-wise, we usually can't just concatenate our DataFrames together, instead we use certain key variables to make sure the same observations end up in the same row.

We'll use the pd.merge function to merge datasets on key column variables.

pd.merge automatically uses all column names that appear in both datasets as keys. We can also specify key variables:

Pandas also includes a DataFrame method version of merge:

Note that there is also a join method that focuses on joining using the Pandas indices for the objects in question. It can be useful, but merge is usually more versatile.

In the examples above, the two DataFrames share the same "name" key values. However, when the values don't completely match, we can use how to choose which values get kept and which values get dropped.

The "cross" option joins every key value to every other key value (a Cartesian product): every possible pair of names appears.

Reshaping Data

Reshaping a DataFrame can have lots of benefits across data cleanup, analysis, and communication. Here are three different ways to structure the same data.

"Big Data" and iteration in pandas

Pandas can read csvs in smaller chunks to help deal with files that are too large to be read into RAM.

In the code below, setting chunksize and iterator=True generates a flow of 1000 row chunks out of the main dataset. This isn't really necessary in our 6109 row dataset, but might be critical to working with a 61 million row dataset.

Alternative: The csv package

Python also comes packaged with package for reading Comma Separated Values (csv) files. This can sometimes be easier to work with if you don't need the extra functionality of pandas or would prefer base Python objects to work wtih!

This package provides two major ways to read csv files:

The syntax for each command is similar:

Notice that each process reads the csv in row by row - this can be easily adapted with an if condition to filter out specific rows from a dataset that might be too large to open all at once.

Let's take a look at the differences between the output from each of these processes.

Read more about the csv package here: https://docs.python.org/3/library/csv.html

Learn more