Python: Session 2¶

University Libraries at the University of North Carolina at Chapel Hill

Tody we will introduce:

  • Psuedocode
  • Exception handling
  • User-defined functions
  • Working with Tablular Data in Pandas
  • Data visualization
  • Other helpful packages
  • How to continue learning

Pseudocode and Comments¶

Pseudocode¶

As you get started coding in Python, there will be many many tasks and steps you aren't familiar with! As you learn new functions and approaches, you'll become better and better at searching for help online and reviewing documentation. Learning to write and use pseudocode where appropriate can help organize your plan for any individual script.

Pseudocode is essentially a first draft of your code, written in English for human consumption, though with the tools of your programming language in mind. For example, let's say we have a list of words, and we want to find only the words with 3 or more vowels. Our psuedocode might look something like this:

1. Create a list of words.
2. Create a list of vowels.
3. Create an output list to store the words with 3 or more vowels.
4. Use a loop to iterate through our list of words.
    4a. Use a counter to keep track of how many vowels are found in a word.
    4b. Use a loop to iterate through the letters of each word.
        * If the letter exists in the list of vowels, add 1 to the counter.
    4c. If the counter finds more than 3 vowels, add the word to our output list.
    

This process can divide a complicated task into more digestible parts. You may not know how to complete the different steps yet, but you'll often have better luck finding existing help online on smaller tasks like these than with your overall goal or project.

Comments¶

Recall that Python ignores anything following a # as a comment. Comments are a vital part of your code, as they leave notes about how or why you're doing something. As you gain experience, you'll use comments in different ways.

Comments can also provide a link between pseudocode and real code. Once you've written your pseudocode, use comments to put the major steps into your code file itself. Then fill in the gaps with actual code as you figure it out.

Here's our possible pseudocode for the vowel counting task:

In [9]:
#1. Create a list of words.

#2. Create a list of vowels.

#3. Create an output list to store the words with 3 or more vowels.

#4. Use a loop to iterate through our list of words.

    #4a. Use a counter to keep track of how many vowels are found in a word.
    
    #4b. Use a loop to iterate through the letters of each word.
        
        #If the letter exists in the list of vowels, add 1 to the counter.
            
    #c. If the counter is greater or equal to 3, add the word and number of vowels to our output list.

And here's how it looks once we've added our code:

In [40]:
#1. Create a list of words.
my_words=["statement", "toy", "cars", "shoes", "ear", "busy", 
              "magnificent", "brainy", "healthy", "narrow", "join", 
              "decay", "dashing", "river", "gather", "stop", "satisfying", 
              "holistic", "reply", "steady", "event", "house", "amused", 
              "soak", "increase"]

#2. Create a list of vowels.
vowels=["a", "e", "i", "o", "u", "y"]

#3. Create an output list to store the words with 3 or more vowels.
output=[]

#4. Use a loop to iterate through our list of words.
for word in my_words:
    
    #4a. Use a counter to keep track of how many vowels are found in a word.
    count = 0
    
    #4b. Use a loop to iterate through the letters of each word.
    for char in word:
        
        #If the letter exists in the list of vowels, add 1 to the counter.
        if char in vowels:
            count = count + 1
            
    #c. If the counter finds more than 3 vowels, add the word and number of vowels 
    #to our output list.
    if count >= 3:
        output.append([word, count])

Exception Handling: Try / Except¶

Errors and warnings are very common while developing code, and an important part of the learning process. In some cases, they can also be useful in designing an algorithm. For example, suppose we have a stream of user entered data that is supposed to contain the user's age in years. You might expect to get a few errors or nonsense entries.

In [62]:
user_ages=["34", "27", "54", "19", "giraffe", "15", "83", "61", "43", "91", "sixteen"]

It would be useful to convert these values to a numeric type to get the average age of our users, but we want to build something that can set non-numeric values aside. We can attempt to convert to numeric and give Python instructions for errors with a try-except statement:

In [63]:
ages = []
problems = []

for age in user_ages:
    try:
        a = int(age)
        ages.append(a)
    except:
        problems.append(age)
        
print(ages)
print(problems)
[34, 27, 54, 19, 15, 83, 61, 43, 91]
['giraffe', 'sixteen']

User-defined Functions¶

While Python (and its available packages) provide a wide variety of functions, sometimes it's useful to create your own. Python's syntax for defining a function is as follows:

def <function_name> ( <arguments> ):
    <code depending on arguments>
    return <value>
        

The mean function below returns the mean of a list of numbers. (Base Python does not include a function for the mean.)

In [80]:
def mean(number_list):
    s = sum(number_list)
    n = len(number_list)
    m = s/n
    return m

numbers = [1,23,89,5,67,3,1]
print(mean(numbers))
27.0

Using Other Packages¶

So far, we've learned how to work with Python's base data structures like lists and dictionaries. However, much of the real-world data we encounter comes in the form of spreadsheets and tables - structures that include rows and columns. To work easily with these data types, we will need to use third party packages that extend Python's functionality. Pandas is one of those packages.

Whenever you need to install a package, you need to use the Miniforge prompt or temrinal window, NOT Python itself. The Miniforge Prompt window can be reached through the Windows Start Menu folder for Miniforge3 or right clicking and opening a terminal from Applications > Utilities > Terminal on Mac.

Installing packages known to conda can be done with the conda install <package name> command in your Miniforge Prompt window. Otherwise you may need to use a different manager like pip install <package name>.

More information about managing packages in Python is available here.

Working With Tabular Data in Pandas¶

We will be working with a synthetic dataset that has been created to for practicing in Python. It doesn't include any real information, but is instead made up of fake circulation data about imaginary books, libraries and towns.

Download the csv file library_data.csv. I've stored my copy in the same folder as this Jupyter Notebook. Remember that Jupyter Notebooks automatically set your working directory to the folder where the .ipynb is saved. You'll have to save the document at least once to set your directory, but once there you can use relative paths.

pd.read_csv reads the tabular data from a Comma Separated Values (csv) file into a DataFrame object.

In [3]:
import pandas as pd

df = pd.read_csv("library_data.csv")

# Having trouble saving your file to the right location? Try uncommenting and running the line of code below.
#df = pd.read_csv("https://github.com/UNC-Libraries-data/Python/raw/main/Session2/library_data.csv")

Exploring a Data Frame¶

Attributes¶

A good first step in understanding our DataFrame is to examine some of its basic attributes. Attributes contain values that help us understand and use the dataframe.

Here we use the .shape attribute to determine how many rows and columns (in that order) are available. .columns provides the column names for the DataFrame.

In [11]:
df.shape
Out[11]:
(5914, 7)
In [15]:
df.columns
Out[15]:
Index(['id', 'library', 'town', 'town pop', 'title', 'genre', 'borrow_date'], dtype='object')

Methods¶

Much of the functionality for working with dataframes comes in the form of methods. Methods are specialized functions that only work for a certain type of object - in this case, dataframes.

We can look at the first 5 rows in the dataset directly with the .head() method. Alternatively, we can get a random sample of rows using sample(). Note that we supply the parameter n to specify how many rows we want to sample.

In [17]:
df.head()
Out[17]:
id library town town pop title genre borrow_date
0 491 Heritage Heights Library Hilltop Springs 75000 Secrets of the Old Manor Mystery 1/1/2020
1 3553 Meadowbrook Commons Library Meadowbrook 45000 The Enchanted Forest Fantasy 1/1/2020
2 3384 Riverside Reading Room Riverdale 150000 The Dragon's Crown Fantasy 1/2/2020
3 2267 Heritage Heights Library Hilltop Springs 75000 Letters to Forever Romance 1/2/2020
4 4298 Lakeside Reading Center Lake Haven 15000 Mathematics Made Simple Non-Fiction 1/2/2020
In [18]:
df.sample(n=5)
Out[18]:
id library town town pop title genre borrow_date
4531 1332 Heritage Heights Library Hilltop Springs 75000 Android's Promise Science Fiction 11/21/2023
3634 1943 Riverside Reading Room Riverdale 150000 The Bookshop Romance Romance 3/5/2023
4805 2481 Heritage Heights Library Hilltop Springs 75000 Echoes of Yesterday Literary Fiction 2/4/2024
2516 168 Meadowbrook Commons Library Meadowbrook 45000 The Vanishing at Midnight Mystery 3/8/2022
2767 2024 Pine Valley Library Pine Valley 25000 Summer Hearts Romance 5/21/2022

A full list of attributes and methods for DataFrames is available in the documentation.

Indexing¶

We'll often want to select certain rows or columns from a large dataframe. As with elements in a list, this can be accomplished using indexing. There are some limitations, however. For example, we can use numbers in square brackets to select certain rows, but doing so always returns all the columns in our dataset:

In [21]:
df[0:3]
Out[21]:
id library town town pop title genre borrow_date
0 491 Heritage Heights Library Hilltop Springs 75000 Secrets of the Old Manor Mystery 1/1/2020
1 3553 Meadowbrook Commons Library Meadowbrook 45000 The Enchanted Forest Fantasy 1/1/2020
2 3384 Riverside Reading Room Riverdale 150000 The Dragon's Crown Fantasy 1/2/2020

We can select rows for specific columns using the column names. If we want to select multiple columns, we must list them in their own nested set of square brackets.

In [23]:
df["title"][5:10]
Out[23]:
5    Secrets of the Old Manor
6          The Dragon's Crown
7                Time Paradox
8       Murder on Pine Street
9           The Perfect Crime
Name: title, dtype: object
In [26]:
df[["title", "genre", "borrow_date"]][20:24]
Out[26]:
title genre borrow_date
20 The Perfect Crime Thriller 1/6/2020
21 Secrets of the Old Manor Mystery 1/6/2020
22 The Silent Witness Thriller 1/7/2020
23 Wizard's First Rule Fantasy 1/7/2020

Typing all those names out gets tiring after a while, though. What if we try to select a column by number instead? Running the code below produces an error. This is where the attributes .iloc and .loc become useful.

.iloc¶

If we use the .iloc attribute before our brackets, pandas accepts two numbers separated by a comma. The first number is for rows and the second for columns. Below, we select the second row and sixth column.

In [34]:
df.iloc[2,6]
Out[34]:
'1/2/2020'

We can also use a colon to select multiple rows or columns at once. Note the examples below.

In [40]:
df.iloc[:,1] # All rows of column 1
Out[40]:
0          Heritage Heights Library
1       Meadowbrook Commons Library
2            Riverside Reading Room
3          Heritage Heights Library
4           Lakeside Reading Center
                   ...             
5909         Riverside Reading Room
5910       Heritage Heights Library
5911       Heritage Heights Library
5912    Meadowbrook Commons Library
5913       Heritage Heights Library
Name: library, Length: 5914, dtype: object
In [36]:
df.iloc[0:3,:] # Rows 0-2 of all columns
Out[36]:
id library town town pop title genre borrow_date
0 491 Heritage Heights Library Hilltop Springs 75000 Secrets of the Old Manor Mystery 1/1/2020
1 3553 Meadowbrook Commons Library Meadowbrook 45000 The Enchanted Forest Fantasy 1/1/2020
2 3384 Riverside Reading Room Riverdale 150000 The Dragon's Crown Fantasy 1/2/2020
In [39]:
df.iloc[120:126,1:4] # Rows 120-125 of columns 1-3
Out[39]:
library town town pop
120 Riverside Reading Room Riverdale 150000
121 Riverside Reading Room Riverdale 150000
122 Riverside Reading Room Riverdale 150000
123 Meadowbrook Commons Library Meadowbrook 45000
124 Riverside Reading Room Riverdale 150000
125 Heritage Heights Library Hilltop Springs 75000

Series¶

We can think of a DataFrame as a collection rows and columns where each row represents an "observation" and each column contains a specific type of information collected about each observation. In Pandas, our columns are stored as Series objects. A DataFrame, like a dictionary, can be thought of as a named collection of objects, but in this case, the objects are Series.

Series have their own set of attributes and methods just like DataFrames. One of the most useful methods for categorical variables is .value_counts() which provides a frequency table.

In [44]:
# How many books have been borrowed at each library?
df.library.value_counts()
Out[44]:
library
Heritage Heights Library       1864
Riverside Reading Room         1853
Meadowbrook Commons Library    1239
Pine Valley Library             649
Lakeside Reading Center         309
Name: count, dtype: int64

Filtering¶

To filter our dataset based on a logical condition (true or false), we will use nested square brackets. Note the example below.

  • The inner statement, df["borrow_date"]=="6/23/2024" looks for a column name and checks if it equals "6/23/2024"
  • The outer statement df[ ... ] uses the resulting column of True/Falue values to select rows
  • When combined, these two commands call all of the data contained in rows where the value of the borrow_date field is equal to "6/23/2024"
In [48]:
# What books were borrowed on 6/23/2024?
df[df["borrow_date"] == "6/23/2024"]
Out[48]:
id library town town pop title genre borrow_date
5283 1271 Riverside Reading Room Riverdale 150000 The Last Space Colony Science Fiction 6/23/2024
5284 2906 Heritage Heights Library Hilltop Springs 75000 The Victorian Secret Historical Fiction 6/23/2024
5285 4864 Heritage Heights Library Hilltop Springs 75000 The Inventor's Life Biography 6/23/2024
5286 4551 Heritage Heights Library Hilltop Springs 75000 Life of Einstein Biography 6/23/2024
5287 3051 Meadowbrook Commons Library Meadowbrook 45000 Ancient Promises Historical Fiction 6/23/2024

Pandas Training on LinkedIn Learning¶

  • Get free access to LinkedIn Learning through UNC
  • Data Analysis with Python and Pandas
  • Getting Started with Pandas and Advanced Pandas

Data Visualization¶

There are many libraries for data visualization within Python. We'll introduce three of them to show some examples of data visualization in Python while answering questions about our library dataset.

Simple bar chart using Matplotlib¶

What are the top 5 most borrowed books?

In [72]:
# Count how often each book has been borrowed
top5 = df.title.value_counts()

# Get only the top 5
top5 = top5.head()

top5
Out[72]:
title
Secrets of the Old Manor     392
The Bookshop Romance         305
Summer Hearts                283
The Detective's Last Case    252
Mathematics Made Simple      231
Name: count, dtype: int64
In [73]:
# load matplotlib
import matplotlib.pyplot as plt

# Create horizontal bar chart
plt.barh(y = top5.index, width = top5.values) # set the x and y axis
plt.gca().invert_yaxis() # display bars in descending order
plt.xlabel("Times Borrowed") # label the x axis
plt.ylabel("Title") # label the y axis
plt.title("Top 5 Most Borrowed Books") # create a chart title
Out[73]:
Text(0.5, 1.0, 'Top 5 Most Borrowed Books')
No description has been provided for this image

Faceted bar charts using Seaborn¶

What are the top 5 most borrowed books in each library?

In [42]:
# Group the df by library
top5bylib = df.groupby("library")

# Count the number of times each book is borrowed in each library. This gives us a series.
top5bylib = top5bylib.title.value_counts()

# Now that we have a series, we need to group it by library again and get the top 5 in each group
top5bylib = top5bylib.groupby("library").nlargest(5)

# Turn the series into a dataframe to make it easier to work with in Seaborn
top5bylib = top5bylib.reset_index(level=0, drop=True).to_frame().reset_index()

top5bylib.head(15)
Out[42]:
library title count
0 Heritage Heights Library The Detective's Last Case 102
1 Heritage Heights Library The Silent Witness 80
2 Heritage Heights Library Secrets of the Old Manor 79
3 Heritage Heights Library Murder on Pine Street 76
4 Heritage Heights Library The Missing Manuscript 76
5 Lakeside Reading Center Secrets of the Old Manor 68
6 Lakeside Reading Center The Bookshop Romance 55
7 Lakeside Reading Center Philosophy Today 42
8 Lakeside Reading Center The Last Space Colony 40
9 Lakeside Reading Center Life of Einstein 39
10 Meadowbrook Commons Library The Vanishing at Midnight 82
11 Meadowbrook Commons Library Deadly Deadline 80
12 Meadowbrook Commons Library Secrets of the Old Manor 74
13 Meadowbrook Commons Library The Victorian Secret 73
14 Meadowbrook Commons Library The Silent Witness 69
In [101]:
# load seaborn
import seaborn as sns

# create a "grid" object using the FacetGrid function. 
grid = sns.FacetGrid(data = top5bylib, col = "library", col_wrap = 3, hue = "library", sharey = False, aspect = 1.5)

# specify which chart we want to use on the grid and supply the variables for the x and y axis.
fig = grid.map_dataframe(sns.barplot, y = "title", x = "count")
No description has been provided for this image

Interactive Line Chart Using Bokeh¶

What are the borrowing trends for Secrets of the Old Manor over time?

In [40]:
# Use a filter to select the title we want to focus on
somtrend = df[df["title"] == "Secrets of the Old Manor"]

# Count the number of times the book is borrowed on each date. This gives us a series.
somtrend = somtrend.borrow_date.value_counts()

# Turn the series into a dataframe
somtrend = somtrend.to_frame().reset_index()

# Give borrow_date a datetime format
somtrend["borrow_date"] = pd.to_datetime(somtrend["borrow_date"])

# Group and sum by month for a smoother line. This gives us a series again.
somtrend = somtrend.groupby(somtrend["borrow_date"].dt.to_period("M"))["count"].sum()

# Turn the series into a dataframe again
somtrend = somtrend.to_frame().reset_index()

# Sort dataframe in chronological order for Bokeh to display it correctly
somtrend = somtrend.sort_values(by="borrow_date")

somtrend.head()
Out[40]:
borrow_date count
0 2020-01 10
1 2020-02 9
2 2020-03 8
3 2020-04 6
4 2020-05 10
In [43]:
# load bokeh modules
from bokeh.plotting import figure, show, output_notebook

# set up bokeh for working in jupyter notebooks
output_notebook()

# set up the size of our plot and format the x axis for dates
p = figure(height = 300, width = 600, x_axis_type = "datetime")

# add a line to our plot
p.line(source = somtrend, x = "borrow_date", y = "count", width = 2)

# show the plot
show(p)
Loading BokehJS ...

Other Helpful Libraries¶

Data Packages

  • NumPy for numerical computation in Python
  • scikit-learn for data analysis and machine learning
  • Polars for dataframes designed for large-scale data processing performance
  • DuckDB for creating a SQL database

Other Utilities

  • Beautiful Soup for parsing HTML etc
  • NLTK for text analysis
  • Pillow for Images
  • JobLib or Multiprocessing for running parallel/concurrent jobs

Numpy¶

Numpy provides the mathematical functionality (e.g. large arrayes, linear algebra, random numbers, etc.) for many popular statistical and machine learning tasks in Python. This is a dependency for many of the other packages we have and will discuss, including pandas. One of the foundational objects in numpy is the array:

In [46]:
import numpy as np
import pandas as pd

a_list = [[1,2],[3,4],[5,6],[7,8]] #list of ROWS
an_array = np.array(a_list, ndmin = 2)

an_array
Out[46]:
array([[1, 2],
       [3, 4],
       [5, 6],
       [7, 8]])

We can use numpy to do many numerical tasks, for example creating random values:

In [47]:
np.random.rand(2,2)
Out[47]:
array([[0.37438578, 0.47633254],
       [0.41351988, 0.08877861]])

scikit-learn¶

scikit-learn provides a consolidated interface for machine learning in Python:

  • functions for splitting data into training and testing components
  • cross validation for model tuning
  • supervised and unsupervised modeling
  • model fit assessment and comparison

Read more about using sklearn.

The following example comes from Scitkit-learn's Linear Regression Example page

In [49]:
from sklearn import linear_model, datasets
import matplotlib.pyplot as plt

# Load the diabetes dataset
diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)

# Use only one feature
diabetes_X = diabetes_X[:, np.newaxis, 2]

# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

# Split the targets into training/testing sets
diabetes_y_train = diabetes_y[:-20]
diabetes_y_test = diabetes_y[-20:]

# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)

# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)

# Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test, color="black")
plt.plot(diabetes_X_test, diabetes_y_pred, color="blue", linewidth=3)

plt.xticks(())
plt.yticks(())

plt.show()
No description has been provided for this image

Polars (dataframes for large-scale data processing)¶

The Polars library offers an alternative to Pandas dataframes that often performs much faster and uses less RAM for dataframe operations. Polars is built in Rust while Pandas is built on NumPy, which has lower performance and higher memory use than Rust. Polars is also able to efficiently run parallel processes, adding to its improved performance.

In [50]:
import sys
import time
import polars as pl
import pandas as pd

# how long does it take Polars to load in a CSV?
start_pl = time.time()
df_pl = pl.read_csv("library_data.csv")
end_pl = time.time()
print(f'Seconds for Polars to load in the library_data CSV: {end_pl-start_pl}')

# how long does it take Pandas to load in a CSV?
start_pd = time.time()
df_pd= pd.read_csv("library_data.csv")
end_pd = time.time()
print(f'Seconds for Pandas to load in the library_data CSV: {end_pd-start_pd}')

# how much faster is Polars?
print(f'Polars is {round((end_pd-start_pd)/(end_pl-start_pl), 2)}x faster')
Seconds for Polars to load in the library_data CSV: 0.016093730926513672
Seconds for Pandas to load in the library_data CSV: 0.021875381469726562
Polars is 1.36x faster
In [51]:
# compare the memory size of the Polars and Pandas dataframes
print(f'The Polars dataframe takes up {sys.getsizeof(df_pl)} bytes.')
print(f'The Pandas dataframe takes up {sys.getsizeof(df_pd)} bytes.')
The Polars dataframe takes up 48 bytes.
The Pandas dataframe takes up 1972339 bytes.

DuckDB (for creating a SQL Database)¶

DuckDB is a great library for setting up a SQL-database with Python. It does not have any dependencies and is very memory efficient, making it a faster alternative to PostgreSQL, MySQL, or SQLite.

In [57]:
import duckdb

duckdb.read_csv("library_data.csv")
duckdb.sql("SELECT id, library, town, title FROM 'library_data.csv' WHERE town = 'Riverdale' LIMIT 10")
Out[57]:
┌───────┬────────────────────────┬───────────┬──────────────────────────┐
│  id   │        library         │   town    │          title           │
│ int64 │        varchar         │  varchar  │         varchar          │
├───────┼────────────────────────┼───────────┼──────────────────────────┤
│  3384 │ Riverside Reading Room │ Riverdale │ The Dragon's Crown       │
│  1449 │ Riverside Reading Room │ Riverdale │ Time Paradox             │
│   921 │ Riverside Reading Room │ Riverdale │ Murder on Pine Street    │
│   931 │ Riverside Reading Room │ Riverdale │ Murder on Pine Street    │
│  4122 │ Riverside Reading Room │ Riverdale │ Philosophy Today         │
│   886 │ Riverside Reading Room │ Riverdale │ Murder on Pine Street    │
│  2404 │ Riverside Reading Room │ Riverdale │ The Great American Novel │
│  2488 │ Riverside Reading Room │ Riverdale │ Echoes of Yesterday      │
│  5680 │ Riverside Reading Room │ Riverdale │ The Perfect Crime        │
│  5767 │ Riverside Reading Room │ Riverdale │ Night Watch              │
├───────┴────────────────────────┴───────────┴──────────────────────────┤
│ 10 rows                                                     4 columns │
└───────────────────────────────────────────────────────────────────────┘

BeautifulSoup (for parsing HTML or XML data)¶

Python's built-in urllib.request package makes it relatively easy to download the underlying html from a web page. Note that the from <package> import <function> notation used here allows you to selectively import only parts of a package as needed.

Be sure to check the terms of services for any website before scraping! We're scraping our own materials here to be safe!

In [58]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

# Scrape Python 1 materials!
page = urlopen("https://unc-libraries-data.github.io/Python/Intro/Introduction.html")
html = page.read()

# Parse the HTML
soup = BeautifulSoup(html,"html.parser")
[x.text for x in soup.find_all("h2")] # find all second-level headers
Out[58]:
['Why Python?¶',
 'Getting Started¶',
 'Data Types and Variables¶',
 'Flow Control¶',
 'More Data Types¶',
 'Review¶',
 'Pseudocode and Comments¶',
 'User-defined Functions¶',
 'Coming up¶',
 'References and Resources¶']

NLTK (for text analysis)¶

The Natural Language Toolkit (nltk) provides a wide array of tools for processing and analyzing text. This includes operations like splitting text into sentences or words ("tokenization"), tagging them with their part of speech, classification, and more.

In [78]:
import nltk
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist

text = "This is an example sentence. This sentence is a simple example." 

words = word_tokenize(text) # tokenize the text
fdist = FreqDist(words) # count how often each word occurs

# Show the frequencies
plt.figure()
plt.barh(fdist.keys(), fdist.values())
plt.xticks(range(1,3))
plt.xlabel('Frequency')
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\tuesday\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
Out[78]:
Text(0.5, 0, 'Frequency')
No description has been provided for this image

PIL (Pillow)¶

Pillow is the updated version of the old Python Imaging Library (PIL), which provides fundamental tools for working with images. Pillow can work with a many common formats (some of which may require extra packages or other dependencies) to automate a wide variety of image transformations.

Note: While pillow is how you install the package, you import functions with import PIL.

Both display and imshow can be used to preview the image. Currently, display returns some errors due to the image's modes, so we're using imshow to avoid these errors. Read more about modes.

In [66]:
from PIL import Image
from urllib.request import urlretrieve
from matplotlib.pyplot import imshow
from IPython.display import display

# download the unc logo
urlretrieve("https://identity.unc.edu/wp-content/uploads/sites/885/2019/01/UNC_logo_webblue-e1517942350314.png",
           "UNC_logo.png")

# open and display the image
UNC = Image.open("UNC_logo.png")
imshow(UNC)
Out[66]:
<matplotlib.image.AxesImage at 0x13ffeb64f80>
No description has been provided for this image
In [67]:
# resize the image and make it grayscale
# note: // divides two numbers and rounds the result down to get an integer
UNC_gray = UNC.convert('LA').resize((UNC.width//2,UNC.height//2))
imshow(UNC_gray)
Out[67]:
<matplotlib.image.AxesImage at 0x13ffd3c1e20>
No description has been provided for this image

Parallel Processing with joblib¶

As you move into more complicated processes in Python (or applying code to a wide variety of objects or files), processing time can become a major factor. Fortunately, most modern laptops have multiple processor cores that can do separate things at the same time. Python only uses one core by default. If you have a set of loops that don't depend on each other (e.g. processing lots of files one after another), you could split up your many loop iterations between processors to greatly increase speed.

The joblib package provides a straightforward way to split loops up between your computers cores for faster performance on complicated code. Note that parallelization may not benefit you much or may even hurt you for very quick jobs because setting up and consolidating information from separate cores creates overhead costs.

How to continue learning¶

  • Exercises and solutions for practicing Python
  • Exercises and solutions for praciticing Pandas
  • Practice Python
  • Python Projects - Beginner to Advanced
  • Python Projects You Can Build
  • Automate the Boring Stuff with Python
  • Python Programming for the Humanities
  • Python Data Science Handbook This free ebook emphasizes Numpy, Scipy, Matplotlib, Pandas and other data analysis packages in Python, assuming some familiarity with the basic principles of the language.
    • Whirlwind Tour of Python
  • Are you used to working in R? Check out this Data manipulation R-Python conversion guide and The Struggles I Had When Switching from R to Python