Python: Session 2¶
University Libraries at the University of North Carolina at Chapel Hill
Tody we will introduce:
- Psuedocode
- Exception handling
- User-defined functions
- Working with Tablular Data in Pandas
- Data visualization
- Other helpful packages
- How to continue learning
Pseudocode and Comments¶
Pseudocode¶
As you get started coding in Python, there will be many many tasks and steps you aren't familiar with! As you learn new functions and approaches, you'll become better and better at searching for help online and reviewing documentation. Learning to write and use pseudocode where appropriate can help organize your plan for any individual script.
Pseudocode is essentially a first draft of your code, written in English for human consumption, though with the tools of your programming language in mind. For example, let's say we have a list of words, and we want to find only the words with 3 or more vowels. Our psuedocode might look something like this:
1. Create a list of words.
2. Create a list of vowels.
3. Create an output list to store the words with 3 or more vowels.
4. Use a loop to iterate through our list of words.
4a. Use a counter to keep track of how many vowels are found in a word.
4b. Use a loop to iterate through the letters of each word.
* If the letter exists in the list of vowels, add 1 to the counter.
4c. If the counter finds more than 3 vowels, add the word to our output list.
This process can divide a complicated task into more digestible parts. You may not know how to complete the different steps yet, but you'll often have better luck finding existing help online on smaller tasks like these than with your overall goal or project.
Comments¶
Recall that Python ignores anything following a #
as a comment. Comments are a vital part of your code, as they leave notes about how or why you're doing something. As you gain experience, you'll use comments in different ways.
Comments can also provide a link between pseudocode and real code. Once you've written your pseudocode, use comments to put the major steps into your code file itself. Then fill in the gaps with actual code as you figure it out.
Here's our possible pseudocode for the vowel counting task:
#1. Create a list of words.
#2. Create a list of vowels.
#3. Create an output list to store the words with 3 or more vowels.
#4. Use a loop to iterate through our list of words.
#4a. Use a counter to keep track of how many vowels are found in a word.
#4b. Use a loop to iterate through the letters of each word.
#If the letter exists in the list of vowels, add 1 to the counter.
#c. If the counter is greater or equal to 3, add the word and number of vowels to our output list.
And here's how it looks once we've added our code:
#1. Create a list of words.
my_words=["statement", "toy", "cars", "shoes", "ear", "busy",
"magnificent", "brainy", "healthy", "narrow", "join",
"decay", "dashing", "river", "gather", "stop", "satisfying",
"holistic", "reply", "steady", "event", "house", "amused",
"soak", "increase"]
#2. Create a list of vowels.
vowels=["a", "e", "i", "o", "u", "y"]
#3. Create an output list to store the words with 3 or more vowels.
output=[]
#4. Use a loop to iterate through our list of words.
for word in my_words:
#4a. Use a counter to keep track of how many vowels are found in a word.
count = 0
#4b. Use a loop to iterate through the letters of each word.
for char in word:
#If the letter exists in the list of vowels, add 1 to the counter.
if char in vowels:
count = count + 1
#c. If the counter finds more than 3 vowels, add the word and number of vowels
#to our output list.
if count >= 3:
output.append([word, count])
Exception Handling: Try / Except¶
Errors and warnings are very common while developing code, and an important part of the learning process. In some cases, they can also be useful in designing an algorithm. For example, suppose we have a stream of user entered data that is supposed to contain the user's age in years. You might expect to get a few errors or nonsense entries.
user_ages=["34", "27", "54", "19", "giraffe", "15", "83", "61", "43", "91", "sixteen"]
It would be useful to convert these values to a numeric type to get the average age of our users, but we want to build something that can set non-numeric values aside. We can attempt to convert to numeric and give Python instructions for errors with a try
-except
statement:
ages = []
problems = []
for age in user_ages:
try:
a = int(age)
ages.append(a)
except:
problems.append(age)
print(ages)
print(problems)
[34, 27, 54, 19, 15, 83, 61, 43, 91] ['giraffe', 'sixteen']
User-defined Functions¶
While Python (and its available packages) provide a wide variety of functions, sometimes it's useful to create your own. Python's syntax for defining a function is as follows:
def <function_name> ( <arguments> ):
<code depending on arguments>
return <value>
The mean
function below returns the mean of a list of numbers. (Base Python does not include a function for the mean.)
def mean(number_list):
s = sum(number_list)
n = len(number_list)
m = s/n
return m
numbers = [1,23,89,5,67,3,1]
print(mean(numbers))
27.0
Using Other Packages¶
So far, we've learned how to work with Python's base data structures like lists and dictionaries. However, much of the real-world data we encounter comes in the form of spreadsheets and tables - structures that include rows and columns. To work easily with these data types, we will need to use third party packages that extend Python's functionality. Pandas is one of those packages.
Whenever you need to install a package, you need to use the Miniforge prompt or temrinal window, NOT Python itself. The Miniforge Prompt window can be reached through the Windows Start Menu folder for Miniforge3 or right clicking and opening a terminal from Applications > Utilities > Terminal on Mac.
Installing packages known to conda can be done with the conda install <package name>
command in your Miniforge Prompt window. Otherwise you may need to use a different manager like pip install <package name>
.
More information about managing packages in Python is available here.
Working With Tabular Data in Pandas¶
We will be working with a synthetic dataset that has been created to for practicing in Python. It doesn't include any real information, but is instead made up of fake circulation data about imaginary books, libraries and towns.
Download the csv file library_data.csv. I've stored my copy in the same folder as this Jupyter Notebook. Remember that Jupyter Notebooks automatically set your working directory to the folder where the .ipynb is saved. You'll have to save the document at least once to set your directory, but once there you can use relative paths.
pd.read_csv
reads the tabular data from a Comma Separated Values (csv) file into a DataFrame
object.
import pandas as pd
df = pd.read_csv("library_data.csv")
# Having trouble saving your file to the right location? Try uncommenting and running the line of code below.
#df = pd.read_csv("https://github.com/UNC-Libraries-data/Python/raw/main/Session2/library_data.csv")
Exploring a Data Frame¶
Attributes¶
A good first step in understanding our DataFrame is to examine some of its basic attributes. Attributes contain values that help us understand and use the dataframe.
Here we use the .shape
attribute to determine how many rows and columns (in that order) are available. .columns
provides the column names for the DataFrame.
df.shape
(5914, 7)
df.columns
Index(['id', 'library', 'town', 'town pop', 'title', 'genre', 'borrow_date'], dtype='object')
Methods¶
Much of the functionality for working with dataframes comes in the form of methods. Methods are specialized functions that only work for a certain type of object - in this case, dataframes.
We can look at the first 5 rows in the dataset directly with the .head()
method. Alternatively, we can get a random sample of rows using sample()
. Note that we supply the parameter n to specify how many rows we want to sample.
df.head()
id | library | town | town pop | title | genre | borrow_date | |
---|---|---|---|---|---|---|---|
0 | 491 | Heritage Heights Library | Hilltop Springs | 75000 | Secrets of the Old Manor | Mystery | 1/1/2020 |
1 | 3553 | Meadowbrook Commons Library | Meadowbrook | 45000 | The Enchanted Forest | Fantasy | 1/1/2020 |
2 | 3384 | Riverside Reading Room | Riverdale | 150000 | The Dragon's Crown | Fantasy | 1/2/2020 |
3 | 2267 | Heritage Heights Library | Hilltop Springs | 75000 | Letters to Forever | Romance | 1/2/2020 |
4 | 4298 | Lakeside Reading Center | Lake Haven | 15000 | Mathematics Made Simple | Non-Fiction | 1/2/2020 |
df.sample(n=5)
id | library | town | town pop | title | genre | borrow_date | |
---|---|---|---|---|---|---|---|
4531 | 1332 | Heritage Heights Library | Hilltop Springs | 75000 | Android's Promise | Science Fiction | 11/21/2023 |
3634 | 1943 | Riverside Reading Room | Riverdale | 150000 | The Bookshop Romance | Romance | 3/5/2023 |
4805 | 2481 | Heritage Heights Library | Hilltop Springs | 75000 | Echoes of Yesterday | Literary Fiction | 2/4/2024 |
2516 | 168 | Meadowbrook Commons Library | Meadowbrook | 45000 | The Vanishing at Midnight | Mystery | 3/8/2022 |
2767 | 2024 | Pine Valley Library | Pine Valley | 25000 | Summer Hearts | Romance | 5/21/2022 |
A full list of attributes and methods for DataFrames is available in the documentation.
Indexing¶
We'll often want to select certain rows or columns from a large dataframe. As with elements in a list, this can be accomplished using indexing. There are some limitations, however. For example, we can use numbers in square brackets to select certain rows, but doing so always returns all the columns in our dataset:
df[0:3]
id | library | town | town pop | title | genre | borrow_date | |
---|---|---|---|---|---|---|---|
0 | 491 | Heritage Heights Library | Hilltop Springs | 75000 | Secrets of the Old Manor | Mystery | 1/1/2020 |
1 | 3553 | Meadowbrook Commons Library | Meadowbrook | 45000 | The Enchanted Forest | Fantasy | 1/1/2020 |
2 | 3384 | Riverside Reading Room | Riverdale | 150000 | The Dragon's Crown | Fantasy | 1/2/2020 |
We can select rows for specific columns using the column names. If we want to select multiple columns, we must list them in their own nested set of square brackets.
df["title"][5:10]
5 Secrets of the Old Manor 6 The Dragon's Crown 7 Time Paradox 8 Murder on Pine Street 9 The Perfect Crime Name: title, dtype: object
df[["title", "genre", "borrow_date"]][20:24]
title | genre | borrow_date | |
---|---|---|---|
20 | The Perfect Crime | Thriller | 1/6/2020 |
21 | Secrets of the Old Manor | Mystery | 1/6/2020 |
22 | The Silent Witness | Thriller | 1/7/2020 |
23 | Wizard's First Rule | Fantasy | 1/7/2020 |
Typing all those names out gets tiring after a while, though. What if we try to select a column by number instead? Running the code below produces an error. This is where the attributes .iloc and .loc become useful.
.iloc
¶
If we use the .iloc attribute before our brackets, pandas accepts two numbers separated by a comma. The first number is for rows and the second for columns. Below, we select the second row and sixth column.
df.iloc[2,6]
'1/2/2020'
We can also use a colon to select multiple rows or columns at once. Note the examples below.
df.iloc[:,1] # All rows of column 1
0 Heritage Heights Library 1 Meadowbrook Commons Library 2 Riverside Reading Room 3 Heritage Heights Library 4 Lakeside Reading Center ... 5909 Riverside Reading Room 5910 Heritage Heights Library 5911 Heritage Heights Library 5912 Meadowbrook Commons Library 5913 Heritage Heights Library Name: library, Length: 5914, dtype: object
df.iloc[0:3,:] # Rows 0-2 of all columns
id | library | town | town pop | title | genre | borrow_date | |
---|---|---|---|---|---|---|---|
0 | 491 | Heritage Heights Library | Hilltop Springs | 75000 | Secrets of the Old Manor | Mystery | 1/1/2020 |
1 | 3553 | Meadowbrook Commons Library | Meadowbrook | 45000 | The Enchanted Forest | Fantasy | 1/1/2020 |
2 | 3384 | Riverside Reading Room | Riverdale | 150000 | The Dragon's Crown | Fantasy | 1/2/2020 |
df.iloc[120:126,1:4] # Rows 120-125 of columns 1-3
library | town | town pop | |
---|---|---|---|
120 | Riverside Reading Room | Riverdale | 150000 |
121 | Riverside Reading Room | Riverdale | 150000 |
122 | Riverside Reading Room | Riverdale | 150000 |
123 | Meadowbrook Commons Library | Meadowbrook | 45000 |
124 | Riverside Reading Room | Riverdale | 150000 |
125 | Heritage Heights Library | Hilltop Springs | 75000 |
Series¶
We can think of a DataFrame as a collection rows and columns where each row represents an "observation" and each column contains a specific type of information collected about each observation. In Pandas, our columns are stored as Series objects. A DataFrame, like a dictionary, can be thought of as a named collection of objects, but in this case, the objects are Series.
Series have their own set of attributes and methods just like DataFrames. One of the most useful methods for categorical variables is .value_counts() which provides a frequency table.
# How many books have been borrowed at each library?
df.library.value_counts()
library Heritage Heights Library 1864 Riverside Reading Room 1853 Meadowbrook Commons Library 1239 Pine Valley Library 649 Lakeside Reading Center 309 Name: count, dtype: int64
Filtering¶
To filter our dataset based on a logical condition (true or false), we will use nested square brackets. Note the example below.
- The inner statement,
df["borrow_date"]=="6/23/2024"
looks for a column name and checks if it equals "6/23/2024" - The outer statement
df[ ... ]
uses the resulting column of True/Falue values to select rows - When combined, these two commands call all of the data contained in rows where the value of the borrow_date field is equal to "6/23/2024"
# What books were borrowed on 6/23/2024?
df[df["borrow_date"] == "6/23/2024"]
id | library | town | town pop | title | genre | borrow_date | |
---|---|---|---|---|---|---|---|
5283 | 1271 | Riverside Reading Room | Riverdale | 150000 | The Last Space Colony | Science Fiction | 6/23/2024 |
5284 | 2906 | Heritage Heights Library | Hilltop Springs | 75000 | The Victorian Secret | Historical Fiction | 6/23/2024 |
5285 | 4864 | Heritage Heights Library | Hilltop Springs | 75000 | The Inventor's Life | Biography | 6/23/2024 |
5286 | 4551 | Heritage Heights Library | Hilltop Springs | 75000 | Life of Einstein | Biography | 6/23/2024 |
5287 | 3051 | Meadowbrook Commons Library | Meadowbrook | 45000 | Ancient Promises | Historical Fiction | 6/23/2024 |
Pandas Training on LinkedIn Learning¶
- Get free access to LinkedIn Learning through UNC
- Data Analysis with Python and Pandas
- Getting Started with Pandas and Advanced Pandas
# Count how often each book has been borrowed
top5 = df.title.value_counts()
# Get only the top 5
top5 = top5.head()
top5
title Secrets of the Old Manor 392 The Bookshop Romance 305 Summer Hearts 283 The Detective's Last Case 252 Mathematics Made Simple 231 Name: count, dtype: int64
# load matplotlib
import matplotlib.pyplot as plt
# Create horizontal bar chart
plt.barh(y = top5.index, width = top5.values) # set the x and y axis
plt.gca().invert_yaxis() # display bars in descending order
plt.xlabel("Times Borrowed") # label the x axis
plt.ylabel("Title") # label the y axis
plt.title("Top 5 Most Borrowed Books") # create a chart title
Text(0.5, 1.0, 'Top 5 Most Borrowed Books')
Faceted bar charts using Seaborn¶
What are the top 5 most borrowed books in each library?
# Group the df by library
top5bylib = df.groupby("library")
# Count the number of times each book is borrowed in each library. This gives us a series.
top5bylib = top5bylib.title.value_counts()
# Now that we have a series, we need to group it by library again and get the top 5 in each group
top5bylib = top5bylib.groupby("library").nlargest(5)
# Turn the series into a dataframe to make it easier to work with in Seaborn
top5bylib = top5bylib.reset_index(level=0, drop=True).to_frame().reset_index()
top5bylib.head(15)
library | title | count | |
---|---|---|---|
0 | Heritage Heights Library | The Detective's Last Case | 102 |
1 | Heritage Heights Library | The Silent Witness | 80 |
2 | Heritage Heights Library | Secrets of the Old Manor | 79 |
3 | Heritage Heights Library | Murder on Pine Street | 76 |
4 | Heritage Heights Library | The Missing Manuscript | 76 |
5 | Lakeside Reading Center | Secrets of the Old Manor | 68 |
6 | Lakeside Reading Center | The Bookshop Romance | 55 |
7 | Lakeside Reading Center | Philosophy Today | 42 |
8 | Lakeside Reading Center | The Last Space Colony | 40 |
9 | Lakeside Reading Center | Life of Einstein | 39 |
10 | Meadowbrook Commons Library | The Vanishing at Midnight | 82 |
11 | Meadowbrook Commons Library | Deadly Deadline | 80 |
12 | Meadowbrook Commons Library | Secrets of the Old Manor | 74 |
13 | Meadowbrook Commons Library | The Victorian Secret | 73 |
14 | Meadowbrook Commons Library | The Silent Witness | 69 |
# load seaborn
import seaborn as sns
# create a "grid" object using the FacetGrid function.
grid = sns.FacetGrid(data = top5bylib, col = "library", col_wrap = 3, hue = "library", sharey = False, aspect = 1.5)
# specify which chart we want to use on the grid and supply the variables for the x and y axis.
fig = grid.map_dataframe(sns.barplot, y = "title", x = "count")
Interactive Line Chart Using Bokeh¶
What are the borrowing trends for Secrets of the Old Manor over time?
# Use a filter to select the title we want to focus on
somtrend = df[df["title"] == "Secrets of the Old Manor"]
# Count the number of times the book is borrowed on each date. This gives us a series.
somtrend = somtrend.borrow_date.value_counts()
# Turn the series into a dataframe
somtrend = somtrend.to_frame().reset_index()
# Give borrow_date a datetime format
somtrend["borrow_date"] = pd.to_datetime(somtrend["borrow_date"])
# Group and sum by month for a smoother line. This gives us a series again.
somtrend = somtrend.groupby(somtrend["borrow_date"].dt.to_period("M"))["count"].sum()
# Turn the series into a dataframe again
somtrend = somtrend.to_frame().reset_index()
# Sort dataframe in chronological order for Bokeh to display it correctly
somtrend = somtrend.sort_values(by="borrow_date")
somtrend.head()
borrow_date | count | |
---|---|---|
0 | 2020-01 | 10 |
1 | 2020-02 | 9 |
2 | 2020-03 | 8 |
3 | 2020-04 | 6 |
4 | 2020-05 | 10 |
# load bokeh modules
from bokeh.plotting import figure, show, output_notebook
# set up bokeh for working in jupyter notebooks
output_notebook()
# set up the size of our plot and format the x axis for dates
p = figure(height = 300, width = 600, x_axis_type = "datetime")
# add a line to our plot
p.line(source = somtrend, x = "borrow_date", y = "count", width = 2)
# show the plot
show(p)
Other Helpful Libraries¶
Data Packages
- NumPy for numerical computation in Python
- scikit-learn for data analysis and machine learning
- Polars for dataframes designed for large-scale data processing performance
- DuckDB for creating a SQL database
Other Utilities
- Beautiful Soup for parsing HTML etc
- NLTK for text analysis
- Pillow for Images
- JobLib or Multiprocessing for running parallel/concurrent jobs
Numpy¶
Numpy provides the mathematical functionality (e.g. large arrayes, linear algebra, random numbers, etc.) for many popular statistical and machine learning tasks in Python. This is a dependency for many of the other packages we have and will discuss, including pandas. One of the foundational objects in numpy is the array:
import numpy as np
import pandas as pd
a_list = [[1,2],[3,4],[5,6],[7,8]] #list of ROWS
an_array = np.array(a_list, ndmin = 2)
an_array
array([[1, 2], [3, 4], [5, 6], [7, 8]])
We can use numpy to do many numerical tasks, for example creating random values:
np.random.rand(2,2)
array([[0.37438578, 0.47633254], [0.41351988, 0.08877861]])
scikit-learn¶
scikit-learn provides a consolidated interface for machine learning in Python:
- functions for splitting data into training and testing components
- cross validation for model tuning
- supervised and unsupervised modeling
- model fit assessment and comparison
Read more about using sklearn.
The following example comes from Scitkit-learn's Linear Regression Example page
from sklearn import linear_model, datasets
import matplotlib.pyplot as plt
# Load the diabetes dataset
diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)
# Use only one feature
diabetes_X = diabetes_X[:, np.newaxis, 2]
# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
# Split the targets into training/testing sets
diabetes_y_train = diabetes_y[:-20]
diabetes_y_test = diabetes_y[-20:]
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)
# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)
# Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test, color="black")
plt.plot(diabetes_X_test, diabetes_y_pred, color="blue", linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()
Polars (dataframes for large-scale data processing)¶
The Polars library offers an alternative to Pandas dataframes that often performs much faster and uses less RAM for dataframe operations. Polars is built in Rust while Pandas is built on NumPy, which has lower performance and higher memory use than Rust. Polars is also able to efficiently run parallel processes, adding to its improved performance.
import sys
import time
import polars as pl
import pandas as pd
# how long does it take Polars to load in a CSV?
start_pl = time.time()
df_pl = pl.read_csv("library_data.csv")
end_pl = time.time()
print(f'Seconds for Polars to load in the library_data CSV: {end_pl-start_pl}')
# how long does it take Pandas to load in a CSV?
start_pd = time.time()
df_pd= pd.read_csv("library_data.csv")
end_pd = time.time()
print(f'Seconds for Pandas to load in the library_data CSV: {end_pd-start_pd}')
# how much faster is Polars?
print(f'Polars is {round((end_pd-start_pd)/(end_pl-start_pl), 2)}x faster')
Seconds for Polars to load in the library_data CSV: 0.016093730926513672 Seconds for Pandas to load in the library_data CSV: 0.021875381469726562 Polars is 1.36x faster
# compare the memory size of the Polars and Pandas dataframes
print(f'The Polars dataframe takes up {sys.getsizeof(df_pl)} bytes.')
print(f'The Pandas dataframe takes up {sys.getsizeof(df_pd)} bytes.')
The Polars dataframe takes up 48 bytes. The Pandas dataframe takes up 1972339 bytes.
DuckDB (for creating a SQL Database)¶
DuckDB is a great library for setting up a SQL-database with Python. It does not have any dependencies and is very memory efficient, making it a faster alternative to PostgreSQL, MySQL, or SQLite.
import duckdb
duckdb.read_csv("library_data.csv")
duckdb.sql("SELECT id, library, town, title FROM 'library_data.csv' WHERE town = 'Riverdale' LIMIT 10")
┌───────┬────────────────────────┬───────────┬──────────────────────────┐ │ id │ library │ town │ title │ │ int64 │ varchar │ varchar │ varchar │ ├───────┼────────────────────────┼───────────┼──────────────────────────┤ │ 3384 │ Riverside Reading Room │ Riverdale │ The Dragon's Crown │ │ 1449 │ Riverside Reading Room │ Riverdale │ Time Paradox │ │ 921 │ Riverside Reading Room │ Riverdale │ Murder on Pine Street │ │ 931 │ Riverside Reading Room │ Riverdale │ Murder on Pine Street │ │ 4122 │ Riverside Reading Room │ Riverdale │ Philosophy Today │ │ 886 │ Riverside Reading Room │ Riverdale │ Murder on Pine Street │ │ 2404 │ Riverside Reading Room │ Riverdale │ The Great American Novel │ │ 2488 │ Riverside Reading Room │ Riverdale │ Echoes of Yesterday │ │ 5680 │ Riverside Reading Room │ Riverdale │ The Perfect Crime │ │ 5767 │ Riverside Reading Room │ Riverdale │ Night Watch │ ├───────┴────────────────────────┴───────────┴──────────────────────────┤ │ 10 rows 4 columns │ └───────────────────────────────────────────────────────────────────────┘
BeautifulSoup (for parsing HTML or XML data)¶
Python's built-in urllib.request
package makes it relatively easy to download the underlying html from a web page. Note that the from <package> import <function>
notation used here allows you to selectively import only parts of a package as needed.
Be sure to check the terms of services for any website before scraping! We're scraping our own materials here to be safe!
from urllib.request import urlopen
from bs4 import BeautifulSoup
# Scrape Python 1 materials!
page = urlopen("https://unc-libraries-data.github.io/Python/Intro/Introduction.html")
html = page.read()
# Parse the HTML
soup = BeautifulSoup(html,"html.parser")
[x.text for x in soup.find_all("h2")] # find all second-level headers
['Why Python?¶', 'Getting Started¶', 'Data Types and Variables¶', 'Flow Control¶', 'More Data Types¶', 'Review¶', 'Pseudocode and Comments¶', 'User-defined Functions¶', 'Coming up¶', 'References and Resources¶']
NLTK (for text analysis)¶
The Natural Language Toolkit (nltk
) provides a wide array of tools for processing and analyzing text. This includes operations like splitting text into sentences or words ("tokenization"), tagging them with their part of speech, classification, and more.
import nltk
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
text = "This is an example sentence. This sentence is a simple example."
words = word_tokenize(text) # tokenize the text
fdist = FreqDist(words) # count how often each word occurs
# Show the frequencies
plt.figure()
plt.barh(fdist.keys(), fdist.values())
plt.xticks(range(1,3))
plt.xlabel('Frequency')
[nltk_data] Downloading package punkt_tab to [nltk_data] C:\Users\tuesday\AppData\Roaming\nltk_data... [nltk_data] Package punkt_tab is already up-to-date!
Text(0.5, 0, 'Frequency')
PIL (Pillow)¶
Pillow is the updated version of the old Python Imaging Library (PIL), which provides fundamental tools for working with images. Pillow can work with a many common formats (some of which may require extra packages or other dependencies) to automate a wide variety of image transformations.
Note: While pillow
is how you install the package, you import functions with import PIL
.
Both display
and imshow
can be used to preview the image. Currently, display
returns some errors due to the image's modes, so we're using imshow
to avoid these errors. Read more about modes.
from PIL import Image
from urllib.request import urlretrieve
from matplotlib.pyplot import imshow
from IPython.display import display
# download the unc logo
urlretrieve("https://identity.unc.edu/wp-content/uploads/sites/885/2019/01/UNC_logo_webblue-e1517942350314.png",
"UNC_logo.png")
# open and display the image
UNC = Image.open("UNC_logo.png")
imshow(UNC)
<matplotlib.image.AxesImage at 0x13ffeb64f80>
# resize the image and make it grayscale
# note: // divides two numbers and rounds the result down to get an integer
UNC_gray = UNC.convert('LA').resize((UNC.width//2,UNC.height//2))
imshow(UNC_gray)
<matplotlib.image.AxesImage at 0x13ffd3c1e20>
Parallel Processing with joblib¶
As you move into more complicated processes in Python (or applying code to a wide variety of objects or files), processing time can become a major factor. Fortunately, most modern laptops have multiple processor cores that can do separate things at the same time. Python only uses one core by default. If you have a set of loops that don't depend on each other (e.g. processing lots of files one after another), you could split up your many loop iterations between processors to greatly increase speed.
The joblib
package provides a straightforward way to split loops up between your computers cores for faster performance on complicated code. Note that parallelization may not benefit you much or may even hurt you for very quick jobs because setting up and consolidating information from separate cores creates overhead costs.
How to continue learning¶
- Exercises and solutions for practicing Python
- Exercises and solutions for praciticing Pandas
- Practice Python
- Python Projects - Beginner to Advanced
- Python Projects You Can Build
- Automate the Boring Stuff with Python
- Python Programming for the Humanities
- Python Data Science Handbook This free ebook emphasizes Numpy, Scipy, Matplotlib, Pandas and other data analysis packages in Python, assuming some familiarity with the basic principles of the language.
- Are you used to working in R? Check out this Data manipulation R-Python conversion guide and The Struggles I Had When Switching from R to Python