Useful Packages¶
Some of these packages may NOT be included in your installation. Whenever you need to install a package, you need to use the Miniforge prompt or temrinal window, NOT Python itself. The Miniforge Prompt window can be reached through the Windows Start Menu folder for Miniforge3 or right clicking and opening a terminal from Applications > Utilities > Terminal on Mac.
Installing packages known to conda can be done with the conda install <package name>
command in your Miniforge Prompt window. Otherwise you may need to use a different manager like pip install <package name>
.
More information about managing packages in Python is available here.
Data Packages
- NumPy for numerical computation in Python
- scikit-learn for data analysis and machine learning
- Polars for dataframes designed for large-scale data processing performance
- DuckDB for creating a SQL database
Other Utilities
- Beautiful Soup for parsing HTML etc
- NLTK for text analysis
- Pillow for Images
- JobLib or Multiprocessing for running parallel/concurrent jobs
Conda envs¶
conda provides the option to create separate Python environments inside your installation. All of the code below should be run in the Miniforge Prompt or Terminal:
(PC) Start Menu > Miniforge3 > Miniforge Prompt
(Mac) Finder > Applications > Utilities > Terminal
conda create --name myenv python=3.5
creates an environment called myenv
with Python version 3.5 instead of your main installation version.
conda activate myenv
makes this environment active. From here you can install packages, and open software (e.g. spyder
will open spyder after installation).
conda deactivate
deactivates the active environment and returns to your base environment.
Conda environments are a great place to test out code, or run code that has very specific requirements. It's generally a good idea to be careful about vastly changing your environment (e.g. upgrading to a new version of Python), because it can break your project code! Environments provide a great way to test before making the change in your main environment.
Numpy¶
Numpy provides the mathematical functionality (e.g. large arrayes, linear algebra, random numbers, etc.) for many popular statistical and machine learning tasks in Python. This is a dependency for many of the packages we discuss below, including pandas. One of the foundational objects in numpy is the array:
import numpy as np
import pandas as pd
a_list = [[1,2],[3,4]] #list of ROWS
an_array = np.array(a_list, ndmin = 2)
a_dataframe = pd.DataFrame(a_list)
a_list
[[1, 2], [3, 4]]
an_array
array([[1, 2], [3, 4]])
a_dataframe
0 | 1 | |
---|---|---|
0 | 1 | 2 |
1 | 3 | 4 |
However, arrays in numpy are constrained to a single data type, unlike lists or DataFrames.
import numpy as np
import pandas as pd
a_list = [[1,"cat"],[3,"dog"]] #list of ROWS
an_array = np.array(a_list, ndmin = 2)
a_dataframe = pd.DataFrame(a_list)
pd.DataFrame(a_list).dtypes
0 int64 1 object dtype: object
pd.DataFrame(an_array).dtypes
0 object 1 object dtype: object
We can use numpy to do many numerical tasks, for example creating random values or matrices/DataFrames:
np.random.rand(2,2)
array([[0.25377815, 0.99224832], [0.15266403, 0.31636709]])
np.zeros((3,4))
array([[0., 0., 0., 0.], [0., 0., 0., 0.], [0., 0., 0., 0.]])
np.ones((4,3,2))
array([[[1., 1.], [1., 1.], [1., 1.]], [[1., 1.], [1., 1.], [1., 1.]], [[1., 1.], [1., 1.], [1., 1.]], [[1., 1.], [1., 1.], [1., 1.]]])
Arrays make many mathematical operations easier than base Python. For example if we want to add a single value to every element of a list, we could try:
[1,2]+3
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[11], line 1 ----> 1 [1,2]+3 TypeError: can only concatenate list (not "int") to list
To accomplish this in base Python, we instead need to use a comprehension (maybe even with an if statement if the data types vary!):
base_list = [1,2]
[k+3 for k in base_list]
[4, 5]
base_list = [1,2,"three"]
[k+3 for k in base_list if type(k)==int] #in this case we can use "int" because all values are integers
[k+3 for k in base_list if str(k).isnumeric()] #note that the .isnumeric() method is only available for str objects
[4, 5]
With numpy arrays, we can use:
np.array([1,2])+3
array([4, 5])
or
arr = np.array([1,2])
arr += 3
print(arr)
[4 5]
Since the pandas dataframes are built on numpy arrays:
pd.DataFrame([1,2])+3
0 | |
---|---|
0 | 4 |
1 | 5 |
SciPy
adds an array of mathematical and statistical functions that work with numpy
objects.
Pandas and Data Visualization Packages¶
See our dedicated lesson on Pandas (linked above).
scikit-learn¶
scikit-learn provides a consolidated interface for machine learning in Python:
- functions for splitting data into training and testing components
- cross validation for model tuning
- supervised and unsupervised modeling
- model fit assessment and comparison
Read more about using sklearn. Digging into the application of machine learning is beyond the scope of our workshop series.
The following example comes from Scitkit-learn's Linear Regression Example page
# !conda install scikit-learn
from sklearn import linear_model, datasets
import matplotlib.pyplot as plt
# Load the diabetes dataset
diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)
# Use only one feature
diabetes_X = diabetes_X[:, np.newaxis, 2]
# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
# Split the targets into training/testing sets
diabetes_y_train = diabetes_y[:-20]
diabetes_y_test = diabetes_y[-20:]
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)
# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)
# The coefficients
print("Coefficients: \n", regr.coef_)
Coefficients: [938.23786125]
# Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test, color="black")
plt.plot(diabetes_X_test, diabetes_y_pred, color="blue", linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()
Polars (dataframes for large-scale data processing)¶
The Polars library offers an alternative to Pandas dataframes that often performs much faster and uses less RAM for dataframe operations. Polars is built in Rust while Pandas is built on NumPy, which has lower performance and higher memory use than Rust. Polars is also able to efficiently run parallel processes, adding to its improved performance.
# !pip install polars
import sys
import time
import polars as pl
import pandas as pd
# how long does it take Polars to load in a CSV?
start_pl = time.time()
df_pl = pl.read_csv("protest_data.csv")
end_pl = time.time()
print(f'Seconds for Polars to load in the protest data CSV: {end_pl-start_pl}')
# how long does it take Pandas to load in a CSV?
start_pd = time.time()
df_pd= pd.read_csv("protest_data.csv")
end_pd = time.time()
print(f'Seconds for Pandas to load in the protest data CSV: {end_pd-start_pd}')
# how much faster is Polars?
print(f'Polars is {round((end_pd-start_pd)/(end_pl-start_pl), 2)}x faster')
Seconds for Polars to load in the protest data CSV: 0.08365845680236816 Seconds for Pandas to load in the protest data CSV: 0.12293887138366699 Polars is 1.47x faster
# compare the memory size of the Polars and Pandas dataframes
print(f'The Polars dataframe takes up {sys.getsizeof(df_pl)} bytes.')
print(f'The Pandas dataframe takes up {sys.getsizeof(df_pd)} bytes.')
The Polars dataframe takes up 48 bytes. The Pandas dataframe takes up 28690152 bytes.
DuckDB (for creating a SQL Database)¶
DuckDB is a great library for setting up a SQL-database with Python. It does not have any dependencies and is very memory efficient, making it a faster alternative to PostgreSQL, MySQL, or SQLite.
#!pip install duckdb
import duckdb
duckdb.read_csv("protest_data.csv")
duckdb.sql("SELECT * FROM 'protest_data.csv' WHERE YEAR > 2010 LIMIT 10")
┌───────────┬─────────┬───────┬───────┬───────────────┬─────────┬───────────────┬──────────┬────────────┬───────────┬────────┬──────────┬─────────┬───────────────────┬─────────────────────────────────────────────────────────────────┬───────────────────────┬───────────────┬────────────────────────────────────┬─────────────────────────────┬──────────────────┬──────────────────┬──────────────────┬─────────────────┬────────────────┬────────────────┬────────────────┬────────────────┬────────────────┬────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │ id │ country │ ccode │ year │ region │ protest │ protestnumber │ startday │ startmonth │ startyear │ endday │ endmonth │ endyear │ protesterviolence │ location │ participants_category │ participants │ protesteridentity │ protesterdemand1 │ protesterdemand2 │ protesterdemand3 │ protesterdemand4 │ stateresponse1 │ stateresponse2 │ stateresponse3 │ stateresponse4 │ stateresponse5 │ stateresponse6 │ stateresponse7 │ sources │ notes │ │ int64 │ varchar │ int64 │ int64 │ varchar │ int64 │ int64 │ int64 │ int64 │ int64 │ int64 │ int64 │ int64 │ int64 │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ ├───────────┼─────────┼───────┼───────┼───────────────┼─────────┼───────────────┼──────────┼────────────┼───────────┼────────┼──────────┼─────────┼───────────────────┼─────────────────────────────────────────────────────────────────┼───────────────────────┼───────────────┼────────────────────────────────────┼─────────────────────────────┼──────────────────┼──────────────────┼──────────────────┼─────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ 202011001 │ Canada │ 20 │ 2011 │ North America │ 1 │ 1 │ 3 │ 7 │ 2011 │ 3 │ 7 │ 2011 │ 0 │ Montreal, Quebec │ NULL │ 300 │ quebec separatists │ political behavior, process │ NULL │ NULL │ NULL │ ignore │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ 1. royal couple turn deaf ear to protesters and chat to a friendly punk; royal visit ; demonstrations aside, the pair were made welcome in quebec, writes valentine low the times london , july 4, 2011 monday, news; pg. 4,5, 619 words, valentine low; │ a crowd of about 300 protesters staged the noisiest demonstration of the duke and duchess of cambridge s tour of canada yesterday as the couple attended a military ceremony in quebec. blowing horns, whistles and vuvuzelas and shouting anti royal slogans, they were kept a safe distance from the duke and duchess by a squad of riot police. we have nothing against them, said julien gaudreau, a spokesman for the separatist movement, the resistance network of quebec. they are a sweet couple. they can be in love all they want. but more than 70 per cent of the population of quebec want to get rid of the monarchy. │ │ 202012001 │ Canada │ 20 │ 2012 │ North America │ 1 │ 1 │ 10 │ 2 │ 2012 │ 6 │ 6 │ 2012 │ 1 │ Quebec │ NULL │ 1000s │ university students │ price increases, tax policy │ NULL │ NULL │ NULL │ crowd dispersal │ arrests │ NULL │ NULL │ NULL │ NULL │ NULL │ 1. quebec plans to suspend classes over student strike the new york times, may 17, 2012 thursday, section a; column 0; foreign desk; pg. 6, 551 words, by ian austen 2. canada: quebec seeks to end protests the new york times, may 18, 2012 friday, section a; column 0; foreign desk; world briefing the americas; pg. 10, 89 words, by ian austen 3. students riot over tuition fees the times london , may 18, 2012 friday, news; pg. 30, 41 words 4. canada: talks over tuition increases and student strike collapse in quebec the new york times, june 1, 2012 friday, section a; column 0; foreign desk; world briefing the americas; pg. 5, 138 words, by ian austen 5. law increases draw of canada protests the new york times, june 6, 2012 wednesday, section a; column 0; foreign desk; pg. 4, 1108 words, by ian austen; │ protests were part of a long student strike against the quebec government s plan to raise tuition by 75 percent │ │ 202013000 │ Canada │ 20 │ 2013 │ North America │ 0 │ 0 │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ │ 202014000 │ Canada │ 20 │ 2014 │ North America │ 0 │ 0 │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ │ 202015001 │ Canada │ 20 │ 2015 │ North America │ 1 │ 1 │ 6 │ 10 │ 2015 │ 6 │ 10 │ 2015 │ 0 │ Whitby │ 50-99 │ 50+ │ protesters │ political behavior, process │ NULL │ NULL │ NULL │ ignore │ NULL │ NULL │ NULL │ NULL │ NULL │ . │ agreement shakes up canadian campaign the new york times, october 7, 2015 wednesday, section a; column 0; foreign desk; pg. 12, 837 words, by ian austen; │ while campaigning on tuesday in whitby, an ontario city adjacent to the general motors of canada headquarters in oshawa, mr. harper offered the auto industry a billion canadian dollars spread over 10 years to encourage construction of auto assembly plants. as autoworkers held a protest against the agreement outside the factory hosting his campaign stop, mr. harper insisted the pact would expand the industry. we believe that this deal offers enormous benefits for the automobile sector, he said. mr. harper is a canadian politician in government. │ │ 202016001 │ Canada │ 20 │ 2016 │ North America │ 1 │ 1 │ 2 │ 2 │ 2016 │ 2 │ 2 │ 2016 │ 0 │ Parliament Hill, Ottowa; national │ 100-999 │ 250-300 │ taxi drivers against uber │ labor wage dispute │ NULL │ NULL │ NULL │ ignore │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ taxi drivers of the world unite; anti uber demonstrations clog city streets from u.s. to africa. the toronto star, february 11, 2016 thursday, news; pg. a6, 366 words, oliver sachgau; │ ottawa: taxi drivers from toronto were among those who drove to ottawa for a feb. 2 protest at parliament hill, calling on the federal government to step in with regulations. also attending were brampton east mp raj grewal and unifor canada president jerry dias. │ │ 202016002 │ Canada │ 20 │ 2016 │ North America │ 1 │ 2 │ 10 │ 2 │ 2016 │ 10 │ 2 │ 2016 │ 0 │ Montreal Pierre Elliot Trudeau International Airport; national │ 100-999 │ hundreds │ taxi drivers against uber │ labor wage dispute │ NULL │ NULL │ NULL │ ignore │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ cabbies right to back down. the toronto star, february 11, 2016 thursday, editorial; pg. a20, 618 words; │ taxi drivers who were threatening to block afternoon rush hour routes on the eve of the city s nba all star weekend have backed away from such reckless action. they were right to drop their ill conceived protest against uber, not just for the sake of stressed commuters but for their own. it was a different story in montreal, where taxi and limousine drivers targeted airport traffic on wednesday morning. hundreds of cabs slowed movement to pierre elliot trudeau international airport. and as far away as london, england, thousands of black cabs brought the city to a standstill in a protest against uber. this issue goes far beyond toronto. │ │ 202016003 │ Canada │ 20 │ 2016 │ North America │ 1 │ 3 │ 25 │ 2 │ 2016 │ 25 │ 2 │ 2016 │ 0 │ Toronto, Ontario │ 100-999 │ more than 200 │ canadian union of public employees │ labor wage dispute │ NULL │ NULL │ NULL │ ignore │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ mulcair joins rally for inside workers; as demonstrations continued, outside city workers were voting on four year contract. the toronto star, february 26, 2016 friday, greater toronto; pg. gt2, 532 words, david rider; │ more than 200 people including federal ndp leader thomas mulcair and some toronto councillors rallied in front of city hall demanding a fair contract for more than 28,000 city inside workers. flag waving cupe local 79 members chanted work to rule! and no free work on thursday as provincial and national union leaders pledged solidarity with their fight for full time permanent jobs rather than precarious work. but as they talked of unity, 5,400 outside city workers in another cupe local were voting on a proposed four year contract that was 6negotiated by their union leaders. mayor john tory has said that deal should also be good enough for the inside workers. │ │ 202016004 │ Canada │ 20 │ 2016 │ North America │ 1 │ 4 │ 20 │ 3 │ 2016 │ 4 │ 4 │ 2016 │ 0 │ Toronto, Ontario │ 100-999 │ hundreds │ black lives matter │ police brutality │ NULL │ NULL │ NULL │ crowd dispersal │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ 1. police, demonstrators clash at andrew loku protest, while victim s uncle calls for video. thestar.com, march 21, 2016 monday, news toronto star, 1019 words, wendy gillischristopher reynolds; 2. police must name the officer who killed andrew loku: cole. thestar.com, march 24, 2016 thursday, opinion editorial coverage t, 749 words, desmond cole; 3. black lives matter ends tent city for now. the toronto star, april 5, 2016 tuesday, greater toronto; pg. gt1, 645 words, sarah joyce batters; │ ontario s special investigations unit, the civilian watchdog that probes deaths involving police, announced friday that the unnamed officer who shot loku last july had used justifiable force against loku. the siu ruling prompted demonstrations by black lives matter protesters outside city hall and police headquarters downtown beginning sunday and throughout the day monday. the protest at police headquarters became the scene of a confrontation in the evening, as officers pushed their way past demonstrators to extinguish their fire and dismantle their tents. pascale diverlus, the co founder of black lives matter toronto told the star that moments before police came out, protestors were not causing any harm, before officers violently shut them down. not commenting on any allegations of violence, sgt. caroline de. kloet told the star, all tents have been taken down, or are no longer there. │ │ 202016005 │ Canada │ 20 │ 2016 │ North America │ 1 │ 5 │ 24 │ 3 │ 2016 │ 24 │ 3 │ 2016 │ 0 │ Toronto, Ontario │ 50-99 │ dozens │ women │ political behavior, process │ NULL │ NULL │ NULL │ arrests │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ 1. in toronto, former star radio host is acquitted of sexual assault charges. the washington post, march 25, 2016 friday, a section; pg. a04, 732 words, rob gillies; 2. reactions to verdict. the toronto star, march 25, 2016 friday, greater toronto; pg. gt3, 716 words; │ ghomeshi, who first gained fame as a member of the 1990s satirical pop band moxy fruvous, defended himself in a 1,500 word statement on facebook, saying women consented to having rough sex with him and that he was the victim of a disgruntled ex girlfriend. the cbc fired him. he faces another sex assault trial in june based on allegations from a fourth complainant. ghomeshi and his lawyer declined to comment after the verdict. dozens of women gathered outside the courthouse to protest the verdict, some chanting ghomeshi guilty. a topless protester who jumped in front of the prosecutor while he was talking to the media was arrested. │ ├───────────┴─────────┴───────┴───────┴───────────────┴─────────┴───────────────┴──────────┴────────────┴───────────┴────────┴──────────┴─────────┴───────────────────┴─────────────────────────────────────────────────────────────────┴───────────────────────┴───────────────┴────────────────────────────────────┴─────────────────────────────┴──────────────────┴──────────────────┴──────────────────┴─────────────────┴────────────────┴────────────────┴────────────────┴────────────────┴────────────────┴────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ 10 rows 31 columns │ └───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
BeautifulSoup (for parsing HTML or XML data)¶
Python's built-in urllib.request
package makes it relatively easy to download the underlying html from a web page. Note that the from <package> import <function>
notation used here allows you to selectively import only parts of a package as needed.
Be sure to check the terms of services for any website before scraping! We're scraping our own materials here to be safe!
from urllib.request import urlopen
from bs4 import BeautifulSoup
page = urlopen("https://unc-libraries-data.github.io/Python/Intro/Introduction.html") #The Python 1 materials!
html = page.read()
print(html[:300]) #print only the first 300 characters
b'<!DOCTYPE html>\n<html>\n<head><meta charset="utf-8" />\n\n<title>Introduction</title>\n\n<script src="https://cdnjs.cloudflare.com/ajax/libs/require.js/2.1.10/require.min.js"></script>\n<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/2.0.3/jquery.min.js"></script>\n\n\n\n<style type="text/css">\n '
soup=BeautifulSoup(html,"html.parser")
[x.text for x in soup.find_all("h2")] # find all h2 (second-level headers)
['Why Python?¶', 'Getting Started¶', 'Data Types and Variables¶', 'Flow Control¶', 'More Data Types¶', 'Review¶', 'Pseudocode and Comments¶', 'User-defined Functions¶', 'Coming up¶', 'References and Resources¶']
APIs¶
APIs (Application Programming Interfaces) provide a structured way to request data over the internet. APIs are generally a better option than web scraping because:
- they provided structured data instead of often inconsistent HTML or other formats you'll have to dig through to find relevant information
- this often has the added benefit of not taxing a website as much as web scraping!
- they provide a clearer path to retrieving information with permission
An API call is just a specific type of web address that you can use Python to help you generate (or cycle through many options):
For example: https://api.weather.gov/points/35.9132,-79.0558
This link pulls up the National Weather Service information for a particular lat-long pair (for Chapel Hill). The forecast field leads us to a new link:
https://api.weather.gov/gridpoints/LWX/96,70/forecast
We can use Python to request and parse the content from these links, but often we can find a wrapper someone else has created to do some of that work for us!
Remember that we can install packages in the Miniforge Prompt or Terminal:
- (PC) Start Menu > Miniforge3 > Miniforge Prompt
- (Mac) Finder > Applications > Utilities > Terminal
Then run the following to install the package:
pip install noaa_sdk
# !pip install noaa_sdk
from noaa_sdk import NOAA
from time import sleep
n = NOAA()
forecast = n.points_forecast(35.9132,-79.0558, type='forecastGridData')
sleep(5) #pause for 5 seconds to prevent repeatedly using the API
This provides us with a pretty complicated set of nested dictionaries that we can parse to find specific values:
rain_chance = forecast["properties"]["probabilityOfPrecipitation"]["values"]
pd.DataFrame(rain_chance).plot.line(x="validTime",y="value", figsize=(20,3))
<Axes: xlabel='validTime'>
NLTK (text analysis)¶
The Natural Language Toolkit (nltk
) provides a wide array of tools for processing and analyzing text. This includes operations like splitting text into sentences or words ("tokenization"), tagging them with their part of speech, classification, and more.
Let's take the example sentence: "The quick brown fox jumps over the lazy dog." and convert it into individual words.
import nltk
#the code below is necessary for word_tokenize and parts of speech to work
# nltk.download("punkt")
# nltk.download('averaged_perceptron_tagger')
sentence = "The quick brown fox jumps over the lazy dog."
words = nltk.word_tokenize(sentence)
print(words)
['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']
Now we can extract the part of speech for each word in the sentence. Note that this function, like many of the functions in NLTK, uses machine learning to classify each word and therefore may have some level of error!
nltk.pos_tag(words)
[('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN'), ('.', '.')]
The meaning of these parts of speech tags are available below:
# nltk.download('tagsets')
# nltk.help.upenn_tagset()
PIL (Pillow)¶
Pillow is the updated version of the old Python Imaging Library (PIL), which provides fundamental tools for working with images. Pillow can work with a many common formats (some of which may require extra packages or other dependencies) to automate a wide variety of image transformations.
Note: While pillow
is how you install the package, you import functions with import PIL
.
Both display
and imshow
can be used to preview the image. Currently, display
returns some errors due to the image's modes, so we're using imshow
to avoid these errors. Read more about modes.
from PIL import Image
from urllib.request import urlretrieve
from matplotlib.pyplot import imshow
from IPython.display import display
#downloading an image locally
urlretrieve("https://identity.unc.edu/wp-content/uploads/sites/885/2019/01/UNC_logo_webblue-e1517942350314.png",
"UNC_logo.png")
UNC = Image.open("UNC_logo.png")
#note: // divides two numbers and rounds the result down to get an integer
UNC_gray = UNC.convert('LA').resize((UNC.width//2,UNC.height//2))
imshow(UNC)
<matplotlib.image.AxesImage at 0x1877cb5b1d0>
Mode "LA" is grayscale, preserving transparent image areas.
imshow(UNC_gray)
<matplotlib.image.AxesImage at 0x1877a9f0290>
Parallel Processing with joblib¶
As you move into more complicated processes in Python (or applying code to a wide variety of objects or files), processing time can become a major factor. Fortunately, most modern laptops have multiple processor cores that can do separate things at the same time. Python only uses one core by default. If you have a set of loops that don't depend on each other (e.g. processing lots of files one after another), you could split up your many loop iterations between processors to greatly increase speed.
The joblib
package provides a straightforward way to split loops up between your computers cores for faster performance on complicated code. Note that parallelization may not benefit you much or may even hurt you for very quick jobs because setting up and consolidating information from separate cores creates overhead costs.