Using the Industry Documents Library API Python Wrapper¶

This tutorial will demonstrate the basic usage of the Python library wrapper for the UCSF Industry Documents Library API. You can visit the library's GitHub Repository. You can also click here to learn more about the Indsutry Document Library API.

In [1]:

# if you have downloaded the wrapper library, uncomment the following line to install it via pip
#!pip install industryDocumentsWrapper

We can start by importing the main class IndustryDocsSearch and assigning it to a variable.

In [1]:

from industryDocumentsWrapper import IndustryDocsSearch
import polars as pl

In [2]:

wrapper = IndustryDocsSearch()

We construct a simple query for email documents from the JUUL Labs Collection specifically from the State of North Carolina. By default, the query will return the first 1000 documents. If you want more or less results, you can also pass in the argument `n` (i.e. `n=50000`).¶

In [3]:

# if you want to change the number of returned results, uncomment the following line and adjust 'n' to fit your needs
# wrapper.query(q='(case:"State of North Carolina" AND collection:"JUUL Labs Collection" AND type:Email)', n=50000)
wrapper.query(q='(case:"State of North Carolina" AND collection:"JUUL Labs Collection" AND type:Email)')

100/1000 documents collected
200/1000 documents collected
300/1000 documents collected
400/1000 documents collected
500/1000 documents collected
600/1000 documents collected
700/1000 documents collected
800/1000 documents collected
900/1000 documents collected
1000/1000 documents collected

In [4]:

len(wrapper.results)

Out[4]:

Let's look at an example of the metadata returned for each item:

In [4]:

wrapper.results[0]

Out[4]:

{'id': 'ffbb0284',
 'collection': ['JUUL Labs Collection'],
 'collectioncode': ['juul'],
 'custodian': ['Marand, Ashley'],
 'availability': ['public', 'no restrictions'],
 'source': '[{"type":"plaintext","title":"University Libraries, University of North Carolina at Chapel Hill"}]',
 'datesent': '2015 July 07',
 'redactedby': ['UCSF'],
 'datereceived': '2015 July 07',
 'filename': 'Re: Changes to JUULvapor.com Events',
 'filepath': ['\\Marand, Ashley\\Ashley_Marand_ashley@pax.com_25.pst\\Top of Personal Folders\\Inbox\\Re: Changes to JUULvapor.com Events'],
 'case': ['State of North Carolina, ex rel. Joshua H. Stein, Attorney General,  v. JUUL Labs, Inc'],
 'title': 'Re: Changes to JUULvapor.com Events',
 'author': ['Lee Garvey <lee@pax.com>'],
 'documentdate': '2015 July 07',
 'type': ['email'],
 'pages': 1,
 'recipient': ['Ashley Marand <"ashley marand <ashley@pax.com>">'],
 'brand': ['Juul'],
 'bates': 'JLI00489744',
 'redacted': 'yes',
 'dateaddeducsf': '2024 January 25'}

We can save the results as either JSON or parquest files formats.

In [5]:

wrapper.save('test.parquet', format='parquet')

Now, we may want to actually want analyze the content of the documents that we searched. You can either download the entire zipped JUUL collection and find the content based on the id column. In this case, we have created a parquet file with all the email documents from the North Carolina JUUL case for our convenience. See this tutorial for how to create your own dataset.

In [6]:

# Step 1: Load in the results from the our query and the email data
query_df = pl.read_parquet('test.parquet')
df = pl.read_parquet('../../data/juul_unc_emails.parquet')

In [8]:

# Step 2: Join the dataframes on the 'id' column to get the OCR content
joined_df = query_df.join(df.select(['id', 'ocr_text']), on='id', how='left')

# Step 3: Let's make sure our join worked correclty
print(f"Original query_df shape: {query_df.shape}")
print(f"Joined dataframe shape: {joined_df.shape}")
joined_df.head(3)

Original query_df shape: (1000, 25)
Joined dataframe shape: (1000, 26)

Out[8]:

shape: (3, 26)

id	collection	collectioncode	custodian	availability	source	datesent	redactedby	datereceived	filename	filepath	case	title	author	documentdate	type	pages	recipient	brand	bates	redacted	dateaddeducsf	topic	copied	attachment	ocr_text
str	list[str]	list[str]	list[str]	list[str]	str	str	list[str]	str	str	list[str]	list[str]	str	list[str]	str	list[str]	i64	list[str]	list[str]	str	str	str	str	list[str]	list[str]	str
"ffbb0284"	["JUUL Labs Collection"]	["juul"]	["Marand, Ashley"]	["public", "no restrictions"]	"[{"type":"plaintext","title":"…	"2015 July 07"	["UCSF"]	"2015 July 07"	"Re: Changes to JUULvapor.com E…	["\Marand, Ashley\Ashley_Marand_ashley@pax.com_25.pst\Top of Personal Folders\Inbox\Re: Changes to JUULvapor.com Events"]	["State of North Carolina, ex rel. Joshua H. Stein, Attorney General, v. JUUL Labs, Inc"]	"Re: Changes to JUULvapor.com E…	["Lee Garvey <lee@pax.com>"]	"2015 July 07"	["email"]	1	["Ashley Marand <"ashley marand <ashley@pax.com>">"]	["Juul"]	"JLI00489744"	"yes"	"2024 January 25"	null	null	null	"From: To: Sent: Subject: …
"ffbb0285"	["JUUL Labs Collection"]	["juul"]	["Goulart, Tania;Long-Rotstein, Kelly"]	["public", "no restrictions"]	"[{"type":"plaintext","title":"…	"2016 March 09"	["UCSF"]	"2016 March 09"	"Re: [Update] B2B Portal - Soli…	["\Long, Kelly\Kelly Long\Kelly_Long_kelly@pax.com_108.pst\Top of Personal Folders\Inbox\Re: [Update] B2B Portal - Solidify Phase 1 Requirements", "\Goulart, Tania\Tania_Goulart_tania@pax.com_17.pst\Top of Personal Folders\Inbox\Re: [Update] B2B Portal - Solidify Phase 1 Requirements", "\Goulart, Tania\Tania Goulart_Email\Tania_Goulart_Email_tania@juul.com_34.pst\Top of Personal Folders\Inbox\Re: [Update] B2B Portal - Solidify Phase 1 Requirements"]	["State of North Carolina, ex rel. Joshua H. Stein, Attorney General, v. JUUL Labs, Inc"]	"Re: [Update] B2B Portal - Soli…	["Kelly Long <kelly@pax.com>"]	"2016 March 09"	["email"]	6	["Tania Goulart <"tania goulart <tania@pax.com>">"]	["Juul"]	"JLI03638986"	"yes"	"2024 January 25"	"Marketing"	null	null	"From: To: Sent: Subject: …
"ffbb0287"	["JUUL Labs Collection"]	["juul"]	["Berrier, Jon;Burbidge, Cole;Cruise, Daniel;David, Matthew;Davis, Victoria;Esquea, Jim;Foster, Heather;Gould, Ashley;Honig, Jake;Kwong, Ted;Sillin, Nat;Taylor, Jessica;Troy, Tevi;Winterton, Grant"]	["public", "no restrictions"]	"[{"type":"plaintext","title":"…	"2019 April 14"	null	null	"Event Tracker: 4/15/19"	["\Berrier, Jon\Jon Berrier_Email_All through 4-1-2020\Jon_Berrier_Email_All_through_4-1-2020--jon@juul.com_0 .mbox\flag-alerts\1630919068580615622-e1dfdbc1-2666-47fa-9c9e-213410ca6652.mbox.eml\Event Tracker: 4/15/19", "\Burbidge, Cole\Cole_Burbidge_Email_cburbidge@juul.com_0 .mbox\flag-alerts\1630919069155943942-5ab412c5-d9ab-4e4d-87af-a4825f80ec94.mbox.eml\Event Tracker: 4/15/19", … "\Winterton, Grant\Grant_Winterton_Email_4-28-2020--grant@juul.com_3 .mbox\TRASH\1630919067683967005-cd8516e1-a624-48aa-9f4e-bf7dd5d40223.mbox.eml\Event Tracker: 4/15/19"]	["State of North Carolina, ex rel. Joshua H. Stein, Attorney General, v. JUUL Labs, Inc"]	"Event Tracker: 4/15/19"	["Flag Alerts <alerts@fmaalerts.com>"]	"2019 April 14"	["email"]	1	["juulalerts@flagmediaanalytics.com"]	["Juul"]	"JLI05453876"	null	"2024 January 25"	null	null	null	"From: To: Sent: Subject: …

In [9]:

# Step 4: Save the joined dataframe to a new parquet file
joined_df.write_parquet('juul_query_with_ocr.parquet')

That's it! Now you are able to construct your own queries in the Industry Documents Library. For more information about using this Python package, visit the GitHub repository.

Using the Industry Documents Library API Python Wrapper¶

We construct a simple query for email documents from the JUUL Labs Collection specifically from the State of North Carolina. By default, the query will return the first 1000 documents. If you want more or less results, you can also pass in the argument n (i.e. n=50000).¶

We construct a simple query for email documents from the JUUL Labs Collection specifically from the State of North Carolina. By default, the query will return the first 1000 documents. If you want more or less results, you can also pass in the argument `n` (i.e. `n=50000`).¶