Using the Industry Documents Library API Python Wrapper¶
This tutorial will demonstrate the basic usage of the Python library wrapper for the UCSF Industry Documents Library API. You can visit the library's GitHub Repository. You can also click here to learn more about the Indsutry Document Library API.
# if you have downloaded the wrapper library, uncomment the following line to install it via pip
#!pip install industryDocumentsWrapper
We can start by importing the main class IndustryDocsSearch
and assigning it to a variable.
from industryDocumentsWrapper import IndustryDocsSearch
import polars as pl
wrapper = IndustryDocsSearch()
We construct a simple query for email documents from the JUUL Labs Collection specifically from the State of North Carolina. By default, the query will return the first 1000 documents. If you want more or less results, you can also pass in the argument n
(i.e. n=50000
).¶
# if you want to change the number of returned results, uncomment the following line and adjust 'n' to fit your needs
# wrapper.query(q='(case:"State of North Carolina" AND collection:"JUUL Labs Collection" AND type:Email)', n=50000)
wrapper.query(q='(case:"State of North Carolina" AND collection:"JUUL Labs Collection" AND type:Email)')
100/1000 documents collected 200/1000 documents collected 300/1000 documents collected 400/1000 documents collected 500/1000 documents collected 600/1000 documents collected 700/1000 documents collected 800/1000 documents collected 900/1000 documents collected 1000/1000 documents collected
len(wrapper.results)
1000
Let's look at an example of the metadata returned for each item:
wrapper.results[0]
{'id': 'ffbb0284', 'collection': ['JUUL Labs Collection'], 'collectioncode': ['juul'], 'custodian': ['Marand, Ashley'], 'availability': ['public', 'no restrictions'], 'source': '[{"type":"plaintext","title":"University Libraries, University of North Carolina at Chapel Hill"}]', 'datesent': '2015 July 07', 'redactedby': ['UCSF'], 'datereceived': '2015 July 07', 'filename': 'Re: Changes to JUULvapor.com Events', 'filepath': ['\\Marand, Ashley\\Ashley_Marand_ashley@pax.com_25.pst\\Top of Personal Folders\\Inbox\\Re: Changes to JUULvapor.com Events'], 'case': ['State of North Carolina, ex rel. Joshua H. Stein, Attorney General, v. JUUL Labs, Inc'], 'title': 'Re: Changes to JUULvapor.com Events', 'author': ['Lee Garvey <lee@pax.com>'], 'documentdate': '2015 July 07', 'type': ['email'], 'pages': 1, 'recipient': ['Ashley Marand <"ashley marand <ashley@pax.com>">'], 'brand': ['Juul'], 'bates': 'JLI00489744', 'redacted': 'yes', 'dateaddeducsf': '2024 January 25'}
We can save the results as either JSON or parquest files formats.
wrapper.save('test.parquet', format='parquet')
Now, we may want to actually want analyze the content of the documents that we searched. You can either download the entire zipped JUUL collection and find the content based on the id
column. In this case, we have created a parquet file with all the email documents from the North Carolina JUUL case for our convenience. See this tutorial for how to create your own dataset.
# Step 1: Load in the results from the our query and the email data
query_df = pl.read_parquet('test.parquet')
df = pl.read_parquet('../../data/juul_unc_emails.parquet')
# Step 2: Join the dataframes on the 'id' column to get the OCR content
joined_df = query_df.join(df.select(['id', 'ocr_text']), on='id', how='left')
# Step 3: Let's make sure our join worked correclty
print(f"Original query_df shape: {query_df.shape}")
print(f"Joined dataframe shape: {joined_df.shape}")
joined_df.head(3)
Original query_df shape: (1000, 25) Joined dataframe shape: (1000, 26)
id | collection | collectioncode | custodian | availability | source | datesent | redactedby | datereceived | filename | filepath | case | title | author | documentdate | type | pages | recipient | brand | bates | redacted | dateaddeducsf | topic | copied | attachment | ocr_text |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
str | list[str] | list[str] | list[str] | list[str] | str | str | list[str] | str | str | list[str] | list[str] | str | list[str] | str | list[str] | i64 | list[str] | list[str] | str | str | str | str | list[str] | list[str] | str |
"ffbb0284" | ["JUUL Labs Collection"] | ["juul"] | ["Marand, Ashley"] | ["public", "no restrictions"] | "[{"type":"plaintext","title":"… | "2015 July 07" | ["UCSF"] | "2015 July 07" | "Re: Changes to JUULvapor.com E… | ["\Marand, Ashley\Ashley_Marand_ashley@pax.com_25.pst\Top of Personal Folders\Inbox\Re: Changes to JUULvapor.com Events"] | ["State of North Carolina, ex rel. Joshua H. Stein, Attorney General, v. JUUL Labs, Inc"] | "Re: Changes to JUULvapor.com E… | ["Lee Garvey <lee@pax.com>"] | "2015 July 07" | ["email"] | 1 | ["Ashley Marand <"ashley marand <ashley@pax.com>">"] | ["Juul"] | "JLI00489744" | "yes" | "2024 January 25" | null | null | null | "From: To: Sent: Subject: … |
"ffbb0285" | ["JUUL Labs Collection"] | ["juul"] | ["Goulart, Tania;Long-Rotstein, Kelly"] | ["public", "no restrictions"] | "[{"type":"plaintext","title":"… | "2016 March 09" | ["UCSF"] | "2016 March 09" | "Re: [Update] B2B Portal - Soli… | ["\Long, Kelly\Kelly Long\Kelly_Long_kelly@pax.com_108.pst\Top of Personal Folders\Inbox\Re: [Update] B2B Portal - Solidify Phase 1 Requirements", "\Goulart, Tania\Tania_Goulart_tania@pax.com_17.pst\Top of Personal Folders\Inbox\Re: [Update] B2B Portal - Solidify Phase 1 Requirements", "\Goulart, Tania\Tania Goulart_Email\Tania_Goulart_Email_tania@juul.com_34.pst\Top of Personal Folders\Inbox\Re: [Update] B2B Portal - Solidify Phase 1 Requirements"] | ["State of North Carolina, ex rel. Joshua H. Stein, Attorney General, v. JUUL Labs, Inc"] | "Re: [Update] B2B Portal - Soli… | ["Kelly Long <kelly@pax.com>"] | "2016 March 09" | ["email"] | 6 | ["Tania Goulart <"tania goulart <tania@pax.com>">"] | ["Juul"] | "JLI03638986" | "yes" | "2024 January 25" | "Marketing" | null | null | "From: To: Sent: Subject: … |
"ffbb0287" | ["JUUL Labs Collection"] | ["juul"] | ["Berrier, Jon;Burbidge, Cole;Cruise, Daniel;David, Matthew;Davis, Victoria;Esquea, Jim;Foster, Heather;Gould, Ashley;Honig, Jake;Kwong, Ted;Sillin, Nat;Taylor, Jessica;Troy, Tevi;Winterton, Grant"] | ["public", "no restrictions"] | "[{"type":"plaintext","title":"… | "2019 April 14" | null | null | "Event Tracker: 4/15/19" | ["\Berrier, Jon\Jon Berrier_Email_All through 4-1-2020\Jon_Berrier_Email_All_through_4-1-2020--jon@juul.com_0 .mbox\flag-alerts\1630919068580615622-e1dfdbc1-2666-47fa-9c9e-213410ca6652.mbox.eml\Event Tracker: 4/15/19", "\Burbidge, Cole\Cole_Burbidge_Email_cburbidge@juul.com_0 .mbox\flag-alerts\1630919069155943942-5ab412c5-d9ab-4e4d-87af-a4825f80ec94.mbox.eml\Event Tracker: 4/15/19", … "\Winterton, Grant\Grant_Winterton_Email_4-28-2020--grant@juul.com_3 .mbox\TRASH\1630919067683967005-cd8516e1-a624-48aa-9f4e-bf7dd5d40223.mbox.eml\Event Tracker: 4/15/19"] | ["State of North Carolina, ex rel. Joshua H. Stein, Attorney General, v. JUUL Labs, Inc"] | "Event Tracker: 4/15/19" | ["Flag Alerts <alerts@fmaalerts.com>"] | "2019 April 14" | ["email"] | 1 | ["juulalerts@flagmediaanalytics.com"] | ["Juul"] | "JLI05453876" | null | "2024 January 25" | null | null | null | "From: To: Sent: Subject: … |
# Step 4: Save the joined dataframe to a new parquet file
joined_df.write_parquet('juul_query_with_ocr.parquet')
That's it! Now you are able to construct your own queries in the Industry Documents Library. For more information about using this Python package, visit the GitHub repository.