Owner Yann: (Experimental) Demo Google cloud application flow for covid19 data extraction from PDF's

Aziz Ketari c10749a8f1 testing updates 4 năm trước cách đây
content c10749a8f1 testing updates 4 năm trước cách đây
notebooks 27bee22cde updated notebooks 4 năm trước cách đây
scripts 19c4ad5bfb updated dir 4 năm trước cách đây
utils e1f8d8e7c1 first commit 4 năm trước cách đây
.DS_Store c10749a8f1 testing updates 4 năm trước cách đây
README.md c10749a8f1 testing updates 4 năm trước cách đây
env_variables.sh e1f8d8e7c1 first commit 4 năm trước cách đây
requirements.txt e1f8d8e7c1 first commit 4 năm trước cách đây

README.md

COVID-19 public dataset on GCP from cases in Italy

by the Italian Society of Medical and Interventional Radiology (ISMIR)

This repository contains all the code required to extract relevant information from pdf documents published by ISMIR and store raw data in a relational database and entities in a No-SQL database.

In particular, you will use Google Cloud Vision API and Translation API, before storing the information on BigQuery. Separately, you will also use specific NER models (from Scispacy) to extract (medical) domain specific entities and store them in a NoSQL db (namely Datastore) on Google Cloud Platform.

Looking for more context behind this dataset? Check out this article.

Google Cloud Architecture of the pipeline: Batch mode (Streaming mode coming soon ...)

Quick sneak peak on the Entity dataset on Datastore:


Installation

You can replicate this pipeline directly on your local machine or on the cloud shell on GCP.

Requirements:

gcloud config set project PROJECT_ID

  • Enable APIs:

    gcloud services enable vision.googleapis.com
    gcloud services enable translate.googleapis.com
    gcloud services enable datastore.googleapis.com
    gcloud services enable bigquery.googleapis.com
    
  • Install package requirements:

    cd ~/covid19_ISMIR
    pip3 install --user -r requirements.txt
    

Note:

You will also need to download a NER model for the second part of this pipeline. See Scispacy full selection of available models here. If you follow this installation guide, the steps will automatically download a model for you and install it.

Extracting data

  • Step 1: Modify the values to each variables in env_variables.sh file then run

    Assumption: You have already created/downloaded the json key to your Google Cloud Service Account. Useful link

    source env_variables.sh
    
  • Step 2: Download the required files to your bucket and load the required model in your local
    (this step will take ~10 min)

    Optional: If you have already downloaded the scispacy model, you should modify the file ./content/download_content.sh to not repeat that step

    source ./content/download_content.sh
    pip install -U ./scispacy_models/en_core_sci_lg-0.2.4.tar.gz
    
  • Step 3: Start the extraction of text from the pdf documents

python3 ./scripts/extraction.py

Pre-processing data

Following the extraction of text, it's time to translate it from Italian to English and curate it.

python3 ./scripts/preprocessing.py

Storing data

Following the pre-processing, it's time to store the data in a more searchable format: a data warehouse - BigQuery - for the text, and a No-SQL database - Datastore - for the (UMLS) medical entities.

python3 ./scripts/storing.py

Test

Last but not least, you can query your databases using this script.

python3 ./scripts/retrieving.py


Contributing

To get started...

Step 1

Step 2

  • HACK AWAY! 🔨🔨🔨

Step 3

  • 🔃 Create a new pull request

Citing


License

License