| 
					
				 | 
			
			
				@@ -1,5 +1,7 @@ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-# COVID-19 public dataset on GCP from cases in Italy 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-> Medical notes and entities from TRUE patient cases publicly available on BigQuery and Datastore! 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+cyborg-ai fork of Aziz Ketari proposol to extract and classify real covid-19 patient data.   
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+https://www.linkedin.com/in/aziz-ketari 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+### COVID-19 public dataset on GCP from cases in Italy 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+Medical notes and entities from true patient cases publicly available on BigQuery and Datastore! 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				  
			 | 
		
	
		
			
				 | 
				 | 
			
			
				 This repository contains all the code required to extract relevant information from pdf documents published by ISMIR  
			 | 
		
	
		
			
				 | 
				 | 
			
			
				 and store raw data in  a relational database and entities in a No-SQL database. 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -13,13 +15,7 @@ store them in a NoSQL db (namely Datastore) on Google Cloud Platform. 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				  
			 | 
		
	
		
			
				 | 
				 | 
			
			
				 Google Cloud Architecture of the pipeline: 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				  
			 | 
		
	
		
			
				 | 
				 | 
			
			
				- 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-Quick sneak peak on the Entity dataset on Datastore: 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				- 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				- 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				---- 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				- 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-## Installation 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+#### Installation 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				 You can replicate this pipeline directly on your local machine or on the cloud shell on GCP. 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				   
			 | 
		
	
		
			
				 | 
				 | 
			
			
				 **Requirements:** 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -35,7 +31,7 @@ source ./content/env_variables.sh 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				  
			 | 
		
	
		
			
				 | 
				 | 
			
			
				 - Set the project that you will be working on: 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				  
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-`gcloud config set project $PROJECT_ID` 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+```gcloud config set project $PROJECT_ID`` 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				  
			 | 
		
	
		
			
				 | 
				 | 
			
			
				 - Enable APIs: 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				 ``` 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -46,9 +42,9 @@ gcloud services enable bigquery.googleapis.com 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				 ``` 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				  
			 | 
		
	
		
			
				 | 
				 | 
			
			
				 - Install package requirements: 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-> Make sure you have a python version >=3.6.0. Otherwise you will face some version errors [Useful link](https://stackoverflow.com/questions/47273260/google-cloud-compute-engine-change-to-python-3-6) 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+Make sure you have a python version >=3.6.0. Otherwise you will face some version errors [Useful link](https://stackoverflow.com/questions/47273260/google-cloud-compute-engine-change-to-python-3-6) 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				  
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-`ERROR: Package 'scispacy' requires a different Python: 3.5.3 not in '>=3.6.0'` 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+```ERROR: Package 'scispacy' requires a different Python: 3.5.3 not in '>=3.6.0'`` 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				  
			 | 
		
	
		
			
				 | 
				 | 
			
			
				 ``` 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				 pip3 install --user -r requirements.txt 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -61,64 +57,50 @@ available models [here](https://allenai.github.io/scispacy/). If you follow this 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				 will automatically download a model for you and install it. 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				  
			 | 
		
	
		
			
				 | 
				 | 
			
			
				  
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-## Extracting data 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+#### Extracting data 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				  
			 | 
		
	
		
			
				 | 
				 | 
			
			
				 - **Step 1:** Download the required files to your bucket and load the required model in your local   
			 | 
		
	
		
			
				 | 
				 | 
			
			
				 (this step will take ~10 min) 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-> Optional: If you have already downloaded the scispacy models, you should modify the file ./content/download_content.sh to not repeat that step 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+Optional: If you have already downloaded the scispacy models, you should modify the file ./content/download_content.sh to not repeat that step 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				 ``` 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				 source ./content/download_content.sh 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-``` 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+```` 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				  
			 | 
		
	
		
			
				 | 
				 | 
			
			
				 - **Step 2:** Start the extraction of text from the pdf documents   
			 | 
		
	
		
			
				 | 
				 | 
			
			
				  
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-`python3 ./scripts/extraction.py` 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+``` 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+python3 ./scripts/extraction.py 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+``` 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				  
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-## Pre-processing data 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+#### Pre-processing data 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				 Following the extraction of text, it's time to translate it from Italian to English and curate it. 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				  
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-`python3 ./scripts/preprocessing.py` 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+``` 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+python3 ./scripts/preprocessing.py 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+``` 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				  
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-## Storing data 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+#### Storing data 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				 Following the pre-processing, it's time to store the data in a more searchable format: a data warehouse -  
			 | 
		
	
		
			
				 | 
				 | 
			
			
				 [BigQuery](https://cloud.google.com/bigquery) - for the text, and a No-SQL database -  
			 | 
		
	
		
			
				 | 
				 | 
			
			
				 [Datastore](https://cloud.google.com/datastore) - for the (UMLS) medical entities.  
			 | 
		
	
		
			
				 | 
				 | 
			
			
				  
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-`python3 ./scripts/storing.py True True [Model_of_your_choice]` 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+``` 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+python3 ./scripts/storing.py True True [Model_of_your_choice] 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+``` 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				  
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-## Test 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+#### Test 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				 Last but not least, this script will run a few test cases and display the results. Feel free to modify the test cases. 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				  
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-`python3 ./scripts/retrieving.py` 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				- 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				---- 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				- 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-## Contributing 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-> To get started... 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				- 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-### Step 1 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-- **Option 1** 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-    - 🍴 Fork this repo!     
			 | 
		
	
		
			
				 | 
				 | 
			
			
				- 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-- **Option 2** 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-    - 👯 Clone this repo to your local machine using https://github.com/azizketari/covid19_ISMIR.git 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-     
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-### Step 2 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-- **HACK AWAY!** 🔨🔨🔨 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				- 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-### Step 3 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-- 🔃 Create a new pull request 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				- 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				---- 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+``` 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+python3 ./scripts/retrieving.py 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+``` 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				  
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-## Citing 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+#### Citing 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				  
			 | 
		
	
		
			
				 | 
				 | 
			
			
				 - [ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing by Mark Neumann and Daniel King and  
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-Iz Beltagy and Waleed Ammar (2019)](https://www.semanticscholar.org/paper/ScispaCy%3A-Fast-and-Robust-Models-for-Biomedical-Neumann-King/de28ec1d7bd38c8fc4e8ac59b6133800818b4e29) 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-   
			 | 
		
	
		
			
				 | 
				 | 
			
			
				---- 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+Iz Beltagy and Waleed Ammar (2019)](https://www.semanticscholar.org/paper/ScispaCy%3A-Fast-and-Robust-Models-for-Biomedical-Neumann-King/de28ec1d7bd38c8fc4e8ac59b6133800818b4e29 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				    
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-## License 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+#### License 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				 [](http://badges.mit-license.org) 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				  
			 | 
		
	
		
			
				 | 
				 | 
			
			
				 - [MIT License](https://opensource.org/licenses/mit-license.php) 
			 |