|
@@ -1,5 +1,7 @@
|
|
|
-# COVID-19 public dataset on GCP from cases in Italy
|
|
|
-> Medical notes and entities from TRUE patient cases publicly available on BigQuery and Datastore!
|
|
|
+cyborg-ai fork of Aziz Ketari proposol to extract and classify real covid-19 patient data.
|
|
|
+https://www.linkedin.com/in/aziz-ketari
|
|
|
+### COVID-19 public dataset on GCP from cases in Italy
|
|
|
+Medical notes and entities from true patient cases publicly available on BigQuery and Datastore!
|
|
|
|
|
|
This repository contains all the code required to extract relevant information from pdf documents published by ISMIR
|
|
|
and store raw data in a relational database and entities in a No-SQL database.
|
|
@@ -13,13 +15,7 @@ store them in a NoSQL db (namely Datastore) on Google Cloud Platform.
|
|
|
|
|
|
Google Cloud Architecture of the pipeline:
|
|
|
![Batch mode (Streaming mode coming soon ...)](./content/images/covid19_repo_architecture_3_24_2020.png)
|
|
|
-
|
|
|
-Quick sneak peak on the Entity dataset on Datastore:
|
|
|
-![](./content/images/datastore_snapshot.gif)
|
|
|
-
|
|
|
----
|
|
|
-
|
|
|
-## Installation
|
|
|
+#### Installation
|
|
|
You can replicate this pipeline directly on your local machine or on the cloud shell on GCP.
|
|
|
|
|
|
**Requirements:**
|
|
@@ -35,7 +31,7 @@ source ./content/env_variables.sh
|
|
|
|
|
|
- Set the project that you will be working on:
|
|
|
|
|
|
-`gcloud config set project $PROJECT_ID`
|
|
|
+```gcloud config set project $PROJECT_ID``
|
|
|
|
|
|
- Enable APIs:
|
|
|
```
|
|
@@ -46,9 +42,9 @@ gcloud services enable bigquery.googleapis.com
|
|
|
```
|
|
|
|
|
|
- Install package requirements:
|
|
|
-> Make sure you have a python version >=3.6.0. Otherwise you will face some version errors [Useful link](https://stackoverflow.com/questions/47273260/google-cloud-compute-engine-change-to-python-3-6)
|
|
|
+Make sure you have a python version >=3.6.0. Otherwise you will face some version errors [Useful link](https://stackoverflow.com/questions/47273260/google-cloud-compute-engine-change-to-python-3-6)
|
|
|
|
|
|
-`ERROR: Package 'scispacy' requires a different Python: 3.5.3 not in '>=3.6.0'`
|
|
|
+```ERROR: Package 'scispacy' requires a different Python: 3.5.3 not in '>=3.6.0'``
|
|
|
|
|
|
```
|
|
|
pip3 install --user -r requirements.txt
|
|
@@ -61,64 +57,50 @@ available models [here](https://allenai.github.io/scispacy/). If you follow this
|
|
|
will automatically download a model for you and install it.
|
|
|
|
|
|
|
|
|
-## Extracting data
|
|
|
+#### Extracting data
|
|
|
|
|
|
- **Step 1:** Download the required files to your bucket and load the required model in your local
|
|
|
(this step will take ~10 min)
|
|
|
-> Optional: If you have already downloaded the scispacy models, you should modify the file ./content/download_content.sh to not repeat that step
|
|
|
+Optional: If you have already downloaded the scispacy models, you should modify the file ./content/download_content.sh to not repeat that step
|
|
|
```
|
|
|
source ./content/download_content.sh
|
|
|
-```
|
|
|
+````
|
|
|
|
|
|
- **Step 2:** Start the extraction of text from the pdf documents
|
|
|
|
|
|
-`python3 ./scripts/extraction.py`
|
|
|
+```
|
|
|
+python3 ./scripts/extraction.py
|
|
|
+```
|
|
|
|
|
|
-## Pre-processing data
|
|
|
+#### Pre-processing data
|
|
|
Following the extraction of text, it's time to translate it from Italian to English and curate it.
|
|
|
|
|
|
-`python3 ./scripts/preprocessing.py`
|
|
|
+```
|
|
|
+python3 ./scripts/preprocessing.py
|
|
|
+```
|
|
|
|
|
|
-## Storing data
|
|
|
+#### Storing data
|
|
|
Following the pre-processing, it's time to store the data in a more searchable format: a data warehouse -
|
|
|
[BigQuery](https://cloud.google.com/bigquery) - for the text, and a No-SQL database -
|
|
|
[Datastore](https://cloud.google.com/datastore) - for the (UMLS) medical entities.
|
|
|
|
|
|
-`python3 ./scripts/storing.py True True [Model_of_your_choice]`
|
|
|
+```
|
|
|
+python3 ./scripts/storing.py True True [Model_of_your_choice]
|
|
|
+```
|
|
|
|
|
|
-## Test
|
|
|
+#### Test
|
|
|
Last but not least, this script will run a few test cases and display the results. Feel free to modify the test cases.
|
|
|
|
|
|
-`python3 ./scripts/retrieving.py`
|
|
|
-
|
|
|
----
|
|
|
-
|
|
|
-## Contributing
|
|
|
-> To get started...
|
|
|
-
|
|
|
-### Step 1
|
|
|
-- **Option 1**
|
|
|
- - 🍴 Fork this repo!
|
|
|
-
|
|
|
-- **Option 2**
|
|
|
- - 👯 Clone this repo to your local machine using https://github.com/azizketari/covid19_ISMIR.git
|
|
|
-
|
|
|
-### Step 2
|
|
|
-- **HACK AWAY!** 🔨🔨🔨
|
|
|
-
|
|
|
-### Step 3
|
|
|
-- 🔃 Create a new pull request
|
|
|
-
|
|
|
----
|
|
|
+```
|
|
|
+python3 ./scripts/retrieving.py
|
|
|
+```
|
|
|
|
|
|
-## Citing
|
|
|
+#### Citing
|
|
|
|
|
|
- [ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing by Mark Neumann and Daniel King and
|
|
|
-Iz Beltagy and Waleed Ammar (2019)](https://www.semanticscholar.org/paper/ScispaCy%3A-Fast-and-Robust-Models-for-Biomedical-Neumann-King/de28ec1d7bd38c8fc4e8ac59b6133800818b4e29)
|
|
|
-
|
|
|
----
|
|
|
+Iz Beltagy and Waleed Ammar (2019)](https://www.semanticscholar.org/paper/ScispaCy%3A-Fast-and-Robust-Models-for-Biomedical-Neumann-King/de28ec1d7bd38c8fc4e8ac59b6133800818b4e29
|
|
|
|
|
|
-## License
|
|
|
+#### License
|
|
|
[![License](http://img.shields.io/:license-mit-blue.svg?style=flat-square)](http://badges.mit-license.org)
|
|
|
|
|
|
- [MIT License](https://opensource.org/licenses/mit-license.php)
|