Browse Source

update doc

Urs Kehrli 4 years ago
parent
commit
16d33060e9
1 changed files with 29 additions and 47 deletions
  1. 29 47
      README.md

+ 29 - 47
README.md

@@ -1,5 +1,7 @@
-# COVID-19 public dataset on GCP from cases in Italy
-> Medical notes and entities from TRUE patient cases publicly available on BigQuery and Datastore!
+cyborg-ai fork of Aziz Ketari proposol to extract and classify real covid-19 patient data.  
+https://www.linkedin.com/in/aziz-ketari
+### COVID-19 public dataset on GCP from cases in Italy
+Medical notes and entities from true patient cases publicly available on BigQuery and Datastore!
 
 
 This repository contains all the code required to extract relevant information from pdf documents published by ISMIR 
 This repository contains all the code required to extract relevant information from pdf documents published by ISMIR 
 and store raw data in  a relational database and entities in a No-SQL database.
 and store raw data in  a relational database and entities in a No-SQL database.
@@ -13,13 +15,7 @@ store them in a NoSQL db (namely Datastore) on Google Cloud Platform.
 
 
 Google Cloud Architecture of the pipeline:
 Google Cloud Architecture of the pipeline:
 ![Batch mode (Streaming mode coming soon ...)](./content/images/covid19_repo_architecture_3_24_2020.png)
 ![Batch mode (Streaming mode coming soon ...)](./content/images/covid19_repo_architecture_3_24_2020.png)
-
-Quick sneak peak on the Entity dataset on Datastore:
-![](./content/images/datastore_snapshot.gif)
-
----
-
-## Installation
+#### Installation
 You can replicate this pipeline directly on your local machine or on the cloud shell on GCP.
 You can replicate this pipeline directly on your local machine or on the cloud shell on GCP.
  
  
 **Requirements:**
 **Requirements:**
@@ -35,7 +31,7 @@ source ./content/env_variables.sh
 
 
 - Set the project that you will be working on:
 - Set the project that you will be working on:
 
 
-`gcloud config set project $PROJECT_ID`
+```gcloud config set project $PROJECT_ID``
 
 
 - Enable APIs:
 - Enable APIs:
 ```
 ```
@@ -46,9 +42,9 @@ gcloud services enable bigquery.googleapis.com
 ```
 ```
 
 
 - Install package requirements:
 - Install package requirements:
-> Make sure you have a python version >=3.6.0. Otherwise you will face some version errors [Useful link](https://stackoverflow.com/questions/47273260/google-cloud-compute-engine-change-to-python-3-6)
+Make sure you have a python version >=3.6.0. Otherwise you will face some version errors [Useful link](https://stackoverflow.com/questions/47273260/google-cloud-compute-engine-change-to-python-3-6)
 
 
-`ERROR: Package 'scispacy' requires a different Python: 3.5.3 not in '>=3.6.0'`
+```ERROR: Package 'scispacy' requires a different Python: 3.5.3 not in '>=3.6.0'``
 
 
 ```
 ```
 pip3 install --user -r requirements.txt
 pip3 install --user -r requirements.txt
@@ -61,64 +57,50 @@ available models [here](https://allenai.github.io/scispacy/). If you follow this
 will automatically download a model for you and install it.
 will automatically download a model for you and install it.
 
 
 
 
-## Extracting data
+#### Extracting data
 
 
 - **Step 1:** Download the required files to your bucket and load the required model in your local  
 - **Step 1:** Download the required files to your bucket and load the required model in your local  
 (this step will take ~10 min)
 (this step will take ~10 min)
-> Optional: If you have already downloaded the scispacy models, you should modify the file ./content/download_content.sh to not repeat that step
+Optional: If you have already downloaded the scispacy models, you should modify the file ./content/download_content.sh to not repeat that step
 ```
 ```
 source ./content/download_content.sh
 source ./content/download_content.sh
-```
+````
 
 
 - **Step 2:** Start the extraction of text from the pdf documents  
 - **Step 2:** Start the extraction of text from the pdf documents  
 
 
-`python3 ./scripts/extraction.py`
+```
+python3 ./scripts/extraction.py
+```
 
 
-## Pre-processing data
+#### Pre-processing data
 Following the extraction of text, it's time to translate it from Italian to English and curate it.
 Following the extraction of text, it's time to translate it from Italian to English and curate it.
 
 
-`python3 ./scripts/preprocessing.py`
+```
+python3 ./scripts/preprocessing.py
+```
 
 
-## Storing data
+#### Storing data
 Following the pre-processing, it's time to store the data in a more searchable format: a data warehouse - 
 Following the pre-processing, it's time to store the data in a more searchable format: a data warehouse - 
 [BigQuery](https://cloud.google.com/bigquery) - for the text, and a No-SQL database - 
 [BigQuery](https://cloud.google.com/bigquery) - for the text, and a No-SQL database - 
 [Datastore](https://cloud.google.com/datastore) - for the (UMLS) medical entities. 
 [Datastore](https://cloud.google.com/datastore) - for the (UMLS) medical entities. 
 
 
-`python3 ./scripts/storing.py True True [Model_of_your_choice]`
+```
+python3 ./scripts/storing.py True True [Model_of_your_choice]
+```
 
 
-## Test
+#### Test
 Last but not least, this script will run a few test cases and display the results. Feel free to modify the test cases.
 Last but not least, this script will run a few test cases and display the results. Feel free to modify the test cases.
 
 
-`python3 ./scripts/retrieving.py`
-
----
-
-## Contributing
-> To get started...
-
-### Step 1
-- **Option 1**
-    - 🍴 Fork this repo!    
-
-- **Option 2**
-    - 👯 Clone this repo to your local machine using https://github.com/azizketari/covid19_ISMIR.git
-    
-### Step 2
-- **HACK AWAY!** 🔨🔨🔨
-
-### Step 3
-- 🔃 Create a new pull request
-
----
+```
+python3 ./scripts/retrieving.py
+```
 
 
-## Citing
+#### Citing
 
 
 - [ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing by Mark Neumann and Daniel King and 
 - [ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing by Mark Neumann and Daniel King and 
-Iz Beltagy and Waleed Ammar (2019)](https://www.semanticscholar.org/paper/ScispaCy%3A-Fast-and-Robust-Models-for-Biomedical-Neumann-King/de28ec1d7bd38c8fc4e8ac59b6133800818b4e29)
-  
----
+Iz Beltagy and Waleed Ammar (2019)](https://www.semanticscholar.org/paper/ScispaCy%3A-Fast-and-Robust-Models-for-Biomedical-Neumann-King/de28ec1d7bd38c8fc4e8ac59b6133800818b4e29
   
   
-## License
+#### License
 [![License](http://img.shields.io/:license-mit-blue.svg?style=flat-square)](http://badges.mit-license.org)
 [![License](http://img.shields.io/:license-mit-blue.svg?style=flat-square)](http://badges.mit-license.org)
 
 
 - [MIT License](https://opensource.org/licenses/mit-license.php)
 - [MIT License](https://opensource.org/licenses/mit-license.php)