Aziz Ketari 4 vuotta sitten
vanhempi
sitoutus
649c1e5ac9
4 muutettua tiedostoa jossa 33 lisäystä ja 33 poistoa
  1. 10 11
      README.md
  2. 7 7
      env_variables.sh
  3. 1 0
      requirements.txt
  4. 15 15
      scripts/utils/preprocessing_fcn.py

+ 10 - 11
README.md

@@ -25,9 +25,15 @@ You can replicate this pipeline directly on your local machine or on the cloud s
 **Requirements:**
 - Clone this repo to your local machine using https://github.com/azizketari/covid19_ISMIR.git
 - You need a Google Cloud project and IAM rights to create service accounts.
+- Create and Download the json key associated with your Service Account. Useful [link](https://cloud.google.com/iam/docs/creating-managing-service-account-keys#iam-service-account-keys-create-python)
+- Modify the values to each variables in env_variables.sh file then run
+```
+source env_variables.sh
+```
+
 - Set the project that you will be working on:
 
-`gcloud config set project PROJECT_ID`
+`gcloud config set project $PROJECT_ID`
 
 - Enable APIs:
 ```
@@ -44,23 +50,16 @@ cd ~/covid19_ISMIR
 pip3 install --user -r requirements.txt
 ```
 
-
 Note:
 
-You will also need to download a NER model for the second part of this pipeline. See Scispacy full selection of 
+You will also need to download a Named Entity Recognition model for the second part of this pipeline. See Scispacy full selection of 
 available models [here](https://allenai.github.io/scispacy/). If you follow this installation guide, the steps 
 will automatically download a model for you and install it.
 
 
 ## Extracting data
 
-- **Step 1:** Modify the values to each variables in env_variables.sh file then run
-> Assumption: You have already created/downloaded the json key to your Google Cloud Service Account. Useful [link](https://cloud.google.com/iam/docs/creating-managing-service-account-keys#iam-service-account-keys-create-python)
-```
-source env_variables.sh
-```
-
-- **Step 2:** Download the required files to your bucket and load the required model in your local  
+- **Step 1:** Download the required files to your bucket and load the required model in your local  
 (this step will take ~10 min)
 > Optional: If you have already downloaded the scispacy model, you should modify the file ./content/download_content.sh to not repeat that step
 ```
@@ -68,7 +67,7 @@ source ./content/download_content.sh
 pip install -U ./scispacy_models/en_core_sci_lg-0.2.4.tar.gz
 ```
 
-- **Step 3:** Start the extraction of text from the pdf documents  
+- **Step 2:** Start the extraction of text from the pdf documents  
 
 `python3 ./scripts/extraction.py`
 

+ 7 - 7
env_variables.sh

@@ -1,7 +1,7 @@
-SA_KEY_PATH="path/to/service_account.json",
-PROJECT_ID="unique_project_id",
-BUCKET_NAME="bucket_contains_data",
-LOCATION="compute_region",
-BQ_DATASET_NAME="covid19",
-BQ_TABLE_NAME="ISMIR",
-TEST_CASE="case14" # lowercase any case from 1 to 49 (e.g case1, case32 ...)
+export SA_KEY_PATH="path/to/service_account.json"
+export PROJECT_ID="unique_project_id"
+export BUCKET_NAME="bucket_contains_data"
+export LOCATION="compute_region"
+export BQ_DATASET_NAME="covid19"
+export BQ_TABLE_NAME="ISMIR"
+export TEST_CASE="case14" # lowercase any case from 1 to 49 (e.g case1, case32 ...)

+ 1 - 0
requirements.txt

@@ -4,6 +4,7 @@ google-cloud-bigquery==1.24.0
 google-cloud-datastore==1.11.0
 google-cloud-translate==2.0.1
 google-cloud-vision==1.0.0
+google-oauth2-tool====0.0.3
 googleapis-common-protos==1.51.0
 pandas==1.0.3
 scispacy==0.2.4

+ 15 - 15
scripts/utils/preprocessing_fcn.py

@@ -1,27 +1,27 @@
 from google.cloud import storage, translate, vision
-from google.oauth2 import service_account
+#from google.oauth2 import service_account
 import logging
 import os
 
 from google.protobuf import json_format
 
 # DEVELOPER: change path to key
-project_id = os.getenv('PROJECT_ID')
-bucket_name = os.getenv('BUCKET_NAME')
-location = os.getenv('LOCATION')
-key_path = os.getenv('SA_KEY_PATH')
+# project_id = os.getenv('PROJECT_ID')
+# bucket_name = os.getenv('BUCKET_NAME')
+# location = os.getenv('LOCATION')
+# key_path = os.getenv('SA_KEY_PATH')
 
 # DEVELOPER: change path to key
-credentials = service_account.Credentials.from_service_account_file(key_path)
-
-storage_client = storage.Client(credentials=credentials,
-                                project_id=credentials.project_id)
-
-translate_client = translate.Client(credentials=credentials,
-                                    project_id=credentials.project_id)
-
-vision_client = vision.Client(credentials=credentials,
-                              project_id=credentials.project_id)
+# credentials = service_account.Credentials.from_service_account_file(key_path)
+#
+# storage_client = storage.Client(credentials=credentials,
+#                                 project_id=credentials.project_id)
+#
+# translate_client = translate.Client(credentials=credentials,
+#                                     project_id=credentials.project_id)
+#
+# vision_client = vision.Client(credentials=credentials,
+#                               project_id=credentials.project_id)
 
 
 def async_detect_document(vision_client, gcs_source_uri, gcs_destination_uri, batch_size=20):