reach-ad-analyser-batch-processor

v1.0.0

Published

a year ago

nodejs server app to analyze brand-safety of newspaper articles (text and images)

Downloads

0High
0Medium
0Low

trinitymirrordigital-admin

reach brand safety advertisment ad nodejs cloud nlu natural language understanding vr visual recognition

REACH Batch Article URL Analyser

Local Development instructions:

The application uses dotenv framework to inject environment variables within the application. which are passed with .env files in the root folder. As it stands there is within this project the folowing env files:

Environment service configuration

Within each environment the following services must be configured:

NLU: ``

App Launch instructions:

node app.js or npm start

to lauch with a specific environment, such as prod: NODE_ENV=prod.appnexus node app.js or bertha: NODE_ENV=dev.BERTHA node app.js

Which starts a server: http://localhost:6003 (see server log for port number)

How To create a new environment file:

To create a new environment file the following credentials are required: Obtain credentials obtained from the respective watson service:

NLU:

    natural_language_understanding_apikey = << key from credentials in ibm console >> 
    natural_language_understanding_url = https://gateway-lon.watsonplatform.net/natural-language-understanding/api
    natural_language_understanding_version = 2019-02-01``

change url if there you the NLU instance is in different region than London.

    visual_recognition_apikey = << key from credentials in ibm console >>  
    visual_recognition_url = https://gateway.watsonplatform.net/visual-recognition/api  
    visual_recognition_version = 2019-02-01

change url if there you the VR instance is in different region than London.

WDS:

    discovery_url = https://gateway-lon.watsonplatform.net/discovery/api  
    discovery_apikey = << key from credentials in ibm console >>  
    discovery_version = 2019-02-01  
    discovery_collectionid =  
    discovery_environmentid =

Cloud storage bucket:
Create a set of credentials for IBM cloud storage, with HMAC credentials, and convert the values to base 64 string with the following:

    cloud_storage_enpoint_url = https://s3.eu-gb.cloud-object-storage.appdomain.cloud  
    cloud_storage_apikey = << key from credentials in ibm console >>
    cloud_storage_resource_instance_id = << key from credentials in ibm console >>  
    cloud_storage_access_key_id = << key from credentials in ibm console >>  
    cloud_storage_secret_access_key = << key from credentials in ibm console >>  
    cloud_storage_bucket = << input storage bucket >>  
    cloud_storage_reports = << output storage bucket >>

DB: Get the db credentials from the ibm console, connections are done as follows:

    postgreSQL_connectionString = postgres://user:password@0af45143-13f5-40ee-a847-2aea727b42fd.bmo1leol0d54tib7un7g.databases.appdomain.cloud:port/db?sslmode=verify-full
    postgreSQL_certificate_base64 = << pem ssl certificate string >>

Other environment variables

| variable | description | | :-------------------: | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: | | write_to_db | enables writing nlu findings into the db, to be cached, the default value is present in the env file | | read_from_db_cache | enables reading cached nlu findings from the db, the default value is present in the env file | | write_to_log | create a CSV log file with a line for each analyzed article, containing the rating of the article | | write_rules_to_log | adds also the rating of each rule. Applies only if write_to_log = true | | write_to_cache | store result JSONs as files on server used as cache in future requests | | analyze_images | enable/disable(s) analyzing images identified in HTML articles | | recalculation_rate | recalculation rate for new rulesets | | sleep_interval | time interval in seconds that the processor waits between new input files lookups, default values are in the env files | | selected_process_mode | currently batch file processing mode is done through config in default.json, this defines the input/output format, and any filtering that might need doing. It defaults to default, and has with more modes | | max_small_file_size | file size threshold in which it will process the whole file in one go instead of a stream, if not present it defaults to 20kb | | articles_parallel | number of articles to process in parallel, if not present defaults to 30 | | NODE_ENV | used to change psql file to use test db for e2e tests | | LOCAL_DEV | used to set the db to localhost, useful for local development |

Processing modes:

Processing Mode config:

All existent processing modes are currently within the config/default.json under processMode.

The processing currently supports the following flags:

| flag | description | | :----------------------------: | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------: | | name | name of the process mode | | inputFormat | expected file format for the input | | outputFormat | expected file format for the output file | | saveArticles | saves articles as a list with format selected as above | | saveReport | saves report from processed file, this includes the processed file, how many articles failed, total number processed, and status. Current saves report always in json | | outputArticleErrors | if output format is json, it can output the error, for each article that fails, along with the input used, allowing better debugging | | removeUnmatchedFilteredArticle | if true and there are matchers in articleFilterOutput and they return undefined (but not fail) it removes them from the output. the default is false | | inputTransformation | uses Jexl framework to apply transforms to the input | | articleFilterOutput | uses Jexl framework please see below for format. |

Article Filter output format:

This part of the config allows allows to extract parts for the article response and filter it. It also allows to apply transforms to the data in the fields. It works by adding an array of key elements that should be present in the output. By adding them a key/value to output static values, or a matcher if we would like to filter parts of the original object only.

| key | description | | :--------: | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: | | key | destination json key for the object (if format is csv, this value will be ommited) | | value | value if we would like to output a static value associated with the key | | matcher | Jexl matcher to filter objects | | transforms | Jexl supports transforms to be added to a matcher, therefore, we add a static list of transformer fucntions in jexlTransforms.js class, and load them into jexl, to use them in order, just match the name of that function into the array |

Please see default.json for transforms, and matcher examples and Jexl page.

If we need to add more transformer functions to the config, we add a new function into jexlTransforms

    const urlExtraneousRemoval = () => {
      return {
        name: 'urlExtraneousRemoval',
        method: urlString => URLStringBuilder.buildRemovingExtraneous(urlString)
      };
    };

Where the name returned is the name of the transform fucntion, and the method is the method to added to Jexl List of transforms. Then add the new transformer as an export

    module.exports = {
      urlExtraneousRemoval
    };

Then this function is available to be used in the transforms array for the inputTransformation array, articleFilterOutput.

Code structure:

The processor

The batch processor service is an express.js server app with no routing, its single goal is to pool for new files to be processed from ibm cloud storage. This works by starting the services in ./app.js and doing processor.init(). The service gets objects from ibm cloud storage by checking their name, and etag to inspect changes (in case the same file is uploaded multiple times).

The server/processor/processor.js: is responsible for waiting and pooling new files from ibm cloudstorage to be analysed
The server/processor/processorOrchestrator.js: is responsible for reading a file with multiple urls from ibm cloud storage, prepare a batch and save it, by either using a stream or an object put.
The server/processor/controllers/processorOrchestrator.js: is responsible for reading a file with multiple urls from ibm cloud storage, and save it, by either using a stream or an object put.
The server/processor/controllers/report/: contains controllers to create report or article streams.
The server/processor/controllers/jsonFilter/: applies transfomrs and/or filtering to article objects as per processing mode configuration.
The server/processor/controllers/objectOutputBuilder.js: is used to convert objects to their desired output format.
The server/processor/controllers/storageCache.js: is used to build a local cache of processed cloud storage objects, as means to complement the db article_process table.

Once a batch ends up in the articleQueue file, it ends up being processed individually through brand-safety-tools orchestrator file.

Unit tests

Unit tests, are written using Jest framework, and can be run by doing npm test in the terminal. An HTML coverage report is available in link.

Folder structure goes as follows:

### Tests structure: A unit test should attempt to test one condition of the class/module it is testing. tests name should follow:

test('<method name>() <condition to test>, <expected return value>')

an example:

test('filter() with config, articleFilterOutput and removeUnmatchedFilteredArticle set to false, throws an error', () => {}

describe() should be aggregate tests in the following order of preference:

aggregate tests within a test file.
aggregate a complex/finicky logical scenario.
aggregate tests around a method.
a test file should never test more or have describes that test more than one file.