apify-schema-tools

v3.1.0

Published

3 months ago

Apify schema managing tools.

0High
0Medium
0Low

gullmar

apify schema tools cli typescript json-schema synchronization validation

Apify Schema Tools

This is a tool intended for Apify actors developers.

It allows generating JSON schemas and TypeScript types, for input and dataset, from a single source of truth, with a few extra features.

As a quick example, assume you have a project that looks like this:

my-project
├── .actor
│   ├── actor.json
│   ├── dataset_schema.json
│   └── input_schema.json
└── src-schemas
    ├── dataset-item.json <-- source file for dataset
    └── input.json        <-- source file for input

After running this script, you will have:

my-project
├── .actor
│   ├── actor.json
│   ├── dataset_schema.json <-- updated with the definitions from src-schemas
│   └── input_schema.json   <-- updated with the definitions from src-schemas
├── src
│   └── generated
│       ├── dataset.ts     <-- TypeScript types generated from src-schemas
│       ├── input-utils.ts <-- utilities to fill input default values
│       └── input.ts       <-- TypeScript types generated from src-schemas
└── src-schemas
    ├── dataset-item.json
    └── input.json

Quickstart

These instructions will allow you to quickly get to a point where you can use the apify-schema-tools to generate your schemas and TypeScript types.

Let's assume you are starting from a new project created from an Apify template.

Install apify-schema-tools:

npm i -D apify-schema-tools

Initialize your project with default settings:

npx apify-schema-tools init

This command will:

Create a src-schemas folder with input.json and dataset-item.json files.
Create the necessary .actor files if they don't exist.
Add configuration to your package.json.
Add a generate script to your package.json.

Generate JSON schemas and TypeScript types from the source schemas:

npx apify-schema-tools sync

Now, you will be able to use TypeScript types and utilities in your project:

import { Actor } from 'apify';

import type { DatasetItem } from './generated/dataset.ts';
import type { Input } from './generated/input.ts';
import { getInputWithDefaultValues, type InputWithDefaults } from './generated/input-utils.ts';

await Actor.init();

const input: InputWithDefaults = getInputWithDefaultValues(await Actor.getInput<Input>());

'...'

await Actor.pushData<DatasetItem>({
    tile: '...',
    url: '...',
    text: '...',
    timestamp: '...',
});

await Actor.exit();

Configuration

You can configure apify-schema-tools in two ways:

Using package.json configuration

The init command automatically adds configuration to your package.json. You can also manually add an apify-schema-tools section to customize the behavior:

{
  "name": "my-actor",
  "version": "1.0.0",
  "apify-schema-tools": {
    "input": ["input", "dataset"],
    "output": ["json-schemas", "ts-types"],
    "srcInput": "src-schemas/input.json",
    "srcDataset": "src-schemas/dataset-item.json",
    "outputTSDir": "src/generated",
    "includeInputUtils": true
  }
}

Using command-line arguments

You can also pass options directly to the sync command. You can check which options are available:

$ npx apify-schema-tools --help
usage: apify-schema-tools [-h] {init,sync,check} ...

Apify Schema Tools - Generate JSON schemas and TypeScript files for Actor input and output dataset.

positional arguments:
  {init,sync,check}
    init             Initialize the Apify Schema Tools project with default settings.
    sync             Generate JSON schemas and TypeScript files from the source schemas.
    check            Check the schemas for consistency and correctness.

optional arguments:
  -h, --help         show this help message and exit

$ npx apify-schema-tools sync --help
usage: apify-schema-tools sync [-h] [-i [{input,dataset} ...]] [-o [{json-schemas,ts-types} ...]] [--src-input SRC_INPUT] [--src-dataset SRC_DATASET] [--add-input ADD_INPUT] [--add-dataset ADD_DATASET] [--input-schema INPUT_SCHEMA] [--dataset-schema DATASET_SCHEMA] [--output-ts-dir OUTPUT_TS_DIR]
                               [--deep-merge] [--include-input-utils {true,false}]

optional arguments:
  -h, --help            show this help message and exit
  -i [{input,dataset} ...], --input [{input,dataset} ...]
                        specify which sources to use for generation (default: input,dataset)
  -o [{json-schemas,ts-types} ...], --output [{json-schemas,ts-types} ...]
                        specify what to generate (default: json-schemas,ts-types)
  --src-input SRC_INPUT
                        path to the input schema source file (default: src-schemas/input.json)
  --src-dataset SRC_DATASET
                        path to the dataset schema source file (default: src-schemas/dataset-item.json)
  --add-input ADD_INPUT
                        path to an additional schema to merge into the input schema (default: undefined)
  --add-dataset ADD_DATASET
                        path to an additional schema to merge into the dataset schema (default: undefined)
  --input-schema INPUT_SCHEMA
                        the path of the destination input schema file (default: .actor/input_schema.json)
  --dataset-schema DATASET_SCHEMA
                        the path of the destination dataset schema file (default: .actor/dataset_schema.json)
  --output-ts-dir OUTPUT_TS_DIR
                        path where to save generated TypeScript files (default: src/generated)
  --deep-merge          whether to deep merge additional schemas into the main schema (default: false)
  --include-input-utils {true,false}
                        include input utilities in the generated TypeScript files: 'input' input and 'ts-types' output are required (default: true)

Setting up your project manually

If you prefer to set up your project manually instead of using the init command, you can follow these steps:

Create a src-schemas folder:

mkdir src-schemas

Create the files input.json and dataset-item.json inside the src-schemas. Here is some example content:

{
  "title": "Input schema for Web Scraper",
  "type": "object",
  "schemaVersion": 1,
  "properties": {
    "startUrls": {
      "type": "array",
      "title": "Start URLs",
      "description": "List of URLs to scrape",
      "default": [],
      "editor": "requestListSources",
      "items": {
        "type": "object",
        "properties": {
          "url": { "type": "string" }
        }
      }
    }
  },
  "required": ["startUrls"],
  "additionalProperties": false
}

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "title": "Dataset schema for Web Scraper",
  "type": "object",
  "properties": {
    "title": {
      "type": "string",
      "title": "Title",
      "description": "Page title"
    },
    "url": {
      "type": "string",
      "title": "URL",
      "description": "Page URL"
    },
    "text": {
      "type": "string",
      "title": "Text content",
      "description": "Extracted text"
    },
    "timestamp": {
      "type": "string",
      "title": "Timestamp",
      "description": "When the data was scraped"
    }
  },
  "required": ["title", "url"]
}

Create the file .actor/dataset_schema.json and enter some empty content:

{
    "actorSpecification": 1,
    "fields": {},
    "views": {}
}

Link the dataset schema in .actor/actor.json:

{
    "actorSpecification": 1,
    "...": "...",
    "input": "./input_schema.json",
    "storages": {
        "dataset": "./dataset_schema.json"
    },
    "...": "..."
}

Generate JSON schemas and TypeScript types from the source schemas:

npx apify-schema-tools sync

Resolving conflicts

The sync command includes interactive conflict resolution to help you handle schema inconsistencies. When the tool detects conflicts between your source schemas and existing target schemas, it will prompt you to choose which version to keep.

When conflicts are detected

Conflicts occur when there are differences between your source schema files and the schemas that would be generated in the target locations. Common scenarios include:

The source and the target schema have different title of description.
The same property has different title of description in the source and target schemas.
Properties that exist in the target schema are missing from the source schema.

Interactive mode (default behavior)

By default, when conflicts are detected, the tool will prompt you interactively to resolve each conflict:

⚠️  Field [properties > startUrls > description] in the source schema differs from 
the target schema. Choose which to keep: (Use arrow keys)
❯ [source] List of URLs to scrape
  [target] List of URLs to parse

⚠️  Property "searchTerm" was removed from the source schema. What do you want to do? (Use arrow keys)
❯ Confirm deletion
  Restore field

Non-interactive modes

For automated scripts or CI/CD pipelines, you can use these options:

Force mode (`--force`)

Automatically resolves all conflicts by preferring the source schema:

npx apify-schema-tools sync --force

This will:

Always use values from the source schema when there are conflicts
Remove properties that exist in target but not in source
Overwrite target schemas without prompting

Fail on conflict (`--fail-on-conflict`)

Stops execution and exits with an error code when conflicts are detected:

npx apify-schema-tools sync --fail-on-conflict

Checking if the schemas are in sync with the source schemas

The check command allows you to verify that your generated schemas and TypeScript files are up-to-date with your source schemas. This is particularly useful in CI/CD pipelines to ensure that developers haven't forgotten to run the generation after making changes to the source schemas.

npx apify-schema-tools check

The check command will:

Compare the current generated files with what would be generated from the source schemas
Exit with code 0 if everything is in sync
Exit with code 1 if there are differences, showing you which files are out of sync

You can add this to your CI pipeline to automatically detect when schemas need to be regenerated:

{
  "scripts": {
    "generate": "apify-schema-tools sync",
    "check-schemas": "apify-schema-tools check",
    "test": "npm run check-schemas && npm run test:unit"
  }
}

The check command accepts the same configuration options as the sync command, either through package.json configuration or command-line arguments, ensuring it checks the same files that would be generated.

Ignoring descriptions while checking (`--ignore-descriptions`)

The check command can ignore the title and description fields in the source and target schemas, and their properties. This allows you to edit your descriptions and change how your Actor will appear on the Apify platform, without having to run this tool to synchronize the schemas, but still being able to check for semantical correctness:

npx apify-schema-tools check --ignore-descriptions

The next time someone will try to run the sync command, they will be prompted to solve the conflicts in the descriptions.

Extra features

Keep only allowed properties in Input schema

As an example, when type is "array", the property items is forbidden if editor is different from "select".

Merge a second schema into the main one

This feature is useful when working in monorepos. It allows you to define a single common schema across all the actors in the repo, and to add or override the tile, the description, and some properties, when necessary.

To use it, use the parameters --add-input and --add-dataset, e.g.:

npx apify-schema-tools sync \
  --input input,dataset \
  --output json-schemas,ts-types \
  --src-input ../src-schemas/input.json \
  --src-dataset ../src-schemas/dataset-item.json \
  --add-input src-schemas/input.json \
  --add-dataset src-schemas/dataset-item.json

You can also define the order of the properties in the merged schema. To do so, add a position field to the properties. The script will follow these rules:

Properties without position or with the same position, are sorted in the same order in which they appear in the source schemas, with the ones in the additional schema after the ones in the base schema.
If both properties with and without position exist, the ones without position will appear at the end.
The position will be overwritten if a property is overwritten.

An example:

# Source input schema
{
  "title": "My input schema",
  "description": "My input properties",
  "type": "object",
  "properties": {
    "a": { "type": "string", "position": 3 },
    "b": { "type": "string" }, // will be last, because it has no position
    "c": { "type": "string", "position": 1 }
  },
  "required": ["a"],
  "additionalProperties": false
}

# Additional input schema
{
  "description": "My input properties, a bit changed", // will override the description
  "type": "object",
  "properties": {
    "c": { "type": "boolean", "position": 5 }, // will override also the position
    "d": { "type": "string", "position": 1 } // will be first
  },
  "required": ["c", "d"], // will be merged to the source required parameters
  "additionalProperties": false
}

# Final input schema
{
  "title": "My input schema",
  "description": "My input properties, a bit changed",
  "type": "object",
  "properties": {
    "d": { "type": "string" },
    "a": { "type": "string" },
    "c": { "type": "boolean" },
    "b": { "type": "string" }
  },
  "required": ["a", "c", "d"],
  "additionalProperties": false
}

Use the option --deep-merge to merge object properties and array items, instead of overwriting every definition.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme