strapi-plugin-semantic-search

v1.1.0

Published

8 months ago

Intelligent semantic search plugin for Strapi 5 powered by OpenAI embeddings. Automatically generates embeddings for your content and provides powerful semantic search capabilities.

0High
0Medium
0Low

samuelumoren365

strapi plugin semantic-search search embedding openai ai vector-search similarity content-discovery strapi5

Semantic Search Plugin for Strapi 5

Intelligent content discovery powered by OpenAI embeddings

Transform your Strapi CMS into an AI-powered content platform with automatic embedding generation and semantic search capabilities.

🆕 What's New in v1.1.0

Configurable Field Mapping - Customize which fields to process per content type
Enhanced Validation - Comprehensive configuration validation with helpful error messages
Improved Logging - Better visibility into plugin configuration and processing
Flexible Content Types - Support for any content type with custom field mappings

Features

Automatic Embedding Generation - Content embeddings created automatically on save
Semantic Search APIs - RESTful endpoints for intelligent content discovery
Multi-Content Type Search - Search across different content types simultaneously
High Performance - Sub-300ms search responses with vector similarity
Production Ready - Comprehensive error handling and rate limiting
Analytics - Monitor embedding coverage and search performance
Zero Dependencies - No external vector databases required

Installation

1. Install Dependencies

cd src/plugins/semantic-search
npm install openai@^5.8.2 axios@^1.10.0

2. Configure Environment

Add your OpenAI API key to .env:

OPENAI_API_KEY=your_openai_api_key_here

3. Enable Plugin

Update config/plugins.ts:

export default ({ env }) => ({
  'semantic-search': {
    enabled: true,
    resolve: './src/plugins/semantic-search'
  },
});

4. Add Embedding Fields

Add embedding fields to your content types:

{
  "embedding": {
    "type": "json"
  },
  "embeddingMetadata": {
    "type": "json"
  }
}

5. Restart Strapi

npm run develop

Configuration

Basic Configuration

By default, the plugin processes these content types and fields:

// Default configuration
{
  'api::article.article': ['title', 'content', 'summary'],
  'api::blog.blog': ['title', 'body', 'excerpt']
}

Custom Configuration

Configure which content types and fields to process by updating config/plugins.js:

// config/plugins.js
module.exports = ({ env }) => ({
  'semantic-search': {
    enabled: true,
    resolve: './src/plugins/semantic-search',
    config: {
      contentTypes: {
        'api::article.article': ['title', 'content', 'summary', 'tags'],
        'api::blog.blog': ['title', 'body', 'excerpt', 'category'],
        'api::product.product': ['name', 'description', 'features', 'benefits'],
        'api::course.course': ['title', 'overview', 'learningOutcomes']
      }
    }
  }
});

Configuration Options

| Option | Type | Description | |--------|------|-------------| | contentTypes | Object | Maps content type UIDs to arrays of field names |

Content Type Format: Use Strapi's UID format: api::collection-name.collection-name

Field Names:

Must be valid field names from your content type schema
Can include any text-based fields (text, textarea, richtext, etc.)
Rich text fields are automatically converted to plain text for embedding

Configuration Validation

The plugin validates your configuration on startup:

✅ Valid content type format: api::article.article
✅ Valid field arrays: ['title', 'content']
✅ Non-empty field names: No empty strings or invalid types
❌ Invalid formats: Missing api:: prefix, empty fields, etc.

Invalid configurations will log warnings and fallback to defaults.

Usage

Automatic Embedding Generation

Embeddings are generated automatically when you create or update content. The plugin extracts text from the fields you've configured for each content type (see Configuration section above).

Search API

Single Content Type Search

POST /api/semantic-search/search

Request:

{
  "query": "artificial intelligence and machine learning",
  "contentType": "api::article.article", 
  "limit": 10,
  "threshold": 0.1
}

Response:

{
  "success": true,
  "data": {
    "query": "artificial intelligence and machine learning",
    "contentType": "api::article.article",
    "results": [
      {
        "id": 1,
        "title": "Deep Learning Fundamentals",
        "similarityScore": 0.8945,
        "content": "...",
        "createdAt": "2025-01-15T10:30:00.000Z"
      }
    ],
    "metadata": {
      "totalResults": 5,
      "queryProcessing": {
        "embeddingDimensions": 1536
      }
    }
  }
}

Multi-Content Type Search

POST /api/semantic-search/multi-search

Request:

{
  "query": "productivity and remote work",
  "contentTypes": ["api::article.article", "api::blog.blog"],
  "limit": 15,
  "aggregateResults": true
}

Embedding Statistics

GET /api/semantic-search/stats

Response:

{
  "success": true,
  "data": {
    "api::article.article": {
      "total": 50,
      "withEmbeddings": 50,
      "coverage": "100.00%"
    },
    "api::blog.blog": {
      "total": 25, 
      "withEmbeddings": 23,
      "coverage": "92.00%"
    }
  }
}

Configuration

Supported Content Types

The plugin automatically processes these content types:

Any content type that starts with api::
Excludes admin and plugin content types
Configurable in the lifecycle registration

Text Field Mapping

Default fields processed for embedding generation:

const textFields = [
  'title', 'name', 'content', 'body', 
  'summary', 'description', 'excerpt'
];

Search Parameters

| Parameter | Type | Default | Description | |-----------|------|---------|-------------| | query | string | required | Search query text | | contentType | string | required | Strapi content type UID | | limit | number | 10 | Maximum results (max: 50) | | threshold | number | 0.1 | Minimum similarity score | | filters | object | {} | Additional database filters |

Architecture

Plugin Structure

semantic-search/
├── package.json           # Plugin metadata and dependencies
├── strapi-server.js       # Plugin entry point
└── server/
    ├── index.js           # Server exports
    └── src/
        ├── index.js       # Main plugin logic
        ├── controllers/   # API request handlers
        │   ├── index.js
        │   └── search-controller.js
        ├── services/      # Business logic
        │   ├── index.js
        │   ├── embedding-service.js    # OpenAI integration
        │   ├── vector-service.js       # Similarity calculations
        │   └── search-service.js       # Search orchestration
        └── routes/        # API endpoint definitions
            └── index.js

Data Flow

Content Creation/Update → Lifecycle hook triggered
Text Extraction → Combine relevant text fields
OpenAI API Call → Generate 1536-dimension embedding
Database Storage → Save embedding in JSON field
Search Request → Convert query to embedding
Similarity Search → Calculate cosine similarity
Result Ranking → Sort by similarity score

Vector Storage

Embeddings are stored as JSON in your existing database:

{
  "embedding": [0.1234, -0.5678, 0.9012, ...], // 1536 dimensions
  "embeddingMetadata": {
    "model": "text-embedding-ada-002",
    "generatedAt": "2025-01-15T10:30:00.000Z",
    "dimensions": 1536,
    "processedText": "Machine learning fundamentals...",
    "originalLength": 1250,
    "processedLength": 1180
  }
}

Similarity Scores

Understanding similarity score ranges:

| Score Range | Relevance | Description | |-------------|-----------|-------------| | 0.85 - 1.0 | Highly Relevant | Direct topic match | | 0.75 - 0.85 | Relevant | Related concepts | | 0.65 - 0.75 | Somewhat Relevant | Tangential connection | | 0.1 - 0.65 | Low Relevance | Weak semantic relation |

Performance

Benchmarks

Embedding Generation: 1-3 seconds per document
Search Latency: ~250ms end-to-end
Vector Comparison: ~50ms for 1000 documents
Memory Usage: ~1.5KB per embedding
Storage Overhead: ~6KB per document (embedding + metadata)

Optimization Tips

Batch Processing: Process multiple embeddings in parallel
Caching: Cache frequently searched embeddings
Indexing: Add database indexes on frequently filtered fields
Text Preprocessing: Optimize text extraction for your content types

Development

Local Development

# Install plugin dependencies
cd src/plugins/semantic-search
npm install

# Return to project root
cd ../../../

# Start development server
npm run develop

Testing

# Test embedding generation
curl -X POST http://localhost:1337/api/semantic-search/search \
-H "Content-Type: application/json" \
-d '{"query": "test query", "contentType": "api::article.article"}'

# Check statistics
curl http://localhost:1337/api/semantic-search/stats

Debugging

Enable debug logging in your Strapi configuration:

// config/logger.js
module.exports = {
  level: 'debug',
  transports: [
    {
      type: 'console',
      options: {
        pool: true,
        format: 'combined',
        level: 'debug'
      }
    }
  ]
};

Production Deployment

Environment Variables

# Required
OPENAI_API_KEY=your_production_openai_key

# Optional
NODE_ENV=production

Security Considerations

API Authentication: Enable authentication for search endpoints
Rate Limiting: Configure appropriate rate limits
Input Validation: Validate search queries and parameters
API Key Security: Secure OpenAI API key storage

Monitoring

Monitor these key metrics:

Embedding generation success rate
Search response times
OpenAI API usage and costs
Similarity score distributions
Search query patterns

Extending the Plugin

Custom Content Type Support

Add support for additional content types:

// In registerEmbeddingLifecycles function
const contentTypes = [
  'api::article.article',
  'api::blog.blog', 
  'api::product.product',  // Add your content type
  'api::course.course'     // Add another content type
];

Custom Text Extraction

Modify text field extraction:

// In extractTextContent function
const textFields = [
  'title', 'content', 'summary',
  'customField',      // Add custom field
  'description'       // Add more fields as needed
];

Advanced Filtering

Add custom filtering logic:

// In search service
const searchWithCustomFilters = async (query, contentType, customFilters) => {
  const filters = {
    ...customFilters,
    publishedAt: { $notNull: true },  // Only published content
    featured: true                     // Only featured content
  };
  
  return await semanticSearch(query, contentType, { filters });
};

Troubleshooting

Common Issues

1. Plugin not loading:

Check config/plugins.ts configuration
Verify plugin path resolution
Check for syntax errors in plugin files

2. Embeddings not generating:

Verify OpenAI API key is valid
Check network connectivity to OpenAI
Review content type field configuration

3. Search returning no results:

Verify embedding field exists on content type
Check similarity threshold (try lowering to 0.1)
Ensure content has embeddings generated

4. Performance issues:

Monitor OpenAI API rate limits
Check database query performance
Consider caching strategies

Debug Commands

# Check plugin loading
tail -f logs/strapi.log | grep "semantic-search"

# Test OpenAI connectivity
curl -X POST https://api.openai.com/v1/embeddings \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{"input": "test", "model": "text-embedding-ada-002"}'

# Validate embedding storage
curl http://localhost:1337/api/semantic-search/stats

License

MIT License

Contributing

This plugin is part of the semantic search demo project. Contributions and improvements are welcome.

Built with OpenAI Embeddings and Strapi 5

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Semantic Search Plugin for Strapi 5

🆕 What's New in v1.1.0

Features

Installation

1. Install Dependencies

2. Configure Environment

3. Enable Plugin

4. Add Embedding Fields

5. Restart Strapi

Configuration

Basic Configuration

Custom Configuration

Configuration Options

Configuration Validation

Usage

Automatic Embedding Generation

Search API

Single Content Type Search

Multi-Content Type Search

Embedding Statistics

Configuration

Supported Content Types

Text Field Mapping

Search Parameters

Architecture

Plugin Structure

Data Flow

Vector Storage

Similarity Scores

Performance

Benchmarks

Optimization Tips

Development

Local Development

Testing

Debugging

Production Deployment

Environment Variables

Security Considerations

Monitoring

Extending the Plugin

Custom Content Type Support

Custom Text Extraction

Advanced Filtering

Related Documentation

Troubleshooting

Common Issues

Debug Commands

License

Contributing