har-to-llm
v1.0.0
Published
Convert HAR files to LLM-friendly format
Maintainers
Readme
har-to-llm
A command-line tool and library for converting HAR (HTTP Archive) files to LLM-friendly formats.
Installation
npm install -g har-to-llmOr use with npx:
npx har-to-llm ./file.harUsage
Basic Usage
# Convert HAR file to markdown format (default)
har-to-llm ./file.har
# Convert to JSON format
har-to-llm ./file.har --format json
# Save output to file
har-to-llm ./file.har --output output.mdFiltering Options
# Filter by HTTP methods
har-to-llm ./file.har --methods GET,POST
# Filter by status codes
har-to-llm ./file.har --status 200,201,404
# Filter by domains
har-to-llm ./file.har --domains api.example.com,api.github.com
# Exclude domains
har-to-llm ./file.har --exclude-domains google-analytics.com,facebook.com
# Filter by request duration
har-to-llm ./file.har --min-duration 100 --max-duration 5000Deduplication Options
# Default behavior: automatically remove semantically similar requests
har-to-llm ./file.har
# Disable deduplication to keep all requests
har-to-llm ./file.har --no-deduplicate
# Verbose output shows deduplication statistics
har-to-llm ./file.har --verboseNote: By default, the tool uses semantic deduplication optimized for LLM training:
Semantic deduplication removes requests that follow the same pattern:
- Same HTTP method
- Same URL pattern (ignoring specific IDs:
/users/1,/users/2→/users/{id}) - Same query parameter structure (names match, values can differ)
- Same request body structure (JSON keys match, values can differ)
- Same header structure (header names match, values can differ)
Examples of semantic deduplication:
GET /users/1,GET /users/2,GET /users/3→ keeps onlyGET /users/{id}POST /userswith different user data → keeps only one examplePUT /users/1with different update data → keeps only one example
This ensures LLM training data contains unique API patterns without redundancy.
Header Filtering
The tool automatically filters out headers that are not useful for API implementation:
Excluded headers include:
- Browser-specific:
User-Agent,Accept,Accept-Language,Accept-Encoding,Cache-Control,Origin,Referer - Network:
Connection,Keep-Alive,Transfer-Encoding,Content-Length,Date,Server - Security:
X-Frame-Options,X-Content-Type-Options,X-XSS-Protection,Strict-Transport-Security - Caching:
ETag,Last-Modified,If-Modified-Since,If-None-Match - Analytics:
X-Forwarded-For,X-Real-IP,X-Requested-With - CDN/Proxy:
CF-Ray,X-Cache,X-Amz-Cf-Id
Kept headers include:
- Authentication:
Authorization,X-API-Key - Content:
Content-Type - Custom API headers:
X-Custom-Header,X-Rate-Limit-*,X-Request-ID - Response headers:
Location,Set-Cookie
Output Formats
- markdown (default): Human-readable markdown format
- json: Structured JSON data
- text: Simple text summary
- curl: cURL commands for replaying requests
- conversation: Conversation format for LLM training
- structured: Detailed structured data with summary
Examples
# Get only successful API calls in JSON format
har-to-llm ./file.har --format json --status 200,201,204 --methods GET,POST,PUT,DELETE
# Generate cURL commands for debugging
har-to-llm ./file.har --format curl --output commands.sh
# Create conversation log for LLM training
har-to-llm ./file.har --format conversation --output training-data.md
# Show summary only
har-to-llm ./file.har --summary
# Verbose output with filtering
har-to-llm ./file.har --verbose --domains api.example.com --min-duration 500
# Keep all requests including semantically similar ones
har-to-llm ./file.har --no-deduplicate --verboseProgrammatic Usage
import { HARConverter, Formatters } from 'har-to-llm';
import * as fs from 'fs';
// Read HAR file
const harContent = fs.readFileSync('./file.har', 'utf8');
const harData = JSON.parse(harContent);
// Convert entries (with automatic semantic deduplication and header filtering)
const conversations = harData.log.entries.map(entry =>
HARConverter.convertEntry(entry)
);
// Filter entries with semantic deduplication (default for LLM training)
const filteredEntries = HARConverter.filterEntries(harData.log.entries, {
methods: ['GET', 'POST'],
statusCodes: [200, 201],
domains: ['api.example.com'],
deduplicate: true // semantic deduplication
});
// Filter entries without deduplication
const allEntries = HARConverter.filterEntries(harData.log.entries, {
methods: ['GET', 'POST'],
deduplicate: false
});
// Manual semantic deduplication for LLM training
const semanticallyUnique = HARConverter.deduplicateEntries(harData.log.entries);
// Manual exact deduplication
const exactlyUnique = HARConverter.removeExactDuplicates(harData.log.entries);
// Generate different formats
const markdown = Formatters.toMarkdown(conversations);
const json = Formatters.toJSON(conversations);
const curl = Formatters.toCurlCommands(conversations);
// Get summary
const summary = HARConverter.generateSummary(harData.log.entries);Output Formats
Markdown Format
# HTTP Conversations
## Request 1
**Timestamp:** 2023-12-01T10:30:00.000Z
**Duration:** 150ms
### Request
**Method:** GET
**URL:** https://api.example.com/users/1
**Headers:**
- authorization: Bearer token123
- content-type: application/json
### Response
**Status:** 200 OK
**Headers:**
- content-type: application/json
**Body:**
```json
{
"id": 1,
"name": "John Doe"
}
### JSON Format
```json
[
{
"request": {
"method": "GET",
"url": "https://api.example.com/users/1",
"headers": {
"authorization": "Bearer token123",
"content-type": "application/json"
},
"queryParams": {},
"body": null,
"contentType": null
},
"response": {
"status": 200,
"statusText": "OK",
"headers": {
"content-type": "application/json"
},
"body": "{\"id\":1,\"name\":\"John Doe\"}",
"contentType": "application/json"
},
"timestamp": "2023-12-01T10:30:00.000Z",
"duration": 150
}
]cURL Format
# GET https://api.example.com/users/1
curl -X GET -H "authorization: Bearer token123" -H "content-type: application/json" "https://api.example.com/users/1"Features
- ✅ Convert HAR files to multiple LLM-friendly formats
- ✅ Filter requests by method, status code, domain, and duration
- ✅ Semantic deduplication optimized for LLM training
- ✅ Automatic filtering of useless headers
- ✅ Generate cURL commands for request replay
- ✅ Create conversation logs for LLM training
- ✅ Provide detailed summaries and statistics
- ✅ Support for both CLI and programmatic usage
- ✅ TypeScript support with full type definitions
Requirements
- Node.js 16.0.0 or higher
License
MIT
Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
Changelog
1.0.0
- Initial release
- Support for multiple output formats
- Filtering capabilities
- CLI and programmatic APIs
- Semantic deduplication for LLM training
- Automatic header filtering
