npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

@builder.io/gpt-crawler

v1.4.0

Published

Crawl a site to generate knowledge files to create your own custom GPT

Downloads

47

Readme

GPT Crawler

Crawl a site to generate knowledge files to create your own custom GPT from one or multiple URLs

Gif showing the crawl run

Example

Here is a custom GPT that I quickly made to help answer questions about how to use and integrate Builder.io by simply providing the URL to the Builder docs.

This project crawled the docs and generated the file that I uploaded as the basis for the custom GPT.

Try it out yourself by asking questions about how to integrate Builder.io into a site.

Note that you may need a paid ChatGPT plan to access this feature

Get started

Running locally

Clone the repository

Be sure you have Node.js >= 16 installed.

git clone https://github.com/builderio/gpt-crawler

Install dependencies

npm i

Configure the crawler

Open config.ts and edit the url and selector properties to match your needs.

E.g. to crawl the Builder.io docs to make our custom GPT you can use:

export const defaultConfig: Config = {
  url: "https://www.builder.io/c/docs/developers",
  match: "https://www.builder.io/c/docs/**",
  selector: `.docs-builder-container`,
  maxPagesToCrawl: 50,
  outputFileName: "output.json",
};

See config.ts for all available options. Here is a sample of the common configuration options:

type Config = {
  /** URL to start the crawl, if sitemap is provided then it will be used instead and download all pages in the sitemap */
  url: string;
  /** Pattern to match against for links on a page to subsequently crawl */
  match: string;
  /** Selector to grab the inner text from */
  selector: string;
  /** Don't crawl more than this many pages */
  maxPagesToCrawl: number;
  /** File name for the finished data */
  outputFileName: string;
  /** Optional resources to exclude
   *
   * @example
   * ['png','jpg','jpeg','gif','svg','css','js','ico','woff','woff2','ttf','eot','otf','mp4','mp3','webm','ogg','wav','flac','aac','zip','tar','gz','rar','7z','exe','dmg','apk','csv','xls','xlsx','doc','docx','pdf','epub','iso','dmg','bin','ppt','pptx','odt','avi','mkv','xml','json','yml','yaml','rss','atom','swf','txt','dart','webp','bmp','tif','psd','ai','indd','eps','ps','zipx','srt','wasm','m4v','m4a','webp','weba','m4b','opus','ogv','ogm','oga','spx','ogx','flv','3gp','3g2','jxr','wdp','jng','hief','avif','apng','avifs','heif','heic','cur','ico','ani','jp2','jpm','jpx','mj2','wmv','wma','aac','tif','tiff','mpg','mpeg','mov','avi','wmv','flv','swf','mkv','m4v','m4p','m4b','m4r','m4a','mp3','wav','wma','ogg','oga','webm','3gp','3g2','flac','spx','amr','mid','midi','mka','dts','ac3','eac3','weba','m3u','m3u8','ts','wpl','pls','vob','ifo','bup','svcd','drc','dsm','dsv','dsa','dss','vivo','ivf','dvd','fli','flc','flic','flic','mng','asf','m2v','asx','ram','ra','rm','rpm','roq','smi','smil','wmf','wmz','wmd','wvx','wmx','movie','wri','ins','isp','acsm','djvu','fb2','xps','oxps','ps','eps','ai','prn','svg','dwg','dxf','ttf','fnt','fon','otf','cab']
   */
  resourceExclusions?: string[];
  /** Optional maximum file size in megabytes to include in the output file */
  maxFileSize?: number;
  /** Optional maximum number tokens to include in the output file */
  maxTokens?: number;
};

Run your crawler

npm start

Alternative methods

Running in a container with Docker

To obtain the output.json with a containerized execution, go into the containerapp directory and modify the config.ts as shown above. The output.jsonfile should be generated in the data folder. Note: the outputFileName property in the config.ts file in the containerapp directory is configured to work with the container.

Running as an API

To run the app as a API server you will need to do an npm install to install the dependencies. The server is written in Express JS.

To run the server.

npm run start:server to start the server. The server runs by default on port 3000.

You can use the endpoint /crawl with the post request body of config json to run the crawler. The api docs are served on the endpoint /api-docs and are served using swagger.

To modify the environment you can copy over the .env.example to .env and set your values like port, etc. to override the variables for the server.

Upload your data to OpenAI

The crawl will generate a file called output.json at the root of this project. Upload that to OpenAI to create your custom assistant or custom GPT.

Create a custom GPT

Use this option for UI access to your generated knowledge that you can easily share with others

Note: you may need a paid ChatGPT plan to create and use custom GPTs right now

  1. Go to https://chat.openai.com/
  2. Click your name in the bottom left corner
  3. Choose "My GPTs" in the menu
  4. Choose "Create a GPT"
  5. Choose "Configure"
  6. Under "Knowledge" choose "Upload a file" and upload the file you generated
  7. if you get an error about the file being too large, you can try to split it into multiple files and upload them separately using the option maxFileSize in the config.ts file or also use tokenization to reduce the size of the file with the option maxTokens in the config.ts file

Gif of how to upload a custom GPT

Create a custom assistant

Use this option for API access to your generated knowledge that you can integrate into your product.

  1. Go to https://platform.openai.com/assistants
  2. Click "+ Create"
  3. Choose "upload" and upload the file you generated

Gif of how to upload to an assistant

Contributing

Know how to make this project better? Send a PR!