dumpduck

v0.1.1

Published

2 months ago

Import PostgreSQL .dump files into DuckDB

0High
0Medium
0Low

nickcrews

postgres duckdb dump pg_restore

Dumpduck - import postgres .dump files into duckdb

Status: Alpha - this works for small .dumps but isn't performant for large databases yet. But I architected it so it should be quite possible to optimize.

Example usage:

Read from local sql file and import into local duckdb pnpx dumpduck backup.sql out.duckdb
Stream from remote .dump file, translate with pg_restore on the fly, and stream into motherduck. This takes low ram and low disk space, perfect for github action cronjobs wiht huge backups. pnpx dumpduck https://example.com/backup.dump md:
Read from stdin: cat backup.dump | pnpx dumpduck :stdin: out.duckdb

Inspired by https://github.com/NINAnor/pg_dedump

The workflow is basically:

We are handed a file path, an http(s) URL, or a stream such as stdin. We convert all of these to a stream of bytes, for consistency.
Then, we detect the format for this byte stream. If it is the custom binary format from eg pg_dump --format='custom', then we pipe it through pg_restore to convert it back to plain SQL. If you need to do this, then of course you will need pg_restore available on your system.
Spin up a temporary, in-memory postgres instance using the awesome pglite project (a port of postgres to 3Mb WASM file), and expose it as a local server.
Connect to the given duckdb database, and then ATTACH 'host=0.0.0.0 port=9077 sslmode=disable' AS pg (TYPE postgres); to that postgres connection.
Take the stream of raw SQL text and do very rough parsing on it, splitting it into individual statements.
Take the stream of SQL statements and process them one at a time:
- The first section of SQL statements in a dump are e.g. CREATE TABLE and CREATE TYPE statements. Pass each of these through to pglite.
- As soon as we hit the first COPY statement, that means we have moved onto the data section of the dump file. Pause for a moment, and in the duckdb instance do COPY FROM DATABASE pg TO <current db>. This leverages the type mapping of the duckdb postgres extension to copy over all the finished types, tables, and views from pglite into duckdb. There is no data in the pglite tables yet, so this does no data copying.
- Now the duckdb database is ready to accept data, now we process each of the COPY statements. For each, we intercept the rows of data and batch them into chunks of 1k lines. For each batch, we load that into pglite using it's native COPY syntax. This guarantees that any weird csv formatting in the dump file is actually read by postgres, the program that is designed to handle it, instead of trying to convert it to something that duckdb understands (I tried this and it was hard). After loading the batch into pglite, then from duckdb we INSERT INTO <dst> SELECT * FROM pg.<dst>; to move it over, and then TRUNCATE; VACUUM; the pglite table to reclaim space.

Note how this entire thing is very carefully designed to be streaming. We only read through the input stream once, and we never materialize large chunks of data. This means that you can do pnpx dumpduck https://example.com/backup.dump md: to do backups on tiny github action runners that have low ram and disk space, but great network connectivity.

Limitations and Problems

Due to https://github.com/duckdb/duckdb-postgres/issues/429, if your postgres dump has any custom datatypes, then these will get improperly copied over to the duckdb file. You will be able to use the duckdb file fine, but any schema.sql files generated from a subsequent EXPORT statement will be garbled. If you have custom datatypes in postgres AND need to EXPORT and reIMPORT the duckdb database, you will be affected. Otherwise you should be fine.

Pkg
Stats

Discover Tips

General search

Package details

User packages

Sponsor

About

Twitter

GitHub

Twitter

GitHub

Site

Open Software & Tools

Framework

Server

Data Store

Caching

CSS / Styling

Typeface

Avatars

Data Viz

Date formatting

Infinite scrolling

Markdown rendering

Repository url parsing

User data

Compiling

Types

Odds & Ends

dumpduck

v0.1.1

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Dumpduck - import postgres .dump files into duckdb

Limitations and Problems