dumpduck
v0.1.1
Published
Import PostgreSQL .dump files into DuckDB
Downloads
211
Maintainers
Readme
Dumpduck - import postgres .dump files into duckdb
Status: Alpha - this works for small .dumps but isn't performant for large databases yet. But I architected it so it should be quite possible to optimize.
Example usage:
- Read from local sql file and import into local duckdb
pnpx dumpduck backup.sql out.duckdb - Stream from remote .dump file, translate with pg_restore on the fly, and stream into motherduck. This takes low ram and low disk space, perfect for github action cronjobs wiht huge backups.
pnpx dumpduck https://example.com/backup.dump md: - Read from stdin:
cat backup.dump | pnpx dumpduck :stdin: out.duckdb
Inspired by https://github.com/NINAnor/pg_dedump
The workflow is basically:
- We are handed a file path, an http(s) URL, or a stream such as stdin. We convert all of these to a stream of bytes, for consistency.
- Then, we detect the format for this byte stream. If it is the custom binary format from eg
pg_dump --format='custom', then we pipe it throughpg_restoreto convert it back to plain SQL. If you need to do this, then of course you will needpg_restoreavailable on your system. - Spin up a temporary, in-memory postgres instance using the awesome pglite project (a port of postgres to 3Mb WASM file), and expose it as a local server.
- Connect to the given duckdb database, and then
ATTACH 'host=0.0.0.0 port=9077 sslmode=disable' AS pg (TYPE postgres);to that postgres connection. - Take the stream of raw SQL text and do very rough parsing on it, splitting it into individual statements.
- Take the stream of SQL statements and process them one at a time:
- The first section of SQL statements in a dump are e.g.
CREATE TABLEandCREATE TYPEstatements. Pass each of these through to pglite. - As soon as we hit the first
COPYstatement, that means we have moved onto the data section of the dump file. Pause for a moment, and in the duckdb instance doCOPY FROM DATABASE pg TO <current db>. This leverages the type mapping of the duckdb postgres extension to copy over all the finished types, tables, and views from pglite into duckdb. There is no data in the pglite tables yet, so this does no data copying. - Now the duckdb database is ready to accept data, now we process each of the
COPYstatements. For each, we intercept the rows of data and batch them into chunks of 1k lines. For each batch, we load that into pglite using it's nativeCOPYsyntax. This guarantees that any weird csv formatting in the dump file is actually read by postgres, the program that is designed to handle it, instead of trying to convert it to something that duckdb understands (I tried this and it was hard). After loading the batch into pglite, then from duckdb weINSERT INTO <dst> SELECT * FROM pg.<dst>;to move it over, and thenTRUNCATE; VACUUM;the pglite table to reclaim space.
- The first section of SQL statements in a dump are e.g.
Note how this entire thing is very carefully designed to be streaming.
We only read through the input stream once, and we never materialize large chunks of data.
This means that you can do pnpx dumpduck https://example.com/backup.dump md: to
do backups on tiny github action runners that have low ram and disk space,
but great network connectivity.
Limitations and Problems
Due to https://github.com/duckdb/duckdb-postgres/issues/429,
if your postgres dump has any custom datatypes,
then these will get improperly copied over to the duckdb file.
You will be able to use the duckdb file fine, but any
schema.sql files generated from a subsequent EXPORT statement will be garbled.
If you have custom datatypes in postgres AND need to EXPORT and reIMPORT the duckdb database, you will be affected. Otherwise you should be fine.
