hypersync-lancedb-pipe

A serverless embedded streaming OLAP data pipeline that leverages historical blockchain data from Hypersync and mutable columnar storage format lance.

Since Lance is designed to be mutable, it is possible to create an embedded streaming pipeline using the same data source. The main advantage of this streaming approach is that it doesn't require any parquet glob file management. This reduces the complexity of streaming to the same as batch processing. The other main benefit is that LanceDB has tight integration with both polars and duckdb. LanceDB accepts polars dataframes as data inputs, which allows for a more flexible ETL pipeline, allowing polars to be used as a preprocessing tool.

Since LanceDB leverages the Apache Arrow Standard, there is a lot of flexibility to query from ths database - such as querying larger than memory datasets with polars lazyframes and a dataframe API, or using an embedded OLAP engine like duckdb for faster speed and SQL API.

Install with pip

pip install hypersync-lancedb-pipe

Getting Started

This repository uses rye to manage dependencies and the virtual environment. To install, refer to this link for instructions here.
Once rye is installed, run rye sync to install dependencies and setup the virtual environment, which has a default name of .venv.
Activate the virtual environment with the command source .venv/bin/activate.

Running the Pipeline

There are some script examples in the scripts folder. These examples demonstrate the versatility of the lancedb writer.

Run historical_sync.py file to backfill data from a historical block number. Assumes there is no existing table.
Run head_sync.py to sync the database to the head of the chain. Assumes existing table exists.
Run backfill_sync.py to perform a backfill sync from the earliest block number. Assumes existing table exists.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
scripts		scripts
src/hypersync_lancedb_pipe		src/hypersync_lancedb_pipe
tests		tests
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
analysis.ipynb		analysis.ipynb
pyproject.toml		pyproject.toml
requirements-dev.lock		requirements-dev.lock
requirements.lock		requirements.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scripts

scripts

src/hypersync_lancedb_pipe

src/hypersync_lancedb_pipe

tests

tests

.gitignore

.gitignore

.python-version

.python-version

README.md

README.md

analysis.ipynb

analysis.ipynb

pyproject.toml

pyproject.toml

requirements-dev.lock

requirements-dev.lock

requirements.lock

requirements.lock

Repository files navigation

hypersync-lancedb-pipe

Install with pip

Getting Started

Running the Pipeline

About

Releases

Packages

Languages

Evan-Kim2028/hypersync-lancedb-pipe

Folders and files

Latest commit

History

Repository files navigation

hypersync-lancedb-pipe

Install with pip

Getting Started

Running the Pipeline

About

Topics

Resources

Stars

Watchers

Forks

Languages