Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Append Functionality for Streaming Data in LazyFrames #16204

Open
vultix opened this issue May 13, 2024 · 0 comments
Open

Add Append Functionality for Streaming Data in LazyFrames #16204

vultix opened this issue May 13, 2024 · 0 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@vultix
Copy link

vultix commented May 13, 2024

Description

I'm currently using polars to process real-time stock market data. As new market data is made available, I currently have to recompute the entire historical dataset. With millions of rows of historical data, this is hugely inefficient.

This issue proposes a mechanism to append new data into an existing streaming lazyframe, only computing the new values.

Example use case

# Start with simple pricing data
price_data = pl.DataFrame({"price": [1.0, 2.0, 3.0, 4.0]})

# Add some computation that can be done in a streaming, append-only manner
lazy_frame = price_data.lazy().with_columns(pl.col('price').pct_change().alias('pct_change'))

# Process the "historical" pricing data
current_data = lazy_frame.collect(streaming=True)
# shape: (4, 2)
# ┌───────┬────────────┐
# │ price ┆ pct_change │
# │ ---   ┆ ---        │
# │ f64   ┆ f64        │
# ╞═══════╪════════════╡
# │ 1.0   ┆ null       │
# │ 2.0   ┆ 1.0        │
# │ 3.0   ┆ 0.5        │
# │ 4.0   ┆ 0.333333   │
# └───────┴────────────┘

# Later, new pricing information is made available
new_price_data = pl.DataFrame({"price": [5.0]})

# Calculate the output dataframe for the new price, no need to recompute all old data 
lazy_frame.stream_append(new_price_data)
# shape: (1, 2)
# ┌───────┬────────────┐
# │ price ┆ pct_change │
# │ ---   ┆ ---        │
# │ f64   ┆ f64        │
# ╞═══════╪════════════╡
# │ 5.0   ┆ 0.25       │
# └───────┴────────────┘

Proposed new functionality

  • Streaming LazyFrame Wrapper: Introduce a wrapper that maintains the state of the streaming engine within a LazyFrame.
  • Incremental Processing: Allow collection of the next n rows, processing/consuming only the data required to output those rows.
  • Append New Data: Enable appending new rows/dataframes directly to the streaming engine
@vultix vultix added the enhancement New feature or an improvement of an existing feature label May 13, 2024
@vultix vultix changed the title Implement Append Functionality for Streaming Data in LazyFrames Add Append Functionality for Streaming Data in LazyFrames May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

1 participant