Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bqetl stage deploy uses ever-increasing memory #4340

Open
sean-rose opened this issue Sep 22, 2023 · 4 comments
Open

bqetl stage deploy uses ever-increasing memory #4340

sean-rose opened this issue Sep 22, 2023 · 4 comments

Comments

@sean-rose
Copy link
Contributor

sean-rose commented Sep 22, 2023

bqetl stage deploy seems to use ever-increasing amounts of memory as it runs, which can become an issue when deploying lots of artifacts to staging.

For example, PR #4223's deploy-changes-to-stage CI job is attempting to deploy 401 artifacts to staging but keeps getting killed, apparently due to running out of memory.

image

Even increasing the memory resources from 4 GB to 16 GB wasn't enough.

image

┆Issue is synchronized with this Jira Task

@sean-rose sean-rose self-assigned this Sep 22, 2023
@sean-rose
Copy link
Contributor Author

In PR #4223 I tried not sharing one BigQuery client across all schema deployment threads, and closing BigQuery clients when the bqetl CLI is done with them, but neither worked to reduce the memory usage of bqetl stage deploy.

@sean-rose sean-rose removed their assignment Sep 22, 2023
@sean-rose sean-rose mentioned this issue Sep 22, 2023
5 tasks
@scholtzan
Copy link
Collaborator

scholtzan commented Sep 26, 2023

I've been doing some memory profiling on a smaller stage deploy with memray. It seems like

def get_stable_table_schemas() -> List[SchemaFile]:
is using up most memory. This method pulls in stable table schemas from mozilla-pipeline-schemas and it seems to do so for every query that gets deployed to stage.

I'll see if we can just pull in the schemas once and then keep reusing them across the queries that need to be deployed.

@scholtzan
Copy link
Collaborator

Just to dump this here:

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Location                                                                                                                                 ┃             <Total Memory> ┃             Total Memory % ┃                Own Memory ┃              Own Memory % ┃          Allocation Count ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ _bootstrap_inner at /Users/anna/.pyenv/versions/3.10.6/lib/python3.10/threading.py                                                       │                   93.696MB │                     44.03% │                    0.000B │                     0.00% │                      1496 │
│ _bootstrap at /Users/anna/.pyenv/versions/3.10.6/lib/python3.10/threading.py                                                             │                   93.696MB │                     44.03% │                    0.000B │                     0.00% │                      1496 │
│ _deploy_schema at /Users/anna/mydata/bigquery-etl/bigquery_etl/cli/stage.py                                                              │                   93.690MB │                     44.02% │                    0.000B │                     0.00% │                      1485 │
│ mapstar at /Users/anna/.pyenv/versions/3.10.6/lib/python3.10/multiprocessing/pool.py                                                     │                   93.690MB │                     44.02% │                    0.000B │                     0.00% │                      1485 │
│ worker at /Users/anna/.pyenv/versions/3.10.6/lib/python3.10/multiprocessing/pool.py                                                      │                   93.690MB │                     44.02% │                    0.000B │                     0.00% │                      1485 │
│ update at /Users/anna/mydata/bigquery-etl/bigquery_etl/cli/query.py                                                                      │                   93.689MB │                     44.02% │                    0.000B │                     0.00% │                      1483 │
│ get_dependency_graph at /Users/anna/mydata/bigquery-etl/bigquery_etl/dependency.py                                                       │                   93.689MB │                     44.02% │                    0.000B │                     0.00% │                      1482 │
│ extract_table_references_without_views at /Users/anna/mydata/bigquery-etl/bigquery_etl/dependency.py                                     │                   93.678MB │                     44.02% │                    0.000B │                     0.00% │                      1476 │
│ load at /Users/anna/.pyenv/versions/3.10.6/lib/python3.10/json/__init__.py                                                               │                   88.768MB │                     41.71% │                    0.000B │                     0.00% │                      1385 │
│ invoke at /Users/anna/.pyenv/versions/3.10.6/lib/python3.10/site-packages/click/core.py                                                  │                   83.437MB │                     39.21% │                  640.000B │                     0.00% │                      5364 │
│ _get_references at /Users/anna/mydata/bigquery-etl/bigquery_etl/dependency.py                                                            │                   82.640MB │                     38.83% │                   9.766KB │                     0.00% │                      1440 │
│ get_stable_table_schemas at /Users/anna/mydata/bigquery-etl/bigquery_etl/schema/stable_table_schema.py                                   │                   50.580MB │                     23.77% │                  600.000B │                     0.00% │                      1340 │
│ run at /Users/anna/.pyenv/versions/3.10.6/lib/python3.10/threading.py                                                                    │                   50.561MB │                     23.76% │                   5.852KB │                     0.00% │                      1360 │
│ loads at /Users/anna/.pyenv/versions/3.10.6/lib/python3.10/json/__init__.py                                                              │                   50.530MB │                     23.74% │                   6.371MB │                     2.99% │                      1333 │
│ decode at /Users/anna/.pyenv/versions/3.10.6/lib/python3.10/json/decoder.py                                                              │                   44.159MB │                     20.75% │                   1.089KB │                     0.00% │                      1332 │
│ raw_decode at /Users/anna/.pyenv/versions/3.10.6/lib/python3.10/json/decoder.py                                                          │                   44.158MB │                     20.75% │                  44.158MB │                    20.75% │                      1330 │
│ _read_chunked at /Users/anna/.pyenv/versions/3.10.6/lib/python3.10/http/client.py                                                        │                    3.824MB │                      1.80% │                   3.824MB │                     1.80% │                         1 │
│ read at /Users/anna/.pyenv/versions/3.10.6/lib/python3.10/http/client.py                                                                 │                    3.824MB │                      1.80% │                    0.000B │                     0.00% │                         1 │
│ compile at /Users/anna/.pyenv/versions/3.10.6/lib/python3.10/re.py                                                                       │                    3.356MB │                      1.58% │                    0.000B │                     0.00% │                       322 │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴────────────────────────────┴────────────────────────────┴───────────────────────────┴───────────────────────────┴───────────────────────────┘

It looks like building the dependency graph, which happens on every schema update call is using most memory. We can't really cache the dependency graph since we override references with the stage deploy project.

@sean-rose
Copy link
Contributor Author

Thanks for investigating!

I was thinking it might help to have the bqetl stage deploy command call bqetl query schema update and bqetl query schema deploy once each passing them all paths (those commands would need to be modified to accept multiple tables), rather than calling them repeatedly for each individual table.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants