Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make pandas/io/sql.py work with sqlalchemy 2.0 #48576

Merged
merged 13 commits into from
Feb 9, 2023
Merged

Make pandas/io/sql.py work with sqlalchemy 2.0 #48576

merged 13 commits into from
Feb 9, 2023

Conversation

cdcadman
Copy link
Contributor

@cdcadman cdcadman commented Sep 15, 2022

I have run the following with sqlalchemy 2.0.0 to confirm compatilibility:

  • pytest pandas/tests/io/test_sql.py
  • mypy pandas\io\sql.py pandas\tests\io\test_sql.py --follow-imports=silent

@datapythonista datapythonista added Testing pandas testing functions or related to the test suite IO SQL to_sql, read_sql, read_sql_query Warnings Warnings that appear or should be added to pandas labels Sep 16, 2022
Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @cdcadman

I'm sorry I'm not so familiar with the SQLAlchemy changes, but do you mind explaining why we should filter RemovedIn20Warnings instead of updating the code? I assume we want to be compatible with both 1.4 and 2.0 which are not compatible. But feels like it'd make more sense to for now use if statements to support both versions.

I may surely be missing something, if you can please expand on why this approach, that would be great. Thanks!

pandas/io/sql.py Outdated
Comment on lines 1419 to 1422
elif "statement" in kwargs:
statement = kwargs["statement"]
else:
statement = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this the same as statement = kwargs.get("statement")?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and I just made a new commit to address this.

@cdcadman
Copy link
Contributor Author

I'm sorry I'm not so familiar with the SQLAlchemy changes, but do you mind explaining why we should filter RemovedIn20Warnings instead of updating the code? I assume we want to be compatible with both 1.4 and 2.0 which are not compatible. But feels like it'd make more sense to for now use if statements to support both versions.

My approach is to update pandas/io/sql.py so that it can support both versions. I also had to update pandas/tests/io/test_sql.py, because pandas users will need run DataFrame.to_sql within a transaction if they pass a sqlalchemy Connection under sqlalchemy 2.0. This is how I've approached the transition to sqlalchemy 2.0 in my own work: by continuing to use 1.4 while ensuring that my code would still work under 2.0.

The RemovedIn20Warning provides a way of testing that the code will work under sqlalchemy 2.0, while the tests run under sqlalchemy 1.4. Another approach is to pass future=True to sqlalchemy.create_engine in order to create 2.0-style engines for testing. I hadn't thought of expanding the tests to run under both 1.4 and 2.0 sqlalchemy engines and connections, but that might be a better approach to testing the code during the transition period, since it does not require the SQLALCHEMY_WARN_20 environment variable to be set prior to the start of the tests.

@cdcadman cdcadman marked this pull request as draft September 16, 2022 16:11
@datapythonista
Copy link
Member

Thanks a lot for all the information @cdcadman, that's useful. There is something I still don't fully understand. Let me explain in detail, and please correct me if I'm wrong, I can surely be missing something. Let me use this example:

# 1.4
session.query(User).get(42)

# 2.0
session.get(User, 42)

I see two different cases.

Case A: 2.0 syntax also works in 1.4

This case is easy, we should simply replace one by the other, right?

Case B: 2.0 syntax doesn't work on 1.4

Then, while we want to keep compatibility with 1.4, we should have something like (code won't work, just to illustrate):

if sqlalchemy.__version__ < '2.0':
    session.query(User).get(42)
else:
    session.get(User, 42)

When we test a pandas function, if we have all the code written this way, once we finished all the changes, we shouldn't have any warning being raised, so we wouldn't need to filter any warning in our code.

If we agree on this (please let me know if I missed anything), then the next question is how we make these changes. If I understand correctly, there is this flag you mention that makes alchemy start raising warnings about deprecations. I guess it can be activated locally, address the warnings, and then open PRs just with the fixes. Or we can activate it in our CI, so the warnings are visible in our tests. Either way is probably fine.

If the above is correct, then I don't understand why we need to filter warnings in our code. Do you mind explaining what I'm missing please?

@cdcadman
Copy link
Contributor Author

@datapythonista I agree with everything you wrote. I don't think we should censor RemovedIn20Warning. The only warnings filter I put into any code is this code in pandas/tests/io/test_sql.py:

if SQLALCHEMY_INSTALLED:
    # This only matters if the environment variable SQLALCHEMY_WARN_20 is set to 1.
    pytestmark = pytest.mark.filterwarnings("error::sqlalchemy.exc.RemovedIn20Warning")

This causes tests to fail if they raise RemovedIn20Warning.

Your comment about checking sqlalchemy versions reminded me that it is possible within sqlalchemy version 1.4 to obtain a 2.0-style sqlalchemy engine by passing future=True to sqlalchemy.create_engine. Any situation which raises RemovedIn20Warning should fail if it runs with a 2.0-style engine. Testing with 2.0-style engines is better than testing based on the RemovedIn20Warning. For example, when I tried running the tests after adding future=True to create_engine, one of the tests froze, and to fix it I had to use the connection as a context manager.

So I'd like to redo this PR in the next few days. I don't expect pandas/io/sql.py to change, but I will do something different with the test file:

  • Don't change test code to eliminate RemovedIn20Warning, so that I can ensure the existing tests still pass with 1.x-style sqlalchemy engines (future=False). That will provide evidence that downstream users' code won't break based on the edits I make to pandas/io/sql.py.
  • Add tests with 2.0-style engines. I can think of 3 different ways to do this:
    • Use pytest.mark.parametrize("future", [True, False]) on existing tests and classes. I will probably do this.
    • Create duplicate fixtures and classes for sqlalchemy 2.0 tests.
    • Create a duplicate test file test_sql_future.py.

@cdcadman
Copy link
Contributor Author

Update: pandas/io/sql.py will have to change significantly from how I currently have it in this PR.

@cdcadman cdcadman changed the title Eliminate RemovedIn20Warning and other errors from pandas/tests/io/test_sql.py Make pandas/io/sql.py work with 2.0-style sqlalchemy connectables Sep 20, 2022
@cdcadman cdcadman marked this pull request as ready for review September 20, 2022 13:12
@cdcadman
Copy link
Contributor Author

@datapythonista I modified this PR's code, title, and original comment, and it is again ready for review. Since I am now testing 2.0-style sqlalchemy connectables instead of checking for warnings, I think this can close #40686. Since I did make a lot of changes, would you prefer that I start a new PR instead?

@mroeschke
Copy link
Member

Is it known when SQLAlchemy 2.0 will be released?

If it comes before pandas 2.0 (late 2022/early 2023), maybe we can just bump the minimum version of SQLAlchemy to 2.0 and adopt the new syntax. I'm not in love in duplicating all the testing & having to support 2 versions of an optional dependency with different syntax.

xref: #44823

@cdcadman
Copy link
Contributor Author

@mroeschke I haven't seen a release date for sqlalchemy 2.0. Instead of duplicating all the tests, I could make the future argument of create_engine depend on either a global constant in test_sql.py or an environment variable. The future argument will be supported in sqlalchemy 2.0 and required to be True. I could also put this PR on hold until the beta release of sqlalchemy 2.0 comes out.

@@ -2387,12 +2438,14 @@ class _TestMySQLAlchemy:

flavor = "mysql"
port = 3306
future = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what we usually do to manage two versions of the same library is to add a flag in pandas.compat, and sometimes a function to handle both versions behavior, that then we call from our code. Do you think this could be helpful and avoid the test duplication and passing the future argument?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the test duplication with the future argument is different from what pandas.compat provides. The future argument duplicates the type of sqlalchemy connectables that can be passed to read_sql_query and DataFrame.to_sql. Currently, the tests are already duplicated to ensure that both sqlalchemy.engine.Engine and sqlalchemy.engine.Connection can be passed to these methods, and the fact that these are both allowed also makes the code in pandas.io.sql more complicated. In sqlalchemy 1.4, each of these connectables has a subclass, sqlalchemy.future.Engine and sqlalchemy.future.Connection. In pandas.io.sql, I did not differentiate between future and non-future, but the test duplication ensures that all 4 of these sqlalchemy connectables work. Once pandas decides to require sqlalchemy 2.0, the future/non-future duplication will be unnecessary.

@fangchenli
Copy link
Member

The 2.0 beta is out.

sqlalchemy/sqlalchemy#8631

@cdcadman cdcadman marked this pull request as draft October 27, 2022 15:36
@cdcadman cdcadman changed the title Make pandas/io/sql.py work with 2.0-style sqlalchemy connectables Make pandas/io/sql.py work with sqlalchemy 2.0 Oct 27, 2022
@cdcadman
Copy link
Contributor Author

I'm planning to make some changes to this PR. Firstly, I noticed that pandas.io.sql.execute is documented, right above this line: https://pandas.pydata.org/docs/user_guide/io.html?highlight=sql%20execute#engine-connection-examples . As it stands, my PR would make this return a context manager instead of a Results Iterable, and I don't think I need to make this change, so I will change it back.

I plan to make SQLDatabase accept only a SQLAlchemy Connection and not an Engine. I would change pandasSQL_builder into a generating function, decorated by contextlib.contextmanager, so that it can dispose of the Engine that is created if the connectable is a string. A new argument, need_txn will be set to True by to_sql, and otherwise be False. An advantage of this approach is that I can begin a transaction if the connectable is a Connection which is not already in a transaction.

@contextmanager
def pandasSQL_builder(
    con,
    schema: str | None = None,
    need_txn: bool = False,
) -> Iterator[SQLDatabase] | Iterator[SQLiteDatabase]:
    """
    Convenience function to return the correct PandasSQL subclass based on the
    provided parameters.  Also creates a sqlalchemy connection and transaction
    if necessary.
    """
    import sqlite3
    import warnings

    if isinstance(con, sqlite3.Connection) or con is None:
        yield SQLiteDatabase(con)
    else:
        sqlalchemy = import_optional_dependency("sqlalchemy", errors="ignore")

        if sqlalchemy is not None and isinstance(con, (str, sqlalchemy.engine.Connectable)):
            with _sqlalchemy_con(con, need_txn) as con:
                yield SQLDatabase(con, schema=schema)
        elif isinstance(con, str) and sqlalchemy is None:
            raise ImportError("Using URI string without sqlalchemy installed.")
        else:

            warnings.warn(
                "pandas only supports SQLAlchemy connectable (engine/connection) or "
                "database string URI or sqlite3 DBAPI2 connection. "
                "Other DBAPI2 objects are not tested. Please consider using SQLAlchemy.",
                UserWarning,
                stacklevel=find_stack_level(),
            )
            yield SQLiteDatabase(con)


@contextmanager
def _sqlalchemy_con(connectable, need_txn: bool):
    """Create a sqlalchemy connection and a transaction if necessary."""
    import sqlalchemy

    if isinstance(connectable, str):
        engine = sqlalchemy.create_engine(connectable)
        try:
            with engine.connect() as con:
                if need_txn:
                    with con.begin():
                        yield con
                else:
                    yield con
        finally:
            engine.dispose()
    elif isinstance(connectable, sqlalchemy.engine.Engine):
        with connectable.connect() as con:
            if need_txn:
                with con.begin():
                    yield con
            else:
                yield con
    else:
        if need_txn and not connectable.in_transaction():
            with connectable.begin():
                yield connectable
        else:
            yield connectable

In test_sql.py, I will take out the test classes which pass Engines to SQLDatabase, which will reduce the number of tests.

As I was looking over the tests, I noticed this interesting behavior related to transactions. I like having the ability to rollback a DataFrame.to_sql call to help maintain data integrity, but if a sqlite3.Connection is used, pandas commits the transaction. If a sqlalchemy.engine.Connection is used, then pandas does not commit the transaction. Maybe this is fine, because someone who wants the sqlalchemy behavior with a sqlite database can just make a sqlalchemy connection to the database. Here is an example script:

import sqlite3
from pandas import DataFrame
from sqlalchemy import create_engine

with sqlite3.connect(":memory:") as con:
    con.execute("create table test (A integer, B integer)")
    row_count = con.execute("insert into test values (2, 4), (5, 10)").rowcount
    if row_count > 1:
        con.rollback()
    print(con.execute("select count(*) from test").fetchall()[0][0]) # prints 0

with sqlite3.connect(":memory:") as con:
    con.execute("create table test (A integer, B integer)")
    row_count = DataFrame({'A': [2, 5], 'B': [4, 10]}).to_sql('test', con, if_exists='append', index=False)
    if row_count > 1:
        con.rollback() # does nothing, because pandas already committed the transaction.
    print(con.execute("select count(*) from test").fetchall()[0][0]) # prints 2
    
with create_engine("sqlite:///:memory:").connect() as con:
    with con.begin():
        con.exec_driver_sql("create table test (A integer, B integer)")
    try:
        with con.begin():
            row_count = DataFrame({'A': [2, 5], 'B': [4, 10]}).to_sql('test', con, if_exists='append', index=False)
            assert row_count < 2
    except AssertionError:
        pass
    print(con.execute("select count(*) from test").fetchall()[0][0]) # prints 0

@cdcadman cdcadman marked this pull request as ready for review October 31, 2022 07:31
@cdcadman
Copy link
Contributor Author

@datapythonista @mroeschke Based on your feedback, I took out all the test duplication, and instead ran the tests with sqlalchemy 2.0.0b2 installed to ensure that this can close #40686. This PR will allow pandas to work with sqlalchemy 1.4.16 (the documented minimum version) and higher, even after sqlalchemy 2.0 is released. I found a note here (written 10/13/2022) on the timing of sqlalchemy 2.0: https://www.sqlalchemy.org/blog/2022/10/13/sqlalchemy-2.0.0b1-released/

we will likely move from beta releases into release candidates as well, anticipating a 2.0 final release after some months.

@mroeschke
Copy link
Member

Thanks for your work on this @cdcadman. IMO I would still be partial on just supporting sqlalchemy 2.0 syntax/tests/min version when it becomes available.

@cdcadman
Copy link
Contributor Author

cdcadman commented Nov 1, 2022

@mroeschke As a pandas/sqlalchemy user, it would be really helpful to get these changes into pandas sooner, so that I can get my code ready for sqlalchemy 2.0 sooner. This PR is making the kind of changes envisioned in the first paragraph of the sqlalchemy 2.0 migration document: https://docs.sqlalchemy.org/en/14/changelog/migration_20.html#overview .

The SQLAlchemy 2.0 transition presents itself in the SQLAlchemy 1.4 release as a series of steps that allow an application of any size or complexity to be migrated to SQLAlchemy 2.0 using a gradual, iterative process. Lessons learned from the Python 2 to Python 3 transition have inspired a system that intends to as great a degree as possible to not require any “breaking” changes, or any change that would need to be made universally or not at all.

It's understandable if you are concerned that these changes will break existing code running on sqlalchemy 1.4. I made three types of changes to pandas/tests/io/test_sql.py:

  • Changed a test involving a PostgreSQL timezone-aware data type which doesn't exist in Sqlite or MySQL. This test was failing for me in the main branch, so I fixed it before making other changes. It xfails against Sqlite and MySQL.
  • I changed the private API within pandas.io.sql, and that broke many tests that were testing the private API.
  • I changed some tests so that they can run in sqlalchemy 2.0, even though they ran fine in sqlalchemy 1.4.

Let me know if there's anything I can do to help get this merged sooner.

@mroeschke
Copy link
Member

mroeschke commented Nov 3, 2022

Thanks @cdcadman. I misunderstood that sqlalchemy 1.4 already has 2.0 functionality, so we can adopt 2.0 syntax/functionality while pandas min version is 1.4.

I made three types of changes to pandas/tests/io/test_sql.py:

Could you split these into 3 separate PRs? It's difficult for me to determine which changes correspond with a certain objective. Generally, 1 PR targeting 1 type of change is easier for review.

@cdcadman
Copy link
Contributor Author

@mroeschke , I split this into two commits. The tests pass after the first commit with sqlalchemy 1.4.44, but I had to modify test_sql.py to make it work with sqlalchemy 2.0.0b3.

pandas/io/sql.py Outdated Show resolved Hide resolved
pandas/io/sql.py Outdated
@@ -1454,8 +1467,16 @@ def run_transaction(self):
yield self.con

def execute(self, *args, **kwargs):
"""Simple passthrough to SQLAlchemy connectable"""
return self.con.execute(*args, **kwargs)
"""Almost a simple passthrough to SQLAlchemy Connection"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious what 2.0 changes made this more complex

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes to the execute method in sqlalchemy 2.0 are described here: https://docs.sqlalchemy.org/en/14/changelog/migration_20.html#execute-method-more-strict-execution-options-are-more-prominent . My goal was to allow end users to pass either a string or a sqlachemy expression (like table.insert()). Currently, either one works, but passing a string emits a RemovedIn20Warning.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. Looks to be that _convert_params is packing the sql and params into args and then you're unpacking it here.

Might be better if execute is always defined as def execute(self, sql: str | expression, *args, **kwargs) for better clarity?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and I can also remove **kwargs based on the existing tests. Maybe I can replace *args as well. I will work on this and add some tests.

Did you have a specific type in mind for expression? I don't know what to use without importing something from sqlalchemy.sql.expression (ClauseElement | Executable)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and I can also remove **kwargs based on the existing tests. Maybe I can replace *args as well. I will work on this and add some tests.

This would be great, thank you.

Did you have a specific type in mind for expression?

The docs for read_sql note that it should be a SQLAlchemy Selectable (select or text object) so maybe just a expression.Select or expression.TextClause (don't know if we have tests for either

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created a separate commit which makes the signature of PandasSQL.execute into def execute(self, sql: str | Select | TextClause, params=None).

@joshcita
Copy link

joshcita commented Feb 3, 2023

Hi Dev, If you have problem when you try the read data
Try this code
from sqlalchemy.sql import text

.....your code
into you engine ( add text option )
df = pd.read_sql(text(query), con=conn)

@cdcadman
Copy link
Contributor Author

cdcadman commented Feb 3, 2023

@phofl I think I addressed all the issues that got linked. Let me know if I missed anything. I think test coverage of read_sql and its variants is really good. Tests for DataFrame.to_sql are mostly indirect, so I added a direct test.

@mroeschke
Copy link
Member

Could you remove the <1.4.46 in the yaml files in ci/deps directory?

Then we don't need the SQLALCHEMY_WARN_20=1 in run_tests.sh

@cdcadman
Copy link
Contributor Author

cdcadman commented Feb 9, 2023

@mroeschke I reverted all the changes from this commit: fa78ea8. I think everything is working, since I only got one unrelated test failure: pandas/tests/io/parser/common/test_file_buffer_url.py::test_file_descriptor_leak[c_high]

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. cc @phofl merge when ready

@phofl phofl merged commit c73dc7f into pandas-dev:main Feb 9, 2023
@phofl
Copy link
Member

phofl commented Feb 9, 2023

thx @cdcadman very nice!

@phofl phofl added this to the 2.0 milestone Feb 9, 2023
@cdcadman
Copy link
Contributor Author

cdcadman commented Feb 9, 2023

Thanks @phofl and @mroeschke.

@amotl
Copy link

amotl commented Mar 13, 2023

Hi there,

thank you so much for fixing this compatibility issue. As already referenced elsewhere (see below), and as a maintainer of the crate-python package, may I humbly ask if you plan to also ship this fix with another patch release for pandas 1.5.x?

With kind regards,
Andreas.

References

@amotl
Copy link

amotl commented Mar 13, 2023

Ah, I see. That topic was discussed at #49857 (comment) ff., and the outcome is apparently that pandas 1.5.x will never support SQLAlchemy 2.x, right?

@phofl
Copy link
Member

phofl commented Mar 13, 2023

Correct, there probably won’t be another 1.5.x release anyway

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO SQL to_sql, read_sql, read_sql_query Testing pandas testing functions or related to the test suite Warnings Warnings that appear or should be added to pandas
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Upgrade Pandas SQLAlchemy code to be compatible with 2.0+ syntax
9 participants