Make pandas/io/sql.py work with sqlalchemy 2.0 #48576

cdcadman · 2022-09-15T23:28:30Z

closes ENH: Upgrade Pandas SQLAlchemy code to be compatible with 2.0+ syntax #40686
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

I have run the following with sqlalchemy 2.0.0 to confirm compatilibility:

pytest pandas/tests/io/test_sql.py
mypy pandas\io\sql.py pandas\tests\io\test_sql.py --follow-imports=silent

datapythonista

Thanks for working on this @cdcadman

I'm sorry I'm not so familiar with the SQLAlchemy changes, but do you mind explaining why we should filter RemovedIn20Warnings instead of updating the code? I assume we want to be compatible with both 1.4 and 2.0 which are not compatible. But feels like it'd make more sense to for now use if statements to support both versions.

I may surely be missing something, if you can please expand on why this approach, that would be great. Thanks!

datapythonista · 2022-09-16T09:24:04Z

pandas/io/sql.py

+        elif "statement" in kwargs:
+            statement = kwargs["statement"]
+        else:
+            statement = None


Isn't this the same as statement = kwargs.get("statement")?

Yes, and I just made a new commit to address this.

cdcadman · 2022-09-16T11:06:13Z

I'm sorry I'm not so familiar with the SQLAlchemy changes, but do you mind explaining why we should filter RemovedIn20Warnings instead of updating the code? I assume we want to be compatible with both 1.4 and 2.0 which are not compatible. But feels like it'd make more sense to for now use if statements to support both versions.

My approach is to update pandas/io/sql.py so that it can support both versions. I also had to update pandas/tests/io/test_sql.py, because pandas users will need run DataFrame.to_sql within a transaction if they pass a sqlalchemy Connection under sqlalchemy 2.0. This is how I've approached the transition to sqlalchemy 2.0 in my own work: by continuing to use 1.4 while ensuring that my code would still work under 2.0.

The RemovedIn20Warning provides a way of testing that the code will work under sqlalchemy 2.0, while the tests run under sqlalchemy 1.4. Another approach is to pass future=True to sqlalchemy.create_engine in order to create 2.0-style engines for testing. I hadn't thought of expanding the tests to run under both 1.4 and 2.0 sqlalchemy engines and connections, but that might be a better approach to testing the code during the transition period, since it does not require the SQLALCHEMY_WARN_20 environment variable to be set prior to the start of the tests.

datapythonista · 2022-09-17T11:57:36Z

Thanks a lot for all the information @cdcadman, that's useful. There is something I still don't fully understand. Let me explain in detail, and please correct me if I'm wrong, I can surely be missing something. Let me use this example:

# 1.4
session.query(User).get(42)

# 2.0
session.get(User, 42)

I see two different cases.

Case A: 2.0 syntax also works in 1.4

This case is easy, we should simply replace one by the other, right?

Case B: 2.0 syntax doesn't work on 1.4

Then, while we want to keep compatibility with 1.4, we should have something like (code won't work, just to illustrate):

if sqlalchemy.__version__ < '2.0':
    session.query(User).get(42)
else:
    session.get(User, 42)

When we test a pandas function, if we have all the code written this way, once we finished all the changes, we shouldn't have any warning being raised, so we wouldn't need to filter any warning in our code.

If we agree on this (please let me know if I missed anything), then the next question is how we make these changes. If I understand correctly, there is this flag you mention that makes alchemy start raising warnings about deprecations. I guess it can be activated locally, address the warnings, and then open PRs just with the fixes. Or we can activate it in our CI, so the warnings are visible in our tests. Either way is probably fine.

If the above is correct, then I don't understand why we need to filter warnings in our code. Do you mind explaining what I'm missing please?

cdcadman · 2022-09-17T14:17:11Z

@datapythonista I agree with everything you wrote. I don't think we should censor RemovedIn20Warning. The only warnings filter I put into any code is this code in pandas/tests/io/test_sql.py:

if SQLALCHEMY_INSTALLED:
    # This only matters if the environment variable SQLALCHEMY_WARN_20 is set to 1.
    pytestmark = pytest.mark.filterwarnings("error::sqlalchemy.exc.RemovedIn20Warning")

This causes tests to fail if they raise RemovedIn20Warning.

Your comment about checking sqlalchemy versions reminded me that it is possible within sqlalchemy version 1.4 to obtain a 2.0-style sqlalchemy engine by passing future=True to sqlalchemy.create_engine. Any situation which raises RemovedIn20Warning should fail if it runs with a 2.0-style engine. Testing with 2.0-style engines is better than testing based on the RemovedIn20Warning. For example, when I tried running the tests after adding future=True to create_engine, one of the tests froze, and to fix it I had to use the connection as a context manager.

So I'd like to redo this PR in the next few days. I don't expect pandas/io/sql.py to change, but I will do something different with the test file:

Don't change test code to eliminate RemovedIn20Warning, so that I can ensure the existing tests still pass with 1.x-style sqlalchemy engines (future=False). That will provide evidence that downstream users' code won't break based on the edits I make to pandas/io/sql.py.
Add tests with 2.0-style engines. I can think of 3 different ways to do this:
- Use pytest.mark.parametrize("future", [True, False]) on existing tests and classes. I will probably do this.
- Create duplicate fixtures and classes for sqlalchemy 2.0 tests.
- Create a duplicate test file test_sql_future.py.

cdcadman · 2022-09-19T08:34:18Z

Update: pandas/io/sql.py will have to change significantly from how I currently have it in this PR.

cdcadman · 2022-09-23T17:10:26Z

@datapythonista I modified this PR's code, title, and original comment, and it is again ready for review. Since I am now testing 2.0-style sqlalchemy connectables instead of checking for warnings, I think this can close #40686. Since I did make a lot of changes, would you prefer that I start a new PR instead?

mroeschke · 2022-09-23T17:30:40Z

Is it known when SQLAlchemy 2.0 will be released?

If it comes before pandas 2.0 (late 2022/early 2023), maybe we can just bump the minimum version of SQLAlchemy to 2.0 and adopt the new syntax. I'm not in love in duplicating all the testing & having to support 2 versions of an optional dependency with different syntax.

xref: #44823

cdcadman · 2022-09-23T21:55:38Z

@mroeschke I haven't seen a release date for sqlalchemy 2.0. Instead of duplicating all the tests, I could make the future argument of create_engine depend on either a global constant in test_sql.py or an environment variable. The future argument will be supported in sqlalchemy 2.0 and required to be True. I could also put this PR on hold until the beta release of sqlalchemy 2.0 comes out.

datapythonista · 2022-09-25T13:09:57Z

pandas/tests/io/test_sql.py

@@ -2387,12 +2438,14 @@ class _TestMySQLAlchemy:

    flavor = "mysql"
    port = 3306
+    future = False


I think what we usually do to manage two versions of the same library is to add a flag in pandas.compat, and sometimes a function to handle both versions behavior, that then we call from our code. Do you think this could be helpful and avoid the test duplication and passing the future argument?

I think the test duplication with the future argument is different from what pandas.compat provides. The future argument duplicates the type of sqlalchemy connectables that can be passed to read_sql_query and DataFrame.to_sql. Currently, the tests are already duplicated to ensure that both sqlalchemy.engine.Engine and sqlalchemy.engine.Connection can be passed to these methods, and the fact that these are both allowed also makes the code in pandas.io.sql more complicated. In sqlalchemy 1.4, each of these connectables has a subclass, sqlalchemy.future.Engine and sqlalchemy.future.Connection. In pandas.io.sql, I did not differentiate between future and non-future, but the test duplication ensures that all 4 of these sqlalchemy connectables work. Once pandas decides to require sqlalchemy 2.0, the future/non-future duplication will be unnecessary.

fangchenli · 2022-10-26T14:58:46Z

The 2.0 beta is out.

sqlalchemy/sqlalchemy#8631

cdcadman · 2022-10-28T11:56:52Z

I'm planning to make some changes to this PR. Firstly, I noticed that pandas.io.sql.execute is documented, right above this line: https://pandas.pydata.org/docs/user_guide/io.html?highlight=sql%20execute#engine-connection-examples . As it stands, my PR would make this return a context manager instead of a Results Iterable, and I don't think I need to make this change, so I will change it back.

I plan to make SQLDatabase accept only a SQLAlchemy Connection and not an Engine. I would change pandasSQL_builder into a generating function, decorated by contextlib.contextmanager, so that it can dispose of the Engine that is created if the connectable is a string. A new argument, need_txn will be set to True by to_sql, and otherwise be False. An advantage of this approach is that I can begin a transaction if the connectable is a Connection which is not already in a transaction.

@contextmanager
def pandasSQL_builder(
    con,
    schema: str | None = None,
    need_txn: bool = False,
) -> Iterator[SQLDatabase] | Iterator[SQLiteDatabase]:
    """
    Convenience function to return the correct PandasSQL subclass based on the
    provided parameters.  Also creates a sqlalchemy connection and transaction
    if necessary.
    """
    import sqlite3
    import warnings

    if isinstance(con, sqlite3.Connection) or con is None:
        yield SQLiteDatabase(con)
    else:
        sqlalchemy = import_optional_dependency("sqlalchemy", errors="ignore")

        if sqlalchemy is not None and isinstance(con, (str, sqlalchemy.engine.Connectable)):
            with _sqlalchemy_con(con, need_txn) as con:
                yield SQLDatabase(con, schema=schema)
        elif isinstance(con, str) and sqlalchemy is None:
            raise ImportError("Using URI string without sqlalchemy installed.")
        else:

            warnings.warn(
                "pandas only supports SQLAlchemy connectable (engine/connection) or "
                "database string URI or sqlite3 DBAPI2 connection. "
                "Other DBAPI2 objects are not tested. Please consider using SQLAlchemy.",
                UserWarning,
                stacklevel=find_stack_level(),
            )
            yield SQLiteDatabase(con)


@contextmanager
def _sqlalchemy_con(connectable, need_txn: bool):
    """Create a sqlalchemy connection and a transaction if necessary."""
    import sqlalchemy

    if isinstance(connectable, str):
        engine = sqlalchemy.create_engine(connectable)
        try:
            with engine.connect() as con:
                if need_txn:
                    with con.begin():
                        yield con
                else:
                    yield con
        finally:
            engine.dispose()
    elif isinstance(connectable, sqlalchemy.engine.Engine):
        with connectable.connect() as con:
            if need_txn:
                with con.begin():
                    yield con
            else:
                yield con
    else:
        if need_txn and not connectable.in_transaction():
            with connectable.begin():
                yield connectable
        else:
            yield connectable

In test_sql.py, I will take out the test classes which pass Engines to SQLDatabase, which will reduce the number of tests.

As I was looking over the tests, I noticed this interesting behavior related to transactions. I like having the ability to rollback a DataFrame.to_sql call to help maintain data integrity, but if a sqlite3.Connection is used, pandas commits the transaction. If a sqlalchemy.engine.Connection is used, then pandas does not commit the transaction. Maybe this is fine, because someone who wants the sqlalchemy behavior with a sqlite database can just make a sqlalchemy connection to the database. Here is an example script:

import sqlite3
from pandas import DataFrame
from sqlalchemy import create_engine

with sqlite3.connect(":memory:") as con:
    con.execute("create table test (A integer, B integer)")
    row_count = con.execute("insert into test values (2, 4), (5, 10)").rowcount
    if row_count > 1:
        con.rollback()
    print(con.execute("select count(*) from test").fetchall()[0][0]) # prints 0

with sqlite3.connect(":memory:") as con:
    con.execute("create table test (A integer, B integer)")
    row_count = DataFrame({'A': [2, 5], 'B': [4, 10]}).to_sql('test', con, if_exists='append', index=False)
    if row_count > 1:
        con.rollback() # does nothing, because pandas already committed the transaction.
    print(con.execute("select count(*) from test").fetchall()[0][0]) # prints 2
    
with create_engine("sqlite:///:memory:").connect() as con:
    with con.begin():
        con.exec_driver_sql("create table test (A integer, B integer)")
    try:
        with con.begin():
            row_count = DataFrame({'A': [2, 5], 'B': [4, 10]}).to_sql('test', con, if_exists='append', index=False)
            assert row_count < 2
    except AssertionError:
        pass
    print(con.execute("select count(*) from test").fetchall()[0][0]) # prints 0

cdcadman · 2022-10-31T13:14:31Z

@datapythonista @mroeschke Based on your feedback, I took out all the test duplication, and instead ran the tests with sqlalchemy 2.0.0b2 installed to ensure that this can close #40686. This PR will allow pandas to work with sqlalchemy 1.4.16 (the documented minimum version) and higher, even after sqlalchemy 2.0 is released. I found a note here (written 10/13/2022) on the timing of sqlalchemy 2.0: https://www.sqlalchemy.org/blog/2022/10/13/sqlalchemy-2.0.0b1-released/

we will likely move from beta releases into release candidates as well, anticipating a 2.0 final release after some months.

mroeschke · 2022-10-31T17:20:27Z

Thanks for your work on this @cdcadman. IMO I would still be partial on just supporting sqlalchemy 2.0 syntax/tests/min version when it becomes available.

cdcadman · 2022-11-01T16:20:38Z

@mroeschke As a pandas/sqlalchemy user, it would be really helpful to get these changes into pandas sooner, so that I can get my code ready for sqlalchemy 2.0 sooner. This PR is making the kind of changes envisioned in the first paragraph of the sqlalchemy 2.0 migration document: https://docs.sqlalchemy.org/en/14/changelog/migration_20.html#overview .

The SQLAlchemy 2.0 transition presents itself in the SQLAlchemy 1.4 release as a series of steps that allow an application of any size or complexity to be migrated to SQLAlchemy 2.0 using a gradual, iterative process. Lessons learned from the Python 2 to Python 3 transition have inspired a system that intends to as great a degree as possible to not require any “breaking” changes, or any change that would need to be made universally or not at all.

It's understandable if you are concerned that these changes will break existing code running on sqlalchemy 1.4. I made three types of changes to pandas/tests/io/test_sql.py:

Changed a test involving a PostgreSQL timezone-aware data type which doesn't exist in Sqlite or MySQL. This test was failing for me in the main branch, so I fixed it before making other changes. It xfails against Sqlite and MySQL.
I changed the private API within pandas.io.sql, and that broke many tests that were testing the private API.
I changed some tests so that they can run in sqlalchemy 2.0, even though they ran fine in sqlalchemy 1.4.

Let me know if there's anything I can do to help get this merged sooner.

mroeschke · 2022-11-03T17:45:05Z

Thanks @cdcadman. I misunderstood that sqlalchemy 1.4 already has 2.0 functionality, so we can adopt 2.0 syntax/functionality while pandas min version is 1.4.

I made three types of changes to pandas/tests/io/test_sql.py:

Could you split these into 3 separate PRs? It's difficult for me to determine which changes correspond with a certain objective. Generally, 1 PR targeting 1 type of change is easier for review.

cdcadman · 2022-11-19T04:06:05Z

@mroeschke , I split this into two commits. The tests pass after the first commit with sqlalchemy 1.4.44, but I had to modify test_sql.py to make it work with sqlalchemy 2.0.0b3.

pandas/io/sql.py

mroeschke · 2022-11-21T19:34:02Z

pandas/io/sql.py

@@ -1454,8 +1467,16 @@ def run_transaction(self):
        yield self.con

    def execute(self, *args, **kwargs):
-        """Simple passthrough to SQLAlchemy connectable"""
-        return self.con.execute(*args, **kwargs)
+        """Almost a simple passthrough to SQLAlchemy Connection"""


Just curious what 2.0 changes made this more complex

The changes to the execute method in sqlalchemy 2.0 are described here: https://docs.sqlalchemy.org/en/14/changelog/migration_20.html#execute-method-more-strict-execution-options-are-more-prominent . My goal was to allow end users to pass either a string or a sqlachemy expression (like table.insert()). Currently, either one works, but passing a string emits a RemovedIn20Warning.

Gotcha. Looks to be that _convert_params is packing the sql and params into args and then you're unpacking it here.

Might be better if execute is always defined as def execute(self, sql: str | expression, *args, **kwargs) for better clarity?

Yes, and I can also remove **kwargs based on the existing tests. Maybe I can replace *args as well. I will work on this and add some tests.

Did you have a specific type in mind for expression? I don't know what to use without importing something from sqlalchemy.sql.expression (ClauseElement | Executable)

Yes, and I can also remove **kwargs based on the existing tests. Maybe I can replace *args as well. I will work on this and add some tests.

This would be great, thank you.

Did you have a specific type in mind for expression?

The docs for read_sql note that it should be a SQLAlchemy Selectable (select or text object) so maybe just a expression.Select or expression.TextClause (don't know if we have tests for either

I've created a separate commit which makes the signature of PandasSQL.execute into def execute(self, sql: str | Select | TextClause, params=None).

joshcita · 2023-02-03T03:06:33Z

Hi Dev, If you have problem when you try the read data
Try this code
from sqlalchemy.sql import text

.....your code
into you engine ( add text option )
df = pd.read_sql(text(query), con=conn)

cdcadman · 2023-02-03T04:52:13Z

@phofl I think I addressed all the issues that got linked. Let me know if I missed anything. I think test coverage of read_sql and its variants is really good. Tests for DataFrame.to_sql are mostly indirect, so I added a direct test.

mroeschke · 2023-02-08T21:41:10Z

Could you remove the <1.4.46 in the yaml files in ci/deps directory?

Then we don't need the SQLALCHEMY_WARN_20=1 in run_tests.sh

pandas/io/sql.py

cdcadman · 2023-02-09T14:38:01Z

@mroeschke I reverted all the changes from this commit: fa78ea8. I think everything is working, since I only got one unrelated test failure: pandas/tests/io/parser/common/test_file_buffer_url.py::test_file_descriptor_leak[c_high]

mroeschke

LGTM. cc @phofl merge when ready

phofl · 2023-02-09T17:35:22Z

thx @cdcadman very nice!

cdcadman · 2023-02-09T18:33:32Z

Thanks @phofl and @mroeschke.

amotl · 2023-03-13T16:21:33Z

Hi there,

thank you so much for fixing this compatibility issue. As already referenced elsewhere (see below), and as a maintainer of the crate-python package, may I humbly ask if you plan to also ship this fix with another patch release for pandas 1.5.x?

With kind regards,
Andreas.

References

amotl · 2023-03-13T16:22:21Z

Ah, I see. That topic was discussed at #49857 (comment) ff., and the outcome is apparently that pandas 1.5.x will never support SQLAlchemy 2.x, right?

phofl · 2023-03-13T16:23:08Z

Correct, there probably won’t be another 1.5.x release anyway

datapythonista added Testing pandas testing functions or related to the test suite IO SQL to_sql, read_sql, read_sql_query Warnings Warnings that appear or should be added to pandas labels Sep 16, 2022

datapythonista reviewed Sep 16, 2022

View reviewed changes

cdcadman marked this pull request as draft September 16, 2022 16:11

cdcadman changed the title ~~Eliminate RemovedIn20Warning and other errors from pandas/tests/io/test_sql.py~~ Make pandas/io/sql.py work with 2.0-style sqlalchemy connectables Sep 20, 2022

cdcadman marked this pull request as ready for review September 20, 2022 13:12

datapythonista reviewed Sep 25, 2022

View reviewed changes

cdcadman marked this pull request as draft October 27, 2022 15:36

cdcadman changed the title ~~Make pandas/io/sql.py work with 2.0-style sqlalchemy connectables~~ Make pandas/io/sql.py work with sqlalchemy 2.0 Oct 27, 2022

cdcadman marked this pull request as ready for review October 31, 2022 07:31

cdcadman mentioned this pull request Nov 4, 2022

Refactor sqlalchemy code in pandas.io.sql to help prepare for sqlalchemy 2.0. #49531

Merged

5 tasks

This was referenced Nov 14, 2022

TYP:Replace union of subclasses with base class. #49587

Merged

TST: Refactor sql test classes. #49757

Merged

mroeschke reviewed Nov 21, 2022

View reviewed changes

pandas/io/sql.py Outdated Show resolved Hide resolved

mroeschke reviewed Nov 21, 2022

View reviewed changes

Chuck Cadman added 3 commits February 2, 2023 17:29

DOC: Update for sqlalchemy 2.0 (#51105)

cff5c95

TST: Add test of to_sql (#51086)

6f90992

TST: Add reference (#51015)

c7c4f63

Fix merge conflict.

112feae

benrutter mentioned this pull request Feb 8, 2023

BUG: ObjectNotExecutableError reading from MySQL with read_sql and SQL string after Sqlalchemy 2.0.0 release #51061

Open

3 tasks

mroeschke reviewed Feb 8, 2023

View reviewed changes

pandas/io/sql.py Show resolved Hide resolved

Chuck Cadman added 4 commits February 8, 2023 15:36

Remove restriction on sqlalchemy version.

9fa404c

Remove SQLALCHEMY_WARN_20 environment variable from tests.

214c6a0

Add comment on what self.exit_stack does.

1950f8f

Update user guide for sqlalchemy 2.0

293f422

mroeschke approved these changes Feb 9, 2023

View reviewed changes

phofl approved these changes Feb 9, 2023

View reviewed changes

phofl merged commit c73dc7f into pandas-dev:main Feb 9, 2023

phofl added this to the 2.0 milestone Feb 9, 2023

cdcadman deleted the sql_fixes branch February 9, 2023 18:34

joshua-oss mentioned this pull request Feb 12, 2023

Problem with any SQL query due to pandas/SQLAlchemy opendp/smartnoise-sdk#529

Closed

This was referenced Feb 13, 2023

Bump sqlalchemy from 1.4.46 to 2.0.1 sanger/lighthouse#713

Merged

Bump sqlalchemy from 1.4.46 to 2.0.1 sanger/crawler#738

Merged

datajoely mentioned this pull request Mar 13, 2023

Fail to read simple .db file with sqlite3 kedro-org/kedro#2312

Closed

This was referenced Apr 4, 2023

Unpin sqlalchemy once issue is fixed huggingface/datasets#5477

Closed

Unpins sqlAlchemy huggingface/datasets#5595

Closed

Make pandas/io/sql.py work with sqlalchemy 2.0 #48576

Make pandas/io/sql.py work with sqlalchemy 2.0 #48576

Conversation

cdcadman commented Sep 15, 2022 • edited

datapythonista left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cdcadman commented Sep 16, 2022

datapythonista commented Sep 17, 2022

cdcadman commented Sep 17, 2022

cdcadman commented Sep 19, 2022

cdcadman commented Sep 23, 2022

mroeschke commented Sep 23, 2022

cdcadman commented Sep 23, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fangchenli commented Oct 26, 2022

cdcadman commented Oct 28, 2022

cdcadman commented Oct 31, 2022

mroeschke commented Oct 31, 2022

cdcadman commented Nov 1, 2022

mroeschke commented Nov 3, 2022 • edited

cdcadman commented Nov 19, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joshcita commented Feb 3, 2023

cdcadman commented Feb 3, 2023

mroeschke commented Feb 8, 2023

cdcadman commented Feb 9, 2023

mroeschke left a comment

Choose a reason for hiding this comment

phofl commented Feb 9, 2023

cdcadman commented Feb 9, 2023

amotl commented Mar 13, 2023

References

amotl commented Mar 13, 2023

phofl commented Mar 13, 2023

cdcadman commented Sep 15, 2022 •

edited

mroeschke commented Nov 3, 2022 •

edited