Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Type checking support #32609

Open
asfimport opened this issue Aug 8, 2022 · 26 comments
Open

[Python] Type checking support #32609

asfimport opened this issue Aug 8, 2022 · 26 comments

Comments

@asfimport
Copy link

mypy and static type checking

As of Python3.6, it has been possible to introduce typing information in the code. This became immensely popular in a short period of time. Shortly after, the tool mypy arrived and this has become the industry standard for static type checking inside Python. It is able to check very quickly for invalid types which makes it possible to serve as a pre-commit. It has raised many bugs that I did not see myself and has been a very valuable tool.

Now what does this mean for PyArrow?

When we run mypy on code that uses PyArrow, you will get error message as follows:

some_util_using_pyarrow/hdfs_utils.py:5: error: Skipping analyzing "pyarrow": module is installed, but missing library stubs or py.typed marker
some_util_using_pyarrow/hdfs_utils.py:9: error: Skipping analyzing "pyarrow": module is installed, but missing library stubs or py.typed marker
some_util_using_pyarrow/hdfs_utils.py:11: error: Skipping analyzing "pyarrow.fs": module is installed, but missing library stubs or py.typed marker

More information is available here: https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-library-stubs-or-py-typed-marker

You can solve this in three ways:

  1. Ignore the message. This, however, will put all types from PyArrow to Any, making it unable to find user errors with the PyArrow library

  2. Create a Python stub file. This is what previously used to be the standard, however, it no longer a popular option. This is because stubs are extra, next to the source code, while you can also inline the code with type hints, which brings me to our third option.

  3. Create a py.typed file and use inline type hints. This is the most popular option today because it requires no extra files (except for the py.typed file), allows all the type hints to be with the code (like now in the documentation) and not only provides your users but also the developers of the library themselves with type hints (and hinting of issues inside your IDE).

     

    My personal opinion already shines through the options, it is 3 as this has shortly become the industry standard since the introduction.

    What should we do?

    I'd very much like to work on this, however, I don't feel like wasting time. Therefore, I am raising this ticket to see if this had been considered before or if we just didn't get to this yet.

    I'd like to open the discussion here:

  4. Do you agree with number ARROW-10: Fix mismatch of javadoc names and method parameters #3 as type hints.

  5. Should we remove the documentation annotations for the type hints given they will be inside the functions? Or should we keep it and specify it in the code? Which would make it double.

     

Reporter: Jorrick Sleijster

Note: This issue was originally created as ARROW-17335. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Much of our code is in Cython. If you put the type annotations inline, any change or addition to typing info will require a recompile.

(also I'm assuming that Cython is compatible with type annotations, but I'm not 100% sure)

@asfimport
Copy link
Author

@asfimport
Copy link
Author

Jorrick Sleijster:
Hi @pitrou,

Thanks for reaching out. I checked out the code indeed and I have to say, it's super clean and I was overwhelmed.

I have no worked with Cython code bases before but if we just changed the \*.py files with type annotations, that would that also require a recompilation?

 

Let's take this line for example:

def ls(self, path, detail=False):

def ls(self, path, detail=False):

That would become

def ls(self, path: str, detail: bool = False) -> List[Dict[str, Any]]:

Would that change require recompilation?

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
No, that wouldn't require recompilation, but some APIs are implemented in Cython.

@asfimport
Copy link
Author

David Li / @lidavidm:
Arguably also duplicates ARROW-8175

@asfimport
Copy link
Author

David Li / @lidavidm:
And cython/cython#3818 is relevant

@asfimport
Copy link
Author

Jorrick Sleijster:
Well it's not really a duplicate of ARROW-8175.

The difference lies in the fact that that ticket is focused perform type checking on the PyArrow code base and ensuring all the types are valid inside the library.

My ticket is about using the PyArrow code base as a library and ensuring we can type check projects that are using PyArrow by using type annotations on functions specified inside the PyArrow codebase.

 

I think PySpark 3.2.2 was a nice example of having stubs: https://github.com/apache/spark/tree/v3.2.2/python/pyspark

I'm pretty sure they created them manually though (and note: this is a Java bindings and not C, but I don't think that's a lot of difference in terms of stubs).

However, they changed it in their latest version to ditch the pyi files. I think this is because they have a lot more percentage of the code in Python compared to Java.

@asfimport
Copy link
Author

Jorrick Sleijster:
I suppose we can just give it a start and see how far we get?

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Yes, we can probably give it a start. Probably start with the most basic utilities, for example in memory.pxi.

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
AFAIK it is not (yet) possible to do inline type annotations in cython code (for type checking purposes, see the links in #6676 as well), so I think that basically means we need to use the stub file approach?

(but I certainly agree it's fine to give this a go with a small subset, see how that looks, and discuss further from there)

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:

Well it's not really a duplicate of ARROW-8175.

The difference lies in the fact that that ticket is focused perform type checking on the PyArrow code base and ensuring all the types are valid inside the library.

My ticket is about using the PyArrow code base as a library and ensuring we can type check projects that are using PyArrow by using type annotations on functions specified inside the PyArrow codebase.

It's indeed not exactly the same. But in practice, I think both aspects are very much related and we could (should?) do those at the same time. If we start adding type annotations so that pyarrow can used by other projects that are type-checked, it would be good that at the same time we also check that those type annotations we are adding are correct (although, based on my limited experience with this, just running mypy on the code base is always a bit limited I suppose, as it doesn't guarantee the type checks are actually correct? (it only might find some incorrect ones))

@asfimport
Copy link
Author

Jorrick Sleijster:
I think you make a good point Joris but as you mention, I don't think we can use inline type annotations :'(. Therefore, we'd have to use generated stubs, which we can't use for checking whether the underlying code actually has the right types.

I think we will therefore have to wait (or take action ourselves upstream) until mypy or cython implements decent support for Python stub generation.

Hence, I think it's better to threat them separate for now and start of with stub generation, which can then later be replaced by a better implementation once available.

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
Mypy doesn't use pyi files when eg doing mypy pyarrow?

@asfimport
Copy link
Author

Jorrick Sleijster:
It does, in fact, the pyi files are the python-interfaces / python stubs as mentioned here: https://mypy.readthedocs.io/en/stable/stubs.html.

As far as I can see from this page, these are only used for external project checking it's types against this project and not for internal project checking.

@asfimport
Copy link
Author

Uwe Korn / @xhochy:
My initial efforts with regards to typing the code base stopped as the inline type annotations (and their automatic extraction into pyi) is the crucial component here. All the important data structures of pyarrow are implemented in Cython, only a very small fraction of the externally visible API is in plain Python. Thus as long as the referenced Cython issue isn't solved, I don't think it makes sense to progress on the Arrow side.

@asfimport
Copy link
Author

Jorrick Sleijster:
Yes I guess it will also be a lot of work to keep it up to date over time if we do major refactors.

Looker at other projects, this seems like a very cumbersome and tedious thing to work on.

For example, this one took a very long time to get right as well; pysam-developers/pysam#1008

I suppose the best it to postpone until python/mypy#7542 is implemented then.

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
FYI pandas does use pyi files next to cython files that have to be kept updated manually if the cython file changes. But, in pandas the cython code is mostly for internal functionality, not for public facing classes and methods.

@wjones127
Copy link
Member

Not planning on looking at this soon. But if someone does, one approach would be to use this branch of stubgen (python/mypy#13284) so that includes the docstrings, and then one can fill in the types based on them. I've heard from @mariusvniekerk that this workflow has helped when he generated type stubs for our Flight submodule for his own project.

On detecting regressions in type stubs for Cython, I think we should be able to catch these as long as we are running type checking on our test suites.

@Fokko
Copy link
Contributor

Fokko commented Feb 9, 2023

Great to see the discussion here. I would be in favor of having type annotations. It will make the code more robust, and also helps the user to see what arguments can be passed in.

I'm working on #33974 and figured that also type checking will help to make sure that the docstrings are up to date (if you update a type, you should update the docstring as well).

@adriangb
Copy link

adriangb commented Mar 4, 2023

Working with the pyarrow library this really sticks out as a sore thumb. Good type stubs and docstrings are IMO just as valuable as API documentation (which the project does very well) because you don't need to leave your IDE and open 10 tabs to find the information.

A couple thoughts:

  • I think this is a case of something is better than nothing. I would rather have information about the most common / stable APIs even if there is no info or slightly incorrect info for experimental APIs.
  • Although it would be manual work if you made type stubs and used them in your own tests at least to some extent.changes that require an update would at least get flagged in CI (as pointed about above already).

@adriangb
Copy link

adriangb commented May 8, 2024

@jorisvandenbossche I'm sorry to ping you about this but since we've interacted before you felt like the least worst person to personally annoy. Could we get some thoughts from the team on this? I really think it would not be that hard to get started and improve the typing stubs over time. Based on the 👍🏻 in my comment above there's a good amount of interest and it's easy pickings for 3rd party contributors.

@wjones127
Copy link
Member

I think a good place to start would be to choose a submodule with a decently limited API (maybe fs or flight? Or maybe just data types?) and start there. I think the things we want to demonstrate is:

  1. How much work is it to do? Once we have one example PR of how to add type stubs, that becomes the template for others to make contributions to them. These can become a organized list of good-first-issues that we can crowdsource contributions for to get good coverage quickly, after we have figured out the initial approach.
  2. Do we have a reliable way to validate them in CI? We would want to make sure the stubs are both accurate and cover the whole module.

@adriangb
Copy link

adriangb commented May 8, 2024

How much work is it to do? Once we have one example PR of how to add type stubs, that becomes the template for others to make contributions to them

I'm happy to try to make an initial PR. My approach would be to add a py.typed file and add a .pyi file for some module (happy to do whichever is chosen).

Do we have a reliable way to validate them in CI? We would want to make sure the stubs are both accurate and cover the whole module.

IMO the best way is to force typing into tests. That's also good dogfooding. But it can result in a lot of code churn (need to rewrite a lot of the tests I imagine). Type checkers do allow whitelisting files so it would probably make sense to pick a very small test file to whitelist and either fix all stubs it needs or add # type: ignores. That said, I do think something is better than nothing and these stubs won't impact runtime behavior so even if they are not totally accurate or are missing module members it's still better than no typing at all.

@ianmcook
Copy link
Member

FWIW: I recently added a method to PySpark to return a DataFrame as a PyArrow Table (apache/spark#45481). Now I'm trying to add support for going in the other direction (apache/spark#46529) but I'm stymied by type checking problems, including the problem described at #24376 (comment).

@paleolimbot
Copy link
Member

paleolimbot commented May 15, 2024

I've experimented a little with this in nanoarrow (where I'd at least like to get autocomplete for methods on some objects that have to be implemented in Cython) and found a few things that might be helpful:

  • mypy can generate stubs that include docstrings for class methods. This does not include documentation for the classes themselves, nor does it include functions (nor class) implemented in Cython, nor does it include any typing hints provided inline in Cython.
  • One low-maintenance intervention to improve the situation is to move or copy some function definitions to Python (e.g., def int32() -> DataType).
  • Pretty much every solution I found listed on google to generate stubs from Cython generates very bad stubs.

I would lean towards some kind of programmatic approach where type hinting is specified in a docstring or something to minimize the pain of keeping .pxi files synced. Off the top of my head, I would probably start with the mypy-generated stubs, parse it into an ast, do some transformation to add type hints based on some pattern in the docstring or parsing of the argument list (Cython methods provide the file number and line number of the definition). That is almost certainly a can of worms (but so are any alternatives I know about).

The PR adding very basic mypy stubs to nanoarrow is here: apache/arrow-nanoarrow#468

@westonpace
Copy link
Member

westonpace commented Jun 6, 2024

I'm happy to try to make an initial PR. My approach would be to add a py.typed file and add a .pyi file for some module (happy to do whichever is chosen).

@adriangb

Looking through this discussion I don't think anyone is opposed to an initial PR and it would be welcome. Picking a small module would be good to prove out the actual workflow.

I think the next step then is to provide typings for RecordBatch, Table, Schema, Field, and Array. I feel like these are the most essential types and if we had type hints for just these classes I feel like the user experience would be much better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants