-
Notifications
You must be signed in to change notification settings - Fork 409
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Interoperability with Django ORM #1137
Comments
We would indeed need to introduce some joblib level public API to make it possible for the users to register a callable used to initialize the workers, depending on the backend type (in particular to call |
This is related to #1071 for which we have no clean way to do it in joblib either. The stopgap work around we used in scikit-learn is to use a custom alternative to the |
Thanks for your response. That would be helpful. In response to your first comment... there is also something more specific going on here. I'm not sure why, but I need to re-import all of the Django models within the parallelized function to avoid the error. I will check out the linked issue and see if I can roll a solution. |
Hum, that's fishy. If django provides a way to introspect all the (active) model classes, this could be used to ship the list of model classes to initialized in a custom |
That's what I am thinking. I'll let you know how it goes. It might have to do with Python interpreter. The way imports have to be done in this forked context kind of breaks Python's assumptions about the accessibility of imported modules. |
joblib uses loky and cloudpickle under the hood. They work fine all the common objects in regular and dynamic modules (e.g. interactively defined modules, as long as the object is picklable by value). Feel free to open an issue on https://github.com/cloudpipe/cloudpickle with a minimal reproduction case with a custom django model class if you can suggest one. |
After a lot of tweaking and experimenting, I came up with nothing. I have successfully moved to Luigi for this part of my ETL pipeline. It has a higher initial startup time per-task, but it allows me to parallelize this part of my workflow without sacrificing code quality. |
There is no sense in posting a comprehensive example, because setting up a Django project requires a good number of files. I will give this simple example.
The way I have to write parallelized Django code (non-Pythonically) ...
The way I should be able to write parallelized Django code (Pythonically) ...
When I write it in the Pythonic way, I get the following traceback which makes basically no sense and provides no guidance on debugging it.
As you can imagine, once a project gets pretty complex, it requires me to insert random imports all over the place until the error disappears. I know that Django does something fancy with the PATH variable in order to set up app paths for the ORM, and this issue is not unique to joblib. It happens in multiprocessing as well.
I would love if we could do something to increase interoperability here, because the tracebacks are very frustrating.
Thanks
The text was updated successfully, but these errors were encountered: