METS Server #966

kba · 2022-12-09T12:57:14Z

~~PR is for early feedback, this is a proof-of-concept, not yet ready for wider use because it is not tested systematically yet.~~ (it is tested and largely consolidated now 🤞)

METS server can be started with:

# Start server listening to a UNIX domain socket
ocrd workspace --mets-server-url /tmp/ws.sock -d /path/to/workspace server start
# Start server listening to TCP port
ocrd workspace --mets-server-url http://localhost:8123 -d /path/to/workspace server start

Note: If you want to use the TCP interface, you must prefix the --mets-server-url with http://, otherwise it will be interpreted as a UDS.

Then you can do things like OcrdMets.find_files via http:

http get localhost:8123 file_grp=OCR-D-IMG "mimetype=//image/.*"

The same can be achieved through ocrd workspace:

ocrd workspace -U /tmp/ws.sock find -G OCR-D-IMG -m "//image/.*"

And processors also accept -U/--mets-server-url:

ocrd-tesserocr-recognize -U /tmp/ws.sock -I OCR-D-IMG-BIN -O TESS

Processors do still need to specify the workspace directory because they read files from disk. ~~Should we go further and provide a way to download files from the METS server?~~ We should not.

The implementation (in ocrd.mets_server) works like this:

OcrdMetsServer is the server component. It is provided with a workspace and calling its startup method creates a fastapi app that is run with uvicorn.
ClientSideOcrdMets is a replacement for OcrdMets that delegates to the OcrdMetsServer instead of the lxml DOM.
ClientSideOcrdFile is the equivalent replacement for OcrdFile
ClientSideOcrdAgent is the equivalent replacement for OcrdAgent
Messages between server and client are encoded as pydantic models

OcrdWorkspace, ocrd_decorators etc. accept mets_server_url in addition to mets_url, mets_basename etc. If --mets-server-url is specified, the OcrdWorkspace.mets is instantiated as a ClientSideOcrdMets instead of OcrdMets. The idea behind this, is that the change in behavior should be largely transparent, there are very few changes to OcrdWorkspace itself, only how the METS is accessed is different, but processor developers and users of ocrd workspace should not need to change anything to get support for METS server.

~~As I said, this is a proof-of-concept at this stage, there some hacky solutions in it and I haven't yet started to unit/integration test this properly.~~ There is now a basic test.

I would appreciate feedback on

~~the fastapi implementation~~ this is good enough to move forward I think
~~whether downloading from the server should be A Thing~~ It should not
~~whether we need both TCP and UNIX socket~~ Yes, adds little complexity but offers flexibility
~~how best to test this~~ Testing with a pytest fixture where the server is run via socket in a separate process, parallel additions tested with process pools

MehmedGIT

Here is the initial portion of my feedback by just peeking over the code. More will come later this week.

ocrd/ocrd/workspace.py

ocrd/ocrd/mets_server.py

ocrd/ocrd/resolver.py

…ets-server

joschrew

Hi, here is my review. I read through all the code and briefly tested the new feature.

This review-comment has 3 parts: questions, uds, tests. Feedback to fastapi are included in my code-comments

Questions:

Maybe we/you could point out the idea of the Mets-Server again (then it is better for me to recap): IIUC the idea is that all interactions with a workspace are done by the Mets-Server. So there is just one process (the server) accessing the Metsfile. What I am wondering is: when a user initiates multiple calls to the server how is it ensured that the Metsfile is accessed in a "synchronized way". Can't it be for example that the Met-Server itself receives 2 requests to add a file to the Metsfile and this calls "happen at the same time" and thus the writes collide somehow? Or is this related somehow to the Metsfile caching? Or because python only has one thread?

Processors do still need to specify the workspace directory because they read files from disk

I don't understand what you mean by this. Does it mean in addition to host/port the path to the workspace has to be specified? If so why, as far as I understand the Mets-Servers is the only one who needs to know where the Metsfile is stored.

whether downloading from the server should be A Thing

sry, I don't understand that either, can you try to clarify? Do you mean we should thing about if we should offer the functionality to download the Metsfile and the filegroups and everything?

Unix sockets

I don't have much experience with unix sockets. What is the benefit in using sockets compared to using tcp? I am wondering if it's worth the added "complexity" because both (uds and tcp) offer nearly the same thing (iiuc). What I would do is go with tcp and drop the sockets at least for the beginning (although it seems to be nicely usable with uvicorn/fastapi).

Tests

I didn't think very much about the tests yet. I am wondering if it is time for that yet, I know about TDD but I think I am not experienced/used to do that.
But in the webapi-implementation we already have some tests regarding fastapi. There we also use pytest-docker because we need mongodb (and later probably rabbit-mq) as well. I am not sure weather we need this also for the metsserver yet. Currently I don't think so.
Testing api endpoints with fastapi is in my opinion simple (using docker is a bit more complicated from my point of view). First a fixture (not mandatory but I think a fixture is the best way) for the fastapi-Instance/TestClient is needed: https://github.com/OCR-D/ocrd-webapi-implementation/blob/main/tests/conftest.py#L33-L36 (let's pretend as if in the example the function has no parameter) and then it can be used to execute http methods: https://github.com/OCR-D/ocrd-webapi-implementation/blob/main/tests/conftest.py#L176-L177. The rest (starting the server etc) is done by fastapi in the background.

ocrd/ocrd/workspace.py

ocrd/ocrd/mets_server.py

MehmedGIT · 2022-12-16T12:29:39Z

Hi Jonas. I will try to provide some insight on the questions you asked as far as I understand, Konstantin can then modify/extend the answers. In the end, hopefully, all of us will be on the same page.

Maybe we/you could point out the idea of the Mets-Server again (then it is better for me to recap): IIUC the idea is that all interactions with a workspace are done by the Mets-Server. So there is just one process (the server) accessing the mets file.

That's correct.

What I am wondering is: when a user initiates multiple calls to the server how is it ensured that the Metsfile is accessed in a "synchronized way". Can't it be for example that the Met-Server itself receives 2 requests to add a file to the Metsfile and this calls "happen at the same time" and thus the writes collide somehow? Or is this related somehow to the Metsfile caching? Or because python only has one thread?

Good question. Still not completely clear to me as well. On a low level, how does the Session from requests_unixsocket work? I did not check that yet.
The plan is to store the changes in the server's RAM and save the content of the mets file on the disk only when requested. Or store it every few seconds depending on how it is configured (to reduce the amount of loss of data if a crash happens). If we extend the mets caching (currently, we don't cache everything), we may also just store the compact cache on the disk for recovery purposes. I assume the requests are accepted in async way, but we do synchronize writes to the XML tree/cache simply by locking/unlocking the XML tree/cache.

Processors do still need to specify the workspace directory because they read files from disk

I don't understand what you mean by this. Does it mean in addition to host/port the path to the workspace has to be specified? If so why, as far as I understand the Mets-Servers only needs to know where the Metsfile is stored.

We have two different options for the mets server:

UDS (socket file) - Consider the provided example:
ocrd-tesserocr-recognize --socket /tmp/ws.sock -I OCR-D-IMG-BIN -O TESS
Here the processor knows where to write, i.e., the socket file (stream-oriented). But the processor also needs to know where to read from the fileGrp OCR-D-IMG-BIN files. So, to the example above the path of the mets file has to be added. This should be already obvious - works only when the mets server and the ocrd processor are running on the same host.
TCP (host/port) - In this case, the processors still need to know the workspace_id (the webapi-impl terminology) from where they need to read. They don't know how/where the data is stored inside the server.

whether downloading from the server should be A Thing

sry, I don't understand that either, can you try to clarify? Do you mean we should thing about if we should offer the functionality to download the Metsfile and the filegroups and everything?

I guess, yes - I have the same understanding. Whether the mets server should download missing files. Although I think this does not include downloading the mets file, just whatever is referenced inside the mets file.

Unix sockets
I don't have much experience with unix sockets. What is the benefit in using sockets compared to using tcp? I am wondering if it's worth the added "complexity" because both (uds and tcp) offer nearly the same thing (iiuc).

In short - the Unix socket is used for interprocess communication (IPC) and is an alternative to native IPC (e.g., pipes and fifos) . In cases where the mets server and the ocrd processor/s are running on the same host, the communication happens in a more efficient way (check here and here). The data transferred over the socket can also be further compressed - e.g., using a message scheme instead of relying on the slower JSON schema (EDIT: message compression is possible over TCP as well, but usually not preferred if users interact with the server directly and they need meaningful responses, JSON is ideal in that case). The mets server will be accessed by ocrd processors and not by users directly. The user is only interested in the latest mets file state once the processing is finished.

Another HPC-specific advantage is that it completely removes any possibility of potential port clashes with other running services. When I asked a more experienced person about starting a mets server inside the HPC, I've been told to always use ports above 10000 to reduce the possibility of potential port clashes with internal services. However, this is still not safe enough in the long run since it can still potentially clash with another user service. The more mets server instances inside the HPC, the more probability of a clash.

What I would do is go with tcp and drop the sockets at least for the beginning (although it seems to be nicely usable with uvicorn/fastapi).

I would not prefer to drop the sockets. However, I think we should separate the host/port from the socket concept a bit, so it is more obvious to the general user what are these. Currently, they are kind of under the same hood. Even a slightly better help description on the CLI may help here.

ocrd/ocrd/mets_server.py

Co-authored-by: Mehmed Mustafa <mehmed.n.mustafa@gmail.com>

…ets-server

Co-authored-by: joschrew <91774427+joschrew@users.noreply.github.com>

kba · 2022-12-19T14:39:02Z

What I am wondering is: when a user initiates multiple calls to the server how is it ensured that the Metsfile is accessed in a "synchronized way". Can't it be for example that the Met-Server itself receives 2 requests to add a file to the Metsfile and this calls "happen at the same time" and thus the writes collide somehow? Or is this related somehow to the Metsfile caching? Or because python only has one thread?

Good question. Still not completely clear to me as well. On a low level, how does the Session from requests_unixsocket work? I did not check that yet.

The METS server is single-threaded, so any concurrent requests will work on the same in-memory OcrdWorkspace/OcrdMets object. This behavior is unrelated to caching, though caching will speed up file search here as well of course.

whether downloading from the server should be A Thing

sry, I don't understand that either, can you try to clarify? Do you mean we should thing about if we should offer the functionality to download the Metsfile and the filegroups and everything?

I guess, yes - I have the same understanding. Whether the mets server should download missing files. Although I think this does not include downloading the mets file, just whatever is referenced inside the mets file.

The way the METS server is currently implemented still requires processors to be run with a local copy of the workspace, only file adding and searching (and a few related methods) are implemented, access to the actual files happens via filesystem. I was wondering whether we should decouple this further, so that a processor can request a file and download it via the METS server, store it in a local scratch directory (or just keep in memory if feasible), do the processing and upload the file again. That way there would be no need to have a processor-local full workspace at all.

The other interpretation does also make sense though, i.e. downloading misssing files to the server-local workspace, like we do with the --download flag to ocrd workspace find. Perhaps we should add a query parameter download=1 to GET /file.

What I would do is go with tcp and drop the sockets at least for the beginning (although it seems to be nicely usable with uvicorn/fastapi).

I would not prefer to drop the sockets. However, I think we should separate the host/port from the socket concept a bit, so it is more obvious to the general user what are these. Currently, they are kind of under the same hood. Even a slightly better help description on the CLI may help here.

I would also prefer to keep both options for the reasons @MehmedGIT explained so eruditely above. If you have suggestions on how to better describe the features, I'll be happy to integrate them.

MehmedGIT · 2022-12-19T17:12:33Z

The way the METS server is currently implemented still requires processors to be run with a local copy of the workspace, only file adding and searching (and a few related methods) are implemented, access to the actual files happens via filesystem.

Since workspace, depending on the context, could mean both a mets file and images, I think we should stick to these separate terms when referring to workspaces. At least for me, it will help. So, reads/writes on the mets file happen through the Mets server, but ocrd processors still have the images stored locally. Although there is a mets file available locally as well, it is not used by the processors.

I was wondering whether we should decouple this further, so that a processor can request a file and download it via the METS server, store it in a local scratch directory (or just keep in memory if feasible), do the processing and upload the file again.

This sounds reasonable considering that the processors should know nothing regarding the workspace structure but just process images. However, we should think how to optimize that to avoid unnecessary transfers of images. E.g., uploading images from Local/Network FS to the METS server, then sending them to the processors via the METS server.

Maybe we should first try to identify a spot where the METS Server fits the best in this architecture. I think the METS Server fits the best between the Processing Servers (workers) and the Network File System. It acts as a proxy when a path to the image has to be provided or an image to be downloaded. So the Mets server will not have to store workspace images locally. Since I did not put much thought into this idea yet, maybe there are things I have not considered.

MehmedGIT

Overall looks good. We already discussed some things in person but for the record here is a bug that is still there.

Starting the METS Server: ocrd workspace -U http://localhost:8123 -d /home/mm/Desktop/example_ws2/data server start
Then trying to run some processor: ocrd-cis-ocropy-binarize -U http://localhost:8123 -I DEFAULT -O OCR_BIN

Output:

17:29:31.758 CRITICAL root - initLogging was called multiple times. Source of latest call:
17:29:31.758 CRITICAL root -   File "/home/mm/Desktop/ocrd_core/ocrd-core/ocrd/ocrd/decorators/__init__.py", line 60, in ocrd_cli_wrap_processor
17:29:31.759 CRITICAL root -     initLogging()
Traceback (most recent call last):
  File "/home/mm/venv37-ocrd/bin/ocrd-cis-ocropy-binarize", line 8, in <module>
    sys.exit(ocrd_cis_ocropy_binarize())
  File "/home/mm/venv37-ocrd/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/mm/venv37-ocrd/lib/python3.7/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/mm/venv37-ocrd/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/mm/venv37-ocrd/lib/python3.7/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/mm/venv37-ocrd/lib/python3.7/site-packages/ocrd_cis/ocropy/cli.py", line 18, in ocrd_cis_ocropy_binarize
    return ocrd_cli_wrap_processor(OcropyBinarize, *args, **kwargs)
  File "/home/mm/Desktop/ocrd_core/ocrd-core/ocrd/ocrd/decorators/__init__.py", line 80, in ocrd_cli_wrap_processor
    workspace = resolver.workspace_from_url(mets, working_dir)
  File "/home/mm/Desktop/ocrd_core/ocrd-core/ocrd/ocrd/resolver.py", line 175, in workspace_from_url
    self.download_to_directory(dst_dir, mets_url, basename=mets_basename, if_exists='overwrite' if clobber_mets else 'skip')
  File "/home/mm/Desktop/ocrd_core/ocrd-core/ocrd/ocrd/resolver.py", line 82, in download_to_directory
    raise FileNotFoundError("File path passed as 'url' to download_to_directory does not exist: %s" % url)
FileNotFoundError: File path passed as 'url' to download_to_directory does not exist: /home/mm/Desktop/ocrd_core/mets.xml

Seems it still tries to find the mets.xml file in the default path.

I have tried to provide the correct mets path with --mets, however, then the produced output files are not reflected by the METS Server at all.

kba · 2023-08-21T18:37:01Z

EDIT: -d is for ocrd workspace, -w is for processors to specify workspace directory.

Overall looks good. We already discussed some things in person but for the record here is a bug that is still there.

Starting the METS Server: ocrd workspace -U http://localhost:8123 -d /home/mm/Desktop/example_ws2/data server start

Then trying to run some processor: ocrd-cis-ocropy-binarize -U http://localhost:8123 -I DEFAULT -O OCR_BIN

You also need to specify the path to the workspace to the processor, like so:

ocrd-cis-ocropy-binarize -U http://localhost:8123 -I DEFAULT -O OCR_BIN -w /home/mm/Desktop/example_ws2/data

Because writing out files other than mets.xml still happens on the client side.

But there were indeed a few problems in the code, that I have fixed now (mostly that --mets-server-url was not passed properly and I had a typo in the option too 🙄)

I have now tested it with https://content.staatsbibliothek-berlin.de/dc/PPN680203753.mets.xml:

ocrd workspace -U http://localhost:8123 -d /tmp/testws server start in one tmux pane
for i in seq --format='%02g' 1 10;do ocrd-tesserocr-recognize -P segmentation_level region -P textequiv_level word -P find_tables true -P model eng -I DEFAULT -O TESS -w /tmp/testws -U http://localhost:8123 -g PHYS_00$i&;done in the other, i.e. process the first 10 pages in parallel

It takes a moment for the processes to start, but then they go brrrr.

kba · 2023-08-21T19:34:16Z

Tests now fail due to psf/requests#6226 but have to investigate further tomorrow.

MehmedGIT

No more issues on my side.

kba added 4 commits December 6, 2022 17:31

wip

604dd9d

.

f73194d

getting there

04eee34

slowly but determinedly

3af495b

kba requested review from MehmedGIT and joschrew December 9, 2022 12:57

kba and others added 3 commits December 9, 2022 13:58

remove noise from makefile

82e2a69

mets-server: bashlib should take same args

dcad68a

ClientSideOcrdMets: fix signature of self.file_groups

63ba8f0

MehmedGIT reviewed Dec 13, 2022

View reviewed changes

ocrd/ocrd/workspace.py Show resolved Hide resolved

ocrd/ocrd/mets_server.py Outdated Show resolved Hide resolved

ocrd/ocrd/mets_server.py Outdated Show resolved Hide resolved

ocrd/ocrd/mets_server.py Show resolved Hide resolved

ocrd/ocrd/mets_server.py Outdated Show resolved Hide resolved

MehmedGIT reviewed Dec 13, 2022

View reviewed changes

ocrd/ocrd/resolver.py Outdated Show resolved Hide resolved

kba and others added 8 commits December 14, 2022 14:48

OcrdWorkspace.is_remote should be a bool

3d48af6

mets_server: only save_mets on PUT and DELETE

14db382

resolver: shorten mets_server_{host,port} check

4f726ab

Merge branch 'master' into mets-server

9b4a751

--port must be int

6fdf7dd

mets_server: replace Model constructor with static create calls

bfa17c3

Merge branch 'mets-server' of https://github.com/kba/ocrd-core into m…

3685aeb

…ets-server

mets_server: different loggers for socket/host-port

69dad92

kba force-pushed the mets-server branch from 7da0de1 to 69dad92 Compare December 14, 2022 16:10

joschrew reviewed Dec 16, 2022

View reviewed changes

MehmedGIT reviewed Dec 16, 2022

View reviewed changes

ocrd/ocrd/mets_server.py Outdated Show resolved Hide resolved

joschrew and others added 5 commits December 19, 2022 13:58

mets_server: missed mimetype kwarg

3532600

mets_server: use factory method not constructor

390682c

Co-authored-by: Mehmed Mustafa <mehmed.n.mustafa@gmail.com>

mets_server: file search/adding on /file not /

9f27a98

Merge branch 'mets-server' of https://github.com/kba/ocrd-core into m…

6baa0c1

…ets-server

workspace: save content to file only if not remote

892841c

Co-authored-by: joschrew <91774427+joschrew@users.noreply.github.com>

kba added 2 commits August 17, 2023 20:24

mets server will never pass content to workspace.add_file

f477ae1

mets server: single option --mets-server-url/-U

46e34bc

kba marked this pull request as ready for review August 17, 2023 19:16

kba requested review from joschrew, bertsky and MehmedGIT August 17, 2023 19:16

cneud mentioned this pull request Aug 17, 2023

Flag for OCR-D processor to periodically save mets.xml file (a suggestion) qurator-spk/eynollah#82

Closed

kba added 3 commits August 21, 2023 12:23

📦 v2.53.0

1b55c8f

workspace server start: pass workspace context

f2da896

METS server: support -U for processor options

47eb196

MehmedGIT requested changes Aug 21, 2023

View reviewed changes

kba added 4 commits August 21, 2023 20:09

move ClientSideOcrd{Agent,File} to ocrd_models

668a8f4

typo: -{,-}mets-server-url

671ca32

pass mets_server_url from run_processor

41393b3

ClientSideOcrdFile et al need url too

98a5690

kba added 2 commits August 21, 2023 21:24

mets server: test both UDS and TCP variant

f6efd94

Merge branch 'master' into mets-server

780fbbc

kba added 2 commits August 22, 2023 13:40

mets server: allow both local_filename and url to be None

6cf22a8

mets server: forbid local/remote workspace with different directories

927ea59

MehmedGIT self-requested a review August 22, 2023 13:57

MehmedGIT approved these changes Aug 22, 2023

View reviewed changes

kba added 2 commits August 22, 2023 16:57

pin requests < 2.30, OCR-D#1082

b53938c

ci: localhost -> 127.0.0.1

5ce54a5

kba merged commit 5ce54a5 into OCR-D:master Aug 22, 2023
1 check passed

kba deleted the mets-server branch August 22, 2023 16:13

kba restored the mets-server branch August 23, 2023 12:49

kba mentioned this pull request Sep 12, 2023

run_cli now also supports the METS server #1095

Merged

bertsky mentioned this pull request May 17, 2024

Utilize processing server proxy to mets servers #1220

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

METS Server #966

METS Server #966

kba commented Dec 9, 2022 •

edited

MehmedGIT left a comment

joschrew left a comment •

edited

MehmedGIT commented Dec 16, 2022 •

edited

kba commented Dec 19, 2022

MehmedGIT commented Dec 19, 2022 •

edited

MehmedGIT left a comment •

edited

kba commented Aug 21, 2023 •

edited

kba commented Aug 21, 2023

MehmedGIT left a comment

METS Server #966

METS Server #966

Conversation

kba commented Dec 9, 2022 • edited

MehmedGIT left a comment

Choose a reason for hiding this comment

joschrew left a comment • edited

Choose a reason for hiding this comment

Questions:

Unix sockets

Tests

MehmedGIT commented Dec 16, 2022 • edited

kba commented Dec 19, 2022

MehmedGIT commented Dec 19, 2022 • edited

MehmedGIT left a comment • edited

Choose a reason for hiding this comment

kba commented Aug 21, 2023 • edited

kba commented Aug 21, 2023

MehmedGIT left a comment

Choose a reason for hiding this comment

kba commented Dec 9, 2022 •

edited

joschrew left a comment •

edited

MehmedGIT commented Dec 16, 2022 •

edited

MehmedGIT commented Dec 19, 2022 •

edited

MehmedGIT left a comment •

edited

kba commented Aug 21, 2023 •

edited