Switch to SQLite DB storage #187

Loran425 · 2024-05-16T05:34:45Z

Only reason this is going up in this state is because it's sat on my computer too long already, maybe if its out there I'll move faster on the actual code writing.

src\core\sql_library.py representing the in memory data structures
src\core\create_db.sql representing the schema for the database

Incredibly rough draft that isn't close to done,
Only Location and Entry have had a cursory initial pass for features, with many missing features of the existing library

First thought is that Library handles all instantiation and processing with the remaining objects primarily being memory caching to prevent slow downs in the initial phases when things like src\qt\modals\tag_database.py would otherwise try to read the database 1 tag at a time.

Initial DB Schema Graphic

yedpodtrzitko · 2024-05-16T05:56:15Z

The thread about DB is very tl;dr, so if you dont mind some questions about the final DB schema attached:

what is page attribute in the table entry_page for?
what's the difference between entry.path and location.path referred via entry.location?
I dont see entry_attribute used in the code yet (assuming it's still very much WIP), so I'll ask with relevant questions when I will see what's that about.

Loran425 · 2024-05-16T06:21:53Z

The thread about DB is very tl;dr, so if you dont mind some questions about the final DB schema attached:

what is page attribute in the table entry_page for?

what's the difference between entry.path and location.path referred via entry.location?

I dont see entry_attribute used in the code yet (assuming it's still very much WIP), so I'll ask with relevant questions when I will see what's that about.

entry_page is part of the group (formerly collations) functionally. So page would be what page of that UI view it appears on.
Locations could probably be better referred to as directories, allowing 2 requested features.

multiple directories within a single library
Allowing the TagStudio database and other files TagStudio generated files to be placed anywhere at creation time, not just in the root of the library folder.

Entry_attribute replaces all references to fields and tags in the data storage so it's essentially the storage of all the attrs of the key:attr pairs. It maps entries to the metadata with tags as keys (stored in the tag table) and the attrs stored in the entry_attribute table,

though thinking this through some more I think multiple tags might have been a missed case because I think this schema needs one row per tag and that would cause a primary key clash if you had a tagbox (tag group) with more than 1 child tag. So it might need an integer primary key rather than the current Title_tag/Entry key
Been so long since I thought about this part I forgot it's actually just tags that get assigned, and then the tag box grouping is handled on the UI side if I remember correctly.

…base

Loran425 · 2024-05-17T04:15:14Z

Question on ignored extensions, is the plan that files with those extensions are ignored by the database (no entries generated) or just hidden from the UI? Trying to see if that info should be stored as a UI settings item, or as another table in the DB (currently commented out)

CyanVoxel · 2024-05-17T04:20:10Z

Question on ignored extensions, is the plan that files with those extensions are ignored by the database (no entries generated) or just hidden from the UI? Trying to see if that info should be stored as a UI settings item, or as another table in the DB (currently commented out)

I was intending on them being hidden on the UI side so the library doesn't have to rescan whenever you make changes to the ignore list.

DannyAlas · 2024-05-17T20:33:44Z

This PR could get quite big (which is okay) as it's reimplementing many core features. But could we use this as a starting point to refactor some components out of here before continuing? Namely:

Decouple the Library from the storage backend

Let the storage backend handle data storage in the DB. The library can manage CRUD, caching, linking, and other management-related features but let the storage backend handle (and optimize) the implementation.

Note: I would also not create a Python object for each entry as if we expect hundreds of thousands to millions of files; I don't think objects managed by GC would be ideal

Isolate the filesystem implementation from TagStudio internals

This would fix the inability to reference files if they move or are deleted. We should have a module that handles the management of files, e.g., their IDs, location, system metadata, etc., and provide an API for libraries to interact with them.

This would use inodes in *nix OS's, BY_HANDLE_FILE_INFORMATION for NTFS systems, and the respective for other systems internally to manage OS specific metadata and provide a single API for consumers, like libraries
This would own the API implementation for things like watching for filesystem changes, such as in Automatic detection of filesystem changes #125, for example

Scopes and defaults

Instead of each library managing its implementation of Tags, their storage, and their defaults, have the Tag implementation be separate.

Since the storage is already abstracted, the Tag Manager can handle this by creating, managing, and storing tags and their relationships (not sure if this is a goal, but tag relationships could be more than just parent-child) in Global Scope. Then, the application could provide a UI to manage these (and import across libraries), and individual libraries can manage local tags and their file associations.

This would also allow for easy imports of tags and moving them around libraries in a user-friendly fashion. In the future, if we want user plugins for adding tags (like image classification or OCR plugins), that would interop with this API for adding tags and then the libraries API for linking them.

I'm happy to start on some of these (like filesystem and storage), but it's up to @CyanVoxel to see if he thinks this is a good direction.

Loran425 · 2024-05-18T17:53:21Z

Thanks for the comments yeah this really would be a big one. I just wanted to get the discussion going and loop in some of the GitHub crowd.

This PR could get quite big (which is okay) as it's reimplementing many core features. But could we use this as a starting point to refactor some components out of here before continuing? Namely:

Decouple the Library from the storage backend

Let the storage backend handle data storage in the DB. The library can manage CRUD, caching, linking, and other management-related features but let the storage backend handle (and optimize) the implementation.

I believe this was one of the end goals for this though definitely not touched on in the first stages.
To make sure I'm on the same page this is basically saying the project architecture shifts and now you have a library acting as middleware? it never touches the disk and never touches the GUI just acts as the connection point/API for both storage backends and GUIs?

Note: I would also not create a Python object for each entry as if we expect hundreds of thousands to millions of files; I don't think objects managed by GC would be ideal

Agreed, it was never the intention for an entire library to live in memory at once long term but since that's how it's currently implemented I was looking at incremental changes to make that more possible.

Isolate the filesystem implementation from TagStudio internals

This would fix the inability to reference files if they move or are deleted. We should have a module that handles the management of files, e.g., their IDs, location, system metadata, etc., and provide an API for libraries to interact with them.

This would use inodes in *nix OS's, BY_HANDLE_FILE_INFORMATION for NTFS systems, and the respective for other systems internally to manage OS specific metadata and provide a single API for consumers, like libraries

This would own the API implementation for things like watching for filesystem changes, such as in Automatic detection of filesystem changes #125, for example

This level of filesystem interaction is well beyond my existing knowledge but I would be interested in learning about it, I'm not seeing clear ways for these metadata structures to resolve back to their file data so that things like thumbnails and opening with system default viewers would be achievable without falling back to system calls to resolve the filename. or is the thought more that this implementation would scan a directory, resolve the filesystem ids from the file names and use that to internally translate between file names and OS level file identifiers? (e.g. I move C:\users\loran425\downloads\test.png to C:\users\loran425\pictures\test.png the file path has changed but the OS level file identifier hasn't so if I was scanning both downloads and pictures the existing TagStudio metadata would automatically be applied because its tied to that ID not the path of the file?)

Scopes and defaults

Instead of each library managing its implementation of Tags, their storage, and their defaults, have the Tag implementation be separate.

Since the storage is already abstracted, the Tag Manager can handle this by creating, managing, and storing tags and their relationships (not sure if this is a goal, but tag relationships could be more than just parent-child) in Global Scope. Then, the application could provide a UI to manage these (and import across libraries), and individual libraries can manage local tags and their file associations.

This would also allow for easy imports of tags and moving them around libraries in a user-friendly fashion. In the future, if we want user plugins for adding tags (like image classification or OCR plugins), that would interop with this API for adding tags and then the libraries API for linking them.

I think this is sort of being shifted towards just by having the tags live in the database, so there wouldn't be a list of defaults in the source code, it would instead be pulled from storage, the current defaults would just be created as defaults in the storage solution since that simplifies the transition.
It hasn't really been discussed from I've seen on having Global Scope items, multiple directories within the file system and allowing the storage location and entries live in different places has been discussed as likely improvements.

I'm happy to start on some of these (like filesystem and storage), but it's up to @CyanVoxel to see if he thinks this is a good direction.

DannyAlas · 2024-05-19T21:49:04Z

I believe this was one of the end goals for this, though definitely not touched on in the first stages. To make sure I'm on the same page this is basically saying the project architecture shifts and now you have a library acting as middleware? it never touches the disk and never touches the GUI just acts as the connection point/API for both storage backends and GUIs?

Kind of; essentially, I'm saying to Separate Concerns. For now, abstract out the storage implementation specifics from the TagStudio Library class/implementation. We could do a Factory or Prototype pattern or just provide an Abstract implementation. The Library should be agnostic to the storage backend. Then each storage implementation would handle figuring out how actually to implement the methods. (and avoid tangling the GUI with any of this, it becomes a big hot mess really fast) See projects like Napari for an idea of structuring larger PyQt projects.

class StorageInterface(ABC):
    @abstractmethod
    def attatch_tag_entry(self, tag: Tag, entry: Entry) -> None:
        pass
    @abstractmethod
    def link_tags(self, tag1: Tag, tag2: Tag, association: Association) -> None:
        pass
    ...

or is the thought more that this implementation would scan a directory, resolve the filesystem ids from the file names and use that to internally translate between file names and OS level file identifiers?

We don't need to translate between file names and the ID. The file name, path, ID, and other metadata are already attached to the file. If we use the path as the identifier, we run into linking issues as files get moved around, and if we use a hash, when internal data is modified (like if you crop a photo), the hash changes.

The ID is a more consistent identifier (it's not guaranteed to always be the same, like on Windows, if the file moves drives the volume ID, a part of the whole id, changes). But take for example, the directory below where Pictures is the monitored library directory.

Pictures/
├── Screen Shots/
│   └── lol_screenshot.png
└── Games/
    └── LOL/

If I have all my tags already associated with the png. If it was to then move the file under games:

Pictures/
├── Screen Shots/
└── Games/
    └── LOL/
        └── lol_screenshot.png

We would lose the association as the path has changed. This could get really bad if you're moving around more than just a few files after you've spent time tagging them. And if I happen to crop or modify it in some way after, most any hash I know of (md5, sha, crc64) would change (and they're also expensive to calculate as the file size grows). The ID would not. Preserving the links. Not perfect but I believe it's better.

An example implementation for this:

def _filetime_to_dt(ft):
    us = (ft.dwHighDateTime << 32) + ft.dwLowDateTime
    us = us // 10 - 11644473600000000
    return datetime.timestamp(us / 1e6).fromtimestamp(datetime.UTC)

def _get_windows_metadata(file_path: str):
    try:
        file_handle = ctypes.windll.kernel32.CreateFileW(
            file_path, 0x00, 0x01 | 0x02 | 0x04, None, 0x03, 0x02000000, None
        )
        if file_handle == -1:
            raise ctypes.WinError()
        info = ctypes.wintypes.BY_HANDLE_FILE_INFORMATION()
        if not ctypes.indll.kernel32.GetFileInformationByHandle(file_handle, ctypes.byref(info)):
            raise ctypes.WinError()
        ctypes.windll.kernel32.CloseHandle(file_handle)
        return {
            "path": file_path,
            "uid": f"{info.dwVolumeSerialNumber}{info.nFileIndexHigh}{info.nFileIndexLow}",
            "size": (info.nFileSizeHigh << 32) + info.nFileSizeLow,
            "creation_time": _filetime_to_dt(info.ftCreationTime),
            "last_access_time": _filetime_to_dt(info.ftLastAccessTime),
            "last_write_time": _filetime_to_dt(info.ftLastWriteTime)
        }
    except Exception as e:
        return {"error": str(e)}

def _get_unix_metadata(file_path):
    try:
        stats = os.stat(file_path)
        return {
            "path": file_path,
            "uid": f"{stats.st_dev}{stats.st_ino}",
            "size": stats.st_size,
            "creation_time": datetime.fromtimestamp(stats.st_ctime),
            "last_access_time": datetime.fromtimestamp(stats.st_atime),
            "last_write_time": datetime.fromtimestamp(stats.st_mtime)
        }
    except Exception as e:
        return {"error": str(e)}

Loran425 · 2024-05-20T00:18:50Z

Kind of; essentially, I'm saying to Separate Concerns. For now, abstract out the storage implementation specifics from the TagStudio Library class/implementation. We could do a Factory or Prototype pattern or just provide an Abstract implementation. The Library should be agnostic to the storage backend. Then each storage implementation would handle figuring out how actually to implement the methods. (and avoid tangling the GUI with any of this, it becomes a big hot mess really fast) See projects like Napari for an idea of structuring larger PyQt projects.

I can see the flexibility gain of such a system, I'll look into the Abstract classes and Prototypes a bit more, I'll admit I tend to lean away from them because I'm not normally writing things that need plugins or configurable backends.

For napari I see they went prototypes but that repo is a lot to take in to try and understand the structure of what and why they might have done something. I'll see if I can look over it a bit more when I have more time.

We don't need to translate between file names and the ID. The file name, path, ID, and other metadata are already attached to the file. If we use the path as the identifier, we run into linking issues as files get moved around, and if we use a hash, when internal data is modified (like if you crop a photo), the hash changes.

I think I agree and am following on this. So to lookup tags from a file you would select a file, parse the system metadata and use the system ID as the Entry id, so that no matter where that file lives (windows drive changes excluded) the tags and other metadata are applied correctly.
or in an active use scenario the you have a GUI it loads a library. that library has a storage system agnostic way of retrieving a list of files that are part of the library (if a file moves outside the library then it won't be displayed but unless the metadata was cleaned up it would relink once it was returned to the library). Then to collect the TagStudio specific metadata it at some point (instantiation, searching or displaying tags) parses the file ID and requests the info from the library. So the GUI or another module of the Library is still operating on Directories & Filenames to know where to look but the internal referencing of the metadata is based on this file ID. Is that basically what you are recommending?

DannyAlas · 2024-05-20T01:04:39Z

I can see the flexibility gain of such a system, I'll look into the Abstract classes and Prototypes a bit more, I'll admit I tend to lean away from them because I'm not normally writing things that need plugins or configurable backends.
For napari I see they went prototypes but that repo is a lot to take in to try and understand the structure of what and why they might have done something. I'll see if I can look over it a bit more when I have more time.

Napari is a great project, and I recommend giving it a look, but it has a different goal. We don't need to copy its systems per se -- the idea is just that they've been able to manage the separation of concerns pretty well in a larger Python Qt project. PyQt is nice as it's really easy to get started and have an MVP fast, but as soon as it grows in complexity and in contributors, the difficulty can ramp up fast. Separation of concerns, types, and documentation all really help here.

I think I agree and am following on this. So to lookup tags from a file you would select a file, parse the system metadata and use the system ID as the Entry id, so that no matter where that file lives (windows drive changes excluded) the tags and other metadata are applied correctly. Or in an active use scenario the you have a GUI it loads a library. that library has a storage system agnostic way of retrieving a list of files that are part of the library (if a file moves outside the library then it won't be displayed but unless the metadata was cleaned up it would relink once it was returned to the library). Then to collect the TagStudio specific metadata it at some point (instantiation, searching or displaying tags) parses the file ID and requests the info from the library. So the GUI or another module of the Library is still operating on Directories & Filenames to know where to look but the internal referencing of the metadata is based on this file ID. Is that basically what you are recommending?

Exactly! This should minimize relinking and broken link annoyances for the user. They can move files around, delete and restore them, have files with the same name, etc. all while the metadata (Tags) for the files are magically linked. (We'd want some recycle bin and archival features as well for deleting files).

Loran425 · 2024-05-20T01:43:53Z

Napari is a great project, and I recommend giving it a look, but it has a different goal. We don't need to copy its systems per se -- the idea is just that they've been able to manage the separation of concerns pretty well in a larger Python Qt project. PyQt is nice as it's really easy to get started and have an MVP fast, but as soon as it grows in complexity and in contributors, the difficulty can ramp up fast. Separation of concerns, types, and documentation all really help here.

Yeah wouldn't think about copying verbatim just looking for an understanding of the separation. After exploring for a little bit and especially with the potential for future plugins I'll be looking at protocols for this PR but still open to changes if there's a better suggestion.

Exactly! This should minimize relinking and broken link annoyances for the user. They can move files around, delete and restore them, have files with the same name, etc. all while the metadata (Tags) for the files are magically linked. (We'd want some recycle bin and archival features as well for deleting files).

Not going to lie, that sounds pretty appealing, I'm sure there are still some cases that this won't catch but we would have those either way. I'll probably start working that way unless I hear direction otherwise or there are solid points against this.

…a standalone object

Add `save` to the DataSource Protocol

comply with json_typing structures

…o feature/SQLite-database

A DataSource can operate CRUD on 1 item at a time or get all Reads are passed ID Creates, Updates & Deletes are passed objects (tag, entity, etc)

Loran425 · 2024-05-31T03:36:12Z

Trying to separate the library from the data source is just leading me to either keep the library functioning on JSON or implement some naive ORM. As a result I'm closing this in favor of #190, I think there are still some good ideas for the identification and deduplication that came out of the discussion but I am not going to be able to match the performance or maintainability of the code written with SQLAlchemy, that library as stated also opens the backend system to a variety of SQL dialects.

Draft to enable communication

e9d2e71

CyanVoxel added enhancement New feature or request refactor Code that needs to be restructured or cleaned up quality of life A quality of life (QoL) enhancement or suggestion library system Relating to the TagStudio library system labels May 16, 2024

CyanVoxel added this to the SQLite Database Migration milestone May 16, 2024

Loran425 added 2 commits May 16, 2024 17:12

Merge remote-tracking branch 'upstream/main' into feature/SQLite-data…

e0bb6f5

…base

Removed unused table drops and table definitions

c5ad5f3

Add database versioning to the create_db.sql script

62b517e

JoshuaMaddy mentioned this pull request May 18, 2024

SQLite DB with SQLAlchemy #190

Draft

DannyAlas mentioned this pull request May 20, 2024

A file system provider #197

Draft

Loran425 added 9 commits May 21, 2024 22:57

WIP Introduction of Datasource Protocol

cd100ee

Remove Alias as an object, Datasource should provide a list of strings

7f12f55

get_tags should return a dict of tag_id, tag not just a list of tags

32a3907

Add missing fetch to locations

f990a98

Add Sql for getting entries

5d5cef3

Add Sql for getting tags

d37f130

Tag_relations will be a component of tags in the library rather than …

f3dabf3

…a standalone object

Tags don't add parent/sub tags, libraries assign them to tags.

6627f07

Cleanup Comments in open library

7c042b8

Loran425 added 10 commits May 24, 2024 20:49

Forgot about boolean short-circuiting

83fcff3

Add ways to save or export the library

ec85267

Add `save` to the DataSource Protocol

add ValueError if SqliteLibrary doesn't have an entry with the given ID

300dae9

add to_json methods to library data structures

e122937

comply with json_typing structures

change error msg to stay consistent

4808d3a

add get_entry doc string to document possible errors

ea417a1

ruffen code

7c4a4f4

sorting fields no longer needed

9de2bd2

Entries limited to a single instance of each field no longer needed

96faa2b

No guesswork in the library

1b8693a

Loran425 changed the base branch from main to db-migration May 26, 2024 18:54

Loran425 added 17 commits May 26, 2024 12:57

Merge remote-tracking branch 'refs/remotes/upstream/db-migration' int…

de7066a

…o feature/SQLite-database

Constants Moved

cc45eec

Update library to_json

14013dd

Add entity deletion to protocol and DataSource

f3b2600

mypy fixes, swap missing files to a generator

c3906c5

refactor duplicate entries

f246b87

fix mypy typing

455e95a

Shorten Comment on Tag.Parents

fb89c83

Group get tag Functions

db91e49

Modify get_entry for datasource and caching

0f3c80c

remove_get_entry_from_index

5c4b42c

rename collation to group

226a0e8

Reorganize CRUD functions

f7f0561

Reorganize DataSource Protocol

5f64979

A DataSource can operate CRUD on 1 item at a time or get all Reads are passed ID Creates, Updates & Deletes are passed objects (tag, entity, etc)

Implement DataSource Protocol for SQLite

8f52416

Add missing values to tags

1e26cea

Update adding an entry to the library

1d3c9f7

Loran425 closed this May 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to SQLite DB storage #187

Switch to SQLite DB storage #187

Loran425 commented May 16, 2024

yedpodtrzitko commented May 16, 2024

Loran425 commented May 16, 2024 •

edited

Loran425 commented May 17, 2024

CyanVoxel commented May 17, 2024

DannyAlas commented May 17, 2024

Loran425 commented May 18, 2024

Decouple the Library from the storage backend

Isolate the filesystem implementation from TagStudio internals

Scopes and defaults

DannyAlas commented May 19, 2024

Loran425 commented May 20, 2024

DannyAlas commented May 20, 2024

Loran425 commented May 20, 2024

Loran425 commented May 31, 2024

Switch to SQLite DB storage #187

Switch to SQLite DB storage #187

Conversation

Loran425 commented May 16, 2024

Initial DB Schema Graphic

yedpodtrzitko commented May 16, 2024

Loran425 commented May 16, 2024 • edited

Loran425 commented May 17, 2024

CyanVoxel commented May 17, 2024

DannyAlas commented May 17, 2024

Decouple the Library from the storage backend

Isolate the filesystem implementation from TagStudio internals

Scopes and defaults

Loran425 commented May 18, 2024

Decouple the Library from the storage backend

Isolate the filesystem implementation from TagStudio internals

Scopes and defaults

DannyAlas commented May 19, 2024

Loran425 commented May 20, 2024

DannyAlas commented May 20, 2024

Loran425 commented May 20, 2024

Loran425 commented May 31, 2024

Loran425 commented May 16, 2024 •

edited