Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to reuse index/model definition in multiple processes #349

Open
askids opened this issue Apr 12, 2023 · 7 comments
Open

How to reuse index/model definition in multiple processes #349

askids opened this issue Apr 12, 2023 · 7 comments
Labels
enhancement New feature or request

Comments

@askids
Copy link

askids commented Apr 12, 2023

hi,

I am trying to see if I can reuse a model definition in multiple processes, with each process having its own copy of the collection data. Essentially, I am looking to implement deduplication library which will use a standard model definition. So when there are multiple processes running in parallel, I want each process to have its own copy of the created collection. a) Is it possible to use the index model this way? b) Is using Redis.OM suitable for this usecase or should I consider alternate options for applying deduping.

The collection will be deleted at the end of process. I plan to use the collection to provide restart capability to the process, so that when I restart the process, I can skip any previously extracted data.

Thanks!

@slorello89
Copy link
Member

Not quite sure what you're looking to do here? The Collection can be used in multiple processes? If your trying to do an initial seed of the database - the thing to do would probably to create a lock in Redis to prevent other processes from trying to seed it while your seed process is working. Does that make sense?

@askids
Copy link
Author

askids commented Apr 12, 2023

@slorello89 as mentioned in my post, the collection will not be shared by processes. Each process needs to get its own copy of collection, which will use a common model. Each batch process is creating a large data extract using different criteria and may be connecting to different database and writing the extracted data to its own file. Multiple such processes will be running in parallel. Look at it as just extract part of an ETL process.

Lets say source has 20M records, I will first split the data into ranges of 50K and then run multiple parallel extracts to pull data for each range and push that into an internal channel, which is then used to write the extracted data to a file. So for 20M, I will have 400 range of ids, with each range covering 50K records. Whenever, I write to file, I will also populate the collection.

So if the process fails in between lets say due to some network issue or query timeout, I will use the collection to skip ids that has already been written to the file. Since the output is a file and not a database, I don't have any commit sync points to restart from. As explained above, the extract process itself runs as a multihreaded process,. So for some id ranges, full data may have been written to file. For some, range, partial data may have been written to file. I will have logic to skip the completed range altogether. But for an range, where only partial data was written to file, I will use the collection to skip the records.

These are long running processes. Hence we dont want to run the extract from top again. Also, this extract process will be built as a generic utility used by multiple teams and multiple processes. So I need each instance of the process to get its own collection, but using the same model. At the end of the successful run, we will delete that collection.

@VagyokC4
Copy link

So I need each instance of the process to get its own collection, but using the same model. At the end of the successful run, we will delete that collection.

Sounds like you just want to give each index it's own unique name and manage which processes are using which index names. Each process will get its own RedisOM instance so you just need to configure it to set the collection to use the right index prefix. While I haven't tested custom index names, it seems like it should be doable.

@slorello89
Copy link
Member

Splitting them across different indexes would be the quickest way to go about it, you could also manage which ranges each is responsible for across a single index (have proc 1 b responsible for 0-50k, proc 2 50000-100k etc. . .) but you would need a way of coordinating that (something in redis assigning the work could work), also the deeper you go into the result set the more expensive it is because access is O(N) for the total number of records being enumerated (this would include the records you're skipping in the LIMIT)

@askids
Copy link
Author

askids commented Apr 14, 2023

May be I was not very clear. The range splitting is happening with a single process. Also, there will be 60-70 such process running at different times of the day. These processes are run via a generic utility, but each process running with its own configuration. There could be new processes added over a period of time and consumer app will only add the configuration and expect the process to work. So I wouldn't know how many indexes to create upfront. That is why I need ability to dynamically create the collection and then delete it once process completes. The model used by the collection will be fixed. But index/collection created for each process has to be unique.

@askids
Copy link
Author

askids commented Apr 14, 2023

So I need each instance of the process to get its own collection, but using the same model. At the end of the successful run, we will delete that collection.

Sounds like you just want to give each index it's own unique name and manage which processes are using which index names. Each process will get its own RedisOM instance so you just need to configure it to set the collection to use the right index prefix. While I haven't tested custom index names, it seems like it should be doable.

Correct. Right now the prefix has to be hard coded, which I can't do as I will need to set this in runtime, depending on the process that is running it. I can use some fixed prefix and append it with the fixed job Id that each process is configured. This way, each instance of the process gets its own unique collection.

@VagyokC4
Copy link

Correct. Right now the prefix has to be hard coded, which I can't do as I will need to set this in runtime, depending on the process that is running it. I can use some fixed prefix and append it with the fixed job Id that each process is configured. This way, each instance of the process gets its own unique collection.

Sounds like #242 is what you need. In this post I describe a way to set the attributes at runtime, while you wait for this functionality to be wired up internally.

@slorello89 slorello89 added the enhancement New feature or request label Aug 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants