Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

infra: implement inactive locales content generation enhancement #2265

Merged
merged 13 commits into from
Oct 19, 2023

Conversation

xDivisionByZerox
Copy link
Member

@xDivisionByZerox xDivisionByZerox commented Jul 19, 2023

Description

This PR introduces a locale file normalization into the generateLocales script. This allows us to remove additional maintainace burdens by automatically checking (and fixing) for the following this in the same file:

  • duplicated locale entries
  • limiting the locale file entries to 1000
  • sorting the entries alphabetically

This PR also includes all to changes to all locale files which are affeected by the new rule. Honestly, I have no idea how you want to review this...

How to review

Please follow the steps @ST-DDT suggested in #2265 (comment):

The easiest way to review these is by reviewing the script and then resetting the locales to next and run the script again.
Then there shouldn't be any diff to the committed state.
Finally, some manual tests with some of the files.
So IMO there isn't a need to review them all manually.

@xDivisionByZerox xDivisionByZerox self-assigned this Jul 19, 2023
@xDivisionByZerox xDivisionByZerox added p: 1-normal Nothing urgent c: infra Changes to our infrastructure or project setup labels Jul 19, 2023
@xDivisionByZerox
Copy link
Member Author

The pipeline is failing to locales that were implemented poorly before. Look at:

import type { DateEntryDefinition } from '../../../definitions';
export default {
wide: [
'الأحَد',
'الإثنين',
'الثلاثاء',
'الأربعاء',
'الخميس',
'الجمعة',
'السبت',
],
} as DateEntryDefinition;

We simply casted the type here. What should be the expected behaviour here?

@matthewmayer
Copy link
Contributor

I'm not sure blanket sorting everything makes sense

You lose inherent ordering like days of the week.

And in many languages a naive sort based on Unicode character values doesn't reflect how strings are sorted.

Plus it causes a lot of churn meaning git blame is less useful.

@ST-DDT
Copy link
Member

ST-DDT commented Jul 19, 2023

IMO we should split these composite files into a folder and individual files.

@xDivisionByZerox xDivisionByZerox added the c: locale Permutes locale definitions label Jul 19, 2023
@xDivisionByZerox
Copy link
Member Author

IMO we should split these composite files into a folder and individual files.

That doesn't really answer the question I had. Should the generation script be able to handle invalid locales or no? Since the DateEntryDefinition requires the keys wide and abbr and we cast it, we currently have invalid data definitions for some locales.

@ST-DDT
Copy link
Member

ST-DDT commented Jul 19, 2023

How many invalid entries do we have?
If they are a select few, we should fix them instead of working around them in our code.
Even if we just do them in a non optimal way, it would probably still better than not having them at all. If we cannot find the data we can fall back to [].

@matthewmayer
Copy link
Contributor

matthewmayer commented Jul 19, 2023

I'm not sure blanket sorting everything makes sense

You lose inherent ordering like days of the week.

And in many languages a naive sort based on Unicode character values doesn't reflect how strings are sorted.

Plus it causes a lot of churn meaning git blame is less useful.

Additionally we may be losing useful information in the definitions by resorting.

For example while implementing #1704 it turned out the first_name file actually had all the male names first, then all the female names. So it was easy to split them. If this script had already been run prior to that, there wouldn't have been any easy way to separate the female and male names.

The number of files being touched here makes it unrealistic to manually review each one and see if any useful information is being discarded.

@xDivisionByZerox
Copy link
Member Author

Additionally we may be losing useful information in the definitions by resorting.

That might be. But the thing is that we already request our contributors to sort te locale entries. Some examples:

We generally want this. Sure, we might lose some context but gain standardization in return.

The number of files being touched here makes it unrealistic to manually review each one and see if any useful information is being discarded.

Agreed. But sadly this always happends if we do major changes to the locale structure. What do you suggest?

@ST-DDT
Copy link
Member

ST-DDT commented Jul 19, 2023

Additionally we may be losing useful information in the definitions by resorting.

AFAIK there aren't any data in our locales that are supposed to contain any extra information by their order.
There is currently the location.direction data that use the index to distinguish between cardinal and sub-cardinal directions, but IMO that can be easily resolved by splitting the data.

For example while implementing #1704 it turned out the first_name file actually had all the male names first, then all the female names.

Is there a file left that would still support this or all of them already split?
If they are already split, then we don't loose anything. (I'm not aware of any)
If not we can split them before the merge.

The number of files being touched here makes it unrealistic to manually review each one and see if any useful information is being discarded.

The easiest way to review these is by reviewing the script and then resetting the locales to next and run the script again.
Then there shouldn't be any diff to the committed state.
Finally, some manual tests with some of the files.
So IMO there isn't a need to review them all manually.

@xDivisionByZerox
Copy link
Member Author

Also your comment @matthewmayer: #1823 (review)

  • The entries should be sorted alphabetical order

I'm not saying that it is not allowed to change ones opinon, but you had this idea once as well^^

@xDivisionByZerox
Copy link
Member Author

The easiest way to review these is by reviewing the script and then resetting the locales to next and run the script again.
Then there shouldn't be any diff to the committed state.

Honestly, thats super big brain. Never thought about this kind of review 👀

@xDivisionByZerox
Copy link
Member Author

How many invalid entries do we have?
If they are a select few, we should fix them instead of working around them in our code.
Even if we just do them in a non optimal way, it would probably still better than not having them at all. If we cannot find the data we can fall back to [].

All that are currently fail in the CI. So 3 AFAICT.

@ST-DDT
Copy link
Member

ST-DDT commented Jul 19, 2023

All that are currently fail in the CI. So 3 AFAICT.

IMO 3 are easily fixable manually.
(We should do that in a separate PR though)

@matthewmayer
Copy link
Contributor

I agree new entries should always be in alphabetical order unless they are something with a logical order like days of the week. I'm just concerned about subtly losing information if old entries are sorted alphabetically.

@xDivisionByZerox
Copy link
Member Author

Blocked by #2293

@xDivisionByZerox xDivisionByZerox added the s: on hold Blocked by something or frozen to avoid conflicts label Aug 6, 2023
@xDivisionByZerox xDivisionByZerox force-pushed the infra/scripts/enhance-generate-locales branch from ba53fb7 to a42addb Compare August 28, 2023 18:03
@xDivisionByZerox xDivisionByZerox removed the s: on hold Blocked by something or frozen to avoid conflicts label Aug 28, 2023
@codecov
Copy link

codecov bot commented Aug 28, 2023

Codecov Report

Merging #2265 (cf5f72e) into next (a193693) will decrease coverage by 0.01%.
The diff coverage is n/a.

Additional details and impacted files
@@            Coverage Diff             @@
##             next    #2265      +/-   ##
==========================================
- Coverage   99.59%   99.58%   -0.01%     
==========================================
  Files        2823     2823              
  Lines      255517   255517              
  Branches     1106     1105       -1     
==========================================
- Hits       254475   254458      -17     
- Misses       1014     1031      +17     
  Partials       28       28              

see 2 files with indirect coverage changes

@matthewmayer
Copy link
Contributor

I don't see the advantage of having cardinal directions sorted in a "natural" order. There is no usecase in which this information is relevant to us. I can see the argument for the month. Since those values are a fixed number it might be mor usefull to use a map instead of an array there. This would allow reviewers to make easy assumtions about localized values.

I'd much rather this PR be introduced one method at a time so we can review to make sure if sorting makes sense or if we are losing any natural order. This PR is too big to review.

Why can't you do that if both lists (xx_CC and yy_CC) are sorted in the same why? Although, I'm not quite sure what you mean by "states in a country are numbered". Can you elaborate on this or give an example?

For example France departments have an numerical order

https://en.wikipedia.org/wiki/Departments_of_France?wprov=sfti1

If country xx_CC and yy_CC are sorted seperately then the Nth item in each list no longer lines up with the corresponding translation. So if state 1 is called Aflorida in xx and ZFlorida in YY they are no longer both in index position 1.

Ok, then just split your data into multiple soures as well:

Again we need to review the existing data to see if this is needed anywhere which is hard to do with a massive PR.

@ST-DDT
Copy link
Member

ST-DDT commented Oct 17, 2023

Team Decision

We will split this into multiple smaller PRs:

  1. (This PR) Add sort+unique script with every module ignored
  2. Enable one module at a time to check whether there are elements in there that need sorting
  3. Truncate to (shuffled) 1000k elements in all modules at once

Considerations:

A list of the top 1k elements is an atomic thing, instead of just replacing one element in it, you would replace the entire list with the new one.
For our usecase it is not important whether the top or the last element is chosen from that list, because it is just an entry.
Sorting makes the list somewhat easier to manage at least for latin like characters.
We can try sorting the data with the locale key and check if that might result in better sort results, but if it doesn't we will use simple sort.

@ST-DDT ST-DDT removed the s: needs decision Needs team/maintainer decision label Oct 17, 2023
@xDivisionByZerox xDivisionByZerox changed the title infra: generate locales enhance array content generation infra: implement inactive locales content generation enhancement Oct 17, 2023
@ST-DDT ST-DDT merged commit a882244 into next Oct 19, 2023
20 checks passed
@ST-DDT ST-DDT deleted the infra/scripts/enhance-generate-locales branch October 19, 2023 23:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c: infra Changes to our infrastructure or project setup c: locale Permutes locale definitions p: 1-normal Nothing urgent
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants