infra: implement inactive locales content generation enhancement #2265

xDivisionByZerox · 2023-07-19T19:27:32Z

Description

This PR introduces a locale file normalization into the generateLocales script. This allows us to remove additional maintainace burdens by automatically checking (and fixing) for the following this in the same file:

duplicated locale entries
limiting the locale file entries to 1000
sorting the entries alphabetically

This PR also includes all to changes to all locale files which are affeected by the new rule. Honestly, I have no idea how you want to review this...

How to review

Please follow the steps @ST-DDT suggested in #2265 (comment):

The easiest way to review these is by reviewing the script and then resetting the locales to next and run the script again.
Then there shouldn't be any diff to the committed state.
Finally, some manual tests with some of the files.
So IMO there isn't a need to review them all manually.

xDivisionByZerox · 2023-07-19T19:36:12Z

The pipeline is failing to locales that were implemented poorly before. Look at:

faker/src/locales/ar/date/weekday.ts

Lines 1 to 13 in 8530a3e

    
           import type { DateEntryDefinition } from '../../../definitions'; 
        
           export default { 
        
             wide: [ 
        
               'الأحَد', 
        
               'الإثنين', 
        
               'الثلاثاء', 
        
               'الأربعاء', 
        
               'الخميس', 
        
               'الجمعة', 
        
               'السبت', 
        
             ], 
        
           } as DateEntryDefinition;

We simply casted the type here. What should be the expected behaviour here?

matthewmayer · 2023-07-19T19:48:48Z

I'm not sure blanket sorting everything makes sense

You lose inherent ordering like days of the week.

And in many languages a naive sort based on Unicode character values doesn't reflect how strings are sorted.

Plus it causes a lot of churn meaning git blame is less useful.

ST-DDT · 2023-07-19T20:02:09Z

IMO we should split these composite files into a folder and individual files.

xDivisionByZerox · 2023-07-19T20:46:16Z

IMO we should split these composite files into a folder and individual files.

That doesn't really answer the question I had. Should the generation script be able to handle invalid locales or no? Since the DateEntryDefinition requires the keys wide and abbr and we cast it, we currently have invalid data definitions for some locales.

ST-DDT · 2023-07-19T21:45:09Z

How many invalid entries do we have?
If they are a select few, we should fix them instead of working around them in our code.
Even if we just do them in a non optimal way, it would probably still better than not having them at all. If we cannot find the data we can fall back to [].

matthewmayer · 2023-07-19T22:15:36Z

I'm not sure blanket sorting everything makes sense

You lose inherent ordering like days of the week.

And in many languages a naive sort based on Unicode character values doesn't reflect how strings are sorted.

Plus it causes a lot of churn meaning git blame is less useful.

Additionally we may be losing useful information in the definitions by resorting.

For example while implementing #1704 it turned out the first_name file actually had all the male names first, then all the female names. So it was easy to split them. If this script had already been run prior to that, there wouldn't have been any easy way to separate the female and male names.

The number of files being touched here makes it unrealistic to manually review each one and see if any useful information is being discarded.

xDivisionByZerox · 2023-07-19T22:28:45Z

Additionally we may be losing useful information in the definitions by resorting.

That might be. But the thing is that we already request our contributors to sort te locale entries. Some examples:

We generally want this. Sure, we might lose some context but gain standardization in return.

The number of files being touched here makes it unrealistic to manually review each one and see if any useful information is being discarded.

Agreed. But sadly this always happends if we do major changes to the locale structure. What do you suggest?

ST-DDT · 2023-07-19T22:29:22Z

Additionally we may be losing useful information in the definitions by resorting.

AFAIK there aren't any data in our locales that are supposed to contain any extra information by their order.
There is currently the location.direction data that use the index to distinguish between cardinal and sub-cardinal directions, but IMO that can be easily resolved by splitting the data.

For example while implementing #1704 it turned out the first_name file actually had all the male names first, then all the female names.

Is there a file left that would still support this or all of them already split?
If they are already split, then we don't loose anything. (I'm not aware of any)
If not we can split them before the merge.

The number of files being touched here makes it unrealistic to manually review each one and see if any useful information is being discarded.

The easiest way to review these is by reviewing the script and then resetting the locales to next and run the script again.
Then there shouldn't be any diff to the committed state.
Finally, some manual tests with some of the files.
So IMO there isn't a need to review them all manually.

xDivisionByZerox · 2023-07-19T22:31:31Z

Also your comment @matthewmayer: #1823 (review)

The entries should be sorted alphabetical order

I'm not saying that it is not allowed to change ones opinon, but you had this idea once as well^^

xDivisionByZerox · 2023-07-19T22:33:42Z

The easiest way to review these is by reviewing the script and then resetting the locales to next and run the script again.
Then there shouldn't be any diff to the committed state.

Honestly, thats super big brain. Never thought about this kind of review 👀

xDivisionByZerox · 2023-07-19T22:34:33Z

How many invalid entries do we have?
If they are a select few, we should fix them instead of working around them in our code.
Even if we just do them in a non optimal way, it would probably still better than not having them at all. If we cannot find the data we can fall back to [].

All that are currently fail in the CI. So 3 AFAICT.

ST-DDT · 2023-07-19T22:35:54Z

All that are currently fail in the CI. So 3 AFAICT.

IMO 3 are easily fixable manually.
(We should do that in a separate PR though)

matthewmayer · 2023-07-19T23:14:00Z

I agree new entries should always be in alphabetical order unless they are something with a logical order like days of the week. I'm just concerned about subtly losing information if old entries are sorted alphabetically.

xDivisionByZerox · 2023-08-06T10:04:31Z

Blocked by #2293

codecov · 2023-08-28T19:19:53Z

Codecov Report

Merging #2265 (cf5f72e) into next (a193693) will decrease coverage by 0.01%.
The diff coverage is n/a.

Additional details and impacted files

@@            Coverage Diff             @@
##             next    #2265      +/-   ##
==========================================
- Coverage   99.59%   99.58%   -0.01%     
==========================================
  Files        2823     2823              
  Lines      255517   255517              
  Branches     1106     1105       -1     
==========================================
- Hits       254475   254458      -17     
- Misses       1014     1031      +17     
  Partials       28       28

see 2 files with indirect coverage changes

matthewmayer · 2023-10-14T13:28:44Z

I don't see the advantage of having cardinal directions sorted in a "natural" order. There is no usecase in which this information is relevant to us. I can see the argument for the month. Since those values are a fixed number it might be mor usefull to use a map instead of an array there. This would allow reviewers to make easy assumtions about localized values.

I'd much rather this PR be introduced one method at a time so we can review to make sure if sorting makes sense or if we are losing any natural order. This PR is too big to review.

Why can't you do that if both lists (xx_CC and yy_CC) are sorted in the same why? Although, I'm not quite sure what you mean by "states in a country are numbered". Can you elaborate on this or give an example?

For example France departments have an numerical order

https://en.wikipedia.org/wiki/Departments_of_France?wprov=sfti1

If country xx_CC and yy_CC are sorted seperately then the Nth item in each list no longer lines up with the corresponding translation. So if state 1 is called Aflorida in xx and ZFlorida in YY they are no longer both in index position 1.

Ok, then just split your data into multiple soures as well:

Again we need to review the existing data to see if this is needed anywhere which is hard to do with a massive PR.

ST-DDT · 2023-10-17T16:54:03Z

Team Decision

We will split this into multiple smaller PRs:

(This PR) Add sort+unique script with every module ignored
Enable one module at a time to check whether there are elements in there that need sorting
Truncate to (shuffled) 1000k elements in all modules at once

Considerations:

A list of the top 1k elements is an atomic thing, instead of just replacing one element in it, you would replace the entire list with the new one.
For our usecase it is not important whether the top or the last element is chosen from that list, because it is just an entry.
Sorting makes the list somewhat easier to manage at least for latin like characters.
We can try sorting the data with the locale key and check if that might result in better sort results, but if it doesn't we will use simple sort.

xDivisionByZerox self-assigned this Jul 19, 2023

xDivisionByZerox added p: 1-normal Nothing urgent c: infra Changes to our infrastructure or project setup labels Jul 19, 2023

xDivisionByZerox added the c: locale Permutes locale definitions label Jul 19, 2023

xDivisionByZerox force-pushed the infra/scripts/enhance-generate-locales branch from ee486a4 to ba53fb7 Compare August 5, 2023 23:08

xDivisionByZerox mentioned this pull request Aug 5, 2023

fix(locale): invalid date definitions #2293

Merged

xDivisionByZerox added the s: on hold Blocked by something or frozen to avoid conflicts label Aug 6, 2023

matthewmayer mentioned this pull request Aug 18, 2023

feat(location): update en county list #2238

Merged

xDivisionByZerox added 3 commits August 28, 2023 19:37

infra: generate locales enhance array content generation

364e79c

refactor: extract to own method

7fa1513

refactor: polishing

a42addb

xDivisionByZerox force-pushed the infra/scripts/enhance-generate-locales branch from ba53fb7 to a42addb Compare August 28, 2023 18:03

xDivisionByZerox removed the s: on hold Blocked by something or frozen to avoid conflicts label Aug 28, 2023

xDivisionByZerox added 2 commits August 28, 2023 20:15

refactor: simpler recursive data normalization

c353ac0

test: update snapshots

e6d6c29

ST-DDT removed the s: needs decision Needs team/maintainer decision label Oct 17, 2023

xDivisionByZerox added 3 commits October 18, 2023 00:57

infra: deactivate locale normalization for all modules

eca418e

refactor: revert all locale changes

22e613d

test: update snapshots

c487847

xDivisionByZerox changed the title ~~infra: generate locales enhance array content generation~~ infra: implement inactive locales content generation enhancement Oct 17, 2023

xDivisionByZerox requested review from ST-DDT and matthewmayer October 17, 2023 23:07

matthewmayer approved these changes Oct 18, 2023

View reviewed changes

ST-DDT approved these changes Oct 18, 2023

View reviewed changes

Merge branch 'next' into infra/scripts/enhance-generate-locales

cf5f72e

ST-DDT merged commit a882244 into next Oct 19, 2023
20 checks passed

ST-DDT deleted the infra/scripts/enhance-generate-locales branch October 19, 2023 23:15

ST-DDT mentioned this pull request Nov 5, 2023

refactor(locale): remove fr_CH data which is identical to fr #2526

Merged

matthewmayer mentioned this pull request Mar 15, 2024

refactor(word): reduce definitions to 1000 in all locales #2751

Merged

xDivisionByZerox mentioned this pull request Apr 7, 2024

refactor(locale): normalize animal data #2791

Merged

This was referenced Apr 15, 2024

refactor(locale): activate data normalization for airline #2828

Merged

refactor(locale): activate data normalization for color #2837

Merged

This was referenced May 8, 2024

refactor(locale): normalize science data #2886

Merged

refactor(locale): normalize company data #2889

Merged

xDivisionByZerox mentioned this pull request May 15, 2024

refactor(locale): normalize date data #2902

Merged

ST-DDT mentioned this pull request May 16, 2024

Check whether the locale data should use locale aware sorting #2905

Open

xDivisionByZerox mentioned this pull request May 19, 2024

refactor(locale): normalize finance data #2915

Merged

TroyTargaryen mentioned this pull request May 31, 2024

[Snyk] Upgrade @faker-js/faker from 8.0.2 to 8.4.1 TroyTargaryen/frodor-dev#5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

infra: implement inactive locales content generation enhancement #2265

infra: implement inactive locales content generation enhancement #2265

xDivisionByZerox commented Jul 19, 2023 •

edited

xDivisionByZerox commented Jul 19, 2023

matthewmayer commented Jul 19, 2023

ST-DDT commented Jul 19, 2023

xDivisionByZerox commented Jul 19, 2023

ST-DDT commented Jul 19, 2023

matthewmayer commented Jul 19, 2023 •

edited

xDivisionByZerox commented Jul 19, 2023

ST-DDT commented Jul 19, 2023

xDivisionByZerox commented Jul 19, 2023

xDivisionByZerox commented Jul 19, 2023

xDivisionByZerox commented Jul 19, 2023

ST-DDT commented Jul 19, 2023

matthewmayer commented Jul 19, 2023

xDivisionByZerox commented Aug 6, 2023

codecov bot commented Aug 28, 2023 •

edited

matthewmayer commented Oct 14, 2023

ST-DDT commented Oct 17, 2023

infra: implement inactive locales content generation enhancement #2265

infra: implement inactive locales content generation enhancement #2265

Conversation

xDivisionByZerox commented Jul 19, 2023 • edited

Description

How to review

xDivisionByZerox commented Jul 19, 2023

matthewmayer commented Jul 19, 2023

ST-DDT commented Jul 19, 2023

xDivisionByZerox commented Jul 19, 2023

ST-DDT commented Jul 19, 2023

matthewmayer commented Jul 19, 2023 • edited

xDivisionByZerox commented Jul 19, 2023

ST-DDT commented Jul 19, 2023

xDivisionByZerox commented Jul 19, 2023

xDivisionByZerox commented Jul 19, 2023

xDivisionByZerox commented Jul 19, 2023

ST-DDT commented Jul 19, 2023

matthewmayer commented Jul 19, 2023

xDivisionByZerox commented Aug 6, 2023

codecov bot commented Aug 28, 2023 • edited

Codecov Report

matthewmayer commented Oct 14, 2023

ST-DDT commented Oct 17, 2023

xDivisionByZerox commented Jul 19, 2023 •

edited

matthewmayer commented Jul 19, 2023 •

edited

codecov bot commented Aug 28, 2023 •

edited