Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ISO 639-3 language codes #293

Merged
merged 1 commit into from
Apr 10, 2023
Merged

Conversation

sandbergja
Copy link
Contributor

@sandbergja sandbergja commented Mar 8, 2023

Closes #292

Includes a new rake task to regenerate the language translation map as needed.

@jrochkind
Copy link
Contributor

jrochkind commented Mar 8, 2023

Looks good, thanks for being so complete!

How do we know there are no conflicts between three-letter codes from LC MARC language codes vs ISO 639-3? Now and forever, even if ISO adds more codes?

Should anything in the generation task check to make sure there wasn't a conflict? (Or does it already and I missed it?). If there is a code used in both, whichever one ends up in the file last (ISO) will probably overwrite the earlier one. That is, if the same code exists in both, in the current layout of the file you'll prob get the ISO translation rather than the MARC one.

@jrochkind
Copy link
Contributor

I found some answer to my own question! "The Library of Congress is
maintenance agency for both lists, and the two are kept compatible in terms of code additions and deletions. "

https://www.loc.gov/marc/languages/introduction.pdf

ISO 639-2 (Codes for the representation of names of languages-- Part 2: alpha-3 code) was based on
the MARC Code List for Languages and published in 1998. In the 22 cases where the ISO 639-2 list has
two alternative codes, the bibliographic code is the same as the MARC code. Language names in ISO 639-2
are not necessarily the same as those in MARC, particularly because of the practice of correlating the MARC
language names with those used in Library of Congress Subject Headings. The MARC list includes references
for unused forms of language names, while the ISO list has in some cases included alternative name forms,
but many are lacking, since this practice of supplying alternate forms has only recently been implemented.
In addition the MARC documentation includes a list of individual languages under collective codes or
language groups, while the ISO list only includes the group codes themselves. The Library of Congress is
maintenance agency for both lists, and the two are kept compatible in terms of code additions and deletions.

However...

In the 22 cases where the ISO 639-2 list has
two alternative codes, the bibliographic code is the same as the MARC code. Language names in ISO 639-2
are not necessarily the same as those in MARC, particularly because of the practice of correlating the MARC
language names with those used in Library of Congress Subject Headings.

Is it possible for the same code to exist in both with slightly different labels, or is that not done? If that does happen, this change might mean some people using the new translation map would get different labels than using the old one previous to this PR? Or am I misunderstanding? If so, is that a problem?

@sandbergja
Copy link
Contributor Author

I attempted to throw out any duplicate codes with this line: sandbergja@2f48bb7#diff-00faae62d158f145b9eb2fe759cfdd1119003521712b53b7e99af8f4a49349caR72

So the yaml file produced should only ever contain the same code once, with the label coming from the LoC data, rather than the ISO 639-3 data.

@sandbergja
Copy link
Contributor Author

sandbergja commented Mar 8, 2023

Also, if it would be helpful, I could add some kind of regression test to confirm that the translation map doesn't contain duplicates, and that any label from LoC takes precedence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Index ISO 639-3 codes
3 participants