Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation for covering grammars #490

Open
kylebgorman opened this issue Mar 22, 2023 · 5 comments
Open

Documentation for covering grammars #490

kylebgorman opened this issue Mar 22, 2023 · 5 comments
Assignees
Labels
documentation Improvements or additions to documentation good first issue Good for newcomers

Comments

@kylebgorman
Copy link
Collaborator

We have no effective documentation for the covering grammars data library.

@kylebgorman kylebgorman added documentation Improvements or additions to documentation good first issue Good for newcomers labels Mar 22, 2023
@neurlang
Copy link

hi what is covering grammar? I have language to IPA longest prefix match grammar for a number of languages:

English
Spanish
German
French
Italian
Arabic
Farsi
Luxembourgish
Dutch
Portuguese
Russian
Swedish
Czech
Slovak
Romanian
Finnish
Isan
Swahili
Esperanto
Icelandic
Norwegian
Jamaican
Japanese

It can generate a new grammar based on tsv files like the ones that you have

@kylebgorman
Copy link
Collaborator Author

A covering grammar is essentially a listing of, for each character, all the pronunciations it can take on. For this to work with our system it also has to be at the right level (broad or narrow) and actually match what's in the Wiktionary data. Maybe you could take a look at what we have here, which overlaps a few of your languages:

https://github.com/CUNY-CL/wikipron/tree/master/data/covering_grammar/tsv

and see how they differ, if at all. If they are broadly similar it might make sense to incorporate your data.

@neurlang
Copy link

Checking japanese in your dataset, I see that you have a mapping "あ"->"a̠" but also "ああ"->"a̠ː". Consider the source language word "ああ" and a target language word "a̠a̠" is this a valid covering grammar production in your case?

@kylebgorman
Copy link
Collaborator Author

Checking japanese in your dataset, I see that you have a mapping "あ"->"a̠" but also "ああ"->"a̠ː". Consider the source language word "ああ" and a target language word "a̠a̠" is this a valid covering grammar production in your case?

Yes, it would be in that case. You can think of all the pairs in the mapping each as a substitution, then the "grammar" is simply the closure over the union of all the substitutions.

It is a "covering" grammar because we know it's overly permissive, but it's simple enough to be specified as substitution pairs. It's useful in debugging, quality assurance, and the like.

@neurlang
Copy link

If its the case that any choice like this is valid, then it means that the idea of covering grammars is different than the one which I use (longest prefix match grammar).

In my scenario I am taking the longest match always, "ああ" and a target language word "a̠a̠" wouldn't be accepted by the grammar machine.

Since our ideas are distinct you can't simply copy paste my grammars and call it a day.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants