Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should all TLDs be whitelisted? #19

Open
diasks2 opened this issue Jan 20, 2016 · 1 comment
Open

Should all TLDs be whitelisted? #19

diasks2 opened this issue Jan 20, 2016 · 1 comment
Labels

Comments

@diasks2
Copy link
Owner

diasks2 commented Jan 20, 2016

Here is the current list: http://data.iana.org/TLD/tlds-alpha-by-domain.txt

This will allow us to successfully pass the following spec:

it 'knows what is not a domain 1' do
  skip "NOT IMPLEMENTED"
  text = "this is a sentence.and no domain."
  pt = PragmaticTokenizer::Tokenizer.new(text, remove_domains: true)
  expect(pt.tokenize).to eq(
    ["this", "is", "a", "sentence", ".", "and", "no", "domain", "."]
  )
end
@diasks2 diasks2 changed the title Should all TLD domains be whitelisted? Should all TLDs be whitelisted? Jan 20, 2016
@maia
Copy link
Collaborator

maia commented Jan 20, 2016

The longer I think about it, the more downsides it might have when using the complete list of TLDs. TLDs like .glass, .global, .google, .green etc. might more frequently be used as the first word of a new sentence (similar to the spec above), than being used as a domain.

What if this list is saved as a constant (similar to abbreviations, stop words etc), but with the option of passing an array of TLDs to remove_domains: ['com', 'net', 'org'] that will restrict to only these? Then users could define the 3-5 TLDs that most of the domains they deal with use, and prevent issues of having pragmatic_tokenizer identify too many non-domains as domain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants