How many stopwords for non-english languages?

  • Author
    Posts
  • #1320794

    zodiac1978
    Member

    Hi!

    We just stumbled upon this:
    http://translate.wordpress.com/projects/wpcom/de/default?filters%5Bstatus%5D=either&filters%5Boriginal_id%5D=51374&filters%5Btranslation_id%5D=1897380

    How many stopwords should be used there, as we have found many stopword-lists and there are much longer:
    http://www.phpbar.de/w/Stoppwortliste_deutsch
    http://wortschatz.uni-leipzig.de/Papers/top100de.txt

    Best regards
    Torsten

    The blog I need help with is netztaucher.wordpress.com.

    #1321047

    kathrynwp
    Staff

    Hi Torsten, I’ve passed your question along to the internationalization team and will get back to you as soon as I have some guidance on this. Thanks!

    #1321048

    zodiac1978
    Member

    Thx.

    #1321090

    zodiac1978
    Member

    Seems to be a tough question … ;-)

    #1321091

    kathrynwp
    Staff

    Hi Torsten, while I don’t have a definitive reply, one of our native German-speakers said that “the top100de.txt list has probably a lot of duplicates and it does have some weird words in it, like ‘percent’, ‘million’, ‘Mark’ (Germany’s pre-Euro currency).”

    #1321100

    zodiac1978
    Member

    And now? My question wasn’t: “What is your opinion to these lists?”, my question was “How many stopwords are okay?”

    But forget it. Seems to be irrelevant for Automattic.
    http://www.alistapart.com/articles/translation-is-ux/

    #1321101

    rachelmcr
    Staff

    Hi Torsten,

    Looking at this from a translator’s perspective, I don’t think there’s a definitive answer to your question. A lot of translation is about making judgment calls, and in this case there isn’t one objective, correct way to create a list of stop words — you can take a look at how they are discussed on Wikipedia to see what I mean: Stop words

    The issue here is that the list of stop words is used to decide what words aren’t included in searches. As a native speaker, you (and the other German speakers who contribute to GlotPress) are in the best position to determine what stop words will make searches more useful in German. If you think the list of stop words needs to be expanded, I don’t see a problem with that — I think you should add as many stop words as you deem helpful and relevant. Cheers! :)

    #1321102

    zodiac1978
    Member

    Hi Rachel,

    I already know the definition of Stopwords. You don’t have to explain that to me. Thank you.
    The best way to determine the best stopwords would be an analysis of the used search terms. But I don’t even know for which search these stopwords are used. You wrote: “If you think the list of stop words needs to be expanded” – on which data should I rely? I have no data. And you are not able to provide it.

    The quick answer seems to be: “the more the merrier”. And that was so hard to tell…?

The topic ‘How many stopwords for non-english languages?’ is closed to new replies.