在 Python 中使用 Snowballstemmer 获取土耳其语单词列表时出现问题

mel*_*lik 1 python turkish nlp list

我正在尝试在 Python 中使用一个名为 Snowballstemmer 的库,但它似乎没有按预期工作。原因可能是什么?请参阅下面我的代码。

\n\n

我的数据集:

\n\n
df=[[\'musteri\', \'hizmetlerine\', \'cabuk\', \'baglaniyorum\'],[\'konuda\', \'yard\xc4\xb1mc\xc4\xb1\', \'oluyorlar\', \n   \'islemlerimde\']]\n
Run Code Online (Sandbox Code Playgroud)\n\n

我已经应用了snowballstemmer包并导入TurkishStemmer

\n\n
  from snowballstemmer import TurkishStemmer\n  turkStem=TurkishStemmer()\n  data_words_nostops=[turkStem.stemWord(word) for word in df]\n  data_words_nostops\n\n  [[\'musteri\', \'hizmetlerine\', \'cabuk\', \'baglaniyorum\'],\n   [\'konuda\', \'yard\xc4\xb1mc\xc4\xb1\', \'oluyorlar\', \'islemlerimde\']]\n
Run Code Online (Sandbox Code Playgroud)\n\n

不幸的是它没有起作用。但是当我将它应用于单个单词时,它按预期工作:

\n\n
 turkStem.stemWord("islemlerimde")\n \'islem\'\n
Run Code Online (Sandbox Code Playgroud)\n\n

可能是什么问题呢?任何帮助将不胜感激。

\n\n

谢谢。

\n

lin*_*nqo 5

您的意思是拥有一个字符串列表而不是包含字符串的列表列表吗?

\n\n

当我以这种方式重新格式化代码时,我能够获得每个单词的词干:

\n\n
from snowballstemmer import TurkishStemmer\n\ndf = [\n    \'musteri\',\n    \'hizmetlerine\',\n    \'cabuk\',\n    \'baglaniyorum\',\n    \'konuda\',\n    \'yard\xc4\xb1mc\xc4\xb1\',\n    \'oluyorlar\',\n    \'islemlerimde\'\n]\nturkStem = TurkishStemmer()\ndata_words_nostops = [turkStem.stemWord(word) for word in df]\nprint(data_words_nostops)\n
Run Code Online (Sandbox Code Playgroud)\n\n

如果您有一个字符串列表列表(假设它是您定义的df)并且您想将其展平为单个单词列表,您可以执行以下操作:

\n\n
df = [\n    [\'musteri\', \'hizmetlerine\', \'cabuk\', \'baglaniyorum\'],\n    [\'konuda\', \'yard\xc4\xb1mc\xc4\xb1\', \'oluyorlar\', \'islemlerimde\']\n]\nflattened_df = [item for sublist in df for item in sublist]\n\n# Output:\n# [\'musteri\', \'hizmetlerine\', \'cabuk\', \'baglaniyorum\', \'konuda\', \'yard\xc4\xb1mc\xc4\xb1\', \'oluyorlar\', \'islemlerimde\']\n
Run Code Online (Sandbox Code Playgroud)\n\n

上述内容归功于这篇StackOverflow 帖子。

\n\n

或者,您可以纠正循环来解决原始布局的问题:

\n\n
df = [\n    [\'musteri\', \'hizmetlerine\', \'cabuk\', \'baglaniyorum\'],\n    [\'konuda\', \'yard\xc4\xb1mc\xc4\xb1\', \'oluyorlar\', \'islemlerimde\']\n]\nturkStem = TurkishStemmer()\nall_stem_lists = []\n\nfor word_group in df:\n    output_stems = []\n    for word in word_group:\n        stem = turkStem.stemWord(word)\n        output_stems.append(stem)\n    all_stem_lists.append(output_stems)\n\nprint(all_stem_lists)\n
Run Code Online (Sandbox Code Playgroud)\n