Gau*_*pta 3 python nlp nltk chatbot python-3.x
我正在使用 Python 制作一个聊天机器人。代码:
import nltk
import numpy as np
import random
import string
f=open('/home/hostbooks/ML/stewy/speech/chatbot.txt','r',errors = 'ignore')
raw=f.read()
raw=raw.lower()# converts to lowercase
sent_tokens = nltk.sent_tokenize(raw)# converts to list of sentences
word_tokens = nltk.word_tokenize(raw)# converts to list of words
lemmer = nltk.stem.WordNetLemmatizer()
def LemTokens(tokens):
return [lemmer.lemmatize(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
def LemNormalize(text):
return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))
GREETING_INPUTS = ("hello", "hi", "greetings", "sup", "what's up","hey","hii")
GREETING_RESPONSES = ["hi", "hey", "*nods*", "hi there", "hello", "I am glad! You are talking to me"]
def greeting(sentence):
for word in sentence.split():
if word.lower() in GREETING_INPUTS:
return random.choice(GREETING_RESPONSES)
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
def response(user_response):
robo_response=''
sent_tokens.append(user_response)
TfidfVec = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english')
tfidf = TfidfVec.fit_transform(sent_tokens)
vals = cosine_similarity(tfidf[-1], tfidf)
idx=vals.argsort()[0][-2]
flat = vals.flatten()
flat.sort()
req_tfidf = flat[-2]
if(req_tfidf==0):
robo_response=robo_response+"I am sorry! I don't understand you"
return robo_response
else:
robo_response = robo_response+sent_tokens[idx]
return robo_response
flag=True
print("ROBO: My name is Robo. I will answer your queries about Chatbots. If you want to exit, type Bye!")
while(flag==True):
user_response = input()
user_response=user_response.lower()
if(user_response!='bye'):
if(user_response=='thanks' or user_response=='thank you' ):
flag=False
print("ROBO: You are welcome..")
else:
if(greeting(user_response)!=None):
print("ROBO: "+greeting(user_response))
else:
print("ROBO: ",end="")
print(response(user_response))
sent_tokens.remove(user_response)
else:
flag=False
print("ROBO: Bye! take care..")
Run Code Online (Sandbox Code Playgroud)
它运行良好,但每次对话都会出现此错误:
/home/hostbooks/django1/myproject/lib/python3.6/site-packages/sklearn/feature_extraction/text.py:300: UserWarning: Your stop_words may be inconsistent with your preprocessing.
Tokenizing the stop words generated tokens ['ha', 'le', 'u', 'wa'] not in stop_words.
Run Code Online (Sandbox Code Playgroud)
以下是 CMD 的一些对话:
ROBO:聊天机器人是一种通过听觉或文本方法进行对话的软件。
印度是什么
/home/hostbooks/django1/myproject/lib/python3.6/site-packages/sklearn/feature_extraction/text.py:300: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['ha', 'le', 'u', 'wa'] not in stop_words. 'stop_words.' % sorted(inconsistent))
Run Code Online (Sandbox Code Playgroud)
ROBO:印度的野生动物传统上在印度文化中被视为宽容,它们在这些森林和其他受保护的栖息地中得到了支持。
什么是聊天机器人
/home/hostbooks/django1/myproject/lib/python3.6/site-packages/sklearn/feature_extraction/text.py:300: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['ha', 'le', 'u', 'wa'] not in stop_words. 'stop_words.' % sorted(inconsistent))
Run Code Online (Sandbox Code Playgroud)
ROBO:聊天机器人是一种通过听觉或文本方法进行对话的软件。
小智 5
原因是您使用了自定义tokenizer和默认stop_words='english',因此在提取特征时会检查stop_words和之间是否存在不一致tokenizer
如果您深入研究代码,sklearn/feature_extraction/text.py您会发现此代码片段执行一致性检查:
Run Code Online (Sandbox Code Playgroud)def _check_stop_words_consistency(self, stop_words, preprocess, tokenize): """Check if stop words are consistent Returns ------- is_consistent : True if stop words are consistent with the preprocessor and tokenizer, False if they are not, None if the check was previously performed, "error" if it could not be performed (e.g. because of the use of a custom preprocessor / tokenizer) """ if id(self.stop_words) == getattr(self, '_stop_words_id', None): # Stop words are were previously validated return None # NB: stop_words is validated, unlike self.stop_words try: inconsistent = set() for w in stop_words or (): tokens = list(tokenize(preprocess(w))) for token in tokens: if token not in stop_words: inconsistent.add(token) self._stop_words_id = id(self.stop_words) if inconsistent: warnings.warn('Your stop_words may be inconsistent with ' 'your preprocessing. Tokenizing the stop ' 'words generated tokens %r not in ' 'stop_words.' % sorted(inconsistent))
正如您所看到的,如果发现不一致,它会发出警告。
希望能帮助到你。
| 归档时间: |
|
| 查看次数: |
2093 次 |
| 最近记录: |