Split sentence into words and non-white characters for POS Tagging

Question

Split sentence into words and non-white characters for POS Tagging

This was the question I got from an onsite interview with a tech firm, and one that I think ultimately killed my chances.

You're given a sentence, and a dictionary that has words as keys and parts of speech as values.

The goal is to write a function in which when you're given a sentence, change each word to its part of speech given in the dictionary in order. We can assume that all the stuffs in sentence are present as keys in dictionary.

For instance, let's assume that we're given the following inputs:

sentence='I am done; Look at that, cat!' 

dictionary={'!': 'sentinel', ',': 'sentinel', 
            'I': 'pronoun', 'am': 'verb', 
            'Look': 'verb', 'that': 'pronoun', 
             'at': 'preposition', ';': 'preposition', 
             'done': 'verb', ',': 'sentinel', 
             'cat': 'noun', '!': 'sentinel'}

output='pronoun verb verb sentinel verb preposition pronoun sentinel noun sentinel'

Run Code Online (Sandbox Code Playgroud)

The tricky part was catching sentinels. If part of speech didn't have sentinels, this can be easily done. Is there an easy way of doing it? Any library?

Answer 1

Div*_*ava 6

Python's Regular Expression package can be used to split the sentence into the tokens.

import re
sentence='I am done; Look at that, cat!' 

dictionary={'!': 'sentinel', ',': 'sentinel', 
            'I': 'pronoun', 'am': 'verb', 
            'Look': 'verb', 'that': 'pronoun', 
             'at': 'preposition', ';': 'preposition', 
             'done': 'verb', ',': 'sentinel', 
             'cat': 'noun', '!': 'sentinel'}

tags = list()
for word in re.findall(r"[A-Za-z]+|\S", sentence):
    tags.append(dictionary[word])

print (' '.join(tags))

Run Code Online (Sandbox Code Playgroud)

Output

pronoun verb verb preposition verb preposition pronoun sentinel noun sentinel

The Regular expression [A-Za-z]+|\S basically selects all the alphabets (capital and small) with their one or more occurance by [A-Za-z]+, together with (done by |, which means Alteration) all non white spaces by \s.

这是一个[正则表达式]（https://www.regexbuddy.com/regex.html）。正则表达式是用于描述搜索模式的特殊文本字符串。您可以将正则表达式视为类固醇上的通配符。 (2认同)

归档时间：	6 年，7 月前
查看次数：	68 次
最近记录：	6 年，7 月前