n-gram是所有单词中最常见的一个

Question

n-gram是所有单词中最常见的一个

我遇到了以下编程面试问题:

挑战1:N克

N-gram是来自给定单词的N个连续字符的序列.对于"飞行员"这个词,有三个3克:"pil","ilo"和"lot".对于给定的单词集和n-gram长度,您的任务是

• write a function that finds the n-gram that is the most frequent one among all the words
• print the result to the standard output (stdout)
• if there are multiple n-grams having the same maximum frequency please print the one that is the smallest lexicographically (the first one according to the dictionary sorting order)

Run Code Online (Sandbox Code Playgroud)

请注意,您的函数将收到以下参数:

• text
    ? which is a string containing words separated by whitespaces
• ngramLength
    ? which is an integer value giving the length of the n-gram

Run Code Online (Sandbox Code Playgroud)

数据限制

• the length of the text string will not exceed 250,000 characters
• all words are alphanumeric (they contain only English letters a-z, A-Z and numbers 0-9)

Run Code Online (Sandbox Code Playgroud)

效率限制

• your function is expected to print the result in less than 2 seconds

Run Code Online (Sandbox Code Playgroud)

示例输入文本:"aaaab a0a baaab c"

输出aaa ngramLength:3

说明

对于上面提供的输入,按频率排序的3克是:

• "aaa" with a frequency of 3
• "aab" with a frequency of 2
• "a0a" with a frequency of 1
• "baa" with a frequency of 1

Run Code Online (Sandbox Code Playgroud)

如果我只有一个小时来解决问题,我选择使用C语言来解决问题:实现哈希表以计算N-gram的频率是否是一个好主意？因为在C库中没有哈希表的实现......

如果是,我正在考虑使用单独链接和有序链接列表来实现哈希表.这些实现减少了您必须解决问题的时间....

这是最快的选择吗？

谢谢!!!

Answer 1

bti*_*lly 5

如果实现效率是重要的并且你正在使用C,我会初始化一个指向字符串中n-gram开头的指针数组,用于qsort根据它们所属的n-gram对指针进行排序,然后循环在那个排序的数组上并计算出数量.

这应该足够快地执行,并且不需要编写任何花哨的数据结构.

归档时间：	11 年，5 月前
查看次数：	5322 次
最近记录：	8 年，9 月前