谁能救我?我有以下代码:
private List<string> GenerateTerms(string[] docs)
{
List <string> uniques = new List<string>();
for (int i = 0; i < docs.Length; i++)
{
string[] tokens = docs[i].Split(' ');
List<string> toktolist = new List<string>(tokens.ToList());
var query = toktolist.GroupBy(word => word)
.OrderByDescending(g => g.Count())
.Select(g => g.Key)
.Take(20000);
foreach (string k in query)
{
if (!uniques.Contains(k))
uniques.Add(k);
}
}
return uniques;
}
Run Code Online (Sandbox Code Playgroud)
它是基于最高频率从多个文档生成术语.我使用字典做了相同的程序.在这两种情况下花费了440毫秒.但令人惊讶的是,当我使用数组列表的过程时,如下面的代码
private ArrayList GenerateTerms(string[] docs)
{
Dictionary<string, int> yy = new Dictionary<string, int>();
ArrayList uniques = new ArrayList();
for (int i = 0; i < docs.Length; i++)
{
string[] tokens = docs[i].Split(' ');
yy.Clear();
for (int j = 0; j < tokens.Length; j++)
{
if (!yy.ContainsKey(tokens[j].ToString()))
yy.Add(tokens[j].ToString(), 1);
else
yy[tokens[j].ToString()]++;
}
var sortedDict = (from entry in yy
orderby entry.Value descending
select entry).Take(20000).ToDictionary
(pair => pair.Key, pair => pair.Value);
foreach (string k in sortedDict.Keys)
{
if (!uniques.Contains(k))
uniques.Add(k);
}
}
return uniques;
}
Run Code Online (Sandbox Code Playgroud)
它花了350毫秒.不应该列表列表比列表和字典慢?请用这个时态救救我.
您的代码执行了大量不必要的工作,并使用低效的数据结构.
试试这个:
private List<string> GenerateTerms(string[] docs)
{
var result = docs
.SelectMany(doc => doc.Split(' ')
.GroupBy(word => word)
.OrderByDescending(g => g.Count())
.Select(g => g.Key)
.Take(20000))
.Distinct()
.ToList();
return result;
}
Run Code Online (Sandbox Code Playgroud)
重构版本使其更易于阅读:
private List<string> GenerateTerms(string[] docs)
{
return docs.SelectMany(doc => ProcessDocument(doc)).Distinct().ToList();
}
private IEnumerable<string> ProcessDocument(string doc)
{
return doc.Split(' ')
.GroupBy(word => word)
.OrderByDescending(g => g.Count())
.Select(g => g.Key)
.Take(10000);
}
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
1668 次 |
| 最近记录: |