你会使用什么分类技术?

2 c sorting algorithm cpu-word

如果您有65536个随机英语单词,每个单词的长度为1-32,您需要根据字典或外观等级计算外观和排序,您如何构建数据以及您将使用哪种排序技术来最快地处理它?

pax*_*blo 17

严肃地说,65,000字是一个微不足道的排序问题.除非你有重新梳理它很多每分钟的时间,我会建议你只需要使用qsort()内置的语言.毕竟,这就是它的用途.

我建议为alpha命令使用一个简单的char指针数组.为了维护频率顺序,您可以使用如下结构:

typedef struct {
    char *word;      // points to one of the strings.
    int frequency;   // counts the number of occurrences.
} tFreq;
Run Code Online (Sandbox Code Playgroud)

在另一个数组中,每当你创建或修改alpha排序的指针数组时,你可以完全填充这些数组(参见下面我的理由为什么这个看似低效的过程是合适的).

作为速度的一个例子,请考虑以下代码:

#include <stdio.h>
#define MAXWDS 66000

static char *words[MAXWDS];

static int compFn (const void *p1, const void *p2) {
    return strcmp (*((const char **)p1), *((const char **)p2));
}

int main() {
    char *pWord;
    int i, j, sz;
    time_t t0, t1;

    srand (time (0));
    for (i = 0; i < MAXWDS; i++) {
        sz = rand() % 32 + 1;
        pWord = words[i] = malloc (sz + 1);
        for (j = 0; j < sz; j++)
            pWord[j] = 'A' + (rand() % 26);
        pWord[sz] = '\0';
    }

    t0 = time(0);
    qsort (words, MAXWDS, sizeof (char*), compFn);
    t1 = time(0);

    printf ("Time taken for %7d elements was %2d second(s)\n", MAXWDS, t1 - t0);
    return 0;
}
Run Code Online (Sandbox Code Playgroud)

在3GHz双核英特尔芯片上,这里是MAXWDS的几个选择值的输出:

   MAXWDS   Output
---------   ------
   66,000   Time taken for   66000 elements was  0 second(s)
  100,000   Time taken for  100000 elements was  0 second(s)
  500,000   Time taken for  500000 elements was  0 second(s)
  600,000   Time taken for  600000 elements was  1 second(s)
1,000,000   Time taken for 1000000 elements was  1 second(s)
2,000,000   Time taken for 2000000 elements was  2 second(s)
3,000,000   Time taken for 3000000 elements was  5 second(s)
4,000,000   Time taken for 4000000 elements was  7 second(s)
5,000,000   Time taken for 5000000 elements was  9 second(s)
6,000,000   Time taken for 6000000 elements was 10 second(s)
7,000,000   Time taken for 7000000 elements was 11 second(s)
9,999,999   Time taken for 9999999 elements was 21 second(s)
Run Code Online (Sandbox Code Playgroud)

因此,正如您所看到的,qsort对于您正在讨论的数据集大小非常有效.

事实上,如下面的代码所示,实现维护两个排序列表的整个过程,可以准确地显示66,000个元素的无关紧要.基本前提是:

  • 根据需要修改alpha排序的字符串,然后对它们进行完整的alpha排序(t0 to t1).
  • 使用它们被排序的事实很容易将它们转移到另一个数组但每个单词只有一个元素和频率(t1 to t2).
  • 排序那个频率数组(t2 to t3).

以下代码显示了如何完成.唯一有点棘手的是从alpha阵列到频率阵列的转换.

#include <stdio.h>

#define MAXWDS 66000
typedef struct {
    char *word;
    int frequency;
} tFreq;
static char *words[MAXWDS];
static tFreq freq[MAXWDS];
static int numFreq;

static int compFn (const void *p1, const void *p2) {
    return strcmp (*((const char **)p1), *((const char **)p2));
}

static int compFn2 (const void *p1, const void *p2) {
    return ((tFreq*)p2)->frequency - ((tFreq*)p1)->frequency;
}
Run Code Online (Sandbox Code Playgroud)

 

int main() {
    char *pWord;
    int i, j, sz;
    time_t t0, t1, t2, t3;

    // Generate random words.
    srand (time (0));
    for (i = 0; i < MAXWDS; i++) {
        sz = rand() % 32 + 1;
        pWord = words[i] = malloc (sz + 1);
        for (j = 0; j < sz; j++)
            pWord[j] = 'A' + (rand() % 26);
        pWord[sz] = '\0';
    }

    t0 = time(0);

    // Alpha sort.
    qsort (words, MAXWDS, sizeof (char*), compFn);
    t1 = time(0);
Run Code Online (Sandbox Code Playgroud)

 

    // Pre-condition to simplify loop: make first word with zero frequency.

    freq[0].word = words[0];
    freq[0].frequency = 0;

    // Transfer to frequency array.

    for (i = numFreq = 0; i < MAXWDS; i++) {
        // If new word, add it and set frequency to 0.
        if (strcmp (freq[numFreq].word, words[i]) != 0) {
            numFreq++;
            freq[numFreq].word = words[i];
            freq[numFreq].frequency = 0;
        }

        // Increment frequency (for old and new words).
        freq[numFreq].frequency++;
    }
    numFreq++;
    t2 = time(0);

    // Sort frequency array.
    qsort (freq, numFreq, sizeof (tFreq), compFn2);
    t3 = time(0);
Run Code Online (Sandbox Code Playgroud)

 

    // Output stats.
    printf ("Time taken for      sorting %5d elements was %d seconds.\n",
        MAXWDS, t1 - t0);
    printf ("Time taken for transferring %5d elements was %d seconds.\n",
        numFreq, t2 - t1);
    printf ("Time taken for      sorting %5d elements was %d seconds.\n",
        numFreq, t3 - t2);
    printf ("Time taken for   everything %5s          was %d seconds.\n\n",
        "", t3 - t0);
    for (i = 0; i < 28; i++) {
        printf ("[%-15s] %5d\n", freq[i].word, freq[i].frequency);
    }

    return 0;
}
Run Code Online (Sandbox Code Playgroud)

并且66,000个随机字符串的输出是(前几个字符串在那里,所以你可以看到排序工作):

Time taken for      sorting 66000 elements was 0 seconds.
Time taken for transferring 62422 elements was 0 seconds.
Time taken for      sorting 62422 elements was 0 seconds.
Time taken for   everything                was 0 seconds.

[Z              ]   105
[H              ]    97
[X              ]    95
[P              ]    90
[D              ]    89
[K              ]    87
[T              ]    86
[J              ]    85
[G              ]    84
[F              ]    83
[Q              ]    83
[W              ]    81
[V              ]    81
[M              ]    80
[I              ]    79
[O              ]    78
[A              ]    78
[B              ]    75
[U              ]    74
[N              ]    73
[C              ]    73
[S              ]    70
[Y              ]    68
[L              ]    65
[E              ]    60
[R              ]    59
[NQ             ]     8
[XD             ]     8
Run Code Online (Sandbox Code Playgroud)

因此,即使您每次插入或删除一个值都执行这些操作,它们也没有明显的影响(除非显然,如果您每隔几秒钟执行一次以上,那么您会考虑批量处理改变效率).