怎么让KMP算法支持汉子检索？

suuse · 发表于 2009-1-8 17:18:04

/*
* =====================================================================================
*
*    Filename:  kmp1.c
*
* Description:  another kmp sample
*
*       Version:  1.0
*       Created:  2009年01月08日 11时40分51秒
*    Revision:  none
*    Compiler:  gcc
*
*       Author:  Dr. Fritz Mehner (mn), mehner@fh-swf.de
*       Company:  FH Südwestfalen, Iserlohn
*
* =====================================================================================
*/

#include <assert.h>
#include <stdio.h>
#include <time.h>
#include <stdio.h>
#define SEARCH_TIME 10000
/* The file contain words to search. */
#ifdef WINDOWS
#define FILE_PATH "C:\\search.txt"
#else
#define FILE_PATH "/home/xxx/search.txt"
/* The file contain words to search. */
#endif

char *
read_text_file ()
{
FILE *f;
char *text ,*text1;

text = (char *)malloc(100);
text1 = (char *)malloc(4096);
f = fopen(FILE_PATH, "rt");
while (fgets(text,100,f))
   {
text1 = strcat(text1,text);
   }
      printf("%s\n",text1);
fclose(f);
return text ;
}

void kmp_init(const char *patn, int len, char *next)
{
      int i, j;

      assert(patn != NULL && len > 0 && next != NULL);
      next[0] = 0;
      for (i = 1, j = 0; i < len; i ++) {
            while (j > 0 && patn[j] != patn)
                     j = next[j - 1];
            if (patn[j] == patn)
                     j += 1;
            next = j;
      }
}

int kmp_find(const char *text, int text_len, const char *patn,
            int patn_len, char *next)
{
      int i, j;

      assert(text != NULL && text_len > 0 && patn != NULL && patn_len > 0
                     && next != NULL);
      for (i = 0, j = 0; i < text_len; i ++ ) {
            while (j > 0 && text != patn[j])
                     j = next[j - 1];
            if (text == patn[j])
                     j += 1;
            if (j == patn_len)
                     return i + 1 - patn_len;
      }

      return -1;
}

int main(int argc, char *argv[])
{
      char *next;
      int i, pos, len = strlen(argv[1]);
time_t timep;
int seconds, seconds1;
char *text;

text = read_text_file();

      if (argc < 2) {
            printf("Usage: %s pattern\n", argv[0]);
            return 1;
      }

seconds = time((time_t*)NULL);

for(int x = 0; x < SEARCH_TIME ; x++){
      next = calloc(strlen(argv[1]), sizeof(int));
      kmp_init(argv[1], strlen(argv[1]), next);
      printf("next array:\n");
      for (i = 0; i < len; i ++)
            printf("\t%c", argv[1]);
      printf("\n");
      for (i = 0; i < len; i ++)
            printf("\t%d", next);
      printf("\n");

      pos = kmp_find(text, strlen(text), argv[1], strlen(argv[1]), next);
      printf("find result:\n");
      if (pos < 0) {
            printf("None found!\n");
      } else {
            printf("%s\n", argv[1]);
            for (i = 0; i < pos; i ++)
                     printf(" ");
            printf("^\n");
      }
}/* for */

seconds1 = time((time_t*)NULL);

printf("Here are the result of ..\n");
printf("time when start search : %d\n" , seconds);
printf("time when serch end : %d\n", seconds1);
printf("seconds elapsed : %d\n", seconds1-seconds);

return 0;

}

上面是挤出来的，很大部分抄了网上的。现在还不能解决汉字检索问题。代码写的很烂。我是C菜鸟，现在是想用C实现这个汉子检索的算法。帮下忙。有没有高手知道比这个更快的，推荐下，谢谢！

sunwt · 发表于 2009-1-11 21:39:58

kmp使用字节匹配，汉字也是表示成字节的，天生就支持啊，楼主什么意思？

littlepig · 发表于 2009-1-13 22:41:46

Post by sunwt;1937087
kmp使用字节匹配，汉字也是表示成字节的，天生就支持啊，楼主什么意思？

字节匹配并不意味这支持汉字，楼上再斟酌斟酌。

楼主的问题有难度，涉及编码问题。如果是汉字是用固定字节数来编码，那就简单，只要将指针移动一个字节改为那个固定的字节书，如果非固定字节数，就很难说了。这个要熟悉编码的的朋友出来澄清一下。

sunwt · 发表于 2009-1-14 09:53:55

楼上给我个提示？不明白为什么不能用字节匹配。

kmp是搜索整个短句的，但我们搜索英文来说按照字节（就是字母）分块，利用技巧加快匹配速度。拆成字母失去了它在短句中的意义。
同样，汉语搜索可以拆成单个汉字，也可以拆成编码的字节。

只要pattern是汉语，kmp匹配成功的肯定也是汉语，你说呢？

littlepig · 发表于 2009-1-14 12:11:53

假设,仅仅是假设,汉字``一,二,三''的编码分别为四个字节1111,2222,1122,那么单词:一二,应该是11112222,如果实现KMP的时候按字节匹配,将出现``一二''包含``三''的结果.是这意思不? 所以实现汉字或其它多字节编码时,不能按字节匹配.当然KMP还是适用，只是做比较的方式要改变少许，字节对字节的比较应该是不工作的。

sunwt · 发表于 2009-1-26 13:36:37

Yes, you are right. But this problem would not exist if you use utf8 encoding.

sorry for my scim not working after re-emerged

littlepig · 发表于 2009-1-27 00:33:18

Post by sunwt;1941520
Yes, you are right. But this problem would not exist if you use utf8 encoding.

sorry for my scim not working after re-emerged

嘿嘿，巧了！正好昨晚有空查了一些汉字的码表，的确，如果采用utf8，不会出现我以上描述的那种情形，于是字节对字节的KMP实现没有一点问题，我描述的那种情形目前来看只是逻辑上可能存在问题，但是实际上会不会发生，不知道。

花了点时间按照<<Introduction to algorithms>>中的伪代码实现了一个。凑合能转;) 我完全按照字节对字节的匹配，所以最后输出的匹配位置用的也是字节，并非汉字或者英文的字符数，检验的时候麻烦自己算清楚匹配的位置(而且调试过程中打印pi[]数组还是有一点点让人困惑的地方，不过不影响最后结果) 。

P.S 写算法导论的那帮大牛没有遵从另外一个大牛的劝告：numbering should start at zero. 伪代码中所有的计数都从1开始，在附件的代码中被改成了C惯例的从0开始。

P.S 这一段烂代码，爱用就用，无须顾忌，当然也没有任何保证。

P.S 建议楼主认真看算法导论中string matching那一章，KMP那一节，尤其是数学证明部分，理解了数学证明再看算法实现就会感觉非常简明直接。我N年前写过一个乱七八糟的中文笔记，如果需要，PM我，给你发一份，就不贴在这里丢人现眼了;)

		自动登录	找回密码
密码			注册

怎么让KMP算法 支持汉子检索？

本帖子中包含更多资源

怎么让KMP算法支持汉子检索？