AI Zone Admin Forum Add your forum

NEWS: Chatbots.org survey on 3000 US and UK consumers shows it is time for chatbot integration in customer service!read more..

Unicode support/Chinese character
 
 

I was testing pattern in Chinese charater but got a message I don’t understand:
=====================================================
:testpattern (我) 我

  in authorizedIP.txt at 1: (我) 我
  entering strange word to dictionary 我
Failed
=====================================================

I also tried to put this one character pattern in a topic file but could not get a match.

Any idea?  Thanks

 

 
  [ # 1 ]

At the present time, foreign language support for ChatScript is, at best, spotty. With multibyte characters, such as many Asian languages use (Chinese included), it’s simply unavailable. Bruce and I both have looked into this issue without much success (unless he’s recently found a solution). Bruce is certainly aware of the issue, but I’m not sure where it sits on his priority list.

For what it’s worth, the vast majority of chatbot engines have problems dealing with languages that the developers do not have at least a passing familiarity with, or, in fact, work with on a frequent basis. I’d love to get in contact with a C++ or PHP programmer who’s native and “working” language is Chinese, and try to learn from them how to work within the character sets they use. I’m sure I would learn a lot, regardless of whether I learn their language or not. smile

By the way, I humbly suggest that you not consider this to be the “final answer”, since my involvement with ChatScript’s development is minimal, at best. I’m just giving you my take on the issue here.

 

 
  [ # 2 ]

Thanks for your quick reply.  In the past I was able to successfully implement a Chinese chatbot using AIML programD.  Actually the requirement for supporting a language like Chinese is quite simple.  Essentially, I just separate Chinese characters with spaces and let the system do unicode character matching.  In a way, it is much easier than English because you don’t have to deal with cononical forms, spelling check, or even grammer parsing - just pure simple unicodec matching. Chatscript can display Chinese characters just fine - I am just having problem with simple pattern matching.

 

 
  [ # 3 ]

Actually, the only KNOWN problem chatscript has is that the console window of visual C++ can’t display multibyte so proving the system actually works is hard.  Currently the dictionary declines to store them, but I can turn that off and allow them to go in. It would probably allow you to proceed.  Or if you can compile chatscript yourself, go to dictionarySystem.cpp to the WORDP StoreWord function and just comment out the code that gives you that message and exits…  then tell me where you hang up next.

 

 
  [ # 4 ]

Hey, amazingly it worked!  After comment out the message in dictionarySystem.cpp I was able to get Chinese character patterns to match in console mode in Mac OS/X.  The trick is to separate each Chinese character (which is really a single word in Chinese) witha a space for pattern and input text.  ChatScript then treats each Chinese character as a separate one character word and pattern match!

 

 
  [ # 5 ]

GREAT!  but is separating the characters required?  wouldn’t sometimes one normally composit them?

 

 
  [ # 6 ]

One of the most annoying issue with a Chinese sentence is that you don’t know where a multi-character word ends.  In English (and most alphabet languagues) you have spaces to mark word boundries.  In Chinese, a word maybe single or multi-character.  If you interpret the word boundries incorrectly, you may get a totally different meaning.  To just avoid the whole headache, in the past, I just insert spaces between characters.  This works great most of the time.  However, I think it’s going to be problematic with the ~concept in ChatScript.  I will have to think more depth about this.

 

 
  [ # 7 ]

Depends on how long the string of characters needs to be. Concepts already take sequences of words in double quotes (phrases) up to 4 long. One could extend the length of phrases allowed if one needed to. Of course there is the issue of start boundaries of a phrase, given a chinese phrase of “a b c d”  might overlap input of which was a mix of single and multiple words and so accidently matching.

 

 
  [ # 8 ]

I was trying to pass Chinese input via char* in performchat api (v 6.5b )

But always failed to any rules I wrote in my .top file until I found this post.

Inserting spaces between two Chinese characters fixed my issue.

 

 
  login or register to react