AI Zone Admin Forum Add your forum

NEWS: survey on 3000 US and UK consumers shows it is time for chatbot integration in customer service!read more..

What about UTF-8 support?

Good day!

I am interesting how to append other languages into ChatScript like Russian, Spanish?

What is the best way to do it?

Thank you for any advise.


  [ # 1 ]

I’m not sure.

I’ve avoided other alphabets, having not sorted out how to insure getting multibyte characters correctly. The system is not set up for a unicode representation, but multibyte should be feasible. If one knew one was getting multibyte properly in the input always, in theory one could just remove the code in ReadALine in textutilities.cpp involving utfbad.

One would also, of course, end up needing a different set of dictionary entries, and shut down POSTag which is currently built for english language pos-tagging and parsing. Spelling correction would probably continue to work correctly. And I presume but dont know that the code that converts proper names into single words and multi-word text numbers into single numbers would continue working ( not knowing those languages).


  [ # 2 ]

I’d absolute recommend to work on UTF-8. From a European perspective, having 30 countries, each having extensions on the standard Latin alphabet, all non-UTF-8 encoding is a night mare.

We could obviously discuss about the future of Europe grin , but also Non-English countries are getting increasingly popular. The BRIC’s countries, Brazil, Russia, India (they have dozens of different languages!) & China, all use different alphabets.

If you manage to make ChatScript UTF-8 proof, it can be used all over the world, not only for English.


  [ # 3 ]

Next Chatscript update (1.27) will support UTF-8.


  [ # 4 ]

Way cool, Bruce! Thanks. smile


  [ # 5 ]

Not sure if this will help but I believe it to be one of the most comprehensive libraries for handling unicode and internationalisation. I’ve been using it for converting arbitrary text in an unknown encoding into UTF8.


  [ # 6 ]

It seems, UTF-8 without BOM works correctly right now. We’ve just removed code in ReadALine (textutilities.cpp) involving utfbad.


  [ # 7 ]

THANK You andrew.  ANd yes, ChatScript works w/o BOM marks.  UTF8 worked before, but I had trouble testing it so suppressed it. I have improved the code and reenabled it for 1.27 release.


  [ # 8 ]

My current version is 1.27, OS Linux.
I have some rule, which includes two-byte characters:
” u: ( test )  ÄÖÜ “
But this rule isn’t working, output is generated from another rule. Of course, when I change “ÄÖÜ” to single-byte characters (e.g. “smth”) - rule works.


  [ # 9 ]

This description makes no sense to me…..  the rule matches based on the pattern (test) and it shouldnt matter what the output side is.  Could you email me a sample topic file with the behavior to gowilcox at so that I can see the full context.


  [ # 10 ]

Sorry, it was my stupid fault. It’s working correctly with meaningful phrases (not set of 2-byte characters).


  [ # 11 ]


I had a little UTF-8 related conversation in
problems with ChatScript-tutorial
beginning at Dec. 8, 2011.




  [ # 12 ]

So… been working on UTF8. What a mess!

I have modified chatscript to read UTF-8 files and ignore the BOM at the start.
I have modified the script compiler to generate files marked with the BOM at the start
and fixed a bunch of code that wasn’t ready for multibyte characters.

At this point, I’d be done, except for ONE LITTLE PROBLEM.  Taking a string of characters, some of whom may be utf-8
and getting the visual studio C++ console window to display them correctly. I tried setting a codepage. I tried converting the string to widechar stuff. But I haven’t been able to get the console output to display them.  The server would be fine,
because it would send back utf8 characters and the browser or receiver would be responsible for displaying them.

Any ideas?


  [ # 13 ]

A quick search on utf-8 and the windows console gave:


  [ # 14 ]

Joy is not mine. I went to the windows command prompt window. Told it to type out my simple source file with umlaut character as part of topic. Printed wrong, of course. Then tried chcp 1250   and chcp 65001 before a type command, didn’t help. Still prints out wrong.


  [ # 15 ]

Is it possible, Bruce, that the font used by Windows for the command window doesn’t support UTF-8? I’ve done some testing with the command window on my Win 7 machine, and the default “raster font” doesn’t print all of the UTF-8 characters properly. I found that using the font Lucida Console worked for me, when using the command “copy D:\utf8.txt con”, which displays the contents of the file to the screen. Maybe this will prove useful? smile


 1 2 3 > 
1 of 3
  login or register to react
‹‹ failed after rebuild      Parsing features ››