AI Zone: chatbots.org

NEWS: Chatbots.org survey on 3000 US and UK consumers shows it is time for chatbot integration in customer service!read more..

What about UTF-8 support?

Posted: Oct 28, 2011

Viktor Skliar

Member

Total posts: 21

Joined: Oct 25, 2011

E-mail Viktor

Good day!

I am interesting how to append other languages into ChatScript like Russian, Spanish?

What is the best way to do it?

Thank you for any advise.

Posted: Oct 28, 2011

[ # 1 ]

Bruce Wilcox

Moderator

Total posts: 2372

Joined: Jan 12, 2010

E-mail Bruce

I’m not sure.

I’ve avoided other alphabets, having not sorted out how to insure getting multibyte characters correctly. The system is not set up for a unicode representation, but multibyte should be feasible. If one knew one was getting multibyte properly in the input always, in theory one could just remove the code in ReadALine in textutilities.cpp involving utfbad.

One would also, of course, end up needing a different set of dictionary entries, and shut down POSTag which is currently built for english language pos-tagging and parsing. Spelling correction would probably continue to work correctly. And I presume but dont know that the code that converts proper names into single words and multi-word text numbers into single numbers would continue working ( not knowing those languages).

Posted: Nov 1, 2011

[ # 2 ]

Erwin van Lun

Senior member

Total posts: 971

Joined: Aug 14, 2006

E-mail Erwin1

I’d absolute recommend to work on UTF-8. From a European perspective, having 30 countries, each having extensions on the standard Latin alphabet, all non-UTF-8 encoding is a night mare.

We could obviously discuss about the future of Europe , but also Non-English countries are getting increasingly popular. The BRIC’s countries, Brazil, Russia, India (they have dozens of different languages!) & China, all use different alphabets.

If you manage to make ChatScript UTF-8 proof, it can be used all over the world, not only for English.

Posted: Nov 7, 2011

[ # 3 ]

Bruce Wilcox

Moderator

Total posts: 2372

Joined: Jan 12, 2010

E-mail Bruce

Next Chatscript update (1.27) will support UTF-8.

Posted: Nov 7, 2011

[ # 4 ]

Dave Morton

Administrator

Total posts: 3111

Joined: Jun 14, 2010

E-mail Dave

Way cool, Bruce! Thanks.

Posted: Nov 7, 2011

[ # 5 ]

Andrew Smith

Senior member

Total posts: 473

Joined: Aug 28, 2010

E-mail Andrew

Not sure if this will help but I believe it to be one of the most comprehensive libraries for handling unicode and internationalisation. I’ve been using it for converting arbitrary text in an unknown encoding into UTF8.

http://site.icu-project.org/

Posted: Nov 7, 2011

[ # 6 ]

Aliera

Member

Total posts: 20

Joined: Oct 28, 2011

E-mail Aliera

It seems, UTF-8 without BOM works correctly right now. We’ve just removed code in ReadALine (textutilities.cpp) involving utfbad.

Posted: Nov 7, 2011

[ # 7 ]

Bruce Wilcox

Moderator

Total posts: 2372

Joined: Jan 12, 2010

E-mail Bruce

THANK You andrew. ANd yes, ChatScript works w/o BOM marks. UTF8 worked before, but I had trouble testing it so suppressed it. I have improved the code and reenabled it for 1.27 release.

Posted: Nov 24, 2011

[ # 8 ]

Aliera

Member

Total posts: 20

Joined: Oct 28, 2011

E-mail Aliera

My current version is 1.27, OS Linux.
I have some rule, which includes two-byte characters:
” u: ( test ) ÄÖÜ “
But this rule isn’t working, output is generated from another rule. Of course, when I change “ÄÖÜ” to single-byte characters (e.g. “smth”) - rule works.

Posted: Nov 24, 2011

[ # 9 ]

Bruce Wilcox

Moderator

Total posts: 2372

Joined: Jan 12, 2010

E-mail Bruce

This description makes no sense to me….. the rule matches based on the pattern (test) and it shouldnt matter what the output side is. Could you email me a sample topic file with the behavior to gowilcox at gmail.com so that I can see the full context.

Posted: Nov 29, 2011

[ # 10 ]

Aliera

Member

Total posts: 20

Joined: Oct 28, 2011

E-mail Aliera

Sorry, it was my stupid fault. It’s working correctly with meaningful phrases (not set of 2-byte characters).

Posted: Dec 16, 2011

[ # 11 ]

Andreas Drescher

Experienced member

Total posts: 94

Joined: Dec 8, 2011

E-mail Andreas

Hi,

I had a little UTF-8 related conversation in
“problems with ChatScript-tutorial”
beginning at Dec. 8, 2011.

Greetings

Andreas

Posted: Dec 21, 2011

[ # 12 ]

Bruce Wilcox

Moderator

Total posts: 2372

Joined: Jan 12, 2010

E-mail Bruce

So… been working on UTF8. What a mess!

I have modified chatscript to read UTF-8 files and ignore the BOM at the start.
I have modified the script compiler to generate files marked with the BOM at the start
and fixed a bunch of code that wasn’t ready for multibyte characters.

At this point, I’d be done, except for ONE LITTLE PROBLEM. Taking a string of characters, some of whom may be utf-8
and getting the visual studio C++ console window to display them correctly. I tried setting a codepage. I tried converting the string to widechar stuff. But I haven’t been able to get the console output to display them. The server would be fine,
because it would send back utf8 characters and the browser or receiver would be responsible for displaying them.

Any ideas?

Posted: Dec 21, 2011

[ # 13 ]

Jan Bogaerts

Senior member

Total posts: 697

Joined: Aug 5, 2010

E-mail Jan

A quick search on utf-8 and the windows console gave:
http://stackoverflow.com/questions/388490/unicode-characters-in-windows-command-line-how

Posted: Dec 21, 2011

[ # 14 ]

Bruce Wilcox

Moderator

Total posts: 2372

Joined: Jan 12, 2010

E-mail Bruce

Joy is not mine. I went to the windows command prompt window. Told it to type out my simple source file with umlaut character as part of topic. Printed wrong, of course. Then tried chcp 1250 and chcp 65001 before a type command, didn’t help. Still prints out wrong.

Posted: Dec 21, 2011

[ # 15 ]

Dave Morton

Administrator

Total posts: 3111

Joined: Jun 14, 2010

E-mail Dave

Is it possible, Bruce, that the font used by Windows for the command window doesn’t support UTF-8? I’ve done some testing with the command window on my Win 7 machine, and the default “raster font” doesn’t print all of the UTF-8 characters properly. I found that using the font Lucida Console worked for me, when using the command “copy D:\utf8.txt con”, which displays the contents of the file to the screen. Maybe this will prove useful?

1 2 3 >

1 of 3

‹‹ failed after rebuild Parsing features ››

Search the Forum

Forum Profile

Forum Subscription

Forum Moderators

On Our Admin Forums

Partner Forums

Science Statistics

Chatbot Statistics