AI Zone Admin Forum Add your forum

NEWS: Chatbots.org survey on 3000 US and UK consumers shows it is time for chatbot integration in customer service!read more..

Trying to match Polish Language pattern but I can’t.
 
 

Hi, I have a problem which maybe can be solved with your little help. I try to match patterns like:

u: (co robić) Some response here.
u: (mieć) Another response here.

But it seems like ChatScript cannot recognize patterns with polish language national characters with are: ążźćęłóńś.

My .top files are saved in UTF-8 format by Windows default Notepad.

I even added some polish words to concept

concept: ~polishwords VERB_PRESENT VERB (robić mieć)

but it didn’t change anything.

What else can be done?

 

 
  [ # 1 ]

Try putting a space after ( and before ).  Chatscript likes spaces between elements because it likes to parse fast.

Type
:prepare
In the chat window before typing your text.  You can see if any substitutions are being done and you can see how chatscript is interpreting your text before matching against the patterns.

Put a simple pattern in the same topic without special characters and verifiy that your topic and pattern syntax are working.

Good luck!

 

 
  [ # 2 ]

Everything without special characters are working fine, also when I take input from text file using :source test.txt it’s working even with the special characters, any ideas why I cannot input my pattern text manually into the console and get it working?

 

 
  [ # 3 ]

My first thought was that Polish, being a Slavic language, was using a Cyrillic character set, but further research seems to indicate that this is not the case. Still, the ‘accented’ characters in Polish may not be directly handled properly in UTF-8 (though I could be wrong about this), which would probably have an effect on the parsing. Beyond that fairly wild guess, I have no clue.

 

 
  [ # 4 ]

But reading file in UTF8 using source command work perfect. Maybe input from console isn’t in UTF8? How to change that?

 

 
  [ # 5 ]

That’s a good question. I think it’s time to see what Bruce has to say about it. First, though, can you give us the ChatScript version that you’re using, as well as which OS/version you’re working in? This may be of use.

 

 
  [ # 6 ]

Windows 10 64bit, ChatScript 6.8 smile I hope that Bruce can help me.

 

 
  [ # 7 ]

Actually, I might be able to help here. Take a look at this SUPERUSER.com question for a way to force the Windows 10 console to use UTF-8 by default. WARNING! This method involves using the Windows Registry Editor, and (especially in this instance) could render your computer unbootable, so if you’re not comfortable making changes to the registry, don’t do it! smile

I think that there might be a way to create a special shortcut that opens the console in such a way that UTF-8 is the default code page, but I haven’t found it yet. Still digging, though.

 

 
  [ # 8 ]

I’ve done a bit more research, and found that the accepted answer on that page may not be the best way to go here. What I’ve found to be effective for me in getting the Windows console to display Unicode characters properly is two-fold:

1.) Make sure that the open console window is using a TrueType font (either Lucida Console or Consolas) by clicking the windows icon in the upper-left corner, selecting Properties, going to the Font tab, and choosing one of the fonts I mentioned.

2.) In the console itself, enter “CHCP 65001” (no quotes) and hit enter. This changes the code page (character encoding) to UTF-8 (the default code page is 437).

From there, you should be able to use ChatScript in the console without a problem. You might be able to automate much of this by creating a shortcut to the console, editing the font within that shortcut, then creating a batch file that changes the code page before calling ChatScript. you could even have the shortcut for the console call this batch file every time it opens. If you need assistance with creating such a shortcut, please let me know, and I can walk you through it.

 

 
  [ # 9 ]

So I added the patterns to the introductions topic.  I made sure to add the words to the topic keywords list as well.

I changed the console app to display the lucida font.  I removed the flags from cs_token in the file “simplecontrol.top” for do_substitute_system and do_spellcheck and do_parse.  I issued :build 0 and :build harry commands.

I typed :Prepare and then co robić.  Here is the result.

alaric: > co robić
TokenControl
DO_SUBSTITUTE_SYSTEM DO_NUMBER_MERGE DO_PROPERNAME_MERGE DO_DATE_MERGE DO_SPELLCHECK DO_INTERJECTION_SPLITTING DO_PARSE


Original User Input
co robic
Tokenized into
co  robic
Spelling changed into
co  robin
Actual used input
co robin(robic

The console app removes the special characters even though they are displayed.  Then Chatscript still performs a spellcheck or a substitution and matches against the word “robin”.  This was after I removed “Robbi” from the dictionary and rebuilt it.

So you will have trouble matching special characters until these issues are resolved.

When I add the pattern for “co robin” it matches and I get “Response 2.” returned.

u: ( co robić Response 1.
u
: ( co robin Response 2. 

I would like to know the exact steps to turn off spellcheck. 
I issued :build german and it showed the tokens in the simplecontrol.top correctly in the :prepare command results.  I created chatbot “Bob” as a copy of Harry.  I issued :Build Bob and still saw the incorrect list of tokens assigned to $cs_token in the :prepare results.  I copied the simplecontrol.top to from German to Bob and issued :Build Bob and the incorrect list of cs_tokens still showed for Bob.

I restarted Chatscript and issued :build bob and the incorrect list of tokens form the $cs_token assignment in simplecontrol.top was still displayed in the :prepare results.  It is as if it is ignoring the $cs_token

$cs_token #NO_HYPHEN_END | #DO_INTERJECTION_SPLITTING  | #DO_SUBSTITUTE_SYSTEM   | #DO_NUMBER_MERGE | #DO_DATE_MERGE 

It is not a matter of the :prepare results being incorrect and the flags being correct as the spellcheck is performed and the match is still made for “robic” to “robin”.

Regarding the console issue, it is probably best to setup and run the web interface.

 

 

 

 
  [ # 10 ]

So here it the thing…Bruce comments his code usually for a reason.  Editing the simplecontrol.top file is kind of like driving in the snow for me; I don’t like it.  It is easy to drive off the road.  But like those road signs “Road ends in 400 ft” it is best to pay attention.  So the very first line (not buried somewhere deep but literally the first line) of the simplecontrol.top states:

# this function is executed once for every new user chatting with harry 

followed by:

outputmacroharry()  # you get harry by default
$cs_token #DO_INTERJECTION_SPLITTING  | #DO_SUBSTITUTE_SYSTEM   | #DO_NUMBER_MERGE | #DO_DATE_MERGE  | #DO_PROPERNAME_MERGE  | #DO_SPELLCHECK  | #DO_PARSE 

So when I log in as Bob or any other user name the $cs_token is reassigned and has the correct tokens.

 

 
  login or register to react