AI Zone: chatbots.org

NEWS: Chatbots.org survey on 3000 US and UK consumers shows it is time for chatbot integration in customer service!read more..

Preprocess user input sentence before system tokenization

Posted: Feb 28, 2017

He Yunchao

Member

Total posts: 4

Joined: Feb 9, 2017

E-mail Yunchao

As Chinese texts do not have separator between words, I wrote a custom function called ^cn_segment (in private code) to segment Chinese sentence. I can use this function in output to get the segmented text, for example:

^cn_segment(IloveChatScriptVeryMuch)

output:

I love ChatScript Very Much

(Automatically adds blank space between words.)

My problem is, I want every user input could be passed into the ^cn_segment function before ChatScript tokenization. (The ChatScript tokenization can not handle Chinese at present)

How can I get that?

I have tried change the %originalsentence variable, but it does not work.

Thanks.

Posted: Feb 28, 2017

[ # 1 ]

Tobias La

Senior member

Total posts: 149

Joined: Dec 17, 2015

E-mail Tobias La

Try the %originalinput variable.
You will not have seperated sentences, because no tokenization is done.
Note that oob data is stripped off and has to be processed seperatly.

Posted: Feb 28, 2017

[ # 2 ]

He Yunchao

Member

Total posts: 4

Joined: Feb 9, 2017

E-mail Yunchao

Thanks Tobias La, But it did not work. Maybe my problem is not clear. The following is a more detail explanation:

What I want to do:

User could input sentence without blank space separators, for example, input: IloveChatScriptVeryMuch. I want this sentence could math patterns such as, ( love ChatScript ).

What I have:

A function called ^cn_segment() for tokenizing input sentence. For example, ^cn_segment(IloveChatScriptVeryMuch) will output: I love ChatScript Very Much.

My Problem:

How can I use this function to tokenize each input before matching with patterns?

What I have tried:

I have tried to change the value of %originalsentence and %originalinput variables in control script. But it does not work.

outputmacro: test() # you get test by default
$cs_token = #DO_NUMBER_MERGE | #DO_DATE_MERGE | #DO_PROPERNAME_MERGE

^addtopic(~basic)
$cs_prepass = ~segmentation
$cs_control_main = ~control
# $cs_control_post = ~XPOSTPROCESS # uncomment to enable talk
$userprompt = ^”%user: >”
$botprompt = ^“Bot: “

table: defaultbot (^name)
^createfact(^name defaultbot defaultbot)
DATA:
test

topic: ~segmentation system ()
t: ( _* ) # memorize sentence to pass to Jieba segmentation
%originalinput = ^cn_segment(‘_0) # %originalsentence variable has also been tried

topic: ~control system ()
.....

Posted: Feb 28, 2017

[ # 3 ]

Tobias La

Senior member

Total posts: 149

Joined: Dec 17, 2015

E-mail Tobias La

Ah okay, yes, I misunderstood your problem.

So you want basically to replace or enhance the default tokenization with your own, don’t you?

Manipulating the input before tokenization is not possible as far as I know.

However, you could as a workaround get the input as is, call your function ^cn_segment() to make your own tokenization and then reinvoke it back to chatscript with ^input($$yourFunctionReturn)

Than your controlscript will get called again as if your function argument was the user input.
In your controlscript you can then determine the revised input by checking if(%revisedinput) and do whatever you want.

So I guess you want to do something like this:

topic: ~YourControlScript system Keep Repeat ()
u: (_*)
if(%revisedinput)
{
      # insert your current control script here
}
else
{
    $$tokenizedInput = ^cn_segment(_0)
    ^input($$tokenizedInput)  
} 

This would enhance the default tokenization.
If you want to replace it use %originalinput instead of _0

This is a possible workaround that came to my mind, maybe someone else knows a straighter solution.

Posted: Feb 28, 2017

[ # 4 ]

He Yunchao

Member

Total posts: 4

Joined: Feb 9, 2017

E-mail Yunchao

Thanks Tobias La very much. Your method works fine for me. Thanks.

Posted: Feb 28, 2017

[ # 5 ]

Bruce Wilcox

Moderator

Total posts: 2372

Joined: Jan 12, 2010

E-mail Bruce

There is no way to override tokenization code.. I will have to fix that

Posted: Aug 9, 2017

[ # 6 ]

Anjing Wang

Member

Total posts: 16

Joined: Aug 8, 2017

E-mail Anjing Wang

Is the fix out?

I also found this is extremely useful, as sometimes I need to pass some control flag (such as PREFIX_REMOVE_ME_and_DO_Something) in front of the real user input to CS in order to gain some control of CS. Ideally, CS strip this control flag, finish what it intends to control, and then process the real user input.

Posted: Aug 21, 2017

[ # 7 ]

Bruce Wilcox

Moderator

Total posts: 2372

Joined: Jan 12, 2010

E-mail Bruce

No fix so far. not enough priority in my queue.

Posted: Sep 5, 2017

[ # 8 ]

Vlad Vetsh

Experienced member

Total posts: 43

Joined: Apr 9, 2017

E-mail Vlad Vetsh

just wanted to say that this thread helped me very much.

took me a while to understand that %originalinput is probably read only

‹‹ Harry’s Topic example Match rejoinder second time in one volley ››

Search the Forum

Forum Profile

Forum Subscription

Forum Moderators

On Our Admin Forums

Partner Forums

Science Statistics

Chatbot Statistics