AI Zone Admin Forum Add your forum
Preprocess user input sentence before system tokenization

As Chinese texts do not have separator between words, I wrote a custom function called ^cn_segment (in private code) to segment Chinese sentence. I can use this function in output to get the segmented text, for example:



I love ChatScript Very Much

(Automatically adds blank space between words.)

My problem is, I want every user input could be passed into the ^cn_segment function before ChatScript tokenization. (The ChatScript tokenization can not handle Chinese at present)

How can I get that?

I have tried change the %originalsentence variable, but it does not work.



  [ # 1 ]

Try the %originalinput variable.
You will not have seperated sentences, because no tokenization is done.
Note that oob data is stripped off and has to be processed seperatly.


  [ # 2 ]

Thanks Tobias La, But it did not work. Maybe my problem is not clear. The following is a more detail explanation:

What I want to do:

User could input sentence without blank space separators, for example, input: IloveChatScriptVeryMuch. I want this sentence could math patterns such as, ( love ChatScript ).

What I have:

A function called ^cn_segment() for tokenizing input sentence. For example,  ^cn_segment(IloveChatScriptVeryMuch) will output: I love ChatScript Very Much.

My Problem:

How can I use this function to tokenize each input before matching with patterns?

What I have tried:

I have tried to change the value of %originalsentence and %originalinput variables in control script. But it does not work.

outputmacro: test()  # you get test by default

$cs_prepass = ~segmentation
$cs_control_main = ~control
#  $cs_control_post = ~XPOSTPROCESS # uncomment to enable talk
$userprompt = ^”%user: >”
$botprompt = ^“Bot: “

table: defaultbot (^name)
^createfact(^name defaultbot defaultbot)

topic: ~segmentation system ()
t: ( _* ) # memorize sentence to pass to Jieba segmentation
%originalinput = ^cn_segment(‘_0)  # %originalsentence variable has also been tried

topic: ~control system ()


  [ # 3 ]

Ah okay, yes, I misunderstood your problem.

So you want basically to replace or enhance the default tokenization with your own, don’t you?

Manipulating the input before tokenization is not possible as far as I know.

However, you could as a workaround get the input as is, call your function ^cn_segment() to make your own tokenization and then reinvoke it back to chatscript with ^input($$yourFunctionReturn)

Than your controlscript will get called again as if your function argument was the user input.
In your controlscript you can then determine the revised input by checking if(%revisedinput) and do whatever you want.

So I guess you want to do something like this:

topic: ~YourControlScript system Keep Repeat ()
u: (_*)
# insert your current control script here
$$tokenizedInput = ^cn_segment(_0)

This would enhance the default tokenization.
If you want to replace it use %originalinput instead of _0

This is a possible workaround that came to my mind, maybe someone else knows a straighter solution.


  [ # 4 ]

Thanks Tobias La very much. Your method works fine for me. Thanks.


  [ # 5 ]

There is no way to override tokenization code.. I will have to fix that


  login or register to react