AI Zone Admin Forum Add your forum
Preprocess user input sentence before system tokenization

As Chinese texts do not have separator between words, I wrote a custom function called ^cn_segment (in private code) to segment Chinese sentence. I can use this function in output to get the segmented text, for example:



I love ChatScript Very Much

(Automatically adds blank space between words.)

My problem is, I want every user input could be passed into the ^cn_segment function before ChatScript tokenization. (The ChatScript tokenization can not handle Chinese at present)

How can I get that?

I have tried change the %originalsentence variable, but it does not work.



  [ # 1 ]

Try the %originalinput variable.
You will not have seperated sentences, because no tokenization is done.
Note that oob data is stripped off and has to be processed seperatly.


  [ # 2 ]

Thanks Tobias La, But it did not work. Maybe my problem is not clear. The following is a more detail explanation:

What I want to do:

User could input sentence without blank space separators, for example, input: IloveChatScriptVeryMuch. I want this sentence could math patterns such as, ( love ChatScript ).

What I have:

A function called ^cn_segment() for tokenizing input sentence. For example,  ^cn_segment(IloveChatScriptVeryMuch) will output: I love ChatScript Very Much.

My Problem:

How can I use this function to tokenize each input before matching with patterns?

What I have tried:

I have tried to change the value of %originalsentence and %originalinput variables in control script. But it does not work.

outputmacro: test()  # you get test by default

$cs_prepass = ~segmentation
$cs_control_main = ~control
#  $cs_control_post = ~XPOSTPROCESS # uncomment to enable talk
$userprompt = ^”%user: >”
$botprompt = ^“Bot: “

table: defaultbot (^name)
^createfact(^name defaultbot defaultbot)

topic: ~segmentation system ()
t: ( _* ) # memorize sentence to pass to Jieba segmentation
%originalinput = ^cn_segment(‘_0)  # %originalsentence variable has also been tried

topic: ~control system ()


  [ # 3 ]

Ah okay, yes, I misunderstood your problem.

So you want basically to replace or enhance the default tokenization with your own, don’t you?

Manipulating the input before tokenization is not possible as far as I know.

However, you could as a workaround get the input as is, call your function ^cn_segment() to make your own tokenization and then reinvoke it back to chatscript with ^input($$yourFunctionReturn)

Than your controlscript will get called again as if your function argument was the user input.
In your controlscript you can then determine the revised input by checking if(%revisedinput) and do whatever you want.

So I guess you want to do something like this:

topic: ~YourControlScript system Keep Repeat ()
u: (_*)
# insert your current control script here
$$tokenizedInput = ^cn_segment(_0)

This would enhance the default tokenization.
If you want to replace it use %originalinput instead of _0

This is a possible workaround that came to my mind, maybe someone else knows a straighter solution.


  [ # 4 ]

Thanks Tobias La very much. Your method works fine for me. Thanks.


  [ # 5 ]

There is no way to override tokenization code.. I will have to fix that


  [ # 6 ]

Is the fix out?

I also found this is extremely useful, as sometimes I need to pass some control flag (such as PREFIX_REMOVE_ME_and_DO_Something) in front of the real user input to CS in order to gain some control of CS. Ideally, CS strip this control flag, finish what it intends to control, and then process the real user input.


  [ # 7 ]

No fix so far. not enough priority in my queue.


  login or register to react