AI Zone Admin Forum Add your forum

NEWS: Chatbots.org survey on 3000 US and UK consumers shows it is time for chatbot integration in customer service!read more..

Original Sentence before tokenization
 
 

The system variables documentation provides the following:

%originalinput – all sentences user passed into volley, before adjusted in any way
%originalsentence – the current sentence after tokenization but before any adjustments

What I need is a mix of both.

I need the the current sentence before adjusted in any way, so even before tokenization.
Even if I use something like:
$$tmptoken = $cs_token
$cs_token = 0
^retry(SENTENCE)

The Output is tokenized, so:

This is a sentence.

Goes to

This is a sentence .

I want to achieve that if I just append all sentences I would have an exact copy of %originalinput.

The only way I can think of is splitting it manually at ,.? or ! but that seems a bit unsure and ugly to me.

 

 

 
  [ # 1 ]

There is currently no such thing as a mixture of both.  In order to separate sentences, the tokenizer has to do work to decide where a sentence ends (not all periods end a sentence). While probably it could be coded in the tokenizer, I don’t see it as having a universal utility.

 

 
  [ # 2 ]

So I have no chance to get to my goal somehow?

 

 
  [ # 3 ]

You have access to the raw original input.  After that, either you process it yourself however you wish, or tokenization happens. I’m not planning on making tokenization hunt just for the end of sentence markers (disambiguating all other periods along the way but not actually adjusting them). You can write script which will burst all original tokens of the input, you can find ones that have a period in them or a ? or an ! and you can write code to handle them however you want. It will be tedious code, but not that difficult

 

 
  login or register to react