Hi Larry,
Your question is about the "extract keywords" function that Hal uses to omit common words from the all-caps "second line entry" in the Q&A brain database files.
The "extract keyword" function makes a call outside the script to a .dll, so I don't have a copy of the list of words that it omits.
HOWEVER, it isn't necessary to "extract keywords" to put a usable "second line entry." If you merely make the second line alphanumeric-only and all caps, (and of course in the proper position), it will work.
My understanding from Robert Medeksza is that the "extract keywords" function was created so that Hal could evaluate trigger sentences for relevance without false-triggering on extremely common short words such as "the" and "a" and "an" and so forth. The Q & A brain does a complex calculation for relevance based on the number of matching words, the number of matching words in the correct sequence, and the lengths of the respective sentences.
My own experimentation suggests that "extract keywords" performs a valuable function when a database is thinly populated. However, when a database becomes larger ("heavily populated"), the concern about false-triggering decreases (because there's usually a decent-relevance sentence somewhere in the database anyway). When the database is heavily populated, the inclusion of minor words could theoretically increase the precision and discrimination of Hal's responses!
There's a long-term possibility that the Q&A brain function could be re-written to include 100% of the trigger sentence, and evaluate all words, maybe even with heavier-weighting for nouns and verbs, lighter weighting for adjectives and adverbs, and the least weighting for articles and prepositions. Robert Medeksza has mentioned this, and it sounds very useful, but it also sounds like a big, big programming job!
At present, I believe that the .dll just ignores the words it regards as irrelevant, so the only thing you're sacrificing by putting an entire raw sentence into the all-caps "second trigger line" is a very small penalty in space and speed.
By the way, I've written a number of routines that use the usersentence as the "response" line, and Hal's previous remark (prevsent) as the "trigger line." That way, Hal learns to respond to you the way that you respond to him. When you do that, the "trigger line" may contain completely different words than the response line.
If the routine you're writing takes a text file and "sorts" each sentence into the various topicfocus databases according to its own content, you basically have two choices for the "all-caps trigger second line," which are as follows:
1. Make the second all-caps line the same words as the response sentence itself. This usually results in plausible responses, although they sometimes sound like paraphrases.
2. Make the second all-caps line the same words as the PREVIOUS sentence in the text file. If you can do this, you might cause Hal to utter the "next plausible thought" in a train of thought.
I hope that I've interpreted your question correctly and that this information is useful to you. Also, if I've described anything inaccurately, Robert, please correct me!
Thanks, and have a good day!
Sincerely,
Don