This operator works with buildTermCorpus operator, where buildTermCorpus builds the model and matchSimilarFromCorpus operator matches the text to the corpus in the model and adds the columns those kept.
- Click + on the parent node.
- Enter the Match Similar From Corpus operator in the search field and select the operator from the Results to open the operator form.
- In the Table drop-down, enter or select the name of the table to run this operator on.
- In the Model Name drop-down, Enter or select the name of the model in the MODEL NAME field.
- In the Column drop-down, enter or select the name of the column that contains the text to extract TF-IDF features.
- In the Number of Matches, enter a value to get the number of best matches.
- Click Run to view the result.
- Click Save to add the operator to the playbook.
- Click Cancel to discard the operator form.
Uses the processed corpus from buildTermCorpus and a new column of text to return the Cosine similarity.
matchSimilarFromCorpus(table: TableReference, modelName:String, column: String, numberOfMatches:Int*)
table (TableReference): Table name
modelName (String): model name
column (String): Column name that contains the text to extract TF-IDF features
numberOfMatches (Int*): Optional parameter to return number of best matches
Returns the greatest Cosine similarity score 'lhubcosineSimilarity', ranging from 0.0 - 1.0, where 0.0 doesn't match, 1.0 perfectly matches from the TF-IDF terms from the saved corpus along with the columns defined at corpus creation in the columnsKeep argument with 'lhub' prefix.
table and model name from buildTermCorpus operator
|h a c d i j b|
|gg aa ff jj c i b|
|k o m p n l q|
matchSimilarFromCorpus(inputTable, "corpusModel") // table = inputTable // model name that was created by buildTermCorpus operator = "corpusModel"
|h a c d i j b||x||more than 0..5|
|gg aa ff jj c i b||y||more than 0.5|
|k o m p n l q||z||apple||more than 0.5|
lable and domain columns are came from a corpusModel, where in the parameters it was set to keep ["label", "domain"] columns which would be added in the output based on matches. lhub_confidence is the best matches confidence score (e.g. cosine distance).
Updated over 2 years ago