NAME
lsm - Latent Semantic Mapping tool
SYNOPSIS
lsm lsmcommand [commandoptions] mapfile [inputfiles]
DESCRIPTION
The Latent Semantic Mapping framework is a language independent, Unicode based technology that builds maps and uses them to classify texts into one of a number of categories. llssmm is a tool to create, manipulate, test, and dump Latent Semantic Mapping maps. It is designed to provide access to a large subset of the functionality of the Latent Semantic Mapping API, mainly for rapid prototyping and diagnostic purposes, but possibly also for simple shell script based applications of Latent Semantic Mapping. CCOOMMMMAANNDDSSllssmm provides a variety of commands (lsmcommand in the Synopsis), each
of which often has a wealth of options (see the Command Options below). Command names may be abbreviated to unambiguous prefixes. llssmm ccrreeaattee mapfile inputfiles Create a new LSM map from the specified inputfiles. llssmm uuppddaattee mapfile inputfiles Add the specified inputfiles to an existing LSM map. llssmm eevvaalluuaattee mapfile inputfiles Classify the specified inputfiles into the categories of the LSM map. llssmm dduummpp mapfile [inputfiles] Without inputfiles, dumps all words in the map with their counts. With inputfiles, dump, for each file, the words that appear in the map, their counts in the map, and their relative frequencies in the input file. llssmm iinnffoo mapfile Bypass the Latent Semantic Mapping framework to extract and print information about the file and perform a number of consistency checks on it. ((NNOOTT IIMMPPLLEEMMEENNTTEEDD YYEETT)) CCOOMMMMAANNDD OOPPTTIIOONNSS This section describes the commandoptions that are available for the llssmm commands. Not all commands support all of these options; each option is only supported for commands where it makes sense. However, when a command has one of these options you can count on the same meaning for the option as in other commands.--ccaatteeggoorryy-ddeelliimmiitteerr delimiter
Specify the delimiter to be used to between categories in the inputfiles passed to the ccrreeaattee and uuppddaattee commands. ggrroouupp Categories are separated by a `;' argument. ffiillee Each inputfile represents a separate category. This is thedefault if the --ccaatteeggoorryy-ddeelliimmiitteerr option is not given.
lliinnee Each line represents a separate category. string Categories are separated by the specified string.--ddiimmeennssiioonnss dim
Direct the ccrreeaattee and uuppddaattee commands to use the given number of dimensions for computing the map (Defaults to the number of categories). This option is useful to manage the size and computational overhead of maps with large number of categories.--hhaasshh
Direct the ccrreeaattee and uuppddaattee commands to write the map in a format that is not human readable with default file manipulation tools like ccaatt or hheexxdduummpp. This is useful in applications such as junk mail filtering, where input data may contain naughty words and where the contents of the map may tip off spammers what words to avoid.--hheellpp
List an overview of the options available for a command. Available for all commands.--hhttmmll
Strip HTML codes from the inputfiles. Useful for mail and web input. Available for the ccrreeaattee, uuppddaattee, eevvaalluuaattee, and dduummpp commands.--jjuunnkk-mmaaiill
When parsing the input files, apply heuristics to counteract common methods used by spammers to disguise incriminating words such as: Zer0 1nt3rest l0ans Substituting letters with digits W E A L T H Adding spaces between letters m.o.r.t.g.a.g.e Adding punctuation between letters Available for the ccrreeaattee, uuppddaattee, eevvaalluuaattee, and dduummpp commands.--ppaaiirrss
If specified with the ccrreeaattee command when building the map, store counts for pairs of words as well as the words themselves. This can increase accuracy for certain classes of problems, but will generate unreasonably large maps unless the vocabulary is fairly limited.--ssttoopp-wwoorrddss stopwordfile
If specified with the ccrreeaattee command, stopwordfile is parsed and all words found are excluded from texts evaluated against the map. This is useful for excluding frequent, semantically meaningless words.--sswweeeepp-ccuuttooffff threshold
--sswweeeepp-ffrreeqquueennccyy days
Available for the ccrreeaattee and uuppddaattee commands. Every specified number of days (by default 7), scan the map and remove from it any entries that have been in the map for at least 2 previous scans and whose total counts are smaller than threshold. threshold defaults to 0, so by default the map is not scanned.--tteexxtt-ddeelliimmiitteerr delimiter
Specify the delimiter to be used to between texts in the inputfiles passed to the ccrreeaattee, uuppddaattee, eevvaalluuaattee, and dduummpp commands. ffiillee Each inputfile represents a separate text. This is thedefault if the --tteexxtt-ddeelliimmiitteerr option is not given.
lliinnee Each line represents a separate text. string Texts are separated by the specified string.--ttrriipplleettss
If specified with the ccrreeaattee command when building the map, store counts for triplets and pairs of words as well as the words themselves. This can increase accuracy for certain classes of problems, but will generate unreasonably large maps unless the vocabulary is fairly limited.--wweeiigghhtt weight
Scale counts of input words for the ccrreeaattee and uuppddaattee commands by the specified weight, which may be a positive or negative floating point number. EEXXAAMMPPLLEESS"lsm evaluate -html -junk-mail ~/Library/Mail/LSMMap2 msg*.txt"
Simulate the MMaaiill..aapppp junk mail filter by evaluating the specified files (assumed to each hold the raw text of one mail message) against the user's junk mail map."lsm dump ~/Library/Mail/LSMMap2"
Dump the words accumulated in the junk mail map and their counts."lsm create -category-delimiter=group cvsh *.c ';' *.h"
Create an LSM map trained to distinguish C header files from C source files."lsm update -weight 2.0 -cat=group cvsh ';' ../xy/*.h"
Add some additional header files with an increased weight to the training."lsm create -help"
List the options available for the llssmm ccrreeaattee command.1.0 2004-08-16 LSM(1)