ClearWSD
ClearWSD is a word sense disambiguation tool for the JVM, with core modules available under an Apache 2.0 license. It provides simple APIs for integration with other libraries, as well as a command-line interface (CLI) for non-programmatic use. It is modular, allowing for alternative implementations of sub-components such as parsers or resources used for feature extraction.
It is meant for use in both research and production settings. Main features include
- State-of-the-art results in verb sense disambiguation over VerbNet classes
- Automatic optimization of feature subsets and hyperparameters
- Production-ready pre-trained models
- Easy training of new models using CLI
- 1000+ sense predictions per second on a 2014 MacBook Pro
API
The easiest way to make use of ClearWSD in your project is through Maven, by simply adding corresponding ClearWSD dependencies to your project's pom.xml.
Releases are distributed through Maven Central.
To try out ClearWSD in your project, you will need to include three modules, the first being clearwsd-core:
<dependency>
<groupId>io.github.clearwsd</groupId>
<artifactId>clearwsd-core</artifactId>
<version>0.12.1</version>
</dependency>
and the second being a parser module, used for pre-processing and feature extraction. A wrapper for the NLP4J dependency parser is provided:
<dependency>
<groupId>io.github.clearwsd</groupId>
<artifactId>clearwsd-nlp4j</artifactId>
<version>0.12.1</version>
</dependency>
Finally, to use pre-trained word sense disambiguation models (compatible with NLP4J), just add the following:
<dependency>
<groupId>io.github.clearwsd</groupId>
<artifactId>clearwsd-models</artifactId>
<version>0.12.1</version>
</dependency>
You can then try out a pre-trained model (from OntoNotes) with the following:
import java.util.List;
import io.github.clearwsd.DefaultSensePredictor;
import io.github.clearwsd.SensePrediction;
import io.github.clearwsd.corpus.ontonotes.OntoNotesSense;
import io.github.clearwsd.parser.Nlp4jDependencyParser;
public class Test {
public static void main(String[] args) {
Nlp4jDependencyParser parser = new Nlp4jDependencyParser(); // load dependency parser
DefaultSensePredictor<OntoNotesSense> wsd = DefaultSensePredictor.loadFromResource(
"models/nlp4j-ontonotes.bin", parser); // load WSD model
String sentence = "Mary took the bus to school (which " // 8 --> travel by means of
+ "took about 30 minutes), and studiously " // 3 --> require or necessitate
+ "took notes about the Bolsheviks " // 2 --> light verb usage
+ "taking over the Winter Palace"; // 9 --> claim or conquer, become in control of
List<String> tokens = parser.tokenize(sentence); // split sentence into tokens
// display sense predictions and their definitions
for (SensePrediction<OntoNotesSense> prediction : wsd.predict(tokens)) {
System.out.println(prediction.sense().getNumber() + " --> " + prediction.sense().getName());
}
}
}
Command Line Interface
ClearWSD provides a command-line interface for training, evaluation, and application of word sense disambiguation models.
To build ClearWSD, you will need Java 8 or above and Apache Maven.
On OS X/Linux, you can then build the project for CLI use:
git clone https://github.com/clearwsd/clearwsd.git
cd clearwsd
mvn package -DskipTests -P build-nlp4j-cli
To use the Stanford Parser wrapper module (GPL licensed) instead, use build-stanford-cli:
mvn package -DskipTests -P build-stanford-cli
You can see a help message and available options with the following command (assuming you have already followed the CLI setup instructions):
java -jar clearwsd-cli-*.jar --help
Usage: WordSenseCLI [options]
Options:
-model, -m
Path to classifier model (for loading or saving)
-input, -i
Path to unlabeled input file for new predictions
-train, -t
Path to training data (required for training)
-valid, -dev, -v
Path to validation data
-cv, -folds
Number of cross-validation folds
Default: 0
-test
Path to test data
--itl, --interactive, --loop
Start an interactive test session on provided model (after training
and/or testing)
Default: false
--om
Output misses on evaluation data in separate files
Default: false
--reparse
Reparse, even if a parsed file of the same name already exists
Default: false
--help, --usage
Display usage
-corpus
Training/evaluation corpus type
Default: Semlink
Possible Values: [Semeval, Semlink]
-dataExt
Extension for training data file (only needed for Semeval XML corpora)
Default: .data.xml
-ext
Parse file extension, appended to input file names to save parses
Default: .dep
-inventory, -inv
Sense inventory
Possible Values: [VerbNet, WordNet, OntoNotes, Counting]
-inventoryPath
Sense inventory path (optional)
-keyExt
Extension for sense key file (only needed for Semeval XML corpora)
Default: .gold.key.txt
-output, -o
Path to output file where predictions on the input file are stored
Training
To train a new model, you must specify the path to a training data file with -train, as well as a path for the resulting saved model, using -model:
java -jar clearwsd-cli-*.jar -train path/to/training/file.txt -model path/to/save/model.bin
The default corpus (Semlink) expects files with an instance per line in the following format:
document_id <space> sentence_id <space> token# <space> lemma <space> sense_label <tab> sentence_text
sentence_text should be a single sentence containing the instance, with tokens separated by spaces:
example.txt 25 3 get comprehend-87.2-1 Oh , I get it .
example.txt 57 2 get get-13.5.1-1 Did you get that part ?
Evaluation
The CLI provides several modes of evaluation/application. You can perform cross-validation, test on a specific dataset, apply a trained model to raw text, or try out a model interactively by typing in test sentences.
Cross Validation
Specify the number of folds with -cv. -cv 5, for example, can be used for 5-fold cross validation.:
java -jar clearwsd-cli-*.jar -train path/to/training/file.txt -cv 5
Test Dataset
Specify a test file with -test:
java -jar clearwsd-cli-*.jar -test path/to/test/file.txt -model path/to/trained/model.bin
Application
To apply a trained model to new (raw) data, specify a path with -input. Optionally specify an output path with -output:
java -jar clearwsd-cli-*.jar -input path/to/raw/data.txt -output path/to/predictions.txt \
-model clearwsd-models/src/main/resources/models/nlp4j-ontonotes.bin
Interactive Testing
--loop or --itl can be used to start an interactive command line test loop, where you can input sentences and see predictions.
java -jar clearwsd-cli-*.jar --loop -model clearwsd-models/src/main/resources/models/nlp4j-verbnet-3.3.bin
After the parser and model finish loading, you should then be able to enter test sentences and see predicted senses:
Enter test input ("EXIT" to quit).
> please take notes
Please
take[25.2]
notes
> Take the train home.
Take[51.4.3]
the
train
home
> Take on the government
Take[98]
on
the
government
> Take the money out of the vault
Take[13.5.1]
the
money
out
of
the
vault
License
Please refer to the LICENSE.txt in individual modules.