Prerequisites
Instructions
tar -xvzf hide-1.0.tar.gz
cd path/to/hide-1.0/lib
tar -xvzf mallet-2.0-RC2.tar.gz
ln -s mallet-2.0-RC2 mallet
cd ./mallet
ant
Optional
An annotation tool can be used to annotate files and generate training data. We currently recommend Callisto. get it.
cd /path/to/hide-1.0/
TRAIN - Training a new mark-up/de-identification model
HIDE-train.pl crfmodelfile [list of files]
The list of files are the marked up sgml formatted files you wish to train the CRF classifier.
MARK-UP - Using a mark-up/de-identification model to mark-up new files
HIDE-markup.pl crfmodelfile outputdir [list of files]
The list of files is the list of files you with to mark with sgml tags based on the trained CRF.
DE-ID
(from original text files)
HIDE crfmodelfile outputdir deidconfigfile [list of files]
The list of files is the list of files you wish to deid and the result is placed in the outputdir directory.
Remember to modify the deidconfigfile to fit the tags that you have used to markup your data.
(from marked up files)
./lib/HIDE-DEID.pl deidconfigfile outputdir [list of files]
The list of files is the list of files to replace sgml tags based on the configuration in the deidconfigfile and the result is placed in the outputdir directory.
Remember to modify the deidconfigfile to fit the tags that you have used to markup your data.
The input and output of HIDE is completely in sgml format. HIDE simply uses the main tag name (no support for sgml attributes e.g. <name type="sometype"> will be interpreted the same as <name>) as the label for the enclosed word or phrase.
Example
This is a made up pathology report about <name>John Doe<name> he is a <age>36</age> year old male.
This report is in reference to <MRN>1234123-123</MRN>.
The above example can be used to train the CRF classifier. Note: One training example is probably not enough for the classifier. You will want to have many examples to train the markup/de-identification model. Again, an annotation tool such as Calisto can be used.
The output from HIDE will be in the same format as the example above.