Word Sense Disambiguation (WSD)Test Collection
Collaborations & Outside Resources

  Exploiting MeSH indexing in MEDLINE to generate a data set for word sense disambiguation  
  Exploiting MeSH indexing in MEDLINE to generate a data set for word sense disambiguation - PDF  Exploiting MeSH indexing in MEDLINE to generate a data set for word sense disambiguation, Antonio Jimeno-Yepes, Bridget McInnes, Alan Aronson (BMC Bioinformatics link)

Evaluation of Word Sense Disambiguation methods (WSD) in the biomedical domain is difficult because the available resources are either too small or too focused on specific types of entities (e.g. diseases or genes). We have developed a method that can be used to automatically develop a WSD test collection using the Unified Medical Language System (UMLS) Metathesaurus and the manual MeSH indexing of MEDLINE.

The resulting dataset is called MSH WSD and consists of 106 ambiguous abbreviations, 88 ambiguous terms and 9 which are a combination of both, for a total of 203 ambiguous words. Each instance containing the ambiguous word was assigned a CUI from the 2009AB version of the UMLS. For each ambiguous term/abbreviation, the data set contains a maximum of 100 instances per sense obtained from MEDLINE; totaling 37,888 ambiguity cases in 37,090 MEDLINE citations.

The "MSH WSD Data Set" contains contains the benchmark_mesh.txt file which lists the ambiguous word and candidate CUIs and the term_pmid_cui file containing one line for each ambiguous word, the PMID, and the disambiguated CUI. The data set also contains a file for each of the 203 ambiguous words containing the pmid, the citation text (title and abstract only), and the sense based on the name derived from the benchmark file (M1, M2, ...). In the citation text, the instance of the ambiguous word considered for disambiguation is denoted by the e tag (e.g.<e>AA</e>). There is a README.txt file in the download which explains the files in more detail.

Please Note: Users are responsible for compliance with the UMLS Metathesaurus License Agreement.

To use this test collection, you must have accepted the terms of the UMLS Metathesaurus License Agreement, which requires you to respect the copyrights of the constituent vocabularies and to file a brief annual report on your use of the UMLS. You also must have activated a UMLS Terminology Services (UTS) account.

The 37,090 MEDLINE citations included in this "MSH WSD Data Set" are for exclusive use with the MSH WSD Data Set and cannot be redistributed. In addition, the citations were retrieved in July 2010 and represent a static view of MEDLINE at that time. The data set has been reformatted such that none of the MEDLINE ASCII element labels (e.g., PMID- or TI -") remain and only the Title (TI) and Abstract (AB) elements were used.

MSH WSD Data Set zipped file  MSH WSD Data Set (17 MB compressed, 53 MB uncompressed)

Antonio Jimeno-Yepes, U.S. National Library of Medicine (contact)
Bridget T. McInnes, University of Minnesota Twin Cities (contact)
 
 
  WSD Choices Linked to UMLS CUIs  
  WSD Choices Linked to UMLS CUIs gzipped/tar file  WSD Choices Linked to UMLS CUIs v0.3 - Updated 30June2010 (14.7 KB)

Bridget T. McInnes, University of Minnesota Twin Cities (contact) has kindly provided us with these matchups between the various WSD Ambiguity choices and their corresponding UMLS CUIs. This is a gzipped tar file which has a directory containing a file for each of the 50 ambiguities showing the original choices and the UMLS CUI at the end of the list. Bridget is responsible for the 1999 mappings.

Mark Stevenson, University of Sheffield (contact) has kindly provided us with the 2007AB UMLS matchups between the various WSD Ambiguity choices and their corresponding UMLS CUIs.

Example:
M1|Adjustment <1> (Individual Adjustment)|inbe, Individual Behavior|C0376209
M2|Adjustment <3> (Adjustment Action)|ftcn, Functional Concept|C0456081
M3|adjustment <5> (Psychological adjustment)|menp, Mental Process|C0683269

PLEASE NOTE: The UMLS CUIs in these files are based on the 1999 and 2007AB UMLS data! Some changes do occur with every UMLS release and some changes may have occurred to these specific concepts since the releases of the 1999 and 2007AB UMLS data files.
 
 
  nlm2sval2 from Dr. Ted Pedersen at the University of Minnesota, Duluth  
  Now Available from Dr. Ted Pedersen at the University of Minnesota, Duluth:

A small utility package called nlm2sval2, which will take the WSD Test Collection and convert it into the Senseval-2 lexical sample format. nlm2sval2 is written in Perl, and is freely available from their data conversion page at the following URL: http://www.d.umn.edu/~tpederse/tools.html
 

Last Modified: October 18, 2012 ii-public2
     Contact Us    |   Contact Us (SemRep)    |   Copyright    |   Privacy    |   Accessibility    |   Freedom of Information Act    |   USA.gov    Get Acrobat Reader button
Links to Our Sites
MetaMap Public Release
NEW: Distributable version of the actual MetaMap program.
Indexing Initiative (II)
Investigating computer-assisted and fully automatic methodologies for indexing biomedical text. Includes the NLM Medical Text Indexer (MTI).
Semantic Knowledge Representation (SKR)
Develop programs to provide usable semantic representation of biomedical text. Includes the MetaMap and SemRep programs.
MetaMap Transfer (MMTx)
Java-Based distributable version of the MetaMap program.
Word Sense Disambiguation (WSD)
Test collection of manually curated MetaMap ambiguity resolution in support of word sense disambiguation research.
MEDLINE Baseline Repository (MBR)
Static MEDLINE® Baselines for use in research involving biomedical citations. Allows for query searches and test collection creation.
Structured Abstracts (SA)
Information about NLM's research on Structured Abstracts in the MEDLINE® Baselines.
 
Lister Hill Center Homepage Link - Image of Lister Hill Center Lister Hill National Center for Biomedical Communications   NLM Homepage Link - NLM Logo U.S. National Library of Medicine   NIH Homepage Link - NIH Logo National Institutes of Health
DHHS Homepage Link - DHHS Logo Department of Health and Human Services