The SemRep source code and associated annotated and test data sets are publicly available at our GitHub site: SemRep GitHub.
The Semantic MEDLINE Database (SemMedDB) is a repository of semantic predications (subject-predicate-object triples) extracted by SemRep, a semantic interpreter of biomedical text. SemMedDB currently contains information about approximately 96.3 million predications from all of PubMed citations (about 29.1 million citations) and forms the backbone of the Semantic MEDLINE application.
For details about the SemMedDB schema, click here.
To Download the SemMedDB Database click here.
In early 2011, we conducted a gold standard annotation study in which we annotated with semantic predications a set of 500 sentences randomly selected from MEDLINE abstracts. The results are mainly intended to serve as an evaluation testbed for SemRep. They can also be used by other information extraction systems based on UMLS domain knowledge. The study consisted of three phases: a) the practice phase, b) the main annotation phase, and c) the adjudication phase.
Here, we present two sets of annotations from the main phase as well as the adjudicated gold standard. For further details, please refer to our BMC Bioinformatics paper Constructing A Semantic Predication Gold Standard from the Biomedical Literature.
Annotator A: Main Phase (main_A.xml) (1.3 mb)
Annotator B: Main Phase (main_B.xml) (1.4 mb)
Annotator C: Adjudication (adjudicated.xml) (1.4 mb)
DTD file (annotations.dtd) (1.8 kb)
In order to develop and evaluate a sortal anaphora resolution module, we annotated a corpus of 320 MEDLINE citations with pairwise sortal anaphora relations consisting of the anaphoric expressions and their correspondent antecedents. Since we aimed at a general approach that takes into account all the semantic types and consequently supports SemRep, we collected MEDLINE abstracts on a wide range of topics, including molecular biology and clinical medicine.
For further details, please refer to our BMC Bioinformatics paper Sortal anaphora resolution to enhance relation extraction from biomedical literature.
Sortal Anaphora dataset:
Sortal Anaphora Dataset
Biomedical knowledge claims are often expressed with extra-propositional entities such as hypotheses, speculations, or opinions, rather than explicit facts (assertions or propositions). Currently, SemRep extracts propositional content in the form of predications. We studied the feasibility of incorporating extra-propositional information by assessing the factuality level of SemRep predications. To this end, we annotated semantic predications extracted from 500 PubMed abstracts with seven factuality values (FACT, PROBABLE, POSSIBLE, DOUBTFUL, COUNTERFACT, UNCOMMITTED, and CONDITIONAL).
For further details, please refer to our PLoS ONE paper Assigning factuality values to semantic relations extracted from Biomedical Research Literature.
While the UMLS provides the predication arguments and the linking predicates, indicator rules map syntactic elements in the text, such as verbs and nominalizations, to predicates in the SN (e.g., TREATS, PREVENTS, AFFECTS, and so on). The indicator file contains the SemRep indicators for the SN predicates that SemRep uses. At this time (version v1.7), SemRep is using an earlier Prolog file. The new JAVA file was created for use with SemRep v1.8 in a JAVA implementation. Thus, some rules have not been implemented yet but are included in the file for future implementation after SemRep Java is complete. There are different types of indicator rules, from simple to multi-phrase, and the format varies for each.