The design of the term checker for ASD-STE100
This term checker is a customized version of LanguageTool. The grammar and the terms that the term checker validates are specified in 4 XML files. (The word term means the same as the word word in ASD-STE100.)
An important difference between a grammar checker and the term checker is as follows:
- A grammar checker finds incorrect text.
- The term checker ignores correct text.
The term checker for ASD-STE100 is not Software as a Service (SaaS). You install LanguageTool on your computers. All the processing is on your computers.
The structure of LanguageTool
LanguageTool has rules for many languages. For each language, the rules are in the files disambiguation.xml
and grammar.xml
. (Some languages also have style rules in style.xml
.) The term checker uses the English disambiguation.xml
and grammar.xml
to specify the location of the rules for the term checker.
The disambiguation files specify the terms and the grammar files tell LanguageTool which terms to find
The rules for the term checker are in 4 XML files:
disambiguation-ste8.xml
tells LanguageTool which STE terms are approved and which STE terms are not approved. Disambiguation rules identify the part of speech that a term has. For disambiguation, sequence is important. Thus,disambiguation-ste8.xml
includes the content ofdisambiguation-projectterms.xml
before the disambiguation rules.disambiguation-projectterms.xml
tells LanguageTool which project terms are approved and which project terms are not approved. You will edit this file to include your organization's technical terms.grammar-ste8.xml
tells LanguageTool to find STE terms that are not approved and STE terms that are used incorrectly.grammar-projectterms.xml
tells LanguageTool to find project terms that are not approved and project terms that are used incorrectly. You can edit this file to give guidance to your technical writers about your technical terms.
Changes to LanguageTool
To make LanguageTool into a checker for STE, TechScribe makes these changes to files in LanguageTool:
- Replace
\org\languagetool\resource\en\disambiguation.xml
with TechScribe's version, which has a password-protected link todisambiguation-ste8-202y-mm-dd.xml
on the TechScribe website. - Replace
\org\languagetool\resource\en\grammar.xml
with TechScribe's version, which has a password-protected link togrammar-ste8-202y-mm-dd.xml
on the TechScribe website. - Replace
\org\languagetool\rules\en\en-GB\grammar.xml
with TechScribe's version ofgrammar.xml
. This change is for spelling rules that are applicable only to British English. Refer to 'For British English, the term checker uses the Oxford spelling'. - Replace
\org\languagetool\rules\en\en-US\grammar.xml
with TechScribe's version ofgrammar.xml
. This change removes the rules that are not applicable to STE. - Delete
\org\languagetool\rules\en\style.xml
.- The rules are not applicable to STE.
- The parts of speech in the term checker have an unwanted effect on the LanguageTool rules. Refer to https://github.com/languagetool-org/languagetool/issues/8414.
- Delete
\org\languagetool\rules\en\en-GB\style.xml
. The rules are not applicable to STE. - Delete
\org\languagetool\rules\en\en-US\style.xml
. The rules are not applicable to STE. - Delete some terms from
\org\languagetool\resource\en\multiwords.txt
. LanguageTool gives each term in this file only one part of speech, which can cause an error with the analysis of text. Refer to https://github.com/languagetool-org/languagetool/issues/7779. - Delete terms from
org\languagetool\rules\en\en-GB\replace.txt
. The file is necessary for LanguageTool, but the contents are not applicable to STE. - Delete terms from
org\languagetool\rules\en\en-US\replace.txt
. The file is necessary for LanguageTool, but the contents are not applicable to STE.
The files for the term checker are not in the LanguageTool directory
You will create the installation directory in the installation step 'Create the directories and install LanguageTool'.
In LanguageTool, disambiguation.xml
and grammar.xml
contain all the rules for the grammar checks and the style checks. The 2 XML files are in the LanguageTool-n.n
directory. Thus, if you use the stand-alone version of LanguageTool and the OpenOffice version, you have 2 sets of files.
The term checker files are not in the LanguageTool-n.n
directory. This method has these advantages:
- Only 1 set of files is necessary. All the changes that you make in a file are available in all the different versions of LanguageTool.
- You can easily use different sets of rules for different projects.
- If you update LanguageTool, the term checker files are not deleted.
When you install the term checker, you will replace disambiguation.xml
and grammar.xml
with files from TechScribe. The new files contain links to the files for the term checker (disambiguation-ste8.xml
, grammar-ste8.xml
, disambiguation-projectterms.xml
, and grammar-projectterms.xml
).
The files disambiguation-ste8.xml
and grammar-ste8.xml
are in these locations:
- With the remote files version of the term checker, the files are always on the TechScribe website. The remote files version of the term checker is the version that you evaluate. (Although the files are on the TechScribe website, all the processing is done on your computer. LanguageTool does not send your data to TechScribe.)
- With the local files version of the term checker, the files are on your computer.
The disambiguation of terms
If ASD-STE100 shows a term as approved for one part of speech, and unapproved for a different part of speech, then the term checker finds the term if it is used incorrectly. For example, work is approved as a noun, but not as a verb:
In the text, "You must work quickly," work is a verb, not a noun. Thus, the term checker finds the term.
The design of disambiguation-ste8.xml
The disambiguation rules that are in disambiguation-ste8.xml
analyse text and add a POS tag to show the part of speech that a word has.
The analysis uses pattern matching. For example, in the text "the X was," X is a noun. X cannot be a verb or some other part of speech. Each rule is applied in sequence. If text matches a pattern, the rule adds a part of speech to the matched term. If the pattern is not matched, the next rule is used.
The analysis is not always correct. The examples that follow are not STE, they are standard English:
- The analysis uses patterns (n-grams). For some text, full-sentence parsing is necessary:
- A person's right to work to earn money. ['s = verb, right=verb]
- A person's right to work to earn money is not disputed. ['s = possessive, right=noun]
- Semantics has an effect on the analysis:
- To close the Awards Ceremony, the former cricket captain was praised. [former=adjective]
- To improve the structural stability, the former attachment point was moved. [former=noun in a noun cluster]
The disambiguator is different to the LanguageTool disambiguator
LanguageTool has a disambiguator, but the term checker cannot use it:
- A word in ASD-STE100 does not always have the same parts of speech that the word has in standard English. For example, in standard English, the word your is a possessive pronoun, but in ASD-STE100, it is an adjective.
- The disambiguation of ASD-STE100 is not the same as the disambiguation of standard English. For example, think about the sentence "You can see the key positions in Figure 7." In standard English, the word key can be parsed as an adjective that modifies the noun positions. In ASD-STE100, key is parsed as a noun, because key is an approved technical name (noun) and is unknown as an adjective.
For more information about disambiguation, refer to these documents:
- Developing a Disambiguator (https://dev.languagetool.org/developing-a-disambiguator
- Patterns in language for POS disambiguation in a style checker (www.techscribe.co.uk/ta/patterns-in-language-tcuk-2013.pdf.
The design of grammar-ste8.xml
The rules that are in grammar-ste8.xml
tell LanguageTool the text to find. For example, the pattern that finds the verb work is as follows:
<token>work<exception postag_regexp="yes" postag="IS_(NOUN|NNP)|PROJECT_TN_NOUN_MULTI_WORD.*"/></token>
The XML code means, "find the word work, unless work is a noun, a proper noun, or a multi-word project noun."
To prevent 2 errors for 1 problem, rule STE_RULE_1_1_USE_APPROVED_WORDS tells LanguageTool to find only the unknown terms. Other rules in grammar-ste8.xml
cause LanguageTool to find the unapproved terms and the terms that are used incorrectly.
Limits and defects in LanguageTool prevent the correct analysis of some text
LanguageTool does not always find the end of a sentence
LanguageTool does not always find the end of a sentence: https://github.com/languagetool-org/languagetool/issues/6318.
This defect in LanguageTool causes an error with the STE disambiguation of sentences in a list, if the sentences do not end with a full stop (period) and are not separated by empty lines:
LanguageTool incorrectly removes POS tags
In the conditions that follow, LanguageTool incorrectly removes POS tags from a word, adds the VBG
POS tag to the word, and makes the quote mark part of the word:
- The word ends with in.
- The word plus the letter g can be a verb. Examples: basing, bulleting, tinting.
- The word is immediately followed by a single quote mark.
- There is no single quote mark immediately before the word.
These changes are done in the LanguageTool Java code. It is not possible to correct these errors in disambiguation-ste8.xml
. These changes can cause the term checker to give an incorrect analysis:
For more examples, refer to https://github.com/languagetool-org/languagetool/issues/9001.
Precision and recall
ASD-STE100 is for safety-critical documentation. The best possible analysis is if the term checker finds all the errors in a text (recall=1.0) and does not give incorrect warnings (precision=1.0). For an introduction to precision and recall, refer to Classification: Precision and Recall (https://developers.google.com/machine-learning/crash-course/classification/accuracy-precision-recall).
The precision and the recall is dependent on the text. (The rules for semantics usually always give a warning.) Typical values are as follows:
Precision: 0.86.
Recall: 0.98.
Refer also to
Building a controlled language lexicon for Danish (https://rauli.cbs.dk/index.php/LSP/article/download/2069/2068)
A specification and validating parser for simplified technical Spanish (https://aclanthology.org/2003.eamt-1.15.pdf)
A simple rule-based part of speech tagger (https://aclanthology.org/H92-1022/)