%%% spelling-doc.tex
%%% Copyright 2012, 2013 Stephan Hennig
%%
%% This work may be distributed and/or modified under the conditions of
%% the LaTeX Project Public License, either version 1.3 of this license
%% or (at your option) any later version.  The latest version of this
%% license is in http://www.latex-project.org/lppl.txt
%% and version 1.3 or later is part of all distributions of LaTeX
%% version 2005/12/01 or later.
%%
%% See file README for more information.
%%
\documentclass[11pt]{article}
\usepackage{fontspec}
\defaultfontfeatures{Ligatures=TeX}
\usepackage{multicol}
\usepackage[rgb, x11names]{xcolor}
\usepackage{listings}
\input{\jobname-lst-lua.tex}
\lstset{
  basicstyle=\ttfamily,
  columns=spaceflexible,
}
% Short-cut for non-language code snippets.
\lstMakeShortInline\|
% Short-cut for LaTeX code snippets.
\lstMakeShortInline[
language={[LaTeX]TeX},
basicstyle=\ttfamily,
]°
\lstdefinestyle{Lua}{
  language=[5.2]Lua,
  keywordstyle=\bfseries\color{Blue4},
  keywordstyle=[2]\bfseries\color{RoyalBlue3},
  keywordstyle=[3]\bfseries\color{Purple3},
  stringstyle=\bfseries\color{Coral4},
  commentstyle=\itshape\color{Green4},
}
\usepackage{xspace}
\usepackage{array}
\usepackage{booktabs}
\usepackage[latin, UKenglish]{babel}
\usepackage{hyperref}
\hypersetup{
  pdftitle={spelling},
  pdfauthor={Stephan Hennig},
  pdfkeywords={spell-checking, spelling, TeX, LuaTeX}
}
\hypersetup{
  english,% For \autoref.
  pdfstartview={XYZ null null null},% Zoom factor is determined by viewer.
  colorlinks,
  linkcolor=RoyalBlue3,
  urlcolor=Chocolate4,
  citecolor=DeepPink2
}
\usepackage{spelling}
\spellingreadbad{\jobname.bad}
\newcommand*{\pkg}{\textsf{spelling}}
\newcommand*{\acr}[1]{\mbox{\scshape#1}}
\newcommand*{\descr}[1]{〈\emph{#1}〉}
\newcommand*{\cmd}[1]{\mbox{\ttfamily\textbackslash#1}}
\newcommand*{\macro}[1]{\cmd{#1}\marginpar{\cmd{#1}}}
\newcommand*{\latinphrase}[1]{\foreignlanguage{latin}{\emph{#1}}}
\newcommand*{\lpcf}{\latinphrase{cf.}\xspace}
\newcommand*{\lpeg}{\latinphrase{e.\,g.}\xspace}
\newcommand*{\lpetc}{\latinphrase{etc.}\xspace}
\newcommand*{\lpie}{\latinphrase{i.\,e.}\xspace}
\begin{document}
\author{Stephan Hennig\thanks{sh2d@arcor.de}}
\title{\pkg\thanks{This document describes the \pkg\ package v0.41.}}
\maketitle


\begin{abstract}
  This package supports spell-checking of \TeX\ documents compiled with
  the Lua\TeX\ engine.  It can give visual feedback in \acr{pdf} output
  similar to \acr{wysiwyg} word processors.  The package relies on an
  external spell-checker application that can check a plain text file
  and output a list of bad spellings.  The package should work with most
  spell-checkers, even dumb, \TeX-unaware ones.

  \emph{Warning!  This package is in a very early state.  Everything may
    change!}
\end{abstract}

\begin{multicols}{2}
\small
% Set toc entries ragged right.  Trick taken from tocloft.pdf.
\makeatletter
\renewcommand{\@tocrmarg}{2.55em plus1fil}
\makeatother
\tableofcontents
\end{multicols}


\section{Introduction}
\label{sec:intro}

Ther%
\footnote{A footnote containing mispellings.}
%
are three main approaches to spell-checking \TeX\ documents:

\begin{enumerate}

\item checking spelling in the |.tex| source file,

\item converting a |.tex| file to another format, for which a proved
  spell-checking solution exists,

\item checking spelling after a |.tex| file has been processed by \TeX.

\end{enumerate}

All of these approaches have their strengths and weaknesses.  This
package follows the third approach, providing some unique features:

\begin{itemize}

\item In traditional solutions, text is extracted from typeset
  \acr{dvi}, \acr{ps} or \acr{pdf} files, including hyphenated words.
  To avoid (lots of) false positives being reported by the
  spell-checker, hyphenation needs to be switched off during the \TeX\
  run.  That is, one doesn't work on the original document any more.

  In contrast to that, the \pkg\ package works transparently on the
  original |.tex| source file.  Text is extracted \emph{during}
  typesetting, after Lua\TeX\ has applied its catcode and macro
  machinery, but before hyphenation takes place.

\item The \pkg\ package can highlight words with known incorrect
  spelling in \acr{pdf} output, giving visual feedback similar to
  \acr{wysiwyg} word processors.%
  \footnote{Currently, only colouring words is implemented.}

\end{itemize}


\section{Usage}
\label{sec:usage}

The \pkg\ package requires the Lua\TeX\ engine.  All functionality of
the package is implemented in Lua.  The \LaTeX\ interface, which is
described below, is effectively a wrapper around the Lua interface.

\emph{Implementing such wrappers for other formats shouldn't be too
  difficult.  The author is a \LaTeX-only user, though, and therefore
  grateful for contributions.  By the way, the \LaTeX\ package needs
  some polishing, too, \lpeg, a key-value interface is desirable.
  Patches welcome!}


\subsection{Work-flow}
\label{sec:work-flow}

Here's a short outline of how using the \pkg\ package fits into the
general process of compiling a document with Lua\TeX:

\begin{enumerate}

\item After loading the package in the preamble of a |.tex| source file,
  a list of bad spellings is read from a file (if that file exists).

\item During the Lua\TeX\ run, text is extracted from pages and all
  words are checked against the list of bad spellings.  Words with a
  known incorrect spelling are highlighted in \acr{pdf} output.

\item At the end of the Lua\TeX\ run, in addition to the \acr{pdf} file,
  a text file is written, containing most of the text of the typeset
  document.

\item The text file is then checked by your favourite external
  spell-checker application, \lpeg, Aspell or Hunspell.  The
  spell-checker should be able to write a list of bad spellings to a
  file.  Otherwise, visual feedback in \acr{pdf} output won't work.

\item Visually minded people may now compile their document a second
  time.  This time, the new list of bad spellings is read-in and words
  with incorrect spelling found by the spell-checker should now be
  highlighted in \acr{pdf} output.  Users can then apply the necessary
  corrections to the |.tex| source file.

\end{enumerate}

Whatever way spell-checker output is employed, users not interested in
visual feedback (because their spell-checker has an interactive mode
only or because they prefer grabbing bad spellings from a file directly)
can also benefit from this package.  Using it, Lua\TeX\ writes a pure
text file that is particularly well suited as spell-checker input,
because it contains no hyphenated words (and neither macros, nor active
characters).  That way, any spell-checker application, even \TeX-unaware
ones, can be used to check spelling of \TeX\ documents.


\subsection{Word lists}
\label{sec:wordlists}

As described above, after loading the \pkg\ package, a list of bad
spellings is read from a file \descr{jobname}.|spell.bad|, if that file
exists.  Words found in this file are stored in an internal list of bad
spellings and are later used for highlighting spelling mistakes in
\acr{pdf} output.  Additionally, a list of good spellings is read from a
file \descr{jobname}|.spell.good|, if that file exists.  Words found in
the latter file are stored in an internal list of good spellings.  File
format for both files is one word per line.  Files must be in the
\acr{utf-8} encoding.  Letter case is significant.

A word in the document is highlighted, if it occurs in the internal list
of bad spellings, but not in the internal list of good spellings.  That
is, known good spellings take precedence over known bad spellings.

Users can load additional files containing lists of bad or good
spellings with macros \macro{spellingreadbad} and
\macro{spellingreadgood}.  Argument to both macros is a file name.  If a
file cannot be found, a warning is written to the console and |log| file
and compilation continues.  As an example, the command

\begin{lstlisting}[language={[LaTeX]TeX}]
\spellingreadgood{myproject.whitelist}
\end{lstlisting}
%
reads words from a file |myproject.whitelist| and adds them to the list
of good spellings.

Known good spellings can be used to deal with words wrongly reported as
bad spellings by the spell-checker (false positives).  But note, most
spell-checkers also provide means to deal with unknown words via
additional dictionaries.  It is recommended to configure your
spell-checker to report as few false positives as possible.


\subsection{Match rules}
\label{sec:matchrules}

\emph{This section describes an advanced feature.  You may safely skip
  this section upon first reading.}

The \pkg\ package provides an additional way to deal with bad and good
spellings, match rules.  Match rules can be used to employ regular
patterns within certain ‘words’.  A typical example are bibliographic
references like \emph{Lin86}, which are often flagged by spell-checkers,
but need not be highlighted as they are generated by \TeX.

There are two kinds of rules, bad and good rules.  A rule is a Lua
function whose boolean return value indicates whether a word
\emph{matches} the rule.  A bad rule should return a true value for all
strings identified as bad spellings, otherwise a false value.  A good
rule should return a true value for all strings identified as good
spellings, otherwise a false value.  A word in the document is
highlighted if it matches any bad rule, but no good rule.

Function arguments are a \emph{raw} string and a \emph{stripped} string.
The raw string is a string representing a word as it is found in the
document possibly surrounded by punctuation characters.  The stripped
string is the same string with surrounding punctuation already stripped.

As an example, the rule in \autoref{lst:mr-three-letter-words} matches
all words consisting of exactly three letters.  The function matches the
stripped string against the Lua string pattern |^%a%a%a$| via function
|unicode.utf8.find| from the Selene Unicode library.  The latter
function is a \acr{utf-8} capable version of Lua's built-in function
|string.find|.  It returns |nil| (a false value) if there has been no
match and a number (a true value) if there has been a match.  The
pattern |%a| represents a character class matching a single letter.
Characters |^| and |$| are anchors for the beginning and the end of the
string in question.  Note, pattern |%a%a%a| without anchors would match
any string containing three letters in a row.  More information about
Lua string patterns can be found in the Lua reference manual%
\footnote{\url{http://www.lua.org/manual/5.2/manual.html\#6.4}}%
%
, the Selene Unicode library documentation%
\footnote{\url{https://github.com/LuaDist/slnunicode/blob/master/unitest}}
%
and in the Unicode standard%
\footnote{\url{http://www.unicode.org/Public/4.0-Update1/UCD-4.0.1.html\#General_Category_Values}}%
.

\suppressfloats[b]

\begin{lstlisting}[style=Lua, float, label=lst:mr-three-letter-words, caption={Matching three-letter words.}]
function three_letter_words(raw, stripped)
  return unicode.utf8.find(stripped, '^%a%a%a$')
end
\end{lstlisting}

\autoref{lst:mr-double-punctuation} shows a rule matching all ‘words’
containing double punctuation.  Note, how the raw string is examined
instead of the stripped one.

\begin{lstlisting}[style=Lua, float, label=lst:mr-double-punctuation, caption={Matching double punctuation.}]
function double_punctuation(raw, stripped)
  return unicode.utf8.find(raw, '%p%p')
end
\end{lstlisting}

The rule in \autoref{lst:mr-bibtex-alpha} combines the results of three
string searches to match bibliographic references as generated by the
Bib\TeX\ style \emph{alpha}.

\begin{lstlisting}[style=Lua, float, label=lst:mr-bibtex-alpha, caption={Matching references generated by the Bib\TeX\ style \emph{alpha}.}]
function bibtex_alpha(raw, stripped)
  return unicode.utf8.find(stripped, '^%u%l%l?%d%d$')
    or unicode.utf8.find(stripped, '^%u%u%u?%u?%d%d$')
    or unicode.utf8.find(stripped, '^%u%u%u%+%d%d$')
end
\end{lstlisting}

Match rules have to be provided by means of a Lua module.  Such modules
can be loaded with the \macro{spellingmatchrules} command.  Argument is
a module name.  To tell bad rules from good rules, the table returned by
the module must follow this convention: Function identifiers
representing bad and good match rules are prefixed |bad_rule_| and
|good_rule_|, resp.  The rest of an identifier is actually irrelevant.
Other and non-function identifiers are ignored.

\autoref{lst:mr-module} shows an example module declaring the rules from
\autoref{lst:mr-three-letter-words} and
\autoref{lst:mr-double-punctuation} as \emph{bad} match rules and the
rule from \autoref{lst:mr-bibtex-alpha} as a \emph{good} match rule.
Note, how function identifiers are made local and how exported function
identifiers are prefixed |bad_rule_| and |good_rule_|, while local
function identifiers have no prefixes.  When the module resides in a
file named |myproject.rules.lua|, it can be loaded in the preamble of a
document via
\begin{lstlisting}[language={[LaTeX]TeX}]
\spellingmatchrules{myproject.rules}
\end{lstlisting}

\begin{lstlisting}[style=Lua, float=p, label=lst:mr-module, caption={A Lua module containing two bad and one good match rule.}]
-- Module table.
local M = {}

-- Import Selene Unicode library.
local unicode = require('unicode')
-- Add short-cut.
local Ufind = unicode.utf8.find

-- Local function matching three letter words.
local function three_letter_words(raw, stripped)
  return Ufind(stripped, '^%a%a%a$')
end
-- Make this a bad rule.
M.bad_rule_three_letter_words = three_letter_words

local function double_punctuation(raw, stripped)
  return Ufind(raw, '%p%p')
end
M.bad_rule_double_punctuation = double_punctuation

local function bibtex_alpha(raw, stripped)
  return Ufind(stripped, '^%u%l%l?%d%d$')
    or Ufind(stripped, '^%u%u%u?%u?%d%d$')
    or Ufind(stripped, '^%u%u%u%+%d%d$')
end
M.good_rule_bibtex_alpha = bibtex_alpha

-- Export module table.
return M
\end{lstlisting}

How are match rules and lists of bad and good spellings related?
Internally, the lists of bad and good spellings are referred to by two
special default match rules, that look-up raw and stripped strings and
return a true value if either argument has been found in the
corresponding list.  Since good rules take precedence over bad rules, an
entry in the list of good spellings takes precedence over any
user-supplied bad rule.  Likewise, any user-supplied good rule takes
precedence over an entry in the list of bad spellings.

\paragraph{Some final remarks on match rules} It must be stressed that
the boolean return value of a match rule \emph{does not} indicate
whether a spelling is bad or good, but whether a word matches a certain
rule or not.  Whether it's a bad or a good spelling, depends on the name
of the match rule in the module table.

Match rules are only called upon the first occurrence of a spelling in a
document.  The information, whether a spelling needs to be highlighted,
is stored in a cache table.  Subsequent occurrences of a spelling just
need a table look-up to determine highlighting status.  For that reason,
it is safe to do relatively expensive operations within a match rule
without affecting compilation time too much.  Nevertheless, match rules
should be stated as efficient as possible.%
\footnote{Some Lua performance tips can be found in the book \emph{Lua
    Programming Gems} by Figueiredo, Celes and Ierusalimschy
  \emph{(eds.)}, 2008, ch.~2.  That chapter is also available online at
  \url{http://www.lua.org/gems/}.}

When written without care, match rules can easily produce false
positives as well as false negatives.  While false positives in bad
rules and false negatives in good rules can easily be spotted due to the
unexpected highlighting of words, the other cases are more problematic.
To avoid all kinds of false results, match rules should be stated as
specific as possible.


\subsection{Highlighting spellling mistakes}
\label{sec:highlight}

\paragraph{Enabling/disabling} Highlighting spelling mistakes (words
with known incorrect spelling) in \acr{pdf} output can be toggled on and
off with command \macro{spellinghighlight}.  If the argument is |on|,
highlighting is enabled.  For other arguments, highlighting is disabled.
Highlighting is enabled, by default.

\paragraph{Colour} The colour used for highlighting bad spellings can be
determined by command \cmd{spellinghighlightcolor}.  Argument is a
colour statement in the \acr{pdf} language.  As an example, the colour
red in the \acr{rgb} colour space is represented by the statement %
|1 0 0 rg|.  In the \acr{cmyk} colour space, a reddish colour is
represented by |0 1 1 0 k|.  Default colour used for highlighting is %
|1 0 0 rg|, \lpie, red in the \acr{rgb} colour space.


\subsection{Text output}
\label{sec:textoutput}

\paragraph{Text file} After loading the \pkg\ package, at the end of the
Lua\TeX\ run, a text file is written that contains most of the document
text.  The text file is no close text rendering of the typeset document,
but serves as input for your favourite spell-checker application.  It
contains the document text in a simple format: paragraphs separated by
blank lines.  A paragraph is anything that, during typesetting, starts
with a |local_par| whatsit node in the node list representing a typeset
page of the original document, \lpeg, paragraphs in running text,
footnotes, marginal notes, (in-lined) °\parbox° commands or cells from
°p°-like table columns \lpetc

Paragraphs consist of words separated by spaces.  A word is the textual
representation of a chain of consecutive nodes of type |glyph|, |disc|
or |kern|.  Boxes are processed transparently.  That is, the \pkg\
package (highly imperfectly) tries to recognise as a single word what in
typeset output looks like a single word.  As an example, the \LaTeX\
code

\begin{center}
  \begin{tabular}{c}
\begin{lstlisting}[language={[LaTeX]TeX}]
foo\mbox{'s bar}s
\end{lstlisting}
  \end{tabular}
\end{center}
which is typeset as

\begin{center}
  foo\mbox{'s bar}s
\end{center}
is considered two words \textit{foo's} and \textit{bars}, instead of the
four words \textit{foo}, \textit{'s}, \textit{bar} and~\textit{s}.%
\footnote{This document has been compiled with a custom list of bad
  spellings, which is why the word \emph{foo's} should be highlighted.}

\paragraph{Enabling/disabling} Text output can be toggled on and off
with command \macro{spellingoutput}.  If the argument is |on|, text
output is enabled.  For other arguments, text output is disabled.  Text
output is enabled, by default.

\paragraph{File name} \hspace{0pt plus 5em} Text output file name can be
configured via command \macro{spellingoutputname}.  Argument is the new
file name.  Default text output file name is
\descr{jobname}|.spell.txt|.

\paragraph{Line length} In text output, paragraphs can either be put on
a single line or broken into lines of a fixed length.  The behaviour can
be controlled via command \macro{spellingoutputlinelength}.  Argument is
a number.  If the number is less than~1, paragraphs are put on a single
line.  For larger arguments, the number specifies maximum line length.
Note, lines are broken at spaces only.  Words longer than maximum line
length are put on a single line exceeding maximum line length.  Default
line length is~72.


\subsection{Text extraction}
\label{sec:textextraction}

\paragraph{Enabling/disabling} Text extraction can be enabled and
disabled in the document via command \macro{spellingextract}.  If the
argument is |on|, text extraction is enabled.  For other arguments, text
extraction is disabled.  The command should be used in vertical mode,
\lpie, outside paragraphs.  If text extraction is disabled in the
document preamble, an empty text file is written at the end of the
Lua\TeX\ run.  Text extraction is enabled, by default.

Note, text extraction and visual feedback are orthogonal features.  That
is, if text extraction is disabled for part of a document, \lpeg, a long
table, words with a known incorrect spelling are still highlighted in
that part.


\subsection{Code point mapping}
\label{sec:cp-mapping}

As explained in \autoref{sec:textoutput}, the text file written at the
end of the Lua\TeX\ run is in the \acr{utf-8} encoding.  Unicode
contains a wealth of code points with a special meaning, such as
ligatures, alternative letters, symbols \lpetc Unfortunately, not all
spell-checker applications are smart enough to correctly interpret all
Unicode code points that may occur in a document.  For that reason, a
code point mapping feature has been implemented that allows for mapping
certain Unicode code points that may appear in a node list to arbitrary
strings in text output.  A typical example is to map ligatures to the
characters corresponding to their constituting letters.  The default
mappings applied can be found in \autoref{tab:cp-mapping}.

\begin{table}
  \begin{minipage}{1.0\linewidth}
    \centering

    \newcommand*{\coltitle}[2]{%
      \normalfont%
      \vbox{
        \hbox{\strut#1}
        \hbox{\strut#2}
      }%
    }

    \begin{tabular}{>{\ttfamily}l>{\fontspec{Linux Libertine
            O}}l>{\ttfamily}l>{\ttfamily}l}
      \normalfont Unicode name & \coltitle{sample}{glyph\footnote{Sample
          glyphs are taken from font \emph{Linux Libertine~O}.}} &
      \coltitle{code}{point} & \coltitle{target}{characters}\\
    \addlinespace
    \toprule
    \addlinespace

    LATIN CAPITAL LIGATURE IJ     & ^^^^0132 & 0x0132 & IJ  \\
    LATIN SMALL LIGATURE IJ       & ^^^^0133 & 0x0133 & ij  \\
    LATIN CAPITAL LIGATURE OE     & ^^^^0152 & 0x0152 & OE  \\
    LATIN SMALL LIGATURE OE       & ^^^^0153 & 0x0153 & oe  \\
    LATIN SMALL LETTER LONG S     & ^^^^017f & 0x017f & s   \\
    \addlinespace
    LATIN SMALL LIGATURE FF       & ^^^^fb00 & 0xfb00 & ff  \\
    LATIN SMALL LIGATURE FI       & ^^^^fb01 & 0xfb01 & fi  \\
    LATIN SMALL LIGATURE FL       & ^^^^fb02 & 0xfb02 & fl  \\
    LATIN SMALL LIGATURE FFI      & ^^^^fb03 & 0xfb03 & ffi \\
    LATIN SMALL LIGATURE FFL      & ^^^^fb04 & 0xfb04 & ffl \\
    LATIN SMALL LIGATURE LONG S T & ^^^^fb05 & 0xfb05 & st  \\
    LATIN SMALL LIGATURE ST       & ^^^^fb06 & 0xfb06 & st  \\
  \end{tabular}

  \caption{Default code point mappings.}
  \label{tab:cp-mapping}

  \end{minipage}
\end{table}

Additional mappings can be declared by command \macro{spellingmapping}.
This command takes two arguments, a number that refers to the Unicode
code point, and a sequence of arbitrary characters that is the mapping
target.  The code point number must be in a format that can be parsed by
Lua.  The characters must be in the \acr{utf-8} encoding.

New mappings only have effect on the following document text.  The
command should therefore be used in the document preamble.  As an
example, the character |A| can be mapped to |Z| and \latinphrase{vice
  versa} with the following code:

\begin{lstlisting}[language={[LaTeX]TeX}]
\spellingmapping{65}{Z}% A => Z
\spellingmapping{90}{A}% Z => A
\end{lstlisting}

Another command \macro{spellingclearallmappings} can be used to remove
all existing code point mappings.


\subsection{Tables}
\label{sec:tables}

How do tables fit into the simple text file format that has only
paragraphs and blank lines as described in \autoref{sec:textoutput}?
What is a paragraph with regards to tables?  A whole table?  A row?  A
single cell?

By default, only text from cells in °p°(aragraph)-like columns is put on
their own paragraph, because the corresponding node list branches
contain a |local_par| whatsit node (\lpcf \autoref{sec:textoutput}).
The behaviour can be changed with the \macro{spellingtablepar} command.
This command takes as argument a number.  If the argument is~0, the
behaviour is described as above.  If the argument is~1, a blank line is
inserted before and after every table row (but at most once between
table rows).  If the argument is~2, a blank line is inserted before and
after every table cell.  By default, no blank lines are inserted.


\section{LanguageTool support}
\label{sec:languagetool}

Installing spell-checkers and dictionaries can be a difficult task if
there are no pre-built packages available for an architecture.  That's
one reason the \pkg\ package is rather spell-checker agnostic and the
manual doesn't recommend a particular spell-checker application.
Another reason is, there is no best spell-checker.  The only
recommendation the author makes is not to trust in one spell-checker,
but to use multiple spell-checkers at the same time, with different
dictionaries or, better yet, different checking engines under the hood.

Among the set of options available, LanguageTool%
\footnote{\url{http://www.languagetool.org/}}%
%
, a style and grammar checker that can also check spelling since
version~1.8, deserves some notice for its portability, ease of
installation and active development.  For these reasons, the \pkg\
package provides explicit LanguageTool support.  LanguageTool uses
Hunspell as the spell-checking engine, augmenting results with a rule
based engine and a morphological analyser (depending on the language).
The \pkg\ package can parse LanguageTool's error reports in the
\acr{xml} format, pick those errors that are spelling related and use
them to highlight bad spellings.%
\footnote{Highlighting style and grammar errors found by LanguageTool
  should be possible, but requires major restructuring of the \pkg\
  package.}


\subsection{Installation}
\label{sec:lt-installation}

Here are some brief installation instructions for the stand-alone
version of LanguageTool (tested with LanguageTool~2.1).  The stand-alone
version contains a \acr{gui} as well as a command-line interface.  For
the \pkg\ package, the latter is needed.

\begin{enumerate}

\item LanguageTool is primarily written in Java.  Make sure a recent
  Java Runtime Environment (\acr{jre}) is installed.

\item\label{enum:run-java} Open a command-line and type

\begin{lstlisting}
java -version
\end{lstlisting}
%
  If you get an error message, find out the full path to the Java
  executable (called |java.exe| on Windows) for later reference.

\item Download the stand-alone version of LanguageTool (should be a
  \acr{zip} archive).

\item Uncompress the downloaded archive to a location of your choice.

\item Open a command-line in the directory containing file
  |languagetool-commandline.jar| and type

\begin{lstlisting}[escapeinside=°°]
°\descr{path to}°/java -jar languagetool-commandline.jar --help
\end{lstlisting}
%
  Prepending the path to the Java executable is optional, depending on
  the result in step~\ref{enum:run-java}.  If you now see a list of
  LanguageTool's command-line options rush by, all is well.

\item For easier access to LanguageTool, create a small batch script and
  put that somewhere into the |PATH|.

  \begin{itemize}

  \item For users of unixoide systems, the script might look like

\begin{lstlisting}[escapeinside=°°]
#!/bin/sh
°\descr{path to}°/java -jar °\descr{path to}°/languagetool-commandline.jar $*
\end{lstlisting}
%
    where \texttt{\descr{path to}} should point to the Java executable
    (optional) and file |languagetool-commandline.jar| (mandatory).  If
    the script is named |lt.sh|, you should be able to run LanguageTool
    on the command shell by typing, \lpeg,

\begin{lstlisting}
sh lt.sh --version
\end{lstlisting}
%
    Don't forget to put the script into the |PATH|!  For other ways of
    making scripts executable, please consult the operating system
    documentation.

  \item For Windows users, the script might look like

\begin{lstlisting}[escapeinside=°°]
@echo off
°\descr{path to}°\java -jar °\descr{path to}°\languagetool-commandline.jar %*
\end{lstlisting}
%
    where \texttt{\descr{path to}} should point to the Java executable
    (optional) and file |languagetool-commandline.jar| (mandatory).  If
    the script is named |lt.bat|, you should be able to run LanguageTool
    on the command-line by typing, \lpeg,

\begin{lstlisting}
lt --version
\end{lstlisting}
%
    Don't forget to put the script into the |PATH|!

  \end{itemize}

\end{enumerate}


\subsection{Usage}
\label{sec:lt-usage}

The results of checking a text file with LanguageTool are written to an
error report, either in a human readable format or in a machine friendly
\acr{xml} format.  The \pkg\ package can only parse the latter format.
When it was said in \autoref{sec:wordlists} that the \pkg\ package reads
files \descr{jobname}|.spell.bad| and \descr{jobname}|.spell.good|, if
they exist, that was not the whole truth.  Additionally, a file
\descr{jobname}|.spell.xml| is read, if it exists.  This file should
contain a LanguageTool error report in the \acr{xml} format.  Additional
LanguageTool \acr{xml} error reports can be loaded via the
\macro{spellingreadLT} command.  Argument is a file name.  Macros
|\spellingreadLT|, |\spellingreadbad| and |\spellingreadgood| can be
used in combination in a \TeX\ file.

To check a text file and create an error report in the \acr{xml} format,
LanguageTool can be called on the command-line like this
\begin{lstlisting}[escapeinside=°°]
lt °\descr{options}° °\descr{input file}° > °\descr{error report}°
\end{lstlisting}
where \texttt{\descr{options}} is a list of options described below,
\texttt{\descr{input file}} is the text file written by the \pkg\
package in the first Lua\TeX\ run and \texttt{\descr{error report}} is
the file containing the error report.  Note, how standard output is
redirected to a file via the |>| operator.  By default, LanguageTool
writes error reports to standard output, that is, the command-line.
Redirection is a feature most operating systems provide.

\begin{itemize}

\item Option |-l| determines the language (variant) of the file to
  check.  As an example, language variant US English can be selected via
  |-l en-US|.  The full list of languages supported by LanguageTool can
  be requested via option |--list|.

\item Option |-c| determines the encoding of the input file.  Since the
  text file written by the \pkg\ package is in the \acr{utf-8} encoding,
  this part should be |-c utf-8|.

\item By default, LanguageTool outputs error reports in a human readable
  format.  The \pkg\ package can only parse error reports in the
  \acr{xml} format.  If the |--api| option is present, LanguageTool
  outputs \acr{xml} data.

\item Users that don't want to highlight bad spellings, but prefer to
  study the list of bad spellings themselves, should refer to the |-u|
  option.  But note, that with the latter option present, LanguageTool
  doesn't output pure \acr{xml} any more, even if the |--api| option is
  present.  Make sure such error reports aren't read by the \pkg\
  package.

\item If the |--help| option is present, LanguageTool shows more
  information about command-line options.

\end{itemize}

As an example, to compile a \LaTeX\ file |myletter.tex| written in
French that uses the \pkg\ package with standard settings to highlight
bad spellings and to use LanguageTool as a spell-checker, the following
commands should be typed on the command-line:

\begin{lstlisting}
lualatex myletter
lt --api -c utf-8 -l fr myletter.spell.txt > myletter.spell.xml
lualatex myletter
\end{lstlisting}


\section{Bugs}
\label{sec:bugs}

Note, this package is in a very early state.  Expect bugs!  Package
development is hosted at
\href{http://github.com/sh2d/spelling/}{\bfseries GitHub}.  The full
list of known bugs and feature requests can be found in the
\href{http://github.com/sh2d/spelling/issues/}{\bfseries issue tracker}.
New bugs should be reported there.

The most user-visible issues are listed below:

\begin{itemize}

\item There's no support for the Plain \TeX\ or Con\TeX\ formats other
  than the \acr{API} of the package's Lua modules, yet
  (\href{https://github.com/sh2d/spelling/issues/1}{issue~1}).

\item Macros provided by the \LaTeX\ package have very long names.  A
  \emph{key-value} package option interface would be much more
  user-friendly
  (\href{https://github.com/sh2d/spelling/issues/2}{issue~2}).

\item There are a couple of issues with text extraction and highlighting
  incorrect spellings:

  \begin{itemize}

  \item Text in head and foot lines is neither extracted nor highlighted
    (\href{https://github.com/sh2d/spelling/issues/7}{issue~7}).

  \item The first word starting right after an |hlist|, \lpeg, the first
    word within an |\mbox|, is never highlighted.  It is extracted and
    written to the text file, though.  This might affect acronyms, names
    \lpetc (\href{https://github.com/sh2d/spelling/issues/6}{issue~6}).

  \item Bad spellings that are hyphenated at a page break are not
    highlighted
    (\href{https://github.com/sh2d/spelling/issues/10}{issue~10}).

  \end{itemize}


\end{itemize}

Patches welcome!

\bigskip
\emph{Happy \TeX ing!}


\end{document}


%%% Local Variables: 
%%% mode: latex
%%% TeX-PDF-mode: t
%%% TeX-master: t
%%% coding: utf-8
%%% End: