The use of English plurals in synthetic sentences is a case in point. In computing applications, for example, it is quite common to encounter error messages which jar because they do not correctly inflect for grammatical number:
Compilation aborted: 1 errors were detected.Individually, such inelegances are easily overcome (or, more accurately, the inelegance may be transferred from the interface to the code):
print "Compilation aborted: $count ", ($count==1 ? "error was" : "errors were"), " detected.\n";Unfortunately, in attempting to generate more complex text, some less tractable problems arise, notably the diversity of plural forms available in English. Consider the difficulty faced by a text generation system (machine or human) in forming plural versions of the following:
Her criterion differs from mine. The Major General met the Governor General. Analysis of this aquarium's fish failed to determine its genus. That phalanx suffered a trauma.This paper presents an algorithmic approach that provides (nearly) automatic plural inflections for such examples.
One might argue that this approach is economically rational, in that the extra cost and complexity involved in identifying and coding around that one special case outweighs the benefit of correctly handling it. This, of course, is the perennial excuse for ugly and ungainly interfaces, and quite unassailable in the estimation of the utilitarian mind.
Number of errors: 1 Number of errors: 10A common (if somewhat clumsy) alternative is to bet both ways and structure the sentence so that it will read correctly in either grammatical number:
1 error(s) found. 10 error(s) found.Evasion techniques such as these solve the problem of "canned" synthetic text, but do so either by craving the readers' indulgence (of threadbare English) or their complicity (in ignoring the inappropriate sense of a schizophrenic construction). However, in general text generation, such terse and artificial structures may be inappropriate or simply unachievable.
sub select_pl($$) { my ($word, $count) = @_; $word =~ s#\(([^)/]*)/([^)]*)\)# $count==1 ? $1 : $2 #ge; return $word }which allows the programmer to code synthetic text generation as follows:
print select_pl("$count error(/s) (was/were) found", $count);This approach neatly solves the problem of correctly inflecting "canned" text for number, but is not easily adapted to handle the more general problems encountered when the text is not pre-determined.
More complex algorithms that cope with specific suffixes (-ss -> -sses, -y -> -ies, etc.) can be specified, but pure suffix-based approaches will still be prone to exceptions and meta-exceptions. For example: -y becomes -ies, except after a vowel (when it becomes -ys), except for soliloquy (which uses -ies).
A usable pluralization algorithm must therefore cope with three categories of plural formation: universal defaults, general suffix-based rules, and specific exceptional cases. The following section examines each of these categories in more detail.
The rules themselves are well-known and need no elaboration. By default:
Certain types of adjectives also inflect in this way. For example, possessive adjectives that end in -'s or -' in the singular are made plural by forming the plural of the root word and appending an apostrophe (unless the root's plural does not itself end in -s, in which case -'s is appended). Hence cat's becomes cats', axis' becomes axes', whilst child's becomes children's.
Other suffix categories arise because words of foreign origin (most commonly Ancient Greek or Latin) have retained a non-anglicized plural inflection. Hence criterion becomes criteria, nucleus becomes nuclei, and matrix becomes matrices. Dealing with such categories is complicated by the fact that many other imports have been wholly or partially anglicized. Hence although criterion always forms its plural with -a, ganglion may take either -s or -a (ganglions or ganglia), whilst bastion is always inflected with -s. Occasionally the anglicized and "classical" plural forms of a word may both be in common use, but with distinct meanings. Thus a copy-editor might remove appendices, whereas a surgeon would remove appendixes.
The correct inflection of words derived from Latin can be particularly complex, since the same suffix may form different Latinate plurals depending on the declension (or sometimes the part of speech) of the original. Thus the plural of stimulus (second declension) is stimuli, and that of genus (third declension) is genera. Status (fourth declension) is traditionally unchanged in the plural, whilst ignoramus (a first person plural Latin verb) has been wholly anglicized and becomes ignoramuses.
The only practical way to deal with such complexities in an algorithm
is to categorize words by both suffix and inflection, and to allow
for both anglicized and classical variants. Table 1 illustrates such categories.
Singular suffix |
Anglicized
plural |
Classical
plural |
Example
(see Appendix A for comprehensive lists of words in each category) |
-a | (none) | -ae | alga -> algae |
-a | -as | -ae | nova -> novas/novae |
-a | -as | -ata | dogma -> dogmas/dogmata |
-an | -en | (none) | woman -> women |
-ch | -ches | (none) | church -> churches |
-eau | -eaus | -eaux | chateau -> chateaus/chateaux |
-en | -ens | -ina | foramen -> foramens/foramina |
-ex | (none) | -ices | codex -> codices |
-ex | -exes | -ices | index -> indexes/indices |
-f(e) | -ves | (none) | wolf -> wolves
life -> lives |
-ieu | -ieus | -ieux | milieu -> mileus/milieux |
-is | (none) | -es | basis -> bases |
-is | -ises | -ides | iris -> irises /irides |
-ix | -ixes | -ices | matrix -> matrixes/matrices |
-nx | -nxes | -nges | phalanx -> phalanxes /phalanges |
-o | -oes | (none) | potato -> potatoes |
-o | -os | (none) | photo -> photos |
-o | (none) | -i | graffito -> graffiti |
-o | -os | -i | tempo -> tempos/tempi |
-on | (none) | -a | aphelion -> aphelia |
-on | -ons | -a | ganglion -> ganglions/ganglia |
-oo- | -ee- | (none) | foot -> feet
tooth -> teeth |
-oof | -oofs | -ooves | hoof -> hoofs/hooves |
-s | -s | (none) | series -> series |
-s | -ses | (none) | atlas -> altases |
-sh | -shes | (none) | wish -> wishes |
-um | (none) | -a | bacterium -> bacteria |
-um | -ums | -a | medium -> mediums/media |
-us | (none) | -era | genus -> genera |
-us | (none) | -i | stimulus -> stimuli |
-us | -uses | -era | opus -> opuses/opera |
-us | -uses | -i | radius -> radiuses/radii |
-us | -uses | -ora | corpus -> corpuses/corpora |
-us | -uses | -us | status -> statuses/status |
-x | -xes | (none) | box -> boxes |
-y | -ies | (none) | ferry -> ferries |
-zoon | (none) | -zoa | protozoon -> protozoa |
(none) | -s | -im | cherub -> cherubs/cherubim |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
This table is surprisingly comprehensive, though certainly not exhaustive. Indeed, specific dialects of English may define much larger sets of irregular plurals and may not recognize some of the entries in Table 2. Hence it is important that any algorithmic approach to pluralization be both extensible and adjustable, so that its output may be easily expanded or trimmed for a specific audience.
The algorithms are based on the rules of English inflection described in the Oxford English Dictionary [5] (OED), Fowler's Modern English Usage [6], and A Practical English Grammar [1] . Where these sources disagree, the OED is taken to be definitive.
VAX -> VAXenTo extend the power of this mechanism, each singular form can be specified as a (case-insensitive) regular expression, rather than a literal word to be matched. This allows the user to specify families of common inflections. For example, one might specify that all nouns ending in -x will be inflected to -xen (oxen, boxen, suffixen, etc.), regardless of the normal rules of English:
(.*)x -> $1xenFurthermore, if the user-defined table preserves a suitable ordering (perhaps "first-defined, last-tried"), then exceptions to such user-defined generic rules can also be specified. For example:
(.*)x -> $1xen fox -> foxesAs a final generalization, the plural form allows two variants (an anglicized plural and a "classical" alternative), separated by some delimiter - say "|". In such cases, the plural selected would depend on whether classical or anglicized plurals had been requested. For example, the previous generic rule might be rewritten to cater for "classical" usages:
(.*)x -> $1xes | $1xen fox -> foxes ox -> oxenNote that, where only one plural form is specified, it is used in both "anglicized" and "classical" modes.
if the word matches a user-defined noun, return the user-specified plural form if suffix(-fish) or suffix(-ois) or suffix(-sheep) or suffix(-deer) or suffix(-pox) or suffix(-[A-Z].*ese) or suffix(-itis) or category(-,-), return the original noun if the word is a pronoun, return the specified plural of the pronoun if the word is of the form: "<preposition> <pronoun>", return "<preposition> <specified plural of pronoun>" if the word has an irregular plural, return the specified plural if suffix(-man), return inflection(-man,-men) if suffix(-[lm]ouse), return inflection(-ouse,-ice) if suffix(-tooth), return inflection(-tooth,-teeth) if suffix(-goose), return inflection(-goose,-geese) if suffix(-foot), return inflection(-foot,-feet) if suffix(-zoon), return inflection(-zoon,-zoa) if suffix(-[csx]is), return inflection(-is,-es) if category(-ex,-ices), return inflection(-ex,-ices) if category(-um,-a), return inflection(-um,-a) if category(-on,-a), return inflection(-on,-a) if category(-a,-ae), return inflection(-a,-ae) if in classical mode, if suffix(-trix), return inflection(-trix,-trices) if suffix(-eau), return inflection(-eau,-eaux) if suffix(-ieu), return inflection(-ieu,-ieux) if suffix(-..[iay]nx), return inflection(-nx,-nges) if category(-en,-ina), return inflection(-en,-ina) if category(-a,-ata), return inflection(-a,-ata) if category(-is,-ides), return inflection(-is,-ides) if category(-us,-i), return inflection(-us,-i) if category(-us,-us), return the original noun if category(-o,-i), return inflection(-o,-i) if category(-,-i), return inflection(-,-i) if category(-,-im), return inflection(-,-im) if suffix(-[cs]h), return inflection(-h,-hes) if suffix(-ss), return inflection(-ss,-sses) if suffix(-[aeo]lf) or suffix(-[^d]eaf) or suffix(-arf), return inflection(-f,-ves) if suffix(-[nlw]ife), return inflection(-fe,-ves) if suffix(-[aeiou]y), return inflection(-y,-ys) if suffix(-[A-Z].*y), return inflection(-y,-ys) if suffix(-y), return inflection(-y,-ies) if category(-o,-os) or suffix(-[aeiou]o), return inflection(-o,-os) if suffix(-o), return inflection(-o,-oes) if category(-general,-generals), return inflection(-l,-ls) if the word is of the form: "<word> general", return "<plural of word> general" if the word is of the form: "<word> <preposition> <words>", return "<plural of word> <preposition> <words>" otherwise, return inflection(-,-s) |
if the word matches a user-defined verb, return the user-specified plural form if the word has the form "<auxiliary> <words>" and <auxiliary> belongs to the category of irregular verbs, return "<specified plural of auxiliary> <words>" if the word belongs to the category of irregular verbs, return the specified plural form if suffix(-[cs]hes), return inflection(-hes,-h) if suffix(-[sx]es), return inflection(-es,-) if suffix(-zzes), return inflection(-es,-) if suffix(-ies), return inflection(-ies,-y) if suffix(-oes), return inflection(-oes,-o) if suffix(-[^s]s), return inflection(-s,-) if the word is in the ambiguous category, return the specified plural form otherwise, return the verb uninflected |
if the word matches a user-defined adjective, return the user-specified plural form if the word is "a" or "an", return "some" if the word is "this", return "these" if the word is "that", return "those" if the word is a personal possessive, return the specified plural form if suffix(-'s) or suffix(-'), if suffix(-'), let the noun <owner> be inflection(-',-) otherwise, let the noun <owner> be inflection(-'s,-) let the noun <owners> be the noun plural of <owner> if <owners> ends in -s, return "<owners>'" otherwise, return "<owners>'s" otherwise, return the adjective uninflected |
try step 1 of Algorithm 3 try step 1 of Algorithm 2 try step 1 of Algorithm 1 try steps 2 through 4 of Algorithm 3 try steps 2 through 5 of Algorithm 2 if word is a noun ending in -s, try steps 2 through 13 of Algorithm 1 try steps 4 and 5 of Algorithm 2 try steps 2 through 13 of Algorithm 1 |
Note that this sequence represents a particular compromise in the face of inherently ambiguous input. Other compromises (which might perhaps more heavily favour the verb sense of a word) may also be defined, by selecting different subsets of the three algorithms or by changing the order in which the various subsets are used.
It ate it -> They ate themAs a consequence of this ambiguity, the noun and unified algorithms cannot guarantee to inflect it correctly without additional context. This could be provided by an extra parameter (one which specifies the required case), or by simply defaulting to the nominative (it -> they) and accepting a small number of incorrect inflections.
Of course, where the necessary context is already provided (for example, when forming the plural of a dative or ablative: to it, from it, with it, etc.), the noun algorithm detects this (in step 3) and correctly returns the accusative plural form: to them, from them, with them, etc.)
However, if a verb were to take common singular forms but different plurals (for example, the atrophying British usage: I will -> you shall, you will -> you will), then the algorithms presented above would be unable to determine the correct inflection without additional context (such as an extra "person" parameter).
The author is not currently aware of any other verbs in English which present this problem, but is not willing to assume ipso facto that none exist.
I put the mice next to the cheese. I put the mouses next to the keyboards. Three basses were stolen from the band's trailer. Three bass were stolen from the band's fishpond. Several thoughts about leaving crossed my mind. Several thought about leaving across my lawn.The algorithms presented above handle such words in two ways:
Finally, if the choice of a particular "usual inflection" is considered inappropriate for a particular application, it can always be changed by specifying an overriding user-defined inflection.
if <word1> equals <word2>, return true using anglicized plurals... if the appropriate plural of <word1> equals <word2>, return true if the appropriate plural of <word2> equals <word1>, return true using classical plurals... if the appropriate plural of <word1> equals <word2>, return true if the appropriate plural of <word2> equals <word1>, return true if the words are nouns, for each noun category <c>... let <ss> be the singular suffix for category <c> let <sa> be the anglicized plural suffix for <c> let <sc> be the classical plural suffix for <c> if <sa> differs from <sc>, let <stem1> be stem(<sa>) of <word1> if <word2> equals inflect(-,<sc>) of <stem1>, return true let <stem2> be stem(<sa>) of <word2> if <word1> equals inflect(-,<sc>) of <stem2>, return true if the words are adjectives, let <word1a> be stem(-'s) or stem(-') of <word1> let <word2a> be stem(-'s) or stem(-') of <word2> let <word1b> be stem(-s') of <word1> let <word2b> be stem(-s') of <word2> for each defined <w1> in (<word1a>, <word1b>)... for each defined <w2> in (<word2a>, <word2b>)... apply step 4 to <w1> and <w2> if step 4 returns true, return true otherwise, return false |
Note that, because steps 2 and 3 do not specify which pluralizing algorithm is used, Algorithm 5 is generic and may be readily adapted to deal with only nouns, verbs, or adjectives, or with all three at once. Such adaptations merely involve selecting the appropriate algorithm (Algorithms 1 through 4 respectively) with which to generate the "appropriate plural" forms. Where the algorithm is adapted to a particular part of speech, one or both of steps 4 and 5 may be omitted entirely, if inappropriate.
The exportable subroutines of Lingua::EN::Inflect provide plural inflections for English words. Plural forms of most nouns, many verbs, and some adjectives are provided. Where appropriate, "classical" variants are also provided. The module also offers pronunciation-based selection of indefinite articles (a and an), but discussion of those facilities is beyond the scope of this paper.
The PL_...() subroutines also take an optional second argument, which indicates the desired grammatical number of the word. If the "number" argument is supplied and is not 1 (or "one" or "a"), the plural form of the word is returned. If the "number" argument does indicate singularity, the (uninflected) word itself is returned. If the number argument is omitted, the plural form is returned unconditionally.
The various subroutines are:
In classical mode, the non-anglicized plural form of a word (if one
exists) is preferred.
Hence, whereas dogma is normally inflected to dogmas,
if classical mode is active it becomes dogmata.
def_noun 'cow' => 'cows|kine'; def_noun '(.+i)o' => '$1os|$1i';
def_verb 'am' => 'are', 'ar(e|t)' => 'are", 'is' => 'are';
def_adj 'dat' => 'dose'; def_adj 'red' => 'red|gules';
print "I saw $N ", PL_N($animal,$N), "\n";Since the usual purpose of producing a plural is to make it agree with an explicit preceding count, Lingua::EN::Inflect provides an exportable subroutine (NO($;$)) which, given a word and an optional count, returns the count followed by the correctly inflected word. Hence the previous example can be rewritten:
print "I saw ", NO($animal,$N), "\n";In addition, if the count is zero (or some other expression which implies zero, such as "zero", "nil", etc.), the count is replaced by the string "no". Hence if $N had the value zero the previous example would print the somewhat more elegant:
I saw no ducksrather than:
I saw 0 ducksNote that the name of the subroutine is thus a pun: the subroutine returns either a No. (a number) or a "no", in front of the inflected word.
print PL_ADJ("This",$errors), PL_N(" error",$errors), PL_V(" was",$errors), " fatal.\n";Lingua::EN::Inflect therefore provides an exportable subroutine (NUM($;$)) which may be used to set a persistent "default number" value. If such a value is set, it is subsequently used whenever an optional second "number" argument of a PL_...() subroutine is omitted. The default value thus set can subsequently be removed by calling NUM() with no arguments:
NUM($errors); # SET DEFAULT NUMBER print PL_ADJ("This"), PL_N(" error"), PL_V(" was"), "fatal.\n"; NUM(); # CLEAR DEFAULT NUMBERBy default, NUM() returns its first argument, so that it may also be "inlined" in contexts like:
print NUM($errors), PL_N(" error"), PL_V(" was"), " detected.\n" print PL_ADJ("This"), PL_N(" error"), PL_V(" was"), "fatal.\n" if $severity > 1;
To ameliorate this problem, Lingua::EN::Inflect provides an exportable string-interpolating subroutine (inflect($)), that recognizes calls to the various inflection subroutines within a string and interpolates them appropriately. Using inflect() plurals can be interpolated directly into a string as follows:
NUM($errors); print inflect "NO(error) PL_V(was) detected.\n"; print inflect "PL_ADJ(This) PL_N(error) PL_V(was) fatal.\n" if $errors && $severity > 1;
The actual value returned by the various PL_eq_...() subroutines encodes which of the three equality rules succeeded: "eq" is returned if the strings were identical, "s:p" if the strings were singular and plural respectively, "p:s" for plural and singular, and "p:p" for two distinct plurals. Inequality is indicated by returning an empty string.
It is possible to cater for differences in major usage patterns (for example, modern and classical inflections) and for local differences in dialect (via user-defined inflections). It is also possible to make use of the pluralization algorithms to efficiently detect pairs of words which differ only in grammatical number.
A free implementation of these algorithms is available, and provides additional features such as conditional pluralization (depending on a numerical parameter), setting of default number values, and interpolation of the various subroutines into strings.
Singular form | Anglicized plural | Classical plural |
beef | beefs | beeves |
brother | brothers | brethren |
child | (none) | children |
cow | cows | kine |
ephemeris | (none) | ephemerides |
genie | genies | genii |
money | moneys | monies |
mongoose | mongooses | (none) |
mythos | (none) | mythoi |
octopus | octopuses | octopodes |
ox | (none) | oxen |
soliloquy | soliloquies | (none) |
trilby | trilbys | (none) |
bison | flounder | pliers |
bream | gallows | proceedings |
breeches | graffiti | rabies |
britches | headquarters | salmon |
carp | herpes | scissors |
chassis | high-jinks | sea-bass |
clippers | homework | series |
cod | innings | shears |
contretemps | jackanapes | species |
corps | mackerel | swine |
debris | measles | trout |
diabetes | mews | tuna |
djinn | mumps | whiting |
eland | news | wildebeest |
elk | pincers |
acropolis | chaos | lens |
aegis | cosmos | mantis |
alias | dais | marquis |
asbestos | digitalis | metropolis |
atlas | epidermis | pathos |
bathos | ethos | pelvis |
bias | gas | polis |
caddis | glottis | rhinoceros |
cannabis | glottis | sassafras |
canvas | ibis | trellis |
act | fight | run |
bend | fire | saw |
bent | like | sink |
blame | look | sleep |
copy | make | thought |
cut | might | view |
drink | reach | will |
1st Person | 2nd Person | 3rd Person |
I -> we | you -> you
thou -> you|ye |
she -> they
he -> they it -> they they -> they |
me -> us | you -> you
thee -> you|ye |
her -> them
him -> them it -> them them -> them |
myself -> ourselves | yourself -> yourself
thyself -> yourself |
herself -> themselves
himself -> themselves itself -> themselves themself -> themselves oneself -> oneselves |
1st Person | 2nd Person | 3rd Person |
mine -> ours | yours -> yours
thine -> yours |
hers -> theirs
his -> theirs its -> theirs theirs -> theirs |
1st Person | 2nd Person | 3rd Person |
my -> our | your -> your
thy -> your |
her -> their
his -> their its -> their their -> their |
1st Person | 2nd Person | 3rd Person |
am -> are | are -> are | is -> are |
was -> were | were -> were | was -> were |
have -> have | have -> have | has -> have |
ate | had | sank |
could | made | shall |
did | must | should |
fought | ought | sought |
gave | put | spent |
alumna | alga | vertebra |
abscissa | formula | medusa |
amoeba | hydra | nebula |
antenna | hyperbola | nova |
aurora | lacuna | parabola |
anathema | enema | oedema |
bema | enigma | sarcoma |
carcinoma | gumma | schema |
charisma | lemma | soma |
diploma | lymphoma | stigma |
dogma | magma | stoma |
drama | melisma | trauma |
edema | miasma |
stamen | foramen | lumen |
codex | murex | silex |
apex | latex | vertex |
cortex | pontifex | vortex |
index | simplex |
iris | clitoris |
albino | generalissimo | manifesto |
archipelago | ghetto | medico |
armadillo | guano | octavo |
commando | inferno | photo |
ditto | jumbo | pro |
dynamo | lingo | quarto |
embryo | lumbago | rhino |
fiasco | magneto | stylo |
|
|
|
|
|
|
|
|
|
aphelion | hyperbaton | perihelion |
asyndeton | noumenon | phenomenon |
criterion | organon | prolegomenon |
agendum | datum | extremum |
bacterium | desideratum | stratum |
candelabrum | erratum | ovum |
aquarium | interregnum | quantum |
compendium | lustrum | rostrum |
consortium | maximum | spectrum |
cranium | medium | speculum |
curriculum | memorandum | stadium |
dictum | millenium | trapezium |
emporium | minimum | ultimatum |
enconium | momentum | vacuum |
gymnasium | optimum | velum |
honorarium | phylum |
focus | nimbus | succubus |
fungus | nucleolus | torus |
genius | radius | umbilicus |
incubus | stylus | uterus |
apparatus | impetus | prospectus |
cantus | nexus | sinus |
coitus | plexus | status |
hiatus |
afreet | afrit | efreet |
cherub | goy | seraph |
Adjutant | Lieutenant | Quartermaster |
Brigadier | Major |