Cloud OCR SDK Documentation

Regular Expressions

The ABBYY Cloud OCR SDK regular expression alphabet is described in the following table:

Item nameConventional regular expression signUsage examples and explanations
Any character . c.t — denotes words like “cat”, “cot”
Character from a character range [] [b-d]ell — denotes words like “bell”, “cell”, “dell”
[ty]ell — denotes words “tell” and “yell”
Character out of a character range [^] [^y]ell — denotes words like “dell”, “cell”, “tell”, but forbids “yell”
[^n-s]ell — denotes words like “bell”, “cell”, but forbids “nell”, “oell”, “pell”, “qell”, “rell”, and “sell"
Or | c(a|u)t — denotes words “cat” and “cut"
0 or more occurrences in a row * 10* — denotes numbers 1, 10, 100, 1000, etc.
1 or more occurrences in a row + 10+ — allows numbers 10, 100, 1000, etc., but forbids 1.
Letter or digit [0-9a-zA-Z] [0-9a-zA-Z] — allows a single character;
[0-9a-zA-Z]+ — allows any word
Capital Latin letter [A-Z]
Small Latin letter [a-z]
Capital Cyrillic letter [А-Я]
Small Cyrillic letter [а-я]
Digit [0-9]
Space \s
Character, used by system @

Notes:

  • Some characters used in regular expressions are “auxiliary” (i.e. they are used for system purposes). As you can see from the list above, such characters include square brackets, periods, etc. If you wish to enter an auxiliary character as a normal one, put a backslash (\) before it. Example: [t-v]x+ denotes words like “tx”, “txx”, “txxx”, etc., “ux”, “uxx”, etc., but \[t-v\]x+ denotes words like “[t-v]x”, “[t-v]xx”, “[t-v]xxx” etc.
  • If you need to group certain regular expression elements, use brackets. For example, (a|b)+|c denotes “c” and any combinations like “abbbaaabbb”, “ababab”, etc. (a word of any non-zero length in which there may be any number of a's and b's in any order), whilst a|b+|c denotes “a”, “c”, and “b”, “bb”, “bbb”, etc.
  • Regular expressions do not strictly limit the set of characters of the output result, i.e. the recognized value may contain characters which are not included into the regular expression. During recognition all hypotheses of a word recognition are checked against the specified regular expression. If a given recognition variant conforms to the expression, it has higher probability of being selected as final recognition output. But if there is no variant that matches regular expression, the result will not conform to the expression.

Samples

Regular expression for dates

The number denoting day may consist of one digit (e.g. 1, 2 etc.) or two digits (e.g. 02, 12), but it cannot be zero (00 or 0). The regular expression for the day should then look like this: ((|0)[1-9])|([12][0-9])|(30)|(31).

The regular expression for the month should look like this: ((|0)[1-9])|(10)|(11)|(12).

The regular expression for the year should look like this: (((19)|(20))[0-9][0-9])|([0-9][0-9]).

What is left is to combine all this together and separate the numbers by a period (e.g. 1.03.1999). The period is an auxiliary sign, so we must put a backslash (\) before it. The regular expression for the full date should then look like this:

(((|0)[1-9])|([12][0-9])|(30)|(31))\.(((|0)[1-9])|(10)|(11)|(12))\.((((19)|(20))[0-9][0-9])|([0-9][0-9]))

Regular expression for e-mail addresses

You can easily make a language for denoting e-mail addresses. The regular expression for an e-mail address should look like this:

[a-zA-Z0-9_\-\.]+\@[a-zA-Z0-9\.\-]+\.[a-zA-Z]+