[ about infogistics ] [ products ] [ partners & customers ] [ in the spotlight ] [ jobs ] [ contact ] [ home ] download evaluation version see NLProcessor interactive on-line demo download integrator-level documentation. |
Text Normalization Tool Interactive Demo
What to normalize in text In mixed-case texts capitalized words usually denote proper names -- names of organizations, locations, people, artifacts, etc., but there are special positions in the text where capitalization is expected. Such mandatory (ambiguous) positions include the first word in a sentence, words in all-capitalized titles or table entries, a capitalized word after a colon or open quote, the first capitalized word in a list-entry, etc. Capitalized words in these and some other positions present a case of ambiguity -- they can stand for proper names as in ``White later said ...'', or they can be just capitalized common words as in ``White elephants are ...''. Obviously, this distinction is important for almost all kinds of text analysis and document indexing. Sentence boundary disambiguation (SBD) is an important aspect in developing virtually any practical text processing application -- syntactic parsing, Information Extraction, Machine Translation, Text Alignment, Document Summarization, etc. Segmenting text into sentences in most cases is a simple matter -- a period, an exclamation mark or a question mark usually signal a sentence boundary. However, there are cases when a period denotes a decimal point or is a part of an abbreviation and thus it does not signal a sentence break. Furthermore, an abbreviation itself can be the last token in a sentence, in which case its period acts at the same time as part of this abbreviation and as the end-of-sentence indicator (fullstop). |
[ home ]
[ about infogistics ]
[ products ] [ in the spotlight ] [ jobs ] [ contact ] [ partners & customers ] Copyright © 2000 2001 Infogistics Ltd. All rights reserved. |