Infogistics' RealTerm

[ about infogistics ]
[ products ]
[ partners & customers ]
[ in the spotlight ]
[ jobs ]
[ contact ]
[ home ]

About RealTerm

search the Web with RealTerm technology

download RealTerm white-paper

download case study in HR domain

RealTerm: How it Works

RealTerm applies advanced statistical, linguistic and conceptual analysis to automatically identify major topics and their sub-topics that are contained in a document collection.

Term Extraction

First of all RealTerm scans in real-time document summaries returned by a search engine and identifies the most important words and phrases (terms) which characterize these documents. Since documents might come from a variety of diverse sources, RealTerm applies different kinds of spelling correction and phrase unification algorithms which allow it to unify differently (mis)spelled variants of the same phrase e.g. "Mono Lisa" and "Mona Lisa", "Gregoriy" and "Grigory", etc.

It also applies morphological and syntactic transformations to unify phrases according to a Language grammar. For example, "linguistic and statistical method" and "statistical and linguistic methods", "information retrieval" and "retrieval of information" can be unified according to syntactic rules of English.

Document Clustering

At the next stage documents together with the identified terms are clustered into groups of related topics according to content of their summaries. This is done by applying statistical algorithms which evaluate strength of co-occurrence between terms and documents. Therefore groups of documents form topic clusters which can be described by the terms identified in these documents. In other words RealTerm looks at the documents beyond individual words appearing there and groups them into topics on the basis of contained information.

Term Relations Identification

Finally, terms are arranged into hierarchical relations of more general with more specific. This is done in two ways. Lexical analysis of term structure can indicate that one term is more general than another, as for example, "Manchester United Football Club" is a specialization (kind) of "Football Club". The other way of uncovering term relations is distributional statistical analysis which can reveal that, for example, "myocardial infarction" often co-occurs with "heart disease" and therefore can be treated as its specialization.

Knowledge-Base Support

When applied to a specific domain RealTerm can make use of existing knowledge bases, thesauri and word-lists. For example, in medical domain RealTerm can be used in conjunction with MEDLINE Meta-Thesaurus where relations between many medical terms are already established.

The Result

Extracted terms are arranged into a semantic network which links them to documents and other terms. This network then supports RealTerm topic browsing functionality which allows the user within a few mouse clicks to identify and aggregate all and only relevant documents regardless whether they have been listed in the first ten or the last hundred in the return set.

This multi-staged process is in fact very fast. The engine is capable of processing 15,000 words a second on Pentium 1GH computer. This speed allows it to generate the topic index for a return set of 1,000 documents from the abstracts returned by the search engine in 2-3 seconds.

[ home ] [ about infogistics ] [ products ]
[ in the spotlight ] [ jobs ] [ contact ] [ partners & customers ]