I am going to give you two definitions of Late Semantic Indexing. Reason being, LSI is derived from a mathematical formula used to retrieve data and was originally used at Universities to make searching large information databases more accurate. The first definition will give you an explanation of LSI and LSA (latent semantic analysis) from the educational perspective. The second will be in accordance with how the Search Engines (mainly Google) are using LSI in their search engine algorithm to produce their search results.
Latent Semantic Analysis (LSA) is a theory and method for extracting and representing the contextual-usage meaning of words by statistical computations applied to a large corpus of text. The underlying idea is that the aggregate of all the word contexts in which a given word does and does not appear provides a set of mutual constraints that sharply determines the similarity of meaning of words and sets of words to each other. The adequacy of LSA's reflection of human knowledge has been established in a variety of ways. For example, its scores overlap those of humans on standard vocabulary and subject matter tests; it mimics human word sorting and category judgments; it simulates word-word and passage-word lexical priming data; and, as has been reported, it accurately estimates passage coherence, learners ability of passage by individual students, and the quality and quantity of knowledge contained in an essay.
LSA can be construed in two ways:
(1) simply as a practical expedient for obtaining approximate estimates of the contextual usage substitutability of words in larger text segments, and of the types of-as yet incompletely specified- meaning similarities among Introduction to Late Semantic Analysis 4 words and text segments that such relations may reflect, or
(2) as a model of the computational processes and representations under fundamental substantive portions of the acquisition and utilization of knowledge. We next sketch both views.
Regular keyword searches approach a document collection with a kind of accountant mentality: a document contains a given word or it does not, with no middle ground. We create a result set by looking through each document in turn for certain keywords and phrases, tossing besides any documents that do not contain them, and ordering the rest based on some ranking system. Each document stands alone in judgment before the search algorithm – there is no interdependence of any kind between documents, which are evaluated solely on their contents.
Latent semantic indexing adds an important step to the document indexing process. In addition to recording which keywords a document contains, the method examines the document collection as a whole, to see which other documents contain some of those same words. LSI observers documents that have many words in common to be semantically close, and ones with few words in common to be semantically distant. This simple method correlates surprisingly well with how a human being, looking at content, might classify a document collection. Although the LSI algorithm does not understand anything about what the words mean, the patterns it notices can make it seem astonishingly intelligent.
When you search an LSI-indexed database, the search engine looks at similarity values it has calculated for every content word, and returns the documents that it thinks best fit the query. Because two documents may be semantically very close even if they do not share a particular keyword, LSI does not require an exact match to return useful results. Where a plain keyword search will fail if there is no exact match, LSI will often return relevant documents that do not contain the keyword at all.
To use an earlier example, let's say we use LSI to index our collection of mathematical articles. If the words n-dimensional, manifold and topology appear together in enough articles, the search algorithm will notice that the three terms are semantically close. A search for n-dimensional manifolds will there ever return a set of articles containing that phrase (the same result we would get with a regular search), but also articles that contain just the word topology. The search engine understands nothing about mathematics, but examining a sufficient number of documents schools it that the three terms are related. It then uses that information to provide an expanded set of results with better recall than a plain keyword search.