Latent semantic analysis (LSA) is a mathematical technique used in natural language processing or computational linguistics for finding complex and hidden (latent) relations of meaning among words and the various contexts in which they are found. It is also used in signal processing to reduce noise. It is built on the basic ideas of association of elements with contexts and similarity in meaning as defined by similarity in shared contexts. Although it is a purely computational technique that is based only on co-occurrence patterns, it can produce results that mimic the performance of humans on certain standard language tests, including synonym-antonym matching, vocabulary and topic extraction.
What it can do
LSA is already being applied to various problems related to the automatic processing of huge amounts of information, such as Web page indexing, searching, filtering, the learning of individual user interests, judging the educational suitability of texts and other sophisticated tasks that have always relied on human intelligence. It has the potential to provide insight into the nature and function of human language processing and possibly cognition and learning in general. Ultimately, it may help breathe new life into the field of what has been called artificial intelligence, a field that might better be referred to as artificial mind. A particularly exciting implication of work that has already been done is a convincing suggestion of how language might be learned from a rather sparse set of training data. That is what Chomsky called Plato's problem and has been used as a strong argument for language as innate rather than learned.
How it works
In natural language processing, LSA searches a set of texts for elements that appear at least once in at least one of a set of texts. In general, the elements could be any unit of text and the texts (contexts) could be any unit of text that is larger than the elements. In practice, words serve as the elements; key word lists, titles, sentences, abstracts, whole articles or even larger units of written language may serve as contexts.
The number of occurrences of each of the elements in each of the contexts is counted and the results are put in table form, with the elements labeling the rows and the contexts labeling the columns. Up to this point it is about the same as keyword indexing, but now some clever math is applied to the values in the table (which form a matrix, by the way) so that the words and contexts all become points in a semantic vector space. Points that are close together in that space are somehow similar in meaning. This process is much like statistical factor analysis, particularly cluster analysis, or the operation of an artificial neural network.
One very important thing done by the mathematical transformation is that the data set is 'smoothed'. That is, some of the values in the matrix are changed so as to bring 'shapes' in the semantic space into better focus. This works in much the same way as the anti-aliasing algorithms that smooth out jagged lines on graphic displays. When done right (it's actually a little tricky), the result is similar to inferences made by people in associating words with topics. This inference is similar to the identification and labeling of concepts, and acts as an effective multiplier for learning in language.
Specifically, the method clusters elements and contexts into neighborhoods in semantic space. The smoothing effect means that elements that are close together are inferred to be likely found in contexts that other elements in that cluster (concept) appear in or other contexts that are semantically close to those contexts. For example, consider the elements 'porn' and 'pr0n' and a large number of Web articles (contexts) related to pornography. Some of the articles contain both terms (group A), some contain 'porn' but not 'pr0n' (group B), others 'pr0n' but not 'porn' (group C), and some contain neither element (group D).
Now a simple keyword search on 'porn' will return articles from group A and group B, but not C or D. Similarly for a search on 'pr0n'. A latent semantic analysis, however, will turn up articles from all four groups as a result of search on either one or both of 'porn' and 'pr0n'. It does that by creating a model of the latent semantic relationships among the elements, among the contexts, and between each element and context. This is similar in effect to the inference that if you are interested in articles that contain 'porn', you will also be interested in those that contain 'pr0n' and even many articles that fit the concept but contain neither word.
Note that in the above example, the element 'pr0n' isn't even a standard word. Other than what kinds of units are to serve as elements and contexts, LSA makes no linguistic assumptions and does not rely on any particular linguistic model, syntax or lexicon. It is thus fully applicable to any language, and can also be used in multilingual contexts without the need for translation.
The process
- Scan the set of texts (contexts) for words that appear at least once in at least one of the texts. Create a table of those words versus the contexts, with the table cell containing the number of occurrences of words in the contexts. This is the occurrence matrix.
- Weight the values for each word (i.e., each set of row values) to emphasize their relative importance to the individual contexts and to the set of contexts as a whole.
- Apply singular value decomposition to the matrix (table) of values to reduce the number of dimensions in the semantic space and produce a more focused value matrix of the same size.
The resulting table of values can be used as a set of indices of the relevance of the contexts to the respective words. It is important to note, however, that this process is not entirely automated. The third step, the SVD, involves a reduction in the number of dimensions of the vector space, but the success of the technique is sensitive to the final number of dimensions. There is an optimum number in each case, but there is still no algorithmic way to find that number.
Semantics without syntax
We can see that this approach is entirely semantic in nature. Words are associated with a kind of average of the sum of the word contexts in which they appear; no syntax or other language-dependent information is used. Words used primarily for syntactical structures that are not specific to a document's conceptual make-up are removed from consideration. In languages where the grammar includes word inflection, however, there may be many different word forms that have the same concept base. Examples include singular and plural nouns and part-of-speech conversions such as crux, crucify, crucifixion and crucifix. LSA will capture their semantic similarity only indirectly and incompletely. One way to remove this grammatical influence is to add stemming to the preprocessing to collapse the formal differences into single stem terms.
The failure to use the meaning that we can derive from grammar may seem like a major disadvantage, but for the purposes of topical search based on conceptual similarity, that information is not important. Our request is "Show me documents that are about X.", not "Show me documents that say this and that about X." If we demand that LSA understand the content of the document, then we ask to much.
But is it really semantics?
If we choose to see the context clustering that is done by LSA as a kind of concept formation and labeling, we must admit that these concepts are formal within language. The concepts have no direct physical referents; they are just language patterns that are associated with words (other, smaller language patterns). Those words are not grounded in any kind of non-language experience, as they are in the minds of animals, and they are not linked to any non-linguistic entity to which they may have import such as animal emotion or an embodied sentient artificial mind.Clever technique or deep theory?
A striking feature of LSA is that it is a simple mathematical model that emulates certain basic cognitive linguistic functions of the brain. Clearly the brain does not employ the matrix arithmetic algorithms used to implement LSA any more than it has a calculus engine that gives us the ability to catch a fly ball in deep left field. The similarity in results, however, does suggest that we should use the algorithms to examine what mechanisms the brain does have available that could implement the model or something very similar to it. Because the primary brain mechanism for knowledge formation seems to be high-dimensional associations formed in the neural networks of the nervous system, the natural choice for modeling brain function with reasonable physical validity is the artificial neural network (NN).
There is indeed a straightforward NN analog to the matrix calculations used in LSA, a rather simple three-layer network with full interconnection between layers one and two and between layers two and three. The linguistic elements (e.g., words) are the input layer vector and contexts (e.g., documents) are the output layer vector. The middle (hidden) layer corresponds more or less to the dimensionality, so its size must be selected carefully for the network to produce meaningful results. Therefore, to implement the basic function of LSA in a NN, we must contrive a way to dynamically modulate the NN structure by external factors. To speculate, those factors may include the brain function of attention and various non-linguistic contexts (emotion, homeostatic urges and other such states of mind and body).
The actual wiring of the brain (i.e., the complete natural neural network structure) is certainly too complex for behavior to be explained by any simple analysis of pathways on a physical level, particularly for the higher cognitive functions. We might be well-guided in such an analysis, however, by mathematical models like LSA that produce empirical results akin to human cognitive behavior. Applying such models to the design of artificial neural networks is likely to aid in the understanding cognitive brain functions and in the reverse engineering of those functions.
Landaur, Foltz and Laham, "An Introduction to Latent Semantic Analysis"
Landaur and Duais, "A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction and Representation of Knowledge"
http://lsa.colorado.edu/papers.html
http://www.jakecovert.com/index.php/131/
http://javelina.cet.middlebury.edu/lsa/out/search_engines.htm