Document level semantic similarity

1 min readOct 25, 2017

[2019 edit: I think in the year and a half since I wrote this, there has been some progress in this area, although I haven’t kept up to date with research in the field]

One thing that strikes me is that semantic similarity between documents is not directly analogous to semantic similarity between words: whereas the meaning of words can be extracted from their surrounding context within a sentence (word vector), the meaning of documents comprises something like invoked ideas (I think the field that studies this is called semiotics), or perhaps an automatic “mental summary” of the document that results in such. There are some current attempts to construct document vectors, such as from a vector sum of the individual word vectors weighted by term frequency, but that’s taking a directly analogous approach (although using TD-IDF does offset the case where documents with different semantic content have similar sums of word vectors).

As I understand, one goal of the field (so to speak) would be to generate some kind of representation (on which operations can be performed, like composition or similarity/distance, etc) of any thought that has a verbal representation (“thought vector”).

Document level semantic similarity

Written by Ambert Ho

No responses yet