Latent Semantic Analysis in Perl
I’m currently a little into quantitative text analysis. Not big time, but a little. A nice end goal, that I have no intention of reaching, would be a script that can suggest if a cluster of articles on this site seem similar but don’t share a tag, or if there’s a tag that might as well not exist because it’s applied to a very broad range of articles. This would help me maintain the tags that are currently assigned completely manually and, as the observant reader has noticed, rather arbitrarily.
TF-IDF
I have never really done quantitative text analysis before, so I gave myself a soft first task: compute tf-idf scores for the terms used in all articles.
The code to do this is littered with special cases1 Naïve efforts to strip
out most html tags, skipping source code, Swedish articles, drafts, picking
out terms of minimum length only, treating some \w
characters as term
boundaries, and so on. and probably not of interest to anyone else. The
tf-idf computation is simple enough with a nested key–value dictionary to keep
track of which articles contain each term.
The script produces a text file that lists every term along with the articles in which it is present, and the total tf-idf score for the term as well as in each article, like so:
ada 521.5157 ( 171 x 3.0498) expressive-ada-2012-challenge 179.93817 unicode-strings-in-ada-2012 118.94218 guessing-game-ada-style 106.74298 timeout-blocking-requests-in-ada 48.79679 selective-delay-in-spark-and-ravenscar 30.49799 ... how-much-does-an-experienced-programmer-use-goo 3.04980 how-laziness-works 3.04980 dotnet-on-non-windows-platforms-brief-historic- 3.04980 probability 488.2770 ( 236 x 2.0690) what-is-probability 175.86247 unreasonable-effectiveness-of-conditional-proba 62.06911 birthday-line-puzzle 57.93117 frequentism-and-false-dichotomies 41.37940 markov-chains-for-queueing-systems 20.68970 ... secure-dns-on-a-laptop-with-debian 2.06897 precel-like-excel-for-uncertain-values 2.06897 survival-analysis-for-customer-retention 2.06897 code 447.9971 ( 436 x 1.0275) how-laziness-works 60.62347 expressive-ada-2012-challenge 32.88052 on-competing-with-c-using-haskell 19.52281 the-case-for-controlled-side-effects 18.49529 guessing-game-ada-style 18.49529 ... documentation-reference-manual-vs-cookbook 1.02752 current-email-solution-gpg-agent-offlineimap-not 1.02752 why-perl 1.02752 emacs 393.4511 ( 163 x 2.4138) why-you-should-buy-into-the-emacs-platform 120.69054 migrating-away-from-use-package 55.51765 on-escape-meta-alt-control-shift-emacs 55.51765 ... starting-spaced-repetition 2.41381 dotnet-on-non-windows-platforms-brief-historic- 2.41381 swedish-colemak-hack 2.41381 system 377.2019 ( 332 x 1.1362) markov-chains-for-queueing-systems 59.07981 predicting-resource-usage-from-increased-traffic 42.03756 statistical-process-control-a-practitioners-guid 28.40376 ... sidescrolling-flight-simulator 1.13615 fallback-font-and-good-fonts-including-recommend 1.13615 rethinking-text-input-on-touchscreens 1.13615
The list goes on for 10,000 terms. This is actually way cooler than I would have expected. It associates a word like “code” most strongly with not just articles that contain code, but articles that are, in some sense, about code.
The list of terms is ordered by highest total tf-idf, which means – if my intuition for this is correct – the top terms are ones that discriminate well between articles; they’re terms used often, but only in specific articles.
Significant Terms
Just over half of the 10,000 terms are used more than once. Here’s a rough histogram of number of terms used once, twice, and so on up to 10 times:
1: 3850 ###################################### 2: 1530 ############### 3: 864 ######### 4: 586 ###### 5: 383 #### 6: 294 ### 7: 255 ### 8: 185 ## 9: 173 ## 10: 142 # >10: 1671 #################
I don’t know if this is how the professionals do it, but to make things simpler for myself and reduce the dimensionality of the problem slightly, I picked out a subset of significant terms, by keeping only terms that were used more than 10 times, and had an idf value higher than 1.2 The point here is to focus on terms that hopefully appear in multiple places, but still aren’t meaningless terms like “time” or “that” which occur in every other article. This produced a list of 1800 terms.
In the rest of this article except as noted, all analysis is done on those 1800 terms only.
Matrix Representation
Mainly for convenience, there’s another script that converts the lookup table of terms-to-documents above to a term–document matrix, where each row is a term, each column is a document, and the corresponding cell contains the tf-idf value of that term in that document.
To keep the associations with terms and articles, the first two lines of the file that contains the term–document matrix are a list of words in the order they occur in the matrix, then a list of articles in the order they occur in the matrix. It looks something like this (truncated):
prelude alice utctime req gpg lambda agile meal toys ... what-is-probability statistical-process-control-a-pr ... 0 4.54329 0 0 0 0 0 0 0 0 0 0 0 141.38393 0 0 0 20.6 ... 0 222.62144 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9.08659 0 0 0 0 ... 40.87633 0 199.90497 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... ...
This file accidentally stores the term–document matrix transposed, meaning there’s a row for each document and a column for each term. Conventionally, it’s done the other way around.
K-means Clustering on Raw Data
After that, I tried k-means clustering on the raw term–document matrix. This approach produced garbage. It’s possible to vary how bad the garbage is by tweaking pre-processing rules and parameters, but it is garbage all the same.
The main problem is that it finds a bunch of clusters of just one or two articles3 Which sometimes, to its credit, are related – but just as often they are not. and then it piles almost all articels into a “misc” cluster. Here’s an example of two clusters that are of non-trivial size. Note how unrelated the articles in them seem to be.
Cluster 7: book-review-parallel-agile parser-combinators-parsing-for-haskell-beginners on-competing-with-c-using-haskell starting-to-commute-by-bicycle verifiable-software-development-estimations predicting-resource-usage-from-increased-traffic underrated-technique-sampling reading-notes-guide-for-ravenscar-in-high-integrity-systems purely-functional-avl-trees-in-common-lisp scales-of-plan-driven-development timeout-blocking-requests-in-ada Cluster 8: dotnet-on-non-windows-platforms-brief-historic-summary haskell-time-library-tutorial matrix-is-everything-i-want-irc-to-be the-mystery-of-the-deterministic-super-shotgun emptying-the-dishwasher-with-systems-theory
For another taste of how devoid of insight the raw term–document matrix is, here are some words and how similar they are judged to be4 Cosine similiarity.:
(probability, probabilities) ~ 0.962887911214321 (probability, odds) ~ 0.145700881235703 (probability, sampling) ~ 0.0680276850235823 (probability, throughput) ~ 0.100707873143784
and
(functional, haskell) ~ 0.392247216407078 (procedural, haskell) ~ 0.0174985128544403 (procedural, java) ~ 0.277575389792799
It’s not that it’s completely wrong all the time, but it seems to be right in the way a broken clock is – mainly by accident, and without external reference, it’s hard to know when it’s right and when it’s wrong.
I had hoped for at least one non-garbage result, so I was a little bummed out, but I figured there ought to be a simple way to reduce the dimensionality and perhaps get more robust results.
Latent Semantic Analysis
I remembered from my university days that I tried taking apart a data set by principal component analysis, and making some quite cool discoveries along the way5 Unfortunately, the assignment was for a data visualisation class, and I had very little time to polish the visualisation after I succeeded with the analysis, so my solution was close to failed.. Maybe one can do something like that to the term–document matrix instead?
A quick web search reveals that this is an established technique called latent semantic analysis. And at first, it seems to sort of work. It comes up with some coherent themes I have written about, such as
Concept number 1, top terms: probability average probabilities event individual likely outcome usually rate events frac compute against answer week distribution mathrm theory late alice questions roughly statistical dice process variation per chance team train Concept number 5, top terms: haskell parser parse parsing data day month parsers combinators html date combinator hackage blog monad attoparsec year maybe turtle libraries char parsed text report int convert readp mean pattern function Concept number 6, top terms: development projects project authors task model agile design implementation complete parts programmers product software code deadline tasks parallel domain data cases driven book systems result system separate finished market critical Concept number 7, top terms: run server response configuration request emacs servers running process events network manager extension requests remote dns users system command extend user define defined milliseconds send alice connect event land web Concept number 9, top terms: functional code systems haskell image result performance distance components length class runs system simple programming sort classes loop calls function looks implementations ghc png recursion functions essentially level left knowledge
Not all concepts are coherent, but it looks like there’s something intelligent going on there.
What really blew me away was when I started asking it about the similarity between terms:
(probability, probabilities) ~ 0.952301678239469 (probability, odds) ~ 0.845109841511912 (probability, sampling) ~ 0.567053242317574 (probability, throughput) ~ 0.406783545108254
It has also learned something about programming:
(functional, haskell) ~ 0.66153907357199 (procedural, haskell) ~ 0.166412102018881 (procedural, java) ~ 0.729791346849553
This is baby stuff, of course, but it’s the first time I’ve made it, and it’s fascinating.
Unfortunately, if I try to go closer to the original objective of discovering similiarities between articles, also this approach seems to fail. I tweaked the parameters for what counts as significant terms to include in the analysis6 Specifically, I made it select more based on semantic relevance (high idf) than number of usages., which does seem to improve article similarity, but make term similarity a lot worse.
Anyway, these are the best results I’ve had so far:
Closest five articles to the-reinforcing-nature-of-toil: quick-variance-computation 0.778638685528717 timeboxing-is-not-about-deadlines 0.762039700762091 learning-some-logarithms 0.75815903308222 securing-a-debian-laptop-with-a-firewall 0.748513263154762 dichotomisation-discards-data 0.737257487469376 Closest five articles to emptying-the-dishwasher-with-systems-theory: verifiable-software-development-estimations 0.724795485238744 starting-spaced-repetition 0.719353302217007 securing-a-debian-laptop-with-a-firewall 0.698859553750974 ceasing-short-lived-maintenance-of-emacs-versor 0.666658448749324 what-optimisations-are-not 0.663293458890785 Closest five articles to on-competing-with-c-using-haskell: text-selection-behaves-as-swipe-in-weechat 0.871357916356826 share-buybacks-are-indirect-investments 0.736398228385749 emacs-magic-simple-pastebin 0.664821850265228 matrix-is-everything-i-want-irc-to-be 0.59431696836662 confusing-uncertainty-for-information 0.592934531238207 Closest five articles to hidden-cost-of-heroics: resize-video-while-keeping-quality-high-with-ffmpeg 0.821504405028828 why-code-review-matters-suble-often-non-breaking-bugs 0.760687478808574 historic-mistakes-carriers-and-presses 0.757533916663089 mind-the-gap-when-pdsa 0.743727177824199 recordmydesktop-videos-going-out-of-sync 0.676645616522803
This is fairly good, but does seem a bit … just-so. I can’t explain why this particular set of significant terms happen to yield okay-ish recommendations for these specific articles, nor am I very confident it performs well out of sample, given how much I’ve tweaked it to look good here.
Code
It seems unfair to leave you without at least a little code, so here’s some of the latent semantic analysis stuff in Perl:
use v5.16; use strict; use warnings; use autodie qw( :all ); use PDL; use PDL::NiceSlice; use PDL::LinearAlgebra; use utils qw( read_termdocmat cos_sim ); # Read in the term–document matrix. my ($terms, $slugs, $x) = read_termdocmat(); # Log transform seems to yield better results # than raw tf-idf values. $x = log ($x + 1); # Perform SVD using BLAS/LAPACK and extract out # term–concept matrix, document–concept matrix, # and singular values. my ($v, $s_diag, $fat_u) = $x->msvd; my $t = $fat_u(0:(nelem($s_diag)-1)); my $s = stretcher $s_diag; my $d = $v->transpose; # Truncate to 10 main concepts. my $k = 10; my $tk = $t(0:($k-1)); my $sk = $s(0:($k-1), 0:($k-1)); my $dk = $d(:, 0:($k-1)); # These are the concepts represented by each term. my $concepts = $tk x $sk; sub term_sim { # Ugly code I'm embarrassed to publish please don't hate me. my ($ta, $tb) = @_; my @terms = @$terms; my $ia = (grep { $terms[$_] eq $ta } 0..$#terms)[0]; my $ib = (grep { $terms[$_] eq $tb } 0..$#terms)[0]; my $ca = $concepts(:, ($ia)); my $cb = $concepts(:, ($ib)); my $c = cos_sim $ca, $cb; say "Similarity between $ta and $tb is $c"; } term_sim "functional", "haskell"; term_sim "procedural", "haskell"; term_sim "procedural", "java"; # These are the concepts each article is about. my $docconc = $sk x $dk; sub doc_closest { # Ugly code I'm embarrassed to publish please don't hate me. my ($sp) = @_; my @slugs = @$slugs; my $ia = (grep { $slugs[$_] =~ $sp } 0..$#slugs)[0]; my $ca = $docconc(($ia)); my @dists; for my $ib (0..$#slugs) { my $cb = $docconc(($ib)); my $c = cos_sim $ca, $cb; push @dists, $c; } say ""; say ""; say "Closest five articles to $slugs[$ia]:"; my @clost = sort { $dists[$b] <=> $dists[$a] } 0..$#dists; ## Skipping 0th because it's going to be the article itself. for (@clost[1..5]) { say " $slugs[$_] $dists[$_]"; } } doc_closest qr/toil/i; doc_closest qr/dishwasher/i; doc_closest qr/competing/i;
Next Steps
I think the main problem that needs to be solved before moving on is that of diagnostics. I can’t really say why certain parameters work better and others worse. I don’t have a good way of looking at articles the way the model sees them. I also don’t know how I would do that, because I still have some trouble interpreting the matrices involved in the latent semantic analysis.
Another approach is to just move on from linear algebra and crack open neural networks, and try to write a model based on word embeddings. I have no faith that would work better, but it is something I’ve always wanted to experiment with.