Latent Semantic Analysis in Perl

kqr

, published 2023-06-12

Tags:

I’m currently a little into quantitative text analysis. Not big time, but a little. A nice end goal, that I have no intention of reaching, would be a script that can suggest if a cluster of articles on this site seem similar but don’t share a tag, or if there’s a tag that might as well not exist because it’s applied to a very broad range of articles. This would help me maintain the tags that are currently assigned completely manually and, as the observant reader has noticed, rather arbitrarily.

TF-IDF

I have never really done quantitative text analysis before, so I gave myself a soft first task: compute tf-idf scores for the terms used in all articles.

The code to do this is littered with special cases1¹ Naïve efforts to strip out most html tags, skipping source code, Swedish articles, drafts, picking out terms of minimum length only, treating some \w characters as term boundaries, and so on. and probably not of interest to anyone else. The tf-idf computation is simple enough with a nested key–value dictionary to keep track of which articles contain each term.

The script produces a text file that lists every term along with the articles in which it is present, and the total tf-idf score for the term as well as in each article, like so:

ada                  521.5157 (       171 x   3.0498)
    expressive-ada-2012-challenge                     179.93817
    unicode-strings-in-ada-2012                       118.94218
    guessing-game-ada-style                           106.74298
    timeout-blocking-requests-in-ada                   48.79679
    selective-delay-in-spark-and-ravenscar             30.49799
    ...
    how-much-does-an-experienced-programmer-use-goo     3.04980
    how-laziness-works                                  3.04980
    dotnet-on-non-windows-platforms-brief-historic-     3.04980

probability          488.2770 (       236 x   2.0690)
    what-is-probability                               175.86247
    unreasonable-effectiveness-of-conditional-proba    62.06911
    birthday-line-puzzle                               57.93117
    frequentism-and-false-dichotomies                  41.37940
    markov-chains-for-queueing-systems                 20.68970
    ...
    secure-dns-on-a-laptop-with-debian                  2.06897
    precel-like-excel-for-uncertain-values              2.06897
    survival-analysis-for-customer-retention            2.06897

code                 447.9971 (       436 x   1.0275)
    how-laziness-works                                 60.62347
    expressive-ada-2012-challenge                      32.88052
    on-competing-with-c-using-haskell                  19.52281
    the-case-for-controlled-side-effects               18.49529
    guessing-game-ada-style                            18.49529
    ...
    documentation-reference-manual-vs-cookbook          1.02752
    current-email-solution-gpg-agent-offlineimap-not    1.02752
    why-perl                                            1.02752

emacs                393.4511 (       163 x   2.4138)
    why-you-should-buy-into-the-emacs-platform         120.69054
    migrating-away-from-use-package                    55.51765
    on-escape-meta-alt-control-shift-emacs             55.51765
    ...
    starting-spaced-repetition                          2.41381
    dotnet-on-non-windows-platforms-brief-historic-     2.41381
    swedish-colemak-hack                                2.41381

system               377.2019 (       332 x   1.1362)
    markov-chains-for-queueing-systems                 59.07981
    predicting-resource-usage-from-increased-traffic   42.03756
    statistical-process-control-a-practitioners-guid   28.40376
    ...
    sidescrolling-flight-simulator                      1.13615
    fallback-font-and-good-fonts-including-recommend    1.13615
    rethinking-text-input-on-touchscreens               1.13615

The list goes on for 10,000 terms. This is actually way cooler than I would have expected. It associates a word like “code” most strongly with not just articles that contain code, but articles that are, in some sense, about code.

The list of terms is ordered by highest total tf-idf, which means – if my intuition for this is correct – the top terms are ones that discriminate well between articles; they’re terms used often, but only in specific articles.

Significant Terms

Just over half of the 10,000 terms are used more than once. Here’s a rough histogram of number of terms used once, twice, and so on up to 10 times:

 1:  3850  ######################################
 2:  1530  ###############
 3:   864  #########
 4:   586  ######
 5:   383  ####
 6:   294  ###
 7:   255  ###
 8:   185  ##
 9:   173  ##
 10:  142  #
>10: 1671  #################

I don’t know if this is how the professionals do it, but to make things simpler for myself and reduce the dimensionality of the problem slightly, I picked out a subset of significant terms, by keeping only terms that were used more than 10 times, and had an idf value higher than 1.2² The point here is to focus on terms that hopefully appear in multiple places, but still aren’t meaningless terms like “time” or “that” which occur in every other article. This produced a list of 1800 terms.

In the rest of this article except as noted, all analysis is done on those 1800 terms only.

Matrix Representation

Mainly for convenience, there’s another script that converts the lookup table of terms-to-documents above to a term–document matrix, where each row is a term, each column is a document, and the corresponding cell contains the tf-idf value of that term in that document.

To keep the associations with terms and articles, the first two lines of the file that contains the term–document matrix are a list of words in the order they occur in the matrix, then a list of articles in the order they occur in the matrix. It looks something like this (truncated):

prelude alice utctime req gpg lambda agile meal toys ...
what-is-probability statistical-process-control-a-pr ...
0 4.54329 0 0 0 0 0 0 0 0 0 0 0 141.38393 0 0 0 20.6 ...
0 222.62144 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0  ...
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9.08659 0 0 0 0  ...
40.87633 0 199.90497 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
...

This file accidentally stores the term–document matrix transposed, meaning there’s a row for each document and a column for each term. Conventionally, it’s done the other way around.

K-means Clustering on Raw Data

After that, I tried k-means clustering on the raw term–document matrix. This approach produced garbage. It’s possible to vary how bad the garbage is by tweaking pre-processing rules and parameters, but it is garbage all the same.

The main problem is that it finds a bunch of clusters of just one or two articles3³ Which sometimes, to its credit, are related – but just as often they are not. and then it piles almost all articels into a “misc” cluster. Here’s an example of two clusters that are of non-trivial size. Note how unrelated the articles in them seem to be.

Cluster 7:
  book-review-parallel-agile
  parser-combinators-parsing-for-haskell-beginners
  on-competing-with-c-using-haskell
  starting-to-commute-by-bicycle
  verifiable-software-development-estimations
  predicting-resource-usage-from-increased-traffic
  underrated-technique-sampling
  reading-notes-guide-for-ravenscar-in-high-integrity-systems
  purely-functional-avl-trees-in-common-lisp
  scales-of-plan-driven-development
  timeout-blocking-requests-in-ada
Cluster 8:
  dotnet-on-non-windows-platforms-brief-historic-summary
  haskell-time-library-tutorial
  matrix-is-everything-i-want-irc-to-be
  the-mystery-of-the-deterministic-super-shotgun
  emptying-the-dishwasher-with-systems-theory

For another taste of how devoid of insight the raw term–document matrix is, here are some words and how similar they are judged to be4⁴ Cosine similiarity.:

(probability, probabilities) ~ 0.962887911214321
(probability, odds) ~ 0.145700881235703
(probability, sampling) ~ 0.0680276850235823
(probability, throughput) ~ 0.100707873143784

and

(functional, haskell) ~ 0.392247216407078
(procedural, haskell) ~ 0.0174985128544403
(procedural, java) ~ 0.277575389792799

It’s not that it’s completely wrong all the time, but it seems to be right in the way a broken clock is – mainly by accident, and without external reference, it’s hard to know when it’s right and when it’s wrong.

I had hoped for at least one non-garbage result, so I was a little bummed out, but I figured there ought to be a simple way to reduce the dimensionality and perhaps get more robust results.

Latent Semantic Analysis

I remembered from my university days that I tried taking apart a data set by principal component analysis, and making some quite cool discoveries along the way5⁵ Unfortunately, the assignment was for a data visualisation class, and I had very little time to polish the visualisation after I succeeded with the analysis, so my solution was close to failed.. Maybe one can do something like that to the term–document matrix instead?

A quick web search reveals that this is an established technique called latent semantic analysis. And at first, it seems to sort of work. It comes up with some coherent themes I have written about, such as

Concept number 1, top terms:
probability average probabilities event individual likely outcome usually rate events
frac compute against answer week distribution mathrm theory late alice
questions roughly statistical dice process variation per chance team train

Concept number 5, top terms:
haskell parser parse parsing data day month parsers combinators html
date combinator hackage blog monad attoparsec year maybe turtle libraries
char parsed text report int convert readp mean pattern function

Concept number 6, top terms:
development projects project authors task model agile design implementation complete
parts programmers product software code deadline tasks parallel domain data
cases driven book systems result system separate finished market critical

Concept number 7, top terms:
run server response configuration request emacs servers running process events
network manager extension requests remote dns users system command extend
user define defined milliseconds send alice connect event land web

Concept number 9, top terms:
functional code systems haskell image result performance distance components length
class runs system simple programming sort classes loop calls function
looks implementations ghc png recursion functions essentially level left knowledge

Not all concepts are coherent, but it looks like there’s something intelligent going on there.

What really blew me away was when I started asking it about the similarity between terms:

(probability, probabilities) ~ 0.952301678239469
(probability, odds) ~ 0.845109841511912
(probability, sampling) ~ 0.567053242317574
(probability, throughput) ~ 0.406783545108254

It has also learned something about programming:

(functional, haskell) ~ 0.66153907357199
(procedural, haskell) ~ 0.166412102018881
(procedural, java) ~ 0.729791346849553

This is baby stuff, of course, but it’s the first time I’ve made it, and it’s fascinating.

Unfortunately, if I try to go closer to the original objective of discovering similiarities between articles, also this approach seems to fail. I tweaked the parameters for what counts as significant terms to include in the analysis6⁶ Specifically, I made it select more based on semantic relevance (high idf) than number of usages., which does seem to improve article similarity, but make term similarity a lot worse.

Anyway, these are the best results I’ve had so far:

Closest five articles to the-reinforcing-nature-of-toil:
    quick-variance-computation 0.778638685528717
    timeboxing-is-not-about-deadlines 0.762039700762091
    learning-some-logarithms 0.75815903308222
    securing-a-debian-laptop-with-a-firewall 0.748513263154762
    dichotomisation-discards-data 0.737257487469376

Closest five articles to emptying-the-dishwasher-with-systems-theory:
    verifiable-software-development-estimations 0.724795485238744
    starting-spaced-repetition 0.719353302217007
    securing-a-debian-laptop-with-a-firewall 0.698859553750974
    ceasing-short-lived-maintenance-of-emacs-versor 0.666658448749324
    what-optimisations-are-not 0.663293458890785

Closest five articles to on-competing-with-c-using-haskell:
    text-selection-behaves-as-swipe-in-weechat 0.871357916356826
    share-buybacks-are-indirect-investments 0.736398228385749
    emacs-magic-simple-pastebin 0.664821850265228
    matrix-is-everything-i-want-irc-to-be 0.59431696836662
    confusing-uncertainty-for-information 0.592934531238207

Closest five articles to hidden-cost-of-heroics:
    resize-video-while-keeping-quality-high-with-ffmpeg 0.821504405028828
    why-code-review-matters-suble-often-non-breaking-bugs 0.760687478808574
    historic-mistakes-carriers-and-presses 0.757533916663089
    mind-the-gap-when-pdsa 0.743727177824199
    recordmydesktop-videos-going-out-of-sync 0.676645616522803

This is fairly good, but does seem a bit … just-so. I can’t explain why this particular set of significant terms happen to yield okay-ish recommendations for these specific articles, nor am I very confident it performs well out of sample, given how much I’ve tweaked it to look good here.

Code

It seems unfair to leave you without at least a little code, so here’s some of the latent semantic analysis stuff in Perl:

In[1]:

use v5.16;
use strict;
use warnings;
use autodie qw( :all );

use PDL;
use PDL::NiceSlice;
use PDL::LinearAlgebra;

use utils qw( read_termdocmat cos_sim );

# Read in the term–document matrix.
my ($terms, $slugs, $x) = read_termdocmat();
# Log transform seems to yield better results
# than raw tf-idf values.
$x = log ($x + 1);
# Perform SVD using BLAS/LAPACK and extract out
# term–concept matrix, document–concept matrix,
# and singular values.
my ($v, $s_diag, $fat_u) = $x->msvd;
my $t = $fat_u(0:(nelem($s_diag)-1));
my $s = stretcher $s_diag;
my $d = $v->transpose;

# Truncate to 10 main concepts.
my $k = 10;
my $tk = $t(0:($k-1));
my $sk = $s(0:($k-1), 0:($k-1));
my $dk = $d(:, 0:($k-1));

# These are the concepts represented by each term.
my $concepts = $tk x $sk;

sub term_sim {
    # Ugly code I'm embarrassed to publish please don't hate me.
    my ($ta, $tb) = @_;
    my @terms = @$terms;
    my $ia = (grep { $terms[$_] eq $ta } 0..$#terms)[0];
    my $ib = (grep { $terms[$_] eq $tb } 0..$#terms)[0];
    my $ca = $concepts(:, ($ia));
    my $cb = $concepts(:, ($ib));
    my $c = cos_sim $ca, $cb;
    say "Similarity between $ta and $tb is $c";
}

term_sim "functional", "haskell";
term_sim "procedural", "haskell";
term_sim "procedural", "java";


# These are the concepts each article is about.
my $docconc = $sk x $dk;

sub doc_closest {
    # Ugly code I'm embarrassed to publish please don't hate me.
    my ($sp) = @_;
    my @slugs = @$slugs;
    my $ia = (grep { $slugs[$_] =~ $sp } 0..$#slugs)[0];
    my $ca = $docconc(($ia));
    my @dists;
    for my $ib (0..$#slugs) {
        my $cb = $docconc(($ib));
        my $c = cos_sim $ca, $cb;
        push @dists, $c;
    }

    say "";
    say "";
    say "Closest five articles to $slugs[$ia]:";
    my @clost = sort { $dists[$b] <=> $dists[$a] } 0..$#dists;
    ## Skipping 0th because it's going to be the article itself.
    for (@clost[1..5]) {
        say "    $slugs[$_] $dists[$_]";
    }
}

doc_closest qr/toil/i;
doc_closest qr/dishwasher/i;
doc_closest qr/competing/i;

Next Steps

I think the main problem that needs to be solved before moving on is that of diagnostics. I can’t really say why certain parameters work better and others worse. I don’t have a good way of looking at articles the way the model sees them. I also don’t know how I would do that, because I still have some trouble interpreting the matrices involved in the latent semantic analysis.

Another approach is to just move on from linear algebra and crack open neural networks, and try to write a model based on word embeddings. I have no faith that would work better, but it is something I’ve always wanted to experiment with.

Referencing This Article

Word Embeddings in Perl: Baby Steps