Tagnostic: Determine Tag Quality

kqr

, published 2024-10-01

Tags:

Using tags is hard. It’s easy to accidentally fall into tagging anti-patterns, such as creating tags that are synonyms to existing tags, or one-off tags for single pieces of content, or tags so vague as to be meaningless in practice. I know because I made those mistakes on this site!

Did you know, for example, that there were 14 tags on this site that were only applied to one article? Or that the notes tag was applied haphazardly to articles with little or no connection to each other? I sure didn’t know, although I did harbour suspicions.

I have taken steps to fix that problem, and along the way, I created some software to help out. This article announces Tagnostic! Here’s how it works.

Productivity boosts happen when computers understand human intention

What strikes me about using Tagnostic is how amazing its ergonomics are. It feels like having a discussion with tag assignments: We look at a list of all tags, decide on a target for improvement based on a meaningful summary metric, and then drill down into its assignments. We make adjustments to tag assignments and since Tagnostic caches embeddings locally, we can regenerate the analysis and it goes lightning fast. Back and forth it goes, and the combination is iteratively refined.

This would not have been possible 15 years ago. Without embeddings1¹ Or some other way to numerically compare the similarity of intentions behind pieces of text., we would have had to resort to a panel of experts to provide an independent opinion on which articles a tag is relevant to, and that would have been both expensive and slow. It certainly would not feel like a conversation with the tag assignments: Imagine having waited a week for the panel to come up with a relevance order for all tags on this site, and then when they finally hand over the order, you go,

Hey, I’m just spitballing here, but what about a tag called beginner, for beginner material across many domains? How would you order articles according to relevance to that?

They would beat you up! Or at least ask for more money and take a couple of days to do it.

In contrast, Tagnostic takes a second to generate the new embedding and then barfs out a new list of articles. Then it does it again, and again, and for however long your OpenAI credits last – which is really freaking long when all you do is ask for embeddings!

Viewing how well tags agree with content

Tagnostic is a 150-line Perl script along with a 40-line shell script. They are executed together as such:

In[1]:

$ tagnostic.pl --agent=utils/tagnostic-agent.sh -- \
    --all

Out[1]:

-0.02     8                notes
 0.03     1                 foss
 0.04     1              reading
 0.05     2    computing_science
 0.06     2                    c
 0.06     3                 perl
 0.07     3           technology
 0.08     3              science
 0.10     1           javascript
 0.10     1          probability
 0.10     1                 arts
 0.10     1                baduk
 0.12    11                 meta
 0.13     7    software_problems
 0.14     2              writing
 0.15     2              economy
 0.15     2            petpeeves
 0.15     2              cycling
 0.15     2                  awk
 0.16     3               design
 0.17     4                 unix
 0.17     8                  web
 0.18     3           typography
 0.18     3               dotnet
 0.18     3              swedish
 0.19     4                 lisp
 0.21     4            shorthand
 0.21     4                  vim
 0.22     5         work_culture
 0.23    72                 life
 0.25     6                  ada
 0.26     8             security
 0.29     8          photography
 0.30    14             sysadmin
 0.31    14           management
 0.31    11                maths
 0.33    12                emacs
 0.34    38             learning
 0.35    14  product_development
 0.39    40     meta_programming
 0.40    16              haskell
 0.42    20        organisations
 0.47    69          programming
 0.52    31          forecasting
 0.67    63           statistics

This is a complete listing of all published tags at the time I started working on Tagnostic. The first number is the agreement score – a correlation – measuring how well a tag agrees with the content it is assigned to. The second number is how many times that tag is used. As an example, the science tag is applied to 3 articles and it weakly agrees with their content: the correlation is +0.08.

From this, we can read out that the notes tag is the worst of them all with a nearly negative agreement. We also see a few single-use tags that can probably be removed.

Diagnosing specific tags

To get an idea of what we can learn by zooming in to a specific tag, let’s use sysadmin as an example. Note that this is a fairly high-quality tag (+0.30 correlation), so we might not expect to learn much. We truncate the output to the last 25 lines.

In[2]:

$ tagnostic.pl --agent=tagnostic-agent.sh -- \
    sysadmin | tail -n 25

Out[2]:

                display-backlight-keys-on-tp300la.org
                current-email-solution-gpg-agent-offlineimap-notmuch-alo...
                statistical-process-control-a-practitioners-guide.org
                trying-a-more-vanilla-fedora.org
                awk-state-machine-parser-pattern.org
                hidden-cost-of-heroics.org
       sysadmin the-reinforcing-nature-of-toil.org
       sysadmin imap-smtp-port-numbers-for-google-mail.org
                crash-only-software-on-the-desktop-please.org
                tindall-on-software-delays.org
                reading-notes-guide-for-ravenscar-in-high-integrity-syst...
       sysadmin checklist-for-renewing-gpg-subkeys.org
                the-parking-lot-drill.org
                stop-using-junior-and-senior.org
                temporarily-disabling-iptables.org
                why-you-should-buy-into-the-emacs-platform.org
       sysadmin rsync-net.org
                programmers-and-non-coding-work.org
                connecting-to-kth-eduroam-on-debian-stretch.org
       sysadmin system-observability-metrics-sampling-tracing.org
                fast-sql-for-inheritance-in-a-django-hierarchy.org
       sysadmin basic-computer-security-things-i-want-to-explore.org
       sysadmin basic-firewall-configuration-iptables.org
                securing-a-debian-laptop-with-a-firewall.org
       sysadmin an-update-a-week-keeps-the-hackers-away.org
       sysadmin passwordless-sudo.org
Agreement:  0.30

This gives a complete listing of published articles, sorted by which are most relevant to the tag sysadmin, as determined by the script. On the left margin it indicates which articles are actually tagged sysadmin.

I spot a lot of mistakes in applying this tag – and recall that this was one of the higher quality tags! We can see, for example, that the old Debian firewall article is not tagged, but really ought to be. There’s also an old article that’s more general computer usage than sysadmin stuff, yet it is tagged sysadmin.

Evaluating tag renaming

When I look at the diagnostic listing for the tag computing_science I get a feeling that what the tag really captures is some sort of algorithms idea. We can ask the software whether the relevance of the tag would improve if we renamed it to algorithms instead.

In[3]:

$ tagnostic.pl --agent=tagnostic-agent.sh -- \
    computing_science algorithms | tail -n 25

Out[3]:

                laws-of-software-evolution.org
                student-bootstrapped-t-distribution.org
                code-review-checklist.org
     algorithms awk-state-machine-parser-pattern.org
                precel-like-excel-for-uncertain-values.org
                programming-apprenticeships.org
     algorithms latent-semantic-analysis-in-perl.org
                problem-learning-and-information.org
                guessing-game-ada-style.org
                expressive-limitations-of-oop.org
                software-design-tree-and-program-families.org
                grep-sed-and-awk-the-right-tool-for-the-job.org
     algorithms reading-notes-guide-for-ravenscar-in-high-integrity-syste...
                bayesian-statistics.org
                how-much-does-an-experienced-programmer-use-google.org
     algorithms top-down-vs-bottom-up-programming.org
     algorithms optimise-the-expensive-first.org
     algorithms purely-functional-avl-trees-in-common-lisp.org
                abduction-is-not-induction.org
     algorithms myth-of-the-day-functional-programmers-dont-use-loops.org
     algorithms software-and-orders-of-magnitude.org
     algorithms computing-science-dictionary.org
     algorithms intuition-for-time-complexity-of-algorithms.org
Agreement:  0.26
Agreement (renamed):  0.37

Renaming this tag improves its relevance by quite a bit (from +0.26 to +0.37), so that seems like a sensible thing to do.

Evaluating new tag ideas

We don’t have to give Tagnostic an existing tag to diagnose: we can also just type in anything and it will give us a listing of the articles that are most relevant for that word.

In[4]:

$ tagnostic.pl --agent=tagnostic-agent.sh -- \
    psychology | tail -n 10

Out[4]:

                statistical-process-control-a-practitioners-guide.org
                publish-your-observations.org
                hidden-cost-of-heroics.org
                problem-learning-and-information.org
                improving-forecasting-accuracy-benchmark.org
                validate-your-skill.org
                abduction-is-not-induction.org
                frequentism-and-false-dichotomies.org
                confusing-uncertainty-for-information.org
Agreement: -1.00

Because there are no tag assignments for psychology (we just invented a new potential tag, after all) the agreement score is not meaningful. But what is valuable here is looking at this group of most relevant articles and seeing if there’s a point in creating a new psychology tag. My judgement is that it’s not. There are at best five or so articles that could really use it, so meh.

Agreement is correlation between embedding and application

The fundamental insight that drove this implementation is that there are, in some sense, two sets of tag applications for any given tag:

One set of actual assignments, which are the articles that actually have a tag applied to them.
One imaginary, perfect assignment, which is when the same tag is applied to just the articles for which it is relevant.

In some sense, the quality of the actual tag assignment can be thought of as how closely it matches the perfect assignment. If we had lots of money to throw at this problem, we could ask a panel of experts to assign tags to articles independently of us, and then measure how large the difference is between the actual assignments and the expert assignments. A large difference indicates that two sensible people associate different things with a tag, which means the tag is vague and impossible to assign well. That’s a low-quality tag.

But if we don’t have lots of money to spend on this problem, we can approximate: we ask the computer to order articles by how relevant they are to a tag, using embeddings to determine relevance. Embeddings are a somewhat compact, numeric representation of what a text represents, and, critically, they can be meaningfully compared to each other. Embeddings that have a small cosine distance are probably created from texts that are somehow related to each other.2² Simon Willison’s llm command-line tool makes it very convenient to ask OpenAI for the embeddings of any piece of text.

What we get when we do this is that list of articles we saw when we ran Tagnostic with a specific tag to diagnose – in the previous example it was the tag sysadmin. What we want to see is agreement between how the tag is actually applied, and the panel of exper^W^W^Wembedding-based order. If we see something else, the tag is perhaps too vague, or too sloppily applied, to be useful.

We don’t need to eyeball this agreement from an article list, though. We can quantify it through the Pearson correlation coefficient3³ Actually the way it’s commonly calculated in this case is through the point-biserial correlation, but it’s the same thing, only an easier way to calculate it.. This gives a score between -1 and +1 depending on whether the tag as applied is anti-relevant (-1), irrelevant (0), or relevant (+1) according to the model used to derive the embeddings. This is the agreement score shown in the first listing.

Accuracy concerns are overblown

We could be concerned about accuracy – are the embeddings really good enough to capture the intent of an article in relation to a tag? Heck if I know – but it doesn’t matter. They don’t need to. We were joking about “a panel of experts” but really, anyone off the street would probably suffice. The goal is not so much finding that ideal order, as it is to uncover how large the variation is between different people when they are trying to assign a tag. We want to root out large discrepancies between judgments of sensible people, because that indicates ambiguity, which is what we want to avoid.

The embeddings are not correct, but they are roughly correct often enough that they can be used for this purpose.

Appendix A: Adjustments based on analysis

I have been making a few changes in response to what I learned from Tagnostic. Here is a changelog:

Remove the following tags:
- notes
- foss
- arts
Renamed the following tags:
- computing_science is now algorithms
- baduk is now games
Merged the following tags:
- probability into statistics4⁴ I’m not entirely happy with this, but I think it makes sense for now. The crowds that are interested in these likely overlap a lot.
Applied the following tags more broadly:
- reading
- algorithms
- games
- writing
Fixed a few gaps in various tag applications.

Just these few low-hanging fruit have brought the mean relevance score of the tags on this site from 0.22 to 0.26. I plan on continuing this work of refining the tags now that I have all the tooling I need for it.

Appendix B: The agent script for Entropic Thoughts

Since Tagnostic is a very dumb 150 lines of Perl, and the most interesting stuff happens in the agent script, I figured I might share the agent script I use for this site. Unfortunately, it is also very uninteresting because in turn it shells out the interesting stuff to the llm cli – which in turn redirects the interesting request to OpenAI … yay, abstractions!

In[5]:

#!/bin/sh

script_dir=$(dirname "$(readlink -f "$0")")

case "$1" in

    list-content)
        find "$script_dir/../src/org" -type f | while read -r fn; do
            if grep -qE ':page:|:draft:' "$fn"; then
                continue
            fi
            tags=$(
                cat "$fn" \
                    | perl -lne '/FILETAGS: :([a-z:_]+):/ && print ($1 =~ tr/:/,/r)'
                )
            name=$(basename $fn)
            hash=$(sha256sum $fn | cut -d' ' -f 1)
            echo "$name,$hash,$tags"
        done
        ;;

    embed-content)
        name=${2:?}
        # The first 500 lines should be enough to be accurate
        # but without breaking any context window limits.
        #
        # We also remove any lines consisting only of numbers
        # because they aren't helpful and become many tokens.
        cat "$script_dir/../src/org/$name" \
            | grep -vE '^[0-9., -]+$' \
            | head -n 500 \
            | llm embed -m 3-small \
            | sed 's/[][ ]//g'
        ;;

    embed-tag)
        tag=${2:?}
        llm embed -m 3-small -c "$tag" \
            | sed 's/[][ ]//g'
        ;;

    *)
        echo "Unrecognised subcommand."
        ;;

esac

Referencing This Article

Deploying a Single Binary Haskell Web App