Tagnostic: Determine Tag Quality
Using tags is hard. It’s easy to accidentally fall into tagging anti-patterns, such as creating tags that are synonyms to existing tags, or one-off tags for single pieces of content, or tags so vague as to be meaningless in practice. I know because I made those mistakes on this site!
Did you know, for example, that there were 14 tags on this site that were only
applied to one article? Or that the notes
tag was applied haphazardly to
articles with little or no connection to each other? I sure didn’t know,
although I did harbour suspicions.
I have taken steps to fix that problem, and along the way, I created some software to help out. This article announces Tagnostic! Here’s how it works.
Productivity boosts happen when computers understand human intention
What strikes me about using Tagnostic is how amazing its ergonomics are. It feels like having a discussion with tag assignments: We look at a list of all tags, decide on a target for improvement based on a meaningful summary metric, and then drill down into its assignments. We make adjustments to tag assignments and since Tagnostic caches embeddings locally, we can regenerate the analysis and it goes lightning fast. Back and forth it goes, and the combination is iteratively refined.
This would not have been possible 15 years ago. Without embeddings1 Or some other way to numerically compare the similarity of intentions behind pieces of text., we would have had to resort to a panel of experts to provide an independent opinion on which articles a tag is relevant to, and that would have been both expensive and slow. It certainly would not feel like a conversation with the tag assignments: Imagine having waited a week for the panel to come up with a relevance order for all tags on this site, and then when they finally hand over the order, you go,
Hey, I’m just spitballing here, but what about a tag called beginner, for beginner material across many domains? How would you order articles according to relevance to that?
They would beat you up! Or at least ask for more money and take a couple of days to do it.
In contrast, Tagnostic takes a second to generate the new embedding and then barfs out a new list of articles. Then it does it again, and again, and for however long your OpenAI credits last – which is really freaking long when all you do is ask for embeddings!
Evaluating tag renaming
When I look at the diagnostic listing for the tag computing_science I get a feeling that what the tag really captures is some sort of algorithms idea. We can ask the software whether the relevance of the tag would improve if we renamed it to algorithms instead.
$ tagnostic.pl --agent=tagnostic-agent.sh -- \
computing_science algorithms | tail -n 25
laws-of-software-evolution.org student-bootstrapped-t-distribution.org code-review-checklist.org algorithms awk-state-machine-parser-pattern.org precel-like-excel-for-uncertain-values.org programming-apprenticeships.org algorithms latent-semantic-analysis-in-perl.org problem-learning-and-information.org guessing-game-ada-style.org expressive-limitations-of-oop.org software-design-tree-and-program-families.org grep-sed-and-awk-the-right-tool-for-the-job.org algorithms reading-notes-guide-for-ravenscar-in-high-integrity-syste... bayesian-statistics.org how-much-does-an-experienced-programmer-use-google.org algorithms top-down-vs-bottom-up-programming.org algorithms optimise-the-expensive-first.org algorithms purely-functional-avl-trees-in-common-lisp.org abduction-is-not-induction.org algorithms myth-of-the-day-functional-programmers-dont-use-loops.org algorithms software-and-orders-of-magnitude.org algorithms computing-science-dictionary.org algorithms intuition-for-time-complexity-of-algorithms.org Agreement: 0.26 Agreement (renamed): 0.37
Renaming this tag improves its relevance by quite a bit (from +0.26 to +0.37), so that seems like a sensible thing to do.
Evaluating new tag ideas
We don’t have to give Tagnostic an existing tag to diagnose: we can also just type in anything and it will give us a listing of the articles that are most relevant for that word.
$ tagnostic.pl --agent=tagnostic-agent.sh -- \
psychology | tail -n 10
statistical-process-control-a-practitioners-guide.org publish-your-observations.org hidden-cost-of-heroics.org problem-learning-and-information.org improving-forecasting-accuracy-benchmark.org validate-your-skill.org abduction-is-not-induction.org frequentism-and-false-dichotomies.org confusing-uncertainty-for-information.org Agreement: -1.00
Because there are no tag assignments for psychology (we just invented a new potential tag, after all) the agreement score is not meaningful. But what is valuable here is looking at this group of most relevant articles and seeing if there’s a point in creating a new psychology tag. My judgement is that it’s not. There are at best five or so articles that could really use it, so meh.
Agreement is correlation between embedding and application
The fundamental insight that drove this implementation is that there are, in some sense, two sets of tag applications for any given tag:
- One set of actual assignments, which are the articles that actually have a tag applied to them.
- One imaginary, perfect assignment, which is when the same tag is applied to just the articles for which it is relevant.
In some sense, the quality of the actual tag assignment can be thought of as how closely it matches the perfect assignment. If we had lots of money to throw at this problem, we could ask a panel of experts to assign tags to articles independently of us, and then measure how large the difference is between the actual assignments and the expert assignments. A large difference indicates that two sensible people associate different things with a tag, which means the tag is vague and impossible to assign well. That’s a low-quality tag.
But if we don’t have lots of money to spend on this problem, we can approximate: we ask the computer to order articles by how relevant they are to a tag, using embeddings to determine relevance. Embeddings are a somewhat compact, numeric representation of what a text represents, and, critically, they can be meaningfully compared to each other. Embeddings that have a small cosine distance are probably created from texts that are somehow related to each other.2 Simon Willison’s llm command-line tool makes it very convenient to ask OpenAI for the embeddings of any piece of text.
What we get when we do this is that list of articles we saw when we ran
Tagnostic with a specific tag to diagnose – in the previous example it was the
tag sysadmin. What we want to see is agreement between how the tag is actually
applied, and the panel of exper^W^W^W
embedding-based order. If we see
something else, the tag is perhaps too vague, or too sloppily applied, to be
useful.
We don’t need to eyeball this agreement from an article list, though. We can quantify it through the Pearson correlation coefficient3 Actually the way it’s commonly calculated in this case is through the point-biserial correlation, but it’s the same thing, only an easier way to calculate it.. This gives a score between -1 and +1 depending on whether the tag as applied is anti-relevant (-1), irrelevant (0), or relevant (+1) according to the model used to derive the embeddings. This is the agreement score shown in the first listing.
Accuracy concerns are overblown
We could be concerned about accuracy – are the embeddings really good enough to capture the intent of an article in relation to a tag? Heck if I know – but it doesn’t matter. They don’t need to. We were joking about “a panel of experts” but really, anyone off the street would probably suffice. The goal is not so much finding that ideal order, as it is to uncover how large the variation is between different people when they are trying to assign a tag. We want to root out large discrepancies between judgments of sensible people, because that indicates ambiguity, which is what we want to avoid.
The embeddings are not correct, but they are roughly correct often enough that they can be used for this purpose.
Appendix A: Adjustments based on analysis
I have been making a few changes in response to what I learned from Tagnostic. Here is a changelog:
- Remove the following tags:
- notes
- foss
- arts
- Renamed the following tags:
- computing_science is now algorithms
- baduk is now games
- Merged the following tags:
- probability into statistics4 I’m not entirely happy with this, but I think it makes sense for now. The crowds that are interested in these likely overlap a lot.
- Applied the following tags more broadly:
- reading
- algorithms
- games
- writing
- Fixed a few gaps in various tag applications.
Just these few low-hanging fruit have brought the mean relevance score of the tags on this site from 0.22 to 0.26. I plan on continuing this work of refining the tags now that I have all the tooling I need for it.
Appendix B: The agent script for Entropic Thoughts
Since Tagnostic is a very dumb 150 lines of Perl, and the most interesting stuff
happens in the agent script, I figured I might share the agent script I use for
this site. Unfortunately, it is also very uninteresting because in turn it
shells out the interesting stuff to the llm
cli – which in turn redirects
the interesting request to OpenAI … yay, abstractions!
#!/bin/sh script_dir=$(dirname "$(readlink -f "$0")") case "$1" in list-content) find "$script_dir/../src/org" -type f | while read -r fn; do if grep -qE ':page:|:draft:' "$fn"; then continue fi tags=$( cat "$fn" \ | perl -lne '/FILETAGS: :([a-z:_]+):/ && print ($1 =~ tr/:/,/r)' ) name=$(basename $fn) hash=$(sha256sum $fn | cut -d' ' -f 1) echo "$name,$hash,$tags" done ;; embed-content) name=${2:?} # The first 500 lines should be enough to be accurate # but without breaking any context window limits. # # We also remove any lines consisting only of numbers # because they aren't helpful and become many tokens. cat "$script_dir/../src/org/$name" \ | grep -vE '^[0-9., -]+$' \ | head -n 500 \ | llm embed -m 3-small \ | sed 's/[][ ]//g' ;; embed-tag) tag=${2:?} llm embed -m 3-small -c "$tag" \ | sed 's/[][ ]//g' ;; *) echo "Unrecognised subcommand." ;; esac