org-element Reimplements Emphasis Parsing
Org – which I use to author this site – has support for emphasis markers, much
like e.g. Markdown. For example, slashes indicate /italics/
. Nothing wrong
with that. Works great.
Org also supports _underscores_
to get underlined text, but I never really
want to underline something, so for this site I have hijacked the underline
html rendering code to instead produce <abbr>
tags, as in this very sentence
when I typed _html_
to get them pretty small caps. This also works well and
has done so for a while.
However, sometimes words in programming contain both abbreviation and normal
text at once, such as the case for SQLite and XMLHttpRequest. In those
instances, we want to declare part of the word as emphasised with underscores.
However, we don’t want the hypothetical middle of some_long_variable_name
to
be emphasised. There ought to be an easy way out, though: zero-width,
non-breaking spaces. If we put a zero-width, non-breaking space between “SQL”
and “ite”, we should be able to treat just the first bit as an abbreviation, as
such: _sql_ite
.
Org doesn’t do this out of the box, but it does make it possible to add this, by
modifying the org-emphasis-regexp-components
variable. That variable contains
the characters that are allowed to surround emphasis markers (such as
underscores and slashes). If we add the zero-width, non-breaking space to that
list, the above works.1 This type of modification was previously encouraged,
but a lot of people couldn’t handle the responsibility, so it is no longer
advertised. Critically, though, it is still possible.
This used to work well all the way through for this site, but it broke a few
years ago. At the time of writing – and for the past few years – when we export
such an Org document to html, the emphasis markers surrounded by zero-width,
non-breaking spaces are no longer treated as emphasis markers. I got annoyed
enough to dig into this today, and it turns out it doesn’t work because the
export framework for Org completely ignores the org-emphasis-regexp-components
variable. The export framework doesn’t deal directly with Emacs buffer content,
but rather calls out to org-element for parsing Org syntax. This, in turn,
defines its own regex for detecting emphasis markers!
This is in the function org-element--parse-generic-emphasis
, and the relevant
bits, where it defines which characters are allowed to surround emphasis
markers, boil down to something like
(let ((opening-re (rx-to-string `(seq (or line-start (any space ?- ?' ...)) ,mark (not space)))))) (let ((closing-re (rx-to-string `(seq (not space) (group ,mark) (or (any space ?- ?. ?, ...) line-end))))))
this seems to use a Lisp macro to create the regexen, but critically, of all those characters listed that are allowed before and after the marks – they bear no reference to the org-emphasis-regexp-components list that allowed customisation of emphasis surroundings.
What’s particularly annoying is that (a) this list of characters is hard-coded
into the org-element--parse-generic-emphasis
function, and (b) that function
is really long. If the list had been a variable, we could have set that to
something else. If the function had been short and sweet, we would have been
able to advise around it or something. But as it stands today, the only way
forward at the moment is to overwrite that entire function with our own code.
We’ll lock ourselves out of future improvements and bugfixes to it.
Meh. Not great. I’ve submitted a bug report. We’ll see where it goes. This article is published merely so that other people who have the same problem can find out faster than I what went wrong. If it helps, this was a semi-deliberate change introduced in late 2021.
Update a day later: I read more about this. Apparently Org tries more broadly to transition into a more standardised markup format. This is why they no longer encourage modifying org-emphasis-regexp-components, and why the exporting framework uses the org-element parser, and why the org-element parser is now more disconnected from the Emacs buffer content.
In this light, it seems defensible to not allow customising the allowed characters around emphasis marks, and maybe I’m the one using it wrong.
I also received a response to my bug report on the mailing list with a suggested solution: use zero-width spaces instead (since Emacs does not word wrap on them anyway) and then remove those zero-width spaces in the output, using something like
(defun tw-export-remove-zero-width-space (text _backend _info) "Remove zero width spaces from TEXT." (unless (org-export-derived-backend-p 'org) (replace-regexp-in-string "\u200b" "" text))) (add-to-list 'org-export-filter-final-output-functions #'tw-export-remove-zero-width-space t)
This works! And it seems like the sort of thing that might continue to work for the foreseeable future.