org-element Reimplements Emphasis Parsing

kqr

, published 2025-05-26

Tags:

Org – which I use to author this site – has support for emphasis markers, much like e.g. Markdown. For example, slashes indicate /italics/. Nothing wrong with that. Works great.

Org also supports _underscores_ to get underlined text, but I never really want to underline something, so for this site I have hijacked the underline html rendering code to instead produce <abbr> tags, as in this very sentence when I typed _html_ to get them pretty small caps. This also works well and has done so for a while.

However, sometimes words in programming contain both abbreviation and normal text at once, such as the case for SQLite and XMLHttpRequest. In those instances, we want to declare part of the word as emphasised with underscores. However, we don’t want the hypothetical middle of some_long_variable_name to be emphasised. There ought to be an easy way out, though: zero-width, non-breaking spaces. If we put a zero-width, non-breaking space between “SQL” and “ite”, we should be able to treat just the first bit as an abbreviation, as such: _sql_ite.

Org doesn’t do this out of the box, but it does make it possible to add this, by modifying the org-emphasis-regexp-components variable. That variable contains the characters that are allowed to surround emphasis markers (such as underscores and slashes). If we add the zero-width, non-breaking space to that list, the above works.1¹ This type of modification was previously encouraged, but a lot of people couldn’t handle the responsibility, so it is no longer advertised. Critically, though, it is still possible.

This used to work well all the way through for this site, but it broke a few years ago. At the time of writing – and for the past few years – when we export such an Org document to html, the emphasis markers surrounded by zero-width, non-breaking spaces are no longer treated as emphasis markers. I got annoyed enough to dig into this today, and it turns out it doesn’t work because the export framework for Org completely ignores the org-emphasis-regexp-components variable. The export framework doesn’t deal directly with Emacs buffer content, but rather calls out to org-element for parsing Org syntax. This, in turn, defines its own regex for detecting emphasis markers!

This is in the function org-element--parse-generic-emphasis, and the relevant bits, where it defines which characters are allowed to surround emphasis markers, boil down to something like

In[1]:

(let ((opening-re (rx-to-string
         `(seq
            (or line-start (any space ?- ?' ...))
            ,mark (not space))))))

(let ((closing-re (rx-to-string
         `(seq
            (not space) (group ,mark)
            (or (any space ?- ?. ?, ...)
                line-end))))))

this seems to use a Lisp macro to create the regexen, but critically, of all those characters listed that are allowed before and after the marks – they bear no reference to the org-emphasis-regexp-components list that allowed customisation of emphasis surroundings.

What’s particularly annoying is that (a) this list of characters is hard-coded into the org-element--parse-generic-emphasis function, and (b) that function is really long. If the list had been a variable, we could have set that to something else. If the function had been short and sweet, we would have been able to advise around it or something. But as it stands today, the only way forward at the moment is to overwrite that entire function with our own code. We’ll lock ourselves out of future improvements and bugfixes to it.

Meh. Not great. I’ve submitted a bug report. We’ll see where it goes. This article is published merely so that other people who have the same problem can find out faster than I what went wrong. If it helps, this was a semi-deliberate change introduced in late 2021.

Update a day later: I read more about this. Apparently Org tries more broadly to transition into a more standardised markup format. This is why they no longer encourage modifying org-emphasis-regexp-components, and why the exporting framework uses the org-element parser, and why the org-element parser is now more disconnected from the Emacs buffer content.

In this light, it seems defensible to not allow customising the allowed characters around emphasis marks, and maybe I’m the one using it wrong.

I also received a response to my bug report on the mailing list with a suggested solution: use zero-width spaces instead (since Emacs does not word wrap on them anyway) and then remove those zero-width spaces in the output, using something like

In[2]:

(defun tw-export-remove-zero-width-space (text _backend _info)
  "Remove zero width spaces from TEXT."
  (unless (org-export-derived-backend-p 'org)
    (replace-regexp-in-string "\u200b" "" text)))

(add-to-list 'org-export-filter-final-output-functions
             #'tw-export-remove-zero-width-space t)

This works! And it seems like the sort of thing that might continue to work for the foreseeable future.