This is a somewhat complex React site so the rules could be a little fragile.
Text content seems to be always inside .outer--content, and most h6 elements
are fluff like "read later" or pointers to other articles. However, h6.byline
and h6.post__title__kicker are relevant to the current article.
Figure captions are sometimes inside both figure and div.outer--content
elements, sometimes only inside figure, so take both and remove the
intersection.
The figure elements sometimes contain multiple copies of images or
videos, and we just take them all. Math articles seem to use Mathjax,
which we don't add.
This rewrite rule would change this:
<img title="<foo>">
to this:
<figure><img><figcaption><foo></figcaption></figure>
The image title needs to be properly escaped.
Searches tags for various `data-*` attributes and sets `img` tag `src` attribute appropriately. Falls back to searching `noscript` for `img` tags.
Includes unit tests.
* Only processes images with `src` **and** `title` attributes (others are ignored).
* Processes **all** images in the document (not just the first one).
* Wraps the image and its title attribute in a `figure` tag with the title attribute's contents in a `figcaption` tag.
Updated xkcd rewriter unit test.
Added another xkcd rewriter unit test to check rendering of images without title tags.