Commit Graph

49 Commits

Author SHA1 Message Date
jgbresson 7f6ce16d85 Add scraping rules for theverge.com 2022-10-16 11:58:35 -07:00
Adam B 4d847c6a74 Add scraping rule for royalroad.com
This is what I use for several stories I follow, and I thought it might be useful to other miniflux users.
2022-08-17 19:25:39 -07:00
Owen Valentine f404ddde91 Add swordscomic.com 2022-08-17 19:23:29 -07:00
Owen Valentine c8a3d953cf Add smbc-comics.com 2022-08-17 19:23:29 -07:00
Owen Valentine f851ecac78 Sort alphabetically 2022-08-17 19:23:29 -07:00
Frédéric Guillot cecab91298 Fix some linter issues 2022-08-08 22:06:38 -07:00
Gabe Cook bd1dc3149e Add explosm.net scraper rule 2022-07-30 20:10:52 -07:00
Jouni K. Seppänen bb0d2bf675 Add Youtube videos in Quanta articles
Some articles (especially the recent year-in-review ones) include a Youtube
video. The server-side rendered articles do not include the Youtube iframe,
but they do have a script that looks like

    <script type="text/javascript" data-reactid="6">
      window.__APOLLO_STATE__ = {
        ...
          youtube_id: "9uASADiYe_8",

We add a reformatting function that tries to detect obvious JavaScript code
that has a field or variable called youtube_id that has an 11-character
double-quoted value, and adds the referenced Youtube videos in the beginning of
the article. This is slightly more general than needed for Quanta, in the hope
that it could be useful for similar sites.
2022-01-03 10:10:13 -08:00
Jouni K. Seppänen dcf87bd642 Add scrape and rewrite rules for quantamagazine
This is a somewhat complex React site so the rules could be a little fragile.
Text content seems to be always inside .outer--content, and most h6 elements
are fluff like "read later" or pointers to other articles. However, h6.byline
and h6.post__title__kicker are relevant to the current article.

Figure captions are sometimes inside both figure and div.outer--content
elements, sometimes only inside figure, so take both and remove the
intersection.

The figure elements sometimes contain multiple copies of images or
videos, and we just take them all. Math articles seem to use Mathjax,
which we don't add.
2022-01-03 10:10:13 -08:00
Jouni K. Seppänen 2fedd8f234 Add scraper rule for ikiwiki.iki.fi
Feed: https://ikiwiki.iki.fi/feed.php?linkto=current&ns=uutiset%3Ablog&num=5

Example page: https://ikiwiki.iki.fi/uutiset/blog/20210923100421viiveita

(To clarify, I'm not a representative of iki.fi although I have an email address in the domain. This is a nonprofit association that offers email forwarding addresses, and the rss feed in question contains news for their members.)
2021-12-27 20:51:37 -08:00
hulb 01f678c3b1 add proxy arg in scraper.Fetch 2021-08-28 21:57:11 -07:00
Frédéric Guillot b7c229f30f Update scraper rule for theregister.com 2021-08-16 20:04:02 -07:00
Darius 9242350f0e
Add per feed cookies option 2021-03-22 20:27:58 -07:00
Frédéric Guillot ec3c604a83 Add option to allow self-signed or invalid certificates 2021-02-21 13:58:52 -08:00
Frédéric Guillot a352aff93b Remove deprecated io/ioutil package
Miniflux now requires at least Go 1.16 and io/util is deprecated.

https://golang.org/doc/go1.16#ioutil
2021-02-16 21:25:21 -08:00
Frédéric Guillot 31435ef83e Add rewrite rule to fix Medium.com images 2020-09-29 22:27:32 -07:00
Frédéric Guillot c394a61a4e Add Prometheus exporter 2020-09-27 20:04:48 -07:00
Frédéric Guillot 16b7b3bc3e http client: remove dependency on global config options 2020-09-27 14:37:46 -07:00
Manuel Müller ca918bc7e3 Added scraper rule for dilbert.com and turnoff.us 2020-06-10 20:15:46 -07:00
Corey McCaffrey 25d4b9fc0c Added scraper rule for financialsamurai.com
The default rule results in blank content.
2020-05-24 13:29:28 -07:00
Corey McCaffrey 0683074b8b Added scraper rule for TheOatmeal.com
The default rule does not show the comic posted to the feed. The comic image is in a div with id "comic".
2020-05-13 21:28:00 -07:00
Corey McCaffrey 8f6c07afd6 Added scraper rule for RayWenderlich.com
RayWenderlich.com is a popular developer's community for iOS and Android developers. The default rule results in "GROUP GROUP GROUP GROUP…" instead of the content posted on the blog.
2020-05-13 21:28:00 -07:00
Andrew Williams 9974e0f458 Addition of scraper rule for wdwnt.com
By default fetching original content for wdwnt.com results in a snippet of the comments section, this rule captures the article content.
2020-02-28 20:24:58 -08:00
cinput 8e1ed8bef3 Return outer HTML when scraping elements 2019-12-21 21:18:31 -08:00
somini 30f22fbd78 Update scraper rule for "Le Monde" 2019-12-19 18:35:29 -08:00
Neo Ng 90064a8cf0 Update scraper rule for openingsource.org 2019-11-28 19:40:26 -08:00
Tom Matthews 8b40778ee1 Add BBC News scraping rule 2018-12-13 20:25:30 -08:00
Frédéric Guillot 6f5d93cbbe Update scraper rule for lemonde.fr 2018-12-02 20:53:22 -08:00
Frédéric Guillot 311a133ab8 Refactor manual entry scraper 2018-12-02 20:51:06 -08:00
mapl e47188eab2 Update scraper rule for heise.de 2018-12-01 11:49:30 -08:00
Frédéric Guillot 3b6e44c331 Allow the scraper to parse XHTML documents
Only "text/html" was authorized before.
2018-11-03 13:44:13 -07:00
Frédéric Guillot 5870f04260 Simplify feed parser and format detection
- Avoid doing multiple buffer copies
- Move parser and format detection logic to its own package
2018-10-14 11:46:41 -07:00
Frédéric Guillot 9dc38a0803 Add missing package descriptions for GoDoc 2018-10-08 17:32:17 -07:00
Patrick 2538eea177 Add the possibility to override default user agent for each feed 2018-09-19 18:19:24 -07:00
Frédéric Guillot df2bebaf3d Update scraper rule for heise.de 2018-08-25 10:33:18 -07:00
Frédéric Guillot dbcc5d8a97 Use canonical imports 2018-08-24 21:56:39 -07:00
Frédéric Guillot 1eba1730d1 Move HTTP client to its own package 2018-04-28 10:51:07 -07:00
aniran 322b265d7a Scrape parent element for iframe
Current behavior: if you have an `iframe` scraper rule, `scrapContent`
tries to return the inner HTML of the `iframe`, which turns up blank.

New behavior: like `img` elements, if an `iframe` is matched by a scraper rule,
the parent element's inner HTML (i.e. the `iframe` is returned).
2018-04-27 17:57:22 -07:00
Frédéric Guillot 1d7fe892e1 Add scraper rule for darkreading.com 2018-01-06 13:25:12 -08:00
Frédéric Guillot 48aa0d07ef Add more scraper rules 2018-01-04 19:32:24 -08:00
Frédéric Guillot 3c3f397bf5 Make sure the scraper parse only HTML documents 2018-01-02 18:32:01 -08:00
Frédéric Guillot c454f67037 Add scraper rules for version2.dk and ing.dk 2017-12-27 19:44:23 -08:00
Frédéric Guillot d4839b5597 Add more scraper rules 2017-12-27 13:36:07 -08:00
Frédéric Guillot 1d8193b892 Add logger 2017-12-15 18:55:57 -08:00
Frédéric Guillot c6d9eb3614 Improve content scraper 2017-12-13 21:30:40 -08:00
Frédéric Guillot 84d912c979 Rewrite imports 2017-12-12 21:48:13 -08:00
Frédéric Guillot ef097f02fe Add the possibility to enable crawler for feeds 2017-12-12 19:19:36 -08:00
Frédéric Guillot 87ccad5c7f Add scraper rules 2017-12-10 20:51:04 -08:00
Frédéric Guillot 7a35c58f53 Add readability package to fetch original content 2017-12-10 19:01:38 -08:00