Commit Graph

279 Commits

Author SHA1 Message Date
Frédéric Guillot 4468ef1410 Refactor category validation 2021-01-03 22:50:24 -08:00
Frédéric Guillot 291bf96d15 Do not strip tags for entry title
Some technical blogs have titles like "</some-title>" or "This is some <code>source code</code>".

Miniflux was removing these elements which prevent rendering the title correctly.
2021-01-03 11:44:07 -08:00
Frédéric Guillot f0610bdd9c Refactor feed creation to allow setting most fields via API
Allow API clients to create disabled feeds or define field like "ignore_http_cache".
2021-01-02 16:48:22 -08:00
Frédéric Guillot 1908c84fbe Handle invalid French date 2020-12-02 20:59:14 -08:00
Frédéric Guillot f722fd1208 Handle invalid feeds with relative URLs 2020-12-02 20:58:18 -08:00
Pacman99 b8b6c74d86 Add rewrite rule replace for custom search and replace 2020-11-29 10:32:26 -08:00
Frédéric Guillot de7a613098 Calculate reading time during feed processing
The goal is to speed up the user interface.

Detecting the language based on the content is pretty slow.
2020-11-18 17:43:24 -08:00
Frédéric Guillot b1c9977711 Handle more invalid dates 2020-11-17 17:12:12 -08:00
Frédéric Guillot a108cb7808 Handle various invalid date 2020-11-16 21:37:33 -08:00
Frédéric Guillot 246a48359c Do not follow redirects when trying known feed URLs
Some websites redirects unknown URLs to the home page.
As result, the list of known URLs is returned to the subscription list.
We don't want the user to choose between invalid feed URLs.
2020-11-06 17:46:54 -08:00
Frédéric Guillot 40e983664c Trim spaces around icon URLs 2020-11-06 17:18:58 -08:00
Frédéric Guillot 4f358aa0f3 Do not escape HTML for Atom 1.0 text content during parsing
Avoid encoding single quotes to HTML entities (&#39;).

Feed contents are sanitized after parsing.
2020-10-30 23:41:33 -07:00
Frédéric Guillot b30a045a4e Refactor entry filtering
Avoid looping multiple times across entries
2020-10-19 22:18:41 -07:00
Frédéric Guillot b50778d3eb Add rewrite rule to use noscript content for images rendered with Javascript 2020-10-19 21:31:10 -07:00
Manuel Garrido 84b83fc3c8
Add feed filters (Keeplist and Blocklist) 2020-10-16 14:40:56 -07:00
Frédéric Guillot 3afdf25012 Do not proxy image data url 2020-10-14 22:26:54 -07:00
Frédéric Guillot 31435ef83e Add rewrite rule to fix Medium.com images 2020-09-29 22:27:32 -07:00
Frédéric Guillot d75ff0c5ab Add sanitizer support for responsive images
- Add support for picture HTML tag
- Add support for srcset, media, and sizes attributes to img and source tags
2020-09-28 23:22:08 -07:00
Frédéric Guillot c394a61a4e Add Prometheus exporter 2020-09-27 20:04:48 -07:00
Frédéric Guillot 16b7b3bc3e http client: remove dependency on global config options 2020-09-27 14:37:46 -07:00
Dave Marquard eb026ae4ac handle Pacific Daylight Time in addition to Pacific Standard Time 2020-09-22 19:47:36 -07:00
Frédéric Guillot 0d0395b4e3 Do not try to update a duplicated feed after a refresh 2020-09-20 23:42:18 -07:00
Frédéric Guillot e6c6ee441a Use a transaction to refresh and create entries
Also includes few database improvements:

- Speed up entries clean up with an index and a goroutine
- Avoid the accumulation of enclosures for some feeds
2020-09-20 23:12:23 -07:00
Frédéric Guillot bfb96d536e Add workaround for parsing an invalid date 2020-09-14 21:23:26 -07:00
Kebin Liu cf7712acea
Add HTTP proxy option for subscriptions 2020-09-09 23:28:54 -07:00
alex 0f258fd55b
Make add_invidious_video rule applicable for different invidious instances 2020-09-06 13:41:42 -07:00
Frédéric Guillot fc75b0cd8e Add workaround to get YouTube feed from video page 2020-08-02 12:24:46 -07:00
Frédéric Guillot 7380c64141 Add workaround to find YouTube channel feeds
YouTube doesn't expose RSS links anymore for new-style URLs.
2020-08-02 11:37:07 -07:00
Frédéric Guillot 1d6b0491a7 Ignore <media:title> in RSS 2.0 feeds
In the vast majority of cases, the default entry title is correct.

Ignoring <media:title> avoid overriding the default title if they are different.
2020-06-29 18:24:06 -07:00
Gabriel Augendre e44b4b2540 Try known urls if no link alternate
I came across a few blogs that didn't have a link rel alternate
but offered a RSS/Atom feed.
This aims at solving this issue for "well known" feed urls, since
these urls are often the same.
2020-06-21 20:34:59 -07:00
Manuel Müller ca918bc7e3 Added scraper rule for dilbert.com and turnoff.us 2020-06-10 20:15:46 -07:00
Frédéric Guillot 6c6ca69141 Add feed option to ignore HTTP cache 2020-06-05 22:04:52 -07:00
Frédéric Guillot 7e5157f218 Rename alternative scheduler to entry_frequency 2020-05-25 15:12:47 -07:00
Shizun Ge cead85b165
Add alternative scheduler based on the number of entries 2020-05-25 14:06:56 -07:00
Corey McCaffrey 25d4b9fc0c Added scraper rule for financialsamurai.com
The default rule results in blank content.
2020-05-24 13:29:28 -07:00
Corey McCaffrey 0683074b8b Added scraper rule for TheOatmeal.com
The default rule does not show the comic posted to the feed. The comic image is in a div with id "comic".
2020-05-13 21:28:00 -07:00
Corey McCaffrey 8f6c07afd6 Added scraper rule for RayWenderlich.com
RayWenderlich.com is a popular developer's community for iOS and Android developers. The default rule results in "GROUP GROUP GROUP GROUP…" instead of the content posted on the blog.
2020-05-13 21:28:00 -07:00
Frédéric Guillot 619aa58fb3 Handle more invalid dates
Fixes #617
2020-04-25 20:15:18 -07:00
Frédéric Guillot 592151bdb6 Add support for Invidious
- Embed Invidious player for invidio.us feeds
- Add new rewrite rule to use Invidious player for Youtube feeds
2020-03-20 20:56:59 -07:00
Andrew Williams 9974e0f458 Addition of scraper rule for wdwnt.com
By default fetching original content for wdwnt.com results in a snippet of the comments section, this rule captures the article content.
2020-02-28 20:24:58 -08:00
Frédéric Guillot 997e9422eb Ignore enclosures without URL 2020-01-30 21:18:49 -08:00
Frédéric Guillot 61f0c8aa66 Allow application/xhtml+xml links as comments URL in Atom replies 2020-01-04 16:07:06 -08:00
Frédéric Guillot bf632fad2e Allow only absolute URLs in comments URL
Some feeds are using invalid URLs (random text).
2020-01-04 15:54:16 -08:00
Kebin Liu 8cebd985a2 Use internal XML workarounds to detect feed format 2020-01-02 22:19:15 -08:00
Frédéric Guillot ac3c936820 Make sure whitelisted URI schemes are handled properly by the sanitizer 2020-01-02 11:03:51 -08:00
Frédéric Guillot 3debf75eb9 Normalize URL query string before executing HTTP requests
- Make sure query strings parameters are encoded
- As opposed to the standard library, do not append equal sign
for query parameters with empty value
- Strip URL fragments like Web browsers
2019-12-26 15:56:59 -08:00
Frédéric Guillot 200b1c304b Improve Dublin Core support for RDF feeds 2019-12-23 14:45:58 -08:00
Frédéric Guillot 1b33bb3d1c Improve Podcast support (iTunes and Google Play feeds)
- Add support for Google Play XML namespace
- Improve existing iTunes namespace implementation
2019-12-23 13:51:42 -08:00
Frédéric Guillot 33fdb2c489 Add support for Atom 0.3 2019-12-22 22:42:00 -08:00
Frédéric Guillot cfb6ddfcea Add support for Atom 'replies' link relation
Show comments URL for Atom feeds as per RFC 4685.
See https://tools.ietf.org/html/rfc4685#section-4

Note that only the first link with type "text/html" is taken into consideration.
2019-12-22 18:03:04 -08:00
cinput 8e1ed8bef3 Return outer HTML when scraping elements 2019-12-21 21:18:31 -08:00
somini 30f22fbd78 Update scraper rule for "Le Monde" 2019-12-19 18:35:29 -08:00
Jebbs a155ab6deb Filter valid XML characters for UTF-8 XML documents before decoding
This change should reduce "illegal character code" XML errors.
2019-12-19 18:31:52 -08:00
Frédéric Guillot a4ebb33cd5 Trim spaces for RDF entry links 2019-12-01 15:06:01 -08:00
Frédéric Guillot 120d6ec7d8 Do no rewrite Youtube description twice in "add_youtube_video" rule
This is already done before in <media:description>.
2019-11-30 22:56:06 -08:00
Frédéric Guillot 69aa650203 Add the possibility to add rules during feed creation 2019-11-29 11:27:58 -08:00
Frédéric Guillot 912a98788e Add support of media elements for Atom feeds 2019-11-28 23:55:40 -08:00
Frédéric Guillot f90e9dfab0 Add support of media elements for RSS 2 feeds 2019-11-28 21:33:32 -08:00
Frédéric Guillot c43c9458a9 Add rewrite functions: convert_text_link and nl2br 2019-11-28 21:33:12 -08:00
Neo Ng 90064a8cf0 Update scraper rule for openingsource.org 2019-11-28 19:40:26 -08:00
Tony Wang 2eb2441f2b Improve XML decoder to remove illegal characters 2019-10-22 20:32:35 -07:00
Tony Wang 5517eebafe Add new formats to date parser 2019-10-20 09:52:18 -07:00
Frédéric Guillot 36d7732234 Disable strict XML parsing
This change should improve parsing of broken XML feeds.

See https://golang.org/pkg/encoding/xml/#Decoder
2019-09-18 22:45:56 -07:00
Frédéric Guillot 934385ff55 Replace Travis by GitHub Actions 2019-09-15 11:48:15 -07:00
Frédéric Guillot 8d8f78241d Add native lazy loading for images and iframes
This feature is available only in Chrome >= 76 for now.

See https://web.dev/native-lazy-loading
2019-09-10 21:22:19 -07:00
Peter De Wachter b6f3160dbc add_mailto_subject: New rewrite function
Dinosaur Comics (qwantz.com) likes to hide jokes in mailto: links, but
miniflux's sanitizer strips those out.
2019-08-19 19:42:47 -07:00
Frédéric Guillot ac45307da6 Add test case for parsing HTML entities 2019-08-15 21:42:13 -07:00
Peter De Wachter ea2b6e3608 addImageTitle: Fix HTML injection
This rewrite rule would change this:

    <img title="<foo>">

to this:

    <figure><img><figcaption><foo></figcaption></figure>

The image title needs to be properly escaped.
2019-08-15 21:39:41 -07:00
Peter De Wachter 3a39d110f0 Accept HTML entities when parsing XML
Every once in a while, one of my feeds would throw an XML parse error
because it used `&nbsp;` or some other HTML entity. I feel Miniflux
should be lenient here, and Go already has a handy hook to make this
work.
2019-08-15 21:26:07 -07:00
Ilya Glotov c840268678
Sort feed categories before serialization
A function is added for feeds and its categories normalization.
The test will ensure that the order is right.
2019-07-05 20:34:49 +03:00
Frédéric Guillot 129f1bf3da Add support for OPML v1 import 2019-03-26 20:09:31 -07:00
Jeremy Apthorp 304b43cb30 Add 'allow-popups' to iframe sandbox permissions 2019-03-26 18:26:56 -07:00
Frédéric Guillot 6764a420b0 Make parser compatible with Go 1.12
See changes in strings.Map(): https://golang.org/doc/go1.12#strings
2019-02-28 21:23:33 -08:00
Frédéric Guillot f3fc8b7072 Use feed ID instead of user ID to check entry URLs presence 2019-02-28 20:43:33 -08:00
Frédéric Guillot ed6ae7e0d2 Use preferably the published date for Atom feeds
YouTube feeds use the published date for the original creation date.
2019-01-29 20:01:36 -08:00
Peter De Wachter 0cdcec10ca More robust Atom text handling
Miniflux couldn't deal with XHTML Summary elements.

- Make Summary an 'atomContent' field
- Define an atomContentToString function rather than inling it three times
- Also properly escape special characters in plain text fields.
2019-01-07 17:55:02 -08:00
Frédéric Guillot 56efd2eb3f Add workaround for non GMT dates (RFC822, RFC850, and RFC1123)
RFC822, RFC850, and RFC1123 are supposed to be always in GMT.

This is a workaround for the one defined in PST timezone.
2018-12-26 20:24:38 -08:00
Frédéric Guillot 012138179c Add function storage.UpdateFeedError() 2018-12-15 13:04:38 -08:00
Tom Matthews 8b40778ee1 Add BBC News scraping rule 2018-12-13 20:25:30 -08:00
Frederic Guillot 61bfb3cfa8 Make password prompt compatible with Windows 2018-12-09 17:44:33 -08:00
Frédéric Guillot 1bc8535dbb Move image proxy filter to template functions 2018-12-02 21:09:53 -08:00
Frédéric Guillot 6f5d93cbbe Update scraper rule for lemonde.fr 2018-12-02 20:53:22 -08:00
Frédéric Guillot 311a133ab8 Refactor manual entry scraper 2018-12-02 20:51:06 -08:00
mapl e47188eab2 Update scraper rule for heise.de 2018-12-01 11:49:30 -08:00
Frédéric Guillot 487852f07e Replace daemon and scheduler package with service package 2018-11-11 15:32:48 -08:00
Frédéric Guillot 3b6e44c331 Allow the scraper to parse XHTML documents
Only "text/html" was authorized before.
2018-11-03 13:44:13 -07:00
Frédéric Guillot ae1dc1a91e Handle more encoding conversion edge cases 2018-10-29 23:00:03 -07:00
Frédéric Guillot 7d1b471d88 Add test case to check different feed encoding and HTTP headers 2018-10-29 19:04:36 -07:00
Frédéric Guillot 85d48c8a71 Add entries storage error to feed errors count 2018-10-21 11:44:29 -07:00
Frédéric Guillot b8f874a37d Simplify feed entries filtering
- Rename processor package to filter
- Remove boilerplate code
2018-10-14 22:33:19 -07:00
Frédéric Guillot 778346b0b0 Simplify feed fetcher
- Add browser package to handle HTTP errors
- Reduce code duplication
2018-10-14 21:43:48 -07:00
Frédéric Guillot 5870f04260 Simplify feed parser and format detection
- Avoid doing multiple buffer copies
- Move parser and format detection logic to its own package
2018-10-14 11:46:41 -07:00
Frédéric Guillot 9606126196 Convert text links and line feeds to HTML in YouTube channels 2018-10-08 20:47:10 -07:00
Frédéric Guillot 9dc38a0803 Add missing package descriptions for GoDoc 2018-10-08 17:32:17 -07:00
Frédéric Guillot 11dfcdd3d6 Fix typo in license header 2018-10-08 15:50:15 -07:00
Frédéric Guillot b1e8f534ef Simplify locale package usage (refactoring) 2018-09-22 15:04:55 -07:00
Frédéric Guillot beb7a0cfcb Use unique translation IDs instead of English text as key 2018-09-21 22:23:23 -07:00
Patrick 2538eea177 Add the possibility to override default user agent for each feed 2018-09-19 18:19:24 -07:00
Frédéric Guillot df2bebaf3d Update scraper rule for heise.de 2018-08-25 10:33:18 -07:00
Frédéric Guillot dbcc5d8a97 Use canonical imports 2018-08-24 21:56:39 -07:00
neepl 5365f31e90 Add support for published tag in Atom feeds 2018-07-17 21:52:05 -07:00
Frédéric Guillot a786e78aca Add embedly.com to iframe whitelist 2018-07-10 20:56:54 -07:00
dzaikos 6d25e02cb5 New `add_dynamic_image` rewriter for JavaScript-loaded images.
Searches tags for various `data-*` attributes and sets `img` tag `src` attribute appropriately. Falls back to searching `noscript` for `img` tags.

Includes unit tests.
2018-07-09 01:22:48 -04:00
dzaikos e1c56b2e53 Processor: Do rewriter before sanitizer for `entry.Content`.
Addresses #163.
2018-07-06 00:17:07 -04:00
Frédéric Guillot de1a4aad30 Add support for protocol relative YouTube URLs 2018-07-04 22:45:44 -07:00
dzaikos 7d4a195519 Sandbox iframes when sanitizing.
Updated iframe unit tests.

Refactored sanitizer.getExtraAttributes() to use `switch` instead of multiple `if` statements.
2018-07-03 12:55:18 -07:00
Frédéric Guillot 9c0f882ba0 Add specific 404 and 401 error messages 2018-06-30 12:42:12 -07:00
dzaikos 45d7105ed1 Refactor AddImageTitle rewriter.
* Only processes images with `src` **and** `title` attributes (others are ignored).
* Processes **all** images in the document (not just the first one).
* Wraps the image and its title attribute in a `figure` tag with the title attribute's contents in a `figcaption` tag.

Updated xkcd rewriter unit test.

Added another xkcd rewriter unit test to check rendering of images without title tags.
2018-06-26 17:50:18 -04:00
dzaikos c9131b0e89 Improve sanitizer to remove style tag contents.
See #157.

Refactored how blacklisted tags are handled so they're easier manage in the future.
2018-06-24 19:53:23 -07:00
Dave Z d847b10e32 Improve sanitizer to remove script and noscript contents
These tags where removed but the content was rendered as escaped HTML.

See #157
2018-06-23 17:50:43 -07:00
Frédéric Guillot bddca15b69 Add new fields for feed username/password 2018-06-19 22:58:29 -07:00
Frédéric Guillot c719cf7df0 Rewrite iframe Youtube URLs to https://www.youtube-nocookie.com 2018-06-12 18:45:09 -07:00
Frédéric Guillot 0c2e5ff0dc Handle feeds with dates formatted as Unix timestamp 2018-05-08 20:41:24 -07:00
Frédéric Guillot 5cacae6cf2 Add API endpoint to import OPML file 2018-04-29 18:56:40 -07:00
Frédéric Guillot 1eba1730d1 Move HTTP client to its own package 2018-04-28 10:51:07 -07:00
aniran 322b265d7a Scrape parent element for iframe
Current behavior: if you have an `iframe` scraper rule, `scrapContent`
tries to return the inner HTML of the `iframe`, which turns up blank.

New behavior: like `img` elements, if an `iframe` is matched by a scraper rule,
the parent element's inner HTML (i.e. the `iframe` is returned).
2018-04-27 17:57:22 -07:00
aniran 920dda79b7 Add soundcloud and bandcamp iframe sources 2018-04-27 17:55:58 -07:00
Frédéric Guillot dcbb5047b1 Add support for Dublin Core date in RDF feeds 2018-04-10 18:13:05 -07:00
Frédéric Guillot 02ba735ba9 Handle some non-english date formats 2018-04-09 21:27:15 -07:00
Frédéric Guillot e2d02bac5a Rename RSS parser getters 2018-04-09 20:38:12 -07:00
Frédéric Guillot f76093690c Get the right comments URL when having multiple namespaces 2018-04-09 20:30:55 -07:00
Frédéric Guillot 702256bcc0 Add unit test for comments url and French translation 2018-04-07 13:56:11 -07:00
Ben Brooks 538d08c16c Add CommentsURL to entry 2018-04-07 13:50:45 -07:00
Frédéric Guillot 6ea4da3bce Handle RSS author elements with inner HTML 2018-03-18 11:57:46 -07:00
Frédéric Guillot 482785c5e6 Convert enclosure size field to bigint 2018-03-14 20:09:06 -07:00
Frédéric Guillot ec08f45bf5 Fix broken OPML import with Go 1.10 2018-03-14 18:50:06 -07:00
Frédéric Guillot f110384f11 Improve parser error messages 2018-02-27 21:19:59 -08:00
Frédéric Guillot 953d0a2dc0 Support localized feed errors generated by background workers 2018-02-27 21:08:32 -08:00
Frédéric Guillot 9292d5d604 Handle Atom feeds with HTML title 2018-02-17 12:21:58 -08:00
Frédéric Guillot dda9114692 Improve error handling for HTTP client 2018-02-08 18:16:54 -08:00
Frédéric Guillot 7b0bfd9308 Strip invalid XML characters to avoid parsing errors 2018-02-07 20:57:56 -08:00
Frédéric Guillot c6fd9eb9b1 Remove period for feed errors 2018-02-07 19:10:36 -08:00
Frédéric Guillot 0fb87eba3f Improve error handling when the response is empty 2018-02-07 18:47:47 -08:00
Frédéric Guillot b78172033f Show API URL endpoints in user interface 2018-01-31 21:57:20 -08:00
Frédéric Guillot ffabb009b8 Do not override existing entries when the crawler is enabled 2018-01-20 14:04:19 -08:00
Frédéric Guillot 713b38e34c Handle more encoding edge cases
- Feeds with charset specified only in Content-Type header and not in XML document
- Feeds with charset specified in both places
- Feeds with charset specified only in XML document and not in HTTP header
2018-01-20 13:25:21 -08:00
Frédéric Guillot 3b62f904d6 Do not crawl existing entry URLs 2018-01-20 13:25:20 -08:00
Frédéric Guillot 9652dfa1fe Add more comments (GoDoc) 2018-01-11 19:21:20 -08:00
Frédéric Guillot 1d7fe892e1 Add scraper rule for darkreading.com 2018-01-06 13:25:12 -08:00
Frédéric Guillot 48aa0d07ef Add more scraper rules 2018-01-04 19:32:24 -08:00
Frédéric Guillot 7d278d49f1 Add content length check when refreshing feeds 2018-01-04 18:41:23 -08:00
Frédéric Guillot efac11e082 Handle more date formats 2018-01-03 18:59:29 -08:00
Frédéric Guillot ec63cbe7bb If the website URL is empty, assign the feed URL 2018-01-03 18:23:21 -08:00
Frédéric Guillot c39f2e1a8d Rename helper packages 2018-01-02 19:15:08 -08:00
Frédéric Guillot 3c3f397bf5 Make sure the scraper parse only HTML documents 2018-01-02 18:32:01 -08:00
Frédéric Guillot c454f67037 Add scraper rules for version2.dk and ing.dk 2017-12-27 19:44:23 -08:00
Frédéric Guillot d4839b5597 Add more scraper rules 2017-12-27 13:36:07 -08:00
Frédéric Guillot f6a5d7d6ed Add support for data URL favicons 2017-12-22 19:01:39 -08:00
Frédéric Guillot e7afec7eca Handle more date formats 2017-12-22 17:59:28 -08:00
Frédéric Guillot 1d8193b892 Add logger 2017-12-15 18:55:57 -08:00