Commit Graph

262 Commits

Author SHA1 Message Date
Artémis b585dab6b4
Add data-srcset support to "add_dynamic_image rewrite" rewrite rule 2021-10-22 18:12:23 -07:00
Frank Steinborn 2dcabc840c Fix minor typo 2021-10-17 16:58:42 -07:00
Frédéric Guillot 5f9d6fd81b Handle srcset images with no space after comma 2021-10-13 21:31:08 -07:00
三三 34dd358eb0
Add Telegram integration 2021-09-07 20:04:22 -07:00
Lukas Dietrich 93596c1218 Add rewrite rule to remove dom elements 2021-09-06 09:47:05 -07:00
hulb 01f678c3b1 add proxy arg in scraper.Fetch 2021-08-28 21:57:11 -07:00
James Loh 2f6895e118 Fix finding JSON feeds with new MIME type
The 1.1 version (https://jsonfeed.org/version/1.1) for JSON feeds defines that feeds should have a MIME type of `application/feed+json` which Miniflux wasn't searching for
2021-08-21 13:01:08 -07:00
Frédéric Guillot b7c229f30f Update scraper rule for theregister.com 2021-08-16 20:04:02 -07:00
Alexandros Kosiaris b8b16c3bdf Add /rss/ in finder's wellKnownUrls
ATCOM netvolution WCM, probably alongside others, a CMS powering several
high profile and high traffic Greek news sites, among other sites,
publishes the RSS feed under /rss/. Add it to the list. It's generic
enough to allow us to assume other software might do it to

On a select set of 627 Greek news media sites (the infamous Petsas list),
adding this rule increased discoverability of RSS feeds by a factor of
2.61% (from 498 to 511).
2021-07-22 19:46:40 -07:00
Dave Marquard fc766de02d use authors entry for json 1.1 feeds 2021-07-21 21:28:37 -07:00
Jan-Lukas Else 20cd023c07
Use runes instead of bytes to truncate JSON feed titles
This fix avoid breaking Unicode string. 

It solves this error:

pq: invalid byte sequence for encoding "UTF8": 0xf0 0x9f 0x9a 0x2e
2021-05-31 11:42:59 -07:00
Frédéric Guillot 5b8eb4735c Handle RSS feed title with encoded Unicode entities 2021-04-30 22:57:29 -07:00
yue 18e414ec45
Fix typo in reader/json/doc.go 2021-04-02 19:00:06 -07:00
Frédéric Guillot 6e2e2d1665 Setup golangci-lint Github Action 2021-03-22 21:34:48 -07:00
Darius 9242350f0e
Add per feed cookies option 2021-03-22 20:27:58 -07:00
Frédéric Guillot e60e0ba3c4 Add workaround to handle some invalid dates 2021-03-21 10:52:27 -07:00
Frédéric Guillot 5877048749 Improve handling of Atom text content with CDATA 2021-03-20 20:47:35 -07:00
Frédéric Guillot c8c1f05328 Add better support of Atom text constructs
- Note that Miniflux does not render entry title with HTML tags as of now
- Omit XHTML div element because it should not be part of the content
2021-03-19 22:05:00 -07:00
Frédéric Guillot 96f3e888cf Handle RDF feed with HTML encoded entry title
Example: http://rss.slashdot.org/Slashdot/slashdotMain
2021-03-19 18:49:51 -07:00
Frédéric Guillot 14888f1cb8 Fix incorrect parsing of Atom entry content of type HTML 2021-03-18 21:43:59 -07:00
Gabriel Augendre 1d80c12e18
Prevent Youtube scraping if entry already exists 2021-03-08 20:10:53 -08:00
hykhd 053b1d0f8d
Handle RSS feeds with CDATA in author item element 2021-02-28 12:26:52 -08:00
Frédéric Guillot ec3c604a83 Add option to allow self-signed or invalid certificates 2021-02-21 13:58:52 -08:00
Ilya Mateyko c3f871b49b Use YouTube video duration as read time
This feature works by scraping YouTube website.

To enable it, set the FETCH_YOUTUBE_WATCH_TIME environment variable to
1.

Resolves #972.
2021-02-21 11:13:52 -08:00
hykhd 3cb04b2c56 update whitelist fix bilibili video 2021-02-20 10:29:42 -08:00
Frédéric Guillot a352aff93b Remove deprecated io/ioutil package
Miniflux now requires at least Go 1.16 and io/util is deprecated.

https://golang.org/doc/go1.16#ioutil
2021-02-16 21:25:21 -08:00
Frédéric Guillot 04f9c456d5 Handle entry title with double encoded entities in Atom feeds 2021-02-14 11:19:21 -08:00
Frédéric Guillot 0413daf76b Remove iframe inner HTML contents
An iframe element never has fallback content, as it will always create a nested
browsing context, regardless of whether the specified initial contents are
successfully used.

https://www.w3.org/TR/2010/WD-html5-20101019/the-iframe-element.html#the-iframe-element
2021-02-13 14:00:21 -08:00
Frédéric Guillot 5043749b9f Add workaround for entry title with double encoded entities
Example: 'Text'
2021-02-13 13:33:59 -08:00
Nick Chitwood 793f475edd
Update date parser to fix another time zone issue
The Washington Post has its feeds with EST, which is getting parsed by miniflux as UTC, and showing up as 8 hours off.

See http://feeds.washingtonpost.com/rss/politics for an example.

This fix applies a similar workaround for EST/EDT as was done for PST/PDT.
2021-02-10 22:45:02 -08:00
Frédéric Guillot 864dd9f219 Allow images with data URLs
Only URLs with a mime-type image/* are allowed
2021-02-06 14:46:01 -08:00
Ilya Mateyko 4464802947 Reformat some Go files
When working on #994 I noticed that some Go files are not formatted with
`gofmt`.

This PR fixes this.
2021-01-27 18:13:58 -08:00
Frédéric Guillot 806b9545a9 Refactor feed validator 2021-01-04 14:47:25 -08:00
Frédéric Guillot 4468ef1410 Refactor category validation 2021-01-03 22:50:24 -08:00
Frédéric Guillot 291bf96d15 Do not strip tags for entry title
Some technical blogs have titles like "</some-title>" or "This is some <code>source code</code>".

Miniflux was removing these elements which prevent rendering the title correctly.
2021-01-03 11:44:07 -08:00
Frédéric Guillot f0610bdd9c Refactor feed creation to allow setting most fields via API
Allow API clients to create disabled feeds or define field like "ignore_http_cache".
2021-01-02 16:48:22 -08:00
Frédéric Guillot 1908c84fbe Handle invalid French date 2020-12-02 20:59:14 -08:00
Frédéric Guillot f722fd1208 Handle invalid feeds with relative URLs 2020-12-02 20:58:18 -08:00
Pacman99 b8b6c74d86 Add rewrite rule replace for custom search and replace 2020-11-29 10:32:26 -08:00
Frédéric Guillot de7a613098 Calculate reading time during feed processing
The goal is to speed up the user interface.

Detecting the language based on the content is pretty slow.
2020-11-18 17:43:24 -08:00
Frédéric Guillot b1c9977711 Handle more invalid dates 2020-11-17 17:12:12 -08:00
Frédéric Guillot a108cb7808 Handle various invalid date 2020-11-16 21:37:33 -08:00
Frédéric Guillot 246a48359c Do not follow redirects when trying known feed URLs
Some websites redirects unknown URLs to the home page.
As result, the list of known URLs is returned to the subscription list.
We don't want the user to choose between invalid feed URLs.
2020-11-06 17:46:54 -08:00
Frédéric Guillot 40e983664c Trim spaces around icon URLs 2020-11-06 17:18:58 -08:00
Frédéric Guillot 4f358aa0f3 Do not escape HTML for Atom 1.0 text content during parsing
Avoid encoding single quotes to HTML entities (&#39;).

Feed contents are sanitized after parsing.
2020-10-30 23:41:33 -07:00
Frédéric Guillot b30a045a4e Refactor entry filtering
Avoid looping multiple times across entries
2020-10-19 22:18:41 -07:00
Frédéric Guillot b50778d3eb Add rewrite rule to use noscript content for images rendered with Javascript 2020-10-19 21:31:10 -07:00
Manuel Garrido 84b83fc3c8
Add feed filters (Keeplist and Blocklist) 2020-10-16 14:40:56 -07:00
Frédéric Guillot 3afdf25012 Do not proxy image data url 2020-10-14 22:26:54 -07:00
Frédéric Guillot 31435ef83e Add rewrite rule to fix Medium.com images 2020-09-29 22:27:32 -07:00
Frédéric Guillot d75ff0c5ab Add sanitizer support for responsive images
- Add support for picture HTML tag
- Add support for srcset, media, and sizes attributes to img and source tags
2020-09-28 23:22:08 -07:00
Frédéric Guillot c394a61a4e Add Prometheus exporter 2020-09-27 20:04:48 -07:00
Frédéric Guillot 16b7b3bc3e http client: remove dependency on global config options 2020-09-27 14:37:46 -07:00
Dave Marquard eb026ae4ac handle Pacific Daylight Time in addition to Pacific Standard Time 2020-09-22 19:47:36 -07:00
Frédéric Guillot 0d0395b4e3 Do not try to update a duplicated feed after a refresh 2020-09-20 23:42:18 -07:00
Frédéric Guillot e6c6ee441a Use a transaction to refresh and create entries
Also includes few database improvements:

- Speed up entries clean up with an index and a goroutine
- Avoid the accumulation of enclosures for some feeds
2020-09-20 23:12:23 -07:00
Frédéric Guillot bfb96d536e Add workaround for parsing an invalid date 2020-09-14 21:23:26 -07:00
Kebin Liu cf7712acea
Add HTTP proxy option for subscriptions 2020-09-09 23:28:54 -07:00
alex 0f258fd55b
Make add_invidious_video rule applicable for different invidious instances 2020-09-06 13:41:42 -07:00
Frédéric Guillot fc75b0cd8e Add workaround to get YouTube feed from video page 2020-08-02 12:24:46 -07:00
Frédéric Guillot 7380c64141 Add workaround to find YouTube channel feeds
YouTube doesn't expose RSS links anymore for new-style URLs.
2020-08-02 11:37:07 -07:00
Frédéric Guillot 1d6b0491a7 Ignore <media:title> in RSS 2.0 feeds
In the vast majority of cases, the default entry title is correct.

Ignoring <media:title> avoid overriding the default title if they are different.
2020-06-29 18:24:06 -07:00
Gabriel Augendre e44b4b2540 Try known urls if no link alternate
I came across a few blogs that didn't have a link rel alternate
but offered a RSS/Atom feed.
This aims at solving this issue for "well known" feed urls, since
these urls are often the same.
2020-06-21 20:34:59 -07:00
Manuel Müller ca918bc7e3 Added scraper rule for dilbert.com and turnoff.us 2020-06-10 20:15:46 -07:00
Frédéric Guillot 6c6ca69141 Add feed option to ignore HTTP cache 2020-06-05 22:04:52 -07:00
Frédéric Guillot 7e5157f218 Rename alternative scheduler to entry_frequency 2020-05-25 15:12:47 -07:00
Shizun Ge cead85b165
Add alternative scheduler based on the number of entries 2020-05-25 14:06:56 -07:00
Corey McCaffrey 25d4b9fc0c Added scraper rule for financialsamurai.com
The default rule results in blank content.
2020-05-24 13:29:28 -07:00
Corey McCaffrey 0683074b8b Added scraper rule for TheOatmeal.com
The default rule does not show the comic posted to the feed. The comic image is in a div with id "comic".
2020-05-13 21:28:00 -07:00
Corey McCaffrey 8f6c07afd6 Added scraper rule for RayWenderlich.com
RayWenderlich.com is a popular developer's community for iOS and Android developers. The default rule results in "GROUP GROUP GROUP GROUP…" instead of the content posted on the blog.
2020-05-13 21:28:00 -07:00
Frédéric Guillot 619aa58fb3 Handle more invalid dates
Fixes #617
2020-04-25 20:15:18 -07:00
Frédéric Guillot 592151bdb6 Add support for Invidious
- Embed Invidious player for invidio.us feeds
- Add new rewrite rule to use Invidious player for Youtube feeds
2020-03-20 20:56:59 -07:00
Andrew Williams 9974e0f458 Addition of scraper rule for wdwnt.com
By default fetching original content for wdwnt.com results in a snippet of the comments section, this rule captures the article content.
2020-02-28 20:24:58 -08:00
Frédéric Guillot 997e9422eb Ignore enclosures without URL 2020-01-30 21:18:49 -08:00
Frédéric Guillot 61f0c8aa66 Allow application/xhtml+xml links as comments URL in Atom replies 2020-01-04 16:07:06 -08:00
Frédéric Guillot bf632fad2e Allow only absolute URLs in comments URL
Some feeds are using invalid URLs (random text).
2020-01-04 15:54:16 -08:00
Kebin Liu 8cebd985a2 Use internal XML workarounds to detect feed format 2020-01-02 22:19:15 -08:00
Frédéric Guillot ac3c936820 Make sure whitelisted URI schemes are handled properly by the sanitizer 2020-01-02 11:03:51 -08:00
Frédéric Guillot 3debf75eb9 Normalize URL query string before executing HTTP requests
- Make sure query strings parameters are encoded
- As opposed to the standard library, do not append equal sign
for query parameters with empty value
- Strip URL fragments like Web browsers
2019-12-26 15:56:59 -08:00
Frédéric Guillot 200b1c304b Improve Dublin Core support for RDF feeds 2019-12-23 14:45:58 -08:00
Frédéric Guillot 1b33bb3d1c Improve Podcast support (iTunes and Google Play feeds)
- Add support for Google Play XML namespace
- Improve existing iTunes namespace implementation
2019-12-23 13:51:42 -08:00
Frédéric Guillot 33fdb2c489 Add support for Atom 0.3 2019-12-22 22:42:00 -08:00
Frédéric Guillot cfb6ddfcea Add support for Atom 'replies' link relation
Show comments URL for Atom feeds as per RFC 4685.
See https://tools.ietf.org/html/rfc4685#section-4

Note that only the first link with type "text/html" is taken into consideration.
2019-12-22 18:03:04 -08:00
cinput 8e1ed8bef3 Return outer HTML when scraping elements 2019-12-21 21:18:31 -08:00
somini 30f22fbd78 Update scraper rule for "Le Monde" 2019-12-19 18:35:29 -08:00
Jebbs a155ab6deb Filter valid XML characters for UTF-8 XML documents before decoding
This change should reduce "illegal character code" XML errors.
2019-12-19 18:31:52 -08:00
Frédéric Guillot a4ebb33cd5 Trim spaces for RDF entry links 2019-12-01 15:06:01 -08:00
Frédéric Guillot 120d6ec7d8 Do no rewrite Youtube description twice in "add_youtube_video" rule
This is already done before in <media:description>.
2019-11-30 22:56:06 -08:00
Frédéric Guillot 69aa650203 Add the possibility to add rules during feed creation 2019-11-29 11:27:58 -08:00
Frédéric Guillot 912a98788e Add support of media elements for Atom feeds 2019-11-28 23:55:40 -08:00
Frédéric Guillot f90e9dfab0 Add support of media elements for RSS 2 feeds 2019-11-28 21:33:32 -08:00
Frédéric Guillot c43c9458a9 Add rewrite functions: convert_text_link and nl2br 2019-11-28 21:33:12 -08:00
Neo Ng 90064a8cf0 Update scraper rule for openingsource.org 2019-11-28 19:40:26 -08:00
Tony Wang 2eb2441f2b Improve XML decoder to remove illegal characters 2019-10-22 20:32:35 -07:00
Tony Wang 5517eebafe Add new formats to date parser 2019-10-20 09:52:18 -07:00
Frédéric Guillot 36d7732234 Disable strict XML parsing
This change should improve parsing of broken XML feeds.

See https://golang.org/pkg/encoding/xml/#Decoder
2019-09-18 22:45:56 -07:00
Frédéric Guillot 934385ff55 Replace Travis by GitHub Actions 2019-09-15 11:48:15 -07:00
Frédéric Guillot 8d8f78241d Add native lazy loading for images and iframes
This feature is available only in Chrome >= 76 for now.

See https://web.dev/native-lazy-loading
2019-09-10 21:22:19 -07:00
Peter De Wachter b6f3160dbc add_mailto_subject: New rewrite function
Dinosaur Comics (qwantz.com) likes to hide jokes in mailto: links, but
miniflux's sanitizer strips those out.
2019-08-19 19:42:47 -07:00
Frédéric Guillot ac45307da6 Add test case for parsing HTML entities 2019-08-15 21:42:13 -07:00