Commit Graph

239 Commits

Author SHA1 Message Date
Frédéric Guillot 997e9422eb Ignore enclosures without URL 2020-01-30 21:18:49 -08:00
Frédéric Guillot 61f0c8aa66 Allow application/xhtml+xml links as comments URL in Atom replies 2020-01-04 16:07:06 -08:00
Frédéric Guillot bf632fad2e Allow only absolute URLs in comments URL
Some feeds are using invalid URLs (random text).
2020-01-04 15:54:16 -08:00
Kebin Liu 8cebd985a2 Use internal XML workarounds to detect feed format 2020-01-02 22:19:15 -08:00
Frédéric Guillot ac3c936820 Make sure whitelisted URI schemes are handled properly by the sanitizer 2020-01-02 11:03:51 -08:00
Frédéric Guillot 3debf75eb9 Normalize URL query string before executing HTTP requests
- Make sure query strings parameters are encoded
- As opposed to the standard library, do not append equal sign
for query parameters with empty value
- Strip URL fragments like Web browsers
2019-12-26 15:56:59 -08:00
Frédéric Guillot 200b1c304b Improve Dublin Core support for RDF feeds 2019-12-23 14:45:58 -08:00
Frédéric Guillot 1b33bb3d1c Improve Podcast support (iTunes and Google Play feeds)
- Add support for Google Play XML namespace
- Improve existing iTunes namespace implementation
2019-12-23 13:51:42 -08:00
Frédéric Guillot 33fdb2c489 Add support for Atom 0.3 2019-12-22 22:42:00 -08:00
Frédéric Guillot cfb6ddfcea Add support for Atom 'replies' link relation
Show comments URL for Atom feeds as per RFC 4685.
See https://tools.ietf.org/html/rfc4685#section-4

Note that only the first link with type "text/html" is taken into consideration.
2019-12-22 18:03:04 -08:00
cinput 8e1ed8bef3 Return outer HTML when scraping elements 2019-12-21 21:18:31 -08:00
somini 30f22fbd78 Update scraper rule for "Le Monde" 2019-12-19 18:35:29 -08:00
Jebbs a155ab6deb Filter valid XML characters for UTF-8 XML documents before decoding
This change should reduce "illegal character code" XML errors.
2019-12-19 18:31:52 -08:00
Frédéric Guillot a4ebb33cd5 Trim spaces for RDF entry links 2019-12-01 15:06:01 -08:00
Frédéric Guillot 120d6ec7d8 Do no rewrite Youtube description twice in "add_youtube_video" rule
This is already done before in <media:description>.
2019-11-30 22:56:06 -08:00
Frédéric Guillot 69aa650203 Add the possibility to add rules during feed creation 2019-11-29 11:27:58 -08:00
Frédéric Guillot 912a98788e Add support of media elements for Atom feeds 2019-11-28 23:55:40 -08:00
Frédéric Guillot f90e9dfab0 Add support of media elements for RSS 2 feeds 2019-11-28 21:33:32 -08:00
Frédéric Guillot c43c9458a9 Add rewrite functions: convert_text_link and nl2br 2019-11-28 21:33:12 -08:00
Neo Ng 90064a8cf0 Update scraper rule for openingsource.org 2019-11-28 19:40:26 -08:00
Tony Wang 2eb2441f2b Improve XML decoder to remove illegal characters 2019-10-22 20:32:35 -07:00
Tony Wang 5517eebafe Add new formats to date parser 2019-10-20 09:52:18 -07:00
Frédéric Guillot 36d7732234 Disable strict XML parsing
This change should improve parsing of broken XML feeds.

See https://golang.org/pkg/encoding/xml/#Decoder
2019-09-18 22:45:56 -07:00
Frédéric Guillot 934385ff55 Replace Travis by GitHub Actions 2019-09-15 11:48:15 -07:00
Frédéric Guillot 8d8f78241d Add native lazy loading for images and iframes
This feature is available only in Chrome >= 76 for now.

See https://web.dev/native-lazy-loading
2019-09-10 21:22:19 -07:00
Peter De Wachter b6f3160dbc add_mailto_subject: New rewrite function
Dinosaur Comics (qwantz.com) likes to hide jokes in mailto: links, but
miniflux's sanitizer strips those out.
2019-08-19 19:42:47 -07:00
Frédéric Guillot ac45307da6 Add test case for parsing HTML entities 2019-08-15 21:42:13 -07:00
Peter De Wachter ea2b6e3608 addImageTitle: Fix HTML injection
This rewrite rule would change this:

    <img title="<foo>">

to this:

    <figure><img><figcaption><foo></figcaption></figure>

The image title needs to be properly escaped.
2019-08-15 21:39:41 -07:00
Peter De Wachter 3a39d110f0 Accept HTML entities when parsing XML
Every once in a while, one of my feeds would throw an XML parse error
because it used `&nbsp;` or some other HTML entity. I feel Miniflux
should be lenient here, and Go already has a handy hook to make this
work.
2019-08-15 21:26:07 -07:00
Ilya Glotov c840268678
Sort feed categories before serialization
A function is added for feeds and its categories normalization.
The test will ensure that the order is right.
2019-07-05 20:34:49 +03:00
Frédéric Guillot 129f1bf3da Add support for OPML v1 import 2019-03-26 20:09:31 -07:00
Jeremy Apthorp 304b43cb30 Add 'allow-popups' to iframe sandbox permissions 2019-03-26 18:26:56 -07:00
Frédéric Guillot 6764a420b0 Make parser compatible with Go 1.12
See changes in strings.Map(): https://golang.org/doc/go1.12#strings
2019-02-28 21:23:33 -08:00
Frédéric Guillot f3fc8b7072 Use feed ID instead of user ID to check entry URLs presence 2019-02-28 20:43:33 -08:00
Frédéric Guillot ed6ae7e0d2 Use preferably the published date for Atom feeds
YouTube feeds use the published date for the original creation date.
2019-01-29 20:01:36 -08:00
Peter De Wachter 0cdcec10ca More robust Atom text handling
Miniflux couldn't deal with XHTML Summary elements.

- Make Summary an 'atomContent' field
- Define an atomContentToString function rather than inling it three times
- Also properly escape special characters in plain text fields.
2019-01-07 17:55:02 -08:00
Frédéric Guillot 56efd2eb3f Add workaround for non GMT dates (RFC822, RFC850, and RFC1123)
RFC822, RFC850, and RFC1123 are supposed to be always in GMT.

This is a workaround for the one defined in PST timezone.
2018-12-26 20:24:38 -08:00
Frédéric Guillot 012138179c Add function storage.UpdateFeedError() 2018-12-15 13:04:38 -08:00
Tom Matthews 8b40778ee1 Add BBC News scraping rule 2018-12-13 20:25:30 -08:00
Frederic Guillot 61bfb3cfa8 Make password prompt compatible with Windows 2018-12-09 17:44:33 -08:00
Frédéric Guillot 1bc8535dbb Move image proxy filter to template functions 2018-12-02 21:09:53 -08:00
Frédéric Guillot 6f5d93cbbe Update scraper rule for lemonde.fr 2018-12-02 20:53:22 -08:00
Frédéric Guillot 311a133ab8 Refactor manual entry scraper 2018-12-02 20:51:06 -08:00
mapl e47188eab2 Update scraper rule for heise.de 2018-12-01 11:49:30 -08:00
Frédéric Guillot 487852f07e Replace daemon and scheduler package with service package 2018-11-11 15:32:48 -08:00
Frédéric Guillot 3b6e44c331 Allow the scraper to parse XHTML documents
Only "text/html" was authorized before.
2018-11-03 13:44:13 -07:00
Frédéric Guillot ae1dc1a91e Handle more encoding conversion edge cases 2018-10-29 23:00:03 -07:00
Frédéric Guillot 7d1b471d88 Add test case to check different feed encoding and HTTP headers 2018-10-29 19:04:36 -07:00
Frédéric Guillot 85d48c8a71 Add entries storage error to feed errors count 2018-10-21 11:44:29 -07:00
Frédéric Guillot b8f874a37d Simplify feed entries filtering
- Rename processor package to filter
- Remove boilerplate code
2018-10-14 22:33:19 -07:00
Frédéric Guillot 778346b0b0 Simplify feed fetcher
- Add browser package to handle HTTP errors
- Reduce code duplication
2018-10-14 21:43:48 -07:00
Frédéric Guillot 5870f04260 Simplify feed parser and format detection
- Avoid doing multiple buffer copies
- Move parser and format detection logic to its own package
2018-10-14 11:46:41 -07:00
Frédéric Guillot 9606126196 Convert text links and line feeds to HTML in YouTube channels 2018-10-08 20:47:10 -07:00
Frédéric Guillot 9dc38a0803 Add missing package descriptions for GoDoc 2018-10-08 17:32:17 -07:00
Frédéric Guillot 11dfcdd3d6 Fix typo in license header 2018-10-08 15:50:15 -07:00
Frédéric Guillot b1e8f534ef Simplify locale package usage (refactoring) 2018-09-22 15:04:55 -07:00
Frédéric Guillot beb7a0cfcb Use unique translation IDs instead of English text as key 2018-09-21 22:23:23 -07:00
Patrick 2538eea177 Add the possibility to override default user agent for each feed 2018-09-19 18:19:24 -07:00
Frédéric Guillot df2bebaf3d Update scraper rule for heise.de 2018-08-25 10:33:18 -07:00
Frédéric Guillot dbcc5d8a97 Use canonical imports 2018-08-24 21:56:39 -07:00
neepl 5365f31e90 Add support for published tag in Atom feeds 2018-07-17 21:52:05 -07:00
Frédéric Guillot a786e78aca Add embedly.com to iframe whitelist 2018-07-10 20:56:54 -07:00
dzaikos 6d25e02cb5 New `add_dynamic_image` rewriter for JavaScript-loaded images.
Searches tags for various `data-*` attributes and sets `img` tag `src` attribute appropriately. Falls back to searching `noscript` for `img` tags.

Includes unit tests.
2018-07-09 01:22:48 -04:00
dzaikos e1c56b2e53 Processor: Do rewriter before sanitizer for `entry.Content`.
Addresses #163.
2018-07-06 00:17:07 -04:00
Frédéric Guillot de1a4aad30 Add support for protocol relative YouTube URLs 2018-07-04 22:45:44 -07:00
dzaikos 7d4a195519 Sandbox iframes when sanitizing.
Updated iframe unit tests.

Refactored sanitizer.getExtraAttributes() to use `switch` instead of multiple `if` statements.
2018-07-03 12:55:18 -07:00
Frédéric Guillot 9c0f882ba0 Add specific 404 and 401 error messages 2018-06-30 12:42:12 -07:00
dzaikos 45d7105ed1 Refactor AddImageTitle rewriter.
* Only processes images with `src` **and** `title` attributes (others are ignored).
* Processes **all** images in the document (not just the first one).
* Wraps the image and its title attribute in a `figure` tag with the title attribute's contents in a `figcaption` tag.

Updated xkcd rewriter unit test.

Added another xkcd rewriter unit test to check rendering of images without title tags.
2018-06-26 17:50:18 -04:00
dzaikos c9131b0e89 Improve sanitizer to remove style tag contents.
See #157.

Refactored how blacklisted tags are handled so they're easier manage in the future.
2018-06-24 19:53:23 -07:00
Dave Z d847b10e32 Improve sanitizer to remove script and noscript contents
These tags where removed but the content was rendered as escaped HTML.

See #157
2018-06-23 17:50:43 -07:00
Frédéric Guillot bddca15b69 Add new fields for feed username/password 2018-06-19 22:58:29 -07:00
Frédéric Guillot c719cf7df0 Rewrite iframe Youtube URLs to https://www.youtube-nocookie.com 2018-06-12 18:45:09 -07:00
Frédéric Guillot 0c2e5ff0dc Handle feeds with dates formatted as Unix timestamp 2018-05-08 20:41:24 -07:00
Frédéric Guillot 5cacae6cf2 Add API endpoint to import OPML file 2018-04-29 18:56:40 -07:00
Frédéric Guillot 1eba1730d1 Move HTTP client to its own package 2018-04-28 10:51:07 -07:00
aniran 322b265d7a Scrape parent element for iframe
Current behavior: if you have an `iframe` scraper rule, `scrapContent`
tries to return the inner HTML of the `iframe`, which turns up blank.

New behavior: like `img` elements, if an `iframe` is matched by a scraper rule,
the parent element's inner HTML (i.e. the `iframe` is returned).
2018-04-27 17:57:22 -07:00
aniran 920dda79b7 Add soundcloud and bandcamp iframe sources 2018-04-27 17:55:58 -07:00
Frédéric Guillot dcbb5047b1 Add support for Dublin Core date in RDF feeds 2018-04-10 18:13:05 -07:00
Frédéric Guillot 02ba735ba9 Handle some non-english date formats 2018-04-09 21:27:15 -07:00
Frédéric Guillot e2d02bac5a Rename RSS parser getters 2018-04-09 20:38:12 -07:00
Frédéric Guillot f76093690c Get the right comments URL when having multiple namespaces 2018-04-09 20:30:55 -07:00
Frédéric Guillot 702256bcc0 Add unit test for comments url and French translation 2018-04-07 13:56:11 -07:00
Ben Brooks 538d08c16c Add CommentsURL to entry 2018-04-07 13:50:45 -07:00
Frédéric Guillot 6ea4da3bce Handle RSS author elements with inner HTML 2018-03-18 11:57:46 -07:00
Frédéric Guillot 482785c5e6 Convert enclosure size field to bigint 2018-03-14 20:09:06 -07:00
Frédéric Guillot ec08f45bf5 Fix broken OPML import with Go 1.10 2018-03-14 18:50:06 -07:00
Frédéric Guillot f110384f11 Improve parser error messages 2018-02-27 21:19:59 -08:00
Frédéric Guillot 953d0a2dc0 Support localized feed errors generated by background workers 2018-02-27 21:08:32 -08:00
Frédéric Guillot 9292d5d604 Handle Atom feeds with HTML title 2018-02-17 12:21:58 -08:00
Frédéric Guillot dda9114692 Improve error handling for HTTP client 2018-02-08 18:16:54 -08:00
Frédéric Guillot 7b0bfd9308 Strip invalid XML characters to avoid parsing errors 2018-02-07 20:57:56 -08:00
Frédéric Guillot c6fd9eb9b1 Remove period for feed errors 2018-02-07 19:10:36 -08:00
Frédéric Guillot 0fb87eba3f Improve error handling when the response is empty 2018-02-07 18:47:47 -08:00
Frédéric Guillot b78172033f Show API URL endpoints in user interface 2018-01-31 21:57:20 -08:00
Frédéric Guillot ffabb009b8 Do not override existing entries when the crawler is enabled 2018-01-20 14:04:19 -08:00
Frédéric Guillot 713b38e34c Handle more encoding edge cases
- Feeds with charset specified only in Content-Type header and not in XML document
- Feeds with charset specified in both places
- Feeds with charset specified only in XML document and not in HTTP header
2018-01-20 13:25:21 -08:00
Frédéric Guillot 3b62f904d6 Do not crawl existing entry URLs 2018-01-20 13:25:20 -08:00
Frédéric Guillot 9652dfa1fe Add more comments (GoDoc) 2018-01-11 19:21:20 -08:00
Frédéric Guillot 1d7fe892e1 Add scraper rule for darkreading.com 2018-01-06 13:25:12 -08:00
Frédéric Guillot 48aa0d07ef Add more scraper rules 2018-01-04 19:32:24 -08:00
Frédéric Guillot 7d278d49f1 Add content length check when refreshing feeds 2018-01-04 18:41:23 -08:00
Frédéric Guillot efac11e082 Handle more date formats 2018-01-03 18:59:29 -08:00
Frédéric Guillot ec63cbe7bb If the website URL is empty, assign the feed URL 2018-01-03 18:23:21 -08:00
Frédéric Guillot c39f2e1a8d Rename helper packages 2018-01-02 19:15:08 -08:00
Frédéric Guillot 3c3f397bf5 Make sure the scraper parse only HTML documents 2018-01-02 18:32:01 -08:00
Frédéric Guillot c454f67037 Add scraper rules for version2.dk and ing.dk 2017-12-27 19:44:23 -08:00
Frédéric Guillot d4839b5597 Add more scraper rules 2017-12-27 13:36:07 -08:00
Frédéric Guillot f6a5d7d6ed Add support for data URL favicons 2017-12-22 19:01:39 -08:00
Frédéric Guillot e7afec7eca Handle more date formats 2017-12-22 17:59:28 -08:00
Frédéric Guillot 1d8193b892 Add logger 2017-12-15 18:55:57 -08:00
Frédéric Guillot c6d9eb3614 Improve content scraper 2017-12-13 21:30:40 -08:00
Frédéric Guillot 827683ab59 Make sure that item URL are absolute 2017-12-13 20:16:15 -08:00
Frédéric Guillot 84d912c979 Rewrite imports 2017-12-12 21:48:13 -08:00
Frédéric Guillot ef097f02fe Add the possibility to enable crawler for feeds 2017-12-12 19:19:36 -08:00
Frédéric Guillot 33445e5b68 Add the possibility to define rewrite rules for each feed 2017-12-11 22:16:32 -08:00
Frédéric Guillot 87ccad5c7f Add scraper rules 2017-12-10 20:51:04 -08:00
Frédéric Guillot 7a35c58f53 Add readability package to fetch original content 2017-12-10 19:01:38 -08:00
Frédéric Guillot 6f5350a497 Move packages http and url 2017-12-02 20:26:21 -08:00
Frédéric Guillot 2356ddad28 Add Pinboard integration 2017-12-02 19:32:14 -08:00
Frédéric Guillot fb2a73c91e Proxify image enclosures 2017-12-01 22:29:18 -08:00
Frédéric Guillot bb8e61c7c5 Make sure golint pass on the code base 2017-11-27 21:40:05 -08:00
Frédéric Guillot bd663b43a0 Improve HTML sanitizer 2017-11-25 18:08:59 -08:00
Frédéric Guillot 71bf7e4358 Improve API 2017-11-24 22:29:20 -08:00
Frédéric Guillot 2b641cc224 Improve feed parsers 2017-11-22 14:52:31 -08:00
Frédéric Guillot 99dfbdbb47 Convert feed encoding only if the charset is specified 2017-11-21 22:55:19 -08:00
Frédéric Guillot 5f0ae8196c Add timeout for HTTP client 2017-11-20 19:44:28 -08:00
Frédéric Guillot eb9f588216 Make sure RDF entries have a date 2017-11-20 19:25:30 -08:00
Frédéric Guillot d5838b6734 Move feed parsers packages in reader package 2017-11-20 19:17:04 -08:00
Frédéric Guillot c26787f476 Improve OPML package to be more idiomatic 2017-11-20 19:11:06 -08:00
Frédéric Guillot e91a9b4f13 Export only necessary structs in JsonFeed package 2017-11-20 18:57:54 -08:00
Frédéric Guillot 6618caca81 Use more idiomatic code for Atom parser 2017-11-20 18:50:16 -08:00
Frédéric Guillot 89307010ad Add parser for RDF feeds 2017-11-20 18:34:11 -08:00
Frédéric Guillot c5cd38de83 Add unit test for HTTP client response functions 2017-11-20 17:25:45 -08:00
Frédéric Guillot aecda64030 Make sure XML feeds are always encoded in UTF-8 2017-11-20 17:12:37 -08:00
Frédéric Guillot 0e6717b7c8 Ensure that LocalizedError are returned by parsers 2017-11-20 16:11:55 -08:00
Frédéric Guillot 557cf9c21d Handle RSS entries with Atom links 2017-11-20 15:48:26 -08:00
Frédéric Guillot cf8af56a99 Handle RSS feeds without entry links 2017-11-20 15:15:10 -08:00
Frédéric Guillot a76c2a8c22 Improve OPML import/export 2017-11-20 14:35:11 -08:00
Frédéric Guillot 8ffb773f43 First commit 2017-11-19 22:01:46 -08:00