Commit Graph

79 Commits

Author SHA1 Message Date
neepl 5365f31e90 Add support for published tag in Atom feeds 2018-07-17 21:52:05 -07:00
Frédéric Guillot a786e78aca Add embedly.com to iframe whitelist 2018-07-10 20:56:54 -07:00
dzaikos 6d25e02cb5 New `add_dynamic_image` rewriter for JavaScript-loaded images.
Searches tags for various `data-*` attributes and sets `img` tag `src` attribute appropriately. Falls back to searching `noscript` for `img` tags.

Includes unit tests.
2018-07-09 01:22:48 -04:00
dzaikos e1c56b2e53 Processor: Do rewriter before sanitizer for `entry.Content`.
Addresses #163.
2018-07-06 00:17:07 -04:00
Frédéric Guillot de1a4aad30 Add support for protocol relative YouTube URLs 2018-07-04 22:45:44 -07:00
dzaikos 7d4a195519 Sandbox iframes when sanitizing.
Updated iframe unit tests.

Refactored sanitizer.getExtraAttributes() to use `switch` instead of multiple `if` statements.
2018-07-03 12:55:18 -07:00
Frédéric Guillot 9c0f882ba0 Add specific 404 and 401 error messages 2018-06-30 12:42:12 -07:00
dzaikos 45d7105ed1 Refactor AddImageTitle rewriter.
* Only processes images with `src` **and** `title` attributes (others are ignored).
* Processes **all** images in the document (not just the first one).
* Wraps the image and its title attribute in a `figure` tag with the title attribute's contents in a `figcaption` tag.

Updated xkcd rewriter unit test.

Added another xkcd rewriter unit test to check rendering of images without title tags.
2018-06-26 17:50:18 -04:00
dzaikos c9131b0e89 Improve sanitizer to remove style tag contents.
See #157.

Refactored how blacklisted tags are handled so they're easier manage in the future.
2018-06-24 19:53:23 -07:00
Dave Z d847b10e32 Improve sanitizer to remove script and noscript contents
These tags where removed but the content was rendered as escaped HTML.

See #157
2018-06-23 17:50:43 -07:00
Frédéric Guillot bddca15b69 Add new fields for feed username/password 2018-06-19 22:58:29 -07:00
Frédéric Guillot c719cf7df0 Rewrite iframe Youtube URLs to https://www.youtube-nocookie.com 2018-06-12 18:45:09 -07:00
Frédéric Guillot 0c2e5ff0dc Handle feeds with dates formatted as Unix timestamp 2018-05-08 20:41:24 -07:00
Frédéric Guillot 5cacae6cf2 Add API endpoint to import OPML file 2018-04-29 18:56:40 -07:00
Frédéric Guillot 1eba1730d1 Move HTTP client to its own package 2018-04-28 10:51:07 -07:00
aniran 322b265d7a Scrape parent element for iframe
Current behavior: if you have an `iframe` scraper rule, `scrapContent`
tries to return the inner HTML of the `iframe`, which turns up blank.

New behavior: like `img` elements, if an `iframe` is matched by a scraper rule,
the parent element's inner HTML (i.e. the `iframe` is returned).
2018-04-27 17:57:22 -07:00
aniran 920dda79b7 Add soundcloud and bandcamp iframe sources 2018-04-27 17:55:58 -07:00
Frédéric Guillot dcbb5047b1 Add support for Dublin Core date in RDF feeds 2018-04-10 18:13:05 -07:00
Frédéric Guillot 02ba735ba9 Handle some non-english date formats 2018-04-09 21:27:15 -07:00
Frédéric Guillot e2d02bac5a Rename RSS parser getters 2018-04-09 20:38:12 -07:00
Frédéric Guillot f76093690c Get the right comments URL when having multiple namespaces 2018-04-09 20:30:55 -07:00
Frédéric Guillot 702256bcc0 Add unit test for comments url and French translation 2018-04-07 13:56:11 -07:00
Ben Brooks 538d08c16c Add CommentsURL to entry 2018-04-07 13:50:45 -07:00
Frédéric Guillot 6ea4da3bce Handle RSS author elements with inner HTML 2018-03-18 11:57:46 -07:00
Frédéric Guillot 482785c5e6 Convert enclosure size field to bigint 2018-03-14 20:09:06 -07:00
Frédéric Guillot ec08f45bf5 Fix broken OPML import with Go 1.10 2018-03-14 18:50:06 -07:00
Frédéric Guillot f110384f11 Improve parser error messages 2018-02-27 21:19:59 -08:00
Frédéric Guillot 953d0a2dc0 Support localized feed errors generated by background workers 2018-02-27 21:08:32 -08:00
Frédéric Guillot 9292d5d604 Handle Atom feeds with HTML title 2018-02-17 12:21:58 -08:00
Frédéric Guillot dda9114692 Improve error handling for HTTP client 2018-02-08 18:16:54 -08:00
Frédéric Guillot 7b0bfd9308 Strip invalid XML characters to avoid parsing errors 2018-02-07 20:57:56 -08:00
Frédéric Guillot c6fd9eb9b1 Remove period for feed errors 2018-02-07 19:10:36 -08:00
Frédéric Guillot 0fb87eba3f Improve error handling when the response is empty 2018-02-07 18:47:47 -08:00
Frédéric Guillot b78172033f Show API URL endpoints in user interface 2018-01-31 21:57:20 -08:00
Frédéric Guillot ffabb009b8 Do not override existing entries when the crawler is enabled 2018-01-20 14:04:19 -08:00
Frédéric Guillot 713b38e34c Handle more encoding edge cases
- Feeds with charset specified only in Content-Type header and not in XML document
- Feeds with charset specified in both places
- Feeds with charset specified only in XML document and not in HTTP header
2018-01-20 13:25:21 -08:00
Frédéric Guillot 3b62f904d6 Do not crawl existing entry URLs 2018-01-20 13:25:20 -08:00
Frédéric Guillot 9652dfa1fe Add more comments (GoDoc) 2018-01-11 19:21:20 -08:00
Frédéric Guillot 1d7fe892e1 Add scraper rule for darkreading.com 2018-01-06 13:25:12 -08:00
Frédéric Guillot 48aa0d07ef Add more scraper rules 2018-01-04 19:32:24 -08:00
Frédéric Guillot 7d278d49f1 Add content length check when refreshing feeds 2018-01-04 18:41:23 -08:00
Frédéric Guillot efac11e082 Handle more date formats 2018-01-03 18:59:29 -08:00
Frédéric Guillot ec63cbe7bb If the website URL is empty, assign the feed URL 2018-01-03 18:23:21 -08:00
Frédéric Guillot c39f2e1a8d Rename helper packages 2018-01-02 19:15:08 -08:00
Frédéric Guillot 3c3f397bf5 Make sure the scraper parse only HTML documents 2018-01-02 18:32:01 -08:00
Frédéric Guillot c454f67037 Add scraper rules for version2.dk and ing.dk 2017-12-27 19:44:23 -08:00
Frédéric Guillot d4839b5597 Add more scraper rules 2017-12-27 13:36:07 -08:00
Frédéric Guillot f6a5d7d6ed Add support for data URL favicons 2017-12-22 19:01:39 -08:00
Frédéric Guillot e7afec7eca Handle more date formats 2017-12-22 17:59:28 -08:00
Frédéric Guillot 1d8193b892 Add logger 2017-12-15 18:55:57 -08:00