Commit Graph

153 Commits

Author SHA1 Message Date
Frédéric Guillot 56efd2eb3f Add workaround for non GMT dates (RFC822, RFC850, and RFC1123)
RFC822, RFC850, and RFC1123 are supposed to be always in GMT.

This is a workaround for the one defined in PST timezone.
2018-12-26 20:24:38 -08:00
Frédéric Guillot 012138179c Add function storage.UpdateFeedError() 2018-12-15 13:04:38 -08:00
Tom Matthews 8b40778ee1 Add BBC News scraping rule 2018-12-13 20:25:30 -08:00
Frederic Guillot 61bfb3cfa8 Make password prompt compatible with Windows 2018-12-09 17:44:33 -08:00
Frédéric Guillot 1bc8535dbb Move image proxy filter to template functions 2018-12-02 21:09:53 -08:00
Frédéric Guillot 6f5d93cbbe Update scraper rule for lemonde.fr 2018-12-02 20:53:22 -08:00
Frédéric Guillot 311a133ab8 Refactor manual entry scraper 2018-12-02 20:51:06 -08:00
mapl e47188eab2 Update scraper rule for heise.de 2018-12-01 11:49:30 -08:00
Frédéric Guillot 487852f07e Replace daemon and scheduler package with service package 2018-11-11 15:32:48 -08:00
Frédéric Guillot 3b6e44c331 Allow the scraper to parse XHTML documents
Only "text/html" was authorized before.
2018-11-03 13:44:13 -07:00
Frédéric Guillot ae1dc1a91e Handle more encoding conversion edge cases 2018-10-29 23:00:03 -07:00
Frédéric Guillot 7d1b471d88 Add test case to check different feed encoding and HTTP headers 2018-10-29 19:04:36 -07:00
Frédéric Guillot 85d48c8a71 Add entries storage error to feed errors count 2018-10-21 11:44:29 -07:00
Frédéric Guillot b8f874a37d Simplify feed entries filtering
- Rename processor package to filter
- Remove boilerplate code
2018-10-14 22:33:19 -07:00
Frédéric Guillot 778346b0b0 Simplify feed fetcher
- Add browser package to handle HTTP errors
- Reduce code duplication
2018-10-14 21:43:48 -07:00
Frédéric Guillot 5870f04260 Simplify feed parser and format detection
- Avoid doing multiple buffer copies
- Move parser and format detection logic to its own package
2018-10-14 11:46:41 -07:00
Frédéric Guillot 9606126196 Convert text links and line feeds to HTML in YouTube channels 2018-10-08 20:47:10 -07:00
Frédéric Guillot 9dc38a0803 Add missing package descriptions for GoDoc 2018-10-08 17:32:17 -07:00
Frédéric Guillot 11dfcdd3d6 Fix typo in license header 2018-10-08 15:50:15 -07:00
Frédéric Guillot b1e8f534ef Simplify locale package usage (refactoring) 2018-09-22 15:04:55 -07:00
Frédéric Guillot beb7a0cfcb Use unique translation IDs instead of English text as key 2018-09-21 22:23:23 -07:00
Patrick 2538eea177 Add the possibility to override default user agent for each feed 2018-09-19 18:19:24 -07:00
Frédéric Guillot df2bebaf3d Update scraper rule for heise.de 2018-08-25 10:33:18 -07:00
Frédéric Guillot dbcc5d8a97 Use canonical imports 2018-08-24 21:56:39 -07:00
neepl 5365f31e90 Add support for published tag in Atom feeds 2018-07-17 21:52:05 -07:00
Frédéric Guillot a786e78aca Add embedly.com to iframe whitelist 2018-07-10 20:56:54 -07:00
dzaikos 6d25e02cb5 New `add_dynamic_image` rewriter for JavaScript-loaded images.
Searches tags for various `data-*` attributes and sets `img` tag `src` attribute appropriately. Falls back to searching `noscript` for `img` tags.

Includes unit tests.
2018-07-09 01:22:48 -04:00
dzaikos e1c56b2e53 Processor: Do rewriter before sanitizer for `entry.Content`.
Addresses #163.
2018-07-06 00:17:07 -04:00
Frédéric Guillot de1a4aad30 Add support for protocol relative YouTube URLs 2018-07-04 22:45:44 -07:00
dzaikos 7d4a195519 Sandbox iframes when sanitizing.
Updated iframe unit tests.

Refactored sanitizer.getExtraAttributes() to use `switch` instead of multiple `if` statements.
2018-07-03 12:55:18 -07:00
Frédéric Guillot 9c0f882ba0 Add specific 404 and 401 error messages 2018-06-30 12:42:12 -07:00
dzaikos 45d7105ed1 Refactor AddImageTitle rewriter.
* Only processes images with `src` **and** `title` attributes (others are ignored).
* Processes **all** images in the document (not just the first one).
* Wraps the image and its title attribute in a `figure` tag with the title attribute's contents in a `figcaption` tag.

Updated xkcd rewriter unit test.

Added another xkcd rewriter unit test to check rendering of images without title tags.
2018-06-26 17:50:18 -04:00
dzaikos c9131b0e89 Improve sanitizer to remove style tag contents.
See #157.

Refactored how blacklisted tags are handled so they're easier manage in the future.
2018-06-24 19:53:23 -07:00
Dave Z d847b10e32 Improve sanitizer to remove script and noscript contents
These tags where removed but the content was rendered as escaped HTML.

See #157
2018-06-23 17:50:43 -07:00
Frédéric Guillot bddca15b69 Add new fields for feed username/password 2018-06-19 22:58:29 -07:00
Frédéric Guillot c719cf7df0 Rewrite iframe Youtube URLs to https://www.youtube-nocookie.com 2018-06-12 18:45:09 -07:00
Frédéric Guillot 0c2e5ff0dc Handle feeds with dates formatted as Unix timestamp 2018-05-08 20:41:24 -07:00
Frédéric Guillot 5cacae6cf2 Add API endpoint to import OPML file 2018-04-29 18:56:40 -07:00
Frédéric Guillot 1eba1730d1 Move HTTP client to its own package 2018-04-28 10:51:07 -07:00
aniran 322b265d7a Scrape parent element for iframe
Current behavior: if you have an `iframe` scraper rule, `scrapContent`
tries to return the inner HTML of the `iframe`, which turns up blank.

New behavior: like `img` elements, if an `iframe` is matched by a scraper rule,
the parent element's inner HTML (i.e. the `iframe` is returned).
2018-04-27 17:57:22 -07:00
aniran 920dda79b7 Add soundcloud and bandcamp iframe sources 2018-04-27 17:55:58 -07:00
Frédéric Guillot dcbb5047b1 Add support for Dublin Core date in RDF feeds 2018-04-10 18:13:05 -07:00
Frédéric Guillot 02ba735ba9 Handle some non-english date formats 2018-04-09 21:27:15 -07:00
Frédéric Guillot e2d02bac5a Rename RSS parser getters 2018-04-09 20:38:12 -07:00
Frédéric Guillot f76093690c Get the right comments URL when having multiple namespaces 2018-04-09 20:30:55 -07:00
Frédéric Guillot 702256bcc0 Add unit test for comments url and French translation 2018-04-07 13:56:11 -07:00
Ben Brooks 538d08c16c Add CommentsURL to entry 2018-04-07 13:50:45 -07:00
Frédéric Guillot 6ea4da3bce Handle RSS author elements with inner HTML 2018-03-18 11:57:46 -07:00
Frédéric Guillot 482785c5e6 Convert enclosure size field to bigint 2018-03-14 20:09:06 -07:00
Frédéric Guillot ec08f45bf5 Fix broken OPML import with Go 1.10 2018-03-14 18:50:06 -07:00
Frédéric Guillot f110384f11 Improve parser error messages 2018-02-27 21:19:59 -08:00
Frédéric Guillot 953d0a2dc0 Support localized feed errors generated by background workers 2018-02-27 21:08:32 -08:00
Frédéric Guillot 9292d5d604 Handle Atom feeds with HTML title 2018-02-17 12:21:58 -08:00
Frédéric Guillot dda9114692 Improve error handling for HTTP client 2018-02-08 18:16:54 -08:00
Frédéric Guillot 7b0bfd9308 Strip invalid XML characters to avoid parsing errors 2018-02-07 20:57:56 -08:00
Frédéric Guillot c6fd9eb9b1 Remove period for feed errors 2018-02-07 19:10:36 -08:00
Frédéric Guillot 0fb87eba3f Improve error handling when the response is empty 2018-02-07 18:47:47 -08:00
Frédéric Guillot b78172033f Show API URL endpoints in user interface 2018-01-31 21:57:20 -08:00
Frédéric Guillot ffabb009b8 Do not override existing entries when the crawler is enabled 2018-01-20 14:04:19 -08:00
Frédéric Guillot 713b38e34c Handle more encoding edge cases
- Feeds with charset specified only in Content-Type header and not in XML document
- Feeds with charset specified in both places
- Feeds with charset specified only in XML document and not in HTTP header
2018-01-20 13:25:21 -08:00
Frédéric Guillot 3b62f904d6 Do not crawl existing entry URLs 2018-01-20 13:25:20 -08:00
Frédéric Guillot 9652dfa1fe Add more comments (GoDoc) 2018-01-11 19:21:20 -08:00
Frédéric Guillot 1d7fe892e1 Add scraper rule for darkreading.com 2018-01-06 13:25:12 -08:00
Frédéric Guillot 48aa0d07ef Add more scraper rules 2018-01-04 19:32:24 -08:00
Frédéric Guillot 7d278d49f1 Add content length check when refreshing feeds 2018-01-04 18:41:23 -08:00
Frédéric Guillot efac11e082 Handle more date formats 2018-01-03 18:59:29 -08:00
Frédéric Guillot ec63cbe7bb If the website URL is empty, assign the feed URL 2018-01-03 18:23:21 -08:00
Frédéric Guillot c39f2e1a8d Rename helper packages 2018-01-02 19:15:08 -08:00
Frédéric Guillot 3c3f397bf5 Make sure the scraper parse only HTML documents 2018-01-02 18:32:01 -08:00
Frédéric Guillot c454f67037 Add scraper rules for version2.dk and ing.dk 2017-12-27 19:44:23 -08:00
Frédéric Guillot d4839b5597 Add more scraper rules 2017-12-27 13:36:07 -08:00
Frédéric Guillot f6a5d7d6ed Add support for data URL favicons 2017-12-22 19:01:39 -08:00
Frédéric Guillot e7afec7eca Handle more date formats 2017-12-22 17:59:28 -08:00
Frédéric Guillot 1d8193b892 Add logger 2017-12-15 18:55:57 -08:00
Frédéric Guillot c6d9eb3614 Improve content scraper 2017-12-13 21:30:40 -08:00
Frédéric Guillot 827683ab59 Make sure that item URL are absolute 2017-12-13 20:16:15 -08:00
Frédéric Guillot 84d912c979 Rewrite imports 2017-12-12 21:48:13 -08:00
Frédéric Guillot ef097f02fe Add the possibility to enable crawler for feeds 2017-12-12 19:19:36 -08:00
Frédéric Guillot 33445e5b68 Add the possibility to define rewrite rules for each feed 2017-12-11 22:16:32 -08:00
Frédéric Guillot 87ccad5c7f Add scraper rules 2017-12-10 20:51:04 -08:00
Frédéric Guillot 7a35c58f53 Add readability package to fetch original content 2017-12-10 19:01:38 -08:00
Frédéric Guillot 6f5350a497 Move packages http and url 2017-12-02 20:26:21 -08:00
Frédéric Guillot 2356ddad28 Add Pinboard integration 2017-12-02 19:32:14 -08:00
Frédéric Guillot fb2a73c91e Proxify image enclosures 2017-12-01 22:29:18 -08:00
Frédéric Guillot bb8e61c7c5 Make sure golint pass on the code base 2017-11-27 21:40:05 -08:00
Frédéric Guillot bd663b43a0 Improve HTML sanitizer 2017-11-25 18:08:59 -08:00
Frédéric Guillot 71bf7e4358 Improve API 2017-11-24 22:29:20 -08:00
Frédéric Guillot 2b641cc224 Improve feed parsers 2017-11-22 14:52:31 -08:00
Frédéric Guillot 99dfbdbb47 Convert feed encoding only if the charset is specified 2017-11-21 22:55:19 -08:00
Frédéric Guillot 5f0ae8196c Add timeout for HTTP client 2017-11-20 19:44:28 -08:00
Frédéric Guillot eb9f588216 Make sure RDF entries have a date 2017-11-20 19:25:30 -08:00
Frédéric Guillot d5838b6734 Move feed parsers packages in reader package 2017-11-20 19:17:04 -08:00
Frédéric Guillot c26787f476 Improve OPML package to be more idiomatic 2017-11-20 19:11:06 -08:00
Frédéric Guillot e91a9b4f13 Export only necessary structs in JsonFeed package 2017-11-20 18:57:54 -08:00
Frédéric Guillot 6618caca81 Use more idiomatic code for Atom parser 2017-11-20 18:50:16 -08:00
Frédéric Guillot 89307010ad Add parser for RDF feeds 2017-11-20 18:34:11 -08:00
Frédéric Guillot c5cd38de83 Add unit test for HTTP client response functions 2017-11-20 17:25:45 -08:00
Frédéric Guillot aecda64030 Make sure XML feeds are always encoded in UTF-8 2017-11-20 17:12:37 -08:00
Frédéric Guillot 0e6717b7c8 Ensure that LocalizedError are returned by parsers 2017-11-20 16:11:55 -08:00
Frédéric Guillot 557cf9c21d Handle RSS entries with Atom links 2017-11-20 15:48:26 -08:00