Commit Graph

29 Commits

Author SHA1 Message Date
Corey McCaffrey 0683074b8b Added scraper rule for TheOatmeal.com
The default rule does not show the comic posted to the feed. The comic image is in a div with id "comic".
2020-05-13 21:28:00 -07:00
Corey McCaffrey 8f6c07afd6 Added scraper rule for RayWenderlich.com
RayWenderlich.com is a popular developer's community for iOS and Android developers. The default rule results in "GROUP GROUP GROUP GROUP…" instead of the content posted on the blog.
2020-05-13 21:28:00 -07:00
Andrew Williams 9974e0f458 Addition of scraper rule for wdwnt.com
By default fetching original content for wdwnt.com results in a snippet of the comments section, this rule captures the article content.
2020-02-28 20:24:58 -08:00
cinput 8e1ed8bef3 Return outer HTML when scraping elements 2019-12-21 21:18:31 -08:00
somini 30f22fbd78 Update scraper rule for "Le Monde" 2019-12-19 18:35:29 -08:00
Neo Ng 90064a8cf0 Update scraper rule for openingsource.org 2019-11-28 19:40:26 -08:00
Tom Matthews 8b40778ee1 Add BBC News scraping rule 2018-12-13 20:25:30 -08:00
Frédéric Guillot 6f5d93cbbe Update scraper rule for lemonde.fr 2018-12-02 20:53:22 -08:00
Frédéric Guillot 311a133ab8 Refactor manual entry scraper 2018-12-02 20:51:06 -08:00
mapl e47188eab2 Update scraper rule for heise.de 2018-12-01 11:49:30 -08:00
Frédéric Guillot 3b6e44c331 Allow the scraper to parse XHTML documents
Only "text/html" was authorized before.
2018-11-03 13:44:13 -07:00
Frédéric Guillot 5870f04260 Simplify feed parser and format detection
- Avoid doing multiple buffer copies
- Move parser and format detection logic to its own package
2018-10-14 11:46:41 -07:00
Frédéric Guillot 9dc38a0803 Add missing package descriptions for GoDoc 2018-10-08 17:32:17 -07:00
Patrick 2538eea177 Add the possibility to override default user agent for each feed 2018-09-19 18:19:24 -07:00
Frédéric Guillot df2bebaf3d Update scraper rule for heise.de 2018-08-25 10:33:18 -07:00
Frédéric Guillot dbcc5d8a97 Use canonical imports 2018-08-24 21:56:39 -07:00
Frédéric Guillot 1eba1730d1 Move HTTP client to its own package 2018-04-28 10:51:07 -07:00
aniran 322b265d7a Scrape parent element for iframe
Current behavior: if you have an `iframe` scraper rule, `scrapContent`
tries to return the inner HTML of the `iframe`, which turns up blank.

New behavior: like `img` elements, if an `iframe` is matched by a scraper rule,
the parent element's inner HTML (i.e. the `iframe` is returned).
2018-04-27 17:57:22 -07:00
Frédéric Guillot 1d7fe892e1 Add scraper rule for darkreading.com 2018-01-06 13:25:12 -08:00
Frédéric Guillot 48aa0d07ef Add more scraper rules 2018-01-04 19:32:24 -08:00
Frédéric Guillot 3c3f397bf5 Make sure the scraper parse only HTML documents 2018-01-02 18:32:01 -08:00
Frédéric Guillot c454f67037 Add scraper rules for version2.dk and ing.dk 2017-12-27 19:44:23 -08:00
Frédéric Guillot d4839b5597 Add more scraper rules 2017-12-27 13:36:07 -08:00
Frédéric Guillot 1d8193b892 Add logger 2017-12-15 18:55:57 -08:00
Frédéric Guillot c6d9eb3614 Improve content scraper 2017-12-13 21:30:40 -08:00
Frédéric Guillot 84d912c979 Rewrite imports 2017-12-12 21:48:13 -08:00
Frédéric Guillot ef097f02fe Add the possibility to enable crawler for feeds 2017-12-12 19:19:36 -08:00
Frédéric Guillot 87ccad5c7f Add scraper rules 2017-12-10 20:51:04 -08:00
Frédéric Guillot 7a35c58f53 Add readability package to fetch original content 2017-12-10 19:01:38 -08:00