Commit Graph

108 Commits

Author SHA1 Message Date
Ztec 9f3a8e7f1b Request builder: Allow the use of insecure TLS ciphers when `Allow self-signed or invalid certificates` is used
Some server on the wild are badly configured. Either by mistake or lack
of maintenance. Safe and unsafe Ciphers change overtime based on new
discoveries.

This proposition will include considered unsafe ciphers when `Allow self-signed or invalid certificates` is used.
It could be put into a separate option but, I felt this could fit in.

fix #2671
2024-06-13 20:23:37 -07:00
Ztec e54825bf02 Improve YouTube page feed detection
In order to be more resilient to YouTube URLs variation and
to address this feature_request: https://github.com/miniflux/v2/issues/2628
I've reworked a bit the way the YouTube feed extraction is done.

I've kept all the `FindSubscriptionsFromYouTube*` in order
to keep all the existing unit tests as-is ensuring little to no
regressions. By doing so, I had to call twice `youtubeURLIDExtractor`.
Small performance penalty for peace of mind in my opinion.

`youtubeURLIDExtractor` is made in a way only one kind
of page can be detected at a time. This mean I can
solve the "video in a playlist" feature_request
by prioritizing the playlist ID over the Video ID

Also, by using `url.Parse()` to get ids, it's safer
to url mangle and variation. The most common variation
being the `t=42` parameters that start the playback
at a given position. Previously, this kind of url
would not be detected as "YouTube URL".

I deliberately ignored the url parsing error
to keep previous behavior (skip the YouTube analysis and follow with the other analysis)

I also tried to keep debug logs the same as before as much as I could.

I manually tested all the YouTube cases (video,channel,playlist)
and they all work as expected except for the video. But this one
does not work either on main. The `meta` html tag that was searched for
does not seem to exist anymore.

fix: #2628
2024-06-13 20:18:47 -07:00
x 839fc3843a Add pitchfork.com scraping rule 2024-06-10 21:08:59 -07:00
x 0bab8fac8e Update theverge.com rewrite rule: fix duplicate image
See: https://github.com/miniflux/v2/issues/1979
2024-06-10 21:08:59 -07:00
Ankit Pandey b68b05c64c reader/processor: error out for improper rewrite regexp
It's possible to specify a rewrite regex that validates but doesn't compile such
as:

    rewrite("(((unmatched-capture-group"|"rewrite)))")

In case we encounter one, exit early instead of letting the server panic.
2024-06-01 10:37:02 -07:00
Zhizhen He ae432bc9c6
reader/readingtime: fix incorrect package name 2024-05-21 18:12:24 -07:00
Jan-Lukas Else a33b1adf13 Add description field to feed settings
This adds a new "description" field to the feed settings. This allows to
save custom description regarding a feed. It is also exported and
imported as "description" in OPML.
2024-05-06 15:40:36 -07:00
fin444 a631bd527d options: add FETCH_NEBULA_WATCH_TIME 2024-05-02 16:30:01 -07:00
Frédéric Guillot fb075b60b5 reader/processor: minifier is breaking HTML entry content 2024-04-23 20:31:52 -07:00
Frédéric Guillot 2c4c845cd2 http/response: add brotli compression support 2024-04-19 12:16:49 -07:00
Frédéric Guillot 771f9d2b5f reader/fetcher: add brotli content encoding support 2024-04-19 10:50:46 -07:00
jvoisin b205b5aad0 reader/processor: minimize the feed's entries html
Compress the html of feed entries before storing it. This should reduce the
size of the database a bit, but more importantly, reduce the amount of data
sent to clients

minify being [stupidly fast](https://github.com/tdewolff/minify/?tab=readme-ov-file#performance), the performance impact should be in the noise level.
2024-04-10 19:48:48 -07:00
Frédéric Guillot 38b80d96ea storage: change GetReadTime() function to use entries_feed_id_hash_key index 2024-04-09 20:37:30 -07:00
Frédéric Guillot fdd1b3f18e database: entry URLs can exceeds btree index size limit 2024-04-04 20:22:23 -07:00
Evan Elias Young 1b8c45d162 finder: Find feed from YouTube playlist
The feed from a YouTube playlist page is derived in practically the same way as a feed from a YouTube channel page.
2024-04-01 21:16:32 -07:00
jvoisin 19ce519836 reader/rewrite: add a rule for oglaf.com
By default, Oglaf show some disclaimer/warning about its content, and this
doesn't play well with rss readers, so let's rewrite it to show the actual
comic instead of a placeholder.
2024-04-01 21:05:01 -07:00
jvoisin f109e3207c reader/rss: don't add empty tags to RSS items
This commit adds a bunch of checks to prevent reader/rss from adding empty tags
to rss items, as well as some minor refactors like nested conditions and loops
unrolling.
2024-03-24 19:46:56 -07:00
Frédéric Guillot ad1d349a0c rss: use Channel tags only if there is no Item tags 2024-03-23 13:46:48 -07:00
jvoisin fc4bdf3ab0 Inline a one-liner function
No need to expose a symbol for this.
2024-03-20 17:21:30 -07:00
Frédéric Guillot 08640b27d5 Ensure enclosure URLs are always absolute 2024-03-19 21:57:46 -07:00
jvoisin 4be993e055 Minor refactoring of internal/reader/atom/atom_10_adapter.go
- Move the population of the feed's entries into a new function, to make
  `BuildFeed` easier to understand/separate concerns/implementation details
- Use `sort+compact` instead of `compact+sort` to remove duplicates
- Change `if !a { a = } if !a {a = }` constructs into `if !a { a = ; if !a {a = }}`.
  This reduce the number of comparisons, but also improves a tad the
  control-flow readability.
2024-03-19 20:41:44 -07:00
Jean Khawand a78d1c79da
Add `FILTER_ENTRY_MAX_AGE_DAYS` config option to limit fetching all feed items 2024-03-20 02:58:53 +00:00
Frédéric Guillot fa9697b972 Remove trailing space in SiteURL and FeedURL 2024-03-18 17:51:06 -07:00
jvoisin 91f5522ce0 Minor simplification of internal/reader/media/media.go
- Simplify a switch-case by moving a common condition above it.
- Remove a superfluous error-check: `strconv.ParseInt` returns `0` when passed
  an empty string.
2024-03-18 16:09:32 -07:00
Frédéric Guillot 8212f16aa2 atom: avoid debug message when the date is empty 2024-03-17 15:29:50 -07:00
Frédéric Guillot b1e73fafdf Enable go-critic linter and fix various issues detected 2024-03-17 13:52:34 -07:00
jvoisin c29ca0e313 Minor simplifications of the rewriter
- Online some one-line functions
- Transform a free-standing function into a method
- Massively simplify `removeClickbait`
- Use a proper constant instead of a magic number in `applyFuncOnTextContent`
2024-03-17 12:15:46 -07:00
jvoisin 02a074ed26 Compile block/keep regex only once per feed
No need to compile them once for matching on the url,
once per tag, once per title, once per author, … one time is enough.
It also simplify error handling, since while regexp compilation can fail,
matching can't.
2024-03-17 12:08:03 -07:00
Frédéric Guillot 309fdbb9fc Fix force refresh 2024-03-15 19:42:09 -07:00
Frédéric Guillot 4834e934f2 Remove some duplicated code in RSS parser 2024-03-15 18:40:06 -07:00
Frédéric Guillot dd4fb660c1 Refactor Atom parser to use an adapter 2024-03-15 17:27:16 -07:00
Frédéric Guillot 5948786b15 Add support for RSS <media:category> element 2024-03-13 21:35:39 -07:00
Frédéric Guillot 648b9a8f6f Refactor RSS Parser to use an adapter 2024-03-13 21:25:09 -07:00
Frédéric Guillot 8429c6b0ab Refactor JSON Feed parser to use an adapter 2024-03-12 22:37:14 -07:00
Frédéric Guillot 6bc4b35e38 Refactor RDF parser to use an adapter
Avoid tight coupling between `model.Feed` and the original XML RDF feed.
2024-03-12 20:54:05 -07:00
jvoisin 45d486b919 When detecting the format, detect its version as well
There is no need to detect the format and then the version when both can be
done at the same time.

Add a benchmark as well, on large and small atom and rss files.
2024-03-12 18:56:56 -07:00
Frédéric Guillot 6d97f8b458 Parse podcast categories 2024-03-11 22:30:27 -07:00
Frédéric Guillot f8e50947f2 Move iTunes and GooglePlay XML definitions to their own packages 2024-03-11 22:09:31 -07:00
Frédéric Guillot 9a637ce95e Refactor RSS parser to use default namespace
This change avoid some limitations of the Go XML parser regarding XML namespaces
2024-03-11 21:07:13 -07:00
jvoisin a074773e6c Use an io.ReadSeeker instead of an io.Reader to parse feeds
This will allow to make use of func (*Reader) Seek, instead of re-recreating a
new reader. It's a large commit for a small change, but anything to simply the
reader/buffer/ReadAll/… mess is a step in the right direction I think, and it
should enable more follow-up simplifications.
2024-03-06 20:13:39 -08:00
jvoisin 3d0126be0b Speed the sanitizer up a bit, again
- allow youtube urls to start with `www`
- use `strings.Builder` instead of a `bytes.Buffer`
- use a `strings.NewReader` instead of a `bytes.NewBufferString`
- sprinkles a couple of `continue` to make the code-flow more obvious
- inline calls to `inList`, and put their parameters in the right order
- simplify isPixelTracker
- simplify `isValidIframeSource`, by extracting the hostname and comparing it
  directly, instead of using the full url and checking if it starts with
  multiple variations of the same one (`//`, `http:`, `https://` multiplied by
  ``/`www.`)
- add a benchmark
2024-03-05 19:31:50 -08:00
jvoisin 111e3f2106 Reuse a Reader instead of copying to a buffer when parsing an atom feed 2024-03-04 17:36:10 -08:00
jvoisin 3339d9d3d7 Preallocate memory when exporting to OPML
This should marginally increase performance when export a large amount of feeds
to OPML.
2024-03-03 20:34:37 -08:00
jvoisin 347740dce1 Speed up removeUnlikelyCandidates
`.Not` returns a brand new Selection, copied element by element.
2024-02-29 19:38:43 -08:00
jvoisin ab85d4d678 Improve EstimateReadingTime's speed by a factor 7
- Refactorise the tests and add some
- Use 250 signs instead of the whole text
- Only check for Korean, Chinese and Japanese script
- Add a benchmark
- Use a more idiomatic control flow

```console
$ # main branch
$ go test -bench=.
goos: linux
goarch: amd64
pkg: miniflux.app/v2/internal/reader/readingtime
BenchmarkEstimateReadingTime-12              267           4821268 ns/op
PASS
ok      miniflux.app/v2/internal/reader/readingtime     1.754s
$ # speed_up_reading_time branch
$ go test -bench=.
goos: linux
goarch: amd64
pkg: miniflux.app/v2/internal/reader/readingtime
cpu: 12th Gen Intel(R) Core(TM) i7-1265U
BenchmarkEstimateReadingTime-12             1941            653312 ns/op
PASS
ok      miniflux.app/v2/internal/reader/readingtime     1.342s
$
```
2024-02-29 19:24:15 -08:00
jvoisin 31ac62f410 Don't compute reading-time when unused
If the user doesn't display reading times, there is no need to compute them.
This should speed things up a bit, since `whatlanggo.Detect` is abysmally slow.
2024-02-29 19:14:17 -08:00
Frédéric Guillot 97765b93a9 Revert "Minor internal/reader/readability/readability.go speedup"
This reverts commit 4db138d4b8.

```
panic: runtime error: index out of range [-1]

goroutine 49 [running]:
miniflux.app/v2/internal/reader/readability.getArticle.func1(0x8?, 0xc000b56570)
        /home/fred/repos/miniflux/v2/internal/reader/readability/readability.go:120 +0x2ac
github.com/PuerkitoBio/goquery.(*Selection).Each(0xc000b56510, 0xc000892fa8)
        /home/fred/go/pkg/mod/github.com/!puerkito!bio/goquery@v1.9.0/iteration.go:10 +0x62
miniflux.app/v2/internal/reader/readability.getArticle(0xc00044f1f0, 0xc000a04a50)
        /home/fred/repos/miniflux/v2/internal/reader/readability/readability.go:101 +0x15d
miniflux.app/v2/internal/reader/readability.ExtractContent({0x1005d00?, 0xc0001522d0?})
        /home/fred/repos/miniflux/v2/internal/reader/readability/readability.go:91 +0x211
miniflux.app/v2/internal/reader/scraper.ScrapeWebsite(0xc000893688?, {0xc0007ce720, 0x54}, {0x0, 0x0})
        /home/fred/repos/miniflux/v2/internal/reader/scraper/scraper.go:63 +0x859
miniflux.app/v2/internal/reader/processor.ProcessFeedEntries(0xc000133188, 0xc000502c40, 0xc0003e6360, 0x0)
        /home/fred/repos/miniflux/v2/internal/reader/processor/processor.go:77 +0x8ea
miniflux.app/v2/internal/reader/handler.RefreshFeed(0xc000133188, 0x10cf, 0x52d5c, 0x0)
        /home/fred/repos/miniflux/v2/internal/reader/handler/handler.go:301 +0x1485
miniflux.app/v2/internal/cli.refreshFeeds.func1(0x0)
        /home/fred/repos/miniflux/v2/internal/cli/refresh_feeds.go:59 +0x2d7
created by miniflux.app/v2/internal/cli.refreshFeeds in goroutine 1
        /home/fred/repos/miniflux/v2/internal/cli/refresh_feeds.go:50 +0x5d5
```
2024-02-29 19:06:03 -08:00
Frédéric Guillot c493f8921e Add missing regex anchor detected by CodeQL 2024-02-28 20:50:17 -08:00
jvoisin 4db138d4b8 Minor internal/reader/readability/readability.go speedup
- Don't use a capturing group in `divToPElementsRegexp`
- Remove a duplicate condition
- Replace a regex with a fixed-comparison and a `Contains`
2024-02-28 20:03:14 -08:00
jvoisin f12d5131b0 Divide the sanitization time by 3
Instead of having to allocate a ~100 keys map containing possibly dynamic
values (at least to the go compiler), allocate it once in a global variable.
This significantly speeds things up, by reducing the garbage
collector/allocator involvements.

Local synthetic benchmarks have shown a improvements from 38% of wall time to only
12%.
2024-02-28 20:00:13 -08:00