httpdirfs/README.md

155 lines
8.0 KiB
Markdown
Raw Normal View History

# HTTPDirFS - now with a permanent cache
Have you ever wanted to mount those HTTP directory listings as if it was a
partition? Look no further, this is your solution. HTTPDirFS stands for Hyper
Text Transfer Protocol Directory Filesystem
2018-07-23 05:11:24 +02:00
The performance of the program is excellent, due to the use of curl-multi
interface. HTTP connections are reused, and HTTP pipelining is used when
available. The FUSE component itself also runs in multithreaded mode.
2018-07-24 23:16:13 +02:00
The permanent cache system caches all the files you have downloaded, so you
don't need to download those files again if you later access them again. This
feature is triggered by the ``--cache`` flag. This makes this filesystem much
faster than ``rclone mount``.
2018-07-24 18:37:23 +02:00
2018-07-23 05:11:24 +02:00
## Usage
2018-07-24 18:37:23 +02:00
2019-04-24 04:20:38 +02:00
./httpdirfs -c $CACHE_FOLDER -f $URL $YOUR_MOUNT_POINT
2018-07-24 18:37:23 +02:00
An example URL would be
[Debian CD Image Server](https://cdimage.debian.org/debian-cd/). The ``-f`` flag
keeps the program in the foreground, which is useful for monitoring which URL
the filesystem is visiting.
2018-07-24 07:03:48 +02:00
Useful options:
2019-04-27 13:32:19 +02:00
2019-04-30 09:24:56 +02:00
-f foreground operation
-s disable multi-threaded operation
2019-04-27 13:32:19 +02:00
-u --username HTTP authentication username
-p --password HTTP authentication password
-P --proxy Proxy for libcurl, for more details refer to
https://curl.haxx.se/libcurl/c/CURLOPT_PROXY.html
--proxy-username Username for the proxy
--proxy-password Password for the proxy
--cache Enable cache, by default this is disabled
--cache-location Set a custom cache location, by default it is
located in ${XDG_CACHE_HOME}/httpdirfs
2019-04-27 13:32:19 +02:00
--dl-seg-size The size of each download segment in MB,
default to 8MB.
--max-seg-count The maximum number of download segments a file
can have. By default it is set to 128*1024. This
2019-04-30 09:24:56 +02:00
means the maximum memory usage per file is 128KB.
This allows caching file up to 1TB in size,
assuming you are using the default segment size.
2019-04-27 13:32:19 +02:00
--max-conns The maximum number of network connections that
libcurl is allowed to make, default to 10.
--retry-wait The waiting interval in seconds before making an
HTTP request, after encountering an error,
2019-04-27 13:32:19 +02:00
default to 5 seconds.
--user-agent The user agent string, default to "HTTPDirFS".
2019-04-24 04:20:38 +02:00
## Permanent cache system
You can now cache all the files you have looked at permanently on your hard
drive by using the ``--cache`` flag. The file it caches persist across sessions
2019-04-30 09:24:56 +02:00
By default, the cache files are stored under ``${XDG_CACHE_HOME}/httpdirfs``,
which by default is ``${HOME}/.cache/httpdirfs``. Each HTTP directory gets its
own cache folder, they are named using the escaped URL of the HTTP directory.
Once a segment of the file has been downloaded once, it won't be downloaded
again.
Please note that due to the way the permanent cache system is implemented. The
maximum download speed is around 15MiB/s, as measured using my localhost as the
2019-04-30 09:24:56 +02:00
web server. However after you have accessed a file once, accessing it again will
be the same speed as accessing your hard drive.
2019-04-24 04:20:38 +02:00
If you have any patches to make the initial download go faster, feel free to
submit a pull request.
2019-04-30 09:24:56 +02:00
The permanent cache system also relies on sparse allocation. Please make sure
your filesystem supports it. Otherwise your hard drive / SSD might grind to
a halt.For a list of filesystem that supports sparse allocation, please refer to
[Wikipedia](https://en.wikipedia.org/wiki/Comparison_of_file_systems#Allocation_and_layout_policies).
2018-07-30 16:06:04 +02:00
## Configuration file support
There is now rudimentary config file support. The configuration file that the
program will read is ``${XDG_CONFIG_HOME}/httpdirfs/config``.
If ``${XDG_CONFIG_HOME}`` is not set, it will default to ``${HOME}/.config``. So
by default you need to put the configuration file at
``${HOME}/.config/httpdirfs/config``. You will have to create the sub-directory
and the configuration file yourself. In the configuration file, please supply
one option per line. For example:
2018-07-30 16:06:04 +02:00
2019-03-01 13:25:02 +01:00
$ cat ${HOME}/.config/httpdirfs/config
2018-07-30 16:06:04 +02:00
--username test
--password test
-f
2019-04-24 04:20:38 +02:00
## Compilation
This program was developed under Debian Stretch. If you are using the same
operating system as me, you need ``libgumbo-dev``, ``libfuse-dev``,
``libssl1.0-dev`` and ``libcurl4-openssl-dev``.
2019-04-24 04:20:38 +02:00
If you run Debian Stretch, and you have OpenSSL 1.0.2 installed, and you get
warnings that look like below during compilation,
2019-04-24 04:20:38 +02:00
network.c:70:22: warning: thread_id defined but not used [-Wunused-function]
static unsigned long thread_id(void)
^~~~~~~~~
network.c:57:13: warning: lock_callback defined but not used [-Wunused-function]
static void lock_callback(int mode, int type, char *file, int line)
^~~~~~~~~~~~~
/usr/bin/ld: warning: libcrypto.so.1.0.2, needed by /usr/lib/gcc/x86_64-linux-gnu/6/../../../x86_64-linux-gnu/libcurl.so, may conflict with libcrypto.so.1.1
then you need to check if ``libssl1.0-dev`` had been installed properly. If you
get these compilation warnings, this program will ocassionally crash if you
connect to HTTPS website. This is because OpenSSL 1.0.2 needs those functions
for thread safety, whereas OpenSSL 1.1 does not. If you have ``libssl-dev``
rather than ``libssl1.0-dev`` installed, those call back functions will not be
linked properly.
2019-04-24 04:20:38 +02:00
If you have OpenSSL 1.1 and the associated development headers installed, then
you can safely ignore these warning messages. If you are on Debian Buster, you
will definitely get these warning messages, and you can safely ignore them.
2018-07-30 16:06:04 +02:00
## SSL Support
If you run the program in the foreground, when it starts up, it will output the
SSL engine version string. Please verify that your libcurl is linked against
OpenSSL, as the pthread mutex functions are designed for OpenSSL.
2018-07-24 18:37:23 +02:00
The SSL engine version string looks something like this:
libcurl SSL engine: OpenSSL/1.0.2l
2019-04-24 04:20:38 +02:00
2018-07-23 05:11:24 +02:00
## The Technical Details
I noticed that most HTTP directory listings don't provide the file size for the
web page itself. I suppose this makes perfect sense, as they are generated on
the fly. Whereas the actual files have got file sizes. So the listing pages can
be treated as folders, and the rest are files.
This program downloads the HTML web pages/files using
[libcurl](https://curl.haxx.se/libcurl/), then parses the listing pages using
[Gumbo](https://github.com/google/gumbo-parser), and presents them using
[libfuse](https://github.com/libfuse/libfuse).
I wrote the cache system myself. It was a Herculean effort. I am immensely proud
of it. The cache system stores the metadata and the downloaded file into two
separate directories. It uses bitmaps to record which segment of the file has
been downloaded. By bitmap, I meant ``uint8_t`` arrays, which each byte
indicating for a 1 MiB segment. I could not be bothered to implement proper
bitmapping. The main challenge for the cache system was hunting down various
race conditions which caused metadata corruption, downloading the same segment
multiple times, and deadlocks.
2018-07-23 05:11:24 +02:00
## Acknowledgement
- First of all, I would like to thank
[Jerome Charaoui](https://github.com/jcharaoui) for being the Debian Maintainer
for this piece of software. Thank you so much for packaging it!
- I would like to thank
[Cosmin Gorgovan](https://scholar.google.co.uk/citations?user=S7UZ6MAAAAAJ&hl=en)
for the technical and moral support. Your wisdom is much appreciated!
- I would like to thank [-Archivist](https://www.reddit.com/user/-Archivist/)
for not providing FTP or WebDAV access to his server. This piece of software was
written in direct response to his appalling behaviour.