httpdirfs/README.md

# HTTPDirFS - now with a permanent cache
Have you ever wanted to mount those HTTP directory listings as if it was a partition? Look no further, this is your solution.  HTTPDirFS stands for Hyper Text Transfer Protocol Directory Filesystem

The performance of the program is excellent, due to the use of curl-multi interface. HTTP connections are reused, and HTTP pipelining is used when available. The FUSE component itself also runs in multithreaded mode. 

The permanent cache system caches all the files you have downloaded, so you don't need to download those files again if you later access them again.  This feature is triggered by the ``--cache`` flag. This makes this filesystem much faster than ``rclone mount``.

## Usage

	./httpdirfs -c $CACHE_FOLDER -f $URL $YOUR_MOUNT_POINT

An example URL would be [Debian CD Image Server](https://cdimage.debian.org/debian-cd/). The ``-f`` flag keeps the program in the foreground, which is useful for monitoring which URL the filesystem is visiting.

Useful options:

    -f                     Run HTTPDirFS in foreground
    -u  --username          HTTP authentication username
    -p  --password          HTTP authentication password
    -P  --proxy             Proxy for libcurl, for more details refer to
                            https://curl.haxx.se/libcurl/c/CURLOPT_PROXY.html
        --proxy-username    Username for the proxy
        --proxy-password    Password for the proxy
        --cache             Set a cache folder, by default this is disabled
        --dl-seg-size       The size of each download segment in MB,
                            default to 8MB.
        --max-seg-count     The maximum number of download segments a file
                            can have. By default it is set to 1048576. This
                            means the maximum memory usage per file is 1MB
                            memory. This allows caching file up to 8TB in
                            size, assuming you are using the default segment
                            size.
        --max-conns         The maximum number of network connections that
                            libcurl is allowed to make, default to 10.
        --retry-wait        The waiting interval in seconds before making an
                            HTTP request, after encountering an error, 
                            default to 5 seconds.
        --user-agent        The user agent string, default to "HTTPDirFS".

## Permanent cache system
You can now cache all the files you have looked at permanently on your hard
drive by using the ``--cache`` flag. The file it caches persist across sessions For example:

    mkdir cache mnt
    httpdirfs -f --cache cache http://cdimage.debian.org/debian-cd/ mnt

Once a segment of the file has been downloaded once, it won't be downloaded again.

Please note that due to the way the permanent cache system is implemented. The maximum download speed is around 15MiB/s, as measured using my localhost as the web server. However after you have accessed the file once, and you access it again, it will be the same speed as accessing your hard drive. . 

If you have any patches to make the initial download go faster, feel free to submit a pull request.

The permanent cache system also heavily relies on sparse allocation. Please make sure your filesystem supports it. Otherwise your hard drive / SSD might grind to a halt.

## Configuration file support
There is now rudimentary config file support. The configuration file that the program will read is ``${XDG_CONFIG_HOME}/httpdirfs/config``. If ``${XDG_CONFIG_HOME}`` is not set, it will default to ``${HOME}/.config``. So by default you need to put the configuration file at ``${HOME}/.config/httpdirfs/config``. You will have to create the sub-directory and the configuration file yourself. In the configuration file, please supply one option per line. For example:

	$ cat ${HOME}/.config/httpdirfs/config
	--username test
	--password test
	-f
	
## Compilation
This program was developed under Debian Stretch. If you are using the same operating system as me, you need ``libgumbo-dev``, ``libfuse-dev``, ``libssl1.0-dev`` and ``libcurl4-openssl-dev``.

If you run Debian Stretch, and you have OpenSSL 1.0.2 installed, and you get warnings that look like below during compilation,

    network.c:70:22: warning: ‘thread_id’ defined but not used [-Wunused-function]
    static unsigned long thread_id(void)
                         ^~~~~~~~~
    network.c:57:13: warning: ‘lock_callback’ defined but not used [-Wunused-function]
    static void lock_callback(int mode, int type, char *file, int line)
                ^~~~~~~~~~~~~
    /usr/bin/ld: warning: libcrypto.so.1.0.2, needed by /usr/lib/gcc/x86_64-linux-gnu/6/../../../x86_64-linux-gnu/libcurl.so, may conflict with libcrypto.so.1.1

then you need to check if ``libssl1.0-dev`` had been installed properly. If you get these compilation warnings, this program will ocassionally crash if you connect to HTTPS website. This is because OpenSSL 1.0.2 needs those functions for thread safety, whereas OpenSSL 1.1 does not. If you have ``libssl-dev`` rather than ``libssl1.0-dev`` installed, those call back functions will not be linked properly.

If you have OpenSSL 1.1 and the associated development headers installed, then you can safely ignore these warning messages. If you are on Debian Buster, you will definitely get these warning messages, and you can safely ignore them.

## SSL Support
If you run the program in the foreground, when it starts up, it will output the SSL engine version string. Please verify that your libcurl is linked against OpenSSL, as the pthread mutex functions are designed for OpenSSL.

The SSL engine version string looks something like this:

        libcurl SSL engine: OpenSSL/1.0.2l
        
## The Technical Details
I noticed that most HTTP directory listings don't provide the file size for the web page itself. I suppose this makes perfect sense, as they are generated on the fly. Whereas the actual files have got file sizes. So the listing pages can be treated as folders, and the rest are files.

This program downloads the HTML web pages/files using [libcurl](https://curl.haxx.se/libcurl/), then parses the listing pages using [Gumbo](https://github.com/google/gumbo-parser), and presents them using [libfuse](https://github.com/libfuse/libfuse).

I wrote the cache system myself. It was a Herculean effort. I am immensely proud of it. The cache system stores the metadata and the downloaded file into two separate directories. It uses bitmaps to record which segment of the file has been downloaded. By bitmap, I meant ``uint8_t`` arrays, which each byte indicating for a 1 MiB segment. I could not be bothered to implement proper bitmapping. The main challenge for the cache system was hunting down various race conditions which caused metadata corruption, downloading the same segment multiple times, and deadlocks.

## Acknowledgement
- First of all, I would like to thank [Jerome Charaoui](https://github.com/jcharaoui) for being the Debian Maintainer for this piece of software. Thank you so much for packaging it! 
- I would like to thank [Cosmin Gorgovan](https://scholar.google.co.uk/citations?user=S7UZ6MAAAAAJ&hl=en) for the technical and moral support. Your wisdom is much appreciated! 
- I would like to thank [-Archivist](https://www.reddit.com/user/-Archivist/) for not providing FTP or WebDAV access to his server. This piece of software was written in direct response to his appalling behaviour.
-												updated README, also moved some status message around

											
										
										
											2019-04-23 03:35:17 +02:00
+								# HTTPDirFS - now with a permanent cache
-												Update README.md
											
										
										
											2018-07-23 05:11:24 +02:00
+								Have you ever wanted to mount those HTTP directory listings as if it was a partition? Look no further, this is your solution.  HTTPDirFS stands for Hyper Text Transfer Protocol Directory Filesystem
-												Update README.md
											
										
										
											2019-04-24 04:20:38 +02:00
+								The performance of the program is excellent, due to the use of curl-multi interface. HTTP connections are reused, and HTTP pipelining is used when available. The FUSE component itself also runs in multithreaded mode.
-												fixed SSL thread safety problem

											
										
										
											2018-07-24 23:16:13 +02:00
-												Update README.md
											
										
										
											2019-04-24 04:20:38 +02:00
+								The permanent cache system caches all the files you have downloaded, so you don't need to download those files again if you later access them again.  This feature is triggered by the ``--cache`` flag. This makes this filesystem much faster than ``rclone mount``.
-												completed crypto lock array

											
										
										
											2018-07-24 18:37:23 +02:00
-												Update README.md
											
										
										
											2018-07-23 05:11:24 +02:00
+								## Usage
-												completed crypto lock array

											
										
										
											2018-07-24 18:37:23 +02:00
-												Update README.md
											
										
										
											2019-04-24 04:20:38 +02:00
+									./httpdirfs -c $CACHE_FOLDER -f $URL $YOUR_MOUNT_POINT
-												completed crypto lock array

											
										
										
											2018-07-24 18:37:23 +02:00
-												updated README, also moved some status message around

											
										
										
											2019-04-23 03:35:17 +02:00
+								An example URL would be [Debian CD Image Server](https://cdimage.debian.org/debian-cd/). The ``-f`` flag keeps the program in the foreground, which is useful for monitoring which URL the filesystem is visiting.
-												Update README.md
											
										
										
											2018-07-24 07:03:48 +02:00
-												added the command line switch to control the download segment size

											
										
										
											2019-04-25 22:33:22 +02:00
+								Useful options:
-												Update README.md
											
										
										
											2019-04-27 13:32:19 +02:00
-												added the command line switch to control the download segment size

											
										
										
											2019-04-25 22:33:22 +02:00
+								    -f                     Run HTTPDirFS in foreground
-												Update README.md
											
										
										
											2019-04-27 13:32:19 +02:00
+								    -u  --username          HTTP authentication username
 								    -p  --password          HTTP authentication password
 								    -P  --proxy             Proxy for libcurl, for more details refer to
 								                            https://curl.haxx.se/libcurl/c/CURLOPT_PROXY.html
 								        --proxy-username    Username for the proxy
 								        --proxy-password    Password for the proxy
 								        --cache             Set a cache folder, by default this is disabled
 								        --dl-seg-size       The size of each download segment in MB,
 								                            default to 8MB.
 								        --max-seg-count     The maximum number of download segments a file
 								                            can have. By default it is set to 1048576. This
 								                            means the maximum memory usage per file is 1MB
 								                            memory. This allows caching file up to 8TB in
 								                            size, assuming you are using the default segment
 								                            size.
 								        --max-conns         The maximum number of network connections that
 								                            libcurl is allowed to make, default to 10.
 								        --retry-wait        The waiting interval in seconds before making an
 								                            HTTP request, after encountering an error,
 								                            default to 5 seconds.
 								        --user-agent        The user agent string, default to "HTTPDirFS".
-												Update README.md
											
										
										
											2019-04-24 04:20:38 +02:00
-												documented the new feature, and the bug

											
										
										
											2019-04-23 00:29:59 +02:00
+								## Permanent cache system
 								You can now cache all the files you have looked at permanently on your hard
 								drive by using the ``--cache`` flag. The file it caches persist across sessions For example:
 								    mkdir cache mnt
-												Update README.md
											
										
										
											2019-04-24 04:20:38 +02:00
+								    httpdirfs -f --cache cache http://cdimage.debian.org/debian-cd/ mnt
-												documented the new feature, and the bug

											
										
										
											2019-04-23 00:29:59 +02:00
-												Update README.md
											
										
										
											2019-04-24 04:20:38 +02:00
+								Once a segment of the file has been downloaded once, it won't be downloaded again.
-												updated README, also moved some status message around

											
										
										
											2019-04-23 03:35:17 +02:00
-												Update README.md
											
										
										
											2019-04-25 02:12:48 +02:00
+								Please note that due to the way the permanent cache system is implemented. The maximum download speed is around 15MiB/s, as measured using my localhost as the web server. However after you have accessed the file once, and you access it again, it will be the same speed as accessing your hard drive. .
-												Update README.md
											
										
										
											2019-04-24 04:20:38 +02:00
 								If you have any patches to make the initial download go faster, feel free to submit a pull request.
-												updated README, also moved some status message around

											
										
										
											2019-04-23 03:35:17 +02:00
 								The permanent cache system also heavily relies on sparse allocation. Please make sure your filesystem supports it. Otherwise your hard drive / SSD might grind to a halt.
-												Update README.md

											
										
										
											2018-07-30 16:06:04 +02:00
+								## Configuration file support
-												updated README, also moved some status message around

											
										
										
											2019-04-23 03:35:17 +02:00
+								There is now rudimentary config file support. The configuration file that the program will read is ``${XDG_CONFIG_HOME}/httpdirfs/config``. If ``${XDG_CONFIG_HOME}`` is not set, it will default to ``${HOME}/.config``. So by default you need to put the configuration file at ``${HOME}/.config/httpdirfs/config``. You will have to create the sub-directory and the configuration file yourself. In the configuration file, please supply one option per line. For example:
-												Update README.md

											
										
										
											2018-07-30 16:06:04 +02:00
-												Updated README.md for issue #23
											
										
										
											2019-03-01 13:25:02 +01:00
+									$ cat ${HOME}/.config/httpdirfs/config
-												Update README.md

											
										
										
											2018-07-30 16:06:04 +02:00
+									--username test
 									--password test
 									-f
-												Update README.md
											
										
										
											2019-04-24 04:20:38 +02:00
 								## Compilation
 								This program was developed under Debian Stretch. If you are using the same operating system as me, you need ``libgumbo-dev``, ``libfuse-dev``, ``libssl1.0-dev`` and ``libcurl4-openssl-dev``.
 								If you run Debian Stretch, and you have OpenSSL 1.0.2 installed, and you get warnings that look like below during compilation,
 								    network.c:70:22: warning: ‘thread_id’ defined but not used [-Wunused-function]
 								    static unsigned long thread_id(void)
 								                         ^~~~~~~~~
 								    network.c:57:13: warning: ‘lock_callback’ defined but not used [-Wunused-function]
 								    static void lock_callback(int mode, int type, char *file, int line)
 								                ^~~~~~~~~~~~~
 								    /usr/bin/ld: warning: libcrypto.so.1.0.2, needed by /usr/lib/gcc/x86_64-linux-gnu/6/../../../x86_64-linux-gnu/libcurl.so, may conflict with libcrypto.so.1.1
 								then you need to check if ``libssl1.0-dev`` had been installed properly. If you get these compilation warnings, this program will ocassionally crash if you connect to HTTPS website. This is because OpenSSL 1.0.2 needs those functions for thread safety, whereas OpenSSL 1.1 does not. If you have ``libssl-dev`` rather than ``libssl1.0-dev`` installed, those call back functions will not be linked properly.
 								If you have OpenSSL 1.1 and the associated development headers installed, then you can safely ignore these warning messages. If you are on Debian Buster, you will definitely get these warning messages, and you can safely ignore them.
-												Update README.md

											
										
										
											2018-07-30 16:06:04 +02:00
-												Updated README with information about SSL support
											
										
										
											2018-07-24 20:25:42 +02:00
+								## SSL Support
-												spellings

											
										
										
											2018-07-25 02:31:37 +02:00
+								If you run the program in the foreground, when it starts up, it will output the SSL engine version string. Please verify that your libcurl is linked against OpenSSL, as the pthread mutex functions are designed for OpenSSL.
-												completed crypto lock array

											
										
										
											2018-07-24 18:37:23 +02:00
 								The SSL engine version string looks something like this:
 								        libcurl SSL engine: OpenSSL/1.0.2l
-												Update README.md
											
										
										
											2019-04-24 04:20:38 +02:00
-												Update README.md
											
										
										
											2018-07-23 05:11:24 +02:00
+								## The Technical Details
-												updated README, also moved some status message around

											
										
										
											2019-04-23 03:35:17 +02:00
+								I noticed that most HTTP directory listings don't provide the file size for the web page itself. I suppose this makes perfect sense, as they are generated on the fly. Whereas the actual files have got file sizes. So the listing pages can be treated as folders, and the rest are files.
 								This program downloads the HTML web pages/files using [libcurl](https://curl.haxx.se/libcurl/), then parses the listing pages using [Gumbo](https://github.com/google/gumbo-parser), and presents them using [libfuse](https://github.com/libfuse/libfuse).
-												Update README.md
											
										
										
											2019-04-26 17:44:18 +02:00
+								I wrote the cache system myself. It was a Herculean effort. I am immensely proud of it. The cache system stores the metadata and the downloaded file into two separate directories. It uses bitmaps to record which segment of the file has been downloaded. By bitmap, I meant ``uint8_t`` arrays, which each byte indicating for a 1 MiB segment. I could not be bothered to implement proper bitmapping. The main challenge for the cache system was hunting down various race conditions which caused metadata corruption, downloading the same segment multiple times, and deadlocks.
-												Update README.md
											
										
										
											2018-07-23 05:11:24 +02:00
-												updated README, also moved some status message around

											
										
										
											2019-04-23 03:35:17 +02:00
+								## Acknowledgement
-												Update README.md
											
										
										
											2019-04-23 04:27:42 +02:00
+								- First of all, I would like to thank [Jerome Charaoui](https://github.com/jcharaoui) for being the Debian Maintainer for this piece of software. Thank you so much for packaging it!
 								- I would like to thank [Cosmin Gorgovan](https://scholar.google.co.uk/citations?user=S7UZ6MAAAAAJ&hl=en) for the technical and moral support. Your wisdom is much appreciated!
-												Update README.md

											
										
										
											2019-04-23 12:46:04 +02:00
+								- I would like to thank [-Archivist](https://www.reddit.com/user/-Archivist/) for not providing FTP or WebDAV access to his server. This piece of software was written in direct response to his appalling behaviour.