How to set up your own distributed, redundant, and encrypted storage grid in a few easy steps

MartinD

Retired Staff
Verified Provider
Retired Staff
[i was looking for a solution that this seems to cater for so figured I'd post this here in case others wanted something similar. This has been reproduced with permission from Joepie91 (original author) under the WTFPL license.]

If you have a few different VPSes, you'll most likely have a significant amount of unused storage space across all of them. This guide will be a quick introduction to setting up and using Tahoe-LAFS, a distributed, redundant, and encrypted storage system - some may call it 'cloud storage'.
What are the requirements?

    At least 2 VPSes required, at least 3 VPSes recommended. More is better.
    Each VPS should have at least 256MB RAM (for OpenVZ burstable), or 128MB RAM (for OpenVZ vSwap and other virtualization technologies with proper memory accounting).
    Reading comprehension and an hour of your time or so :)

What is Tahoe-LAFS?

From the Tahoe-LAFS website:

    Tahoe-LAFS is a Free and Open cloud storage system. It distributes your data across multiple servers. Even if some of the servers fail or are taken over by an attacker, the entire filesystem continues to function correctly, including preservation of your privacy and security.

How does Tahoe-LAFS work?

The short version: Tahoe-LAFS uses a RAID-like mechanism to store 'shares' (parts of a file) across the storage grid, according to the settings you specified. When a file is retrieved, all storage servers will be asked for shares of this file, and those that responded fastest will be used to retrieve the data from. The shares are reconstructed by the requesting client into the original file.

All shares are encrypted and checksummed; storage servers cannot possibly know or modify the contents of a share, or the file it derives from.

There are (roughly) two types of files: immutable (these cannot be changed afterwards) and mutable (these can be changed). Immutable files will result in a "read capability" (an encoded string that tells Tahoe-LAFS how to find it and how to decrypt it) and a "verify capability" (that can be used for verifying or repairing the file). A mutable file will also yield a "write capability" that can be used to modify the file. This way, it is possible to have a mutable file, but restrict the write capability to yourself, while sharing the read capability with others.

There is also a pseudo-filesystem with directories; while it isn't required to use this, it makes it possible to for example mount part of a Tahoe-LAFS filesystem via FUSE.

For more specifics, read this documentation entry.
How do I set it up?
1. Install dependencies

Follow the below instructions for all VPSes.

To install and run Tahoe-LAFS, you will need Python (with development files), setuptools, and the usual tools for compiling software. On Debian, this can be installed by running apt-get install python python-dev python-setuptools build-essential. If you use a different distro, your package manager or package names may differ.

Python setuptools comes with a Python package manager (or installer, rather) named easy_install. We'd rather have pip as our Python package manager, so we'll install that instead: easy_install pip.

After installing pip, we'll install the last dependency we need to install manually (pip install twisted), and then we can install Tahoe-LAFS itself: pip install allmydata-tahoe.

When you're done installing all of the above, you'll have to make a new user (adduser tahoe) that you're going to use to run Tahoe-LAFS under. From this point on, run all commands as the tahoe user.
2. Setting up an introducer

First of all, you'll need an 'introducer' - this is basically the central server that all other nodes connect to, to be made aware of other nodes in the storage grid. While the storage grid will continue to function if the introducer goes down, no new nodes will be discovered, and there will be no reconnections to nodes that went down until the introducer is back up.

Preferably, this introducer should be installed on a server that is not a storage node, but it's possible to run an introducer and a storage node alongside each other.

Run the following on the VPS you wish to use as an introducer, as the tahoe user:

tahoe create-introducer ~/.tahoe-introducer
tahoe start ~/.tahoe-introducer

Your introducer should now be started successfully. Read out the file ~/.tahoe-introducer/introducer.furl and note the entire contents down somewhere. You will need this later to connect the other nodes.
3. Setting up storage nodes

Now it's time to set up the actual storage nodes. This will involve a little more configuration than the introducer node. On each storage node, run the following command: tahoe create-node.

If all went well, a storage node should now be created. Now edit ~/.tahoe/tahoe.cfg in your editor of choice. I will explain all the important configuration values - you can leave the rest of the values unchanged. Note that the 'shares' settings all apply to uploads from that particular server - each machine connected to the network can pick their own encoding settings.

    nickname: The name for this particular storage node, as it will appear in the web panel.
    introducer.furl: The FURL for the introducer node - this is the address that you noted down before.
    shares.needed: This is the amount of shares that will be needed to reconstruct a file.
    shares.happy: This is the amount of different servers that have to be available for storing shares, for an upload to succeed.
    shares.total: The total amount of shares that should be created on upload. One storage node may hold more than one share, as long as it doesn't violate the shares.happy setting.
    reserved_space: The amount of space that should be reserved for other applications on this server. Read below for more information.


Reserved space

Tahoe-LAFS has a somewhat interesting way of counting space - instead of keeping track of how much space it can use for itself, it will try to make sure that a certain amount of space is available for other applications. What this means in practice is, that if another application fills up 1GB of disk space, this 1GB will be subtracted from the amount of space that Tahoe-LAFS can use, not from the amount of space that it can't use. The end result is Tahoe-LAFS being very conservative in the way it uses disk space. This means that you can typically set the amount of reserved space to a very low value like 1GB to 5GB, because by the time you hit that amount of free space, you will still have plenty of time to clean up your VPS, before the last gigabytes are used up by other applications.
Share settings

At first, share settings may seem very tricky to configure correctly. My advice would be to set it as the following:

    shares.total: about 80% of the amount of servers you have available.
    shares.happy: 2 lower than shares.total
    shares.needed: half of shares.total

This means that if you have for example 10 storage servers, shares.total = 8, shares.happy = 6, shares.needed = 4.

Now you can't just set any arbitrary values here - your share settings will influence the 'expansion factor' - how many times more space you use than the file would take up on its own. You can calculate the expansion factor by doing shares.total / shares.needed - for example, with the above suggested setup the expansion factor would be 2, meaning that a 100MB file would take up 200MB of space.

The level of redundancy can be calculated quite easily as well: the amount of servers you can lose while being guaranteed to still have access to your data, is shares.happy - shares.needed (this assumes worst case scenario). In most cases, the amount of servers you can lose will be shares.total - shares.needed.
4. Starting your storage nodes

On each node, simply run the command tahoe start as the tahoe user, and you should be in business!
5. (optional) Install a local client

To more easily use Tahoe-LAFS, you may want to install a Tahoe-LAFS client on your local machine. To do this, you should basically follow the instructions in step 3 - however, instead of running tahoe create-node, you should run tahoe create-client. Configuring and starting works the same, but you don't need to fill in the reserved_space option (as you're not storing files).
Using your new storage grid

There are several ways to use your storage grid:
Via the web interface

Simply make sure you have a client (or storage node) installed, and point your browser at http://localhost:3456/ - you will see the web interface for Tahoe-LAFS, which will allow you to use it. The "More info" link on a directory page (or for a file) will give you the read, write, and verify capability URIs that you need to work with them using other methods.
Using Python

I recently started working on a Python module named pytahoe, that you can use to easily interface with Tahoe-LAFS from a Python application or shell. To install it, simply run pip install pytahoe as root - you'll need to make sure that you have libfuse/libfuse2 installed. There is no real documentation for now other than in the code itself, but the below code gives you an idea of how it works:

>>> import pytahoe
>>> fs = pytahoe.Filesystem()
>>> d = fs.Directory("URI:DIR2:hnncfsbzsxv5fhdymxhycm3xc4:qjipiqg3bozb5evb6krdwfmsgks6j4ymivopgx7eoxcjb3avslqq")
>>> d.upload("devilskitchen.tar.gz")

The result of this is something like this.
Mounting a directory

You can also mount a directory as a local filesystem using FUSE (on OpenVZ, make sure your host supports FUSE). Right now, the easiest way appears to be using pytahoe (this can be done from a Python shell as well). Example:

>>> import pytahoe
>>> fs = pytahoe.Filesystem()
>>> d = fs.Directory("URI:DIR2:hnncfsbzsxv5fhdymxhycm3xc4:qjipiqg3bozb5evb6krdwfmsgks6j4ymivopgx7eoxcjb3avslqq")
>>> d.mount("/mnt/something")

Via the web API

If you're using something that is not Python, or want a bit more control over what you do, you may want to use the Tahoe-LAFS WebAPI directly - documentation for this can be found here.
Need more help?

There's plenty more (very clear) documentation on the Tahoe-LAFS website! :)

EDIT: For those interested in copying this guide - it's released under the WTFPL, meaning you can basically do with it whatever you want, including copying it elsewhere. Credits or a donation are both appreciated, but neither is required :)
 
 

drmike

100% Tier-1 Gogent
I hope Joepie91 comes on in drops some other tutorials and wisdom.   One of my old favorites from the other site!
 

willie

Active Member
Tahoe sounds cool but not really that practical because of the high redudancy, the slow (compared to SATA) internet connections between nodes, and the necessarily not-so-great security of the underlying VPS's.  I keep thinking about how to do secure online storage and the answer I keep reaching is always colo servers with various forms of hardware based security.
 
Last edited by a moderator:

Shados

Professional Snake Miner
Tahoe sounds cool but not really that practical because of the high redudancy, the slow (compared to SATA) internet connections between nodes, and the necessarily not-so-great security of the underlying VPS's. I keep thinking about how to do secure online storage and the answer I keep reaching is always colo servers with various forms of hardware based security.
Sounds like the problem you're trying to solve is "server local storage I can be certain my host can't access", whereas Tahoe is trying to solve the problem of "remote storage that can be accessed locally but be certain my remote host can't access" (among other goals).

Completely different things in mind, so it's hardly surprising that it doesn't work for you - you're thinking about using a screwdriver when you need a hammer :p.
 
Last edited by a moderator:

MartinD

Retired Staff
Verified Provider
Retired Staff
Also could have simply linked to https://raymii.org/s/tutorials/Tahoe_LAFS_Storage_Grid.html, which has nicer formatting and some additional info (Q&A).
Thanks! There are more copies out there, @mikho also has one. The License allows it, and yes, my formatting is nicer :)
Sorry - didn't know they existed. I was talking to Joe yesterday about this and he linked to his post elsewhere. I suggested he posted it on VPSBoard too - he wasnt fussed so I simply C&P'd it from the source I was linked to. No harm intended!
 

perennate

New Member
Verified Provider
Some things to keep in mind

  • It's better to get servers that have similar disk sizes. Reasoning: if you have five servers with 500 GB and five servers with 50 GB, and you have (K,N_min)=(3,6) (i.e., replicate to at least six servers with three shares needed to reconstruct the data), then once the five servers with 50 GB are full your storage grid is full since you only have five servers with space left. If you have many more servers than N_min (the happy setting) then it's fine though.
  • Backups: if you want to backup from local filesystem to Tahoe-LAFS, you may want to check out https://github.com/mk-fg/lafs-backup-tool. "tahoe backup" and other commands are still in development.
  • Syncing: I don't know of any good way to sync local filesystem with the storage grid. Sometimes it can be a pain editing files directly from the mounted system. If you use SFTP interface + sshfs it may be better since sshfs has a built-in cache feature.
 
Top