ZFS System

June 23, 2017

I've been contemplating my setup recently and have come to the decision that I'm going to be purchasing a new server in the coming weeks - since I built my original 'do it all' server I've learnt a few things. Namely that btrfs is not reliable in a raid configuration and integrates poorly with systemd. I don't want to go the hardware raid route for various reasons and therefore I think the only sensible choice would be to go zfs.

I've been running Fedora server, but would like to move to CentOS as I'd like something a bit more stable. Now from what I've gathered whilst zfsonlinux is on parity with openzfs on bsd, the fact that's it's not built into the kernel, means that you have have to use a dkms style packages and have it rebuilt on kernel upgrades; which is fine, but I've heard people have at times have had issues that required an update to the dkms package i.e. the pool becomes unavailable until the issue is rectified.

So I'm thinking of running my CentOS system in parallel to FreeNas on a hypervisor like ESXi - I need the FreeNas storage to be available via the CentOS vm, I think iSCSI would be my best bet. I'm just wondering if anyone has a similar set up and if there are any a) performance hits b ) caveats I should be aware of.

Edited August 6, 2017 by dcrdev

June 23, 2017

i run a few esxi machines. i've been thinking about moving my emby server and storage onto esxi but havent had the balls to do it yet.

another option you could look at is something like esxi with a windows 10 VM. then use stablebit drivepool or drive bender to manage your storage and redundancy.

never used centOS so cant compare but have used freeNAS and it was great until i wanted to run some windows software on my 24/7 machine. ended up getting rid of freenas for windows 10.

June 23, 2017

Thanks - although I don't use Windows at home ; nothing against it just prefer Unix like and open source. Also really looking for a CoW filesystem - hence btrfs but as I said that's not stable enough yet; well atleast raid 5/6 isn't due to the write hole bug and raid 10 is just too costly. Also another reason I'd like to use FreeNas is that I use smb, nfs and afs - keeping that all in sync is a chore; so a single configuration point would be brilliant.

I suppose if I were to be specific in my questions, it would boil down to this:

Can I pass though a sata device directly to machine within esxi, or would I have to go down the route of connecting the drives to a hba and passing that through to the vm? I ask this because of course the zfs checksumming and scrubbing operations won't work on a virtual filesystem.
iSCSI is not something I've used before - would there be any speed degradation by using that as a storage medium for another vm?

The CentOS element just runs my web server, proxy server, dns and of course Emby.

June 24, 2017

fwiw, my NAS is a headless ubuntu 16.04 build running zfs which is included in the kernel since 15.1. It works flawlessly and zfs is an amazing filesystem. I only have 4GB of ECC RAM and a Celeron CPU and have not run into any performance or parity issues.

bill@FileServer:~$ zpool status
  pool: storage
 state: ONLINE
  scan: scrub repaired 0 in 8h21m with 0 errors on Sun Jun 18 08:21:04 2017
config:

        NAME                                            STATE     READ WRITE CKSUM
        storage                                         ONLINE       0     0     0
          mirror-0                                      ONLINE       0     0     0
            ata-WDC_WD2003FYPS-27Y2B0_WD-WCAVY5692219   ONLINE       0     0     0
            ata-WDC_WD20EFRX-68EUZN0_WD-WCC4M3FSNHS0    ONLINE       0     0     0
          mirror-1                                      ONLINE       0     0     0
            ata-WDC_WD2000F9YZ-09N20L1_WD-WMC1P0DAZ4K4  ONLINE       0     0     0
            ata-WDC_WD20EFRX-68EUZN0_WD-WCC4M6VDFZ8J    ONLINE       0     0     0
          mirror-2                                      ONLINE       0     0     0
            ata-ST3000DM001-1CH166_W1F40LNB             ONLINE       0     0     0
            ata-ST3000DM001-1CH166_Z1F3MBK4             ONLINE       0     0     0

errors: No known data errors

July 27, 2017

Well having done some testing in VirtualBox, I think I'm going to do ZFS on CentOS with the root on ZFS as well, ontop of that the root is going to be encrypted. I've managed to successful pull this off in a VM - albeit with a patch to grub.

I've opted to build my new server myself - the rest of the parts are coming tomorrow; I'll probably post some pics soon, but the specs are:

Xeon E3-1245 v6

16GB Crucial ECC

Asrock E3C236D2I with IPMI

Noctua NH-L9I CPU Cooler

120MM Noctua PWM Exhaust Fan

Inwin MS04 Server Case

250gb Samsung 960 Pro NVME Boot Drive

250GB Samsung 850 Evo Cache/Log Drive

4x 10TB Seagate IronWold NAS Drives

FSP Flex ATX Platinum 500w PSU

Total cost has been about £2000

August 4, 2017

Well this was a massive pain in the arse to build - the case and drive cage made it impossible to work in, ended up snapping several of the panel led cables and had to solder and rewrap them.:

Storage drives are yet to come - but after much tribulation I have a CentOS system on an NVME drive, with EFI, that is encrypted and on a ZFS root.

$ zpool status
  pool: centos_hell01-serv01-core
 state: ONLINE
  scan: none requested
config:

	NAME                                         STATE     READ WRITE CKSUM
	centos_hell01-serv01-core                    ONLINE       0     0     0
	  luks-c024c16c-08a4-43df-809d-e6e7eb416b43  ONLINE       0     0     0

errors: No known data errors

$ zfs list
centos_hell01-serv01-core         18.0G   211G    29K  legacy
centos_hell01-serv01-core/ROOT    1.02G   211G  1.02G  legacy
centos_hell01-serv01-core/docker  7.76M   211G  7.76M  legacy
centos_hell01-serv01-core/home    29.5K   211G  29.5K  legacy
centos_hell01-serv01-core/swap    17.0G   228G    12K  -

This is my first foray into zfs - so far I'm liking it very much!

Only thing I'm having difficulty with is mixing legacy and zfs managed mounts - particularly the ordering for example docker requires it's own dataset if using the zfs storage driver and it needs to be mounted at /var/lib/docker. In the case of zfs volume manager docker starts up first and therefore sets itself up on the empty directory, this has forced me to use a legacy mounts across the board. When I get around to creating my storage pool, I plan to rely entirely on the zfs volume manager. There are one or two instances where I need to create bind mounts in fstab and I'm not sure how that will pan out if systemd attempts those mounts before zfs volume manager kicks in. I'm sure I'll have massive amounts of fun figuring it out though.

August 19, 2017

So change of plan slightly - seems as though a combination of ZFS / Older Kernel can't quite cope with with the throughput of my Samsung SM961 NVME drive. The NVME kernel queue was getting flooded under heavy I/O with docker on ZFS.

So had to start again with XFS as the root filesystem - still going to go with ZFS on the array though which should be fine as it's just spinning rust. Dammit sometimes I wish I liked Ubuntu, it would make my life so much easier.

August 26, 2017

Goddammit I swear I have the worst luck when it comes to storage - seriously never ever buy startech sata cables! Second batch I've bought this year that have been defective - only bought them because they are round and fit easier in small cases. I've spent all day swapping drives in/out of systems, to learn that once again it's the bloody startech cables. It's really bad as well because had I had not been using Linux these errors would most likely be silent on Windows.

On a more positive note, look what arrived in the mail today:

Also Redhat released a patch for my nvme issue and I'm back on a zfs root - I'm now beginning the very very long task of transferring my data from the old server.

  pool: centos_hell01-serv01-core
 state: ONLINE
  scan: scrub repaired 0B in 0h0m with 0 errors on Sat Aug 26 23:50:34 2017
config:

	NAME                                         STATE     READ WRITE CKSUM
	centos_hell01-serv01-core                    ONLINE       0     0     0
	  luks-670f7816-5cdb-41d6-a057-ca3a73dc8483  ONLINE       0     0     0

errors: No known data errors

  pool: centos_hell01-serv01-data
 state: ONLINE
  scan: scrub repaired 0B in 0h0m with 0 errors on Sat Aug 26 23:50:36 2017
config:

	NAME                                           STATE     READ WRITE CKSUM
	centos_hell01-serv01-data                      ONLINE       0     0     0
	  mirror-0                                     ONLINE       0     0     0
	    luks-a886c3eb-e800-4c5e-98ad-0b4b9c0d7949  ONLINE       0     0     0
	    luks-5b7b50c4-cdbb-40af-ab64-b4cfef91f9da  ONLINE       0     0     0
	logs
	  luks-53331c18-8465-43c1-94fd-5d7101a7b8ba    ONLINE       0     0     0
	cache
	  luks-a0a50ac4-741a-46b4-8f67-30aefce6bc35    ONLINE       0     0     0

errors: No known data errors

August 27, 2017

Using rsync to transfer the files over ssh, transfer speeds are saturating the network.

Obviously this is not normal operations, but something is telling me I'll be getting more RAM fairly soon:

59a2a9d9c0586_Screenshot_20170827_121233

Got another 20 hours of data transfer according to my calculations.

August 27, 2017

Media library is now moved across - imported into Emby in record time and just waiting on the chapter image extraction job.

Anyone know a quick way to import watched status from my old Emby instance?

August 30, 2017

Media library is now moved across - imported into Emby in record time and just waiting on the chapter image extraction job.

Anyone know a quick way to import watched status from my old Emby instance?

The backup plugin

September 1, 2017

@@dcrdev

Hi,

just out of curiosity, is the zpool status screenshot in #8 your production setup?

If it is I strongly advise that you read at least the relevant parts of the zfs documentation on freenas.org and the forums, or the Solaris ZFS docs.

This setup is literally begging for data loss:

1. centos_hell01-serv01-core:

A pool consisting of one disk? Once (not if) the disk dies, everything is gone.

2. centos_hell01-serv01-data:

A two 6TB disk mirror? Once one of the disks dies you have no redundancy.

As you have bough the disks at the same time ("Storage drives are yet to come" in #6) they're probably from the same production batch. Both will have similar hours of operation, so while resilvering a new drive, chances are pretty good that the other drive fails too, as resilvering stresses drives pretty much.

And even if it does not fail, you still will not know whether your data is intact, as all you will have is one single drive to read the data and checksums from, which may have read errors. This is not hypothetical, but computable.

If you plan to use this hardware for Emby in the first place, I'd suggest to discard the pool and log drives and use all four to setup a RaidZ2 configuration.

And, as a final thought, how does your backup solution look like? ZFS does not replace backups, though it makes it quite painless.

--

sv

September 2, 2017

@@dcrdev

Hi,

just out of curiosity, is the zpool status screenshot in #8 your production setup?

If it is I strongly advise that you read at least the relevant parts of the zfs documentation on freenas.org and the forums, or the Solaris ZFS docs.

This setup is literally begging for data loss:

1. centos_hell01-serv01-core:

A pool consisting of one disk? Once (not if) the disk dies, everything is gone.

2. centos_hell01-serv01-data:

A two 6TB disk mirror? Once one of the disks dies you have no redundancy.

As you have bough the disks at the same time ("Storage drives are yet to come" in #6) they're probably from the same production batch. Both will have similar hours of operation, so while resilvering a new drive, chances are pretty good that the other drive fails too, as resilvering stresses drives pretty much.

And even if it does not fail, you still will not know whether your data is intact, as all you will have is one single drive to read the data and checksums from, which may have read errors. This is not hypothetical, but computable.

If you plan to use this hardware for Emby in the first place, I'd suggest to discard the pool and log drives and use all four to setup a RaidZ2 configuration.

And, as a final thought, how does your backup solution look like? ZFS does not replace backups, though it makes it quite painless.

--

sv

Hi,

Thanks for the feedback - although I have read the documentation and have chosen this set up on purpose.

centos_hell01-serv01-core is indeed a one drive pool - I have installed the operating system on this drive and is literally only using zfs for compression and snapshotting - no redundancy obviously. I don't care that much about it, it's replicated onto the storage pool automatically with zfs send/recieve and piped into a gzipped archive. These backups are held for 6 months and are pruned automatically.

centos_hell01-serv01-data is currently just a mirror, the other drive(s) you speak of are actually one ssd partitioned into two and hence the choice to use them for cache/logs. I have opted to go the striped/mirrored route, incrementally adding sets of mirrored vdevs over time. At the moment this consists of one vdev due to budget constraints, but will be growing over the coming months. Also the two 6tb drives are from different vendors and have different manufacturing dates - did this on purpose.

Additionally the storage array is automatically being replicated to another machine. Ultimately (and I've tested this) I can recover this whole configuration from backup in around 7 hours with little to no input from myself.

Most of this automation comes from sanoid https://github.com/jimsalterjrs/sanoid , although some of it revolves around some python scripts I've written.

Edited September 2, 2017 by dcrdev

September 3, 2017

Hi,

centos_hell01-serv01-core is indeed a one drive pool - I have installed the operating system on this drive and is literally only using zfs for compression and snapshotting - no redundancy obviously.

In my setup I have the OS (freenas) on two mirrored USB sticks - if one dies I pull it out and push in a new one. Works fine (though you have the usual problems when the new 16GB stick a few bytes smaller than the old one).

...it's replicated onto the storage pool automatically with zfs send/recieve and piped into a gzipped archive. These backups are held for 6 months and are pruned automatically.

So you send a snapshot to your storage pool and store them in a gzip archive?

Why don't you just turn on compression on your backup dataset on the storage pool and leave the snapshots there? Gzip will squeeze out maybe 10%-20% more than lz4, but lz4 compression is times faster and built into ZFS - no need to handle any archive files, and you can access the snapshot contents directly without having to extract them first.

one ssd partitioned into two and hence the choice to use them for cache/logs

I'd say you gain nothing from this setup, but I'd ask the ZFS gurus on the freenas forums.

I have opted to go the striped/mirrored route, incrementally adding sets of mirrored vdevs over time...

But you then still end up with a pool consisting of vdevs consisting of two drives only. If the wrong two drives die, your pool dies.

Most of this automation comes from sanoid https://github.com/jimsalterjrs/sanoid,...

I have not used these tools (and I do not want to make them bad), but having read the description I'd say everything should be doable with standalone freenas. But as you said you were tied to Linux you'll have to use such tools.

September 4, 2017

In my setup I have the OS (freenas) on two mirrored USB sticks - if one dies I pull it out and push in a new one. Works fine (though you have the usual problems when the new 16GB stick a few bytes smaller than the old one).

I wanted nvme for my boot drive, the motherboard I'm using only has 1x m.2 slot - if I wanted another one, then I'd have to use up a pcie slot. Like I said I can recover from backup in no time - so not really concerned.

So you send a snapshot to your storage pool and store them in a gzip archive?

Why don't you just turn on compression on your backup dataset on the storage pool and leave the snapshots there? Gzip will squeeze out maybe 10%-20% more than lz4, but lz4 compression is times faster and built into ZFS - no need to handle any archive files, and you can access the snapshot contents directly without having to extract them first.

Compression is enabled on both pools, but having a handy single file archive makes it easy to automatically sync to online storage (encrypted ofcourse) . Also I don't have to extract them per say - they are still block level snapshots, I can restore them with a single command:

openssl enc -d -aes-256-cbc -a -in /storage/Backup/root-<yy-mm-dd>.gz.ssl | gunzip | zfs receive centos_hell01-serv01-core/ROOT

I'd say you gain nothing from this setup, but I'd ask the ZFS gurus on the freenas forums.

Have to disagree there are huge proven benefits to running a separate intent log on solid state.

But you then still end up with a pool consisting of vdevs consisting of two drives only. If the wrong two drives die, your pool dies.

That's a risk I'm willing to take.

I have not used these tools (and I do not want to make them bad), but having read the description I'd say everything should be doable with standalone freenas. But as you said you were tied to Linux you'll have to use such tools.

Exactly not using FreeNas, for good reason - this machine is doing much more than just storage.

Edited September 4, 2017 by dcrdev

September 4, 2017

openssl enc -d -aes-256-cbc -a -in /storage/Backup/root-<yy-mm-dd>.gz.ssl | gunzip | zfs receive centos_hell01-serv01-core/ROOT

Sorry I am not sure whether I understand the above.

Your source data is a file (asd) which you have gzipped (asd.gz), then encrypted and stored the result base64-encoded (asd.gz.ssl)?

Why do you expensively compress it just to store it base64 encoded later?

Anyway, it probably suits your needs. I'd probably use an encrypted

dataset and the default lz4 (could use gzip too if size was a concern). Both is

implemented in ZFS therefore transparent:

zfs send backuppool/dataset@thesnapshot | zfs recv datapool/dataset

Have to disagree there are huge proven benefits to running a separate intent log on solid state.

Yes, a ZIL and L2ARC can speed up things under certain use case scenarios.

For streaming the L2ARc is useless, the ZIL might speed up synchronous writes to the pool

You've probably tested it yourself.

Exactly not using FreeNas, for good reason - this machine is doing much more than just storage.

As FreeNAS is built on top of FreeBSD you still have all the virtualization

power at hands. It indeed would be very silly to install stuff into the FreeNAS

system. My FreeNAS box has been running jails with dns, cups, emby, plex,

postgres, lms since years without any dropouts.

You might find this interesting:

http://www.freenas.org/blog/yes-you-can-virtualize-freenas/

September 4, 2017

Sorry I am not sure whether I understand the above.

Your source data is a file (asd) which you have gzipped (asd.gz), then encrypted and stored the result base64-encoded (asd.gz.ssl)?

Why do you expensively compress it just to store it base64 encoded later?

Anyway, it probably suits your needs. I'd probably use an encrypted

dataset and the default lz4 (could use gzip too if size was a concern). Both is

implemented in ZFS therefore transparent:
zfs send backuppool/dataset@thesnapshot | zfs recv datapool/dataset
Yes, a ZIL and L2ARC can speed up things under certain use case scenarios.
For streaming the L2ARc is useless, the ZIL might speed up synchronous writes to the pool

You've probably tested it yourself.

As FreeNAS is built on top of FreeBSD you still have all the virtualization

power at hands. It indeed would be very silly to install stuff into the FreeNAS

system. My FreeNAS box has been running jails with dns, cups, emby, plex,

postgres, lms since years without any dropouts.

You might find this interesting:

http://www.freenas.org/blog/yes-you-can-virtualize-freenas/

The file is gzipped because it's backed up automatically to an online storage provider and it's just easier to do it the above way. Also encrypted datasets only just got implemented on zfs for linux about a week ago, so that's part of it; I'm using dm-crypt for transparent encryption on my system.

Well yes the SLOG is particularly beneficial for my database server and gitlab instance, the L2ARC isn't really going to do much unless I run out of ARC.

Anyway I'm happy with CentOS!

September 5, 2017

Thumbs up for the brave

September 14, 2017

So I've lasted 3 kernel upgrades and a new release - I think we have stability! Even managed to build a rescue initrd with zfs built in, just in case.

Things to do now:

- Upgrade to 32gb ram - using 16gb dimms so shouldn't be much hassle.

- Buy another 2x 6tb drives / find a way to rebalance the array; I'm thinking send/recieve to a new dataset and rename.

- Investigate integrating libvirt with zfs, so that new VMs automatically spin up a new zvoi.

- Figure out how to do unattended boots with a usb key attached to luks, this functionality is partially broken with systemd/dracut.

But yeah this thing is ludicrously stable - I'm blown away given how much of a Frankenstein build it is i.e. patched grub, patched grubby, out-of-tree file-system etc...

ZFS System

Recommended Posts

dcrdev 251

Link to comment

Share on other sites

Swynol 375

Link to comment

Share on other sites

dcrdev 251

Link to comment

Share on other sites

mastrmind11 717

Link to comment

Share on other sites

dcrdev 251

Link to comment

Share on other sites

dcrdev 251

Link to comment

Share on other sites

dcrdev 251

Link to comment

Share on other sites

dcrdev 251

Link to comment

Share on other sites

dcrdev 251

Link to comment

Share on other sites

dcrdev 251

Link to comment

Share on other sites

puithove 208

Link to comment

Share on other sites

Sludge Vohaul 22

Link to comment

Share on other sites

dcrdev 251

Link to comment

Share on other sites

Sludge Vohaul 22

Link to comment

Share on other sites

dcrdev 251

Link to comment

Share on other sites

Sludge Vohaul 22

Link to comment

Share on other sites

dcrdev 251

Link to comment

Share on other sites

Sludge Vohaul 22

Link to comment

Share on other sites

dcrdev 251

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Activity