Friday, May 09, 2025

Home Server Rebuild

Home Server Upgrades

I have had 2 servers at home for all of my infra, both of which were either free or budget builds and should charitably be described as ancient at this point. While I definitely learn plenty wrangling this stuff, they're not formally a homelab, and over time it's gone from just a file server for backups and media centralization to things that are a bit more critical to the daily function of the house, namely HomeAssistant for automation. So I need these to be reliable and relatively maintenance free, because I have plenty to do without also needing to do tech support or help figure out flakiness in the home automation stuff for those residents of the home that are going to be annoyed if it's anything other than 100% transparent to them.

  • HP DL360 G7 with a single E5620, 32GB RAM, and 4x300G SAS SSD storage in a hardware RAID that runs a number of my docker containers that have stuff I'd potentially want to expose to the outside world (Minecraft servers, Home assistant, etc). Ubuntu 22.04 LTS.
    • This isn't very power-efficient, and the RAID's cache card has already failed once and been replaced with a spare, and that spare has long since disabled the cache because it doesn't like the battery, has had glitches where it freaks out and declares the file system read only, making me think it's going to fail entirely. Plus every time it does that, I have to manually go into the RAID card and tell it to boot normally and recover things, which required pulling the system out of the rack and removing the GPU so that it would use the onboard video, because I never could get the GPU to put out a console display, all of which is a serious PITA. Since bypassing the RAID card to move to JBOD/software RAID or replacing an actually failed card was going to require a rebuild anyway, I figured I'm better off just building a new system.
  • Dell Power Edge T110-II (tower) with a single E3-1230 V2 and 16GB of RAM, plus 5x3TB storage in ZFS. Runs TrueNAS. serves as primary file server, and also has a plex jail since that's where all of the media lives. I've been running this since it was FreeNAS, I think version 10, and there are some older posts on the blog about previous iterations of this build.
    • TrueNAS Core (the version I'm running) is based on FreeBSD 13 and is basically end of support. There are multiple sub-versions of 13 that never got integrated, and no plans for anything using 14. 
    • TrueNAS really wants you to go to Scale (Linux based), and they've formally dropped support for jails and VMs running on Core now, so I can't even update to the (likely final) maintenance release of Core without risking breaking something.. 
    • There's an upgrade path from Core to Scale but it requires hard drives, no USB live boot support anymore. I have no SATA ports available on this system, and it's more than a decade old so it's not like it even has USB3 or something that could be used for more drives. 

The goal was to replace the above with some combination of 1-2 systems that had similar power usage (TDP for CPU, etc) but taking advantage of improvements to performance per watt that much newer chips would provide, preferably in similar form factors and allowing me to reuse the various disks involved. I run Folding at Home on my stuff when it's idle, and the heat generated ultimately feeds my heat pump water heater, especially in the winter time, so there's some incentive to have stuff that is still efficient under heavy load. I'd decided I wanted to move away from appliance OS in favor of regular Linux + openZFS because I don't really need the UI for management and didn't want to end up in the same situation I was in with TrueNAS again later. I considered either consolidating everything to one larger rackmount system or keeping 2 systems and using a purpose-built NAS case and motherboard (lots of SATA ports) to replace the Dell tower. Increasing to a 2RU system would mean I'd have to find a different switch that wasn't rack-mounted, because I only have a 4RU rack, and I got analysis paralysis on the DIY NAS route due to the large number of options.

The forcing function that led to me figuring out my plan and moving forward was that a Dell R330 with 4x4TB hard drives came available free as in beer, and I discovered that Dell does still make tower servers that are the newer gen to what I have, and found a T340 on eBay for under $400. The T340 has 8x3.5" hot swap SATA/SAS bays, plus 3 more 5.25 bays and 2 onboard SATA ports. So after some discussion about the merits of various ZFS/RAID configurations, I ended up with the following: 

  • Dell R330, E3-1230 v5, 32GB RAM, PERC H330 (in JBOD mode)
    • 4x200GB SAS SSDs (from the HP) in 2 ZFS mirror vDevs (basically RAID10) for storage
      • adapters to use them in the existing 3.5" bay/caddy
    • Linux software RAID1 for OS made up of 
      • 1x256GB SATA SSD in the optical drive bay via an adapter
      • 1x256GB NVME SSD in an "external" USB3 case plugged into the internal USB3 port 
  • Dell T340, E-2244G, 32GB RAM, PERC H330 (in JBOD mode)
    • 2x200GB SAS SSDs (spares for the HP) in linux software RAID for OS, same 2.5 -> 3.5 caddy adapters
    • ZFS storage pool consisting of the following mirror vDevs
      • 2x4TB SATA
      • 2x4TB SATA
      • 2x3TB SATA
      • 2x3TB SATA (via HDD cage for the 5.25 bays and the onboard SATA ports)
    • eBay Nvidia Quadro P2200 
      • for FaH and Plex transcode
      • Replaced a Quadro K2000 that had been in the HP and died

Software considerations

I ended up using Debian 12 on both. I've been fairly happy with Ubuntu, that HP ran 18 LTS and seamlessly upgraded to 20 and 22, but they seem to be really trying to focus on their paid support models - each new LTS release includes more nagging about the updates you could be getting but aren't, plus with the latest LTS release (24) they're pretty aggressively pushing Snap instead of apt for package management, and there's a limit to how much more that's different largely for the sake of being different I'm interested in being forced to learn in order to have the stuff I use at home Just Work, especially in the age of AI Slop overtaking all the good documentation and search. Yes, I have Opinions about systemd (more on that below), but for ease of use and compatibility reasons, I'd rather just live with the enemy I know than Prove a Point by finding a Linux flavor that doesn't use it and dealing with always having the "weird" OS anytime I need to do something or install something.

As much as possible* lives in Docker. I was already running Home Assistant and Minecraft servers (Java and Bedrock) on Docker in the existing system. Plex was previously running in a FreeBSD Jail on the TrueNAS box, so I built a container to do that on the new box. Both hosts are running Watchtower to keep the containers automagically up-to-date. For the most part, migration involved deploying the appropriate software or containers on the new system, and then copying over the directories from the old system, or doing backup/restore - Minecraft was the former, Home Assistant and Unifi were the latter, and Plex appears to store most of the relevant account settings online now, so it was just as easy to log in to the clean install and re-add my libraries and let it rescan everything.

*with a couple of exceptions: 

  • Unifi - conflicting info about whether their docker container is really supported/which one to use
  • Folding at Home - at least the FAH-GPU container has gone unsupported. No updates for multiple years is causing the newer projects' work units to fail on account of missing/outdated libraries, especially for GPU. Plus there's now a completely new major rev of FAH (v8) that has no current Docker container. Also I'm starting to think that GPU + Docker isn't worth the hassle.
  • Plex - I thought I was going to have to do this outside of a container, see the issues below.

Issues 

Software RAID for OS

OpenZFS is pretty well-supported in Linux, but the prevailing wisdom is that you don't really want to try to boot your OS off of that, hence the Linux software RAID. But Debian's installer also doesn't know how to manage setting up the software RAID in a guided partition setup, and you end up with it failing to write grub to the disk. I started with instructions that told me to disable EFI boot, which worked on the R330 because it wasn't using anything hanging off of the PERC for the OS, so it didn't matter whether the installer could see it. The T340 has SAS OS disks, which hang off of the PERC. I was convinced I had some sort of hardware problem, which I lost several days troubleshooting because neither the installer or an actual install of Debian would see the disks behind that despite recognizing the PERC in logs, lspci, etc. Turns out that despite them both being 330s and both running the same (most current) firmware, there's something newer about the one in the T340 that makes it not work unless you're in EFI boot. Once I found the equivalent instructions for EFI things worked pretty well, though I have noticed that the grub script to copy the current boot image to the second disk is a bit fragile, especially if the disks ever change designations. 

SMB/CIFS

Between Windows defaults changing so that it tries to use your Microsoft login for network shares (which among other things is almost guaranteed to be too long/contain illegal chars for an actual user account) and disabling guest login, I fought with smb.conf for entirely too long before I had a working network share. Lots of conflicting info about how to do this, and the config files that TrueNAS's webUI generates don't seem to translate directly. 

This was what I finally found that made guest access work though note that there's also a command you have to run on the windows side. I still don't entirely understand why, but force user/group nobody within each share configuration stanza doesn't seem to work. 

Plex/FAH GPU Funtimes

I originally tried to do Plex in Docker, including  exposing the GPU to it for transcoding. Once I actually completed that, not only was I not seeing hardware transcoding as an option for plex, it appears that maybe Docker (or FAH) expects exclusive use of the GPU, because the GPU instance for Folding at Home, which is installed on the OS itself, failed and declared the GPU unavailable. As soon as I stopped Docker and restarted the FAH client, everything was fine.

My next try was to install Plex directly on the OS, thinking that maybe Docker was complicating things and the OS should be able to handle two programs sharing the GPU. This also didn't result in hardware transcoding support on Plex, and it broke FAH's GPU access in a similar way. So what I ended up doing was disabling the nvidia-container-runtime in docker, and starting the Plex container back up that way, and uninstalling the OS installation of Plex. Long-term, I ended up having to uninstall nvidia-container-runtime altogether and then redeploy a vanilla plex container in docker because there was still some sort of conflict trying to steal FAH's access to the GPU, which would result in things working, but then failing before the WU could actually complete successfully. 

There is still some sort of race condition in the order services are loading which results in FAH starting before the GPU drivers are fully initialized and ready, so it doesn't find the GPU and you have to restart the service a few minutes after the system reboots. 

Kernel and Virtualization

I was chasing some other random failures with the GPU, noticed I was getting these DMAR/IOMMU errors in the kernel logs, and one of the first search results implies this is some sort of issue with the 6.x kernel that can result in things like filesystem corruption if you don't disable VT-D (direct access for PCI passthrough virtualization). Since this was roughly coincident with a new crop of ZFS errors that happened after I fixed the thermal issues below, and I'm not using that set of virtualization anyway, I figured I'd better disable it. Dell's BIOS on the T340 doesn't have a separate option for VT-D, only enable/disable Virtualization technology, but it does have another setting for x2APIC mode, which is only necessary on machines with a lot of CPUs. It is disabled by default, but was enabled on my machine (I thought BIOS was already reset to defaults but it appears not). I tried disabling just that, but the same errors showed up pretty quickly, so I ended up disabling virtualization in the BIOS entirely. Docker still works, and everything else seems happy with it disabled - several days of stability and no kernel messages.

Diskname juggling

I have 10 disks in the T340. For whatever reason, every time I reboot the thing, they play 3 card monte with their /dev/sdX names. Makes it kinda hard to make a sane smartd.conf, especially since 2 of those disks are SAS (aka SCSI) and the rest are SATA, and the schedule is a little different for the 2 OS SSDs vs the aging spinning rust. Ended up having to use /dev/disk/by-id. The grub copy script I'm using references a specific partition on each of the two /dev/sdX devices, so I'm going to have to spend some time thinking about the best way to address that so I can use by-id references. So meantime, every time I see that a kernel or kernel module (ZFS, nvidia drivers) get an update, I check where the OS disks have migrated to and update the script with the device name du jour.

Logging

At the risk of this devolving into a rant, systemd continues to make its users suffer badly from its devteam's "not invented here" syndrome, or put another way, it continues to violate the principle of least astonishment if one is at all familiar with UNIX/Linux from the last couple of decades due to their insistence on reinventing well-understood and documented, functional, and modular things to (often poorly) re-implement them within the ever-increasing monolith that is systemd. My current sticking point is the transition from regular old text logs to binary files and journalctl, which I hadn't experienced on other iterations of Debian (raspi etc) and so this was my first exposure. I've mostly figured it out at this point, so it's probably not worth the hassle to go through and reconfigure everything to use regular syslog, but fail2ban breaks because it can't find the usual auth.log and the devs don't seem inclined to add a case to fix this on install. So you have to add "sshd_backend = systemd" to the [DEFAULT] section of  /etc/fail2ban/paths-debian.conf manually.

Thermal Management

The T340 is kind of a stupid design in a couple of key ways: 

First, it only has one fan, for everything, unless you count the power supply fans. Pictures and more details here. There's no forced airflow past the card slots, and no real way to add it, so all it really gets is the slight negative pressure the PSU fans create. The HDD cage I added to the 5.25 bays has a fan on the front, so there's a little air blowing toward the cards now, but it's pretty anemic.

Second, both of the full-length PCIe slots are in adjacent slots with an x4 and x8 slot below them. The PERC H330 lives in the top slot, and I put my GPU in the second slot. The H330 runs kinda hot anyway, and it only has a passive heatsink for cooling. But the real issue is that in my setup, bottom plate is way too close to the GPU's heat center, and I managed to trigger a thermal shutdown on the PERC, which is pretty much something you never, ever want to do unless you enjoy calming down several extremely angry file systems after they unceremoniously lost access to most or all their disks at the same time.

I thought about swapping the GPU and PERC between slots, but that will just block off the heat sink's airflow with the GPU and cook both cards, so the current workaround is a 70mm fan on top of the PERC's heatsink, and a PCIe extension cable so that I can move the GPU elsewhere. Since I don't need to use the display ports (the onboard graphics still work if I need to actually see the display), I don't actually need to put it in a slot, and can put it pretty much wherever it fits, as ghetto as that's likely to be. For its part, the GPU doesn't seem to notice that it's not really in direct airflow, as it still reports the same mid-60C temps under full load.

Also worth noting that because this sits on a shelf about 1.7M off of the floor and adjacent to where the other server, my switch, and UPS all hang in a vertical rack, the intake air is a little too warm, so I have a small fan blowing cool air off the ground toward the front of the chassis, which dropped the intake air about 5-6 degrees C (from 35 to 29).

To Do List

I have achieved minimum viable product after what ended up being about 2 weeks of work. There are a few things that are near term useful quality of life improvements I still need to work on.

  • Backups - on the old system, I was doing a set of backups/zips/copies to a CIFS mount via cron so that important files on the 1RU system were getting backed up elsewhere. On this setup, it probably can be ZFS snapshots that I just push around, but I have to figure all of that out. I also probably need to decide what I want to do about offsite backup for some stuff that isn't already covered in other ways now that I have a more straightforward way to do that with snapshots. 
  • Login MOTD - Ubuntu has a pretty nice set of info it presents on login, including system load, updates pending, whether a restart is needed, etc, and I'd like to replicate that on my Debian boxes. Looks like this covers it.
  • Unattended upgrades - apply the security updates that are pending without me remembering to log in periodically to do it. 
  • Alerts - I'm doing SMART monitoring and ZFS scrubs and such, but right now, it's all very much like Milhouse watching Bart's factory: "I saw the whole thing. First it started falling over, then it fell over." So I need to give it a way to demand attention when my monitoring detects a problem that isn't dependent on me logging in to check a half-dozen things every so often. . 
  • Monitoring - one of the things that I do miss about TrueNAS's nice webUI is the pretty pre-built charts on most things - network IO, temps, memory, CPU, storage performance, that worked well to see how things behave normally and troubleshoot when they don't. Rolling my own means right now I have... CLI, and CLI, and also some CLI. Probably this means InfluxDB or Prometheus and Grafana, or maybe Netdata. It'd be nice to integrate it into my existing HomeAssistant instance, but so far most of what I've found that watches hardware on HomeAssistant is assuming you want to monitor the hardware that HA is actually running on when you deploy it as an appliance, not to monitor a couple of systems, one of which may be coincidentally running an HA container.

References

Some of the stuff I found most helpful is linked inline, but here's some additional stuff that I used that I didn't already link elsewhere:

ZFS

For what appears to be primarily license incompatibility reasons, you don't get ZFS on linux without a bit of work. It's not difficult, but it's nice to have some good recipes for it. Also I didn't completely understand ZFS when I initially set it up on the last machine, so I did some things wrong in a way that wasn't easy to fix without vacating and rebuilding. This was an opportunity to do it more "right" this time, especially as it concerned hierarchical datasets to enable easier snapshotting, so I needed some more background reading. 

LACP

I don't have any 10GE on my switch yet, and while I could add a 2 port expansion card, it's probably overkill for my application right now, so since my T340 had 2 onboard and 2 card-based GE ports, I did a 4x1GE LAG, 

Docker

I have created multiple datasets in my ZFS pools for ease of management and backups, and one of them is specifically for Docker, so I wanted to move where docker stores everything from the OS drives to the right ZFS dataset.  https://forums.docker.com/t/change-the-default-docker-storage-location/140455 

SMART 

Needed something to base my config for making sure SMART was monitoring everything and doing periodic tests. https://dan.langille.org/2018/11/04/using-smartd-to-automatically-run-tests-on-your-drives/

Sunday, January 26, 2025

Tesla Relationship Status: It's complicated!

In late 2020, I bought a Tesla Model Y because at the time it was the best combination of things to meet my needs in a vehicle that got halfway decent fuel economy. That whole discussion occurs here. I'd say I had 2-3 years of very few regrets about that purchase. But we are now midway through year 5, and well, I probably should have sold it about 2.5 years ago when they were having supply problems and people were paying nearly what they cost new for lightly-used models. 

The status report at 65K miles:

  • I'm on my third set of tires. The first ones lasted to almost 40K, the second wore out in less than 20K, and addressing that cost me another $4k on top of a new set of tires because apparently there are some design flaws in the Model Y's suspension that result in it wearing out the bushings (apparently the gutters for the windshield result in water being dripped directly on the control arm bushings) making them loose and squeaky, and even after they're replaced with OEM parts, being almost impossible to get the camber settings back in spec. My mechanic thinks that this is because as other parts of the suspension wear, it changes the overall geometry just enough that there isn't enough adjustment range in the control arms and it eats up the inner edge of the tires due to the excessive negative camber. So after a bunch of research, he fitted a set of aftermarket control arms with more adjustment range. 
  • It's now out of warranty. Drivetrain is still covered for a while yet, but all the ancillaries are not. 
  • My indicated range at full charge has dropped from 315 at new to about 285, which I think is a combination of Tesla revising the calculations that make the projections and actual capacity loss. 10% at 5 years isn't terrible, but it's not great either. 
  • Self-driving still isn't, but it is useful if you operate within the significant limitations of the system.

But that's not why I'm writing this post. I need to address the Elonphant in the room, or more accurately, the n4zi at the bar. I hadn't realized until recently what a nice thing it is that we barely know the names of the overwhelming majority of CEOs at major automakers, and if we know anything about their controversial opinions, it might be something along the lines of whether they should still make cars with manual transmissions or where they stand on CAFE and the transition away from fossil fuels, maybe some light thoughts about union-busting. But I know way more about the shitty opinions of Tesla's CEO than I care to, which unfortunately means I have to reconcile those against owning one of their products and what that potentially says about me. Looking at this in hindsight, I was never really a True Believer in his cult of personality, but seeing his unmasking, through generic juvenile edgelord, to anti-trans bigot, to radical right-wing ideologue actively trying to swing the results of the election to benefit himself and his fellow oligarchs, to actual n4zi means that my red line for "I don't want to be associated with anything that benefits him" was crossed probably 2 years ago and it just keeps getting harder to ignore.

My problem comes when looking at what I can actually do about this. The obvious bit is not to support any of the companies he's associated with - vote with your wallet. That really only works with Twitter and Tesla, because I don't personally have much influence on my tax dollars going to SpaceX, and this unfortunately doesn't seem to be a problem for most of the folks that do. But I'm already pretty much there - I bought the Model Y rather than leased it, my low-interest financing is not through Tesla, and I have a bit more than a year left until it's paid for, so they've already gotten their money from that. I'm paying for the data connectivity, and supercharging when I'm on trips. Neither of those are more than rounding errors in terms of the overall money I've handed Tesla, and I only use actual Tesla service when there's no other option. Similarly, for many reasons I wouldn't replace it with another Tesla if I had to replace it tomorrow. 

But it's a lot more complicated when thinking about whether I should get rid of the one I currently own on principle, and it's something I've gone back and forth with myself about for a good portion of at least the last 18 months, with the last week bringing the angst about this back to the surface in a major way. The car is depreciating like crazy both because of lots of other people unloading theirs for much better options and because the guy in charge has made the brand completely toxic. Looking at the values right now, which likely don't reflect the significantly heil hellish last week or so yet, I could still trade it in for more than I owe on it, but that would mean taking on either a lease or even if I buy something used, a car payment for another number of years, which seems like a bad plan with a kid about to enter college and the new admin speedrunning piloting the economy into the side of a mountain. Selling it also doesn't send a message to Tesla in any real way. So it's down to what random people on the internet with Opinions on such things or people who don't know me well enough to know where I stand on this think of me while I continue driving it. I don't think either of those rate highly enough to cause me to do something different here, other than to put a few stickers on the car that make it a little clearer what I think. I wish I had the luxury of "f--- you money" where I didn't have to be pragmatic about this, especially because the Walter Mitty in me has some ideas of ways to make a clearer statement about the whole situation, but instead, I guess I'll just admit this is a pretty privileged problem to have in the first place and conclude my whining on the internet about it.