Bitten by the SMR bug

Monday June 22 2020
linux virtualization

On the evening of Sunday June 21st my home network was working fine. By the morning of June 22nd, things seem to be acting strangely. Local network traffic was working well enough and internet connectivity was present. But sometimes the network seemed to pause for a few seconds and then catch up.

I went through the usual things

Reset the cable modem
Check network cabling for signs of damage
Reset local network hardware like switches
Temporarily remove my ethernet surge protectors
Plug my laptop directly into the cable modem to confirm the problem disappeared

So at this point I had not figured anything out. But I was able to rule out most of the network hardware.

My home internet connection goes through a virtual machine I setup using the virsh utility. Both the host & guest are running Ubuntu 18.04 with the Hardware Enablement stack. The virtual machine has multiple ethernet interfaces shared to it from the host. One is connected to the local network, the other one to the cable modem. This lets me do things like use iptables rules to provide NAT between my local network and the internet. I also use the tc utility to setup traffic queueing classes that attempt to provide an equal level of service to all devices on my local network.

I ran a ping from the host machine to the local IP of the VM. This uility just sends a packet to the target machine and requests a reply. This allows lots of simple metrics to be gathered, like the latency. The typical latency on a wired home network is less than 1 millisecond.

$ ping -i 0.22 192.168.12.90
64 bytes from 192.168.12.90: icmp_seq=345 ttl=64 time=0.253 ms
64 bytes from 192.168.12.90: icmp_seq=346 ttl=64 time=0.361 ms
64 bytes from 192.168.12.90: icmp_seq=347 ttl=64 time=0.360 ms
64 bytes from 192.168.12.90: icmp_seq=348 ttl=64 time=1374 ms
64 bytes from 192.168.12.90: icmp_seq=349 ttl=64 time=1151 ms
64 bytes from 192.168.12.90: icmp_seq=350 ttl=64 time=928 ms
64 bytes from 192.168.12.90: icmp_seq=351 ttl=64 time=704 ms
64 bytes from 192.168.12.90: icmp_seq=352 ttl=64 time=480 ms
64 bytes from 192.168.12.90: icmp_seq=353 ttl=64 time=257 ms
64 bytes from 192.168.12.90: icmp_seq=354 ttl=64 time=34.8 ms
64 bytes from 192.168.12.90: icmp_seq=355 ttl=64 time=0.285 ms
64 bytes from 192.168.12.90: icmp_seq=356 ttl=64 time=0.351 ms
64 bytes from 192.168.12.90: icmp_seq=357 ttl=64 time=0.400 ms
64 bytes from 192.168.12.90: icmp_seq=358 ttl=64 time=0.341 ms

From this I could see that some ping requests were answered in the usual time. But some took up to 1.374 seconds to come back. This latency was identical to what I saw when trying to browse the internet. Most page loads were instant, but some seem to be briefly interrupted before returning to normal. The actual bandwidth available was still the typical 100+ Mbps download speeds provided by my ISP.

At this point I was sort of stumped because the pause simply showed up randomly. I ended up running iotop in the virtual machine. This utility shows which processes are performing the most IO operations, like writing to the hard drive. This VM has basically no IO activity. The most common thing writing to the disk is the journald process which is part of systemd. I used the highly advanced troubleshooting techique of having one terminal displaying the ping command and the other one running iotop at the same time. I eventually decided that the IO activity by journald happened at the same time as the "pause". But of course there is no way journald is actually causing a problem with my internet connection or the local network.

To double check my theory I ran dd if=/dev/zero of=./fstest bs=4096 count=25000 in the guest VM. Basically all this does is write a bunch of zeroes into a file. Every time I ran this, the "pause" would happen. I did the opposite and read the file using cat ./fstest > /dev/null, it did not reproduce the problem. So the "pause" seems to happen anytime data is written to the filesystem. It appeared that writing to the disk just caused the VM to temporarily freeze.

Since I'm running a VM, the VM doesn't have a real disk attached to it. The virtual disk is configured in the XML like this

<disk type='block' device='disk'>
  <driver name='qemu' type='raw' cache='none' io='native'/>
  <source dev='/dev/corona/gateway_ii'/>
  <backingStore/>
  <target dev='vda' bus='virtio'/>
  <alias name='virtio-disk0'/>
  <address type='pci' domain='0x0000' bus='0x00' slot='0x09' function='0x0'/>
</disk>

This exposes the host block device /dev/corona/gateway_ii to the virtual machine as a hard drive. However, even /dev/corona/gateway_ii is not a real device. It is built using LVM from multiple physical devices. You can run the lsblk command on the host to see the makeup of those devices. I've trimmed the output of the command down to just those relevant to the device used by my VM

$ lsblk
NAME                               MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                                  8:0    0   7.3T  0 disk 
├─corona-gateway_ii_rmeta_1        253:7    0     4M  0 lvm  
│ └─corona-gateway_ii              253:9    0    30G  0 lvm  
├─corona-gateway_ii_rimage_1       253:8    0    30G  0 lvm  
│ └─corona-gateway_ii              253:9    0    30G  0 lvm  
sdb                                  8:16   0   7.3T  0 disk 
├─corona-gateway_ii_rmeta_0        253:5    0     4M  0 lvm  
│ └─corona-gateway_ii              253:9    0    30G  0 lvm  
├─corona-gateway_ii_rimage_0       253:6    0    30G  0 lvm  
│ └─corona-gateway_ii              253:9    0    30G  0 lvm

So by looking at the output lsblk we can see that the physical devices sda and sdb are used to actually hold /dev/corona/gateway_ii. The reason why multiple drives are used is because in LVM this is created as a mirror, so there are at least two copies of the data at any point in time.

A failing hard drive could cause a performance issue. This host machine actually has four mehcanical hard drives and one solid state drive. I'm not a stranger to troubleshooting hard drives, so I went through my usual troubleshooting steps

Ran smartctl -t short /dev/sda to perform a self-test for each drive. They all passed the test.
Checked dmesg for any information about communication problems with a drive. There were none.
Used dd if=/dev/sda of=/dev/null to read random sections of the disks.
Powered off the machine, reseated all cables.

All this produced nothing useful. The problem persisted and I still did not know why. Obviously I would need to learn new troubleshooting techniques.

After searching around I found out there is a utility ioping written by Konstantin Khlebnikov. This is similar to the ping utility, but instead of measuring network traffic it just measures how long it takes to read a random 4KB section of a block device.

I started four copies of ioping running on the host machine monitoring /dev/sda, /dev/sdb, /dev/sdc, /dev/sdd respectively. These are my four mechanical drives that are all managed with LVM. I'm not exactly expecting impressive numbers, but reading a random spot on a hard drive usually takes less than 50 milliseconds.

For /dev/sdc & /dev/sdd I saw output like this

$ioping /dev/sdc
4 KiB <<< /dev/sdc (block device 3.64 TiB): request=1 time=24.1 ms (warmup)
4 KiB <<< /dev/sdc (block device 3.64 TiB): request=2 time=12.3 ms
4 KiB <<< /dev/sdc (block device 3.64 TiB): request=3 time=12.0 ms
4 KiB <<< /dev/sdc (block device 3.64 TiB): request=4 time=22.4 ms
4 KiB <<< /dev/sdc (block device 3.64 TiB): request=5 time=18.5 ms
4 KiB <<< /dev/sdc (block device 3.64 TiB): request=6 time=16.5 ms

So a random read takes about 20 milliseconds. That is fast enough to be considered normal by me. Now lets look at the same output for /dev/sda

$ioping /dev/sda
4 KiB <<< /dev/sda (block device 7.28 TiB): request=27 time=262.4 ms
4 KiB <<< /dev/sda (block device 7.28 TiB): request=28 time=1.01 s (slow)
4 KiB <<< /dev/sda (block device 7.28 TiB): request=29 time=63.0 ms (fast)
4 KiB <<< /dev/sda (block device 7.28 TiB): request=30 time=28.8 ms (fast)
4 KiB <<< /dev/sda (block device 7.28 TiB): request=31 time=23.8 ms (fast)
4 KiB <<< /dev/sda (block device 7.28 TiB): request=32 time=68.3 ms
4 KiB <<< /dev/sda (block device 7.28 TiB): request=33 time=217.0 ms
4 KiB <<< /dev/sda (block device 7.28 TiB): request=34 time=68.9 ms
4 KiB <<< /dev/sda (block device 7.28 TiB): request=35 time=353.3 ms

Wow! Yes, you're reading that correctly. Sometimes a random read takes more than 1 second on that drive. So this drive obviously has an issue! In fact, I graphed the latency just to see how bad it is.

This plot shows the latency over time of the three of the hard drives

The plot aboves shows that the latency of /dev/sda and /dev/sdb is commonly 100x that of the other drive /dev/sdc. The purple line down at the very bottom is the latency of /dev/sdc, typically under 50 milliseconds. If just reading from the drive commonly takes more than 100 milliseconds, I could believe that the VM is suffering a huge performance impact when it has to write to one of the affected drives.

This is where things get really interesting. The problem basically just went away suddenly. The output of ioping for all drives went to a normal value of less than 20 milliseconds. I left it running for a while and it seemed OK. The "pauses" on the internet connection stopped. Given that I had changed nothing, I was not satisfied with this lack of any concrete explanation.

I googled the model number of one of the drives: ST8000DM004. At that point, I became very dismayed. The top result on Google is this post by Blocks & Files explaining that SMR has been sneaking into Seagate's consumer products. Back earlier in the year I had upgraded all of the mechanical drives in my host machine and was quite surprised just how cheap storage had become. Apparently that is a double edged sword. I had read all over the news that Western Digital had been doing this but I mistakenly assumed I was in the clear since I only use Seagate drives. Blocks & Files of course has a great write up about the Western Digital drives as well. There is even some discussion on the smartmontools issue tracker about this problem.

So I have SMR drives, how could this cause this problem? Conventional drives are apparently "CMR" drives. SMR stands for "shingled magnetic recording". I'm not going to go into a technical deep drive about how magnetic recording works. But apparently SMR has a huge data density advantage over CMR drives. SMR is only really good for workloads that write infrequently. Cherry picking some of Western Digital's comments about their SMR products:

... workloads tend to be bursty in nature, leaving sufficient idle time for garbage collection and other maintenance operations.

In other words, these drives cannot have sustained write-heavy workloads. I have a bunch of VMs and other tasks that are all writing to these drives so I guess I had just gotten lucky up to this point. This would explain why the problem disappeared on its own. The drive needed time to shuffle things around internally before performance could get back to normal.

In order to confirm this, I needed to cause the problem to happen again. The same physical drives are used to create another huge device with LVM that is then mounted as an ext4 filesystem in the host machine. Using some shell scripts I created tons of medium sized files all over the filesystem. The latency reported by ioping for the drives /dev/sda and /dev/sdb went up immediately. The "pause" came back. So I am confident this is a problem caused by the fact that these drives use SMR technology.

I wish I had known these drives were SMR before I purchased them. But I guess I am stuck with them for now. I probably have degraded these drives by using them to host the VM images. I also frequently have used them to record log files from different tests I have performed. My temporary fix for this was very simple: I migrated the virtual machines disks over to the SSD in the host machine. I plan to purchase some additional non-SMR drives to host the VM drive images.