SAS counters on linux

There isn’t a lot of documentation on debugging SAS-specific storage problems under Linux, unfortunately. I’ve had a couple issue recently that were tied to flaky SAS cables and mostly had to debug them through swapping cables and logic to figure out which device was causing issues. Along the way I’ve learned a few helpful things.

First, Linux actually has counters for tracking SAS-level problems. They’re just really hard to find, and very few tools expose them. The kernel would love to tell you that the connection between your SAS expander and drive 23 is flaky, but it’s not really going to go out of its way to tell you. You need to go looking for it.

There are two packages that I’ve found essential for understanding how SAS devices are connected in a system. The first is sasutils, and the other is lsscsi. Unfortunately as we’ll see in a minute, neither is actually sufficient.

Of the two, sasutils is probably more useful. It can show you your SAS topology (sas_discover), devices (sas_devices), and most of the low-level kernel counters in a format useful for Graphite via sas_counters. Unfortunately, I haven’t found sas_counters to be all that useful for a couple reasons. First, I don’t use Graphite for monitoring, so the information presented isn’t really in the best format for me to use. But the bigger issue is that it doesn’t provide enough context–it’ll tell you which links have problems, but mapping those to actual devices is a manual process. I mean, can you tell which drives this is talking about?

$ sas_counters
...
sasutils.sas_counters.fs2.SAS9300-8e.0x500605b00c20d550.SAS2X36.expander_0x500304800000007f.phys.11.invalid_dword_count 0 1654394146
sasutils.sas_counters.fs2.SAS9300-8e.0x500605b00c20d550.SAS2X36.expander_0x500304800000007f.phys.11.loss_of_dword_sync_count 0 1654394146
sasutils.sas_counters.fs2.SAS9300-8e.0x500605b00c20d550.SAS2X36.expander_0x500304800000007f.phys.11.phy_reset_problem_count 0 1654394146
sasutils.sas_counters.fs2.SAS9300-8e.0x500605b00c20d550.SAS2X36.expander_0x500304800000007f.phys.11.running_disparity_error_count 0 1654394146
sasutils.sas_counters.fs2.SAS9300-8e.0x500605b00c20d550.SAS2X36.expander_0x500304800000007f.phys.8.invalid_dword_count 0 1654394146
sasutils.sas_counters.fs2.SAS9300-8e.0x500605b00c20d550.SAS2X36.expander_0x500304800000007f.phys.8.loss_of_dword_sync_count 0 1654394146
sasutils.sas_counters.fs2.SAS9300-8e.0x500605b00c20d550.SAS2X36.expander_0x500304800000007f.phys.8.phy_reset_problem_count 0 1654394146
sasutils.sas_counters.fs2.SAS9300-8e.0x500605b00c20d550.SAS2X36.expander_0x500304800000007f.phys.8.running_disparity_error_count 0 1654394146
...

I can tell which SAS card (SAS9300-8e, with address 0x500605b00c20d550) and which SAS expander, and even which physical link (SAS PHY) the disk is connected to, but I don’t know anything about the disk itself. Or if this even is a disk–the links between hosts and expanders look identical at this level.

Actually figuring this out turns out to be a bit of an adventure.

In this case, sas_discover --addr -vvvvv | less can help a little bit. Searching for the SAS address provided (0x500304800000007f) will tell you that that’s expander-0:0. Which presumably means that we’re talking about SAS phy-0:0:11 and phy-0:0:8, but there’s no good way to turn that into a /dev/sdX drive using the information here.

You could go run lsscsi and it’ll tell you that device 0:0:11 is /dev/sdn.

$ lsscsi 
[0:0:0:0]    disk    HGST     HUH721008AL5204  NE00  /dev/sdc 
[0:0:1:0]    disk    HGST     HUH721008AL5204  NE00  /dev/sdd 
[0:0:2:0]    disk    HGST     HUH721008AL5204  NE00  /dev/sde 
[0:0:3:0]    disk    HGST     HUH721008AL5204  NE00  /dev/sdf 
[0:0:4:0]    disk    HGST     HUH721008AL5204  NE00  /dev/sdg 
[0:0:5:0]    disk    HGST     H7280A520SUN8.0T PD51  /dev/sdh 
[0:0:6:0]    disk    HGST     HUH721008AL5204  NE00  /dev/sdi 
[0:0:7:0]    disk    HGST     H7280A520SUN8.0T PD51  /dev/sdj 
[0:0:8:0]    disk    HGST     H7280A520SUN8.0T PD51  /dev/sdk 
[0:0:9:0]    disk    HGST     H7280A520SUN8.0T PD51  /dev/sdl 
[0:0:10:0]   disk    HGST     H7280A520SUN8.0T PD51  /dev/sdm 
[0:0:11:0]   disk    HGST     H7280A520SUN8.0T PD51  /dev/sdn 
[0:0:12:0]   disk    HGST     H7280A520SUN8.0T PD51  /dev/sdo

Not too hard, right?

Yeah, except that’s wrong.

What lsscsi is telling you is that SAS end_device-0:0:11 is /dev/sdn. But phy-0:0:11 doesn’t necessarily map to end_device-0:0:11.

To get this right, you actually have to dig into Linux’s /sys filesystem by hand. We know that we’re interested in phy-0:0:11, so we can look in /sys/class/sas_phy/phy-0:0:11:

$ ls -l /sys/class/sas_phy/phy-0:0:11/
total 0
lrwxrwxrwx 1 root root    0 May 29 20:59 device -> ../../../phy-0:0:11
-r--r--r-- 1 root root 4096 May 29 20:59 device_type
-rw-r--r-- 1 root root 4096 May 29 20:59 enable
--w------- 1 root root 4096 May 29 20:59 hard_reset
-r--r--r-- 1 root root 4096 May 29 20:59 initiator_port_protocols
-r--r--r-- 1 root root 4096 May 29 20:59 invalid_dword_count
--w------- 1 root root 4096 May 29 20:59 link_reset
-r--r--r-- 1 root root 4096 May 29 20:59 loss_of_dword_sync_count
-rw-r--r-- 1 root root 4096 May 29 20:59 maximum_linkrate
-r--r--r-- 1 root root 4096 May 29 20:59 maximum_linkrate_hw
-rw-r--r-- 1 root root 4096 May 29 20:59 minimum_linkrate
-r--r--r-- 1 root root 4096 May 29 20:59 minimum_linkrate_hw
-r--r--r-- 1 root root 4096 May 29 20:59 negotiated_linkrate
-r--r--r-- 1 root root 4096 May 29 20:59 phy_identifier
-r--r--r-- 1 root root 4096 May 29 20:59 phy_reset_problem_count
drwxr-xr-x 2 root root    0 May 29 20:59 power
-r--r--r-- 1 root root 4096 May 29 20:59 running_disparity_error_count
-r--r--r-- 1 root root 4096 May 29 20:59 sas_address
lrwxrwxrwx 1 root root    0 May 29 15:49 subsystem -> ../../../../../../../../../../class/sas_phy
-r--r--r-- 1 root root 4096 May 29 20:59 target_port_protocols
-rw-r--r-- 1 root root 4096 May 29 15:48 uevent

That’s sort of useful; you can see the interesting error counters there (all of the *_count files) and negotiated_linkrate, which will tell you how fast this SAS link it. This is where sas_counters gets its data.

To figure out where this PHY goes, you have to look in the device/ subdirectory:

$ ls -l /sys/class/sas_phy/phy-0:0:11/device/
total 0
lrwxrwxrwx 1 root root    0 May 31 10:43 port -> ../port-0:0:3
drwxr-xr-x 2 root root    0 May 29 21:00 power
drwxr-xr-x 3 root root    0 May 29 15:48 sas_phy
-rw-r--r-- 1 root root 4096 May 29 15:49 uevent

That tells us that this PHY is part of SAS port port-0:0:3. Ports are a logical connection between two SAS devices, and contain 1 or more PHYs. A 4-lane link between a SAS host and a SAS expander would be a single port but contain 4 PHYs (and hence 4 negotiated speeds, and 4 sets of error counters). We can follow the port link to see…

$ ls -l /sys/class/sas_phy/phy-0:0:11/device/port/
total 0
drwxr-xr-x 7 root root    0 May 29 15:48 end_device-0:0:3
lrwxrwxrwx 1 root root    0 May 29 20:59 phy-0:0:11 -> ../phy-0:0:11
drwxr-xr-x 2 root root    0 May 29 21:00 power
drwxr-xr-x 3 root root    0 May 29 15:48 sas_port
-rw-r--r-- 1 root root 4096 May 29 15:49 uevent

Hey, look–phy-0:0:11 actually goes to end_device-0:0:3, not end_device-0:0:11. But which block device is that? We could go back to lsscsi, or we could just fetch it directly from /sys. Fortunately, it’s sitting right here in a convienent place:

$ ls -l /sys/class/sas_phy/phy-0:0:11/device/port/end_device-0:0:3/
total 0
drwxr-xr-x 3 root root    0 May 29 15:48 bsg
drwxr-xr-x 2 root root    0 May 29 21:00 power
drwxr-xr-x 3 root root    0 May 29 15:48 sas_device
drwxr-xr-x 3 root root    0 May 29 15:48 sas_end_device
drwxr-xr-x 4 root root    0 May 29 15:48 target0:0:3
-rw-r--r-- 1 root root 4096 May 29 15:49 uevent

Uhm, not there yet. Keep going, follow the target0:0:3 directory to find…

$ ls -l /sys/class/sas_phy/phy-0:0:11/device/port/end_device-0:0:3/target0:0:3
total 0
drwxr-xr-x 8 root root    0 May 29 15:48 0:0:3:0
drwxr-xr-x 2 root root    0 May 29 21:00 power
lrwxrwxrwx 1 root root    0 May 29 15:49 subsystem -> ../../../../../../../../../../bus/scsi
-rw-r--r-- 1 root root 4096 May 29 15:48 uevent

Keep going; look in 0:0:3:0:

$ ls -l
/sys/class/sas_phy/phy-0:0:11/device/port/end_device-0:0:3/target0:0:3/0:0:3:0
total 0
-r--r--r-- 1 root root 4096 May 29 20:59  blacklist
drwxr-xr-x 3 root root    0 May 29 15:48  block
drwxr-xr-x 3 root root    0 May 29 15:48  bsg
--w------- 1 root root 4096 May 29 20:59  delete
-r--r--r-- 1 root root 4096 May 29 20:59  device_blocked
-r--r--r-- 1 root root 4096 May 29 20:59  device_busy
-rw-r--r-- 1 root root 4096 May 29 20:59  dh_state
lrwxrwxrwx 1 root root    0 May 29 15:49  driver -> ../../../../../../../../../../../bus/scsi/drivers/sd
-rw-r--r-- 1 root root 4096 May 29 20:59  eh_timeout
lrwxrwxrwx 1 root root    0 May 29 15:49 'enclosure_device:Slot 04' -> '../../../../port-0:0:13/end_device-0:0:13/target0:0:13/0:0:13:0/enclosure/0:0:13:0/Slot 04'
-r--r--r-- 1 root root 4096 May 29 20:59  evt_capacity_change_reported
-r--r--r-- 1 root root 4096 May 29 20:59  evt_inquiry_change_reported
...

Getting closer–the answer is hiding in the block/ subdirectory:

$ ls -l /sys/class/sas_phy/phy-0:0:11/device/port/end_device-0:0:3/target0:0:3/0:0:3:0/block
total 0
drwxr-xr-x 11 root root 0 May 29 15:48 sdf

There we go. It’s /dev/sdf.

There’s almost certainly a better way to get this out of sasutils, but I haven’t had a lot of luck. Mapping from PHY name to block device isn’t something that any tool seems to want to do today.

Clearly, we need better tooling for this. I’m starting by trying to add SAS counters and a bit of topology to Prometheus’s node_exporter. The goal is to be able to track all of the PHY statistics in Prometheus (so you can graph errors over time and so forth), and then label the metrics with as much topology information as makes sense. The first PR is out now, and once that’s in I’ll start working on node_exporter itself.

I’m also working on a couple other tools for managing large numbers of drives, but they’re further out. I have over 50 drives on my home file server, and it’s surprisingly hard to answer a few really basic questions, like:

How many unused drives are in the system? Like new disks that weren’t added as ZFS spares, or old disks that weren’t removed.
Which enclosure/bay holds drive sdXXX?
What is the serial number of from sdXXX (since this appears on the physical drive label, it’s actually useful for verifying that you’ve pulled the right disk)
What is sdXXX used for on this system?

Those are all working now. The tricky bits are mostly persistence issues:

What is drive sdXXX’s history? (First seen on 2020-03-21, added to ZFS pool XXX on 2020-03-21, removed from ZFS pool on 2022-04-01)
Mark disk XXX as bad and don’t let me forget. I’ve had a couple drives that smartctl says are okay, but have huge latency and only move a few MB/sec. When I see them, I usually swap in a spare, but that process can take days to complete, and it’s way too easy to forget to yank the drive and discard it.
Which drive models have the lowest failure rates? I’m not Backblaze, but I’m probably pushing 1,000 drive-years of disk use at this point. I’ve lost so many ST3000DM001s over the years that I know that they’re terrible, but what about the other models?
And what about drives in other systems? It’d be nice to pool this data across all of my machines, but now I need a server running and config files and security…