8 minutes
SAS counters on linux
There isn’t a lot of documentation on debugging SAS-specific storage problems under Linux, unfortunately. I’ve had a couple issue recently that were tied to flaky SAS cables and mostly had to debug them through swapping cables and logic to figure out which device was causing issues. Along the way I’ve learned a few helpful things.
First, Linux actually has counters for tracking SAS-level problems. They’re just really hard to find, and very few tools expose them. The kernel would love to tell you that the connection between your SAS expander and drive 23 is flaky, but it’s not really going to go out of its way to tell you. You need to go looking for it.
There are two packages that I’ve found essential for understanding how
SAS devices are connected in a system. The first is
sasutils, and the other is
lsscsi
. Unfortunately as we’ll see in a minute, neither is actually
sufficient.
Of the two, sasutils is probably more useful. It can show you your
SAS topology (sas_discover
), devices (sas_devices
), and most of
the low-level kernel counters in a format useful for
Graphite via sas_counters
. Unfortunately,
I haven’t found sas_counters
to be all that useful for a
couple reasons. First, I don’t use Graphite for monitoring, so the
information presented isn’t really in the best format for me to use.
But the bigger issue is that it doesn’t provide enough context–it’ll
tell you which links have problems, but mapping those to actual
devices is a manual process. I mean, can you tell which drives this
is talking about?
$ sas_counters
...
sasutils.sas_counters.fs2.SAS9300-8e.0x500605b00c20d550.SAS2X36.expander_0x500304800000007f.phys.11.invalid_dword_count 0 1654394146
sasutils.sas_counters.fs2.SAS9300-8e.0x500605b00c20d550.SAS2X36.expander_0x500304800000007f.phys.11.loss_of_dword_sync_count 0 1654394146
sasutils.sas_counters.fs2.SAS9300-8e.0x500605b00c20d550.SAS2X36.expander_0x500304800000007f.phys.11.phy_reset_problem_count 0 1654394146
sasutils.sas_counters.fs2.SAS9300-8e.0x500605b00c20d550.SAS2X36.expander_0x500304800000007f.phys.11.running_disparity_error_count 0 1654394146
sasutils.sas_counters.fs2.SAS9300-8e.0x500605b00c20d550.SAS2X36.expander_0x500304800000007f.phys.8.invalid_dword_count 0 1654394146
sasutils.sas_counters.fs2.SAS9300-8e.0x500605b00c20d550.SAS2X36.expander_0x500304800000007f.phys.8.loss_of_dword_sync_count 0 1654394146
sasutils.sas_counters.fs2.SAS9300-8e.0x500605b00c20d550.SAS2X36.expander_0x500304800000007f.phys.8.phy_reset_problem_count 0 1654394146
sasutils.sas_counters.fs2.SAS9300-8e.0x500605b00c20d550.SAS2X36.expander_0x500304800000007f.phys.8.running_disparity_error_count 0 1654394146
...
I can tell which SAS card (SAS9300-8e
, with address
0x500605b00c20d550
) and which SAS expander, and even which physical
link (SAS PHY) the disk is connected to, but I don’t know anything
about the disk itself. Or if this even is a disk–the links between
hosts and expanders look identical at this level.
Actually figuring this out turns out to be a bit of an adventure.
In this case, sas_discover --addr -vvvvv | less
can help a little
bit. Searching for the SAS address provided (0x500304800000007f
)
will tell you that that’s expander-0:0
. Which presumably means that
we’re talking about SAS phy-0:0:11
and phy-0:0:8
, but there’s no
good way to turn that into a /dev/sdX
drive using the information
here.
You could go run lsscsi
and it’ll tell you that device 0:0:11
is
/dev/sdn
.
$ lsscsi
[0:0:0:0] disk HGST HUH721008AL5204 NE00 /dev/sdc
[0:0:1:0] disk HGST HUH721008AL5204 NE00 /dev/sdd
[0:0:2:0] disk HGST HUH721008AL5204 NE00 /dev/sde
[0:0:3:0] disk HGST HUH721008AL5204 NE00 /dev/sdf
[0:0:4:0] disk HGST HUH721008AL5204 NE00 /dev/sdg
[0:0:5:0] disk HGST H7280A520SUN8.0T PD51 /dev/sdh
[0:0:6:0] disk HGST HUH721008AL5204 NE00 /dev/sdi
[0:0:7:0] disk HGST H7280A520SUN8.0T PD51 /dev/sdj
[0:0:8:0] disk HGST H7280A520SUN8.0T PD51 /dev/sdk
[0:0:9:0] disk HGST H7280A520SUN8.0T PD51 /dev/sdl
[0:0:10:0] disk HGST H7280A520SUN8.0T PD51 /dev/sdm
[0:0:11:0] disk HGST H7280A520SUN8.0T PD51 /dev/sdn
[0:0:12:0] disk HGST H7280A520SUN8.0T PD51 /dev/sdo
Not too hard, right?
Yeah, except that’s wrong.
What lsscsi
is telling you is that SAS end_device-0:0:11
is
/dev/sdn
. But phy-0:0:11
doesn’t necessarily map to end_device-0:0:11
.
To get this right, you actually have to dig into Linux’s /sys
filesystem by hand. We know that we’re interested in phy-0:0:11
, so
we can look in /sys/class/sas_phy/phy-0:0:11
:
$ ls -l /sys/class/sas_phy/phy-0:0:11/
total 0
lrwxrwxrwx 1 root root 0 May 29 20:59 device -> ../../../phy-0:0:11
-r--r--r-- 1 root root 4096 May 29 20:59 device_type
-rw-r--r-- 1 root root 4096 May 29 20:59 enable
--w------- 1 root root 4096 May 29 20:59 hard_reset
-r--r--r-- 1 root root 4096 May 29 20:59 initiator_port_protocols
-r--r--r-- 1 root root 4096 May 29 20:59 invalid_dword_count
--w------- 1 root root 4096 May 29 20:59 link_reset
-r--r--r-- 1 root root 4096 May 29 20:59 loss_of_dword_sync_count
-rw-r--r-- 1 root root 4096 May 29 20:59 maximum_linkrate
-r--r--r-- 1 root root 4096 May 29 20:59 maximum_linkrate_hw
-rw-r--r-- 1 root root 4096 May 29 20:59 minimum_linkrate
-r--r--r-- 1 root root 4096 May 29 20:59 minimum_linkrate_hw
-r--r--r-- 1 root root 4096 May 29 20:59 negotiated_linkrate
-r--r--r-- 1 root root 4096 May 29 20:59 phy_identifier
-r--r--r-- 1 root root 4096 May 29 20:59 phy_reset_problem_count
drwxr-xr-x 2 root root 0 May 29 20:59 power
-r--r--r-- 1 root root 4096 May 29 20:59 running_disparity_error_count
-r--r--r-- 1 root root 4096 May 29 20:59 sas_address
lrwxrwxrwx 1 root root 0 May 29 15:49 subsystem -> ../../../../../../../../../../class/sas_phy
-r--r--r-- 1 root root 4096 May 29 20:59 target_port_protocols
-rw-r--r-- 1 root root 4096 May 29 15:48 uevent
That’s sort of useful; you can see the interesting error counters
there (all of the *_count
files) and negotiated_linkrate
, which
will tell you how fast this SAS link it. This is where sas_counters
gets its data.
To figure out where this PHY goes, you have to look in the device/
subdirectory:
$ ls -l /sys/class/sas_phy/phy-0:0:11/device/
total 0
lrwxrwxrwx 1 root root 0 May 31 10:43 port -> ../port-0:0:3
drwxr-xr-x 2 root root 0 May 29 21:00 power
drwxr-xr-x 3 root root 0 May 29 15:48 sas_phy
-rw-r--r-- 1 root root 4096 May 29 15:49 uevent
That tells us that this PHY is part of SAS port port-0:0:3
. Ports
are a logical connection between two SAS devices, and contain 1 or
more PHYs. A 4-lane link between a SAS host and a SAS expander would
be a single port but contain 4 PHYs (and hence 4 negotiated speeds,
and 4 sets of error counters). We can follow the port
link to
see…
$ ls -l /sys/class/sas_phy/phy-0:0:11/device/port/
total 0
drwxr-xr-x 7 root root 0 May 29 15:48 end_device-0:0:3
lrwxrwxrwx 1 root root 0 May 29 20:59 phy-0:0:11 -> ../phy-0:0:11
drwxr-xr-x 2 root root 0 May 29 21:00 power
drwxr-xr-x 3 root root 0 May 29 15:48 sas_port
-rw-r--r-- 1 root root 4096 May 29 15:49 uevent
Hey, look–phy-0:0:11
actually goes to end_device-0:0:3
, not
end_device-0:0:11
. But which block device is that? We could go
back to lsscsi
, or we could just fetch it directly from /sys
.
Fortunately, it’s sitting right here in a convienent place:
$ ls -l /sys/class/sas_phy/phy-0:0:11/device/port/end_device-0:0:3/
total 0
drwxr-xr-x 3 root root 0 May 29 15:48 bsg
drwxr-xr-x 2 root root 0 May 29 21:00 power
drwxr-xr-x 3 root root 0 May 29 15:48 sas_device
drwxr-xr-x 3 root root 0 May 29 15:48 sas_end_device
drwxr-xr-x 4 root root 0 May 29 15:48 target0:0:3
-rw-r--r-- 1 root root 4096 May 29 15:49 uevent
Uhm, not there yet. Keep going, follow the target0:0:3
directory to
find…
$ ls -l /sys/class/sas_phy/phy-0:0:11/device/port/end_device-0:0:3/target0:0:3
total 0
drwxr-xr-x 8 root root 0 May 29 15:48 0:0:3:0
drwxr-xr-x 2 root root 0 May 29 21:00 power
lrwxrwxrwx 1 root root 0 May 29 15:49 subsystem -> ../../../../../../../../../../bus/scsi
-rw-r--r-- 1 root root 4096 May 29 15:48 uevent
Keep going; look in 0:0:3:0
:
$ ls -l
/sys/class/sas_phy/phy-0:0:11/device/port/end_device-0:0:3/target0:0:3/0:0:3:0
total 0
-r--r--r-- 1 root root 4096 May 29 20:59 blacklist
drwxr-xr-x 3 root root 0 May 29 15:48 block
drwxr-xr-x 3 root root 0 May 29 15:48 bsg
--w------- 1 root root 4096 May 29 20:59 delete
-r--r--r-- 1 root root 4096 May 29 20:59 device_blocked
-r--r--r-- 1 root root 4096 May 29 20:59 device_busy
-rw-r--r-- 1 root root 4096 May 29 20:59 dh_state
lrwxrwxrwx 1 root root 0 May 29 15:49 driver -> ../../../../../../../../../../../bus/scsi/drivers/sd
-rw-r--r-- 1 root root 4096 May 29 20:59 eh_timeout
lrwxrwxrwx 1 root root 0 May 29 15:49 'enclosure_device:Slot 04' -> '../../../../port-0:0:13/end_device-0:0:13/target0:0:13/0:0:13:0/enclosure/0:0:13:0/Slot 04'
-r--r--r-- 1 root root 4096 May 29 20:59 evt_capacity_change_reported
-r--r--r-- 1 root root 4096 May 29 20:59 evt_inquiry_change_reported
...
Getting closer–the answer is hiding in the block/
subdirectory:
$ ls -l /sys/class/sas_phy/phy-0:0:11/device/port/end_device-0:0:3/target0:0:3/0:0:3:0/block
total 0
drwxr-xr-x 11 root root 0 May 29 15:48 sdf
There we go. It’s /dev/sdf
.
There’s almost certainly a better way to get this out of sasutils, but I haven’t had a lot of luck. Mapping from PHY name to block device isn’t something that any tool seems to want to do today.
Clearly, we need better tooling for this. I’m starting by trying to
add SAS counters and a bit of
topology to
Prometheus’s
node_exporter
. The
goal is to be able to track all of the PHY statistics in Prometheus
(so you can graph errors over time and so forth), and then label the
metrics with as much topology information as makes sense. The first
PR is out now, and
once that’s in I’ll start working on node_exporter
itself.
I’m also working on a couple other tools for managing large numbers of drives, but they’re further out. I have over 50 drives on my home file server, and it’s surprisingly hard to answer a few really basic questions, like:
- How many unused drives are in the system? Like new disks that weren’t added as ZFS spares, or old disks that weren’t removed.
- Which enclosure/bay holds drive sdXXX?
- What is the serial number of from sdXXX (since this appears on the physical drive label, it’s actually useful for verifying that you’ve pulled the right disk)
- What is sdXXX used for on this system?
Those are all working now. The tricky bits are mostly persistence issues:
- What is drive sdXXX’s history? (First seen on 2020-03-21, added to ZFS pool XXX on 2020-03-21, removed from ZFS pool on 2022-04-01)
- Mark disk XXX as bad and don’t let me forget. I’ve had a couple drives that smartctl says are okay, but have huge latency and only move a few MB/sec. When I see them, I usually swap in a spare, but that process can take days to complete, and it’s way too easy to forget to yank the drive and discard it.
- Which drive models have the lowest failure rates? I’m not Backblaze, but I’m probably pushing 1,000 drive-years of disk use at this point. I’ve lost so many ST3000DM001s over the years that I know that they’re terrible, but what about the other models?
- And what about drives in other systems? It’d be nice to pool this data across all of my machines, but now I need a server running and config files and security…