Report from Day 1 of the Linux Storage and Filesystems Workshop, April 6-7, 2009
April 30th, 2009 | Published in Google Open Source
My karma was apparently very good three weeks ago. In the last minute I secured an invitation to Linux Storage and Filesystems Workshop 2009 (LSF). This year the invitation-only workshop was hosted by the Linux Foundation on April 6-7 in San Francisco, CA. It was, as always, an intense 2 days of non-stop information exchange and decision making. I've attempted to summarize what I took away as the most interesting and important discussions of the first day. These are my opinions, and your mileage may vary.
The group of about 50 developers were nearly all present when Zach
Brown (Oracle)welcomed us and reminded folks about the ground rules of the event. Share the brownies. Wash your hands, ...just kidding. This event remains small so folks can participate and we were pretty "cozy" in the small conference room with 5 good sized round tables. First step was to turn off the projector. :)
Chris Mason and James Bottomley then did a great summary and "scoring" of promises made at LSF2008. IO stack was up first and had some good initial scores with high points for Power management, Request Based Multipath, BIO's TRIM/ERASE support, T10 DIF/DIX (complete) and FCoE (also complete). Chris Mason managed to nearly match that with 4/4 points for Barriers, BTRFS (upstream but not stable yet), IPV6 NFS, NFS RDMA.
The first problem/topic was how to cache device scanning in the kernel or how to properly export an API for device scanning. General problem is there are several methods the kernel exports info and it's very time consuming on large systems. This was followed by Async IO and Direct IO discussion led by Zach Brown (Oracle) and Jeffrey Moyer (Redhat). Zach has been the AIO maintainer "forever" and made it clear AIO was async in only a very few circumstances that happen to suit database developers.
Joe Eykholt gave a summary of "FC/SCSI Targets", how to get initiator *and* target mode support from one FC HBA at the same time. Interesting stuff. Nick Bellinger gave a concise summary of state of "LIO/iSCSI" code and the "tgt" driver.
I was interested in Tejun Heo's "libata status and issues" discussion. First we talked about the status of several patches: mvsas updates, "ATA Bus" transport class, SFF vs Native transport classes from my co-worker Gwendal Grignou, and a pile of power management patches from Kristen Accardi (Intel). Tejun then dove into the "Spurious Power Off" problem. The cause seems to be short loss of power from the PSU is causing massive FS corruption. He's documented 5 incidents so far. Additional symptoms are "clicking" sounds and START/STOP count increments (reported via SMART data). Tejun suspects the FS is issueing a FLUSH to all disks simultaneously. We further speculated that the drives might be in a low power (slower RPMpossibly) and suddenly all come to life. Currently no fix is available.
Some possible workarounds we considered:
- disable Write Cache Enable (and take a write perf hit on loads that are single threaded)
- disable power management.
He moved on to discuss ambiguities around libata/block layer data structures (e.g. hard_ vs w/o hard_ ) fields that have similar (but not the same) names.
One of the last issues was something I raised: Can we reduce the CPU utilization of the block layer? I was asking since several new flash technologies are under development and they are all capable of 200+ *thousand* IOPS. The answer was Jens Axboe was working on this already since about December 2008 had committed his initial results to his own git tree already. I just need to find the git tree and proper branch now. :)
That wraps up day one. I hope you find the information useful. If you want to read about day two, please leave a comment and if demand warrants it, I'll cover that in a future post.
The group of about 50 developers were nearly all present when Zach
Brown (Oracle)welcomed us and reminded folks about the ground rules of the event. Share the brownies. Wash your hands, ...just kidding. This event remains small so folks can participate and we were pretty "cozy" in the small conference room with 5 good sized round tables. First step was to turn off the projector. :)
Chris Mason and James Bottomley then did a great summary and "scoring" of promises made at LSF2008. IO stack was up first and had some good initial scores with high points for Power management, Request Based Multipath, BIO's TRIM/ERASE support, T10 DIF/DIX (complete) and FCoE (also complete). Chris Mason managed to nearly match that with 4/4 points for Barriers, BTRFS (upstream but not stable yet), IPV6 NFS, NFS RDMA.
The first problem/topic was how to cache device scanning in the kernel or how to properly export an API for device scanning. General problem is there are several methods the kernel exports info and it's very time consuming on large systems. This was followed by Async IO and Direct IO discussion led by Zach Brown (Oracle) and Jeffrey Moyer (Redhat). Zach has been the AIO maintainer "forever" and made it clear AIO was async in only a very few circumstances that happen to suit database developers.
Joe Eykholt gave a summary of "FC/SCSI Targets", how to get initiator *and* target mode support from one FC HBA at the same time. Interesting stuff. Nick Bellinger gave a concise summary of state of "LIO/iSCSI" code and the "tgt" driver.
I was interested in Tejun Heo's "libata status and issues" discussion. First we talked about the status of several patches: mvsas updates, "ATA Bus" transport class, SFF vs Native transport classes from my co-worker Gwendal Grignou, and a pile of power management patches from Kristen Accardi (Intel). Tejun then dove into the "Spurious Power Off" problem. The cause seems to be short loss of power from the PSU is causing massive FS corruption. He's documented 5 incidents so far. Additional symptoms are "clicking" sounds and START/STOP count increments (reported via SMART data). Tejun suspects the FS is issueing a FLUSH to all disks simultaneously. We further speculated that the drives might be in a low power (slower RPMpossibly) and suddenly all come to life. Currently no fix is available.
Some possible workarounds we considered:
- disable Write Cache Enable (and take a write perf hit on loads that are single threaded)
- disable power management.
He moved on to discuss ambiguities around libata/block layer data structures (e.g. hard_ vs w/o hard_ ) fields that have similar (but not the same) names.
One of the last issues was something I raised: Can we reduce the CPU utilization of the block layer? I was asking since several new flash technologies are under development and they are all capable of 200+ *thousand* IOPS. The answer was Jens Axboe was working on this already since about December 2008 had committed his initial results to his own git tree already. I just need to find the git tree and proper branch now. :)
That wraps up day one. I hope you find the information useful. If you want to read about day two, please leave a comment and if demand warrants it, I'll cover that in a future post.