Report from Day 2 of the Linux Storage and Filesystem Workshop, April 6-7th, 2009
May 18th, 2009 | Published in Google Open Source
(The sequel to Report from Day 1)
The first official discussion of the day was around optimizing Solid State Disk performance, led by Matthew "willy" Wilcox (Intel). SSD behaviors and "modeling" that behavior are still of interest. File systems developers need to understand how the performance costs have shifted compared to regular disks. Matthew noted the Intel SSDs are adaptive and export a 512 byte sector illusion. Ric Wheeler (RedHat) also raised concerns about current use of fallocate() and how adding use of "TRIM" to fallocate() would affect "Thin Client Provisioning." Consensus was any form of over subscription of HW would cause problems and TRIM command might exacerbate those issues. But TRIM would have measurable positive impact for most users (of SSDs). Ted Ts'o expected ext4 to already properly issue TRIM at the right times.
Moving on, "4KB sector" hard disk support is mostly done and Martin Petersen did much of the recent work to prepare linux kernel for it. Performance issues are still lurking however when using FAT partition tables or anything else that might affect the alignment offset of a write. The
root cause of this disaster is the drives export a 512 byte logical sector (despite implementing 4k physical sectors). They export 512 byte sectors in order to boot from older BIOS and function with "legacy" operating systems. But in order to get good performance with FAT partition tables, HW vendors have "offset" the logical->physical block mapping so logical sector 63 (size of "track" in ancient BIOSs) is "well aligned" physcially. "Badly aligned" means a 4k write will require reading and writing parts of two 4K physical sectors and thus "burning" an extra rotation (must read data first and then write it out on the next rotation). Well, anyone using full disk or some other partition table (e.g. GPT) will learn the joys of unnecessary complexity when they demonstrate and try to explain two levels of disk performance for the same application binary. The only way to avoid this mess is if HDD vendors provide the means to directly use native 4k blocks.
The second issue with 512 byte block emulation is error handling. The problem here is performance will be horrid for sequential small writes to a single 4k block unless the intermediate writes are cached anyway and then written in one go when the last sector is written....if the 8th write fails, the OS will think the previous 7 sectors are fine and just rewrite the last sector again. Previous 7x512 bytes are gone. With Write Cache Enabled (WCE) turned on for most SATA drives, this problem already exists. The only thing new this speculation exposes is disk vendors have strong incentives to violate the intent of "WCE off" despite dire consequences.
The last presentation I want to mention was "Virtual Machine I/O". The challenge was how Block IO schedulers need to manage bandwidth in various topologies typically seen in Virtualized IO. Google's Naumann Rafique was one of the presenters with a focus on "Proportional IO" implementation. Hannes Reincke summarized the core problems nicely: IO scheduling is only needed when there is contention at the device level- keep the mechanism to enforce scheduling at that level. Different policies should be implemented at higher levels as needed.
Later on, when talking with the group in a "hacking session", I backed up my assertion that this is not a new problem by showing my copy of OLS 2004 schedule where IO prioritization was mentioned by Jens Axboe in his talk and in a BOF led by Werner Almsberger (http://abiss.sourceforge.net/). My advice was to solve and push the simplest piece first before confusing everyone with grand designs and huge patches.
And I'll close with my kudos to Linux Foundation staff to pulling this off smoothly! Really. It was nice to see a small event get handled so professionally and courteously.
The first official discussion of the day was around optimizing Solid State Disk performance, led by Matthew "willy" Wilcox (Intel). SSD behaviors and "modeling" that behavior are still of interest. File systems developers need to understand how the performance costs have shifted compared to regular disks. Matthew noted the Intel SSDs are adaptive and export a 512 byte sector illusion. Ric Wheeler (RedHat) also raised concerns about current use of fallocate() and how adding use of "TRIM" to fallocate() would affect "Thin Client Provisioning." Consensus was any form of over subscription of HW would cause problems and TRIM command might exacerbate those issues. But TRIM would have measurable positive impact for most users (of SSDs). Ted Ts'o expected ext4 to already properly issue TRIM at the right times.
Moving on, "4KB sector" hard disk support is mostly done and Martin Petersen did much of the recent work to prepare linux kernel for it. Performance issues are still lurking however when using FAT partition tables or anything else that might affect the alignment offset of a write. The
root cause of this disaster is the drives export a 512 byte logical sector (despite implementing 4k physical sectors). They export 512 byte sectors in order to boot from older BIOS and function with "legacy" operating systems. But in order to get good performance with FAT partition tables, HW vendors have "offset" the logical->physical block mapping so logical sector 63 (size of "track" in ancient BIOSs) is "well aligned" physcially. "Badly aligned" means a 4k write will require reading and writing parts of two 4K physical sectors and thus "burning" an extra rotation (must read data first and then write it out on the next rotation). Well, anyone using full disk or some other partition table (e.g. GPT) will learn the joys of unnecessary complexity when they demonstrate and try to explain two levels of disk performance for the same application binary. The only way to avoid this mess is if HDD vendors provide the means to directly use native 4k blocks.
The second issue with 512 byte block emulation is error handling. The problem here is performance will be horrid for sequential small writes to a single 4k block unless the intermediate writes are cached anyway and then written in one go when the last sector is written....if the 8th write fails, the OS will think the previous 7 sectors are fine and just rewrite the last sector again. Previous 7x512 bytes are gone. With Write Cache Enabled (WCE) turned on for most SATA drives, this problem already exists. The only thing new this speculation exposes is disk vendors have strong incentives to violate the intent of "WCE off" despite dire consequences.
The last presentation I want to mention was "Virtual Machine I/O". The challenge was how Block IO schedulers need to manage bandwidth in various topologies typically seen in Virtualized IO. Google's Naumann Rafique was one of the presenters with a focus on "Proportional IO" implementation. Hannes Reincke summarized the core problems nicely: IO scheduling is only needed when there is contention at the device level- keep the mechanism to enforce scheduling at that level. Different policies should be implemented at higher levels as needed.
Later on, when talking with the group in a "hacking session", I backed up my assertion that this is not a new problem by showing my copy of OLS 2004 schedule where IO prioritization was mentioned by Jens Axboe in his talk and in a BOF led by Werner Almsberger (http://abiss.sourceforge.net/). My advice was to solve and push the simplest piece first before confusing everyone with grand designs and huge patches.
And I'll close with my kudos to Linux Foundation staff to pulling this off smoothly! Really. It was nice to see a small event get handled so professionally and courteously.