AnsweredAssumed Answered

Linux XFS file system tuning for Nimble

Question asked by David Baril on Dec 5, 2016

Hello all.

 

There is very little information about tuning Linux hosts for EFFICIENT use of Nimble, other than a 1 MB first partition alignment, and using a pseudo n*4kb stripe if the file system size is 4kb, but the Nimble volume uses a 8kb page size or larger.  Using the "noop" IO scheduler is also mentioned, and important.  Most of the existing documents are ext2/ext3 file system oriented ... and the available tuning options were much more limited.  In general, you often took the "defaults" because you had few other options.  The other Nimble Linux configuration theme that is somewhat consistent is NOT to use a single iSCSI volume, but to use multiple iSCSI volume.  This infers that there are host-side bottlenecks using a single iSCSI volume.

 

However, with the newer advanced file systems, like XFS and others, and newer options on the Linux IO stack, most host-side bottlenecks using a single volume ... (at least at 10GbE rates) no longer exist ... when properly configured.  Many system administrators would prefer to have a single volume doing 8x the work, than requiring a logical file system be broken up into 8 iSCSI volumes, and then host-striped using the LVM... as Nimble mentions in several of their examples.

 

The need to use multiple Nimble volumes to avoid host bottlenecks and better exploit the Nimble performance may be true, but it may be an artifact of Nimble's deferring to Linux "defaults", which are NOT well "parallel-enabled". Linux CAN be configured for greater parallelism throughout the IO stack, and at the important file-system level ... at least for advanced file systems such as XFS.  Linux has the capability, if properly configured, to exploit the extreme performance of NVMe SSDs with hundreds of concurrent threads, for example, but this does not happen by deferring to the Linux and XFS "defaults". I know that this comparison is a bit apples and oranges ... but the point is that most of the known host-side bottlenecks up to several hundred thousand IOPs can be addressed with proper configuration of non-default settings.  This is more than capable of exploiting the high performance of Nimble.

 

I understand that Nimble's CASL architecture, along with compression and variable stripe blocks, result in a dynamic striping topology, without the classic RAID5 or RAID6 write penalties on less-than-full-stripe writes.

 

However, several advanced Linux file systems allow significant tailoring to align critical data structures on some "hint". These advanced file systems expose more of their internal alignment ... not just the start of a volume, and their sub-volumes, which are often called "allocation groups". These settings can also enable greater levels of parallelism, if properly exploited.

 

Let me also note that Nimble does not offer any topology-centric "hints" using the SCSI mode pages that RHEL/CentOS 7.x and other recent Linux flavors will query.  Much of the current disk IO and file system IO stack will automatically adjust to such topology hints.  In Nimble's case, no information is provided, and Linux can choose some less-than-optimal default values. Storage devices that return no advanced topology information are presumed to be less-capable "legacy" storage, and Linux effectively uses a "backward-compatibility" (slow) set of configuration options.

 

I am specifically interested in the XFS file system, which has significant capabilities to enable multi-threaded IO, including to its own metadata, which traditionally was a bottleneck in ext2/3 file systems. XFS can effectively sub-divide a large filesystem into multiple sub-file systems (called allocation groups), which each if these allocation groups having their own independent metadata, allowing for true parallel file creates and space allocation.  Typical large XFS file systems have hundreds of allocation groups, and these allocation groups are "alignment" aware, and this alignment is NOT the start of the volume alignment.  It is a multiple of the "Physical Extent" size, which can vary. The administrator can also specify the absolute size of an allocation group, which locks in a specific alignment boundary.

 

The XFS stripe unit size and stripe width describe a traditional striping layout.  It also causes the critical blocks of metadata (like a beginning of a chunk of 64 inodes) to be aligned a full stripe boundary..  There is also a critical piece of metadata called the metadata journal, which is a large circular buffer. The metadata journal has its own topology hints and size, and can even be assigned to a separate device.  The default journal file file settings are calculated at file system create time (about 1/2048'th of the file system size).  A 700 GB file system has a ~ 341 MB default journal file size, for example.  This can be easily adjusted up or down to exploit an underlying storage capability. The journal is 99% write.

 

The existing Nimble Linux recommendations for a Nimble volume with a 4kb volume block size  ... is nothing other than a 1 MB initial Linux volume alignment. These setting result ensuring no other XFS alignment, other than the 4kb XFS page size. Oh, there are other alignment and "size" settings, but none are being tailored for Nimble's capabilities.

 

There are many options in the Linux IO stack, and in XFS that when changed, have no affect when using "slow" storage.  It is not that these class of operations are not useful, it is just that the bottleneck exists in the "slow" storage.  When you use higher performing storage, several of these options now make a difference, and may be required to exploit

 

I understand that the existing recommendations combined with the Linux default settings (which change version-to-version) "work", and you can quickly provision Nimble storage to Linux and use it.  It "works", but does it work "well" and "efficiently". It seems like there are some opportunities for improvement if the administrator wants to take the extra step to specify some non-default settings.

 

What additional information or guidance can Nimble provide for further improving the XFS file system efficiency under Linux for Nimble storage? This will indirectly also help Docker containers under Linux.

 

Thank you for your help.

 

Dave B.

Outcomes