본문 바로가기

Study/Bigdata

하둡 hadoop && ext3, ext4 , XFS

Underlying File System Options

If mount the disks as noatime, then the file access times aren't written back; this speeds up reads. There is also relatime, which stores some access time information, but is not as slow as the classic atime attribute. Remember that any access time information kept by Hadoop is independent of the atime attribute of individual blocks, so Hadoop does not care what your settings are here. If you are mounting disks purely for Hadoop, use noatime.

Formatting and tuning options are important. Using tunefs to set the reserve to zero percent can save you over 25 GigaBytes on a 1 TeraByte disk. Also the underlying file system is going to have many large files, you can get more space by lowering the number of inodes at format time.

Ext3

Yahoo! has publicly stated they use ext3. Regardless of the merits of the filesystem, that means that HDFS-on-ext3 has been publicly tested at a bigger scale than any other underlying filesystem that we know of.

XFS

From Bryan on the core-user list on 19 May 2009:

  • We use XFS for our data drives, and we've had somewhat mixed results. One of the biggest pros is that XFS has more free space than ext3, even with the reserved space settings turned all the way to 0. Another is that you can format a 1TB drive as XFS in about 0 seconds, versus minutes for ext3. This makes it really fast to kickstart our worker nodes. We have seen some weird stuff happen though when machines run out of memory, apparently because the XFS driver does something odd with kernel memory. When this happens, we end up having to do some fscking before we can get that node back online. As far as outright performance, I actually *did* do some tests of xfs vs ext3 performance on our cluster. If you just look at a single machine's local disk speed, you can write and read noticeably faster when using XFS instead of ext3. However, the reality is that this extra disk performance won't have much of an effect on your overall job completion performance, since you will find yourself network bottlenecked well in advance of even ext3's performance. The long and short of it is that we use XFS to speed up our new machine deployment, and that's it.

Ext4

The Ext4 Linux filesystem has delayed allocation of data which makes it handle unplanned server shutdowns/power outages less well than classic ext3. Consider turning off the delalloc option in /etc/fstab unless you trust your UPS.

http://wiki.apache.org/hadoop/DiskSetup