Or how a single sector can make you ten times more happier.
Today I’ll talk about a client issue: getting (extremely) slow write performance when backing up my laptop over an USB drive (a 200GB Samsung S1 Mini). All I usually do is booting the PC with SystemRescueCd, plug an USB disk in, “ddrescue /dev/sda /mnt/externaldisk/laptop_disk_image.dd“, letting it run overnight. Except that this morning the backup wasn’t finished yet. What’s wrong? Long post (for a simple solution) ahead…

The USB disk (shown below as /dev/sdb*) “feels” fast when reading and awfully slow when writing. The most simple way to do an HDD benchmark is, of course dd. Use it along with dstat (an essential tool when pinpointing performance issues, whatever they may be) and you’ll quickly gather some useful figures. Beware! dd can ruin all your data just by mistaking a “b” for an “a”: triple-check and make sure that you’re running it on the right devices!

A sequential write test:

balrog ~ # dd if=/dev/zero of=/mnt/temp/x.bin bs=16384 count=$((100*1024))
102400+0 records in
102400+0 records out
1677721600 bytes (1.7 GB) copied, 455.334 s, 3.7 MB/s

3.7 MB/s only, definitely slow. 🙁 Note that you shouldn’t use /dev/{random,urandom} as input file, they’re a bottleneck by themselves. /dev/zero, on the other hand, is super-fast. “dd if=/dev/zero of=/dev/null bs=16384 count=$((10000*1024))” (shove zeros to /dev/null) is bound only by the CPU, running at about 9.3 GB/s here.

A sequential read test:

balrog ~ # dd if=/mnt/temp/x.bin of=/dev/null bs=16384 count=$((100*1024))
102400+0 records in
102400+0 records out
1677721600 bytes (1.7 GB) copied, 52.1106 s, 32.2 MB/s

30 MB/s, that’s the order of magnitude I was expecting (confirmed here).

If I repeat the write test and, at the same time, run dstat, I notice that there are no burst or drops: speed is constant.

balrog linux-2.6.36-gentoo-r5 # dstat -p -d -D sdb
---procs--- --dsk/sdb--
run blk new| read  writ
  0   0 1.0| 362k  888k    # <-- ignore the first sample
  0   0   0|   0     0
4.0 1.0 1.0|   0     0
1.0 2.0   0|   0   360k    # <-- "dd" starts
  0 2.0   0|   0  3360k
  0 2.0   0|   0  3240k
1.0 2.0   0|   0  3240k
  0 2.0   0|   0  3240k

Since reading works, kernel and USB Host Controller seem to go along well. Issue should lie on the disk’s side. I had no clue of what was happening until I tried writing straight to the disk instead of the first primary partition (i.e.: /dev/sdb instead of /dev/sdb1), thus trashing the filesystem (I’ve got no data to lose on that disk: no worries).

balrog ~ # dd if=/dev/zero of=/dev/sdb bs=16384 count=$((100*1024))
1677721600 bytes (1.7 GB) copied, 63.0382 s, 26.6 MB/s

Even though the difference between read and write throughput seems to be too much (almost one order of magnitude), this is starting to look like a FS blocksize/partition aligment issue. Well, some disks use a physical sector size (PSS) of 512 bytes. Others use 4096 bytes (4 KiB). Others use the latter but tell the OS that they’re using 512 bytes or more simply the OS can’t figure out the right physical sector size… And USB mass storage devices tell the OS almost nothing (hdparm won’t help this time)…

Filesystem (or other “structured storage” systems like, for instance, datafiles in databases) organize their data in blocks. The block size can sometimes be adjusted. 4096 bytes is a quite common value:

balrog ~ # tune2fs -l /dev/sdb1 | grep -i block.size
Block size:               4096

A sector represents the smallest chunk of data that can be read/written from/to a disk. If its size is 512, and the filesystem block size is 4096, the filesystem driver will read/write batches of 8 sectors. Better said: the FS thinks to deal with 4k blocks, not knowing that lower level functions will further split them in eight (if only logically).
Consider another example: the PSS is 4096, but the drive acts as if it was 512. Physical sectors can be found at absolute offset sector_number*512*8 (0, 4096, 8192, …). What if a 4k write operation happens at offset 1*512*8-512? (3584, it doesn’t look like a “bad” offset: as far as the OS is concerned, any multiple of 512 is fine). The drive, being unable to write less than 4k and at proper locations, will: read sector 0, read sector 1, modify the last 512 bytes chunk of sector 0, modify the first three chunks of sector 1 then write both sectors (or something similar). If things were properly aligned, a single write operation would’ve sufficed. Read operations speed, on the other hand, may be almost unaffected. Think about it: unless you’re dealing with tons of 4k files spread across sector couples (i.e.: two sectors read instead of one), large chunks of data are (hopefully) laid out sequentially of the disc. Reading 1GB plus 512 bytes, instead of 1GB alone, won’t change anything benchmarks.

What’s up with my partition?

balrog ~ # sfdisk -uS -l /dev/sdb    

Disk /dev/sdb: 24321 cylinders, 255 heads, 63 sectors/track
Units = sectors of 512 bytes, counting from 0

   Device Boot    Start       End   #sectors  Id  System
/dev/sdb1            63 390716864  390716802  83  Linux

sfdisk (one of fdisk‘s cousins) shows that the partition starts at byte 63*512=32256. This value isn’t divisible by 4096, yielding a non integer result. Sector 64, instead, is a good place to start an aligned partition:

63*512/4096 = 7.87
64*512/4096 = 8.00

Similarly, other partitions should start at sectors that are multiples of 8 (because 512*8=4096).
This is the corrected partition table. Moving the partition forward (by a mere 512 bytes) causes a 10x write speed increase.

balrog ~ # sfdisk -uS -l /dev/sdb

Disk /dev/sdb: 24321 cylinders, 255 heads, 63 sectors/track
Units = sectors of 512 bytes, counting from 0

   Device Boot    Start       End   #sectors  Id  System
/dev/sdb1            64 390721967  390721904  83  Linux

You may still have a question though. Does aligning a partition mean that the contained filesystem is aligned too? You’re right, that assumption should not be taken for a fact.

A filesystem is made up of “your” data and “its” data (the latter being internal structures necessary to organize the former). In any case, a FS will try to pad/align stuff to the block size. That was to say that, if you start a partition and the partition is aligned to a given boundary, the filesystem (all of its composing blocks) will be aligned too.

You can find a description of the ext2 layout here. Partition starts with two sectors reserved for the boot loader. Then comes a 1k chunk holding ext2 “superblock”. At offset 56 within the superblock, we should find the ext2 magic number (0x53EF), and here it is:

giuliano@giuliano ~ $ dd if=/dev/sdb1 bs=512 skip=2 count=1 2>/dev/null | xxd -s +56 -l 2
0000038: 53ef                                     S.

The next byte after the superblock, is byte 4096. From then on, everything happens (from the FS point of view) in chunks as big as the configured block size. My disk is a 4k sector disk, formatted with a single partition aligned (as the FS) to a 4K block. FS block size is 4K too. Can’t do really do any better than that besides choosing a filesystem that manages to handle the given workload with fewer read/write operations, but I digress…