2013
04.26

File creation time on Linux

2 people like this post.
Share

What if, on Linux, we wanted to know the creation time of a file? The stat command (and the system call it uses) returns three timestamps: access, modify and change. None of those is really what we are looking for; they just tell us about the last time a file’s contents (or metadata) have been read/written.

Could it be that the filesystem stores more information than what the standard POSIX interface can access? The ext4 inode structure contains the i_crtime field, which is exactly what we are after: the file creation time. This structure is only valid for ext4 filesystems formatted with “large” (256 bytes) inodes. The inode size used in previous version of ext (128 bytes) can’t accommodate any extra metadata.

How can we obtain the file creation time, then? It seems that the only user space tool able to do so is debugfs:

vagrant@muffin:~$ stat x.txt
  File: `x.txt'
  Size: 2               Blocks: 8          IO Block: 4096   regular file
Device: fc00h/64512d    Inode: 2885160     Links: 1
Access: (0664/-rw-rw-r--)  Uid: ( 1000/ vagrant)   Gid: ( 1000/ vagrant)
Access: 2013-04-25 19:34:22.923399413 +0000
Modify: 2013-04-25 19:34:06.940015907 +0000
Change: 2013-04-25 19:34:06.940015907 +0000
 Birth: -
vagrant@muffin:~$ sudo debugfs -R "stat <$(stat -c '%i' x.txt)>" /dev/mapper/precise64-root | grep crtime
debugfs 1.42 (29-Nov-2011)
crtime: 0x517984fc:9a020d64 -- Thu Apr 25 19:33:16 2013

debugfs is built around libext2fs. We can use that ourselves to read the entire contents of an inode and extract crtime from there. In debugfs, the do_stat function does just that. It involves only two calls into libext2fs: ext2fs_open and ext2fs_read_inode_full.

I decided to try and interface Ruby with libext2fs and the FFI gem made that possible without me having to write a single line of C code… See below (or on Github) for a working example.

vagrant@muffin:~$ sudo ruby crtime.rb /dev/mapper/precise64-root x.txt
crtime: 0x517984fc:9a020d64 -- Thu Apr 25 19:33:16 +0000 2013
# Given a device and a file, returns the file creation time (crtime).
# Works only on ext4 filesystem with 256 bytes inodes.
# Requires libext2fs (apt-get install e2fslibs) and root access.
# See:
#      http://www.108.bz/posts/it/file-creation-time-on-linux/
#      http://computer-forensics.sans.org/blog/2011/03/14/digital-forensics-understanding-ext4-part-2-timestamps
# -- giuliano@108.bz

require 'rubygems'
require 'ffi'

module Ext2Fs

    # See: /usr/include/ext2fs/ext2fs.h

    extend FFI::Library
    ffi_lib '/lib/x86_64-linux-gnu/libext2fs.so.2'

    EXT2_FLAG_SOFTSUPP_FEATURES = 0x8000
    EXT2_FLAG_64BITS            = 0x20000
    EXT2_NDIR_BLOCKS            = 12
    EXT2_IND_BLOCK              = EXT2_NDIR_BLOCKS
    EXT2_DIND_BLOCK             = (EXT2_IND_BLOCK + 1)
    EXT2_TIND_BLOCK             = (EXT2_DIND_BLOCK + 1)
    EXT2_N_BLOCKS               = (EXT2_TIND_BLOCK + 1)

    typedef :long,    :errcode_t
    typedef :pointer, :io_manager
    typedef :pointer, :ext2_filsys
    typedef :pointer, :ext2_filsys_ptr
    typedef :pointer, :ext2_inode
    typedef :pointer, :struct_ext2_inode_ptr
    typedef :uint32,  :ext2_ino_t

    attach_variable :unix_io_manager, :unix_io_manager, :pointer;

    # Top portion of struct_ext2_filsys
    class Ext2FilsysAbridged < FFI::Struct
        layout :magic,       :errcode_t,
               :io,          :pointer,
               :flags,       :int,
               :device_name, :string,
               :super,       :pointer,
               :blocksize,   :uint
    end

    class Ext2InodeLarge_linux1 < FFI::Struct
        layout :l_i_version, :uint32
    end
    class Ext2InodeLarge_hurd1 < FFI::Struct
        layout :h_i_translator, :uint32
    end
    class Ext2InodeLarge_osd1 < FFI::Union
        layout :linux1, Ext2InodeLarge_linux1,
               :hurd1,  Ext2InodeLarge_hurd1
    end

    class Ext2InodeLarge_linux2 < FFI::Struct
        layout :l_i_blocks_hi,     :uint16,
               :l_i_file_acl_high, :uint16,
               :l_i_uid_high,      :uint16,
               :l_i_gid_high,      :uint16,
               :l_i_checksum_lo,   :uint16,
               :l_i_reserved,      :uint16
    end
    class Ext2InodeLarge_hurd2 < FFI::Struct
        layout :h_i_frag,      :uint8,
               :h_i_fsize,     :uint8,
               :h_i_mode_high, :uint16,
               :h_i_uid_high,  :uint16,
               :h_i_gid_high,  :uint16,
               :h_i_author,    :uint32
    end
    class Ext2InodeLarge_osd2 < FFI::Union
        layout :linux2, Ext2InodeLarge_linux2,
               :hurd2,  Ext2InodeLarge_hurd2
    end

    class Ext2InodeLarge < FFI::Struct
        layout :i_mode,         :uint16,
               :i_uid,          :uint16,
               :i_size,         :uint32,
               :i_atime,        :uint32,
               :i_ctime,        :uint32,
               :i_mtime,        :uint32,
               :i_dtime,        :uint32,
               :i_gid,          :uint16,
               :i_links_count,  :uint16,
               :i_blocks,       :uint32,
               :i_flags,        :uint32,
               :osd1,           Ext2InodeLarge_osd1,
               :i_block,        [:uint32, EXT2_N_BLOCKS],
               :i_generation,   :uint32,
               :i_file_acl,     :uint32,
               :i_size_high,    :uint32,
               :i_fadd,         :uint32,
               :osd2,           Ext2InodeLarge_osd2,
               :i_extra_isize,  :uint16,
               :i_checksum_hi,  :uint16,
               :i_ctime_extra,  :uint32,
               :i_mtime_extra,  :uint32,
               :i_atime_extra,  :uint32,
               :i_crtime,       :uint32,
               :i_crtime_extra, :uint32,
               :i_version_hi,   :uint32
    end

    # extern errcode_t ext2fs_open(const char *name, int flags, int superblock,
    #                unsigned int block_size, io_manager manager,
    #                ext2_filsys *ret_fs);
    attach_function :ext2fs_open, [:string, :int, :int, :uint, :io_manager, :ext2_filsys_ptr], :errcode_t

    # extern errcode_t ext2fs_read_inode_full(ext2_filsys fs, ext2_ino_t ino,
    #                   struct ext2_inode * inode,
    #                   int bufsize);
    attach_function :ext2fs_read_inode_full, [:ext2_filsys, :ext2_ino_t, :struct_ext2_inode_ptr, :int], :errcode_t

end

if !(ARGV.length == 2 && File.readable?(ARGV[0]) && File.readable?(ARGV[1]))
    puts <<-EOM
    Usage: $0 device_with_ext4_filesystem filename
      Make sure the device and the file are readable
    EOM
    exit
else
    device   = ARGV[0]
    filename = ARGV[1]
end

current_fs_ptr = FFI::MemoryPointer.new :pointer
rc = Ext2Fs.ext2fs_open device,
                        Ext2Fs::EXT2_FLAG_SOFTSUPP_FEATURES | Ext2Fs::EXT2_FLAG_64BITS,
                        0, 0,
                        Ext2Fs.unix_io_manager, current_fs_ptr
fail "Error #{rc} on ext2fs_open" if rc != 0
current_fs = Ext2Fs::Ext2FilsysAbridged.new current_fs_ptr.read_pointer

# This is quite fragile, I should also check s_rev_level in struct ext2_super_block
INODE_SIZE_OFFSET=13*2+6+4*2+2+1*2
inode_size = current_fs[:super].read_array_of_uint16(INODE_SIZE_OFFSET+1)[INODE_SIZE_OFFSET]
fail "inode size is not 256 bytes" if inode_size != 256

inode_buf_ptr = FFI::MemoryPointer.new :char, inode_size
inode_number = File.stat(filename).ino
rc = Ext2Fs.ext2fs_read_inode_full current_fs.pointer, inode_number, inode_buf_ptr, inode_size
fail "Error #{rc} on ext2fs_read_inode_full" if rc != 0
inode = Ext2Fs::Ext2InodeLarge.new inode_buf_ptr

printf "crtime: 0x%08x:%08x -- %s\n", inode[:i_crtime], inode[:i_crtime_extra], Time.at(inode[:i_crtime]).to_s
2012
10.20

VM cloning PowerShell script

Be the first to like.
Share

If you need to clone a VM, in an automated and scheduled fashion, the script below might help.

The variables on top allow you to specify the source VM name, how its clone will be called, the target VMware ESX host (even though, of course, the clone won’t be switched on), Datastore and vCenter Folder.

The script will refuse to run unless a clone already exist and it’s switched off, meaning that you’ll have to create the first one yourself. At each execution, if everything looks good, the previous clone will be deleted and a new one will take its place. Any exception will be caught and written in a log file. I use curl.exe to pipe this log file to a centralized alerting system: a small Ruby app that will alert us if anything wrong happens (errors, missed runs, …) with the various batch scripts we’ve got scattered around. I’ll probably blog about it later on.

For any unattended vCenter login/authentication you’ll need a credential store file. Have a look here to learn how to create one. Suitably protect this file because the password it contains is simply obfuscated (using a reversible algorithm) and not encrypted.

Run the script with:

C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe -PSConsoleFile “C:\Program Files (x86)\VMware\Infrastructure\vSphere PowerCLI\vim.psc1” c:\scripts\vmname-clone.ps1

$logfile         = 'c:\scripts\vmname-clone.log'
$clonename       = 'vmname-clone'
$credentialsfile = 'c:\scripts\vmname-clone.credentials.xml'
$clonesourcevm   = 'vmname'
$clonehost       = '10.0.123.123'
$cloneds         = 'NFSSLOW01'
$clonefolder     = 'Foldername'

function log {
    "$($args -join ' ')" | Out-File $logfile -Encoding ASCII -Append
}
Clear-Content $logfile

$vm = ''
$timestamp=$(Get-Date -f yyyyMMdd-HH:mm:ss)
log "start: $timestamp"
try {
    $c = Get-VICredentialStoreItem -file "$credentialsfile"
    Connect-VIServer -Server $c.Host -User $c.User -Password $c.Password -ErrorAction Stop
    $vm = Get-VM -name $clonename
    if ($vm.PowerState -ne 'PoweredOff') {
        throw 'PreviousCloneIssues'
    } else {
        log "removing: $vm"
        Remove-VM -ErrorAction Stop -DeleteFromDisk:$true -Confirm:$false $vm
        log "cloning: in progress"
        New-VM -VM "$clonesourcevm" -VMHost "$clonehost" -Name $clonename -Datastore "$cloneds" -Location "$clonefolder" -ErrorAction Stop;
    }
    $timestamp=$(Get-Date -f yyyyMMdd-HH:mm:ss)
    log "done: $timestamp"
}
catch {
    if (($_.Exception.GetType().FullName -eq 'System.Management.Automation.RuntimeException') -and
        ($_.FullyQualifiedErrorId -eq 'PreviousCloneIssues')) {
        log "error: Clone not found or not Powered Off, refusing to remove it"
    } else {
        log "error: unexpected"
    }
}

# c:\scripts\curl.exe -s -X PUT http://10.0.110.99:8080/vmclone/sample --data-binary "@c:\scripts\vmname-clone.log"
2012
07.28

A common way to hook an external script into Zabbix, is by using the UserParameter directive. These kind of checks have a set amount of time to return their result (maximum 30 seconds), otherwise they’ll just get killed by the Agent, return no data at all, and if you didn’t take this condition into account (using .nodata() in your triggers’ expressions) actual problems might not get detected… In practice the deadline could be even shorter: you don’t wan’t the Agent to spend too much time waiting for unresponsive services.

The script below is a simple parallel HTTP/HTTPS monitor. It will spawn up to the given number of threads, fetch the URL supplied, look for a matching string. The Parallel gem for Ruby makes it incredibly simple to implement such a scheme. When all the checks have completed, their results will be submitted back to Zabbix in one go with a single call to zabbix_sender.

Why parallel is important? Because waiting for a host to reply, or for a connection attempt to time out, is just a matter of, well, waiting. Your CPU is not really busy and can do some other work before the host decides to reply. Put it in another way: if you can afford to use 10 threads to monitor 10 hosts with a 30 seconds response time, your total check “run” will take 30 seconds total. With a single thread, the same check will take 5 minutes…

Here’s the script, we use it to check the availability of about 25 management interfaces (iLO or IPMI) in our Hadoop cluster.

Oh, one more thing: mind the Mutex. In a multi-threaded program, access to shared data must always be coordinated…

#!/usr/bin/env ruby
require 'rubygems'
require 'parallel'
require 'timeout'
require 'net/http'
 
MaxThreads = 10
MaxTime    = 30

checks = [
    {:key => 'fetch.bmc.slave123', :uri => 'http://192.168.123.123/page/login.html',  :match => 'STR_LOGIN_PASSWORD'},
    {:key => 'fetch.bmc.slave124', :uri => 'http://192.168.123.124/xmldata?item=All', :match => 'ProLiant'}
]

semaphore = Mutex.new
results = []

checker = lambda do |check|
    begin
        Timeout::timeout(MaxTime) do
            response = Net::HTTP.get_response(URI(check[:uri]))
            response.body =~ /(#{check[:match]})/s
            semaphore.synchronize { results.push({:key => check[:key], :v => ($1.nil? ? 0 : 1)}) }
        end
    rescue
        semaphore.synchronize { results.push({:key => check[:key], :v => 0}) }
    end
end
 
ZabbixSender        = File.join(File.dirname(__FILE__), 'zabbix_sender')
ZabbixSenderCmdLine = "#{ZabbixSender} -z 192.168.123.10 -s 'Zabbix Server' -i -"

Parallel.each(checks, :in_threads => MaxThreads, &checker)

data = ''
results.each do |i|
   data << "- #{i[:key]} #{i[:v]}\n"
end

Timeout::timeout(MaxTime) do
    IO.popen(ZabbixSenderCmdLine, :mode => 'w+', :external_encoding => Encoding::ASCII_8BIT) do |file|
        file.write data
    end
end
2011
05.26

The following info is blatantly stolen from this precious post, but the issue I faced is so odd that I wanted to stress about it myself. All credits go to Philip Hofstetter and his blog.

[Edit: the following still applies even when, as described here, the Provisioning API is enabled.]
[Edit: the script below doesn’t handle Results Pagination. That means that it will just return the first 200 or so queried objects. I’ve yet to complete it… Depending on your needs, you may just use Google Apps Manager instead.

I was trying to fetch some info using Google Data API and Python. At some point, I decided to move from simple authentication with user supplied credentials to two-legged OAuth. The Contact feed remained accessible while trying to read Groups, Users or Nicknames (by means of the Provisioning API) failed with “Internal Error” 500 or “Authentication Failure”.

As Philip discovered, some feeds just won’t work unless explicitly permitted access to.
Inconsistency 1: You API Client name (the one you’ll use as “customer_key” in OAuth and the one whose name will most likely match your Google Apps domain name), is already listed under “Manage this domain”, “Advanced tools”, “Manage third party OAuth Client access”. The wording “This client has access to all APIs” is clearly a lie.
Inconsistency 2: I followed Philip advice and manually added the (readonly) feeds/scopes, except that they don’t show up under “Manage API client access”. But they’re somewhat being honored (i.e.: without the tweak, my script won’t work). Moreover, the “Authorize” operation should be done just once and encompass all of the scopes you need. You can’t just add one later. Adding a single scope will revoke access to the previous ones. This behaviour is different from Philip’s (in his screenshots, authorized scopes are indeed visible on Google Apps Domain Control Panel).
This is what I used:

https://apps-apis.google.com/a/feeds/group/#readonly,https://apps-apis.google.com/a/feeds/user/#readonly,https://apps-apis.google.com/a/feeds/nickname/#readonly

And this is the script:

#!/usr/bin/python

# $Id: list_groups_emails_oauth.py,v 1.3 2011/05/26 16:12:42 giuliano Exp giuliano $

import string
import gdata.apps.service
import gdata.apps.groups.service

consumer_key = 'yourdomain.com'
consumer_secret = 'yourOAuthkey'
sig_method = gdata.auth.OAuthSignatureMethod.HMAC_SHA1

service = gdata.apps.groups.service.GroupsService(domain=consumer_key)
service.SetOAuthInputParameters(sig_method, consumer_key, consumer_secret=consumer_secret, two_legged_oauth=True)
res = service.RetrieveAllGroups()
for entry in res:
    print 'group;' + string.lower(entry['groupId'])

service = gdata.apps.service.AppsService(domain=consumer_key)
service.SetOAuthInputParameters(sig_method, consumer_key, consumer_secret=consumer_secret, two_legged_oauth=True)

res = service.RetrieveAllUsers()
for entry in res.entry:
    print 'email;' + string.lower(entry.login.user_name) + '@' + consumer_key

res = service.RetrieveAllNicknames()
for entry in res.entry:
  if hasattr(entry, 'nickname'):
    print 'alias;' + string.lower(entry.nickname.name) + '@' + consumer_key
2011
05.24

Or how a single sector can make you ten times more happier.
Today I’ll talk about a client issue: getting (extremely) slow write performance when backing up my laptop over an USB drive (a 200GB Samsung S1 Mini). All I usually do is booting the PC with SystemRescueCd, plug an USB disk in, “ddrescue /dev/sda /mnt/externaldisk/laptop_disk_image.dd“, letting it run overnight. Except that this morning the backup wasn’t finished yet. What’s wrong? Long post (for a simple solution) ahead…

The USB disk (shown below as /dev/sdb*) “feels” fast when reading and awfully slow when writing. The most simple way to do an HDD benchmark is, of course dd. Use it along with dstat (an essential tool when pinpointing performance issues, whatever they may be) and you’ll quickly gather some useful figures. Beware! dd can ruin all your data just by mistaking a “b” for an “a”: triple-check and make sure that you’re running it on the right devices!

A sequential write test:

balrog ~ # dd if=/dev/zero of=/mnt/temp/x.bin bs=16384 count=$((100*1024))
102400+0 records in
102400+0 records out
1677721600 bytes (1.7 GB) copied, 455.334 s, 3.7 MB/s

3.7 MB/s only, definitely slow. 🙁 Note that you shouldn’t use /dev/{random,urandom} as input file, they’re a bottleneck by themselves. /dev/zero, on the other hand, is super-fast. “dd if=/dev/zero of=/dev/null bs=16384 count=$((10000*1024))” (shove zeros to /dev/null) is bound only by the CPU, running at about 9.3 GB/s here.

A sequential read test:

balrog ~ # dd if=/mnt/temp/x.bin of=/dev/null bs=16384 count=$((100*1024))
102400+0 records in
102400+0 records out
1677721600 bytes (1.7 GB) copied, 52.1106 s, 32.2 MB/s

30 MB/s, that’s the order of magnitude I was expecting (confirmed here).

If I repeat the write test and, at the same time, run dstat, I notice that there are no burst or drops: speed is constant.

balrog linux-2.6.36-gentoo-r5 # dstat -p -d -D sdb
---procs--- --dsk/sdb--
run blk new| read  writ
  0   0 1.0| 362k  888k    # <-- ignore the first sample
  0   0   0|   0     0
4.0 1.0 1.0|   0     0
1.0 2.0   0|   0   360k    # <-- "dd" starts
  0 2.0   0|   0  3360k
  0 2.0   0|   0  3240k
1.0 2.0   0|   0  3240k
  0 2.0   0|   0  3240k

Since reading works, kernel and USB Host Controller seem to go along well. Issue should lie on the disk’s side. I had no clue of what was happening until I tried writing straight to the disk instead of the first primary partition (i.e.: /dev/sdb instead of /dev/sdb1), thus trashing the filesystem (I’ve got no data to lose on that disk: no worries).

balrog ~ # dd if=/dev/zero of=/dev/sdb bs=16384 count=$((100*1024))
1677721600 bytes (1.7 GB) copied, 63.0382 s, 26.6 MB/s

Even though the difference between read and write throughput seems to be too much (almost one order of magnitude), this is starting to look like a FS blocksize/partition aligment issue. Well, some disks use a physical sector size (PSS) of 512 bytes. Others use 4096 bytes (4 KiB). Others use the latter but tell the OS that they’re using 512 bytes or more simply the OS can’t figure out the right physical sector size… And USB mass storage devices tell the OS almost nothing (hdparm won’t help this time)…

Filesystem (or other “structured storage” systems like, for instance, datafiles in databases) organize their data in blocks. The block size can sometimes be adjusted. 4096 bytes is a quite common value:

balrog ~ # tune2fs -l /dev/sdb1 | grep -i block.size
Block size:               4096

A sector represents the smallest chunk of data that can be read/written from/to a disk. If its size is 512, and the filesystem block size is 4096, the filesystem driver will read/write batches of 8 sectors. Better said: the FS thinks to deal with 4k blocks, not knowing that lower level functions will further split them in eight (if only logically).
Consider another example: the PSS is 4096, but the drive acts as if it was 512. Physical sectors can be found at absolute offset sector_number*512*8 (0, 4096, 8192, …). What if a 4k write operation happens at offset 1*512*8-512? (3584, it doesn’t look like a “bad” offset: as far as the OS is concerned, any multiple of 512 is fine). The drive, being unable to write less than 4k and at proper locations, will: read sector 0, read sector 1, modify the last 512 bytes chunk of sector 0, modify the first three chunks of sector 1 then write both sectors (or something similar). If things were properly aligned, a single write operation would’ve sufficed. Read operations speed, on the other hand, may be almost unaffected. Think about it: unless you’re dealing with tons of 4k files spread across sector couples (i.e.: two sectors read instead of one), large chunks of data are (hopefully) laid out sequentially of the disc. Reading 1GB plus 512 bytes, instead of 1GB alone, won’t change anything benchmarks.

What’s up with my partition?

balrog ~ # sfdisk -uS -l /dev/sdb    

Disk /dev/sdb: 24321 cylinders, 255 heads, 63 sectors/track
Units = sectors of 512 bytes, counting from 0

   Device Boot    Start       End   #sectors  Id  System
/dev/sdb1            63 390716864  390716802  83  Linux

sfdisk (one of fdisk‘s cousins) shows that the partition starts at byte 63*512=32256. This value isn’t divisible by 4096, yielding a non integer result. Sector 64, instead, is a good place to start an aligned partition:

63*512/4096 = 7.87
64*512/4096 = 8.00

Similarly, other partitions should start at sectors that are multiples of 8 (because 512*8=4096).
This is the corrected partition table. Moving the partition forward (by a mere 512 bytes) causes a 10x write speed increase.

balrog ~ # sfdisk -uS -l /dev/sdb

Disk /dev/sdb: 24321 cylinders, 255 heads, 63 sectors/track
Units = sectors of 512 bytes, counting from 0

   Device Boot    Start       End   #sectors  Id  System
/dev/sdb1            64 390721967  390721904  83  Linux

You may still have a question though. Does aligning a partition mean that the contained filesystem is aligned too? You’re right, that assumption should not be taken for a fact.

A filesystem is made up of “your” data and “its” data (the latter being internal structures necessary to organize the former). In any case, a FS will try to pad/align stuff to the block size. That was to say that, if you start a partition and the partition is aligned to a given boundary, the filesystem (all of its composing blocks) will be aligned too.

You can find a description of the ext2 layout here. Partition starts with two sectors reserved for the boot loader. Then comes a 1k chunk holding ext2 “superblock”. At offset 56 within the superblock, we should find the ext2 magic number (0x53EF), and here it is:

giuliano@giuliano ~ $ dd if=/dev/sdb1 bs=512 skip=2 count=1 2>/dev/null | xxd -s +56 -l 2
0000038: 53ef                                     S.

The next byte after the superblock, is byte 4096. From then on, everything happens (from the FS point of view) in chunks as big as the configured block size. My disk is a 4k sector disk, formatted with a single partition aligned (as the FS) to a 4K block. FS block size is 4K too. Can’t do really do any better than that besides choosing a filesystem that manages to handle the given workload with fewer read/write operations, but I digress…

2011
04.20

So, Customer starts updating all of his VMware ESX hosts and things turn out for the worst. VMs are crawling slow (ping response time from 0 to 1000ms), console access through vSphere client doesn’t always work, and hosts’ CPU percentage is unnaturally high. Cause is apparent: path thrashing.
Path thrashing happens when, for some reason, SCSI LUNs are being continuously reassinged from a controller (Target) to another one. ESX has a hard time “bouncing” I/O back and forth on the right Fibre Channel path. On Active/Passive SAN arrays a LUN can be “owned” by just one controller at a time. If the LUN owner has to be changed because of a hardware failure (path, Controller, SFP/GBIC, FC switch, …) or because the Initiator would like to, the LUN itself has to “trespass” (in EMC parlance), transition to another controller. The “command” to do so can be issued by the Initiator or internally by the storage subsystem.
Back to today’s case, I was dealing with an IBM DS4800 where LUNs flipped like mad between controller A and B. How to stop it quickly?

  • If anything, the flipping shows that failover works as expected (VMs don’t crash despite the chaos).
  • That said, I could just disconnect a controller. Not really because the same storage system hosts an Oracle RAC cluster, humming along happily, unaffected by the issue.
  • I need a way to selectively “hide” a controller from one or more hosts. I can do it easily by tweaking the SAN zoning configuration.

A Zone (much like a VLAN) is basically a group of WWNs (or ports). Objects in the Zone can only talk to each other. While creating Zones, it is common practice to “go minimal”: they should contain as few stuff as possible. I usually name them like this:
    Z_HOSTNAME_P1_DS4800_CA1_CB1
HBA Port 1 of HOSTNAME can see Controller A/Port 1 and Controller B/Port 1 of the DS4800.
Thus, going through each ESX server’s Zone, I just remove the Controller that the host shouldn’t see. Path thrashing is temporarily stopped.
The above rant serves mainly as a pro-zoning argument. “If every HBA port has to access every Controller’s port, why implement zoning?”. As you just read, zoning saved me from serious trouble, today.
About the “real” issue, it was ultimately caused by a thing called “Auto Volume Transfer” (AVT)1. Let’s say that a LUN is assigned to controller A, but I/O for the LUN is issued to controller B. With AVT switched on the storage system will automatically transfer the LUN from A to B.
The Customer ESX servers are all (correctly) configured to use the “Most Recently Used” (MRU) path to a LUN. It seems that ESX, from a certain version on, issues I/O on the standby path, causing havoc if AVT is on. I can’t tell if that’s because it is fooled into thinking that the storage is an Active/Active one or if it just sort of periodically “probes” standby paths.
How do you switch AVT off? By using the DS “Storage Manager” and changing the ESX Hosts’ type from “Linux” (or whatever) to “LNXCLVMWARE”. This applies to all of the LSI derived Storage Systems (IBM, SUN StorageTek, Engenio, …). The latter host type is the right one to use when hooking an ESX cluster to an IBM DS Storage System. But “Linux” seems to do just fine on not so new ESX hosts version 4.1.x … When AVT is off, the Storage will decide to trespass LUNs only in the event of an internal hardware failure while, normally, LUN ownership will be handled by the multipathing software on the Host.

More reading on the subject:

[1] Differences between the “Linux” and “LNXCLVMWARE” host types.
[2] How does Auto Volume Transfer (AVT) work? Courtesy of Google’s cache. Lists which SCSI commands trigger AVT.
[3] A really nice blog post about the same issue described here. (Found, of course, when I was writing mine)

  1. or even “Auto Disk Transfer” (ADT)
2011
04.12

Quick post to show you how DHCP reservations can be replicated between Windows servers. Why whould you want to do that? Because often, to achieve DHCP service high availability, DHCP scopes are equally divided between servers. When a client PC is connected to the network, it sends out a broadcast to discover which DHCP servers are active on that particular ethernet segment. Depending on their number, the PC will receive one or more answer, each offering an IP address. If a client is to be assigned a fixed IP, all of those offers should bear the same IP address. Hence, DHCP reservations need to be configured the same for every DHCP server in the given scope. As far as I know, this needs to be done by hand. To speed up the process, I use netsh (see Netsh commands for DHCP).

The command below will dump all of the reservations to a file named “reservations.txt”. findstr filters netsh output keeping just the info we need.

C:\Documents and Settings\Administrator> netsh dhcp server \\dhcpsrv1 scope 10.4.0.0 dump | findstr Add.reservedip > reservations.txt

Each line in “reservations.txt” should look like this:

Dhcp Server 10.4.1.1 Scope 10.4.0.0 Add reservedip 10.4.5.3 58b04576339a "pcname.domain.lan" "Reservation Comment" "BOTH"

10.4.1.1 is the IP address for dhcpsrv1, the “source” DHCP server.

Open “reservations.txt” in a text editor, check that everything is fine and substitute the source DHCP server IP with the target’s one (i.e.: 10.4.1.1 becomes 10.4.1.2), save the file and run:

C:\Documents and Settings\Administrator> netsh < reservations.txt
netsh>
Changed the current scope context to 10.4.0.0 scope.

Command completed successfully.
netsh>
Command completed successfully.
netsh>
[..]

That’s it; not a fancy trick, but it may be useful nonetheless. Just beware that, when there are thousands of clients, netsh could take a while to complete its job (especially the “dump” step)…

2011
03.30

JavaScript for Sysadmins, again

1 person likes this post.
Share

Following up on the previous post, let me show you other ways to trick web application into doing what they’ve not been designed to do: saving the Sysadmin some typing and avoding errors. I’ll use JavaScript, jQuery, Greasemonkey and Perl to automate Firefox and implement a sort of dynamic form filling. Let’s start with a screencast:

The config I had to do (on a SonicWALL firewall), involves about 70 subnets, each similar to the other. Only the subnet’s addressing scheme changes, making creation of VLANs/objects/rules a repetitive and error-prone task.

What’s happening in the screencast? VLAN sub-interfaces are being automatically created without me having to type anything at all. How could that be? A simple Greasemonkey script calls a “web service” (AJAX style) fetching the needed data and filling the form for me (I just click the “OK” or “Cancel” buttons). Why this whole Greasemonkey/web service mess? Because here, as in the previous post, JavaScript code is basically being injected into a “page”. Pages (or tabs, or windows) are ran by the browser into a sandbox: they can’t exchange data between each other. Thus page A (the webapp) can’t access code/data in page B (our code). Moreover, injected JavaScript gets lost when the page is closed (think ugly GUIs where dialog windows pop up just to be destroyed shortly after). We need a way (Greasemonkey) to re-inject the code each time our page is shown and some external, long lived, entity to update/keep status (the web service).

Thanks to the HTTP::Server::Simple Perl module, building the web service is trivial. The only logic behind it is keeping track of the current VLAN’s index, iterating through each value on subsequent web service calls:

#!/usr/bin/perl
use strict;
package LameServer;
use base qw(HTTP::Server::Simple::CGI);

my @VLANS = qw(10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98);

my $COUNTER = 0;

sub handle_request {
    my ($self, $cgi) = @_;

        print "HTTP/1.0 200 OK\r\n";
        print "Content-type:text/plain\r\n\r\n";
        print STDERR "$VLANS[$COUNTER] index $COUNTER\n";
        print $VLANS[$COUNTER];
        $COUNTER++;
        $COUNTER = 0 if $COUNTER >= @VLANS;
}

1;

package main;

my $server = LameServer->new(3333);
$server->run();

And here’s the Greasemonkey script. It runs automatically when the “add VLAN” page pops up, does the sort-of-AJAX call, uses jQuery to properly fill the form.

// ==UserScript==
// @name           CurrentVLAN
// @namespace      dontcare
// @require        http://ajax.googleapis.com/ajax/libs/jquery/1.3.2/jquery.min.js
// @include        https://192.168.168.168/*
// ==/UserScript==

if (window.location.toString().match(/192.168.168.168.*editInterface/)) {
    GM_xmlhttpRequest({
        method: "GET",
        url: "http://localhost:3333/",
        onload: function(response) {
            window.GM_CurrentVLAN = parseInt(response.responseText);
            GM_log(window.GM_CurrentVLAN);
            }
    });
    function GM_wait() {
        if(typeof unsafeWindow.jQuery == 'undefined') {
            window.setTimeout(GM_wait,100);
        }
        else {
            $ = unsafeWindow.jQuery;
        }
    }
    GM_wait();

    window.setTimeout(function() {
            $('select[name=interface_Zone]').val('CZ');
            unsafeWindow.setIfaceContent(); // "Zone" onchange function.
            $('input[name=iface_vlan_tag]').val(window.GM_CurrentVLAN+100);
            $('select[name=iface_vlan_parent]').val('X0');
            $('input[name=lan_iface_lan_ip]').val("10.0."+window.GM_CurrentVLAN+".1");
            $('input[name=lan_iface_ping_mgmt]').attr('checked', true);
            }, 1000);
}

I built similar scripts to create static ARP entries and routes. Another one took care of firewall symbolic objects (names) but, as that part of the config is itself carried out by AJAX (in a non reloading window), I didn’t need Greasemonkey, just Firebug.
No kidding, the above tricks saved me half a day of tedium…

2011
03.19

FireQuery fun

2 people like this post.
Share

Or how to toggle a thousand checkboxes clicking none.
Have you ever wondered that jQuery may be relevant to your everyday sysadmin job? It is: web-based GUIs proliferate and the Command Line is sooo nineties (this one’s not mine)…
Working on a SonicWALL/Aventail SSL VPN box, I was asked to simplify how permissions were mapped to Users. What a User could or couldn’t do, was defined at the User level. Ok, let’s just:

  • Create an A/D group for each role/profile.
  • Configure permissions (rules, resources access, …) on various Communities (Aventail parlance for roles/profiles).
  • Put the right Users into the right A/D group.
  • Cleanup: generally, each Community should just have A/D groups assigned to it. Get rid of unnecessary Community memberships.

First three tasks were easy (thanks also to the DS* commands). The fourth, uhm:

Removing members from a Community means unchecking each User. In my case, more than one thousand clicks and a beginning of carpal tunnel syndrome. At page three I started to think about a less saddening way.

FireQuery is a Firefox extensions that let’s you “inject” jQuery into any webpage. Here’s what I did:

  • Went to the page depicted above (the one that let’s you assign members to a Community).
  • Launched Firebug.
  • Inspected the DOM and noticed that each of the checkboxes value begins with “AV”.
  • Used the debugger to see what’s going on when checking/unchecking members. Nothing really strange: just a bit of JavaScript to highlight a row depending on its checkbox’s status.
  • Hit the jQuerify button. FireQuery is needed because Aventail’s web-based GUI doesn’t use jQuery.
  • Went to Console, typed the JavaScript one-liner below and hit Run
$('input[value^="AV"]').attr('checked', false)

which translates to: use jQuery to select all of the checkboxes whose value starts with “AV”. Uncheck the selected checkboxes.

See? No useless clicking: a GUI has been CLI-fied.

2011
02.09

Today I ran into a weird issue while installing Oracle Grid Control Agent 10.2.0.3 on Linux. Right after typing “runInstaller”, OUI crashed because of segmentation fault… Let me talk about some of the troubleshooting maneuvers you may need to perform should you find yourself in similar troubles.

Here are the relevant details:

  • OS: Red Hat Enterprise Linux Server 5.3 x86-64
  • GC Agent: Oracle Enterprise Manager 10g Grid Control Release 3 (10.2.0.3) for Linux x86-64
  • GC Console: Oracle Enterprise Manager 10g Release 5 (10.2.0.5) Grid Control for Microsoft Windows 32-bit

And here’s the error message (the most interesting portions):

An unexpected exception has been detected in native code outside the VM.
Unexpected Signal : 11 occurred at PC=0xE44F46A7
Function=[Unknown.]
Library=(N/A)

[..]

Current Java thread:
        at sun.awt.motif.MToolkit.init(Native Method)
        at sun.awt.motif.MToolkit.<init>(Unknown Source)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)

[..]

Heap at VM Abort:
Heap
 def new generation   total 576K, used 84K [0xe6510000, 0xe65b0000, 0xe7090000)
  eden space 512K,   4% used [0xe6510000, 0xe65152f8, 0xe6590000)
  from space 64K, 100% used [0xe65a0000, 0xe65b0000, 0xe65b0000)
  to   space 64K,   0% used [0xe6590000, 0xe6590000, 0xe65a0000)
 tenured generation   total 6212K, used 4461K [0xe7090000, 0xe76a1000, 0xefb10000)
   the space 6212K,  71% used [0xe7090000, 0xe74eb5f8, 0xe74eb600, 0xe76a1000)
 compacting perm gen  total 5632K, used 5398K [0xefb10000, 0xf0090000, 0xf3b10000)
   the space 5632K,  95% used [0xefb10000, 0xf00558b0, 0xf0055a00, 0xf0090000)

Local Time = Tue Feb  8 09:45:48 2011
Elapsed Time = 1
#
# The exception above was detected in native code outside the VM
#
# Java VM: Java HotSpot(TM) Client VM (1.4.2_08-b03 mixed mode)
#

To go past this show-stopper I tried a few things…

The Heap report produced by java at crash time, seemed to indicate a memory shortage. By editing the “install/oraparam.ini” file, you can tweak how much RAM is available for OUI’s JVM. Just alter “JRE_MEMORY_OPTIONS” value.

#JRE_MEMORY_OPTIONS=" -mx150m"
JRE_MEMORY_OPTIONS=" -Xms512m -Xmx2048m"

This is also a safe place to put additional command line parameters: they’ll mostly be passed to java’s command line. I said “mostly” because OUI wrapper/launcher seems to check some sort of allowed parameters list and may refuse to go on if somethings doesn’t look right.

The “-XX:MaxPermSize=32m” is one of the knobs that doesn’t pass the sanity check. In order to run OUI’s JVM by hand, with the right parameters, just keep the first lines of runInstaller (the ones starting with ‘Arg:‘):

Arg:0:/tmp/OraInstall2011-02-08_04-55-33PM/jre/1.4.2/bin/java:
Arg:1:-Doracle.installer.library_loc=/tmp/OraInstall2011-02-08_04-55-33PM/oui/lib/linux:
Arg:2:-Doracle.installer.oui_loc=/tmp/OraInstall2011-02-08_04-55-33PM/oui:
Arg:3:-Doracle.installer.bootstrap=TRUE:
[..]
Arg:20:-timestamp:
Arg:21:2011-02-08_04-55-33PM:
Arg:22:-nowelcome:

Strip “^Arg:“, “^\d*:“, “:$“, add a trailing “ \” and you’ll have an OUI launching shell script you can alter at will.

Increasing JVM’s memory led to no effect. Heap report looked fine (usage percentages went down) but crash was still there.

Another useful switch is “-XX:+ShowMessageBoxOnError“. It makes java halt on error, allowing us to attach a debugger and perform a stack backtrace, e.g.:

Unexpected Signal: 11, PC: 0x6d4626a7, PID: 4866
An error has just occurred.
To debug, use 'gdb /tmp/OraInstall2011-02-08_11-01-42AM/jre/1.4.2/bin/java 4866'; then switch to thread -136623920
#0  0xffffe410 in __kernel_vsyscall ()
#1  0xf7e462b6 in nanosleep () from /lib/libc.so.6
#2  0xf7e460df in sleep () from /lib/libc.so.6
#3  0xf7bdc6d7 in os::message_box ()
   from /tmp/OraInstall2011-02-08_11-01-42AM/jre/1.4.2/lib/i386/client/libjvm.so
#4  0xf7bd9c52 in os::handle_unexpected_exception ()
   from /tmp/OraInstall2011-02-08_11-01-42AM/jre/1.4.2/lib/i386/client/libjvm.so
#5  0xf7bddbf6 in JVM_handle_linux_signal ()
   from /tmp/OraInstall2011-02-08_11-01-42AM/jre/1.4.2/lib/i386/client/libjvm.so
#6  0xf7bdc9d8 in signalHandler ()
   from /tmp/OraInstall2011-02-08_11-01-42AM/jre/1.4.2/lib/i386/client/libjvm.so
#7  <signal handler called>
#8  0x6d4626a7 in ?? ()
#9  0x6d6d75b9 in XtToolkitInitialize () from /usr/lib/libXt.so.6

I also tried to “inject” a couple of newer JVM’s into the stage directory. The quickest way is to borrow it from another installer.

[oracle@racnode01 orastage]$ find . -type d -name oracle.swd.jre -exec echo {} \; -exec ls {} \;
./Linux_x86_64_Grid_Control_full_102030/Disk1/stage/Components/oracle.swd.jre
1.4.2.8.0
./p6810189_10204_Linux-x86-64/Disk1/stage/Components/oracle.swd.jre
1.4.2.14.0

The server’s has a “working” directory were Oracle patches/products are stored before use. In my case, changing OUI’s JVM from 1.4.2.8 to 1.4.2.14 is a matter of copying:

./p6810189_10204_Linux-x86-64/Disk1/stage/Components/oracle.swd.jre/1.4.2.14.0

to:

./Linux_x86_64_Grid_Control_full_102030/Disk1/stage/Components/oracle.swd.jre

Then modifing the same “oraparam.ini” file mentioned before.

#JRE_LOCATION=../stage/Components/oracle.swd.jre/1.4.2.8.0/1/DataFiles
JRE_LOCATION=../stage/Components/oracle.swd.jre/1.4.2.14.0/1/DataFiles

You could as well download a specific JRE from http://java.sun.com (sorry: from Oracle) and:

  • install the new JRE somewhere
  • unzip (-t) the “filegroup1.jar” file that corresponds to OUI’s “factory” JRE. Note how the directories are laid out (something like: “jre/1.4.2”). Modify the new JRE accordingly.
  • zip the new JRE, rename the resulting file to “filegroup1.jar”, copy it in the right place.
  • modify oraparam.ini and choose the JVM version you’ll boot OUI into.
[oracle@racnode01 oracle.swd.jre]$ pwd
/opt/orastage/Linux_x86_64_Grid_Control_full_102030/Disk1/stage/Components/oracle.swd.jre
[oracle@racnode01 oracle.swd.jre]$ find . -type f
./1.4.2.8.0/1/DataFiles/filegroup1.jar   # <-- factory
./1.4.2.8.0/1/DataFiles/filegroup2.jar
./1.4.2.8.0/1/DataFiles/filegroup3.jar
./1.4.2.8.0/1/DataFiles/filegroup4.jar
./1.4.2.8.0/1/DataFiles/filegroup5.jar
./1.4.2.14.0/1/DataFiles/filegroup1.jar  # <-- stolen from patchset p6810189
./1.4.2.14.0/1/DataFiles/filegroup2.jar
./1.4.2.14.0/1/DataFiles/filegroup3.jar
./1.4.2.14.0/1/DataFiles/filegroup4.jar
./1.4.2.14.0/1/DataFiles/filegroup5.jar
./1.4.2.19.0/1/DataFiles/filegroup1.jar  # <-- downloaded by hand

Three different JREs, each of them segfaulting in the same spot, as we saw in the backtrace:

#9  0x6d6d75b9 in XtToolkitInitialize () from /usr/lib/libXt.so.6

Who’s the owner of libXt?

[root@racnode01 ~]# rpm -q --queryformat '%{NAME}-%{VERSION}-%{RELEASE} %{ARCH}\n' -f /usr/lib/libXt.so.6
libXt-1.0.2-3.1.fc6 i386

After making sure that none of the running processes was using that package contents, I decided to remove it (rpm -e –nodeps libXt-1.0.2-3.1.i386) and reinstall it. Surprisingly, OUI worked flawlessy after this last action. Too bad I can’t really explain why. 🙁 libXt version didn’t change before/after reinstall. I should diff it anyway with what’s left untouched on other RAC cluster members. I’ll update the post when I have a stricter explanation…