Analysing TCP based protocols often means dealing with TCP sessions (also called streams or flows).
A TCP connection, from an application point of view, is much like a bidirectional file descriptor through which ordered data can be read or written. “On the wire” though, data is not ordered at all. It is split into packets, possibly shuffled and mixed with other traffic. You can capture packets using a sniffer, but to make any sense of them you also need an analyzer tool able to do the reordering/reassembling job. Wireshark, for instance, doubles as a sniffer and an analyzer, backed up by the ubiquitous libpcap.

Imagine having dumped/sniffed 1GB worth of traffic. We would like to pinpoint a single TCP session, isolating it from the rest. Here’s how we could proceed:

  • Identify the source/destination addresses and source/destination ports we’re interested in. Then throw away any packet that doesn’t match this tuple. That’s what Wireshark basically does when you select a packet, right click and hit “Follow TCP Stream”. If the same tuple doesn’t get reused for another, unrelated, session, this method works just fine1.
  • Reorder/reassemble packets.
  • Extract packets’ payload.
  • Present the payload in a way that makes sense. That depends on the L7 protocol. HTTP without keep-alive is strictly request/response: print what the client sent to the server (outbound traffic) before and then what the server answered (inbound traffic). Other protocols may behave differently and you may choose to separate inbound traffic from outbound, or rely on timing to correctly present the dialogue between peers.

Besides Wireshark, there are tools that do just that and can also be automated. See TShark or tcpflow.

What if you want to script everything and build your own TCP analyzer? Perl’s module Net::Analysis is surprisingly convenient for the task. It does the dirty job I described above and presents your code with ready to be processed TCP sessions.

Practical goal: saving MP3 files streamed by Grooveshark. Disclaimer: I’m by no means pushing anyone to illegally download stuff, this is just a working, sensible, instructional example that uses a song freely available anyway (by Revolution Void, check them out here, they’re great).

GroovesharkListener.pm extends Net::Analysis::Listener::HTTP. It sniffs all the traffic from/to port 80 and, as soon as he sees an HTTP response with a content-type of “audio”, dumps its content to file and quits. Simple as that.

Put the module some place where Perl can find it and then launch (as root):

# perl -MNet::Analysis -e main GroovesharkListener 'port 80'
(starting live capture)
text/html; charset=UTF-8
[...some more cruft...]
Dumping 8481224 bytes to groovesharkgyzBy.mp3 be patient...

# id3v2 -l groovesharkgyzBy.mp3
id3v1 tag info for groovesharkgyzBy.mp3:
Title  : Invisible Walls                 Artist: Revolution Void              
Album  : Increase the Dosage             Year: 2004, Genre: Other (12)
Comment: http://www.jamendo.com/         Track: 1

That’s it, just one more thing. Net::Analysis doesn’t allow you to select a specific network interface, it just picks up the first available one. I wrote a small patch to address this shortcoming, it adds a “device=” parameter that you can use this way:

# perl -MNet::Analysis -e main GroovesharkListener,device=wlan1 'port 80'

And here’s what GroovesharkListener.pm looks like:

# choose a song
# run (as root or via sudo):
#   perl -MNet::Analysis -e main GroovesharkListener 'port 80'
# click "play" and wait for the file to be dumped...
#                             -- Giuliano - http://www.108.bz
package Net::Analysis::Listener::GroovesharkListener;
use strict;
use base qw(Net::Analysis::Listener::HTTP);
use File::Temp;

sub http_transaction {
    my ($self, $args) = @_;
    my ($http_req) = $args->{req};
    my ($http_resp) = $args->{resp};

    print $http_req->uri(), "\n";
    my $content_type = $http_resp->header('Content-Type');
    print "$content_type\n";
    if ($content_type =~ /audio/i) {
        my $fh = new File::Temp(TEMPLATE => 'groovesharkXXXXX',
            SUFFIX   => '.mp3',
            UNLINK   => 0);
        print "Dumping ".length($http_resp->content)." bytes to ".$fh->filename." be patient...\n";
        print $fh $http_resp->content;

  1. newer Wireshark(s) use the “tcp.stream eq x” primitive