Flow characterization

Flow Characterization

Work in progress ...

Intruders often use non-standard ports or standard ports in non-standard ways to bypass detection. This research will develop algorithms and software to identify IP sessions based on statistical metrics of the packet flows and an adaptive flows knowledge-base.

Most intrusion detection systems (IDS) are based on recognizing known attack signatures and/or anomalous activity. Network-based IDSs look for attack signatures on standard service ports (DNS, IMAP, POP, SNMP, SYSLOG) or monitor interactive activity on standard interactive service ports (TELNET, RLOGIN, RSH, FTP), or look for activity on ports used for known backdoors (NETbus, BackOrifice). In our experience over the last few years, these detection methods are quite effective. However, we have had several intrusions where the attacker has used non-standard ports and avoided detection.

An attacker may have gained access to an internal system by capturing an account password at another site. The attacker can then access an internal system and install backdoors on non-standard ports for later access. This is also an insider threat. Even the currently popular PC-based backdoors (Netbus and BackOrifice) are "port agile" -- the attacker can choose the port to use for the backdoor. These backdoors can later be used to provide interactive access, for chat channels, or for file transfers.

The attacker also may use standard services in non-standard ways. Firewalls may pass HTTP, mail, DNS, or ICMP traffic, and IDS systems often ignore these services when they originate from the inside. The attacker can tunnel his own services through these standard ports, for example, transferring files in what looks like DNS packets, or providing interactive service through ICMP echo packets.

At a large site, it is impractical for the IDS to examine the contents of every packet in an attempt to identify the actual service being provided by a flow of packets between two hosts/ports. The objective of this proposed research is to develop algorithms and software to identify a network flow between two hosts based on just a few features of that flow.

The research can be divided into three broad areas:
1) network traffic capture, data reduction, storage, and analysis
2) statistical analysis of flow characteristics
3) learning and decision systems for classifying flows

Network traffic data is captured and some portion of each packet is saved for later post-processing. Information retained for each packet includes time of arrival (to the microsecond), source address, source port, destination address, destination port, packet length, and TCP flags. This reduces a packet of hundreds of bytes to twenty or thirty bytes. (Still on a daily basis, this summary information could be billions of bytes for a site like ORNL.) Since we are not relying on "content" for classifying flows, our methods should work for encrypted flows.

Flow plots

The following plots illustrate the differing characteristics of flows for various known network services. Each flow represents an individual plot on a page of 80 plots, where all axes are scaled the same for easy comparison of flows. Each packet in an individual flow is plotted as a point and connected with a line to the previous packet point. The horizontal axis is time between packets (log10) and the vertical axis is packet size (log10). Packets are distinguished by color indicating the side of the connection and whether they follow a packet from the same or other side of the connection. Note the rich structure and similarities within a service type that are available for classification of flows. As one would expect, interactive traffic (presumably with a human on one end) has longer interpacket delays than batch type of services like email or ftp.

email port 25
These are various flows for email.

ftp data port 20
These are various flows for file transfer (data port).

telnet port 23
These are various flows for telnet.

ssh port 22
These are various flows for secure shell.

rlogin port 513
These are various flows for rlogin.

mystery flows
We captured two "unknown" flows from a compromised machine, our visual classifier classified it as interactive, but not quite telnet-like. Later with additional information, we were able to determine that this flow was a relayed "chat" session (irc) that the hacker was using. Attackers often use hosts as intermediaries, or relays, to hide their true network locations. The graphic of the two flows are clearly similar, the colors are "reversed", because of reversal of "source" and "destination" by the relay.

Flow classification

We capture a flow structure by binning the relationship between packet size and interpacket delay. A few packet sizes are binned individually because of their high frequency and special use. The rest are binned in intervals.

Binning example
Bin structure example with 288 (4 x 72) bins. Each flow is characterized with a vector of 288 counts.

To look at a low dimensional representation of the combined flow structures for all of the selected ports, we can use principal components (PC) or multidimensional scaling (MDS). PC is a technique that uses the high dimensional flow characteristics and provides orthogonal linear combinations that iteratively capture the direction of greatest variation. MDS is a non-linear technique that uses a distance matrix from the high dimensional representation and attempts to preserve those distances in a lower dimensional space. We obtained better separation of port traffic with MDS.

MDS on Combined Ports
A four dimensional representation of the combined flows structures.
The first two dimensions in this representation show a separation of flows that have a human (ports 23, 22, and 513) on one end from those that have a machine at both ends (ports 20 and 25). The two "unknown" flows (marked as port 0) from a compromised machine are clearly on the human side.

MDS on ports 20 and 25
Focusing the multidimensional scaling on the machine flows only shows that 20 (file transfer) clusters into two distinct groups: one easily separated from 25 and the other more difficult.

MDS on ports 0, 21, 22, and 513
Applying multidimensional scaling to the flows with a human on one end shows some separation but also considerable overlap between the flows. Flow 23 (telnet) also presents two distinct clusters. The two "unknown" flows (marked as port 0) are most likely classified as port 22 (secure shell).

Upcoming tasks in this research are to automate the classification of flows based on principal component analysis and model-based clustering. We will also investigate Markov chains for classifying flows. Most of our statistical analysis has used S-PLUS, but we have begun investigating XGobi and XGvis. UT grad student, Salma Abdulrahman, has worked on speeding up the classification by coding the PCA and classifier in C. Her results can be found here.

Our preliminary tech report (11/27/00): postscript(6 MB), or pdf (5 MB)

Related work

Paxson and Zhang are presenting Detecting Backdoors and Detecting Stepping Stones at USENIX in 2000. Some of the techniques use content analysis using Paxson's Bro, but related to our work, there are "content-free" flow analyses using a heuristic metric of ratio small to large packets and their spacing within a flow.

Frank has a '94 paper on Artificial Intelligence and Intrusion Detection: Current and Future Directions that desribes classifying flows based on packet count, data volume, and duration of flow.

NIST's Artificial Neural Networks for Misuse Detection

paper on audio stream flows

Wide Area Internet Traffic Patterns and Characteristics

Last Modified thd@ornl.gov (touches: )
back to Tom Dunigan's page or the ORNL home page