Work in progress ...
Intruders often use non-standard ports or standard ports in non-standard ways to bypass detection. This research will develop algorithms and software to identify IP sessions based on statistical metrics of the packet flows and an adaptive flows knowledge-base.
Most intrusion detection systems (IDS) are based on recognizing known attack signatures and/or anomalous activity. Network-based IDSs look for attack signatures on standard service ports (DNS, IMAP, POP, SNMP, SYSLOG) or monitor interactive activity on standard interactive service ports (TELNET, RLOGIN, RSH, FTP), or look for activity on ports used for known backdoors (NETbus, BackOrifice). In our experience over the last few years, these detection methods are quite effective. However, we have had several intrusions where the attacker has used non-standard ports and avoided detection.
An attacker may have gained access to an internal system by capturing an account password at another site. The attacker can then access an internal system and install backdoors on non-standard ports for later access. This is also an insider threat. Even the currently popular PC-based backdoors (Netbus and BackOrifice) are "port agile" -- the attacker can choose the port to use for the backdoor. These backdoors can later be used to provide interactive access, for chat channels, or for file transfers.
The attacker also may use standard services in non-standard ways. Firewalls may pass HTTP, mail, DNS, or ICMP traffic, and IDS systems often ignore these services when they originate from the inside. The attacker can tunnel his own services through these standard ports, for example, transferring files in what looks like DNS packets, or providing interactive service through ICMP echo packets.
At a large site, it is impractical for the IDS to examine the contents of every packet in an attempt to identify the actual service being provided by a flow of packets between two hosts/ports. The objective of this proposed research is to develop algorithms and software to identify a network flow between two hosts based on just a few features of that flow.
The research can be divided into three broad areas:
1) network traffic capture, data reduction, storage, and analysis
2) statistical analysis of flow characteristics
3) learning and decision systems for classifying flows
Network traffic data is captured and some portion of each packet is saved for later post-processing. Information retained for each packet includes time of arrival (to the microsecond), source address, source port, destination address, destination port, packet length, and TCP flags. This reduces a packet of hundreds of bytes to twenty or thirty bytes. (Still on a daily basis, this summary information could be billions of bytes for a site like ORNL.) Since we are not relying on "content" for classifying flows, our methods should work for encrypted flows.
Flow plots
The following plots illustrate the differing characteristics of flows for various known network services. Each flow represents an individual plot on a page of 80 plots, where all axes are scaled the same for easy comparison of flows. Each packet in an individual flow is plotted as a point and connected with a line to the previous packet point. The horizontal axis is time between packets (log10) and the vertical axis is packet size (log10). Packets are distinguished by color indicating the side of the connection and whether they follow a packet from the same or other side of the connection. Note the rich structure and similarities within a service type that are available for classification of flows. As one would expect, interactive traffic (presumably with a human on one end) has longer interpacket delays than batch type of services like email or ftp.
Flow classification
We capture a flow structure by binning the relationship between packet
size and interpacket delay. A few packet sizes are binned
individually because of their high frequency and special use. The rest
are binned in intervals.
To look at a low dimensional representation of the combined flow
structures for all of the selected ports, we can use principal
components (PC) or multidimensional scaling (MDS). PC is a technique
that uses the high dimensional flow characteristics and provides
orthogonal linear combinations that iteratively capture the direction
of greatest variation. MDS is a non-linear technique that uses a
distance matrix from the high dimensional representation and attempts
to preserve those distances in a lower dimensional space. We obtained
better separation of port traffic with MDS.
Upcoming tasks in this research are to automate the classification
of flows based on principal component analysis and model-based clustering.
We will also investigate Markov chains for classifying flows.
Most of our statistical analysis has used
S-PLUS, but we have begun
investigating
XGobi and XGvis.
UT grad student, Salma Abdulrahman,
has worked on speeding up the classification by coding the PCA and classifier
in C.
Her results can be found here.
Our preliminary tech report (11/27/00):
postscript(6 MB),
or pdf (5 MB)
Related work
Paxson and Zhang are presenting Detecting Backdoors
and Detecting Stepping Stones at USENIX in 2000.
Some of the techniques use content analysis using Paxson's Bro,
but related to our work, there are "content-free" flow analyses using
a heuristic metric of ratio small to large packets and their spacing
within a flow.
Frank has a '94 paper on
Artificial Intelligence and
Intrusion Detection: Current and Future Directions
that desribes classifying flows based on packet count, data volume,
and duration of flow.
NIST's
Artificial Neural Networks for Misuse Detection
paper on audio stream flows
Wide Area Internet Traffic Patterns and Characteristics
Bin structure example with 288 (4 x 72) bins. Each flow is
characterized with a vector of 288 counts.
A four dimensional representation of the combined flows structures.
The first two dimensions in this representation show a separation
of flows that have a human (ports 23, 22, and 513) on one end from
those that have a machine at both ends (ports 20 and 25). The two
"unknown" flows (marked as port 0) from a compromised machine are
clearly on the human side.
Focusing the multidimensional scaling on the machine flows only
shows that 20 (file transfer) clusters into two distinct
groups: one easily separated from 25 and the other more difficult.
Applying multidimensional scaling to the flows with a human on one end
shows some separation but also considerable overlap between the flows.
Flow 23 (telnet) also presents two distinct clusters. The two
"unknown" flows (marked as port 0) are most likely classified as port
22 (secure shell).
Last Modified
thd@ornl.gov
(touches: )
back to Tom Dunigan's page
or the ORNL home page