Abstract one of the major threats which will

Abstract

Distributed Denial of Service (DDoS) attack
brings revenue loss, productivity loss, reputation damage, theft, etc. to huge
banking and business firms. This leads to the necessity of a good prevention
and detection techniques. In this paper, to provide better solution to these
problems using features analysis. The statistical characteristics or parameters
of the incoming packets are Absolute time interval, Absolute session count,
Absolute session interval, Absolute page access count, Absolute Bandwidth
consumption, and Absolute Ratio of packet count. The incoming packets are
classified into normal and attack by deploying K-Means, J48, and Naïve Bayes
classifiers algorithms using the normal and attack profile from previously
available datasets. Information gain algorithm is used to decrease the false
positive and false negative errors and increase efficiency of detection by
reducing the parameters. The performance increases with more consistency after
the application of information gain. The efficiencies of detection before and
after the application gain are 98%

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

and 99.5% respectively. In this paper, CAIDA
datasets are used for feature selection and classification.

Introduction

DoS attack is an intentional attempt by
malicious users to completely disrupt or degrade the availability of
services/resources to legitimate users .Dos attack are of two types. One is
single source attack. These are easily countered by several defense mechanisms
and therefore the source of   attacks
will be simply blocked. Another one is multiple source attack (DDoS) in which
multiple systems are used to perform the attack.

Distributed
denial of service (DDoS) attack, which makes a server suffer in having slow
responses to clients or even refusing their accesses, is one of the major
threats which will continue in the future. DDoS attack is an attempt to make an
online services unavailable by overwhelming it traffic from multiple sources.
They target a wide variety of important resources from banks to news website
and present a major challenge to making sure people can publish and access
important information.

Most common DDoS
attacks use layered structure, as shown in Figure1, in which the attackers use
client program to connect the handlers. The handlers are compromised systems that
give commands to the bots or zombie agents to perform a DDoS attack. These bots
or zombie agents are compromised by the attackers through handlers. The
attackers compromise the systems using many mechanisms like using Trojans or
malwares etc. In the attack, the attackers give the command to handlers and
then the handlers command the bots and the bots flood the victim with
tremendous amounts of traffics consuming all the resources of the victim.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

               

 

      
 

                                           Fig 1: An illustration of DDoS Attack

Types of DDoS Attacks

There are several types of DDoS attacks and there are
some common DDoS attacks (both past and present):  

UDP Flood: User datagram protocol is a sessionless
networking protocol. This method is referred to a UDP flood. Random ports on
the target machine are flooded with packets that cause it to listen for an
application on that port and report back with ICMP packet.

 SYN Flood: It will
send repeated spoofed requests from a variety of source at the target server.
Thus, the server will respond with ACK packet to complete the TCP connection,
but instead of closing the connection, then it is allowed to timeout. Eventually with a strong attack, the host resources will
exhausted and the server will go offline.

HTTP: In HTTP
flood DDoS attack, the attacker exploits seemingly legitimate GET and POST
request to attack a web server or application. HTTP floods do not use malformed
packets, spoofing and reflection techniques, and requires less bandwidth than
other attacks to bring down the targeted site or server. The attack is most
effective when it forces the server or application to allocate the maximum
resources possible in response to each single request.

Ping flood
Attack: This type attack is one of the simplest attack in which the attacker
floods the victim’s computer with ICMP Echo Request (ping) packets. For each
ping packets from the attacker, the victim replies with a reply packet. Thus it
consumes both the outgoing and incoming bandwidths. This attack is more
efficient when the attacker has more bandwidth than the victim.

Reflected Attack: In this attack an attacker creates
forged packets that will be sent out to as many as computers. When that the
computer receive the packets they will reply, but the reply will be a spoofed
address that routes to the target. All computers will communicate at once and
this will cause the site to be bogged down with requests until the server
resources are exhausted.      

Peer-to-Peer Attacks: In this type attack the server
provides an opportunity for attackers. Instead of using a botnet to siphon
traffic towards the target, a peer-to-peer server is exploited to route traffic
to the target website. Then, people using the file-sharing hub are instead sent
to the target website until the website is overwhelmed and sent offline.  

Slowloris: This type of
DDoS attack can be difficult to mitigate, it is a tool that allows an attacker
to use fewer resources during an attack. During the attack connection to the
target machine will be opened with partial requests and allowed to stay and it
reaches the maximum time. It will send the HTTP headers at certain intervals.
It never completes them keeping more connections open longer until the target
website does not stay offline.   

          
On the basis of protocol DDoS can be further classified as
Network/transport level and

Application level DDOS attacks Network/transport level DDoS attack:
In network-layer DDoS attacks, attackers send a large number of bogus packets
(packets with bogus payload and invalid SYN and ACK number) toward the victim
server and normally attackers use IP spoofing. In network-layer DDoS attacks,
the victim server or IDS can easily distinguish legitimate packets from DDoS
packets. In transport Layer is especially vulnerable for
the Denial of Service (DOS) attack or Distributed Denial of Service (DDOS)
attack. Two most popular protocols used in the transport layer are TCP
(Transmission Control Protocol) and UDP (User Datagram Protocol).

At this level, mostly TCP, UDP, ICMP, and
DNS protocol packets are used to launch the attacks.

Application layer DDoS attack:

These attacks
generally consume less bandwidth and are stealthier in nature when compared to
volumetric attacks. However, they can have a similar impact to service as they target
specific characteristics of well-known applications such as HTTP, DNS, VoIP or
Simple Mail Transfer Protocol (SMTP). These attacks focus on disrupting
legitimate users services by exhausting the resources. An application-level
DDoS attack overloads an application server, such as by making excessive login,
database lookup or search requests. Application attacks are harder to detect
than other kinds of DDoS attacks. Since the connections are already
established, the requests may appear to be from legitimate users.

Request-Flooding
Attacks: High rates of seemingly legitimate application
requests, such as HTTP GETs, DNS queries and SIP INVITEs), deluge web servers
to degrade and disrupt its normal functioning. Asymmetric Attacks:  High-workload? requests that take a heavy toll
of server resources such as CPU, memory or disk space. Repeated Single Attacks: An isolated ?high-workload?
request being sent across many TCP sessions, a stealthier way to combine
asymmetric and request-flooding layer seven DDoS attacks. Application-Exploit Attacks: The attack vectors here are
vulnerabilities in applications, for instance, hidden-field manipulation,
buffer overflows, scripting vulnerabilities, cross-site scripting, cookie
poisoning, and SQL injection.

 

Related Work

 Distributed Denial of Service (DDoS) attacks have become a common
threat to online businesses. With over 50,000 distinct attacks per week, DDoS
attacks have become highly visible and costly form of cyber-crime. They
classify the detection mechanisms into statistical and heuristics based on
based detection algorithm. Statistical based detection system (SBDS) determines
normal traffic/packets data and then generalizes the scope of normal. The traffic
/packets that falls out of the scope then it treated as attack (or anomalous).
So, to improve the accuracy SBDS needs to learn the traffic with pattern
constant as much as the can be active on the network. Network traffic/packets
information is processed with machine learning algorithms. It differentiates
the attack traffic/packets from normal patterns of established network. All the
traffic is measured by an anomaly score for the specific event and the score is
higher than the threshold, the detection system will give a further action to these
attack traffic/ packets.

     Heuristic based detection system (HBDS)
employs the logic form statistical analysis of the network traffic on their
threshold decisions. HBDS requires fine tuning to adapt to network traffic and
minimize the false negative and false positive.

       DDoS detection with SBDS relies on the
false negative and false positive errors. The detection can discriminate normal
traffic which is more likely to be an attack. However some botnets, eg Mydoom
can bypass the detection approaches through the victim. This is because
approaches consider the transport layer and or network layer.Theirfore, the
botnets which generates similar legitimate HTTP packets can avoid detection
approach is their inability to consider legitimate traffic mixed with attacking
traffic.

Ahmed et al. use
change point analysis of packet arrival rate of new source IP addresses. The
method is based on non-parametric CUSUM technique.

Salem et al.
proposed a solution for early detection of flooding attacks in backbone
network.

Bhattacharyya
and Kalita has presented a new framework for detecting anomalies by employing
least mean square (LMS) filter and Pearson chi-square divergence on random
aggregation of flows in 2d sketch data structure. The method can also detect low
rate attack apart from high rate flooding attacks with high detection accuracy
and false alarm rate.

Tang et al.
proposed an efficient online detection scheme for Session Initiation Protocol
(SeIiP) flooding attacks that can detect both high rate as well as low rate
flooding attacks.

Xie and Yu.
Creates DDoS detection for monitoring web flash crowd traffic in order to
reveal dynamic shifts in normal brust traffic, which might signal onset of
Application layer DDoS attacks during the flash crowd event.

Zargar et al.
analyzed the scope of the DDoS flooding attacks and categorized the attacks and
available countermeasures based on where and when these methods could prevent,
detect, and respond to the DDoS flooding attacks.

The exploration of machine learning features considered to train and test the model

The need of
metrics should explore in contrast to packet patterns. The detailed exploration
of the constraints observed in existing contemporary models, which are stated
in related work, it is obvious to state that, in distributed environment,
diversified packet flow is easy to achieve through minimal time frames and
session time. The arrival rate based on human users, including a proxy server
seems to constitute the non-pattern (random) cases. Hence, to challenge this
constraint, this manuscript devised a novel set of metrics, which are derived
from absolute time interval rather than the session time and packet patterns.

Absolute time interval: This denotes
the absolute time taken by the set of sessions initiated at given threshold
time frame. This feature considered as significant, as HTTP-flood is cumulative
of multiple sessions and diversified packet flow. The features explored further
for defined absolute time interval. 

Absolute session Count: This feature
represents the average number of sessions found in an absolute time interval
defined. This feature is considered since the load on any target webserver
estimated by the number of sessions in a given time interval.

Absolute session Interval: This feature
represents the average time render each session in an absolute time interval
defined. This feature is critical as the session time indicates the time spent
by a source on the target webserver with an intension of fair use or an attack.

Absolute Page access count: This
feature represents the average number of requests in an absolute time interval
defined. This feature also critical one among the considered features, since
the page access count along with absolute session interval optimizes the
detection of the load on target webserver.

Ratio of Packet Count: This feature
represents the average number of packet of divergent sources those initiate the
sessions in an absolute time interval defined. This feature is also one among
given features, the ratio of packet count along with session intervals to
detect the load on server.

Ratio of Request between Intervals: This
feature represents the average request between time intervals sources those
initiate the sessions in an absolute time interval. This feature is critical as
the session time ratio by source on webserver.

Absolute Bandwidth Consumption: This
feature represents the average bandwidth consumed by the requests found in
absolute time interval defined. This feature also considered as significant
since the estimation of bandwidth consumption is critical in load assessment.
Then the record structure is given below:

 

Absolute time
interval

Absolute
session Count

Absolute
session Interval

Absolute Page
access count

Ratio of
Packet Count

Ratio of
Request between Intervals

Absolute
Bandwidth Consumption

 

The estimation of the absolute time interval
and other features defined are as follows:

 

    The sessions initiated in a given time
frame threshold are grouped, and then from each group, the time spent to complete
all the sessions in that group will be considered as the time interval of the
corresponding group. Then the sum of average of these time intervals and root
mean square distance of the respective session groups considered as the
absolute time interval.

     The number of sessions rendered in each
absolute time interval considered as absolute session count of the respective
absolute time interval.

     The sum of average session completion time
of given absolute time interval and their root mean square distance denoted as
absolute session completion of the corresponding absolute time interval.

     Similarly, the average page access time
for given absolute time interval and their root mean square distance
aggregated, which denotes the absolute page access time of the corresponding
absolute time interval.

         Further the total number of pages
rendered in a given absolute time interval considered as absolute page access
count of the corresponding absolute time interval.

       The ratio of eminent sources against the
total number of divergent sources found in a given corresponding absolute time
interval will be considered as the eminent source diversity ratio of the
corresponding absolute time interval.

The total
bandwidth consumed by the requests found in a given absolute time interval
denoted as absolute bandwidth consumption of the corresponding absolute time
interval.       

The dataset preparation

This
section explores the dataset preprocessing to train the devised model. The labeled
transactions given for training phase will be partitioned in to Flood and
normal transaction sets TF and TN. Then these partitioned sets are used further
to extract the features considered for training phase. The absolute time
interval will be defined for corresponding datasets TF and TN. Further the
feature will be extracted from TF and TN which will be denoted as   and in further discussion. Each record of the respective
sets will represent an absolute time interval and respective values of the
other dependent features. The record structure is as follows:

 

Absolute time
interval

Absolute
session Count

Absolute
session Interval

Absolute Page
access count

Ratio of
Packet Count

Ratio of
Request between Intervals

Absolute
Bandwidth Consumption

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54

60
51
58
52
42
58
46
53
46
55
70
67
59
59
68
67
63
70
67
54
70
66
65
63
54
57
60
58
59
56
 
 
 

2.84
2.59
3.04
3.16
3.12
3.08
3.28
2.76
2.59
2.84
2.31
1.78
2.88
3
2.72
1.54
2.23
2.15
2.84
1.58
1.66
2.55
3.24
1.7
1.74
2.67
2.84
3.04
2.88
3.12
 

1020
663
928
780
588
754
552
795
690
605
1190
1474
1121
1239
1428
1206
1197
1470
1407
1026
1470
1452
1105
1197
1134
1026
1020
928
1003
1008
 
 

 

 

35.71
24.75
58.97
67.77
4.42
29.84
14.88
76.31
20.89
38.36
98.45
78.19
73.13
21.4
35.91
36.3
73.2
50.95
65.04
43.57
83.28
85.39
34.71
43.57
17.83
55.91
35.75
58.97
78.89
31.99
 

   

These
attributes will be referred as a setin further draft of the article. The number of attributes
in each record will be 7, which is the size of  that can be
referred as.

Further, these record sets,  formed for
respective transaction sets TF and TN are used to train the bio-inspired
strategy called Jaccard search

Optimal
Feature Selection using Jaccard Similarity:

The Feature selection attributes that are having the
similar values in both  are usually not
qualified to assess the significance of the requests are prone to flood attack
or not. Moreover the values obtained for these attributes are dynamic and
varies for the training dataset given. Hence it is obvious to identify the
optimal attributes for flood and normal transactions. This section explores the
optimal feature selection for given flood and normal transaction dataset, which
is as follows:

The similarity between the values given to an
attribute in respective to both datasets is estimated by using jaccard index.

For each
attribute

Extract
all values as sets,  observed for attribute in respective record sets.

Remove
duplicates from.

Find the
Distance of towards using jaccard index as follows

Similarly
find the Distance oftowardsusing jaccard index as follows

The
above equations, identifying the ratio of the common elements in both sets
(intersection of both sets) and unique elements respective to both sets, which
is said to be similarity score under jaccard index. This value is subtracted
from the max similarity score, to identify the distance under jaccard index.

Then the attributes with distance and ( distance
threshold) will be considered as optimal attributes of and respectively.
These optimal attributes will be referred further as a set and  in respective
to and.

 

Absolute session Count

Absolute session Interval

Absolute Page access count

Absolute Bandwidth Consumption

Ratio of Packet Count

60
51
58
52
42
58
46
53
46
55
70
67
59
59
68
67
63
70
67
54
70
66
65
63
54
57
60
58
59
56
 

2.84
2.59
3.04
3.16
3.12
3.08
3.28
2.76
2.59
2.84
2.31
1.78
2.88
3
2.72
1.54
2.23
2.15
2.84
1.58
1.66
2.55
3.24
1.7
1.74
2.67
2.84
3.04
2.88
3.12
 

1020
663
928
780
588
754
552
795
690
605
1190
1474
1121
1239
1428
1206
1197
1470
1407
1026
1470
1452
1105
1197
1134
1026
1020
928
1003
1008
 
 

35.71
24.75
58.97
67.77
4.42
29.84
14.88
76.31
20.89
38.36
98.45
78.19
73.13
21.4
35.91
36.3
73.2
50.95
65.04
43.57
83.28
85.39
34.71
43.57
17.83
55.91
35.75
58.97
78.89
31.99
 

 

 

 

Cluster the data
using any algorithm:

 

Absolute session Count

Absolute session Interval

Absolute Page access count

Absolute Bandwidth Consumption

Ratio of Packet Count

 

 

 

 

 

 

 

 

Classification:

 

       By using any algorithm K-Means, J48 and
Naïve Bayes algorithms is used as classifier algorithm. K-Means, J48 and Naive
Bayes is a machine learning approach that uses probabilities of all the
attributes to make a prediction .There is a strong assumption in an algorithm
approaches. The assumption is that all the attributes are independent to one
another. This assumption not makes the request is more accurate but also
faster. The database is in the excel sheet format (CSV). In this the data from
the database will be loaded to program and it will be split into training and
test datasets. The data set will be randomly split into train and test
datasets.

 

 

 

 

 

 

 

 

 

 

 

                                                                                    
    

                                                                        

 

 

 

 

 

 

 

             

 

 

            Fig 2: Block diagram of Proposed
System.

 

 

Training
phase

 

 

 

 

x

Hi!
I'm Owen!

Would you like to get a custom essay? How about receiving a customized one?

Check it out