Apache Server Configuration Guide for EC2 Instance - EC2InstanceHelper

Apache Server Configuration Guide for EC2 Instance – EC2InstanceHelper

Apache Server – Blocking the Bad Crawlers or Bots

Apache has been the de-facto web server of choice, on Linux/Unix and derivatives. It works well and while memory intensive, is quite a stable solution. It does require substantial memory to work properly as each user thread takes a chunk of it, hence the notoriety of the memory-intensive process. With that in mind, there are many s instances when you want to block certain bots to save performance on the server side. We all know and want to welcome google bots and other good ones, but there are loads of unscrupulous bots which are undesired, hence we here discuss a method to block them at the server level.

Let’s issue the command to gain root-level Access.

Sudo -i 

Followed by the following command to gain access to Apache Server Access Logs

[root@ip-172-31-26-197 ~]# cd /var/log/httpd/
[root@ip-172-31-26-197 httpd]# ls -lt
total 64316
-rwxrwx--x 1 root root 1973429 Aug 2 17:56 ssl_access_log
-rwxrwx--x 1 root root 2309539 Aug 2 17:56 ssl_request_log
-rwxrwx--x 1 root root 279932 Aug 2 17:46 access_log
-rwxrwx--x 1 root root 4551 Aug 2 13:19 ssl_error_log
-rwxrwx--x 1 root root 768 Jul 31 03:39 error_log


 

These are a pretty useful set of Files that can be of use in further investigations if it comes to it.

SSL_ACCESS_LOG  =   logs for all SSL successful Access on port 443

SSL_REQUEST_LOG = logs for all SSL requests on port 443

ACCESS_LOG = Its the file we are after, contains all general successful access to the web/app resource

ERROR_LOG = Contains all the HTTP and HTTPS requests that resulted in some sort of Error

Looking for a specific BOT

Then we obviously look for some usual suspects in the access file. for example, Petal Bot is an aggressive crawler collecting web resource information and data.

[root@ip-172-31-26-197 httpd]# grep -E "petalbot" access_ log
127.0.0.1 - - [31/Jul/2022:05:56:43 +0000] "GET /robots.t xt HTTP/1.1" 301 239 "-" "Mozilla/5.0 (compatible;PetalBo t;+https://webmaster.petalsearch.com/site/petalbot)"
127.0.0.1 - - [31/Jul/2022:13:04:25 +0000] "GET /robots.t xt HTTP/1.1" 301 239 "-" "Mozilla/5.0 (compatible;PetalBo t;+https://webmaster.petalsearch.com/site/petalbot)"
127.0.0.1 - - [31/Jul/2022:17:33:49 +0000] "GET /robots.t xt HTTP/1.1" 301 239 "-" "Mozilla/5.0 (compatible;PetalBo t;+https://webmaster.petalsearch.com/site/petalbot)"
127.0.0.1 - - [31/Jul/2022:17:35:01 +0000] "GET /robots.t xt HTTP/1.1" 301 239 "-" "Mozilla/5.0 (compatible;PetalBo t;+https://webmaster.petalsearch.com/site/petalbot)"
127.0.0.1 - - [01/Aug/2022:01:16:43 +0000] "GET /robots.t xt HTTP/1.1" 301 239 "-" "Mozilla/5.0 (compatible;PetalBo t;+https://webmaster.petalsearch.com/site/petalbot)"
127.0.0.1 - - [01/Aug/2022:07:07:42 +0000] "GET /robots.t xt HTTP/1.1" 301 239 "-" "Mozilla/5.0 (compatible;PetalBo t;+https://webmaster.petalsearch.com/site/petalbot)"
127.0.0.1 - - [01/Aug/2022:22:23:42 +0000] "GET /robots.t xt HTTP/1.1" 301 239 "-" "Mozilla/5.0 (compatible;PetalBo t;+https://webmaster.petalsearch.com/site/petalbot)"
127.0.0.1 - - [02/Aug/2022:06:33:02 +0000] "GET /robots.t xt HTTP/1.1" 301 239 "-" "Mozilla/5.0 (compatible;PetalBo t;+https://webmaster.petalsearch.com/site/petalbot)"
[root@ip-172-31-26-197 httpd]#

 

Looking for a Generalized Search for all/any BOTs

[root@ip-172-31-26-197 httpd]# grep -E "bot|Bot" access_log
127.0.0.1 - - [31/Jul/2022:05:56:43 +0000] "GET /robots.txt HTTP/1.1" 301 239 "-" "Mozilla/5.0 (compatible;PetalBot;+https://webmaster.petalsearch.com/site/petalbot)"
127.0.0.1 - - [31/Jul/2022:06:15:10 +0000] "GET /robots.txt HTTP/1.1" 301 239 "-" "The Knowledge AI"
127.0.0.1 - - [31/Jul/2022:08:34:12 +0000] "GET /robots.txt HTTP/1.1" 301 239 "-" "Mozilla/5.0 (compatible; DotBot/1.2; +https://opensiteexplorer.org/dotbot; help@moz.com)"
127.0.0.1 - - [31/Jul/2022:09:26:41 +0000] "GET /robots.txt HTTP/1.1" 301 239 "-" "Mozilla/5.0 (compatible; DotBot/1.2; +https://opensiteexplorer.org/dotbot; help@moz.com)"
127.0.0.1 - - [31/Jul/2022:10:18:44 +0000] "GET /robots.txt HTTP/1.1" 301 239 "-" "Mozilla/5.0 (compatible; DotBot/1.2; +https://opensiteexplorer.org/dotbot; help@moz.com)"
127.0.0.1 - - [31/Jul/2022:11:02:22 +0000] "GET / HTTP/1.1" 301 229 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
127.0.0.1 - - [31/Jul/2022:11:03:15 +0000] "GET /robots.txt HTTP/1.1" 301 239 "-" "Mozilla/5.0 (compatible; Googlebo /2.1; +http://www.google.com/bot.html)"
127.0.0.1 - - [31/Jul/2022:11:09:46 +0000] "GET /robots.txt HTTP/1.1" 301 239 "-" "Mozilla/5.0 (compatible; DotBot/1.2; +https://opensiteexplorer.org/dotbot; help@moz.com)"
127.0.0.1 - - [31/Jul/2022:11:12:06 +0000] "GET /robots.txt HTTP/1.1" 301 239 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)"
127.0.0.1 - - [31/Jul/2022:11:12:08 +0000] "GET / HTTP/1.1" 301 229 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)"
127.0.0.1 - - [31/Jul/2022:12:02:09 +0000] "GET /robots.txt HTTP/1.1" 301 239 "-" "Mozilla/5.0 (compatible; DotBot/1.2; +https://opensiteexplorer.org/dotbot; help@moz.com)"
127.0.0.1 - - [31/Jul/2022:12:16:55 +0000] "GET /robots.txt HTTP/1.1" 301 239 "-" "Mozilla/5.0 (compatible; coccocbo -image/1.0; +http://help.coccoc.com/searchengine)"
127.0.0.1 - - [31/Jul/2022:12:54:15 +0000] "GET /robots.txt HTTP/1.1" 301 239 "-" "Mozilla/5.0 (compatible; DotBot/1.2; +https://opensiteexplorer.org/dotbot; help@moz.com)"
127.0.0.1 - - [31/Jul/2022:13:04:25 +0000] "GET /robots.txt HTTP/1.1" 301 239 "-" "Mozilla/5.0 (compatible;PetalBot;+https://webmaster.petalsearch.com/site/petalbot)"
127.0.0.1 - - [31/Jul/2022:13:47:44 +0000] "GET /robots.txt HTTP/1.1" 301 239 "-" "Mozilla/5.0 (compatible; DotBot/1.2; +https://opensiteexplorer.org/dotbot; help@moz.com)"
127.0.0.1 - - [31/Jul/2022:14:45:18 +0000] "GET /robots.txt HTTP/1.1" 301 239 "-" "Mozilla/5.0 (compatible; DotBot/1.2; +https://opensiteexplorer.org/dotbot; help@moz.com)"
127.0.0.1 - - [31/Jul/2022:15:00:31 +0000] "GET /robots.txt HTTP/1.1" 301 239 "-" "Mozilla/5.0 (compatible; AhrefsBo /7.0; +http://ahrefs.com/robot/)"

Looking at this data should give you a very good indicator of what the situation is with crawlers and BOTS accessing your web app.

See also  AWS EC2 Instance - Free Tier - Vacouf

You might want to filter out this data and make a list of the BOTS you want to keep or discard.

Add the following rewrite condition and instruction in the appropriate host section of your apache config file.

in our case, the location of the apache config file is the Amazon Linux distribution for EC2.

\etc\httpd\conf\httpd.conf    

 

 RewriteEngine On

RewriteCond %{HTTP_USER_AGENT} ^.*(petal|dotbot|stripper|ninja|webspider|leacher|collector|grabber|webpictures).*$ [NC]
RewriteRule . - [F,L]

Note that the rewrite engine must be ON.

This rewrite rule is a text-based condition where any access request which matches any of the patterns specified will be denied Acess.

 

Apache General Configurations Guide

Apache has a lot of configuration options you can use to control the performance of your site. In this article, we’ll go over some of the most important ones that will help you make informed decisions about how often to serve pages from cache, how many clients can be connected at once, and more. We’ll also cover some changes that can be made with third-party modules if you want more fine-grained control over how your Apache server behaves.

 

MaxKeepAliveRequests 20


KeepAlive On
KeepAliveTimeout 32
Timeout 30

#prefork config

ServerLimit 150
MaxClients 150

MaxRequestsPerChild 2000

 

Setting Keep Alive Off Makes Every Page Request into a Separate TCP Connection

When Keep-Alive is enabled, the server will reuse the TCP connection for multiple requests. This is a feature that allows the server to reuse a TCP connection for multiple requests, so users of your website don’t have to wait for each page they request from you. By default, HTTP requests are not persistent hence after every request the connection will be reset, however, KEEPALIVE is there to cater to this situation exactly. This could greatly enhance your server performance in long run.

See also  Save Money on EC2 Instances using Spot Instances -  EC2InstanceHelper

This means that when keepalive is enabled and there are multiple requests from a client to your website in one session (for example, if someone visits your home page and then clicks on an internal link), these requests are sent over the same TCP connection. If a user has disabled keepalive, each individual request will be sent over its own TCP connection.

KeepAliveCountMax

KeepAliveCountMax is the maximum number of requests to send over an existing connection.

For example, if you set KeepAliveCountMax to 100 and request 101 web resources from your server in a single session, only the first 100 requests will be sent over a persistent connection; after that point, additional requests will be sent using new connections.

KeepAliveTimeout

KeepAliveTimeout is measured in seconds, indicating the amount of time Apache will wait for the next request on a persistent connection. It’s set to 60 seconds by default, but you can increase or decrease this value as needed.

KeepAliveTimeout 0: disables KeepAlive connections

KeepAliveTimeout 10: 10 seconds delay before closing idle connections

There is great information on Stackoverflow in this thread.

ServerLimit and MaxClients

The MaxClients directive sets the maximum number of processes that the server can have running at any one time. It is a global configuration option that applies to all virtual servers and is not specific to any one server.

MaxClients is a very important configuration option because it determines how many simultaneous requests you can serve. Apache has been optimized to handle hundreds or even thousands of clients concurrently but if you set this value too high, you may find yourself with too much CPU usage or memory consumption on your web server.

MaxRequestsPerChild

MaxRequestsPerChild sets the maximum number of requests a child process will handle before being terminated by the server. It is a global setting, meaning it affects all child processes.

MaxRequestsPerChild can be used to limit the memory usage of a child process. When enabled on an Apache server with PHP, this directive allows you to set how many requests PHP-fpm will receive before it is killed off by apache and restarted again with a new one (with another PID). This feature makes sense as you will have one less process-consuming memory and CPU cycle on your machine when performing high-load tasks such as serving static files with Nginx or other applications like WordPress blogs!

Apache Mode Evasive

 

<IfModule mod_evasive24.c>
# The hash table size defines the number of top-level nodes for each child
# hash table. Increasing this number will provide faster performance by
# decreasing the number of iterations required to get to the record, but
# consume more memory for table space. You should increase this if you have
# a busy web server. The value you specify will automatically be tiered up
# to the next prime number in the primes list (see mod_evasive.c for a list
# of primes used).
DOSHashTableSize 3097

# If set, this email address will receive a notification whenever an IP
# address becomes blacklisted. A locking mechanism prevents continous
# emails from being sent.
DOSEmailNotify admin@test.com

# NOTE: The following settings apply on a per-IP address basis.

# Allow up to 2 requests for the same URI per second:
DOSPageInterval 2
DOSPageCount 10

# Allow up to 50 requests across the site per second:
DOSSiteInterval 1
DOSSiteCount 50

# Once the client is blacklisted, prevent them from accessing the site
# for 20 seconds:
DOSBlockingPeriod 20

DOSLogDir "/var/log/mod_evasive"

</IfModule>

Apache Mode Evasive (mod_evasive) is a module that can be used to protect your server from a variety of common attacks. It works by detecting and blocking requests that are part of a known attack or allowing them through if they are not. If mod_evasive blocks an attack, it will send back an HTTP 401 status code (Unauthorized).

See also  EC2 Instance Types - Brief overview and Comparison

 

Apache Mod Expires

<IfModule mod_expires.c>
ExpiresActive On

# Images
ExpiresByType image/jpeg "access plus 1 year"
ExpiresByType image/gif "access plus 1 year"
ExpiresByType image/png "access plus 1 year"
ExpiresByType image/webp "access plus 1 year"

ExpiresByType video/webm "access plus 1 year"
ExpiresByType video/mp4 "access plus 1 year"
ExpiresByType video/mpeg "access plus 1 year"

ExpiresByType font/ttf "access plus 1 year"
ExpiresByType font/otf "access plus 1 year"
ExpiresByType font/woff "access plus 1 year"
ExpiresByType font/woff2 "access plus 1 year"
ExpiresByType application/font-woff "access plus 1 year"

# CSS, JavaScript
ExpiresByType text/css "access plus 1 month"
ExpiresByType text/javascript "access plus 1 month"
ExpiresByType application/javascript "access plus 1 month"

</IfModule>

 

Apache comes with a module called mod_expires that allows you to set expiration dates on the files you serve. It’s enabled by default, but it’s not recommended for use in production environments.

Further details can be found in the official apache documentation.

You can configure the Apache server in a variety of ways to control performance.

You can configure the Apache server in a variety of ways to control performance.

  • You can use Apache mod_expires to cache content.
  • You can use Apache mod_headers to control the content that is sent to the browser.

Conclusion

In this article, we’ve discussed ways to improve Apache server performance. Keep in mind that there are many more variables that can be configured. However, these are some of the most important ones and they should cover most use cases. As always with web servers, it’s a good idea to test different configurations on your own server before deciding which one works best for you!

 

 

 

 

 

 

 

Leave a Comment

Your email address will not be published.