[[en:documentation:pandorafms:introduction:01_introduction|Pandora FMS]] is a complex distributed application that has different key elements, susceptible to represent a bottleneck if it is not sized and configured correctly. The purpose of this chapter is to help to carry out a capacity study, **to analyze the //scalability// of Pandora FMS according to a specific set of parameters**. This study will help to know the requirements that the installation should have to be able to support a certain capacity.
+
[[:en:documentation:pandorafms:introduction:01_introduction|Pandora FMS]] is a complex distributed application that has different key elements, susceptible to represent a bottleneck if it is not sized and configured correctly. The purpose of this chapter is to help to carry out a capacity study, **to analyze the //scalability// of Pandora FMS according to a specific set of parameters**. This study will help to find out the requirements that the installation should have to be able to support a certain capacity.
The load tests are also used to observe the maximum capacity per server. In the current architecture model ([[en:documentation:pandorafms:technical_reference:10_versions|version 3.0 or later]]), with "N" independent servers and a **[[en:documentation:pandorafms:command_center:01_introduction|Command Center (Metaconsole)]]** installed, this //scalability //tends to be of linear order, while //scalability// based on centralized models is exponential.
+
Load tests are also used to see the maximum capacity per server. In the current architecture model ([[:en:documentation:pandorafms:technical_reference:10_versions|version 3.0 or later]]), with "N" independent servers and a **[[:en:documentation:pandorafms:command_center:01_introduction|Command Center (Metaconsole)]]** installed, this //scalability //tends to be of linear order, while //scalability// based on centralized models is exponential.
The tests have been done on a **DELL server PowerEdge T100®** with 2,4 Ghz **Intel Core Duo®** Processor and 2 GB RAM. This server, working on an **Ubuntu Server 8.04**, has given us the base of our study for the tests on High Availability environments. The tests have been done on agent configurations quite similar to that of the QUASAR TECHNOLOGIES project, so is not available the same hardware, but replicate a high availability environment, similar to the QUASAR TECHNOLOGIES to evaluate the impact in the performance as times goes on and set other problems ( mainly of usability) derived from managing big data volume.
+
The tests were performed on a **DELL PowerEdge T100®** server with an **Intel Core Duo**® 2.4 GHz processor and 2 GB of RAM. This server, running on a **Ubuntu Server** 8.04, has provided the study base for the tests in High Capacity environments. The tests have been performed on agent configurations relatively similar to those of the QUASAR TECHNOLOGIES project. The intention of the tests is not to replicate exactly the same volume of information that QUASAR TECHNOLOGIES is going to have, since the same hardware is not available, but to replicate a high capacity environment, similar to QUASAR TECHNOLOGIES to evaluate the impact on performance over time and to determine other issues (mainly performance) arising from handling large volumes of data.
- The "real time" information should be moved to the historical database within a maximum of 15 days, optimally for data older than one week. This guarantees a faster operation.
+
- The margin of maneuver in the optimal case is almost 50% of processing capacity, higher than expected considering this volume of information.
+
- The information fragmentation rate is key to determine the performance and capacity required for the environment where the system needs to be deployed.
Based on the achievement of certain targets, as we have seen in the previous point, we suppose that the estimated target, is to see how it works wiht a load of 100,000 modules, distributed between a total of 3000 agents, that is, an average of 33 modules per agent.
+
Based on some objectives, calculated according to the previous point, it will be assumed that the estimated objective is to see how it behaves with a load of 100,000 modules, distributed among a total of 3,000 agents, that is, an average of 33 modules per agent.
A [[:en:documentation:05_big_environments:08_optimization|task will be created]] of ''pandora_xmlstress'' , executed through **cron** or manual script, that has 33 modules, distributed with a configuration similar to this one:
+
It [[en:documentation:pandorafms:complex_environments_and_optimization:08_optimization#ks2_2_1|will create a task]] of ''pandora_xmlstress'' executed by **cron** or script) manually, containing 33 modules, distributed with a configuration similar to this one:
Then, we configure the thresholds for these 15 modules, so they have this pattern:
+
The thresholds for these 15 modules will be configured to have this pattern:
これらの 15個のモジュールのしきい値を設定すると、次のパターンになります。
これらの 15個のモジュールのしきい値を設定すると、次のパターンになります。
<file>
<file>
-
0-50 normal
+
0-50 normal
-
50-74 warning
+
50-74 warning
-
75- critical
+
75- critical
</file>
</file>
-
We add to the configuration file of our ''pandora_xml_stress'' some new tokens, to could define the thresholds from the XML generation. Attention: Pandora FMS only "adopts" the definition of thresholds in the creation of the module, but not in the update with new data.
+
New tokens will be added to the ''pandora_xml_stress'' configuration file to be able to define the thresholds from the XML generation. Attention: this because Pandora FMS only "adopts" the thresholds definition in the module creation, **but not in the update with new data**.
Should let it running at least for 48 hours without any kind of interruption and we should monitor (with a Pandora FMS agent) the following parameters:
+
It should be left running for at least 48 hours without any kind of interruption and should monitor (with a Pandora FMS agent) the following parameters:
* Number of monitors in unknown status (''unknown''):
-
不明状態の監視項目数:
+
* 不明状態の監視項目数:
-
<code>
+
<code bash>
-
echo "select SUM(unknown_count) FROM tagente;" | mysql -u pandora -p<password> -D pandora | tail -1
+
echo "select SUM(unknown_count) FROM tagente;" | mysql -u pandora -p < password > -D pandora | tail -1
</code>
</code>
-
(''<password>'' for ''pandora'' user.)
+
(where ''< password >'' is the password of the user ''pandora'')
(''<password>'' は ''pandora'' ユーザのパスワードです。)
(''<password>'' は ''pandora'' ユーザのパスワードです。)
-
The first executions should be useful to "tune" the server and the MySQL configuration.
+
The first executions should be used to tune the server and the **MySQL** configuration.
最初の実行は、サーバと MySQL 設定を "調整" するのに役立つはずです。
最初の実行は、サーバと MySQL 設定を "調整" するのに役立つはずです。
-
Use the script ''/usr/share/pandora_server/util/pandora_count.sh'' to count (if are XML files pending to process) the rate of package proccessing. The aim is to make possible that all the packages generated (3000) could be processed in an interval below the 80% of the limit time (5 minutes). This implies that 3000 packages should be processed in 4 minutes, so:
+
The script ''/usr/share/pandora_server/util/pandora_count.sh'' will be used to count (when there are XML files pending to process) the packet processing rate. The objective is to achieve that all the generated packets (3000) can be "processed" in an interval less than 80 % of the time limit (5 minutes). This implies that 3000 packets have to be processed in 4 minutes, then:
* Number maximum of items in intermediate queue (''max_queue_files'').
+
* Maximum number of elements in intermediate queue (''max_queue_files'').
-
* Of course, all the parameters of MySQL that are applicable (very important).
+
* Of course, all relevant **MySQL** parameters (very important).
* スレッド数
* スレッド数
行 310:
行 308:
* もちろん、適用可能な MySQL のすべてのパラメーター(非常に重要)
* もちろん、適用可能な MySQL のすべてのパラメーター(非常に重要)
-
<WRAP center round tip 60%>Importance of this: One Pandora with a GNU/Linux server installed "by default" in a powerful machine, could not exceed from 5-6 packages by second, in a powerful machine well "optimized" and "tuned" it could perfectly reach 30-40 packages by second. **It also depends a lot of the number of modules that would be in each agent**.</WRAP>
+
<WRAP center round tip 90%>
-
<WRAP center round tip 60%>これの重要性:強力なマシンに "デフォルト" でインストールされた GNU/Linux サーバ 1台で、Pandora は、毎秒 5〜6 データを超えることはできませんが、十分に "最適化" および "調整" された強力なマシンでは、毎秒 30-40 データまで処理することができます。**また、各エージェントに含まれるモジュールの数にも依存します**。</WRAP>
+
An installation of Pandora FMS with a GNU/Linux server installed "by default" in a powerful machine, can not pass from 5 to 6 packets per second, in a powerful machine well "optimized" and "conditioned" it can reach 30 to 40 packets per second. **This also depends a lot on the number of modules in each agent**.
-
Configure the system in order that the DDBB maintenance script at ''/usr/share/pandora_server/util/pandora_db.pl'' will be executed every hour instead of every day:
The system is configured so that the database maintenance script in ''/usr/share/pandora_server/util/pandora_db.pl'' is executed every hour instead of every day:
- Is the system stable?, Is it down? If there are problems, check the logs and graphs of the metrics that we have got (mainly memory).
+
- ¿Is the system stable, has it crashed? If there are problems, look at logs and graphs of the metrics obtained (mainly memory).
-
- Evaluate the tendency of time of the metric "Number of monitors in unknown state". **There should be not tendencies neither important peaks**. They should be the exception: If they happen with a regularity of one hour, is because there are problems withe the concurrency of the DDBB management process.
+
- Evaluate the time trend of the metric "number of monitors in unknown state". **There should be no significant trends or spikes**. They should be the exception. If they happen with a regularity of one hour, it is because there are problems with the concurrency of the DB management process.
-
- Evaluate the metric "Average time of response of the pandora DDBB". **It should not increase with time but remain constant**.
+
- Evaluate the metric "Average response time of Pandora FMS DB". **It should not grow over time but remain constant**.
-
- Evaluate the metric "pandora_server CPU" , should have many peaks, but with a constant tendency, **not rising**.
+
- Evaluate the metric "CPU of ''pandora_server''": it should have frequent peaks, but with a constant trend, **not increasing**.
-
- Evaluate the metric "MYSQL server CPU"; should be constant with many peaks, but with a constant tendency , **not rising**.
+
- Evaluate the metric "MySQL server CPU", it should remain constant with frequent peaks, but with a constant, **not increasing** trend.
If all was right, now will evaluate the impact of the alert execution performance.
+
If everything went well, you should now evaluate the performance impact of the alert execution. Apply an alert to five specific modules of each agent (of type ''generic_data''), for the ''CRITICAL'' condition. Something that is relatively lightweight, such as creating an event or writing to **syslog** (to avoid the impact that something with high latency such as sending an email message could have).
Apply one alert to five specific modules of each agent (''generic_data'' type ), for the ''CRITICAL'' condition.Something not really important, like creating an event or writting to **syslog** (to avoid the impact that something with hight latency could have like for example sending an email message).
+
You can optionally create an event correlation alert to generate an alert for any critical condition of any agent with one of these five modules.
Let the system operating 12 hours under those criteria and evaluate the impact, following the previous criteria.
+
Leave the system operating for 12 hours under these criteria and evaluate the impact, following the above criteria.
これらの基準の下でシステムを 12時間稼働させ、前の基準に従って影響を評価します。
これらの基準の下でシステムを 12時間稼働させ、前の基準に従って影響を評価します。
+
+
<wrap #ks3_1_2 />
=== データ削除と移動の評価 ===
=== データ削除と移動の評価 ===
行 369:
行 375:
* 7日以上経過したデータをヒストリデータベースへ移動
* 7日以上経過したデータをヒストリデータベースへ移動
-
Should let the system working "only" during at least 10 days to evaluate the long term performance. We could see a "peak" 7 days later due to the moving of data to the history DDBB. This degradation is <wrap hi>important</wrap> to consider. If you can't have so many time available, it is possible to replicate (with less "realism") changing the purging interval to 2 days in events and 2 days to move data to history, to evaluate this impact.
+
You should leave the system running "alone" for at least 10 days to evaluate long-term performance. You may see a substantial "spike" after 7 days due to the movement of data to the historical DB. This degradation is <wrap hi>important</wrap> to take into account. If you do not have that much time, you can reproduce it (with less "realism") by changing the purge interval to 2 days for events and 2 days to move data to history, to evaluate this impact.
This is specifically for the [[:en:documentation:01_understanding:02_architecture#the_enterprise_network_server_for_snmp_and_icmp|ICMP network server]]. In case of doing the tests for the Open network server, please see the corresponding section of the network server (generic).
It is specifically the [[en:documentation:pandorafms:introduction:02_architecture|ICMP network server]]. In case of testing for the network server Open version, see the point corresponding to the network server (generic).
-
Supposing that server is already working and configured, some key parameters for its performance:
It defines the number of "pings" that the system will do for any execution. If the majority of pings are going to take the same time, you can increase the number to considerably high numberm i.e: 50 to 70
+
Defines the number of pings that the system will do per run. If most pings will take the same amount of time, you can raise the number to a considerably high number, such as 50 to 70.
On the contrary, the module ping park is heterogeneous and they are in very different networks, with different latency times,it is not convenient for you to put a high number, because the test will take the time that takes the slower one, so you can use a number quite low, such as 15 to 20.
+
If, on the other hand, the number of ping modules is heterogeneous and they are in very different networks, with very different latency times, it is not interesting to set a high number, because the test will take as long as the slowest one takes, so you can use a relatively low number, such as 15 to 20.
Obviously, the more threads it has, the more checks it could execute. If you make an addition of all the threads that Pandora FMS execute, they will not be more than 30 to 40. You should not use more than 10 threads here, thought it depends a lot of the kind of hardware an GNU/Linux version that you are using.
+
Obviously, the more threads you have, the more checks you will be able to execute. If you add all the threads that Pandora FMS executes, they should not reach the range of 30 to 40. You should not use more than 10 threads here, although it depends a lot on the type of hardware and the GNU/Linux version you are using.
Now "create" a fictitious number of modules ping type to test. Assume that you are going to test a total of 3000 modules of ping type. To do this, the best option is to choose a system in the network that would be able to support all pings (any GNU/Linux server would do it)
+
Now, you must "create" a fictitious number of ping type modules to test. It is assumed that you will test a total of 3000 ping modules. To do this, it is best to take a system on the network that is capable of supporting all pings (any GNU/Linux server can handle the task).
You can use this shellscript to generate this file (changing the destination IP and the group ID)
+
This shellscript can be used to generate such a file (by changing the destination IP address and group ID):
以下のシェルスクリプトを使用して、ファイルを生成できます(宛先 IP とグループ ID を変更します)。
以下のシェルスクリプトを使用して、ファイルを生成できます(宛先 IP とグループ ID を変更します)。
-
<file>
+
<code bash>
A=3000
A=3000
while [ $A -gt 0 ]
while [ $A -gt 0 ]
行 431:
行 439:
done
done
-
</file>
+
</code>
-
Before start all this process, the Pandora FMS server must be running and monitoring, measuring the metrics from the previous point: CPU consumption (**pandora** and **mysql**), number of modules in unknown state and other interesting monitors.
+
The main thing is to have Pandora FMS monitored, measuring the metrics from the previous point: CPU consumption (**pandora** and **mysql**), number of modules in unknown state and other interesting monitors.
Import the CSV to create 3000 agents, it will take some minutes. After that go to the first agent (''AGENT_3000'') and we create a module Type PING.
+
Import the CSV to create 3000 agents, which will take a few minutes. Then go to the first agent (''AGENT_3000'') and create a **PING** type module in it.
Go to the massive operations tool and copy that module to the other 2999 agents.
+
Then go to the bulk operations tool and copy that module to the other remaining 2999 agents.
一括操作ツールに移動し、そのモジュールを他の 2999 エージェントにコピーします。
一括操作ツールに移動し、そのモジュールを他の 2999 エージェントにコピーします。
-
Pandora FMS should then start to process those modules. Measure with the same metrics from the previous case and evaluate how it goes. The objective is to let an operable system for the number of modules of type ICMP required without any of them reaches the unknown status.
+
Pandora should start processing those modules. Measure with the same metrics as the previous case and see how it evolves. The goal is to leave a system operable for the required number of ICMP type modules without any of them reaching unknown status.
This is specifically about the SNMP Enterprise network server. In case of testing for the Open network server, see the section on the (generic) network server.
This is specifically about the SNMP Enterprise network server. In case of testing for the Open network server, see the section on the (generic) network server.
Here, the case is more simple: ssume that the system is not going to receive traps in a constant way, but that it is about evaluating the response to a traps flood, from which some of them will generate alerts.
+
Here the assumption is simpler: it is assumed that the system is not going to receive traps constantly, but rather to evaluate the response to an avalanche of traps, some of which will generate alerts.
Once the environment is set up we need to validate the following things:
+
Once the environment has been set up, the following assumptions must be validated:
環境をセットアップしたら、次のことを検証する必要があります。
環境をセットアップしたら、次のことを検証する必要があります。
-
- Traps injection to a constant rate(just put one ''sleep 1'' to the previous script inside the loop **while**, to generate 1 trap/sec. Let the system operating 48 hours and evaluate the impact in the server.
+
- Injection of traps at a constant rate (just enter a ''sleep 1'' command to the above script inside the **while** loop, to generate 1 trap per second. The system is left running for 48 hours and the impact on the server is evaluated.
-
- Traps Storm. Evaluate moments before, during and the recovery if a traps storm occurs.
+
- Trap storm. Evaluate the before, during, and recovery from a trap storm.
-
- Effects of the system on a huge traps table ( more than 50 thounsand). This includes the effect of passing the DDBB maintenance.
+
- Effects of the system on a very large table of traps (greater than 50 thousand). This includes the effect of passing the DB maintenance.
1. Normal range of event reception. This has been already tested in the data server, so in each status change, an event will be generated.
+
- Normal event reception rate. This has already been tested in the data server, since an event is generated at each state change.
+
- Event generation storm. To do this, we will force the generation of events via CLI. Using the following command (with an existing group called "Tests"):
That command, used in a loop like the one used to generate traps, can be used to generate dozens of events per second. It can be parallelized in a script with several instances to cause a higher number of insertions. This would serve to simulate the behavior of the system in an event storm. In this way the system could be tested before, during and after an event storm.
/etc/pandora/pandora_server.conf --create_event "Event test" system TestingGroup
+
-
</file>
+
<wrap #ks3_7 />
-
+
-
This command, used un a loop as the one used to generate traps, it can be used to generate tens of events by second. It could be parallelize in one script with several instances to get a higher number of insertions. This will be useful to simulate the performance of the system if an event storm happens. This way we could check the system, before, during and after the event storm.
For this use another server, independent from Pandora FMS, using the WEB monitoring functionality. Do a user session where we have to do the following tasks in this order, and see how long they take.
+
For this, another server independent from Pandora FMS will be used, using the WEB monitoring functionality. In a user session where it will perform the following tasks in a specific order and measure how long they take to be processed:
- Visualize a report (in HTML). This report should contain a pair of graphs and a pair of modules with report type SUM or AVERAGE. The interval of each item should be of one week or five days.
+
- Display a report (in HTML). This report should contain a couple of graphs and a couple of modules with SUM or AVERAGE type reports. The interval for each item should be one week or five days.
-
- Visualization of a combined graph (24 hours).
+
- Display of a combined graph (24 hours).
-
- Generation of report in PDF (another different report).
+
- Generation of PDF report (another different report).
- コンソールへのログイン。
- コンソールへのログイン。
行 606:
行 616:
- PDF でのレポート生成(他のレポートにて)。
- PDF でのレポート生成(他のレポートにて)。
-
This test is done with at least three different users. This task could be parallelize to execute it every minute, so as if there are 5 tasks (each one with their user) we would be simulating the navigation of 5 simultaneous users.Once the environment is set up, we should consider this:
+
This test is performed with at least three different users. You can parallelize that task to run it every minute, so that if there are 5 tasks (each with its user), you would be simulating the navigation of five simultaneous users. Once the environment is established, it will take into account:
- The average velocity of each module is relevant facing to identify " bottle necks" relating with other parallel activities, such as the execution of the maintenance script, etc.
+
- The average speed of each module is relevant in order to identify "bottlenecks" related to other parallel activities, such as the execution of maintenance //script//, et cetera.
-
- The impact of CPU and memory will be measured in the server for each concurrent session.
+
- CPU and memory impact on the server will be measured for each concurrent session.
-
- The impact of each user session simulated referred to the average time of the rest of sessions will be measured. This is, you should estimate how many seconds of delay adds each simultaneous extra session.
+
- The impact of each simulated user session will be measured with respect to the average time of the rest of the sessions. That is, it should be estimated how many seconds of delay each simultaneous extra session adds.