差分

このページの2つのバージョン間の差分を表示します。

--- ja:documentation:07_technical_annexes:03_capacity_planning [2022/02/23 01:46] – [データ削除と移動の評価] junichi
+++ ja:documentation:07_technical_annexes:03_capacity_planning [Unknown date] (現在) – 削除 - 外部編集 (Unknown date) 127.0.0.1
@@ 行 1: / 行 1: @@
-====== キャパシティ分析 ======
-{{indexmenu_n>3}}
-[[ja:documentation:start|Pandora FMS ドキュメント一覧に戻る]]
-===== キャパシティ分析 =====
-==== 概要 ====
-[[:en:documentation:01_understanding:01_introduction|Pandora FMS]] is a quite complex distributed application that has several key elements, that could be a bottleneck if it is not measured and configured correctly. The main aim of this study is **to detail the scalability of Pandora FMS regarding on an specific serial of parameters** to know the requirements that it could have if it gets an specific capacity.
-[[:ja:documentation:01_understanding:01_introduction|Pandora FMS]] は、正しく設定しないとボトルネックになりうる複数の要素がある複雑なアプリケーションです。ここでは、特定のキャパシティを得るための必要条件を知るために、特定のパラメータに関する Pandora FMS のスケーラビリティの詳細を見ていきます。
-Load test were made in a first phase, aimed to a cluster based system, with an unique Pandora server centralized in a DDBB cluster. The load test are also useful to observe the maximum capacity per server. In the current architecture model ([[:en:documentation:08_technical_reference:10_versions|v3.0 or later]]), with "N" independent servers and with one **[[:en:documentation:06_metaconsole:01_introduction|Metaconsole]]** , this scalability tends to be linear, while the scalability based on centralized models would be of the kind shown in the following graph:
-最初は、クラスタを使ったシステムを対象にしたロードテストです。データベースクラスタの中に一つの Pandora サーバがあります。ロードテストは、サーバ一台ごとの最大のキャパシティを見るのに便利です。現在([[:ja:documentation:08_technical_reference:10_versions|バージョン 3.0以降]])のアーキテクチャでは、N台の個別のサーバと一つの**[[:ja:documentation:06_metaconsole:01_introduction|メタコンソール]]**でスケーラビリティはリニアに伸びますが、一台のサーバではそうはなりません。(次のグラフに示します。)
-{{  :wiki:pfms-volumetric_and_capacity_studies-performance_hardware_resources.png  }}
-=== データストレージと圧縮 ===
-実際に Pandora は、リアルタイムでデータを圧縮しています。これは、保存されるデータ量を計算するためにはとても重要です。最初に、データの保存方法に関して従来のシステムと Pandora FMS の "非同期" のデータ保存方法の違いについてみてみます。以下に図を示します。
-{{ wiki:Estudio_vol01.png }}
-** 従来のシステム **
-チェックを一日平均20実施すると、1年で 5MB の容量が必要です。エージェントあたり 50チェックでは、1年 250MB になります。
-**従来とは異なる Pandora FMS のような非同期システム**
-チェックにおいて 1日平均 0.1の状態変化では、1年で 12.3KBの容量になります。エージェントあたり 50チェックでは、1年で 615KB です。
-=== 用語の定義 ===
-Next is described a [[:en:documentation:01_understanding:03_glossary|glossary]] of **specific** terms for this study, for a better comprehension.
-次に、より理解を深められるように、ここで使われている[[:ja:documentation:01_understanding:03_glossary|用語]]について説明します。
-  * **情報の断片化**: Pandora FMS が扱う情報によりパフォーマンスは変化します。情報には、常に変化するもの(CPUの使用率など)や、固定(あるサービスの状態など)のものがあります。Pandora FMS は、これらをデータベースに圧縮して保存します。これはパフォーマンスおよびキャパシティに対して重要な要素となります。そこで、より断片化し、データベースにたくさん保存し、処理量を増やすことが、同じ情報を処理するために必要となります。
-  * **モジュール**: モニタリングにおいて情報を収集する基本的な単位です。いくつかの環境では、イベントでもあります。
-  * **間隔**: 一つのモジュールで情報を収集する時間間隔です。
-  * **アラート**: データがしきい値を越えたり状態が障害状態や警告状態になったときに、Pandora FMS が実行する通知です。
-==== 容量考察の例 ====
-{{  :wiki:pfms-performance-optimization.png?nolink&  }}
-=== これまでの考察と対象範囲 ===
-次の 3種類のパターンで設定を行う場合を考えてみます。
-  * ステージ1: 500エージェントでの設定
-  * ステージ2: 3000エージェントでの設定
-  * ステージ3: 6000エージェントでの設定
-このデータ量において、正しく Pandora FMS の要求スペックを決めるためには、どのような種類のモニタリングをする予定であるかを知る必要があります。次の例では、"QUASAR TECNOLOGIES" という架空の会社の特徴を示しています。
-  * 90% のモニタリングをソフトウエアエージェントで実施。
-  * 技術/ポリシーでグループ化できる似たようなシステムがある。
-  * モニタするモジュールやイベント間で、実行間隔が異なる。
-  * 大量の非同期情報がある(イベントやログなど)。
-  * あまり変化がない状態を確認する処理がたくさんある。
-  * 全体的にパフォーマンスに関する情報は少ない。
-すべての技術的な内容とその実装に関する調査(システムとそのモニタ方法の確認)の結果、次のような結論に至ります。
-  * 1システムあたり、平均して 40個のモジュールやイベントが存在する。
-  * 平均モニタリング間隔は、1200秒(20分)である。
-  * 5分ごとに情報を送ってくるモジュールもあれば、1週間に一度だけのモジュールもある。
-  * 全グループの全モジュール (240,000) のうち、確認のたびに変更が発生する可能性があるのは 25% である。
-  * モジュールごとのアラート比率は 1.3 (モジュール/イベントごとに 1.3 アラート) である。
-  * アラートの発生率は、1% と仮定する。(これは我々の経験上の予測です)
-この結論は、予測を策定するための基本となります。理解しやすいように Excel シートにまとめてみます。
-{{  :wiki:pfms-volumetric_and_capacity_studies-estimation.png  }}
-これらの初期データに対して必要な計算をあてはめます。データベースのサイズおよび、モニタリングに必要となる一秒間あたりのモジュール実行数、その他パラメータを予測できます。
-{{  :wiki:pfms-volumetric_and_capacity_essential parameters.png?direct&  }}
-=== キャパシティの考察 ===
-Once known the basic requirements for the implementation in each phase ( modules/second rate), number of total alerts, modules per day and megabytes / month, next step is to do a real stress test on a server quite similar to the production systems ( test couldn't have been done in a system similar to the production ones).
-それぞれのフェーズで実装のための基本的な要求事項 (秒あたりのモジュール実行数)、アラートの数、日ごとのモジュール実行数、月ごとの容量がわかったら、本番に近いシステムで実際にサーバの負荷テストを行います。(ここでのテストは、本番に近いシステムでは実行できていません。)
-これらの負荷テストでは、Pandora FMS の処理能力がわかり、どの程度で性能劣化するかがわかります。これは、次のような目的において便利です。
-  - 対象のハードウエアで最大どれくらいの規模まで対応できるかを推定する場合。
-  - ストレージの限界および、ヒストリー DB へ情報を移すポイントを知りたい場合。
-  - サービス停止や計画停止により、処理する情報がたまった場合の最大処理量に対する余裕を知りたい場合。
-  - モニタ対象の情報の変化率が変わった場合のパフォーマンスに与える影響を知りたい場合。
-  - 大量のアラート処理の影響を知りたい場合。
-The tests have been done on a **DELL server PowerEdge T100®**  with 2,4 Ghz **Intel Core Duo®**  Processor and 2 GB RAM. This server, working on an **Ubuntu Server 8.04**, has given us the base of our study for the tests on High Availability environments. The tests have been done on agent configurations quite similar to that of the QUASAR TECHNOLOGIES project, so is not available the same hardware, but replicate a high availability environment, similar to the QUASAR TECHNOLOGIES to evaluate the impact in the performance as times goes on and set other problems ( mainly of usability) derived from managing big data volume.
-テストは、**Intel Core Duo** プロセッサ 2.5GHz および、メモリ 2GB を積んだ **DELL のサーバ PowerEdge T100** にて実施しました。このサーバでは、Ubuntu Server 8.04 が動いており、HA 環境のテストに使っています。テストは、QUASAR TECHNOLOGIES プロジェクトに似たエージェント設定で実施しました。同じハードウエアではありませんが、同じような HA 環境で、WUASAR TECHNOLOGIES と似たパフォーマンスの影響および大量のデータを扱うことによるその他問題(主に利便性)を評価できる環境です。
-{{  :wiki:pfms-volumetric_and_capacity_probability_of_change.png  }}
-得られた結果はとても良く、システムは非常に過負荷にもかかわらず、非常に興味深い情報量を処理することができました(180,000モジュール、6000エージェント、120,000アラート）。これから得られた結論は次の通りです。
-. リアルタイムの情報は、最大 15日間でヒストリーデータベースへ移動させるべきである。一週間より古いデータを動かすことがベストです。これにより、より早い動作が保障されます。
-. 情報量を考慮して、処理の余裕は想定能力よりも高い最大能力の 50% ほどである。
-システムを構築する場合のパフォーマンスと必要な容量を決定するには、データの細分化の割合がとても重要である。
-==== 詳細方法論 ====
-The previous chapter was a "quick" study based only in modules typer "dataserver", this section shows a more complete way of doing an analysis of the Pandora FMS capacity.
-前の章は、"データサーバ" モジュールのみに基づく "迅速な" 調査でした。このセクションでは、Pandora FMS 容量の分析を行うためのより完全な方法を示します。
-As starting point, in all cases is assume the **worst-case scenario** providing for choose. If can not choose it, it will be the " Common case" philosophy. **It will be never considered anything in the "best of cases"** so this phylosophy doesn't work.
-出発点として、すべての選択の場面において、**最悪のシナリオ**を想定しています。 それができない場合、それは "一般的な考察" になります。 **"最良の場合" は考慮していません**。
-Next step is how to calculate the system capacity, by monitoring type or based on the information origin.
-次のステップは、監視のタイプまたは情報の出所に基づいて、システム容量を計算する方法です。
-=== データサーバ ===
-Based on the achievement of certain targets, as we have seen in the previous point, we suppose that the estimated target, is to see how it works wiht a load of 100,000 modules, distributed between a total of 3000 agents, that is, an average of 33 modules per agent.
-前のポイントで見たように、特定の目的に基づいて考えます。合計 3000 のエージェントに分散された 100,000 モジュールの負荷、つまり平均でエージェントあたり 33モジュールでどのような負荷で動作するかを確認したいと想定します。
-A [[:en:documentation:05_big_environments:08_optimization|task will be created]] of ''pandora_xmlstress'' , executed through **cron** or manual script, that has 33 modules, distributed with a configuration similar to this one:
-''pandora_xmlstress'' の[[:en:documentation:05_big_environments:08_optimization|タスクを作成]]し、**cron** または手動スクリプトを介して実行します。33個のモジュールがあり、次のような設定で展開されます。
-   * 1 module type string.
-  * 17 modules type ''generic_proc''.
-  * 15 modules type ''generic_data''.
-  * 文字列タイプの 1モジュール
-  * ''generic_proc'' タイプの 17モジュール
-  * ''generic_data'' タイプの 15モジュール
-We will configure the thresholds of the 17 modules of ''generic_proc''  type this way:
-''generic_proc''　タイプの　17個のモジュールのしきい値を次のように設定します。
-<file>
-module_begin
-module_name Process Status X
-module_type generic_proc
-module_description Status of my super-important daemon / service / process
-module_exec type=RANDOM;variation=1;min=0;max=100
-module_end
-</file>
-In the 15 modules of ''generic_data''  type, we should define thresholds. The procedure to follow is the following:
-''generic_data'' タイプの 15個のモジュールでは、しきい値を定義する必要があります。 手順は次の通りです。
-We should configure the thresholds of the 15 modules of ''generic_data''  type so data of this type will be generated:
-このタイプのデータが生成されるように、''generic_data'' タイプの 15個のモジュールのしきい値を設定する必要があります。
-<file>
-module_exec type=SCATTER;prob=20;avg=10;min=0;max=100
-</file>
-Then, we configure the thresholds for these 15 modules, so they have this pattern:
-これらの 15個のモジュールのしきい値を設定すると、次のパターンになります。
-<file>
--50 normal
--74 warning
-- critical
-</file>
-We add to the configuration file of our ''pandora_xml_stress''  some new tokens, to could define the thresholds from the XML generation. Attention: Pandora FMS only "adopts" the definition of thresholds in the creation of the module, but not in the update with new data.
-''pandora_xml_stress'' の設定ファイルにいくつかの新しいトークンを追加して、XML 生成からのしきい値を定義できるようにします。 重要: Pandora FMS は、モジュールの作成時にしきい値の定義を採用するだけで、新しいデータの更新には利用されません。
-<file>
-module_min_critical 75
-module_min_warning 50
-</file>
-Execute the ''pandora_xml_stress''.
-''pandora_xml_stress'' を実行します。
-Should let it running at least for 48 hours without any kind of interruption and we should monitor (with a Pandora FMS agent) the following parameters:
-中断することなく少なくとも 48時間実行し、次のパラメーターを(Pandora FMS エージェントを使用して)監視する必要があります。
-Number of queued packages:
-キューに入っているデータ数:
-<file>
-find /var/spool/pandora/data_in | wc -l
-</file>
-<font inherit/inherit;;inherit;;rgb(251, 250, 249)>PFMS server</font>CPU:
-Pandora FMS サーバ CPU 使用率:
-<file>
-ps aux | grep "/usr/bin/pandora_server" | grep -v grep | awk '{print $3}'
-</file>
-''pandora_server''  total Memory:
-''pandora_server'' トータルメモリ:
-<code>
- ps aux | grep "/usr/bin/pandora_server" | grep -v grep | awk '{print $4}'
-</code>
-CPU used by **mysqld**  (check syntax of the execution, it depends of the MySQL distro)
-**mysqld** による CPU 使用率(実行の構文を確認してください。MySQL ディストリビューションによって異なります。)
-<file>
-ps aux | grep "sbin/mysqld" | grep -v grep | awk '{print $3}'
-</file>
-Pandora FMS DDBB response average time:
-Pandora FMS データベース平均応答時間:
-<code>
-/usr/share/pandora_server/util/pandora_database_check.pl /etc/pandora/pandora_server.conf
-</code>
-Number of monitors in unknown state:
-不明状態の監視項目数:
-<code>
-echo "select SUM(unknown_count) FROM tagente;" | mysql -u pandora -p<password> -D pandora | tail -1
-</code>
-(''<password>''  for ''pandora''  user.)
-(''<password>'' は ''pandora'' ユーザのパスワードです。)
-The first executions should be useful to "tune" the server and the MySQL configuration.
-最初の実行は、サーバと MySQL 設定を "調整" するのに役立つはずです。
-Use the script ''/usr/share/pandora_server/util/pandora_count.sh''  to count (if are XML files pending to process) the rate of package proccessing. The aim is to make possible that all the packages generated (3000) could be processed in an interval below the 80% of the limit time (5 minutes). This implies that 3000 packages should be processed in 4 minutes, so:
-スクリプト ''/usr/share/pandora_server/util/pandora_count.sh'' を使用して、データの処理速度をカウントします(XML ファイルの処理が保留されている場合)。 目的は、生成されたすべてのデータ(3000)を、制限時間(5分)の 80% 未満の間隔で処理できるようにすることです。これは、3000個のデータを 4分で処理する必要があることを意味します。
-<file>
-/ (4x60) = 12.5
-</file>
-It should get a processing rate of 12.5 packages minimum to be reasonably sure that Pandora FMS could process this information.
-Pandora FMS がこの情報を処理できることを合理的に確認するには、最低 12.5 個のデータ処理速度が必要です。
-Elements for adjust:
-調整要素:
-  * Number of threads.
-  * Number maximum of items in intermediate queue (''max_queue_files'').
-  * Of course, all the parameters of MySQL that are applicable (very important).
-  * スレッド数
-  * 中間キュー内のアイテムの最大数(''max_queue_files'')
-  * もちろん、適用可能な MySQL のすべてのパラメーター(非常に重要)
-<WRAP center round tip 60%>Importance of this: One Pandora with a GNU/Linux server installed "by default" in a powerful machine, could not exceed from 5-6 packages by second, in a powerful machine well "optimized" and "tuned" it could perfectly reach 30-40 packages by second. **It also depends a lot of the number of modules that would be in each agent**.</WRAP>
-<WRAP center round tip 60%>これの重要性：強力なマシンに "デフォルト" でインストールされた GNU/Linux サーバ 1台で、Pandora は、毎秒 5〜6 データを超えることはできませんが、十分に "最適化" および "調整" された強力なマシンでは、毎秒 30-40 データまで処理することができます。**また、各エージェントに含まれるモジュールの数にも依存します**。</WRAP>
-Configure the system in order that the DDBB maintenance script at ''/usr/share/pandora_server/util/pandora_db.pl''  will be executed every hour instead of every day:
-''/usr/share/pandora_server/util/pandora_db.pl'' にあるデータベースメンテナンススクリプトが毎日ではなく 1時間ごとに実行されるように、システムを設定します。
-<file>
-mv /etc/cron.daily/pandora_db /etc/cron.hourly
-</file>
-We leave the system working, with the package generator a minimum of 48 hours. Once this time has passed, we should evaluate the following points:
-データジェネレーターを最低 48時間動かして、システムを動作させたままにします。 この時間が経過したら、次の点を評価する必要があります。
-  - Is the system stable?, Is it down? If there are problems, check the logs and graphs of the metrics that we have got (mainly memory).
-  - Evaluate the tendency of time of the metric "Number of monitors in unknown state". **There should be not tendencies neither important peaks**. They should be the exception: If they happen with a regularity of one hour, is because there are problems withe the concurrency of the DDBB management process.
-  - Evaluate the metric "Average time of response of the pandora DDBB". **It should not increase with time but remain constant**.
-  - Evaluate the metric "pandora_server CPU" , should have many peaks, but with a constant tendency, **not rising**.
-  - Evaluate the metric "MYSQL server CPU"; should be constant with many peaks, but with a constant tendency , **not rising**.
-  - システムは安定しているか?、ダウンしていないか? 問題がある場合は、取得したメトリック(主にメモリ)のログとグラフを確認してください。
-  - メトリック "不明な状態の監視項目数" の時間の傾向を評価します。 **何らかの傾向やピークもあってはいけません**。それがある場合は問題が発生しているはずです。問題が 1時間の規則性で発生する場合、それはデータベース管理プロセスの並行処理に問題があるためです。
-  - メトリック " Pandora データベースの平均応答時間" を評価します。 **時間とともに増加することはありませんが、一定のままである必要があります**。
-  - メトリック "pandora_server CPU使用率" を評価します。多くのピークがあるはずですが、一定の傾向があり、**上昇していないこと**です。
-  - メトリック "MYSQL サーバ CPU使用率" を評価します。 多くのピークがありますが、一定の傾向で、**上昇していないこと**です。
-== アラートの影響の評価 ==
-If all was right, now will evaluate the impact of the alert execution performance.
-すべてが問題なければ、アラート実行パフォーマンスの影響を評価します。
-Apply one alert to five specific modules of each agent (''generic_data'' type ), for the ''CRITICAL'' condition.Something not really important, like creating an event or writting to **syslog** (to avoid the impact that something with hight latency could have like for example sending an email message).
-''障害(CRITICAL)'' 状態の場合、各エージェントの 5つの特定のモジュール(''generic_data'' タイプ)に 1つのアラートを適用します。イベントの作成や **syslog**への書き込みなど、それほど重要ではないもの(高い遅延の影響を回避するため、待ち時間が長いもの、たとえば電子メールメッセージを送信するようなものは避ける)を利用します。
-Optionally create one event correlation alert to generate one alert for any critical condition of any agent with one of these five modules.
-オプションで、1つのイベント相関アラートを作成して、これら 5つのモジュールのいずれかを使用するエージェントの障害状態に対して 1つのアラートを生成します。
-Let the system operating 12 hours under those criteria and evaluate the impact, following the previous criteria.
-これらの基準の下でシステムを 12時間稼働させ、前の基準に従って影響を評価します。
-== データ削除と移動の評価 ==
-Supposing the data storage policy was:
-データストレージのポリシーが以下の通りであると仮定します。
-  * Deleting of events from more than 72 hours.
-  * Moving data to history from more than 7 days.
-  * 72時間以上経過した古いイベントを削除
-  * 7日以上経過したデータをヒストリデータベースへ移動
-Should let the system working "only" during at least 10 days to evaluate the long term performance. We could see a "peak" 7 days later due to the moving of data to the history DDBB. This degradation is <wrap hi>important</wrap> to consider. If you can't have so many time available, it is possible to replicate (with less "realism") changing the purging interval to 2 days in events and 2 days to move data to history, to evaluate this impact.
-長期的なパフォーマンスを評価するために、システムを少なくとも 10日間動作させる必要があります。 データがヒストリデータベースに移動される 7日後に "ピーク" が見られます。 このパフォーマンス低下は、<wrap hi>重要</wrap>な考慮点です。確認のための時間がそれほど多くとれない場合は、イベント削除を 2日に、データのヒストリデータベースへの移動を 2日に変更して、この影響を評価できます("実環境" とは少しことなりますが)。