Replication
This service replicates data from an RDBMS to OpenSearch.
|
Impact on system performance Replication has a direct impact on system performance. In particular, the number of threads (Replication → Settings → Replication_Threads) and the interval (replicationInterval) play a role. The more jobs run in a short time, the higher the CPU load. This applies in particular if a lot of data is replicated. Although the replicationBlockSize has a positive effect on the replication duration, it also has a direct influence on the memory requirements of Apache Karaf. |
Preparation for data record updates
A few prerequisites should be created before setting up the replications. Replication requires a column with exclusively incrementing timestamps for each table (log as well as child log), as this timestamp is used to subsequently find new/updated data records. If different processes, servers, … log into the DB in parallel and therefore possibly with a time delay (even if only milliseconds), it can happen that individual data records are not displayed in the BPC because they are committed with a timestamp older than the latest replicated one.
It is therefore recommended:
-
log not the server time of the logging server, but that of the DB server, as different process servers may have slightly different times
-
Ensure that this column is actually rewritten with every insert or update
Both can be done simply via an additional (hidden) column with default value and associated trigger. This has the advantage that nothing changes at all for the logging process)
Adding a technical TIMESTAMP column for replication
Oracle
For PM log and child log (without prefix) under Oracle, the procedure for creating then looks like this, for example:
--LOG:
--Add a column (with 6 decimal places of precision); set the default value for inserts to the current time in UTC (this eliminates the need for adjustments in the logging process)
ALTER TABLE LOG ADD (DB_UPDATE_TS TIMESTAMP DEFAULT SYSTIMESTAMP AT TIME ZONE 'UTC' NOT NULL);
--“Redistribute” existing log entries; otherwise, replication won't handle it well:
--Either put them all in one block (!not recommended for tables with more than 1 million entries!):
UPDATE LOG SET DB_UPDATE_TS = TIMESTAMP;
Commit;
--or for large datasets and live systems that also need to perform operations on the tables, as an anonymous PL/SQL block with minimal undo tablespace requirements:
--In this case, an index should also be created on the timestamp column to avoid full table scans!
declare begin
FOR counter IN 0 .. 3650 LOOP
--dbms_output.put_line(to_char(to_date('2010-01-01', 'YYYY-MM-DD') + counter, 'YYYY-MM-DD') || ' - ' || to_char(to_date('2010-01-01', 'YYYY-MM-DD') + 1 + counter, 'YYYY-MM-DD'));
--LOG:
update log set DB_UPDATE_TS = TIMESTAMP where TIMESTAMP between to_date('2010-01-01', 'YYYY-MM-DD') + counter and to_date('2010-01-01', 'YYYY-MM-DD') + 1 + counter;
commit;
--CHILDLOG:
update childlog set DB_UPDATE_TS = TIMESTAMP where TIMESTAMP between to_date('2010-01-01', 'YYYY-MM-DD') + counter and to_date('2010-01-01', 'YYYY-MM-DD') + 1 + counter;
commit;
END LOOP;
end;
--Create a trigger that resets the column with every update (this eliminates the need for adjustments to the logging process)
CREATE OR REPLACE
TRIGGER LOG_DB_UPDATE_TS
BEFORE UPDATE ON LOG
REFERENCING NEW AS NEW OLD AS OLD
FOR EACH ROW
DECLARE
BEGIN
:NEW.DB_UPDATE_TS := SYSTIMESTAMP AT TIME ZONE 'UTC';
END;
/
--Create Index:
CREATE INDEX IDX_LOG_DBLU ON LOG(DB_UPDATE_TS) COMPUTE STATISTICS;
--Here's the whole thing again for Childlog:
ALTER TABLE CHILDLOG ADD (DB_UPDATE_TS TIMESTAMP DEFAULT SYSTIMESTAMP AT TIME ZONE 'UTC' NOT NULL);
--See the PL/SQL block above, if applicable!
UPDATE CHILDLOG SET DB_UPDATE_TS = TIMESTAMP;
commit;
CREATE OR REPLACE
TRIGGER CHILDLOG_DB_UPDATE_TS
BEFORE UPDATE ON CHILDLOG
REFERENCING NEW AS NEW OLD AS OLD
FOR EACH ROW
DECLARE
BEGIN
:NEW.DB_UPDATE_TS := SYSTIMESTAMP AT TIME ZONE 'UTC';
END;
/
CREATE INDEX IDX_CHILDLOG_DBLU ON CHILDLOG(DB_UPDATE_TS) COMPUTE STATISTICS;
--Finished
MSSQL
|
|
|
The trigger for childlog is missing. Normally, however, this is not required as the entries are only ever added and not updated. |
/* Add columns. Use DATETIME2 for higher precision than DATETIME. Since MSSQL does not support millisecond precision, use the SYSDATETIME() function instead of current_timestamp; this function provides nanosecond precision. */
ALTER TABLE [LOG] ADD DB_UPDATE_TS DATETIME2 DEFAULT SYSDATETIME() NOT NULL;
GO
ALTER TABLE [CHILDLOG] ADD DB_UPDATE_TS DATETIME2 DEFAULT SYSDATETIME() NOT NULL;
GO
/* Initialize columns */
BEGIN
UPDATE [LOG] SET DB_UPDATE_TS = [timestamp];
END
GO
BEGIN
UPDATE [CHILDLOG] SET DB_UPDATE_TS = [timestamp];
END
GO
/* Trigger */
CREATE TRIGGER LOG_DB_UPDATE_TS
ON [LOG]
AFTER UPDATE
AS
BEGIN
IF NOT UPDATE(DB_UPDATE_TS)
BEGIN
UPDATE t
SET t.DB_UPDATE_TS = SYSDATETIME()
FROM [LOG] AS t
INNER JOIN inserted AS i
ON t.PROCESSID = i.PROCESSID;
END
END
GO
/* Indices */
CREATE INDEX IDX_LOG_DBLU ON LOG (DB_UPDATE_TS);
CREATE INDEX IDX_CHILDLOG_DBLU ON CHILDLOG (DB_UPDATE_TS);
MySQL
and for MySQL quite simply:
/* Adds a sufficiently precise timestamp managed by the database to the table */
ALTER TABLE LOG ADD DB_UPDATE_TS TIMESTAMP(6) NOT NULL DEFAULT CURRENT_TIMESTAMP(6) ON UPDATE CURRENT_TIMESTAMP(6)
ALTER TABLE CHILDLOG ADD DB_UPDATE_TS TIMESTAMP(6) NOT NULL DEFAULT CURRENT_TIMESTAMP(6) ON UPDATE CURRENT_TIMESTAMP(6)
PostgreSQL
and for PostgreSQL also via trigger:
ALTER TABLE log ADD db_update_ts timestamp NOT null DEFAULT (timezone('UTC', now()));
ALTER TABLE childlog ADD db_update_ts timestamp NOT null DEFAULT (timezone('UTC', now()));
CREATE OR REPLACE FUNCTION update_db_update_ts_column()
RETURNS TRIGGER AS $$
BEGIN
NEW.db_update_ts = timezone('UTC', now());
RETURN NEW;
END;
$$ language 'plpgsql';
CREATE TRIGGER update_log_db_update_ts BEFORE INSERT OR UPDATE ON log FOR EACH ROW EXECUTE PROCEDURE update_db_update_ts_column();
CREATE TRIGGER update_childlog_db_update_ts BEFORE INSERT OR UPDATE ON childlog FOR EACH ROW EXECUTE PROCEDURE update_db_update_ts_column();
CREATE INDEX idx_log_dblu ON log (db_update_ts);
CREATE INDEX idx_childlog_dblu ON childlog (db_update_ts);
Configuration
Data sources
Data sources are connections to individual databases. The individual replication jobs then use/reference these. These are set up under Backend Connections of type "data_source" and referenced using the component ID.
Interface
A separate interface is available for setting up the individual replication jobs under Settings → Replication → Components → Editor. Entries can be created, deleted, duplicated and also activated/deactivated individually via the interface.
Configuration parameters of the Replication module
The various parameters and associated functions are described below. These can be found under BPC Administration → Replication → General.
Unresolved include directive in modules/core/pages/admin/modules/replication.adoc - include::partial$generated/replication/module-settings.adoc[]
Configuration parameters of a replication job
The various Parameters and associated functions are described below. These can be found under BPC Administration → Replication → Components. It is recommended to use the specialized interface: BPC Administration → Replication → Editor
Unresolved include directive in modules/core/pages/admin/modules/replication.adoc - include::partial$generated/replication/instance-settings.adoc[]
Basic settings
| Setting (Key) | Data type | Description |
|---|---|---|
Replication_Enabled |
boolean |
Activate/deactivate the replication job |
Replication_StartDate |
String |
Replication of data records that are newer than this point in time. or as a relative value in the format:
Examples
|
Source / Source
| Setting (Key) | Data type | Description | ||
|---|---|---|---|---|
Source_DataSource |
String |
Data source to be used (ID of the Backend Connections of type |
||
Source_Table |
String |
Table name of the source or CTE if optional. Name of the CTE if the optional 'Source_CommonTableExpressionQuery' is used. |
||
Source_Timezone |
String |
Time zone of the date fields used in the source database table (used internally TimeZone.getTimeZone). Is only applied to the actual data and not to the 'lastUpdateColumn' column. Examples
|
||
Source_IdColumns |
String |
Columns for the creation of a unique key in OpenSearch.
For example: |
||
Source_LastUpdateColumn |
String |
This column is used to determine the age of the data set
|
||
Source_LastUpdateColumnTimezone |
String |
Time zone, which is used in the source database table for the data in the
Examples
|
||
Source_QueryTimeoutInSeconds |
Integer |
Defines how long the JDBC driver waits for a response from the DB. See also JDBC java.sql.Statement.setQueryTimeout() |
||
Source_CommonTableExpressionQuery |
String |
Can be used instead of database views.
May only contain the SELECT statement of a Common Table Expression (CTE).
The |
Target
| Setting (Key) | Data type | Description |
|---|---|---|
Target_Index |
String |
Target index in OpenSearch. |
Target_CaseSensitivityOfFields |
String |
Specifies how the fields are to be created in OpenSearch (case-sensitive).
|
Target_IndexCreationSettings |
JSON |
Assign different settings to the target index during creation.
Example with added "Index Sorting"
If "Index Sorting" is to be used, then OpenSearch mappings must also be created for the specified sorting fields (LASTUPDATE in the example) (see Target_IndexMappings). |
Target_IndexMappings |
JSON |
Assign a mapping to the target index when it is created. Should only be necessary in certain cases. The mapping for the "Index Sorting" example from above.
|
Target_IndexDynamicTemplates |
JSON |
Assign a customized mapping to the target index. If this is set, then the global (see Core setting → Core_IndexDynamicTemplates) is not used. In the Elasticsearch documentation (reference to the OpenSearch documentation) there is more information about the possibilities of Dynamic Templates. In the following example, all fields that OpenSearch recognizes as text fields (strings) are provided with a mapping ("all_textfields") where the content is not analyzed (saves storage space and the data can still be displayed). Plus an exception ("special case"): Our standard mapping is used for all text fields that have the name postfix 'name'. Example
To specify the OpenSearch type for a database field. It can happen from time to time that OpenSearch makes a mistake with the mapping and uses an inappropriate type. Concrete example: The Oracle column with the name 'NUMBER' of the data type 'NUMBER(10,2)' is created in the OpenSearch mapping as type 'long' instead of 'float'. This can be corrected with the example below. Example
|
Advanced Settings / Advanced
Shadow copies
During a shadow copy, all documents (that are after the specified 'replicationStartDate') of the OpenSearch index (see 'targetIndex') are copied to a new index at the specified time. At the end, the alias is changed to the new index and the previous index is deleted. This can be done once a week, for example, to get back to a lean index.
Additional note: The documents marked as deleted in OpenSearch (either deleted manually or by the tail sync) are of course not copied to the new index. Attention: The associated replication run is suspended until the shadow copy has been created.
| Setting (Key) | Data type | Description |
|---|---|---|
ShadowCopy_Enabled |
boolean |
Enable creation of shadow copies (shadow copy). |
ShadowCopy_CronPattern |
String |
Cron-like pattern (according to Quartz-Scheduler syntax) for defining the time when the shadow copies are to be created. Examples
Further examples and documentation can be found on the Quartz-Scheduler website. |
ShadowCopy_KeepCopiesCount |
Integer |
Number of shadow copies to be kept.
|
Tail Sync
The 'Tail Sync' works on the current index (see 'targetIndex') and synchronizes older data (new/changed/deleted database records) with OpenSearch so that they match the database again. This can be carried out once every night, for example. Documents that are older than the replication start date (see 'replicationStartDate') are first deleted from the OpenSearch index. It then goes through the database in blocks (default 10-day steps) and compares them with OpenSearch. New/changed database records are transferred to the OpenSearch index and documents that no longer exist in the database are deleted in OpenSearch. The Tail Sync only works with the specified 'idColumns' and 'lastUpdateColumn' fields (Hint: a suitable DB index works wonders!)). The complete data is only read out for new/changed database records.
As an alternative or in addition to a time-controlled tail sync, this can also be started manually.
For this purpose, there is a button under Replication → Jobs for each replication job that starts a tail sync.
The settings replicationTailSyncEnabled and replicationTailSyncCronPattern have no effect on the manual start, all other settings are taken into account.
|
The sync does not run up to the current date (see |
| Setting (Key) | Data type | Description | ||
|---|---|---|---|---|
TailSync_Enabled |
boolean |
Activate the 'Tail Sync' |
||
TailSync_CronPattern |
String |
Cron-like pattern (according to Quartz Scheduler syntax) to define the time when the Tail Sync should be started. Examples
Further examples and documentation can be found on the Quartz-Scheduler website. |
||
TailSync_BlockSize |
Integer |
Block size of the database driver.
Note the information on the option |
||
TailSync_RelativeStartDate |
String |
The relative start date specified. Synchronization should then start from this date. Data that is before the replication start date will continue to be deleted. This date is only used if it is after the regular start date and before the end date.
|
||
TailSync_RelativeEndDate |
String |
The relative end date specified. Synchronization should then continue up to this point in time and no further.
|
||
TailSync_BlockDayRange |
Integer |
Number of days (block) in which the data should be processed.
See also |
||
TailSync_RelativeDeleteOlderThanDate |
String |
The relative deletion date to be specified. All documents that are older are deleted. If this option is not set, all documents that are older than the start date of the replication are deleted (see |
||
TailSync_LoggingEnabled |
boolean |
The relative deletion date to be specified.
Specifies whether the tail sync runs are to be logged in the The global setting |
Consistency Check
A simple consistency check is carried out after the replication runs. The number of documents in the source and target are compared. This is done in the period 'replicationStartDate' and the last date which is in the target (OpenSearch).
|
If there is a lot of data (tens of millions of records), this can lead to an increased load on the system and an initial replication can be extremely slowed down. Only activate the consistency check if you are sure that the data will be fully replicated. |
| Setting (Key) | Data type | Description |
|---|---|---|
ConsistencyCheck_Frequency |
Integer |
The frequency with which the consistency check is performed. Examples:
|
Lookup Joins
Can be used to enrich documents to be replicated with additional data. If, for example only a partner ID is contained in the data to be replicated and the name of the partner etc. is still required in the monitor.
Denormalization is used under OpenSearch, i.e. this data is also transferred to the document and is available like all other data (high-performance search; aggregation, ….). Prerequisite: The lookup "tables" must be available as independent indexes and can, for example, be transferred from a DB table via an additional replication.
The lookup data can be updated manually ( BPC settings → Overview → Status → Replication → Jobs → Job → Synchronize lookup joins ) and automatically using our OpenSearch BPC plugin (os-bpc-plugin).
Several lookup tables can be referenced, the possible values of an ENTRY are described below. Structure: "join": [ { EINTRAG }, { EINTRAG }, … ]
|
For the value comparison to work, the column types in the database tables must be identical. In the case of string fields, "*.raw" should be used as the lookupKeyField, but without the ".raw" |
[
{
"keyField": "PARTNER",
"lookupIndex": "lookup-partner",
"lookupKeyField": "ID.raw",
"resultFieldsPrefix": "partner_",
"resultFieldsExcluded": [ "ID", "LASTUPDATE" ]
},
{
"keyField": "MESSAGETYPE",
"lookupIndex": "lookup-messagetype",
"lookupKeyField": "ID.raw",
"resultFieldsPrefix": "messagetype_",
"resultFieldsExcluded": [ "ID", "LASTUPDATE" ]
}
]
| Field | Data type | Description |
|---|---|---|
keyField |
String |
The key field in the table to be replicated. The data from the lookup index is transferred via the value of the field. Example: PARTNER_ID |
keyFieldValuesSeparator |
String |
If there are several values in the keyField that are separated by a separator, this separator can be specified here. The lookup join is then performed on the individual values. Example: The 'lookupIndex' contains the following data:
If the keyFieldValuesSeparator is '%%', the resultFieldsPrefix is 'MENGENEINHEIT_' and the transferred keyValue value is '42%%20%%30', the following field is created: QUANTITY_LONGNAME = centimeter%%gram%%kilogram |
lookupIndex |
String |
The OpenSearch index with the lookup data. Example: lookup-partner |
lookupKeyField |
String |
Example: ID.raw |
resultFieldsPrefix |
String |
The fields to be transferred from the lookup index must be provided with a unique prefix so that there is no conflict with existing fields. Example: partner_ |
resultFieldsIncluded |
String Array |
If not all fields from the lookup index are to be included, the fields to be included can be specified here. Example: [ "FIRST_NAME", "LAST_NAME" ] |
resultFieldsExcluded |
String Array |
If almost all fields are to be included, the fields to be excluded can be specified here. Example: [ "ID", "UPDATED" ] |
Logging of replications
The runs of the replication jobs can be logged and displayed, for example, via the automatically created monitor "Replication Jobs Monitor".
Similarly, runs of the tail sync can also be logged and viewed in the monitor "Tail Sync Logs Monitor".
These monitors are created at startup if they are not available.
They cannot be permanently deleted.
The indices used bpc-replicationjobs-log and bpc-tailsync-log are also created automatically if a run needs to be logged.
Logging can be switched on and off globally and for each replication job (see corresponding settings above). Logging is activated globally for both replications and tail syncs by default, but deactivated in the individual replication jobs. Therefore, no logging is performed by default.
The log entries for replication runs are written in bulk to the index mentioned above after some time. This has little impact on the overall performance of the system. However, a lot of entries can be generated in a very short time (millions in a few hours), so perhaps do not run the cleanup too often and keep an eye on the number of documents it contains: Settings → Core Services → Indices
Since tail syncs take place much less frequently, the bpc-tailsync-log index should not grow so quickly during tail sync logging, but it is still regularly cleaned of old entries (see configuration above).