TopBraid Data Platform

The TopBraid Data Platform is a high availability solution for TopBraid EDG servers. It enables continuous operation of business functions by replicating data across a cluster of EDG servers. Each EDG server is kept up to date and can respond to a client/application request. Together with a load balancer to direct requests, user and application access to the EDG data remains available even if some of the servers are offline.

This availability is achieved by having a cluster of EDG servers communicating with a Data Coordinator (DC) server that receives changes from any clustered server whenever it has updates. The DC server propagates the changes to all other EDG servers, keeping them all up-to-date. Additionally, whenever a new EDG server joins the cluster, it is first brought up-to-date before it starts servicing user/application requests. Each EDG server keeps a local, persistent cache of the data so that start-up involves only applying changes missed by that server.

TopBraid EDG Data Coordinator Server Block Diagram

TopBraid EDG Data Coordinator Server Block Diagram

Each EDG or EDG Explorer server has its own copy of a datastore of all the RDF graphs managed by the Data Platform. When changes to the data are made, the changes are recorded locally as RDF Patches.

At the end of any operation (i.e. the end of the operation’s HTTP request), any changes to the local datastore are sent to the DC server and saved to disk. Once the DC server confirms that the patches are saved, the EDG server commits its own datastore transaction and responds to the HTTP request. It is the state of the RDF Patch Log that determines the state of the EDG server for Data Platform-backed graphs.

When the EDG server receives an operation, it checks that its local datastore is up-to-date and, if not, fetches the necessary patches from the DC server and applies them locally.

The DC server can be a single machine, with file-backed persistence, or a cluster of servers with storage in an Amazon S3-compatible object store. The trade-offs between these configurations is discussed below.

The TopBraid Data Platform is based on the open source (Apache License) RDF Delta system.

Obtaining the Software

The Data Coordinator server can be downloaded from:

https://download.topquadrant.com/dp/

EDG and EDG Explorer already have the Data Platform client software included and require no extra software. You must purchase a license to run Data Platform from TopQuadrant. You will then be sent the appropriate number of licenses for your EDG installations.

Note

The version corresponding to the EDG version as specified in the table on the download page must be used.

System Requirements

The DC server is a Java webserver that coordinates changes across a cluster of EDG servers. For production, the DC server should be run on a separate machine from any of the clustered EDG servers. For development and experimentation, the DC and EDG servers can be run on a shared machine.

The DC server stores patches on-disk.

The disk storage must be backed up, which can be done by file backup on a live server.

The robustness of the system is determined by the robustness of the file storage; so choosing local disk in the DC-server is limiting.

A minimum of 8GB of RAM are required.

It is imperative that the DC server be monitored for low memory and disk space.

Configurations

In a high-availability configuration, two or more EDG servers provide the EDG service. Each has a complete copy of the replicated graphs.

Typically, a deployment will put a load balancer in front of the EDG servers so that the clients (web browsers or other applications) use the same URL to access either server. This also allows a deployment to move the EDG servers to different machines for maintenance and system upgrades.

There are two ways to configure the DC server:

  • A single server backed by its local file system

  • A cluster of servers backed by an Amazon S3-compatible object store

Single Server

The file-based DC server can use any OS-supported filesystem (local storage or remote disk-array) and the reliability and durability of files written are determined by the choice of filesystem. If the single server configuration is used, then operation is interrupted if the DC server is unavailable. The EDG servers keep running but will be unable to update data until the DC server restarts. DC server startup is very fast. There is no need to restart EDG servers. The advantage is the simplicity of operation so there is a tradeoff for small deployments of simplicity and continuous operation.

Server Cluster

For continuous operation, three or more DC servers run on separate machines. The servers use Apache Zookeeper to manage a system wide datastore of the patch state of the deployment. Storage of patches is in an Amazon S3-compatible object store. There are publicly available adapters to provide the Amazon S3 API over other storage choices such as Apache Cassandra.

Guidance for failover is found in the distribution of RDF Delta provided by TopQuadrant’s download site. See Tutorial. After reviewing the guidance the following playbook can be used for EDG and Data Platform cluster.

DataPlatform cluster playbook

Setup

Data Platform can be used as the default datastore for all of the collections in the workspace or as the storage for nominated asset collections, with an option selected when the asset collection is created.

Any projects uploaded to EDG that were supplied by TopQuadrant professional services for customizations or sample data will not be synced between EDG servers. These must be uploaded individually to each EDG server. EDG configuration files will also not be stored via Data Platform. Changes to configurations will need to be manually replicated on each EDG server.

This section is a short walkthrough of configuring EDG with Data Platform with file-backed storage of its patches. The Data Coordinator server is the RDF Delta patch log server.

Only a new EDG workspace can be configured to use Data Platform as its backing data storage. An existing workspace cannot be converted to using Data Platform by simply changing its database storage option. The data must be migrated manually.

File-based persistent patch storage

The simplest configuration of the Data Coordinator server uses the server’s file system for patch storage. This is an OS-supported file system (local storage or remote disk-array) and the reliability and durability of files written are determined by the choice of file system.

Data Platform default datastore steps

Perform the following steps in the order presented. Further details are in the following sections.

  1. Download and start the Data Coordinator server (see the following section).

  2. Install the first instance of EDG using the database properties below. Use a setup properties file to configure the EDG instance so the file can be copied and modified for the other EDG instances. See EDG Server Installation.

    databaseType = DataPlatform
    
    # The Data Coordinator server's DNS name or IP address
    dpServerURL = http://localhost:1066/
    
    isPrimaryNode = true
    
    # Location of the local copy of the shared data (rarely changed)
    dpZone = Zone
    
  3. Verify this configuration is working by creating a collection using the EDG instance. Then check the Base URI Management Admin Page page and, in the Repositories project, verify that the new collection’s file has an extension of .dpc.

  4. Shut down and clone this newly-created EDG instance to set up the other EDG instances. Copy the EDG setup properties file (typically specified as the Tomcat system property edg-setup) and the entire workspace directory tree (as specified by the workspacePath property in the EDG setup file). Note that the workspace directory tree cannot be shared by multiple EDG servers (e.g. via a network shared directory); each server must have its own workspace. Also note that each EDG server must have its own unique license file.

  5. Change the primary node flag in the setup properties files for the additional EDG instances to false so there is only a single primary node:

    isPrimaryNode = false
    
  6. Set up a backup mechanism and memory and disk space monitoring for the Data Coordinator server.

  7. Optionally, configure a Data Platform cluster. See the Server Cluster section, above.

Running the Data Coordinator server

Note: In production, the EDG and Data Coordinator servers should be run on physically separate machines (and not just separate virtual machines), so they will not both be brought down by a single hardware failure.

Run the Data Coordinator server with the following command:

java -jar delta-server.jar –-file –-base DIRECTORY --port 1066

where DIRECTORY is an existing and initially empty directory where the Data Coordinator server stores the patch logs as files.

Example Data Coordinator server output upon startup:

[2023-01-27 11:57:17] Delta      INFO  Provider: file
[2023-01-27 11:57:17] Config     INFO  Delta Server port=1066
[2023-01-27 11:57:17] Delta      INFO  RDF Delta 1.1.2 2022-08-17T12:45:57+0000
[2023-01-27 11:57:17] Delta      INFO    No data sources

Encrypted communication

The EDG and Data Coordinator servers can be configured to use encrypted data when communicating. This requires configuring the DC server to support HTTPS requests and configuring the EDG server to use HTTPS when communicating with the DC server.

The DC server is configured to support HTTPS requests via a command line option that specifies a separate Jetty configuration file. Internally, the DC server uses Eclipse Jetty to handle incoming HTTP requests. As a result, configuring the DC server to handle HTTPS requests is a matter of configuring its Jetty server. Use the --jetty command option when starting the DC server to specify the appropriate Jetty configuration file:

java -jar delta-server.jar –-file –-base DIRECTORY --jetty JETTY_XML

where DIRECTORY is an existing and initially empty directory where the DC server stores the patch logs as files and JETTY_XML is the Jetty configuration file. Below is a link to a template of a Jetty configuration file that can be used to enable the DC server to handle HTTPS requests:

Jetty XML file template

Edit the template file and modify three SslContextFactory$Server settings to allow Jetty to use the appropriate key store:

  • keyStorePath

  • keyStorePassword

  • keyManagerPassword

The EDG servers must then be configured to use HTTPS when communicating with the DC server. Change the Data Platform server URL in the setup properties files to specify https:

dpServerURL = https://localhost:1066/

Once the EDG servers are started, they will be communicating with the DC server via HTTPS.

Outages

If the Data Coordinator is not running, the associated EDG servers:

  • can view and query the shared collections

  • cannot update the shared collections

  • cannot create new shared collections

Simply restarting the Data Coordinator server will re-enable the update and sharing features. Restarting EDG is not necessary.

If an EDG server’s databaseType is configured to be DataPlatform, the Data Coordinator server must be running and available when the EDG server is started for the first time.

Backup & Restore

See also

See the Data Platform-related sections in EDG Backup and Restore: