- In general, just as with many other distributed systems, the network can have a large influence on performance: in many use cases and deployment scenarios there is a very good chance that a Space operation will need to wait for a network round trip time to elapse before it completes. Therefore, the faster the network between the ActiveSpaces nodes, the lower the latency of each space operation.
- There can be no IP network address translation or one-way firewall between any two directly connected members of a Metaspace.
- If one of those conditions exists, the process can still connect to the metaspace by making a remote client connection.
- If multicast discovery is used and not all the members of the metaspace are attached to the same subnet, then either multicast routing must be enabled between the subnets or TCP discovery must be used.
- Unless the link is both low latency and redundant (for example, it is a MAN link), it is not advisable (although it is technically possible) to extend a single Metaspace between two separate sites linked by a WAN link.
- You can instead have a Metaspace as each site and write a small application to listen to changes on one site and replicate them on the other.
Physical versus Virtual Considerations
Although running in virtual environment is certainly supported, it is supported only so long as the following conditions are met:
- No “snapshots” or any other kind of operation that suspends or pauses (or moves) a virtual machine are run on the virtual machine.
- Multicast discovery is NOT used
However , it is a best practice to use of physical hosts rather than virtual hosts when possible (at least for seeders). This is because ActiveSpaces does not benefit from being deployed in a virtual environment, because it already has its own built-in virtualization mechanism (or rather provides a virtual data store functionality) and deploying multiple virtual machines each with ActiveSpaces processes on a single physical server would:
- Potentially degrade the fault tolerance of the replication mechanism (if the physical machine goes down, all of the virtual machines it contains go down at the same time).
- Probably results in more of the physical CPU being used for overhead of the virtualization environment without providing any functional advantage (ActiveSpaces already pools together the RAM and CPUs of those virtual machines together).
Also remember that in many cases network bandwidth to all of the seeders is going to be the overall limiting factor, and because multiple virtual machines most of the time have to share a physical interface with other virtual machines. Therefore you would be getting more overall bandwidth using many small physical machines each with their own network interfaces than by using a large physical host with less physical network interfaces than virtual machines. And even when the number of physical interfaces matches the number of virtual machines, you still have the overhead of the virtualization layer’s internal (software) network switch.
Note: These values and sizing guidelines are valid for version 2.0.2 of ActiveSpaces; they have changed in the past and will change again in the future depending on the version of ActiveSpaces.
Sizing of ActiveSpaces is along two independent axes: the amount of memory required to store the records and the indexes, and the number of CPU cores required to support a certain number of operations per second.
- For number of operations per second, a (very conservative) estimate is to expect 10,000 “per key operations” (i.e., put, get, take, lock, unlock, and so on –not browser creation of remote code invocation) per second per “seeder core” on a single space.
- For memory sizing, the amount or RAM required by a seeder to store a single record comes from two sources:
- There is around 400 bytes per record of internal overhead (including the key field index that is automatically created for each space) pre record stored in the space.
- The amount of space it takes to store the fields of each record and associated overhead, according to the following list:
- Short 2 bytes
- Int 4 bytes
- Long 8 bytes
- Float 4 bytes
- Double 8 bytes
- String 8 bytes + UTF-8 encoded string (1 byte
per char for basic ASCII) + 1 byte for null
- Blob 8 bytes + byte length
- Datetime 9 bytes
- The replication factor
- A 30% to 50% factor to account for runtime memory requirements (buffers and internal metaspace related data structure, “headroom” to account for some possible memory fragmentation when data is constantly updated)
This results in the following formula (e.g. using 42% headroom factor):
Memory_of_record_in_bytes =(400 bytes + payload_bytes) x (1 + replication_count) x 1.42
- If you are using indexes (beyond the key fields index) you also need to account for a per-entry overhead (also around 400 bytes , on average (the exact size being almost impossible to calculate as it depends on the type of index and the distribution of the data and total number being indexed).
- For example, the memory requirements for HASH index types do not grow exactly linearly, because hash tables grow by doubling their size as needed.
Conversely a TREE index’s memory requirements grow in part depending on the depth of the tree; and, some data sets generate trees of different depth depending on how the possible key space is used (for example, if a