TIBCO Activespaces – Best Practices – 9. Persistence

Persistence

Shared-all Persistence

  • You can register more than one instance of the implementation of the shared-all interface on the same space.
    • When more than one instance of the implementation of the shared-all persistence interface is registered, then onWrite and onRead operations are distributed amongst those instances.
      • The most efficient deployment mode is when every seeder on the space registers an implementation of the shared-all persistence interface. In this case, the instance registered by a seeder will handle the onWrite and onRead events for the keys that the seeder seeds.
    • Regardless of the number of persistence instances registered in a space, in shared-all persistence mode only one of those instances receives the onLoad event.

Shared-nothing Persistence

There is no need to implement any interface in order to use shared-nothing persistence, it is built-in to ActiveSpaces.

  • When using shared-nothing persistence you MUST ensure that all of the seeders on the space have a unique member name and that this member name remains the same between instantiations of the seeder(s).
  • In shared-nothing mode of persistence the re-loading of the data is distributed (unlike with shared-all persistence), but you must make sure that enough seeders (and the seeders that were the last members of the space when it went down) are connected to the metaspace before triggering recovery if you do not want to miss any persisted data being recovered.

Usage Patterns

Storing “Lookup Data”

You can use ActiveSpaces to store “lookup data’,” typically data that does not change very often and is not very large (such that a copy of the whole data set can fit in the memory of each node that needs access to it) but that you need to be able to read (on any key) as fast as possible.

  • In this case, you can use ActiveSpaces to store the data into a space with a replication degree of “ALL’, and make the applications that need access to this data be seeders on the space.
  • Because a seeder can service read request not only for seeded entries but also replicated entries at “in-process speed,”  a read request on any key in the space is serviced very fast.

Storing Documents

While ActiveSpaces stores tuples and tuples are a flat container, fields you can still use ActiveSpaces to store structured data.

  • For example, you can use a space as a map of objects in Java or .NET by creating a space with the key and value fields being both of the Blob type and storing serialized versions of the objects in those fields.
  • A Tuple object itself comes with very fast and efficient serialization and deserialization methods and can also be serialized and stored into a Blob field of another tuple.

For unstructured data (for example, XML or JSON documents) you can also store (and then query them) with ActiveSpaces as long as you are willing to write the bit of code required to extract from the document the key field(s) as well as the field(s) you want to query on (and have indexes built on).

  • You then store those key and indexable fields as separate fields in the tuple, followed by the whole document stored in a single String or Blob field.
  • You could also use TIBCO BusinesWorks to not just do this extracting of the key and indexed fields from the document, but also to implement a SOAP or REST interface to the applications wanting to store and query those documents in ActiveSpaces.
  • It would also be very efficient to store the document itself not as one very large String but to first compress it (using something like Zlib for example, which is known to compress XML/JSON documents very well) into a byte array and then store this compressed data into a Blob field.
    • This would result not only in much reduced memory requirements (unless you need to query every single field in the document, it is likely to allow you to be able to store more data than you have RAM available), but also higher throughput and lower latency of the operations since less time is spent serializing and sending data over the network when it is compressed.

Workload Distribution

There are multiple ways you can use ActiveSpaces to distribute workload, either as a workload distribution and process synchronization mechanism, or using it more as a distributed computation grid.

Space as a Queue

  • You can very simply use a space as a distributed (but unordered) queue by having any number of applications create “take browsers” on the space.
    • ActiveSpaces automatically takes care of all race conditions and corner cases such that a particular tuple is only “taken” (consumed) by one of those browsers. This provides you with “demand driven distribution” of the consumption of the tuples stored in the space (just like a “shared queue’)
    • You can even make this consumption more “fail-safe” by using a “lock browser” rather than a “take browser”: lock rather than consume the tuple, and if (and only if) the processing of the tuple is successful then do a take with option unlock to consume it.
  • Using “take browsers” to consume data from a Space will provide “demand-based distribution” if one of the consuming processes is faster than the others, because it will call “next()”more often on its take browser it will automatically consume more entries than the others.
  • Take a look at the request/reply example provided with ActiveSpaces to see how to use a space as a queue for distribution of requests.

Remote Invocation

You can use remote invocation to either direct an invocation request to a single process, or to trigger this invocation in parallel on a set of processes.

Remotely invoked classes can be passed a “context tuple” as well as return a “result tuple” back to the invocating code, those tuples are “free form’, meaning that they do not need to conform to any specific schema and can contain any fields (of any type) the programmers want.

Remote invocation comes in two forms:

  • Directed invocation
  • Broadcasted invocation

Directed Invocation

Directed invocation can be either to a specific metaspace member, or leveraging ActiveSpaces’ distribution algorithm, to one of the space’s seeders according to the value(s) of the key fields:

  • You can use Space and Metaspace membership listeners to keep a list of Space or Metaspace members in your application and then invoke classes directly on any member.
  • You can distribute invocations amongst the seeders of a space by invoking on a key tuple on that space.
  • ActiveSpaces then uses its internal distribution algorithm to direct the request to the seeder that seeds the key in question.
    • In the invoked code, you are guaranteed that a read on the key being invoked on will be local to the seeder.
      • You can use directed invocation to do “in-place data updating.” For example, if you want to increment a single field in a large tuple, rather than getting the whole tuple from the Space, incrementing the field, and doing a put of the whole tuple, it can be more efficient to invoke an “increment” class on the key, which means that the getting of the record, incrementation of the field, and subsequent updating of the record all happen locally on the seeder for the record in question.
      • The invocation happens regardless of data being actually stored in the space for the key being used.
    • You can even define a space on which you store no data at all, and which is used only to distribute invocation requests amongst the seeders of the space. In this case you only need to define the key field(s) that you will use as your distribution field(s).

Broadcasted Invocation (and Map/reduce Style Processing)

Broadcasted invocation triggers the execution of a class on multiple space members at the same time, the invocation can be sent to all the members of a Space or to all the Seeders of a Space, or even to all remotely connected client applications.

  • Broadcasted invocation does not require (or make use of) IP broadcast( or muliticast) packets over the network. While the invocation request is effectively “broadcasted” to a number of hosts, this “broadcasting” happens over the individual TCP connections established between those members.
  • Unlike directed invocation, which returns a single result tuple, broadcasted invocations return a collection of results (one per invoked member), each result possibly containing a result tuple.

Broadcasted invocation can be used to do fast parallelized processing of large data sets (map/reduce style processing).

  • In this case “large data sets” means “all or most of the records stored in the space’.

Let’s look at a practical example: calculating the average value of a field in all (or most) of the records stored in a Space:

  • Create a class implementing the MemberInvocable interface; in this class create a browser on the space (with a filter if needed) with a distribution scope of “SEEDED.”
    • Make sure the class is in the path of all the seeders for the Space.
  • Use this browser to iterate through all the (matching) records and to calculate the average value of the field.
  • Return the average value (and number of matching records).
    • This invocable class is effectively your “map” function.
  • On the invoking side, use InvokeSeeders to trigger the execution of this class in parallel on all the seeders of the Space. And compute an average value from the partial average values (and number of matching records) returned by the invocation.
    • This final calculation that happens on the invoker is effectively your “reduce” function.

This method is the fastest way to compute the average because:

  • The processing is distributed, with each seeder doing part of the processing.
    • There is no need for the programmer to worry about how many processes seed on the space.
  • The processing at each seeder will be fast because it happens only on “local data’.
    • When creating a browser of distribution scope “SEEDED’, the browser will iterate only over records that are stored in the process itself (and therefore very fast).

Distributed process synchronization

Broadcasted invocation can also be used for synchronizing distributed processing:

  • You can take a look at the ASPerf example to see how it is using a space (the “control space”) not to store any data but only to synchronize processing amongst a number of processes.

Locking of keys on a space can also be used to synchronize distributed processing:

  • Since the scope of a lock (i.e., the “owner” of a lock) is a thread, you can use locking on a space to synchronize processing not only between threads in a process but also between processes (and by extension between threads on different process).
  • Locking of a key in a space is not linked to data being stored in the space at that key (other than protecting against modification of the data): you can lock a key even if there is no record stored in the space for the key.
    • You can use a space not to store any data but only for locking and unlocking keys in order to synchronize distributed processing between threads and processes.
Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s