Event-driven system and subscribers missing events

Click For Summary

Discussion Overview

The discussion revolves around the challenges faced by stateful services in an event-driven system, particularly regarding the handling of missed or incorrectly processed events. Participants explore various strategies for ensuring that events are processed in the correct order and maintaining the integrity of the service state.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested

Main Points Raised

  • One participant suggests that subscribing services should have a mechanism to "reset" their state when an event is missed or incorrectly processed, potentially by retrieving lost events from a storage service.
  • Another participant proposes that subscribers could pull requests from a broker to ensure events are processed in the correct order, referencing ZeroMQ as a suitable architecture for such systems.
  • A different viewpoint emphasizes the importance of time synchronization in event processing, discussing how clock drift can affect event timestamps and suggesting that events should report the time of the last synchronization.
  • One participant elaborates on the use of an MQ broker where messages are queued and processed transactionally, which can prevent message loss and ensure correct order, although this may introduce latency under heavy loads.
  • Another participant distinguishes between "resyncing" and "resetting" state, proposing that periodic checkpoints could allow services to revert to a known good state and process forward from there if an event is missed.
  • There is also a suggestion that it may be possible to assess whether a missing event is still relevant based on the context of recent events, implying that not all missed events may require reprocessing.

Areas of Agreement / Disagreement

Participants express differing views on the best approach to handle missed events and the implications of state management. There is no consensus on a single solution, and multiple competing strategies are discussed.

Contextual Notes

Participants highlight various assumptions regarding the architecture of event-driven systems, the role of time synchronization, and the potential trade-offs of different approaches, such as transactional processing versus performance under load. These factors remain unresolved within the discussion.

SlurrerOfSpeech
Messages
141
Reaction score
11
Let's say I have a service that publishes events, like

eo ("Bought 100 shares of AAPL")
e1 ("Bought 100 shares of T")
e2 ("Sold 500 shares of TSLA")

and there exist stateful services subscribing to the events and whose state depends on the events being processed successfully and in the correct order.

There are many things that can go wrong on the subscription side:
  • A subscribing service fails to process an event and is not able to try to re-process it, leading to a contaminated state.
  • A service "successfully" processes the event, but because of a bug in the processing, it actually failed to process it. This is actually equivalent to the first bullet point.
Should the subscribing services have a way of "resetting" their state once such a problem occurs?

For example, let's a service processed e0 and e2 but not e1, let's say because e1 somehow got lost. Maybe the subscribing service keeps a record of events it processed and knows once it sees e2 that it needs to first process e1 and can get it from some service that stores all the events.
 
Technology news on Phys.org
It might be better for the subscribers to pull the requests from a broker in this case so that events are processed in the correct order.

ZeroMQ was made for these kinds on systems:

http://zguide.zeromq.org/page:all

As you read down they will present various architectures for microservice architectures with pros and cons.

http://zeromq.org/intro:read-the-manual
 
  • Like
Likes   Reactions: QuantumQuest and Klystron
jedishrfu said:
It might be better for the subscribers to pull the requests from a broker in this case so that events are processed in the correct order.
And there you are in the middle of what I was doing the last ten years of my professional life - the problem of "time stamping" an event with the correct universal time. After all, there might be several brokers - how do you ensure that the time stamp of an event is correct?

Some years ago, I published an insight here (https://www.physicsforums.com/insights/time-synchronization-across-switched-ethernet/) which discussed the clock synchronization problem for various accuracy requirements. For human systems (like the broker problem), the NTP protocol (with an estimated synchronization accuracy of about 2ms) is more than precise enough. The only problem is that the system clock will drift between synchronizations and thus a timestamped event must somehow report the time of the last synchronization and the measured clock drift between the two last synchronizations.

For a more thorough discussions of time synchronization, read the insight.
 
  • Like
Likes   Reactions: Klystron
I was referring to an MQ broker where producer programs write messages to a queue and consumer programs read messages from the queue in a transactional scheme. In this way if the consumer fails then it can restart and not miss a transaction and process them in the correct order. The transactional feature is important as a message won't be dropped from the queue until the transaction is completed however the feature may slow down the system if the message load is very heavy as in stock ticker systems.

Nice insight by the way, I think MQ systems and database systems have these notions embedded within them at least I'm pretty sure distributed partitioned database schemes need this to work correctly.
 
SlurrerOfSpeech said:
For example, let's a service processed e0 and e2 but not e1, let's say because e1 somehow got lost. Maybe the subscribing service keeps a record of events it processed and knows once it sees e2 that it needs to first process e1 and can get it from some service that stores all the events.
This is more an issue of "resyncing" than "resetting". In general, it will not be possible to "unservice" e2 - but if that is possible, then you could unwind all transactions since the mis-step. A more likely solution would be to periodically checkpoint your servicer's state. So if I checkpoint at e100, e200, and e300 then discover at e377 that I missed e267, I can go back to e200 and process forward from that point.
It is also possible that you can determine whether the missing event matters anymore. If you are keeping a list of the most recent 20 events, loosing an event before that will not matter.
 
  • Like
Likes   Reactions: jedishrfu

Similar threads

Replies
14
Views
11K
  • Sticky
  • · Replies 48 ·
2
Replies
48
Views
70K
  • · Replies 17 ·
Replies
17
Views
4K
  • · Replies 1 ·
Replies
1
Views
3K
  • · Replies 1 ·
Replies
1
Views
4K
Replies
6
Views
5K
  • · Replies 4 ·
Replies
4
Views
5K