- Joined
- 8/20/20
- Messages
- 86
- Points
- 278
I am a software engineer in IB. I’m sharing this in hopes of picking up some tips or insights from many sharp users here on QuantNet. I also think this might give anyone are students here a sense of what the “mess” looks like in real-world production systems. Thanks in advance for the time reading my post.
So, we have a system that relies on numerous upstream data sources—such as asset reference data, FI product risk sensitivities calibrated by other desks, external portfolio positions, equity volatility surfaces, and index time series, even some vendor's api—and also our system feeds multiple downstream consumers (reporting infrastructure, brokerage desk, operations, trading desk...etc.). Because of these many dependencies, maintaining and supporting the system is challenging.
From time to time (actually quite frequently), we encounter issues with data accuracy, user configuration, or positions handling. These problems can easily affect our system’s outputs, and investigating them is often quite challenging. When a dispute arises—whether from a desk internally or a client externally—we have to use our logs and cached intermediate data, combined with the dispute’s context, to deduce the root cause. In general, our (engineering team) investigation usually leads to multiple possible hypotheses. We then need to confirm which one is correct by replicating the issue in our UAT environment.
Unfortunately, our UAT environment often lacks production-level data quality, so replicating issues can be cumbersome. To make it work, I frequently coordinate with Quality Assurance team to synchronize data across environments. I’ve also built data migration tools to help streamline this, but the entire process still takes considerable time—especially when clients or business desks consider every issue “urgent” or “super urgent.”
To tackle these challenges, we are already put some extra effort into many directions. a few of us engineers are coded the system in way it can collecting as much intermediate calculation data as possible in MongoDB. We’ve also created a variety of python scripts to analyze those intermediary data so we have a more detailed but also structured view of things for troubleshooting. Moreover, I’m designing a data override system that allows users to rerun calculations with their own corrected data, buying us time to fix upstream data problems.
In the mean time, bureaucracy and office politics can comes in make the whole process harder...but that's not topic of this post.
So, we have a system that relies on numerous upstream data sources—such as asset reference data, FI product risk sensitivities calibrated by other desks, external portfolio positions, equity volatility surfaces, and index time series, even some vendor's api—and also our system feeds multiple downstream consumers (reporting infrastructure, brokerage desk, operations, trading desk...etc.). Because of these many dependencies, maintaining and supporting the system is challenging.
From time to time (actually quite frequently), we encounter issues with data accuracy, user configuration, or positions handling. These problems can easily affect our system’s outputs, and investigating them is often quite challenging. When a dispute arises—whether from a desk internally or a client externally—we have to use our logs and cached intermediate data, combined with the dispute’s context, to deduce the root cause. In general, our (engineering team) investigation usually leads to multiple possible hypotheses. We then need to confirm which one is correct by replicating the issue in our UAT environment.
Unfortunately, our UAT environment often lacks production-level data quality, so replicating issues can be cumbersome. To make it work, I frequently coordinate with Quality Assurance team to synchronize data across environments. I’ve also built data migration tools to help streamline this, but the entire process still takes considerable time—especially when clients or business desks consider every issue “urgent” or “super urgent.”
To tackle these challenges, we are already put some extra effort into many directions. a few of us engineers are coded the system in way it can collecting as much intermediate calculation data as possible in MongoDB. We’ve also created a variety of python scripts to analyze those intermediary data so we have a more detailed but also structured view of things for troubleshooting. Moreover, I’m designing a data override system that allows users to rerun calculations with their own corrected data, buying us time to fix upstream data problems.
In the mean time, bureaucracy and office politics can comes in make the whole process harder...but that's not topic of this post.