Embracing OpenTelemetry
Embracing OpenTelemetry
At iSolutions, the collection of software metrics is paramount for providing exceptional customer support and adhering to our Service Level Agreements (SLAs). We gather a multitude of metrics from various systems and utilize Grafana for querying and visualization. Additionally, we maintain a proprietary alerting system that leverages some of these metrics to generate alerts.
Historically, our reliance on Windows Performance Counters for collecting application-level and business metrics served our initial needs for system and database metrics. However, as our requirements expanded, it became evident that a more efficient and developer-friendly solution was necessary to support our growing metrics infrastructure.
Rationale for Migration
The decision to migrate from Windows Performance Counters was driven by several key factors:
- Lack of Flexibility: Windows Performance Counters were not sufficiently adaptable for business metrics.
- Frequency Limitations: They were not designed for high-frequency data writing.
- Developer Efficiency: The process of adding new metrics via Performance Counters was cumbersome, error-prone, and time-consuming, involving numerous manual steps. Developers thrive in environments where tools and systems are intuitive and efficient, a criterion that Performance Counters failed to meet.
Given these challenges, OpenTelemetry (OT) emerged as a natural choice for us.
The Migration Process
Our team undertook the task of developing an OpenTelemetry collector and transitioning performance counters in the code to types provided by .NET. This migration presented several challenges:
- Compatibility Maintenance: It was essential to keep our existing Grafana dashboards and alerts functional. This required maintaining compatibility with the current implementation, preserving metric names, and retaining the semantics established by the Windows Performance Counters.
- Storage Considerations: Metrics were written to an SQL database, a setup that, while not ideal by some standards, had evolved naturally and functioned effectively for us.
The migration demanded extensive planning and code review. Despite these challenges, the migration was executed smoothly over a few weeks without any service interruptions. Subsequent months were dedicated to refining the system, primarily through the addition of new features and automation.
Post-Migration Outcomes
Before the migration, adding a new metric involved a seven-step process, two pull requests, and releasing both the application and the monitoring services. This laborious process is now obsolete. Developers can declare metrics directly within their code, making them available post-release, with all other processes managed automatically.
The most significant benefit is that other teams have also adopted OpenTelemetry in their applications. Currently, all our systems utilize the OT standard. Additionally, we export some metrics to external providers like Dynatrace, which supports OpenTelemetry.
Conclusion
More than a year has passed since our transition to OpenTelemetry, and it has become an integral component of our platform. This integration has proven to be a prudent choice, resulting in the collection of more metrics, enhanced system integration, and a significantly improved developer experience.
Following the success of our metrics migration, we also transitioned our logging to the OT standard, yielding similarly positive results. This migration has delivered 'economies of scale,' an often undervalued benefit in our industry. We implemented the change once and utilized it universally, making metrics and logs collection a 'non-issue,' thereby allowing our developers to focus on other critical areas.
The benefits of this migration underscore the importance of investing in efficient and developer-friendly tools. At iSolutions, the adoption of OpenTelemetry has not only streamlined our operations but has also empowered our developers to deliver better solutions more effectively.