As we continue to hire more engineers at Red Canary, we’ve found that candidates come into interviews with wildly differing expectations about the technology and tools we use. As a consequence, I find myself frequently answering some form of the following questions:
- What’s your stack?
- What technologies do you use?
- What kind of tools are your engineers using to be productive?
I completely understand why engineers want to know these things before they decide to join. In this blog, I’m going to explain what I think future Canaries should know about our tech stack before joining—in the hope that it will help prospective Canaries better understand how we do work here.
Where and how we code
All of our source code lives in GitHub in the private repositories under redcanaryco organization. This makes it easy to quickly search for something or look for some historical context, oftentimes without having to check out the repositories locally.
A growing percentage of engineers are using VSCode, but I see engineers using several alternatives as well: Vim, RubyMine, and a few others. We track our work in the Atlassian product suite, utilizing JIRA and Confluence in a fairly typical way with Epics, Stories, Bugs, Sprints, Planning Boards, and Dashboards. Our support team uses Zendesk, which we cross-link with JIRA when needed.
While we review pull requests in GitHub, we get a quick feedback loop from our continuous integration/deployment pipeline (managed by CircleCI), which is building our app into container images, running our extensive test suite, and automatically deploying the code into our Kubernetes clusters.
Programming languages, frameworks, and libraries
The majority of our data processing engine is written in Ruby and horizontally scales extremely well. The basic building blocks are described in a detailed walkthrough that our CEO Brian Beyer posted on YouTube a few years ago. These small, independent Ruby components are deployed as containers that we manage using Kubernetes. The components rely on a battle-tested pattern of using Amazon Simple Storage Service (S3), Simple Notification Service (SNS), and Simple Queue Service (SQS) to reliably and cost-efficiently move the data through the pipeline.
The centerpiece of our product (the portal that customers access to review and respond to confirmed threat detections, read intelligence reporting, and more) is written in Ruby on Rails. We use Sidekiq for efficient background processing, HTML Abstraction Markup Language (HAML) templating engine to keep our front-end clean, and React to keep the portal easy to use and highly responsive.
A few of the data processing components we maintain are written in Go. We’re optimistic and curious about using Go more in our production environment, but we haven’t made the decision to abandon Ruby. There is an interesting trade-off to be made between raw performance, ease of maintenance, and a years-old, battle-tested codebase that’s easy to read and requires very little maintenance.
While we integrate with all the leading endpoint detection and response (EDR) providers such as VMware Carbon Black, Microsoft, CrowdStrike, SentinelOne, and others, we have our own industry-leading Linux sensor providing us with the telemetry from the endpoints hosted in our customers’ data centers. This Linux EDR sensor is written in Rust, which has been a great programming language choice for us. It’s a high-quality component with optimized resource (CPU/RAM) usage performing some fairly advanced telemetry sourcing via eBPF.
Our engine components and architecture
The majority of our production deployments are handled within several Kubernetes clusters running within Amazon Web Services Elastic Kubernetes Service and (more recently) Azure Kubernetes Service (AKS). We use ArgoCD to simplify the management of these deployments across several clusters and provide a single pane of glass into what’s running where.
Most of our code runs in containers in Kubernetes deployments (as noted above), where the required number of pods is scaled up and down based primarily on the amount of incoming telemetry at any given time. To give you an idea of scope, we ingest as much as a petabyte of telemetry every day. In most cases, all this code executes on AWS Elastic Cloud Compute (EC2) spot instances, proving considerable cost savings to us at scale. We’ve recently seen great results with Kubernetes Event Driven Autoscaling (KEDA). Given the thousands of pods we run at once across our deployments, small fine-tuning improvements add up quickly. For example, a recent switch from x86 to ARM Graviton-based instances helped us improve our cost profile considerably.
Management, measurement, and logging
The infrastructure we use in AWS and Azure is built and managed using the cloud version of Terraform. This helps us avoid configuration drift and easily modify & extend the infrastructure we need.
Many of our data-driven decisions are based on the extensive metrics we collect via Prometheus and evaluate using Grafana dashboards. We use these Grafana dashboards to observe the application(s) behavior as well as for our extensive data processing cost analysis. To track and improve the application performance, we use New Relic.
Our application logs go into the centralized, cloud-based version of Splunk, while the unexpected errors are sent into Sentry, along with the respective stack-trace and other relevant details, streamlining our bug/error triage considerably.
We don’t assume all the engineers we hire come with a major security background, but working across teams with so many smart security people makes it all but inevitable that some level of infosec awareness starts rubbing off before they know it. It’s exciting to see people learn so much so quickly.
I hope this high level overview was helpful. If this sounds like your speed, we’d love to talk to you.