← writing
// post

How I Automated Systems Migrations and IT Operations for a $2.5 Billion Private Equity Platform

Azure Automation IaC Auth

Introduction

Despite being early in my career, I have accumulated a set of experiences that are both uncommon and high-impact. In a previous role, I served as the sole Systems Engineer for a private equity platform that sold for $2.5 billion shortly following my departure. My responsibilities spanned network architecture, security, and infrastructure—but the most defining work was designing and executing rapid migrations of legacy IT systems under tight timelines, minimal staffing, and incomplete information.

This is the story of a unique and dynamic experience I had, and how I was able to utilize the 2nd and 3rd order effects of the changes I made to further optimize the company’s operations. This isn’t a story of well written code. In fact, I’d be embarrassed to have some of the code I wrote attributed to my name without the context of surrounding conditions. This is a story of pragmatic, decisive, decision making that directly impacted the operational capabilities of a multi-business platform.

1. Operating in Misaligned Systems

One of the first components to understanding how this process worked, and the circumstances I had to account for, is to consider the priority of different business industries. Businesses, at least successful ones, tend to refine their processes and priorities around their main profit centers and fields of interests. Accounting firms will have stronger financial data, but may not have the most advanced marketing stack. A bakery will likely have an excellent recipe book, but pays little attention to its legal representation. Blue collar service businesses, as I was dealing with, will focus on sales contacts, and care very, very little about their network security, data pipelines, and backend infrastructure. Thus, I was entering from a position facing mild skepticism and even hostility. I was an agent of change with a vision for something greater, and little tolerance for even minor mistakes.

Understanding my position in this situation, it was abundantly clear that I would need to be conservative with my requests, focusing on obtaining key data that I could extrapolate into more detailed structures on my own, rather than relying on manual inputs from disparate office staff. The core goals of these projects were as follows: Migrate all employee accounts (email, user logins, SSO) to a cloud based IDP (Microsoft Entra in this case). Migrate all endpoint management, security, and telemetry to a cloud based solution. Migrate all on-premises data and servers to the cloud. Enroll all mobile devices into an MDM (Mobile Device Management) platform, and ensure users were issued logins securely, and able to log in ahead of time. Additionally, I sought to audit and revamp the account systems of each business, as well as their permissions systems. These businesses did not consistently audit their security posture, and had very high turnover rates. The data as-is was incredibly inaccurate, and there would never be a better time to correct it. All of this needed to be done with 1-2 weeks of asynchronous prep-time, and 2-3 days for actual execution, with an upper bound of 5 days for the largest businesses with ~750 employees.

2. Executing the Core Migration Process

The actual migration process can be split into 3 phases: Endpoint Bootstrapping, Data Preparation, and On-site execution.

The Endpoint Bootstrapping Process is straightforward, and required minimal interaction past the first rollout. In order to facilitate the rollout of endpoint security, telemetry, migration scripts, and general utilities, each endpoint needed an agent installed. This would typically be handled on-site by a local contact, and also allowed us to procure a general inventory of physical assets. Most of my work went into writing the various configuration and deployment scripts that would be executed during the migration process, as well as various API integrations that will be discussed further in section 3. A local contact would install the agent for us in advance, and I could update scripts as needed prior to each migration.

The next phase, data preparation, is also somewhat simple, but was also probably the most critical. This was the step that allowed us to remediate bad data, form a more consistent plan for the on-site, and minimize complications for the most sensitive phase. This phase typically took 1 week, potentially 2 if the timing was suboptimal for the business. The process was kept as simple as possible, with the following philosophy: export a list of employees, security groups, and primary file directories. If an employee is no longer present, highlight them in red. If an employee is missing, add them and highlight them in green. Match every employee with their unique employee id, provided from HR. Provide them with a list of common file structures, let them add or remove a few if needed, and then sort people into those groups. Create those structures ahead of time so they can start moving them locally, minimizing confusion when the final cutover happens. These companies did not have very complex data, but they did need a strong cleanup and guidelines to ensure that financial and legal information was isolated from standard employees. They would never reorganize this content on their own and it preemptively reduced a lot of meaningless toil managing permissions, as well as panic when they realized a random employee had access to management information they weren’t supposed to. The actual information we needed could fit onto a couple pages, and by having them fill it out and validate it, they felt they were in control. Once all of the data was validated, a copy would be moved to the cloud over the weekend, preventing any network slowdown during business hours, and guaranteed the delta sync was small during the final cutover. All user specific information, logins, etc, would be sent out using expiring links.

The final phase was the actual execution of the migration. I, along with the VP, would go on-site to these businesses for 2-3 days. They would typically have one person available to assist with user instructions, so it was a small and focused operation to ensure everything executed smoothly. While the core migration scripts could all run in the background overnight, they could not migrate encrypted user data. Unfortunately, this included browser login information, which would greatly impact the business. Thus, while on-site each user typically had to be reminded to back up their passwords, and then the scripts would run in the background, moving over their local machine or domain accounts to Entra backed domain logins. This took 10-20 minutes each, and often resulted in multiple people being juggled at once. They would need assistance with logging into new accounts, configuring MFA, and restoring credentials. There were occasionally edge cases that needed to be debugged, which were my primary focus as I had designed the core migration process and written all of the scripts. There would inevitably be unanticipated changes the businesses wanted at the last minute, and I would be the one in charge of making these modifications and ensuring that any data transfers, permissions, etc, were updated accordingly, and that any changes, syncs, etc, properly processed this new information.

There were even a few instances where a business critical application was left completely unmentioned beforehand, requiring a last minute strategy to decouple authentication from local services, move it to the cloud if possible, and ensure that any network configurations were updated for the servers new home. For example:

It was the final day of a cutover. Everything has been going smoothly, and set to finish moving tonight. Suddenly, the payroll person says “my excel doesn’t work”. I took a look and discovered that she was talking about a strange macro that links to a SQL server running on a windows VM in their old Azure account. Nobody had the login and it was made by a one-off contractor 2 states away. Authentication was tied to her old account, that was being retired, and this macro was used to process the company’s payroll. Nobody during any preparatory period thought to mention it, even when we asked about all of their tooling. To top it all off, I had about 16 hours before my flight home. Fortunately, the solution was manageable and the userbase was small. It was a bit confusing for them to enter the login information for a few days, as the username appended a randomized hash to prevent conflicts during the migration process. I was able to copy over the VM and replicate the configuration in our own Azure tenant, and eventually was able to contact the original contractor, who disconnected the logins from their old accounts, and provided me with the admin credentials.

A similar instance occurred on 2am of the final day. I woke up with a gut feeling, and checked to make sure the final delta copies were running smoothly. The main company file server had failed half-way through. Somebody decided to completely reorganize the files at the last minute, which had completely thrown of the delta and was resulting in deletion and recreation of every file. The Microsoft data transfer tool was terrible, slow, and far more bottlenecked than the office’s upload pipe. I had to make a decision or everything would be broken when the business opened in 4 hours. I killed the job entirely, it wouldn’t finish on time. I started a fresh upload job, processing a subset of the data that I knew the tool can upload in time. I spun up a few cloud VMs quickly, and opened their firewalls to the public IP of the office. I split the remainder of the data, putting a chunk of it on each vm, taking advantage of the offices faster upload speed. I then ran separate jobs on each vm, allowing for concurrent uploads and bypassing the poor speed of the transfer tool.

Ensuring that all components of this process remained stable, for hundreds of people, in a few days, required pragmatic thinking, constant context switching, and fast-paced decision making. Ultimately, while I dealt with some serious complications, they were never noticed by the business. Near every operation went smoothly, with approximately 15 businesses being migrated over the course of a year.

3. Leveraging Identity as a System Backbone

I previously mentioned that during the data preparation phase that I ensured every employee was matched a unique employee id taken from the HR system. While many initially believed this was a pedantic guarantee for accuracy, it was actually the fundamental link needed for a greater vision of automation and system integration. With Entra as the company’s primary identity platform, part of my job was also configuring Single Sign On for a variety of SaaS applications, typically through SCIM. The SSO-only logins allowed for guaranteed uniqueness in those platforms, and the HR identifier allowed for the same guarantee between HR data and Entra, enabling any HR information to percolate to downstream systems. Critical pieces of data included the following:

  • Business Units
  • Payroll Units/Cost Centers
  • Org Chart Hierarchy
  • Employment Status
  • Hire/Termination Dates

Population of these field provided immense value in regards to security, business operations, and financial savings.

One key example of this was the mobile devices and associated phone bills. Approximately 2000 employees had mobile devices, with about half having 2, for a total of 3000 service lines. Cell service for each device was about $50/mo. Combined, the expected monthly bill would be around $150,000. Additionally, these were blue collar employees with high turnover rates. If the phones were returned and reissued, contact information would rot and inconvenience the operations staff. If they weren’t, the phone bill would quickly climb, with tens of thousands of dollars going towards unused phone lines in the possession of ex-employees, compounding each month. With a bit of Python, a few API calls to the MDM, and the backing of the HR dataflow, these problems were largely eliminated. Mobile Devices were associated directly to an employee via the SCIM integration, which meant their contact information could automatically be linked back to Entra, ensuring contact information was up-to-date for the business users. Even more importantly, this connection meant that phone lines could be tied directly to an employee, and thus their employment status and payroll units. If someone left the company, we could guarantee that the phone line was reissued or terminated near immediately, killing the unintended growth of the bill. Each month the massing 3000+ line phone bill was no longer a giant mess of mislabelled cost centers. They were tied directly to the employee who last used them, granting the finance department a clear breakdown of how much each company and business unit was spending on cellular service, automatically updated to account for any edge cases such as inter-company transfers. For a large, disparate private equity platform, this information was near impossible to connect otherwise.

This same approach was applied to all SaaS bills, allowing for reliable breakdowns, cancellation of unused licenses, and enrichment of data that may be useful. It was linked to additional asset inventories. If an employee left the company IT could easily identify their laptop, collect it, and ensure it was reissued to reduce unnecessary purchases. Even if HR forgot to document an employees departure or onboarding, IT would get an automated report, verify with them, and adjust access accordingly. The enriched data was used for uniform communications channels and mailing lists, which was crucial for HR training, location specific information, and general coordination for the PE platform. Access to applications could be assigned by logical groupings instead of tedious, recurring, one-offs.

It also enabled a more robust security posture. Using Cloudflare Zero Trust, access policies could incorporate multiple dimensions simultaneously:

  • Identity (Entra user)
  • Device state (managed, EDR-compliant)
  • Organizational context (business unit, role)
  • This level of policy granularity was only possible because identity, device state, and org metadata were unified.

4. DevOps, Terraform, and Cloud Native Strategies

One of the final improvements I worked on before I left this organization was the introduction of a more devops oriented approach to infrastructure. While a large part of this position was modernizing older systems, moving them to SaaS and cloud based solutions, its worth understanding that “moving to the cloud” often has different meanings across different organizations. To those of us familiar with the cloud-native era, cloud is often about PaaS convenience, infinite scalability, stateless services, pay-as-you-go, and IaC. For many in the business world, cloud is about transitioning CapEx spend to OpEx spend. There is little regard for architectural decisions, and thus they recreate their on-premise infrastructure in the cloud, losing many benefits and paying excessive fees to keep servers running 24/7 that could sleep for most of the day. If you’ve taken anything away from this discussion so far, I’d imagine you realize I heavily dislike inefficiency, and thus am opposed to this approach unless there are no other options.

For many of the automations that I was deploying in section 3, I required additional data enrichment, or a place to generate reports. For some additional data enrichment, overrides, or shared mappings, I used a single Azure SQL server. Most of the data could be queried directly via API with sufficient performance, but this granted additional flexibility with low cost and complexity. For the runtime, I went with Azure Functions, which provided a cheap, isolated serverless runtime. For the frequency most of these jobs ran, Azure Functions cost pennies each month - far preferable to any long running service, without any server maintenance overhead. Because these functions all operated similarly, I made templates and deployed them using Terraform. I’m a big fan of Terraform for many reasons, including its ease of use, reliability, and declarative syntax. I was able to leave good documentation during my departure in large part because Terraform made managing these resources consistent, and declarative syntax inherently conveys a certain amount of the author’s intent.

However, one incident I found rather humorous was when the company outsourced a third party developer to build a python app with some weather reporting. Upon intial completion, we were provided with some python code, and some instructions on how to schedule it to run on a Windows VM. This raised a few red flags, so I inspected the code, and unsurprisingly found hardcoded api keys. I restructured his code, containerized it using docker, moved the api keys to an integration with Azure Key Vault, and added a toggle to load them via an env file locally for development. I wrote a Terraform config to deploy it via Azure Container instances, and I showed him how to run it in Docker. It stuck a great balance between efficiency and productivity, with him focusing on his application code with clean mappings to my infrastructure. He was quite happy, and seemed to appreciate the time I took to improve the development experience.

A few months later some minor updates were needed to the formatting of the spreadsheet, so he was contacted again, and provided the update shortly after. I was simultaneously a bit impressed and dismayed when I received a link to a new repo he made, which now contained a full docker compose file, orchestrating 3 containers over AMQP for this simple python script (With the .env file built into the container image, but I digress). My poor minimalist deployment pipeline was not ready for this distributed excess! He was clearly improving and learning a lot, but this project probably wasn’t the place to do it. After the confusing process of explaining to my boss why a minor change in business logic completely changed the architecture requirements, I spoke to him - he understood and simplified it back down to a single container.

While the infrastructure of this business was not particularly complex, its important to remember that complexity is not always needed. Additional functionality, tooling, and abstractions can be very powerful, but need to be placed in the right places. As I stated in the introduction, much of the code I wrote here was nothing fancy, but it had disproportionate business impact at very low costs. Sure, many of these API connections could’ve been wired to an elaborate caching mechanism, implemented an event broker, and seen a faster response time on a technical level. However, it would have increased cost, taken development time I didn’t have, and provided 0 benefit to the business. Its fun to host a blog on Kubernetes, but its wise to do so on your own time!

Conclusion

I started this piece by saying it isn’t a story about well-written code. I stand by that. Pragmatism involves constraints and tradeoffs. The best decision in one context is a poor decision in another. Knowing when to automate and when to ask, when to over-prepare and when to improvise, when a single Azure Function was the right answer and when to talk a contractor down from three containers and a message broker - these were the experiences that showed I could make competent decisions. Fifteen businesses. Hundreds of employees each. A two or three day window to move everything that kept them operational, with one engineer who designed the process, wrote the scripts, and was the one on the floor when something unexpected broke.

I’m not just proud that it worked. I’m proud that it kept working after I left. The HR dataflow kept attributing phone lines. The terminated employee reports kept firing. The Terraform configs were documented well enough that someone else could reason about them. Nobody told me to build these things. I did it because I saw a problem, and I wanted to fix it, not coexist with it. I looked and thought “how should this work” now “how does this work” and it paid dividends. In many regards, I would not be surprised if my work helped raise the valuation of the platform. It may not have directly impacted the bottom line, but it helped grow a level of organizational maturity that would not have been present otherwise.