Skip to main content
cloud_cost_optimization_zhhyxw.webp

Azure Cost Optimization: How we saved 30% for a SaaS

Unnati Mishra

Unnati Mishra


Introduction

Cloud computing is now the backbone of many businesses. It offers scalability, flexibility, and innovation. However, as organizations increasingly rely on cloud services, managing costs effectively has emerged as a significant challenge. According to Flexera state of cloud report, organizations waste around 27% of IaaS/PaaS spending and 24% of associated software licensing spend in the cloud.

At CloudRaft, we recently partnered with an AI-enabled IT Service Management (ITSM) company facing this exact issue. Through strategic optimizations and best practices, we successfully reduced their Azure cloud costs by over 30%. In this blog post, we'll share the Azure cost management strategies and actions we implemented to reduce Azure cloud costs and how you can apply them to optimize your cloud spending.

How we optimized Microsoft Azure cloud cost?

Our team implemented a multi-faceted approach to achieve significant Azure cloud savings for our client. We started with a detailed assessment of various services used in the cloud and identified cost-saving opportunities while balancing the performance and reliability of the system. The stack consists of Azure Kubernetes Services (AKS), Managed databases such as PostgreSQL and Redis, Log Analytics, Virtual Machines, and various other Azure managed services. It is the typical stack most SaaS companies have nowadays. So learning from this article might apply to you as well.

In case you are looking for a free assessment, don’t hesitate to contact us. Our team of experts will be happy to help you.

Now, let's dive into each of these strategies in detail:

Using the bigger and latest generation machines

One of the first changes we made was optimizing the Azure Kubernetes Service (AKS) node size for the various pools. We identified there were various pods from AKS running under the kube-system namespace that were consuming a lot of resources on the node which had 2 vCPUs and 8 GB RAM. These pods consumed 30 to 40% of the total CPU requests on the nodes. Some of these services are Azure monitor agent, CNI plugin, CSI driver, and so on which are essential for the functioning of the cluster. So, it can’t be removed completely. As these services are on each node, it made sense to go for larger nodes.

We also noticed the newer v5 (D8as_v5) generation machines were giving a better cost-to-performance ratio. So, we decided to upgrade the nodes to the latest generation machines. This further reduces the overall consumption of compute resources on the cluster giving us the ability to cut down on the total number of CPU and RAM allocated to the cluster. We recommend using the cost calculator to identify the best instance size for you and depending on your workload profile, select the right machine type.

This change alone accounted for a 10-15% reduction in cloud costs due to a reduction in wastage and improved performance.

Optimizing Azure Log Analytics Cost

Azure’s Log Analytics service is an essential part of monitoring and debugging applications, but it can also quickly become one of the costliest aspects of cloud management if not configured properly. Excessive logging in the services, running production instances with debug log level, and having unstructured logging data resulted in driving up Azure log analytics costs.

We did a thorough analysis of the number of logs generated by each AKS and then analyzed the log type and log size for each resource group using KQL queries like this:

union * | summarize LogSizeBytes = sum(_BilledSize) by Name, ResourceId | extend LogSizeMB = round(LogSizeBytes / 1024.0 / 1024.0, 2) | extend ResourceIdParts = split(ResourceId, '/') | extend ClusterName = tostring(ResourceIdParts[8]) | extend ResourceGroup = tostring(ResourceIdParts[4]) | extend ContainerName = tostring(ResourceIdParts[10]) | project Name, ClusterName, ResourceGroup, ContainerName, LogSizeMB, ResourceId | order by LogSizeMB desc
text

We then optimized logs for clusters that were generating a high number of logs. This led to saving on storage and log analytics fees. Also, switching to structured logs made data easier to process and analyze, lowering overall costs. Our observability service utilize open source and reduce the cost significantly.

Reservations for VMs and Databases

Azure Reservations help save money by committing to one-year or three-year plans for multiple products. Committing allows you to get a discount on the resources you use. Reservations can significantly reduce your resource costs by up to 72% from pay-as-you-go prices. Reservations provide a billing discount and don't affect the runtime state of your resources. After you purchase a reservation, the discount automatically applies to matching resources.

We implemented reservations for all VM’s including Kubernetes and PostgreSQL instances. By reserving resources for VMs and databases with consistent usage, we can have the advantage of discounted pricing and Azure VM cost optimization.

Spot Node Pool for Kubernetes

Azure’s Spot Virtual Machines are VMs that can be purchased at a much lower price in exchange for the risk that they might be reclaimed when Azure needs the capacity back. Spot VMs in Azure are perfect for non-critical workloads that can be interrupted, offering significant cost savings. We implemented spot node pools for the Kubernetes clusters.

Shifting non-essential workloads to a spot node pool, reduced the cost of compute resources without impacting key operations. This provided flexibility to run cost-sensitive workloads at a lower rate. We segmented the workload as per our quality requirements and accordingly scheduled it on spot pool.

Migrating from Elasticsearch to Loki

Our client was using Elasticsearch for log monitoring. It was an overkill for the company's log aggregation needs, consuming too many resources. It was making things more expensive and less efficient.

By switching to Loki, which is designed to handle logs more efficiently, we were able to solve these issues. Loki uses a simpler approach by labelling logs instead of doing complex searches, which means it needs less memory and CPU. It also saves space, reducing the storage costs that were high with Elasticsearch. This change led to lower costs while still meeting the company's needs for managing logs.

Garbage Collection and Cleanups

Unused or idle cloud resources, such as VMs and unattached disks, can pile up and lead to unnecessary expenses. Regular cleanups are needed to remove stale resources.

In order to figure out unused resources we checked for unused resources per resource group. We created a list of unused resources like VM’s , IP addresses, disks etc and removed them. We also identified various features such as SFTP for storage account, managed Grafana instances and Prometheus instances for Kubernetes were not used but it was kept enabled adding up to the cost. This proactive approach reduced unnecessary storage and compute costs.

While this manual approach is effective, it can be further enhanced through automation. Tools like Cloud Custodian enable the creation of policies that automatically identify, tag, and even remove unused resources based on predefined criteria.

Lifecycle Policies for Storage

Compliance data, while important, was not accessed frequently and didn’t need to be stored in expensive storage tiers. Data could be automatically moved to cheaper storage tiers as it aged.

We added lifecycle policies for resources like compliance buckets. Automated lifecycle policies moved older, less-accessed data from hot to cold or archive storage, reducing storage expenses while maintaining compliance.

For example, newly generated data might be stored in hot or premium storage for quick access. But as this data ages and is accessed less frequently, the lifecycle policy automatically shifts it to cheaper storage tiers, such as cool or archive storage.

Rightsizing Virtual Machines and Databases

Often we over-provision the compute to reduce the risk of performance issues and then don’t get opportunities to right-size them. In our audit we discovered various VMs and databases that were over-provisioned, using more resources than necessary. We scaled down the VMs to more appropriate sizes and resized databases to reflect the load they were handling. Right-sizing the VMs and databases eliminated unnecessary resource usage, reducing cloud expenses. We maintained the performance levels while cutting unnecessary costs. There are more opportunities if we look inside the databases and fine-tune the queries but let’s keep that for another day.

Cloud cost saving best practices

In addition to the changes we made, we established a set of cloud cost saving best practices to ensure we continue to operate cost-efficiently in the future.

  • Setup Monitoring (Cost Alerts) and Budgets: Azure offers tools to set budgets and create cost alerts. By monitoring spending more closely, we were able to identify wasteful areas and act on them before costs spiralled out of control. Set up daily, weekly, and monthly budget alerts. Use Azure Advisor for cost management suggestions.
  • Use Reservations wherever possible: As mentioned earlier, reservations for VMs and databases provide significant savings. Always assess whether your workloads can benefit from reserved instances. Combine Azure reservation with Azure Hybrid Benefit for maximum savings.
  • Right-size Kubernetes Nodes: Always use appropriately sized nodes for your AKS clusters. Running nodes that are too large wastes resources, while nodes that are too small can result in inefficiencies due to throttling and memory contention. Implement node auto-scaling to match demand fluctuations.
  • Optimize logging and diagnostics: Not every log or diagnostic event is necessary. Be selective with what you collect, as the cost of observability can quickly add up. Utilise Azure Monitor Agent (AMA) for more granular control over data collection.
  • Implement auto-scaling: Use Azure Autoscale for VMs, App Service plans, and other supported services. Design scaling rules based on performance metrics and time-based patterns.
  • Optimize storage usage: Implement Azure Blob storage lifecycle policies. This allows you to automatically manage the lifecycle of your blob compliance.

Conclusion

By implementing these strategies, we were able to achieve a 30% reduction in Azure cloud costs for our client. This not only resulted in significant savings but also improved overall system performance and efficiency.

Remember, cloud cost optimization is not a one-time effort but an ongoing process. By continuously monitoring, analyzing, and adjusting your cloud usage, you can ensure that you're getting the most value from your cloud investments while keeping costs under control. There are tools and SaaS services to manage the costs but these services are unable to look into architectural choices and not sufficiently intelligent to give a very customized recommendation for your specific needs.

This article is jointly written by Unnati Mishra and Ritesh Sonawane.

Engage us for free assessment.

Maximize your cloud investment with expert cost optimization

Discover how to reduce your Azure cloud costs by up to 30% without compromising performance. Ready to optimize your cloud spending?

Enjoying this post?

Get our posts directly in your inbox.