Azure Storage I/O performance problems

Post Views: 2,639

You’re experiencing performance issues with Azure Storage, such as high I/O latency or throttling. This can manifest in several ways: – Increased operation latency: Simple storage operations (uploading a blob, reading a table entity, listing a directory, etc.) are taking much longer than usual. For example, writes to Azure Blob Storage might have spiked from <10ms to hundreds of milliseconds, or even seconds in worst cases. Applications might log slow responses when interacting with storage. – Throttling errors or timeouts: You observe errors like HTTP 503 or 500 status codes indicating “Server Busy” or timeout messages when performing storage operations. In Azure client SDKs, this often appears as StorageException with messages about being throttled or server timeout. Essentially, Azure Storage is rate-limiting your requests due to high volume. – Azure Monitor metrics anomalies: The Azure Storage metrics indicate issues – e.g., the difference between Success E2E Latency and Success Server Latency is large, or the Transactions metric shows a high number of requests resulting in throttle (you might see metrics categories for Throttling if splitting by response type). Possibly the Availability metric might dip if many requests are failing or timing out. – Slow throughput even with scaling clients: Despite adding more clients or threads, the overall throughput plateaus or decreases, often a sign of hitting a performance ceiling or target.

Possible Root Causes:

Hitting Azure Storage scalability targets (throttling): Each Azure Storage account has documented scalability targets (transaction rates, bandwidth, IOPS, etc.)[54]. If your workload approaches or exceeds these targets, the service will start throttling requests to maintain stability, which increases latency for those requests[54]. For instance, standard storage accounts have limits per partition (like ~20K IOPS for a partition key for Table storage, or for a single blob’s upload throughput). Exceeding these results in 503/Throttle responses. In essence, throttling is Azure’s way of saying “slow down”. Often, a spike in traffic or an inefficient access pattern (like all clients hitting the same blob or partition) can trigger these limits.
Excessive client-side latency (network latency): If the end-to-end latency is high but server latency is low, the delay is likely network-related[56]. For example, an on-premises client accessing an Azure storage account in a distant region will experience network latency that dwarfs Azure’s processing time[55]. A chart might show E2E latency ~44ms vs server latency ~0.1ms; most of that ~44ms is network transit[55]. Factors here include physical distance, internet routing, and possibly VPN/ExpressRoute performance. Essentially, Azure can process the request quickly, but the data takes long to travel.
Improper tier or storage type for workload: Using Standard (HDD-backed) storage for a very high I/O workload might not suffice. Premium storage provides much higher IOPS and lower latency by using SSDs. If you stick to standard and push it hard, you’ll see higher latency and throttling compared to if you used Premium for the same workload[57]. Similarly, using a Cool access tier for frequently accessed blobs can incur slightly higher latency and certainly higher access costs – it’s meant for infrequent access. And Archive tier is very slow for access (requires hours to rehydrate). If performance is critical, data should be in Hot tier or on premium accounts.
Metadata or specific operation throttling (esp. Azure Files): There are cases where you might not be hitting overall throughput limits, but certain operations get throttled. For example, Azure Files has a limit on metadata operations (like file open/close, listing directories) around 12K IOPS per share[58][59]. If you do a heavy directory listing or lots of small file open/close operations, you could hit this meta-IOPS limit and see throttling even if file content reads/writes are within normal range. The symptoms here might specifically show up as “Success with Metadata Throttling” in metrics[59].
Inefficient client usage patterns: For instance, a client application performing no concurrency control – slamming the storage account with thousands of requests per second without spacing can overwhelm even if it’s theoretically within limits. Not using exponential backoff on retries can worsen throttling – if many clients instantly retry throttled requests, they amplify the load (called a retry storm).
Network infrastructure bottlenecks: If access is through an intermediate network virtual appliance (NVA) or Azure Firewall, those can become bottlenecks. For example, if you route storage traffic through a firewall that can’t handle the throughput, it will add latency. Similarly, if a VM’s NIC has a bandwidth limit (based on VM size), you might hit that, causing throughput to cap and latency to increase.

Diagnostic Methods:

Use a combination of Azure Monitor and service-specific diagnostics: – Azure Monitor Metrics (Storage): Examine Success E2E Latency vs Success Server Latency. Microsoft recommends these as key indicators for performance issues[60]. If E2E latency is much higher than server latency, the gap is network and client-side time[56]. For example, Azure Files documentation shows a case of on-prem access: ~43.9ms E2E vs 0.17ms server, whereas same-region was ~2.09ms E2E vs 1.92ms server[55][61]. Big gap = network issue. Small gap but both large = server side issue. This analysis helps pinpoint if you should focus on Azure’s side or the network. – Check for Throttling Events: In the Azure Portal metrics for the storage account, select the Transactions metric, and then apply a split by Response Type (or Status Code group). Look for categories like “ServerTimeout” or “ThrottlingError” or similar. Azure has updated how these metrics are reported, but generally an increase in 429/503 responses indicates throttling. If using Azure Files, look at the specialized metrics Success with Metadata Warning/Throttling introduced for metadata IOPS monitoring[59]. A rise in “Success with Metadata Throttling” means metadata operations were throttled (exceeded share’s capacity)[69]. If you see such metrics, it’s clear evidence of hitting limits. – Scalability target checks: Compare your workload against documented limits (e.g., [54] or updated docs). If you have instrumentation on the client, see how many requests per second or what bandwidth you’re driving. For example, a general-purpose v2 storage account has target throughput of up to ~60 MB/s per blob for reads, or 20k IOPS per table partition. If you approach these, Azure might throttle[54]. The metrics Total Requests can hint at this scale (e.g., if it shows 50k requests/min consistently, break that down per second and see if near limits). – Client-side logs and telemetry: Look at your application logs or Application Insights traces if applicable. Often the client SDK will log something like “Operation was throttled. Retry after X seconds.” or error code 509 (too many requests) etc. These can corroborate that throttling is happening and possibly show which operations are throttled. If possible, identify if it’s specific operations (like ListBlobs, PutBlob) or general. – Network diagnostics: If suspecting network latency, measure it. Use tools like ping (though ICMP might be blocked), or better, use tcping on port 443 to your storage account endpoint to measure round-trip. Azure Network Watcher’s Connection Monitor can continuously test from a VM to the storage endpoint and measure latency. Also, check if a Content Delivery Network (CDN) or caching might be missing if users are globally distributed (lack of CDN would cause higher latency for far users). – Cloud-side Logging: Azure Storage has an option (classic Storage Analytics) to log every request with detailed info (including server latency, end-to-end latency, and whether it was throttled). If you have that enabled (check the Storage Account diagnostic settings for logging), you can download those logs from $logs container. Analyzing them can be heavy, but they will explicitly show if requests were throttled (HttpStatus==503 or ServerLatency vs E2ELatency fields, etc.).

Once you have data, you might find scenarios like: High E2E latency primarily due to network – which suggests need for closer data or CDN. High server latency and throttle codes – suggests hitting service limits or some internal contention.

Proven Solutions:

Upgrade to Premium or higher-performance tier: If the workload is I/O intensive, consider moving it to a Premium Storage account. For example, Premium block blob storage provides lower and more consistent latency and higher throughput for blob operations[57]. Microsoft explicitly suggests using premium for “high transaction rates and low latency beyond standard targets”[57]. For Azure Files, use FileStorage (Premium) shares if you need high IOPS (they allow provisioning IOPS/throughput as needed). Premium storage also typically has higher throttling thresholds. This is often the quickest fix if budget allows, as it significantly raises the ceiling. If you cannot move to Premium, ensure you’re using the latest Generation of storage account (GPv2 has better performance than legacy GPv1).
Partitioning and workload distribution: Redesign how you write data to avoid hotspots:
Blob storage: If one blob is being written by many clients, that blob can become a hotspot. Consider uploading in parallel to multiple blobs or using block blob parallel uploads with larger block sizes (larger blocks automatically activate high-throughput mode)[62]. E.g., instead of 4 MB blocks, use 8 MB or larger for uploads to get better throughput.
Table storage: Distribute entities across different partition keys if one partition is seeing heavy traffic. For instance, rather than all logs going to partition “TodayLogs”, hash them or partition by user to distribute load[70]. Azure’s guidance suggests using random prefixes or hashing to improve load distribution across partitions[70].
Azure Files: If hitting the single share limits, consider using multiple shares. Perhaps partition data by department or scenario across multiple file shares, so no single share gets all the load.
Essentially, shard your data: using multiple storage accounts or partitions so that no single unit hits the limit. Yes, this adds complexity, but it is a scale-out approach for storage. Azure now also allows scaling out by using multiple storage accounts and possibly Azure Partition Range (for Tables) if extremely needed.
Reduce network latency: Bring data closer to consumers. If most of your users or services are in a certain region, ensure the storage account is also in that Azure region. If you have a global user base, leverage Azure CDN or Azure Front Door for caching and accelerating blob delivery[65]. For on-premises or multi-region scenarios, use features like Object Replication for blob (which can asynchronously copy blobs to another storage account in a different region) so reads can happen locally in each region[64]. Also, if not already using ExpressRoute for on-prem access, that can provide more consistent latency than going over public internet.
Follow backoff guidance on throttling: Ensure your client applications use the exponential backoff retry strategy provided by Azure Storage SDKs[66]. Most official SDKs do this by default – they catch a 503/ServerBusy and wait some milliseconds before retrying (increasing wait each time). If you implemented custom storage calls or tweaked the retry policy, review it. If you see immediate retries in logs, adjust to backoff exponentially. Also, implement a max retries to avoid infinite retry loops. This will not eliminate throttling, but it will mitigate its impact, allowing the system to catch up instead of being bombarded.
Optimize metadata and listing operations: This is specific but important for Azure Files or scenarios with lots of small operations. If you identified metadata throttling (for Azure Files), one solution is enabling Azure Files Metadata Caching on the client side (if using SMB with Windows, enabling directory cache etc., or using Azure File Sync which caches frequently accessed files). Microsoft recommends enabling Metadata caching for SSD-based Azure file shares to reduce load on metadata operations[68]. Another approach is to batch operations or reduce frequency – e.g., instead of listing a directory every second, cache the results for a short time in your app. If you can control the workload pattern (like staggering scheduled tasks), do so to avoid giant bursts of metadata ops in a short period.
Increase concurrency thoughtfully: Often, to get better throughput, you increase concurrent threads or clients. But beyond a point, that triggers throttling which hurts overall throughput. There is an optimal concurrency level for a given workload. Use load testing to find it – if you see that going from 10 to 50 threads reduces per-thread performance drastically (due to throttling), scale out storage (shards or premium) rather than just adding threads. In other words, avoid brute-forcing beyond Azure’s limits; instead, increase capacity (sharding or premium).
Bandwidth considerations: If using VMs, choose VMs with sufficient NIC throughput. For example, some smaller VMs cap at, say, 500 Mbps. If your storage workload can exceed that, the VM itself becomes a bottleneck. Upgrading to a larger VM SKU (or spreading load across VMs) might be needed to fully utilize storage performance. Also check that you’re not saturating the outbound bandwidth of your virtual network or ExpressRoute circuit.
Alternative services for extreme scenarios: For extremely high throughput needs (millions of requests per second range or very low latency), you might consider using Azure services tailored for that:
Azure Cosmos DB (Table API) for globally distributed, high-scale key-value access (if Azure Table is not enough).
Azure NetApp Files for very high-performance file storage (it provides consistent low latency for file operations).
Azure Data Explorer or Data Lake for analytics heavy read scenarios, etc. These are beyond straightforward Azure Storage, but at a certain scale (or if you need features like guaranteed sub-ms latencies), they might be relevant.

After implementing fixes, monitor again: – Throttling events should drop significantly. Ideally, your metrics will show near-zero throttle errors after distributing load or upgrading tier. For example, where you previously saw continuous throttling, you might now see a flat line, indicating no throttling. – Latency metrics should improve. If network latency was tackled (by moving region or using CDN), the end-to-end latency for distant users should come down to near the server latency levels[61]. If premium storage was adopted, you might see server latency reduce (as premium has typically faster writes). – Overall throughput and reliability will improve. You should be able to push more transactions through without errors. Where previously throughput flat-lined due to throttling, it can now increase linearly with load until new limits are approached. – In application logs, you should notice far fewer timeout or server busy errors reported by the storage client libraries.

Ultimately, the storage system should operate within its intended thresholds, providing consistent performance. Users (or dependent systems) will experience faster data access and fewer interruptions.

Sources: Azure Storage performance checklist[54][57], Azure Monitor metrics analysis for Files[56][55], Microsoft Q&A on throttling guidance[59][67], Azure documentation on scaling storage and best practices[62].

Azure Storage I/O performance problems

Possible Root Causes:

Diagnostic Methods:

Proven Solutions:

Join the discussion Cancel reply

Menu

Bülleten