Maximizing Efficiency in Milvus Cloud Native Vector Database Through Performance Profiling
- Aju John
- Oct 14
- 9 min read
Updated: 7 days ago

The setup and benchmarking work described in this post were carried out by Raghu Shankar of Austin, TX, with consulting support from ADS. We welcome your questions or feedback; feel free to contact us.
Performance profiling using cloud-native observability tools is vital for optimizing any database system. For cloud-native solutions like the Milvus Vector Database, this process is critical. As organizations increasingly depend on Generative AI and Retrival Augmented Generation (RAG), the ability to analyze and boost the efficiency of the underlying vector databases has never been more essential. This post explores the specifics of performance profiling in Milvus, providing insights and practical tips for AI professionals eager to enhance their AI-pipeline efficiency.
Understanding Milvus Vector Database
Milvus is an open-source vector database tailor-made to manage and search large-scale vector data effectively. It's particularly beneficial for applications in AI, such as image recognition, natural language processing, and recommendation systems. Its cloud-native architecture provides organizations with scalability and flexibility, making it a popular choice for leveraging vector data.
When profiling performance in Milvus, it's important to assess various metrics and parameters that impact efficiency. By grasping these metrics, AI experts can spot bottlenecks, streamline queries, and ultimately boost the performance of their applications.
Key Metrics for Performance Profiling
The key metrics used for this performance profiling are:
Core Database Performance Metrics
These metrics quantify the efficiency and quality of the vector database search operations:
• Queries per sec (QPS): Measures the throughput of the system, specifically concerning concurrent searches. Higher QPS is considered better.
• Recall (%): Measures the accuracy of the search results, specifically the percentage of results that match the "true" (100%) nearest neighbors. Higher recall is considered better.
Latency and Duration Metrics
These metrics assess the speed and time consumed by various operations:
• Serial latency: Measures the time taken for search operations, specifically focusing on percentiles such as P99 and P95. Lower latency is considered better.
• Load duration: Measures the time taken to load the dataset, including Load dataset duration) and Index build & optimize duration. Lower duration is better.
• Total Load-index build duration: The combined time for both loading and indexing.
Observability Tools for Performance Profiling
This analysis draws from a performance profiling exercise focused on the Cloud Native Vector Database(s), specifically Milvus, integrated with an observability stack (see Metrics + figure below) for performance profiling.
The observability stack used for the performance profiling of the Cloud Native Vector Database includes several integrated components dedicated to instrumentation, collection, processing, and visualization. Integrating the observability stack into a Kubernetes environment was a major project in itself. Here is a list of the components and their brief descriptions:
eBPF:Used for instrumentation and capturing application data for both flow and tracing streams. eBPF technology allows kernel profiling from user space without code changes.
Cilium: Works alongside eBPF to feed network flow data into Hubble.
Hubble: Collects network flow data (Flow 1) and forwards this information to Prometheus.
Beyla [Otel]: The OpenTelemetry component paired with eBPF to capture Traces for Service Graphs & Span Metrics (Flow 2).
Alloy: Acts as a central data processor (Receivers -> Processors -> Exporters), routing Traces to Tempo and Span Metrics to Prometheus.
Prometheus: Collects and stores metrics in a time-series database, including flow data from Hubble and Span Metrics from Alloy.
Tempo: The dedicated tracing system that receives and manages Traces exported by Alloy.
Grafana: The primary visualization platform used to display metrics (CPU, Mem, Net, L7) and to build Service Graphs using data from Tempo and Prometheus.

To gain deep insight into the performance of our Cloud Native Vector Database (Milvus), the observability stack was essential. This performance profiling began with the raw data capture, where eBPF agents acted as primary instrumentation, splitting their findings into two crucial streams: one stream, partnered with Cilium, captured network flow data and sent it to Hubble, which in turn forwarded its flow metrics to Prometheus; the second stream, teamed up with Beyla [Otel], focused on application traces and span metrics, routing them into Alloy. Acting as the central data processor (Receivers -> Processors -> Exporters), Alloy carefully separated the traffic, exporting detailed Traces to Tempo (which manages them using its Front-end & Querier, Distributor, and Ingester, storing them in Tempo Local storage), while ensuring the essential Span Metrics were simultaneously sent over to Prometheus. Finally, all these efforts converged on Grafana, which served as our visualization command center: it pulled metrics (CPU, Mem, Net metrics, and L7 metrics) directly from Prometheus, and, by integrating the Traces from Tempo and the Span Metrics from Prometheus, it was able to dynamically construct the vital Service Graph.
The testing showcases Milvus’s ability to handle high-dimensional vector search workloads efficiently, confirming that its Queries per sec (QPS) scales linearly with compute cores.
Test Setup and Key Milvis Parameters
The testing environment utilized a Kubernetes 2-node cluster with local disk storage, where the worker nodes offered 40 cores and 64GB of memory. Milvus was configured in standalone mode using a Helm chart deployment. The test workload focused on a large dataset: 500K vectors with 1,536 dimensions.
Key performance results for the 500K, 1,536D setup are summarized below:
Table 1: Milvus Performance Criteria (500K, 1536D)
Criteria | Value |
Load dataset duration, secs (mins) | 1,147s (19m) |
Index build & optimize, secs (mins) | 525 (9m) |
Total Load-index build, secs (mins) | 1,672 (28m) |
Queries per sec, QPS | 980 |
Recall% | 98% |
P99 latency | 12 ms |
High Throughput and Low Latency
The performance results highlight significant efficiency:
1. Throughput (QPS): Milvus achieved 980 Queries per sec (QPS).
This demonstrates a highly linear performance (perf) with cores. Analysis indicates that Milvus outperforms rival configurations by 50%, achieving this throughput while utilizing approximately 21 cores and ~5GB of memory during peak index build and search phases.
2. Serial Latency: The system reported an excellent P99 latency of 12 ms.
Generally, serial latency aims to be lower.
3. Recall: The recall percentage was measured at 98%.
Resource Utilization Deep Dive: CPU and Memory Trends
The system was observed across two primary phases: "Index Build & Optimize" and "Concurrent Searches". The following resource utilization charts illustrate the dynamic nature of the workload:
Table 2: Milvus Peak Resource Metrics
Metrics | Value |
Milvus cores peak | 28 |
Milvus Cache peak, MB | 115 |
Milvus memory max, GB | 6 |
Minio memory, GB | 40 |
Network Transmit max, MB/s | 180 |

CPU Core Usage: The Index Build & Optimize phase showed highly cyclical and dynamic CPU usage, peaking up to 21 cores. A pronounced bottleneck or spike occurred at the transition point to concurrent searching, where utilization briefly surged to 28 cores. During the Concurrent Searches phase, CPU usage stabilized after the initial spike.
Time Period | Index Build & Optimize | Concurrent Searches |
CPU Usage Trend | Highly cyclical/dynamic, peaking at 21 cores. | Brief spike to 28 cores, then stabilizes. |
Memory Usage: During the Index Build & Optimize phase, memory usage gradually increased, peaking at 6.6GB. When the system transitioned to Concurrent Searches, memory usage dropped significantly and stabilized at 3.6GB.
Time Period | Index Build & Optimize | Concurrent Searches |
Memory Usage Trend | Gradually increasing, peaking at 6.6GB. | Drops sharply, stabilizing at 3.6GB. |
Observability Practices for Optimizing Performance
The system employed an observability stack involving eBPF, Cilium, and Hubble for network metrics. The processing rate of flows per node provides insight into the underlying network activity generated by the Milvus benchmark.

The Hubble chart tracking "Flows processed Per Node" showed substantial variability:
• Maximum Rate: The highest flow rate observed was 53 ops/sec.
• Average Rate: The overall average flow rate was approximately ~40 ops/sec.
The trend data reveals distinct spiking and bottleneck periods:
• Initial Spiking (approx. 11:40 to 12:20): This period is characterized by intense, rapid flow spikes, reaching the maximum observed rate.
• Bottleneck Period (approx. 12:20 to 13:00): Following the initial high activity, the rate dropped and sustained a relatively low and steady rate (around 36–37 ops/sec).
• Resumed Activity (approx. 13:00 onwards): The rate increased again, showing several distinct, sharp spikes that settled into a fluctuating, sustained rate around 40 ops/sec.
The overall analysis confirms that Milvus provides robust performance, achieving high throughput (980 QPS) and competitive latency (12 ms P99) while demonstrating a clear, linear relationship between QPS and core utilization.
Bottlenecks Exposed
Based on the charts and data from the Milvus performance analysis, the benchmark exposed several periods of high utilization and resource contention that can be categorized as bottlenecks or stress points within the system. These bottlenecks primarily relate to CPU usage during workload transitions and specific periods of reduced network flow processing.
The key exposed bottlenecks and resource stress points are:
CPU Contention during Transition to Search
The most pronounced spike in resource utilization, indicating a momentary bottleneck or extreme stress, occurred when the Milvus database transitioned from the "Index Build & Optimize" phase to the "Concurrent Searches" phase:
Peak Utilization: CPU core usage spiked significantly from its index-building steady state (21 cores) to a momentary peak of 28 cores at the precise point where concurrent searches began. This 28 core peak represents the highest core utilization observed for Milvus during the benchmark run.
Implication: This spike suggests a temporary surge in demand required to initialize or handle the maximum load of concurrent search tests (Max 80 concurrencies), which could point to a resource bottleneck during workload switching or initial query bursts.
Cyclical CPU Bottleneck during Index Building
The "Index Build & Optimize" phase itself presented a highly dynamic and potentially constrained environment:
Cyclical Load: CPU usage during indexing was highly cyclical and dynamic, continuously ramping up and dropping down, with peaks reaching up to 21 cores.
Implication: This cyclical pattern suggests that the index creation and optimization process involves oscillating load demands, which, while managed, demonstrate a repetitive cycle of high CPU demand (peaking at 21 cores) followed by periods of relative reduction, indicating the index building process is intensely CPU-bound in cycles.
Network Flow Bottleneck/Trough
Analysis of the "Flows processed Per Node" captured by Hubble (the network observability tool) clearly identifies a sustained period where flow processing capacity appears constrained or significantly reduced:
Trough Period Identified: Following an initial high-variability period (approx. 11:40 to 12:20), the flow rate dropped sharply and maintained a relatively low and steady rate (around 36–37 ops/sec) for a significant duration (approx. 12:20 to 13:00).
Contrast to Peak: This low sustained rate contrasts sharply with the maximum observed flow rate of 53 ops/sec.
Implication: This sustained trough suggests a period where the network or the flow processing component (Hubble) was unable to maintain the earlier high rates, potentially pointing to a network or monitoring bottleneck during that time interval.
Summary Thoughts
The performance analysis shows that the Milvus Cloud Native Vector Database is highly efficient, demonstrating highly linear performance (perf) with cores, successfully achieving 980 Queries per sec (QPS) for the challenging 500K vector dataset with 1,536 dimensions. This superior throughput allows Milvus to outperform rival configurations by 50% while maintaining an impressive 98% Recall and exceptionally low 12 ms P99 latency. Resource utilization showed distinct behavior across phases: the initial "Index Build & Optimize" phase was memory-intensive, peaking at 6.6GB and utilizing up to 21 cyclical cores, whereas the "Concurrent Searches" phase stabilized memory at a lower 3.6GB but featured a significant CPU utilization spike peaking at 28 cores during the transition. Critically, the integrated observability stack validated these dynamic operational demands by capturing highly variable internal network flows, which included severe spiking up to a maximum of 53 ops/sec and noticeable bottleneck periods.
Suggested future work for better performance
Based on the performance analysis and the bottlenecks exposed by the benchmark, several areas for potential improvement, across both software and hardware configurations, can be suggested:
Hardware Improvement
The current setup utilizes a Kubernetes 2-node cluster with local disk storage, where worker nodes have 40 cores and 64GB of memory.
Addressing Peak CPU Contention: The most significant CPU spike occurs when transitioning to concurrent searches, peaking at 28 cores.
Suggestion: Increase dedicated core allocation or ensure more homogeneous core availability during transitions to handle the instantaneous load spike above the 21-core steady-state usage. While Milvus showed highly linear performance with cores, dedicating more compute resources could eliminate the transient bottleneck observed at 28 cores.
Storage Optimization: The setup uses a 2-node cluster with local disk.
Suggestion: Although Milvus showed good performance, shifting from local disk to a higher-performance, dedicated distributed storage solution (beyond Minio's current 40GB allocation) could improve persistence and data loading times, especially if scaling beyond a 2-node cluster or handling larger datasets (like the 1M dataset mentioned in the REDO section).
Software/Configuration Improvement Suggestions
Improvements can be targeted at optimizing the Milvus configuration, managing the workload transitions, and enhancing the observability stack's efficiency.
Optimizing Milvus Indexing/Search Transition: The severe CPU spike (to 28 cores) during the transition from "Index Build & Optimize" to "Concurrent Searches" suggests inefficiency in handling the shift in workload.
Suggestion: Fine-tune Milvus configuration parameters related to resource throttling or indexing completion handover. This might involve adjusting Kubernetes resource requests/limits (K8S resources, as mentioned in the setup) specifically for the search components to reserve burst capacity, thereby smoothing the transition and reducing the momentary CPU contention.
Addressing Network Flow Bottleneck (Hubble/Cilium): The Hubble flow chart showed a sustained period where the flow processing rate dropped substantially (trough period of 36–37 ops/sec).
Suggestion: Investigate the resource allocation and configuration of the observability components (Hubble and Cilium). If these components are resource-constrained, increasing their allocated CPU/memory might allow them to maintain a flow processing rate closer to the observed maximum of 53 ops/sec, ensuring more comprehensive real-time network visibility.
Python Library Upgrade: The benchmark setup notes the Python setup libraries "Verson 3.11 (as of Oct 2025)".
Suggestion: If subsequent versions of Python libraries or Milvus client SDKs offer performance improvements or better resource handling, upgrading the Python environment could yield incremental software gains.
Minio Optimization: Minio is allocated 40GB of memory.
Suggestion: Analyze the actual usage and I/O load on the Minio pod. If Minio itself contributes to latency or I/O contention, optimizing its configuration or increasing its dedicated resources might alleviate potential bottlenecks related to object storage interaction.





Comments