db: Smooth out IO from flushing L0 and compaction#2004
Conversation
c7480bb to
0c4fa6e
Compare
sumeerbhola
left a comment
There was a problem hiding this comment.
Reviewed 1 of 5 files at r1, 4 of 5 files at r2, all commit messages.
Reviewable status: 5 of 6 files reviewed, 2 unresolved discussions (waiting on @andrewbaptist)
smoother.go line 55 at r2 (raw file):
// Every 100 iterations, update the estimated utilization under lock if totalSamples == numSamples { utilRunning := float64(sampleRunning) / float64(totalSamples)
s.countRunning and s.sleepingCount can each be > 1, so I don't quite understand the logic behind dividing by totalSamples, which is only being incremented by 1 for each tick.
If both sampleRunning and sampleSleeping are 0, we would still tick and compute util=0 and slowly shrink s.mu.estimatedUtilization to 0.1, yes? What will cause it to increase above 0.1? Seems to me that the sleeps will keep it at 0.1.
It seems to me that this smoother is not aware of the work backlog or the resource availability. Am I missing something?
smoother.go line 58 at r2 (raw file):
utilSleeping := float64(sampleSleeping) / float64(totalSamples) // Add all the running time and half the sleeping time. util := utilRunning + utilSleeping/2
why adding utilSleeping/2?
2a3cc9f to
dedcb58
Compare
|
Data on the performance impact of this change on KV50 Unthrottled P99 (P90 of P99s over the window) Throttled P99 During index creation P99 Throughput, CPU util, LSM health, and most other metrics are very similar. prometheus-patched.tar.gz patched_cockroach_workload_run_kv.log Image showing the difference in P99 |
3852956 to
c2f7610
Compare
c2f7610 to
f9d3c11
Compare
f9d3c11 to
5730a26
Compare
5730a26 to
732f64e
Compare
732f64e to
3ab575a
Compare
smoother: integrate into code
3ab575a to
cb86af8
Compare

This PR adds a smoother which monitors the average time for flushing and compaction and paces future flush / compaction loops to attempt to have a consistent IO rate at all times rather than being spikey.
Spikey IO can result in saturating the underlying device which then slows down writes to the WAL. By having a consistent rate of flushing and compaction the P99 latency is greatly reduced.