Skip to content

db: Smooth out IO from flushing L0 and compaction#2004

Draft
andrewbaptist wants to merge 1 commit into
cockroachdb:masterfrom
andrewbaptist:20221012.io-smoother
Draft

db: Smooth out IO from flushing L0 and compaction#2004
andrewbaptist wants to merge 1 commit into
cockroachdb:masterfrom
andrewbaptist:20221012.io-smoother

Conversation

@andrewbaptist

Copy link
Copy Markdown

This PR adds a smoother which monitors the average time for flushing and compaction and paces future flush / compaction loops to attempt to have a consistent IO rate at all times rather than being spikey.

Spikey IO can result in saturating the underlying device which then slows down writes to the WAL. By having a consistent rate of flushing and compaction the P99 latency is greatly reduced.

@cockroach-teamcity

Copy link
Copy Markdown
Member

This change is Reviewable

@andrewbaptist andrewbaptist force-pushed the 20221012.io-smoother branch 15 times, most recently from c7480bb to 0c4fa6e Compare October 13, 2022 15:27

@sumeerbhola sumeerbhola left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 5 files at r1, 4 of 5 files at r2, all commit messages.
Reviewable status: 5 of 6 files reviewed, 2 unresolved discussions (waiting on @andrewbaptist)


smoother.go line 55 at r2 (raw file):

				// Every 100 iterations, update the estimated utilization under lock
				if totalSamples == numSamples {
					utilRunning := float64(sampleRunning) / float64(totalSamples)

s.countRunning and s.sleepingCount can each be > 1, so I don't quite understand the logic behind dividing by totalSamples, which is only being incremented by 1 for each tick.

If both sampleRunning and sampleSleeping are 0, we would still tick and compute util=0 and slowly shrink s.mu.estimatedUtilization to 0.1, yes? What will cause it to increase above 0.1? Seems to me that the sleeps will keep it at 0.1.

It seems to me that this smoother is not aware of the work backlog or the resource availability. Am I missing something?


smoother.go line 58 at r2 (raw file):

					utilSleeping := float64(sampleSleeping) / float64(totalSamples)
					// Add all the running time and half the sleeping time.
					util := utilRunning + utilSleeping/2

why adding utilSleeping/2?

@andrewbaptist andrewbaptist force-pushed the 20221012.io-smoother branch 5 times, most recently from 2a3cc9f to dedcb58 Compare October 14, 2022 14:55
@andrewbaptist

andrewbaptist commented Oct 14, 2022

Copy link
Copy Markdown
Author

Data on the performance impact of this change on KV50

Unthrottled P99 (P90 of P99s over the window)
10ms -> 6ms

Throttled P99
130ms -> 84ms

During index creation P99
352ms -> 204ms

Throughput, CPU util, LSM health, and most other metrics are very similar.

prometheus-patched.tar.gz
prometheus-orig.tar.gz

patched_cockroach_workload_run_kv.log
orig_cockroach_workload_run_kv.log

Image showing the difference in P99
image
https://docs.google.com/spreadsheets/d/1GmUOc69d9r-4GpS_n176Pttw0JhCzgOpAcdfs1ks_WI/edit#gid=1299047713

@andrewbaptist andrewbaptist force-pushed the 20221012.io-smoother branch 6 times, most recently from 3852956 to c2f7610 Compare October 14, 2022 19:23
smoother: integrate into code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants