Fix race condition for Failed Test Replay#10679
Fix race condition for Failed Test Replay#10679daniel-mohedano wants to merge 6 commits intomasterfrom
Conversation
Debugger benchmarksParameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 10 metrics, 5 unstable metrics. See unchanged results
Request duration reports for reportsgantt
title reports - request duration [CI 0.99] : candidate=None, baseline=None
dateFormat X
axisFormat %s
section baseline
noprobe (308.756 µs) : 284, 333
. : milestone, 309,
basic (267.282 µs) : 261, 274
. : milestone, 267,
loop (8.957 ms) : 8953, 8962
. : milestone, 8957,
section candidate
noprobe (312.384 µs) : 282, 342
. : milestone, 312,
basic (276.009 µs) : 269, 283
. : milestone, 276,
loop (8.967 ms) : 8962, 8972
. : milestone, 8967,
|
BenchmarksStartupParameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 63 metrics, 8 unstable metrics. Startup time reports for petclinicgantt
title petclinic - global startup overhead: candidate=1.60.0-SNAPSHOT~0de9403e81, baseline=1.60.0-SNAPSHOT~969d21d507
dateFormat X
axisFormat %s
section tracing
Agent [baseline] (1.065 s) : 0, 1064688
Total [baseline] (10.936 s) : 0, 10935988
Agent [candidate] (1.065 s) : 0, 1064522
Total [candidate] (10.937 s) : 0, 10937217
section appsec
Agent [baseline] (1.238 s) : 0, 1238318
Total [baseline] (10.985 s) : 0, 10984873
Agent [candidate] (1.238 s) : 0, 1238156
Total [candidate] (11.086 s) : 0, 11085515
section iast
Agent [baseline] (1.239 s) : 0, 1239381
Total [baseline] (11.262 s) : 0, 11261833
Agent [candidate] (1.24 s) : 0, 1239509
Total [candidate] (11.143 s) : 0, 11143021
section profiling
Agent [baseline] (1.197 s) : 0, 1196553
Total [baseline] (10.969 s) : 0, 10969143
Agent [candidate] (1.193 s) : 0, 1193027
Total [candidate] (10.954 s) : 0, 10953969
gantt
title petclinic - break down per module: candidate=1.60.0-SNAPSHOT~0de9403e81, baseline=1.60.0-SNAPSHOT~969d21d507
dateFormat X
axisFormat %s
section tracing
crashtracking [baseline] (1.183 ms) : 0, 1183
crashtracking [candidate] (1.178 ms) : 0, 1178
BytebuddyAgent [baseline] (627.123 ms) : 0, 627123
BytebuddyAgent [candidate] (627.098 ms) : 0, 627098
AgentMeter [baseline] (29.108 ms) : 0, 29108
AgentMeter [candidate] (29.149 ms) : 0, 29149
GlobalTracer [baseline] (256.864 ms) : 0, 256864
GlobalTracer [candidate] (257.468 ms) : 0, 257468
AppSec [baseline] (33.069 ms) : 0, 33069
AppSec [candidate] (32.903 ms) : 0, 32903
Debugger [baseline] (66.287 ms) : 0, 66287
Debugger [candidate] (65.591 ms) : 0, 65591
Remote Config [baseline] (609.482 µs) : 0, 609
Remote Config [candidate] (607.379 µs) : 0, 607
Telemetry [baseline] (10.603 ms) : 0, 10603
Telemetry [candidate] (9.961 ms) : 0, 9961
Flare Poller [baseline] (3.829 ms) : 0, 3829
Flare Poller [candidate] (4.544 ms) : 0, 4544
section appsec
crashtracking [baseline] (1.184 ms) : 0, 1184
crashtracking [candidate] (1.177 ms) : 0, 1177
BytebuddyAgent [baseline] (657.991 ms) : 0, 657991
BytebuddyAgent [candidate] (657.492 ms) : 0, 657492
AgentMeter [baseline] (11.913 ms) : 0, 11913
AgentMeter [candidate] (11.933 ms) : 0, 11933
GlobalTracer [baseline] (257.887 ms) : 0, 257887
GlobalTracer [candidate] (258.196 ms) : 0, 258196
IAST [baseline] (25.198 ms) : 0, 25198
IAST [candidate] (25.276 ms) : 0, 25276
AppSec [baseline] (167.641 ms) : 0, 167641
AppSec [candidate] (167.441 ms) : 0, 167441
Debugger [baseline] (66.876 ms) : 0, 66876
Debugger [candidate] (66.805 ms) : 0, 66805
Remote Config [baseline] (646.372 µs) : 0, 646
Remote Config [candidate] (665.254 µs) : 0, 665
Telemetry [baseline] (9.333 ms) : 0, 9333
Telemetry [candidate] (9.479 ms) : 0, 9479
Flare Poller [baseline] (3.703 ms) : 0, 3703
Flare Poller [candidate] (3.695 ms) : 0, 3695
section iast
crashtracking [baseline] (1.192 ms) : 0, 1192
crashtracking [candidate] (1.189 ms) : 0, 1189
BytebuddyAgent [baseline] (801.024 ms) : 0, 801024
BytebuddyAgent [candidate] (799.827 ms) : 0, 799827
AgentMeter [baseline] (11.363 ms) : 0, 11363
AgentMeter [candidate] (11.403 ms) : 0, 11403
GlobalTracer [baseline] (248.218 ms) : 0, 248218
GlobalTracer [candidate] (249.532 ms) : 0, 249532
IAST [baseline] (27.441 ms) : 0, 27441
IAST [candidate] (27.359 ms) : 0, 27359
AppSec [baseline] (34.896 ms) : 0, 34896
AppSec [candidate] (34.159 ms) : 0, 34159
Debugger [baseline] (66.358 ms) : 0, 66358
Debugger [candidate] (67.357 ms) : 0, 67357
Remote Config [baseline] (538.552 µs) : 0, 539
Remote Config [candidate] (540.716 µs) : 0, 541
Telemetry [baseline] (8.673 ms) : 0, 8673
Telemetry [candidate] (8.627 ms) : 0, 8627
Flare Poller [baseline] (3.489 ms) : 0, 3489
Flare Poller [candidate] (3.468 ms) : 0, 3468
section profiling
crashtracking [baseline] (1.171 ms) : 0, 1171
crashtracking [candidate] (1.176 ms) : 0, 1176
BytebuddyAgent [baseline] (686.222 ms) : 0, 686222
BytebuddyAgent [candidate] (683.59 ms) : 0, 683590
AgentMeter [baseline] (8.59 ms) : 0, 8590
AgentMeter [candidate] (8.558 ms) : 0, 8558
GlobalTracer [baseline] (216.835 ms) : 0, 216835
GlobalTracer [candidate] (216.526 ms) : 0, 216526
AppSec [baseline] (32.861 ms) : 0, 32861
AppSec [candidate] (32.541 ms) : 0, 32541
Debugger [baseline] (67.504 ms) : 0, 67504
Debugger [candidate] (67.283 ms) : 0, 67283
Remote Config [baseline] (630.065 µs) : 0, 630
Remote Config [candidate] (617.85 µs) : 0, 618
Telemetry [baseline] (8.9 ms) : 0, 8900
Telemetry [candidate] (9.042 ms) : 0, 9042
Flare Poller [baseline] (3.76 ms) : 0, 3760
Flare Poller [candidate] (3.747 ms) : 0, 3747
ProfilingAgent [baseline] (99.022 ms) : 0, 99022
ProfilingAgent [candidate] (99.252 ms) : 0, 99252
Profiling [baseline] (99.589 ms) : 0, 99589
Profiling [candidate] (99.82 ms) : 0, 99820
Startup time reports for insecure-bankgantt
title insecure-bank - global startup overhead: candidate=1.60.0-SNAPSHOT~0de9403e81, baseline=1.60.0-SNAPSHOT~969d21d507
dateFormat X
axisFormat %s
section tracing
Agent [baseline] (1.063 s) : 0, 1063341
Total [baseline] (8.728 s) : 0, 8728009
Agent [candidate] (1.065 s) : 0, 1064507
Total [candidate] (8.803 s) : 0, 8803020
section iast
Agent [baseline] (1.229 s) : 0, 1229244
Total [baseline] (9.369 s) : 0, 9369413
Agent [candidate] (1.232 s) : 0, 1231667
Total [candidate] (9.374 s) : 0, 9374123
gantt
title insecure-bank - break down per module: candidate=1.60.0-SNAPSHOT~0de9403e81, baseline=1.60.0-SNAPSHOT~969d21d507
dateFormat X
axisFormat %s
section tracing
crashtracking [baseline] (1.186 ms) : 0, 1186
crashtracking [candidate] (1.184 ms) : 0, 1184
BytebuddyAgent [baseline] (627.223 ms) : 0, 627223
BytebuddyAgent [candidate] (627.224 ms) : 0, 627224
AgentMeter [baseline] (29.127 ms) : 0, 29127
AgentMeter [candidate] (29.138 ms) : 0, 29138
GlobalTracer [baseline] (257.283 ms) : 0, 257283
GlobalTracer [candidate] (258.14 ms) : 0, 258140
AppSec [baseline] (33.034 ms) : 0, 33034
AppSec [candidate] (32.812 ms) : 0, 32812
Debugger [baseline] (62.025 ms) : 0, 62025
Debugger [candidate] (64.785 ms) : 0, 64785
Remote Config [baseline] (620.882 µs) : 0, 621
Remote Config [candidate] (617.798 µs) : 0, 618
Telemetry [baseline] (9.841 ms) : 0, 9841
Telemetry [candidate] (9.12 ms) : 0, 9120
Flare Poller [baseline] (6.907 ms) : 0, 6907
Flare Poller [candidate] (5.336 ms) : 0, 5336
section iast
crashtracking [baseline] (1.185 ms) : 0, 1185
crashtracking [candidate] (1.19 ms) : 0, 1190
BytebuddyAgent [baseline] (794.302 ms) : 0, 794302
BytebuddyAgent [candidate] (797.628 ms) : 0, 797628
AgentMeter [baseline] (11.292 ms) : 0, 11292
AgentMeter [candidate] (11.257 ms) : 0, 11257
GlobalTracer [baseline] (246.545 ms) : 0, 246545
GlobalTracer [candidate] (246.678 ms) : 0, 246678
IAST [baseline] (27.078 ms) : 0, 27078
IAST [candidate] (26.783 ms) : 0, 26783
AppSec [baseline] (35.03 ms) : 0, 35030
AppSec [candidate] (33.976 ms) : 0, 33976
Debugger [baseline] (65.006 ms) : 0, 65006
Debugger [candidate] (65.397 ms) : 0, 65397
Remote Config [baseline] (547.632 µs) : 0, 548
Remote Config [candidate] (532.39 µs) : 0, 532
Telemetry [baseline] (8.714 ms) : 0, 8714
Telemetry [candidate] (8.639 ms) : 0, 8639
Flare Poller [baseline] (3.507 ms) : 0, 3507
Flare Poller [candidate] (3.508 ms) : 0, 3508
LoadParameters
See matching parameters
SummaryFound 1 performance improvements and 4 performance regressions! Performance is the same for 9 metrics, 22 unstable metrics.
Request duration reports for insecure-bankgantt
title insecure-bank - request duration [CI 0.99] : candidate=1.60.0-SNAPSHOT~0de9403e81, baseline=1.60.0-SNAPSHOT~969d21d507
dateFormat X
axisFormat %s
section baseline
no_agent (1.194 ms) : 1182, 1205
. : milestone, 1194,
iast (3.116 ms) : 3074, 3158
. : milestone, 3116,
iast_FULL (5.737 ms) : 5680, 5793
. : milestone, 5737,
iast_GLOBAL (3.619 ms) : 3558, 3681
. : milestone, 3619,
profiling (1.973 ms) : 1955, 1991
. : milestone, 1973,
tracing (1.836 ms) : 1820, 1852
. : milestone, 1836,
section candidate
no_agent (1.165 ms) : 1154, 1176
. : milestone, 1165,
iast (3.343 ms) : 3298, 3388
. : milestone, 3343,
iast_FULL (5.996 ms) : 5936, 6057
. : milestone, 5996,
iast_GLOBAL (3.661 ms) : 3605, 3718
. : milestone, 3661,
profiling (2.123 ms) : 2101, 2145
. : milestone, 2123,
tracing (1.802 ms) : 1787, 1817
. : milestone, 1802,
Request duration reports for petclinicgantt
title petclinic - request duration [CI 0.99] : candidate=1.60.0-SNAPSHOT~0de9403e81, baseline=1.60.0-SNAPSHOT~969d21d507
dateFormat X
axisFormat %s
section baseline
no_agent (18.009 ms) : 17820, 18197
. : milestone, 18009,
appsec (19.624 ms) : 19428, 19820
. : milestone, 19624,
code_origins (17.642 ms) : 17467, 17817
. : milestone, 17642,
iast (17.517 ms) : 17343, 17692
. : milestone, 17517,
profiling (18.635 ms) : 18446, 18823
. : milestone, 18635,
tracing (17.749 ms) : 17573, 17926
. : milestone, 17749,
section candidate
no_agent (19.376 ms) : 19177, 19574
. : milestone, 19376,
appsec (18.516 ms) : 18330, 18702
. : milestone, 18516,
code_origins (17.856 ms) : 17679, 18032
. : milestone, 17856,
iast (17.678 ms) : 17501, 17855
. : milestone, 17678,
profiling (20.73 ms) : 20522, 20938
. : milestone, 20730,
tracing (17.409 ms) : 17238, 17579
. : milestone, 17409,
DacapoParameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 11 metrics, 1 unstable metrics. Execution time for biojavagantt
title biojava - execution time [CI 0.99] : candidate=1.60.0-SNAPSHOT~0de9403e81, baseline=1.60.0-SNAPSHOT~969d21d507
dateFormat X
axisFormat %s
section baseline
no_agent (15.341 s) : 15341000, 15341000
. : milestone, 15341000,
appsec (14.905 s) : 14905000, 14905000
. : milestone, 14905000,
iast (18.211 s) : 18211000, 18211000
. : milestone, 18211000,
iast_GLOBAL (17.974 s) : 17974000, 17974000
. : milestone, 17974000,
profiling (14.709 s) : 14709000, 14709000
. : milestone, 14709000,
tracing (14.669 s) : 14669000, 14669000
. : milestone, 14669000,
section candidate
no_agent (14.743 s) : 14743000, 14743000
. : milestone, 14743000,
appsec (14.688 s) : 14688000, 14688000
. : milestone, 14688000,
iast (18.228 s) : 18228000, 18228000
. : milestone, 18228000,
iast_GLOBAL (17.773 s) : 17773000, 17773000
. : milestone, 17773000,
profiling (14.964 s) : 14964000, 14964000
. : milestone, 14964000,
tracing (15.102 s) : 15102000, 15102000
. : milestone, 15102000,
Execution time for tomcatgantt
title tomcat - execution time [CI 0.99] : candidate=1.60.0-SNAPSHOT~0de9403e81, baseline=1.60.0-SNAPSHOT~969d21d507
dateFormat X
axisFormat %s
section baseline
no_agent (1.472 ms) : 1461, 1484
. : milestone, 1472,
appsec (3.802 ms) : 3579, 4025
. : milestone, 3802,
iast (2.25 ms) : 2182, 2319
. : milestone, 2250,
iast_GLOBAL (2.293 ms) : 2224, 2362
. : milestone, 2293,
profiling (2.1 ms) : 2043, 2156
. : milestone, 2100,
tracing (2.057 ms) : 2004, 2110
. : milestone, 2057,
section candidate
no_agent (1.468 ms) : 1457, 1480
. : milestone, 1468,
appsec (3.783 ms) : 3562, 4005
. : milestone, 3783,
iast (2.256 ms) : 2187, 2325
. : milestone, 2256,
iast_GLOBAL (2.293 ms) : 2224, 2361
. : milestone, 2293,
profiling (2.12 ms) : 2064, 2177
. : milestone, 2120,
tracing (2.059 ms) : 2006, 2112
. : milestone, 2059,
|
|
Hi! 👋 Thanks for your pull request! 🎉 To help us review it, please make sure to:
If you need help, please check our contributing guidelines. |
| cancelSchedule(this.lowRateScheduled); | ||
| // clear interrupt flag that could be set by JVM shutdown to allow to serialize and upload | ||
| // snapshots | ||
| Thread.interrupted(); |
There was a problem hiding this comment.
Is there a catch for InterruptedException that is not reinterrupting the thread? Since I'm not sure that putting this here is the best option
There was a problem hiding this comment.
In hindsight I realize that the interruption is not even in the thread performing the stop method, but rather the one with the scheduled flush, so I don't think the interrupted call does anything here, I'll update the approach (and the InterruptedException can be caught in
amarziali
left a comment
There was a problem hiding this comment.
Looks OK to me. I leave DI team give the formal approval. Thanks for the changes!
| } | ||
| } catch (Exception e) { | ||
| ExceptionHelper.logException(LOGGER, e, "Error during snapshot serialization:"); | ||
| } catch (Throwable e) { |
There was a problem hiding this comment.
Can we target more specifically Errors here?
If Moshi is raising AssertionError, let's focus on it.
Otherwise, catching Throwable here means we are catching OutOfMemoryError which I doubt is good idea there
There was a problem hiding this comment.
Good point, addressed in 0de9403 to reduce the scope to only AssertionError
What Does This Do
Fix race condition in
DebuggerSink.stop()that can cause DI snapshot loss during JVM shutdown, leading to flaky "test headless failed test replay" smoke tests.During JVM shutdown, the periodic lowRateFlush on the dd-task-scheduler thread can race with the shutdown hook. The periodic flush drains snapshots from the BlockingQueue, but the thread's interrupt flag (set by the shutdown sequence) causes Moshi serialization to throw
AssertionError: java.io.InterruptedIOException: interrupted. Since the snapshots have already been removed from the queue, they are permanently lost. The shutdown hook's subsequent flush then finds an empty queue and uploads nothing. This issue mostly surfaces in Failed Test Replay (and not Exception Replay) due to the nature of short-lived testing environments.Evidence from CI logs of a failing run:
The fix reorders
stop()to cancel periodic schedules before performing the final flush. Additionally,SnapshotSink.getSerializedSnapshots()now re-queues snapshots that fail with an AssertionError so the shutdown hook flush can retry them with a non-interrupted thread. Normal serialization Exceptions still drop the snapshot.The PR also introduces some additional changes to stabilize the smoke test environment:
DYNAMIC_INSTRUMENTATION_UPLOAD_FLUSH_INTERVALvalue to minimize the race conditions.Motivation
This issue was causing flakes in the FTR JUnit Console smoke test, where sometimes the snapshots were not available after test execution.
Contributor Checklist
type:and (comp:orinst:) labels in addition to any other useful labelsclose,fix, or any linking keywords when referencing an issueUse
solvesinstead, and assign the PR milestone to the issueJira ticket: [PROJ-IDENT]
Note: Once your PR is ready to merge, add it to the merge queue by commenting
/merge./merge -ccancels the queue request./merge -f --reason "reason"skips all merge queue checks; please use this judiciously, as some checks do not run at the PR-level. For more information, see this doc.