5 posts tagged with "bazel"

Bazel, Buildbarn and Bonanza

March 24, 2025 · 3 min read

Benjamin Ingberg

In the Buildbarn event at Snowflake this week, Ed Schouten presented the work on the Bonanza project. Bonanza is a prototype for a drop in replacement Bazel with improvements to its communication protocols and written as a client that performs all of its actions remotely, deduplicated and cacheable.

How is it different from Bazel?

Bazel is written in an interesting manner. It is designed from the ground up to represent a stable description of a build that is reproducible from any developer's machine. Any developer can check out the same source code ask Bazel to analyze it and get the exact commands Bazel would run to perform the build.

This involves a lot of computation: Resolving bazel modules, downloading repository rules, executing them, analyzing the .bzl and BUILD.bazel files, and finally executing the actions themselves.

Bazel internally represents these computation steps as pure functions returning deterministic values from a stable key, internally represented as SkyFunctions returning SkyValues from SkyKeys.

A SkyFunction should return the same SkyValue given the same SkyKey which makes it a good candidate for being computed only once and have the result cached and fed through to the next step. But peculiarly only the action execution itself has been offloaded to remote cache and execution.

cache all the things

This is what Bonanza attempts to resolve. Instead of having all of that state computed and represented locally, repeated by each developer and discarded whenever a flag changes, Bonanza is designed from the ground up to offload all computations, downloads, and analysis to a remote cacheable system. This essentially reduces the client side to a very light weight program with the primary purpose of scanning the source code for modifications and uploading any local changes.

What's the current state?

The project is still in an early state prototype state. In the presentation Ed demonstrated how far the prototype has come, and it was very impressive. The Bonanza client as of today is capable of analyzing the bb-storage codebase.

There is still a lot remaining before it's a production ready system but it is very impressive progress. Bonanza is capable of running the full bzlmod resolution, fetching all relevant repository rules from upstream, run those repository rules, and analyzing the resulting codebase, basically performing everything a Bazel cquery does, all with fully remote and cacheable manner.

For shops which lean heavily on remote caching and execution the remaining local load is often the bottleneck for your builds, the ability to offload the remaining pieces to your build cluster may be the next evolution of your build system.

Memory Adventure

November 13, 2023 · 8 min read

Nils Wireklint

An adventure in finding a memory thief in Starlark-land

This is a summary and follow-up to my talk at BazelCon-2023. With abridged code examples, the full instructions are available together with the code.

Problem Statement

First, we lament Bazel's out-of-memory errors, and point out that the often useful Starlark stacktrace does not always show up. Some allocation errors just crash Bazel without giving and indication of which allocation failed.

allocation

This diagram illustrates a common problem for memory errors, the allocation that fails may not be the problem, it is just the straw that breaks the camel's back. And the real thief may already have allocated its memory.

We have seen many errors when working with clients, and they typically hide in big corporate code bases. Which complicates troubleshooting, discussion and error reporting. So we create a synthetic repository to try to illustrate the problem, and have something to discuss. The code and instructions are available here.

Errors and poor performance in the analysis phase are not good at all. This is because the analysis must always be done before starting to build all actions. With big projects the number of configuration to build for can be very large, so one cannot rely on CI runners to build the same configuration over and over, to retain the analysis cache. Instead it is on the critical-path for all builds, especially if the actions themselves are cached remotely.

To illustrate (some of the problem) we have a reproduction repository with example code base with some Python and C programs. To introduce memory problems, and make it a little more complex we add two rules: one CPU intensive rule ("spinlock") and one memory intensive aspect ("traverse"). The "traverse" aspect encodes the full dependency tree of all targets and writes that to a file with ctx.actions.write. So the allocations are tied to the Action object.

Toolbox

We have a couple of tools available, many are discussed in the memory optimization guide, but we find that some problems can slip through the cracks.

First off, there are the post-build analysis tools in bazel:

bazel info
bazel dump --rules
bazel aquery --skyframe_state

These are a good starting point and have served us well on many occasions. But with this project they seem to miss some allocations We will return to that later. Additionally, these tool will not give any information if the Bazel server crashes. You will need to increase the memory and run the same build again.

Then one can use Java tools to inspect what the JVM is doing:

Eclipse Memory Analyzer
jmap

The best approach here is to ask Bazel to save the heap if it crashes, so it can be analyzed post-mortem: bazel --heap_dump_on_oom

And lastly, use Bazel's profiling information:

bazel --profile=profile.gz --generate_json_trace_profile --noslim_profile

This contains structured information and is written continuously to disk, so if Bazel crashes we can still parse it, we just need to discard partially truncated events.

Expected Memory consumption

As the two rules write their string allocations to output files we get a clear picture of the expected RAM usage (or at least a lower bound).

$ bazel clean
$ bazel build \
  --aspects @example//memory:eat.bzl%traverse \
  --output_groups=default,eat_memory \
  //...
# Memory intensive tree traversal (in KB)
$ find bazel-out/ -name '*.tree' | xargs du | cut -f1 | paste -sd '+' | bc
78504
# CPU intensive spinlocks (in KB)
$ find bazel-out/ -name '*.spinlock' | xargs du | cut -f1 | paste -sd '+' | bc
3400

Here is a table with the data:

	Memory for each target	Total
Memory intensive	0-17 MB	79 MB
CPU intensive	136 KB	3.4 MB

Reported Memory Consumption

Next, we check with the diagnostic tools.

$ bazel version
Bazelisk version: development
Build label: 6.4.0

Bazel dump --rules

$ bazel $STARTUP_FLAGS --host_jvm_args=-Xmx"10g" dump --rules
Warning: this information is intended for consumption by developers
only, and may change at any time. Script against it at your own risk!

RULE                              COUNT     ACTIONS          BYTES         EACH
cc_library                            4          17        524,320      131,080
native_binary                         1           4        524,288      524,288
cc_binary                             6          54        262,176       43,696
toolchain_type                       14           0              0            0
toolchain                            74           0              0            0
...

ASPECT                             COUNT     ACTIONS          BYTES         EACH
traverse                              85          81        262,432        3,087
spinlock14                            35          66        524,112       14,974
spinlock15                            35          66              0            0
...

First, there are some common rules that we do not care about here, then we have the Aspects. traverse is the memory intensive aspect, which is applied on the command line and spinlock<N> are the CPU intensive rules, with identical implementations just numbered (there are 25 of them).

It is a little surprising that only one have allocations. And the action count for each aspect does not make sense either, as this is not a transitive aspect. It just runs a single action each time the rule is instantiated. The hypothesis is that this is a display problem, with code shared between rules. There are 25 rules, with 25 distinct implementation functions, but they in turn call the same function with the action. So the "count" and "actions" columns are glued together, but the "bytes" is reported for just one of the rules (it would be bad if this was double-counted).

Either way, the total number of bytes does not add up to what we expect. Compare the output to the lower-bound determined before:

| | Memory for each target | Total | Reported Total | | ---- | ---- | ----------- | | Memory intensive | 0-17 MB | 79 MB | 262 kB | CPU intensive | 136 KB | 3.4 MB | 524 kB

Skylark Memory Profile

info

This is not part of the video.

The skylark memory profiler is much more advanced, and can be dumped after a successful build.

$ bazel $STARTUP_FLAGS --host_jvm_args=-Xmx"$mem" dump \
    --skylark_memory="$dir/memory.pprof"

$ pprof manual/2023-10-30/10g-2/memory.pprof
Main binary filename not available.
Type: memory
Time: Oct 30, 2023 at 12:16pm (CET)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 2816.70kB, 73.34% of 3840.68kB total
Showing top 10 nodes out of 19
      flat  flat%   sum%        cum   cum%
     512kB 13.33% 13.33%      512kB 13.33%  impl2
  256.16kB  6.67% 20.00%   256.16kB  6.67%  traverse_impl
  256.11kB  6.67% 26.67%   256.11kB  6.67%  _add_linker_artifacts_output_groups
  256.09kB  6.67% 33.34%   256.09kB  6.67%  alias
  256.09kB  6.67% 40.00%   256.09kB  6.67%  rule
  256.08kB  6.67% 46.67%   256.08kB  6.67%  to_list
  256.06kB  6.67% 53.34%   256.06kB  6.67%  impl7
  256.04kB  6.67% 60.01%   256.04kB  6.67%  _is_stamping_enabled
  256.04kB  6.67% 66.67%   256.04kB  6.67%  impl18
  256.03kB  6.67% 73.34%   768.15kB 20.00%  cc_binary_impl

Here the Memory intensive aspect shows up with 256kB, which is inline with the output from bazel dump --rules, but not reflecting the big allocations we know it makes.

Eclipse Memory Analyzer

The final tool we have investigated is the Java heap analysis tool Eclipse Memory Analyzer, which can easily be used with Bazel's --heap_dump_on_oom flag. On the other hand it is a little tricker to find a heap dump from a successful build.

eclipse-analysis

Here we see the (very) big allocation clear as day, but have no information of its provenance.

We have not found how to track this back to a Skylark function, Skyframe evaluator or anything that could be cross-referenced with the profiling information.

Build Time

The next section of the talk shows the execution time of the build with varying memory limits.

combined

This is benchmarked with 5 data points for each memory limit, and the plot shows failure if there was at least one crash among the data points. There is a region where the build starts to succeed more and more often, but sometimes crashes. So the Crash and not-crash graphs overlap a little, you want to have some leeway to avoid flaky builds from occasional out-of-memory crashes.

We see that the Skymeld graph requires a lot less memory than a regular build, that is because our big allocations are all tied to Action objects. Enabling Skymeld lets Bazel start executing Actions as soon as they are ready, so the resident set of Action objects does not grow so large, and the allocations can be freed much sooner.

Pessimization with limited memory

pessimization

We saw a hump in the build time for the Skymeld graph, where the builds did succeed in the 300 - 400 MB range, but the build speed gradually increased, reaching a plateau at around 500 MB. This is a pattern we have seen before, where more RAM, or more efficient rules can improve build performance.

This is probably because the memory pressure and the Java Garbage Collector interferes with the Skyframe work. See Benjamin Peterson's great talk about the Skyframe for more information.

Future work

example profile

This section details future work for more tools and signals that we can find from Bazel's profile information --profile=profile.gz --generate_json_trace_profile --noslim_profile. Written in the standard chrome://tracing format it is easy to parse for both successful and failed builds.

This contains events for the garbage collector, and all executed Starlark functions.

These can be correlated to find which functions are active during, or before, garbage collection events. Additionally, one could collect this information for all failed builds, and see if some functions are overrepresented among the last active functions for each evaluator in the build.

BazelCon 2023

November 2, 2023 · One min read

Fredrik Medley

Meroton visited BazelCon 2023 in Munich October 24-25, 2023. During the conference, we held three talks:

Remote Output Service - How not to have your bytes and eat them too by Benjamin Ingberg about bb_clientd and the Remote Output Service.
Buildbarn - From 100 to 100.000 CPUs (slides) by Fredrik Medley.
Dude, where is my RAM? - An adventure in finding a RAM thief in Starlak land by Nils Wireklint about difficulties in debugging out-of-memory errors in Bazel.

Other talks that mentioned Buildbarn were:

Migrating a Multiple-Platform Game Engine to Bazel where Kai Zhang from NetEase talks about a Buildbarn worker implementation in Python for Windows and MacOS.
Planting Bazel in barren soil: A Perl Story where Manuel Naranjo from Booking.com is using Buildbarn for remote execution and Buildbuddy for Build Event Streaming.

We are thankful for all amazing chats with the community and are looking forward to BazelCon 2024.

Bazel 6 Errors when using Build without the Bytes

February 7, 2023 · 4 min read

Benjamin Ingberg

** UPDATE: ** Bazel has a workaround for this issue preventing the permanent build failure loop from 6.1.0 and a proper fix with the introduction of --experimental_remote_cache_ttl in Bazel 7

Starting from v6.0.0, Bazel crashes when building without the bytes. Because it sets --experimental_action_cache_store_output_metadata when using --remote_download_minimal or --remote_download_toplevel.

Effectively this leads to Bazel getting stuck in a build failure loop when your remote cache evicts an item you need from the cache.

developer@machine:~$ bazel test @abseil-hello//:hello_test --remote_download_
minimal

[0 / 6] [Prepa] BazelWorkspaceStatusAction stable-status.txt
ERROR: /home/developer/.cache/bazel/_bazel_developer/139b99b96c4ab6cba5122193
1a36e346/external/abseil-hello/BUILD.bazel:26:8: Linking external/abseil-hell
o/hello_test failed: (Exit 34): 42 errors during bulk transfer:
java.io.FileNotFoundException: /home/developer/.cache/bazel/_bazel_developer/
139b99b96c4ab6cba51221931a36e346/execroot/cache_test/bazel-out/k8-fastbuild/b
in/external/com_google_absl/absl/base/_objs/base/spinlock.pic.o (No such file
 or directory)
...
Target @abseil-hello//:hello_test failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 3.820s, Critical Path: 0.88s
INFO: 5 processes: 4 internal, 1 remote.
FAILED: Build did NOT complete successfully
@abseil-hello//:hello_test                                   FAILED TO BUILD

The key here is (Exit 34): xx errors during bulk transfer. 34 is Bazel's error code for Remote Error.

The recommended solution is to set the flag explicitly to false, with --experimental_action_cache_store_output_metadata=false. To quickly solve the issue on your local machine you can run bazel clean. However, this will just push the error into the future.

The bug is independent of which remote cache system you use and is tracked at GitHub.

Background

When performing an analysis of what to build Bazel will ask the remote cache which items have already been built. Bazel will only schedule build actions for items that do not already exist in the cache. If running a build without the bytes¹ the intermediary results will not be downloaded to the client.

Should the cached items be evicted then Bazel will run into an unrecoverable error. It wants the remote system to perform an action using inputs from the cache, but they have disappeared. And Bazel can not upload them, as they were never downloaded to the client. The build would then dutifully crash (some work has been put into trying to resolve this on the bazel side but it has not been considered a priority).

This puts an implicit requirement on the remote cache implementation. Artifacts need to be saved for as long as Bazel needs them. The problem here is that this is an undefined period of time. Bazel will not proactively check if the item still exists, nor in any other manner inform the cache that it will need the item in the future.

Before v6.0.0

Bazel tied the lifetime of which items already exists in the cache (the existence cache) to the analysis cache. Whenever the analysis cache was purged it would also drop the existence cache.

The analysis cache is purged quite frequently. It would therefore be rare in practice, that the existence cache would be out of date. Furthermore, since the existence cache was an in-memory cache, Bazel crashing would forcefully evict the existence cache. Thereby fixing the issue.

After v6.0.0

With the --experimental_action_cache_store_output_metadata flag enabled by default the existence cache is instead committed to disk and never dropped during normal operation.

This means two things:

The implied requirement on the remote cache is effectively infinite.
Should this requirement not be met the build will fail. And since the existence cache is committed to disk Bazel will just fail again the next time you run it.

Currently the only user-facing way of purging the existence cache is to run bazel clean. Which is generally considered an anti-pattern.

If you are using the bb-clientd --remote_output_service to run builds without the bytes (an alternative strategy to --remote_download_minimal) this will not affect you.

When using Bazel with remote execution remote builds are run in a remote server cluster. There is therefore no need for each developer to download the partial results of build. Bazel calls this feature Remote Builds Without the Bytes. The progress of the feature can be tracked at GitHub. ↩

Tips, Tricks & Non-Deterministic Builds

August 28, 2022 · 2 min read

Benjamin Ingberg

When you have a remote build and cache cluster it can sometimes be hard to track down what exactly is using all of your building resources. To help with this we have started a tips and trix section in the documentation where we will share methods we use to debug and resolve slow builds.

The first section is about build non-determinism. Ideally your build actions should produce the same output when run with the same input, in practice this is sometimes not the case. If you are lucky a non-deterministic action won't be noticed since the inputs for the non-deterministic action is unchanged it won't be rebuilt.

If you're not so lucky the non-determinism stems from a bug in the implementation and you should definitely pay attention to them. But how do you know which if any actions are non-deterministic?

This is not trivial but we have added a server side feature which allows detection of non-determinism with virtually no effort on your part.

Once activated it reruns a configured fraction of your actions and automatically flags them if they produce different outputs. The scheduling is done outside of your bazel invocation so your build throughput will be unaffected at the cost of an increase in the number of resources consumed. We suggest 1% which will only increase your resource use by a trivial amount but you could of course set it to 100% which would double the cost of your builds.

How is it different from Bazel?

What's the current state?

Problem Statement​

Toolbox​

Expected Memory consumption​

Reported Memory Consumption​

Bazel dump --rules​

Skylark Memory Profile​

Eclipse Memory Analyzer​

Build Time​

Pessimization with limited memory​

Future work​

Background​

Before v6.0.0​

After v6.0.0​

Footnotes​

Problem Statement

Toolbox

Expected Memory consumption

Reported Memory Consumption

Bazel dump --rules

Skylark Memory Profile

Eclipse Memory Analyzer

Build Time

Pessimization with limited memory

Future work

Background

Before v6.0.0

After v6.0.0

Footnotes