Skip to main content

Updates to Buildbarn deployment repo as of Febuary 2023

· 4 min read
Benjamin Ingberg
Benjamin Ingberg

The example configuration project for buildbarn bb-deployments has gotten updates.

This is a continuation of the updates from last year article and is a high level summary of what has happened since April 2022 up to 2023-02-16.

Let ReferenceExpandingBlobAccess support GCS

ReferenceExpandingBlobAccess already supports S3 so support was extended to Google Cloud Storage buckets.

Support for prefetching Virtual Filesystems

Running workers with Fuse allows inputs for an action to be downloaded on demand. This significantly reduces the amount of data that gets sent in order to run overspecified actions. This however leads to poor performance for actions which reads a lot of their inputs synchronously.

With the prefetcher most of these actions can be recognized and data which is likely to be needed can be downloaded ahead of time.

Support for sha256tree

Buildbarn has added support for sha256tree which uses sha256 hashing over a tree structure similar to blake3.

This algorithm will allow large CAS objects to be chunked and decompositioned with guaranteed data integrity while still using sha256 hardware instructions.

Completeness checking now streams REv2 Tree objects

This change introduces a small change to the configuration schema. If you previous had this:

backend: { completenessChecking: ... },

You will now need to write something along these lines:

backend: {
completenessChecking: {
backend: ...,
maximumTotalTreeSizeBytes: 64 * 1024 * 1024,
},
},

See also the bb-storage commit 1b84fa8.

Postponed healthy service status

The healthy and serving status, i.e. HTTP /-/healthy and grpc_health_v1.HealthCheckResponse_SERVING, are now postponed until the whole service is up and running. Before, the healthy status was potentially reported before starting to listen to the gRPC ports. Kubernetes will now wait until the service is up before forwarding connections to it.

Server keepalive parameter options

The option buildbarn.configuration.grpc.ServerConfiguration.keepalive_parameters can be used for L4 load balancing, to control when to ask clients to reconnect. For default values, see keepalive.ServerParameters.

Graceful termination of LocalBlobAccess

When SIGTERM or SIGINT is received, the LocalBlobAccess now synchronize data to disk before shutting down. Deployments using persistent storage will no longer observe loss of data when restarting the bb_storage services.

Non-sector Aligned Writes to Block Device

Using sector aligned storage is wasteful for the action cache where the messages are typically very small. Buildbarn can now fill all the gaps when writing, making storage more efficient.

DAG Shaped BlobAccess Configuration

Instead of a tree shaped BlobAccess configuration, the with_labels notation allows a directed acyclic graph. See also the bb-storage commit cc295ad.

NFSv4 as worker filesystem

The bb_worker can now supply the working directory for bb_runner using NFSv4. Previously, FUSE and hard linking files from the worker cache were the only two options. This addition was mainly done to overcome the poor FUSE support on macOS.

The NFSv4 server in bb_worker only supports macOS at the moment. No effort has been spent to write custom mount logic for other systems yet.

Specify forwardMetadata with a JMESPath

Metadata forwarding is now more flexible, the JMESPath expressions can for example add authorization result data. The format is described in grpc.proto.

A common use case is to replace

{
forwardMetadata: ["build.bazel.remote.execution.v2.requestmetadata-bin"],
}

with

{
addMetadataJmespathExpression: '{
"build.bazel.remote.execution.v2.requestmetadata-bin":
incomingGRPCMetadata."build.bazel.remote.execution.v2.requestmetadata-bin"
}',
}

Tracing: Deprecate the Jaeger collector span exporter

This option is deprecated, as Jaeger 1.35 and later provide native support for the OpenTelemetry protocol.

bb-deployments Ubuntu 22.04 Example Runner Image

The rbe_autoconfig in bazel-toolchains has been deprecated. In bb-deployments it has been replaced by the Act image ghcr.io/catthehacker/ubuntu:act-22.04, distributed by catthehacker, used for running GitHub Actions locally under Ubuntu 22.04.

bb-deployments Integration Tests

The bare deployment and Docker Compose deployment have now got tests scripts that builds and tests @abseil-hello//:hello_test remotely, shuts down and then checks for 100% cache hit after restart. Another CI test is checking for minimal differences between the Docker Compose deployment and the Kubernetes deployment.

If there are any other changes you feel deserve a mention feel free to submit a pull request at github using the link below.

Bazel 6 Errors when using Build without the Bytes

· 4 min read
Benjamin Ingberg
Benjamin Ingberg

** UPDATE: ** Bazel has a workaround for this issue preventing the permanent build failure loop from 6.1.0 and a proper fix with the introduction of --experimental_remote_cache_ttl in Bazel 7


Starting from v6.0.0, Bazel crashes when building without the bytes. Because it sets --experimental_action_cache_store_output_metadata when using --remote_download_minimal or --remote_download_toplevel.

Effectively this leads to Bazel getting stuck in a build failure loop when your remote cache evicts an item you need from the cache.

developer@machine:~$ bazel test @abseil-hello//:hello_test --remote_download_
minimal

[0 / 6] [Prepa] BazelWorkspaceStatusAction stable-status.txt
ERROR: /home/developer/.cache/bazel/_bazel_developer/139b99b96c4ab6cba5122193
1a36e346/external/abseil-hello/BUILD.bazel:26:8: Linking external/abseil-hell
o/hello_test failed: (Exit 34): 42 errors during bulk transfer:
java.io.FileNotFoundException: /home/developer/.cache/bazel/_bazel_developer/
139b99b96c4ab6cba51221931a36e346/execroot/cache_test/bazel-out/k8-fastbuild/b
in/external/com_google_absl/absl/base/_objs/base/spinlock.pic.o (No such file
or directory)
...
Target @abseil-hello//:hello_test failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 3.820s, Critical Path: 0.88s
INFO: 5 processes: 4 internal, 1 remote.
FAILED: Build did NOT complete successfully
@abseil-hello//:hello_test FAILED TO BUILD

The key here is (Exit 34): xx errors during bulk transfer. 34 is Bazel's error code for Remote Error.

The recommended solution is to set the flag explicitly to false, with --experimental_action_cache_store_output_metadata=false. To quickly solve the issue on your local machine you can run bazel clean. However, this will just push the error into the future.

The bug is independent of which remote cache system you use and is tracked at GitHub.

Background

When performing an analysis of what to build Bazel will ask the remote cache which items have already been built. Bazel will only schedule build actions for items that do not already exist in the cache. If running a build without the bytes1 the intermediary results will not be downloaded to the client.

Should the cached items be evicted then Bazel will run into an unrecoverable error. It wants the remote system to perform an action using inputs from the cache, but they have disappeared. And Bazel can not upload them, as they were never downloaded to the client. The build would then dutifully crash (some work has been put into trying to resolve this on the bazel side but it has not been considered a priority).

This puts an implicit requirement on the remote cache implementation. Artifacts need to be saved for as long as Bazel needs them. The problem here is that this is an undefined period of time. Bazel will not proactively check if the item still exists, nor in any other manner inform the cache that it will need the item in the future.

Before v6.0.0

Bazel tied the lifetime of which items already exists in the cache (the existence cache) to the analysis cache. Whenever the analysis cache was purged it would also drop the existence cache.

The analysis cache is purged quite frequently. It would therefore be rare in practice, that the existence cache would be out of date. Furthermore, since the existence cache was an in-memory cache, Bazel crashing would forcefully evict the existence cache. Thereby fixing the issue.

After v6.0.0

With the --experimental_action_cache_store_output_metadata flag enabled by default the existence cache is instead committed to disk and never dropped during normal operation.

This means two things:

  1. The implied requirement on the remote cache is effectively infinite.
  2. Should this requirement not be met the build will fail. And since the existence cache is committed to disk Bazel will just fail again the next time you run it.

Currently the only user-facing way of purging the existence cache is to run bazel clean. Which is generally considered an anti-pattern.

If you are using the bb-clientd --remote_output_service to run builds without the bytes (an alternative strategy to --remote_download_minimal) this will not affect you.

Footnotes

  1. When using Bazel with remote execution remote builds are run in a remote server cluster. There is therefore no need for each developer to download the partial results of build. Bazel calls this feature Remote Builds Without the Bytes. The progress of the feature can be tracked at GitHub.

BuildBar at the Meroton Office

· One min read
Benjamin Ingberg
Benjamin Ingberg

After BazelCon we've all gotten a bit giddy and you might feel excited how the information presented at BazelCon might impact your development workflow. For that purpose we're inviting everyone in the wider bazel community to an open BuildBar at the Meroton offices.

Come over and digest BazelCon with high level technical discussions of the talks, great company, a pleasant atmosphere and also beer.

Feel welcome to come over this Thursday the first of December from 16 to 20.

Directions

You'll find us at our Linköping offices at Fridtunagatan 33. Currently there is some construction but follow the red lines and you'll be fine.

Fridtunagatan 33 Linköping

Remote Executors for the Free Environment

· 2 min read
Benjamin Ingberg
Benjamin Ingberg

We've performed some updates to the free tier and our pricing model.

Shared Cache

The free tier has been upgraded into an environment with a 1TB cache. This means that you can use the free tier without worrying about hitting any limits or setup process.

Do note that while the cached items will only be accessible with the correct api keys, the storage area can be reclaimed by anyone. I.e. as the cache becomes full your items might be dropped.

With the current churn we expect any items in the cache last at least a week. However, you should always treat items in the cache as something which might be dropped at any moment. The purpose and design of the cache is to maximize build performance, not to provide storage.

Shared executors

New for the free environment is the introduction of remote execution, this was previously only available for paying customers.

If you're using bazel you can simply add the --remote_executor=<...> flag and your builds will be done remotely.

The free environment has access to 64 shared executors.

Starter tier

If you need dedicated storage for evaluation purposes we also offer a starter tier. The starter tier is a streamlined environment with 100 GB of dedicated cache and access to the same 64 executors as the free tier.

This allows you to try out remote execution and caching without being disturbed by others. While the 100GB dedicated cache will only be used by you and therefore not be overwritten by anyone else's build, the cache is still a cache and the system will still drop the cache entries if it determines it is useful for maximizing performance.

Need more executors?

At the moment the starter tier has a fixed amount of shared executors. If you need a scalable solution contact us for setting up a custom Buildbarn environment of any size.

Tips, Tricks & Non-Deterministic Builds

· 2 min read
Benjamin Ingberg
Benjamin Ingberg

When you have a remote build and cache cluster it can sometimes be hard to track down what exactly is using all of your building resources. To help with this we have started a tips and trix section in the documentation where we will share methods we use to debug and resolve slow builds.

The first section is about build non-determinism. Ideally your build actions should produce the same output when run with the same input, in practice this is sometimes not the case. If you are lucky a non-deterministic action won't be noticed since the inputs for the non-deterministic action is unchanged it won't be rebuilt.

If you're not so lucky the non-determinism stems from a bug in the implementation and you should definitely pay attention to them. But how do you know which if any actions are non-deterministic?

This is not trivial but we have added a server side feature which allows detection of non-determinism with virtually no effort on your part.

Once activated it reruns a configured fraction of your actions and automatically flags them if they produce different outputs. The scheduling is done outside of your bazel invocation so your build throughput will be unaffected at the cost of an increase in the number of resources consumed. We suggest 1% which will only increase your resource use by a trivial amount but you could of course set it to 100% which would double the cost of your builds.

Updates to Buildbarn deployment repo as of April 2022

· 2 min read
Benjamin Ingberg
Benjamin Ingberg

The sample configuration project for Buildbarn was recently updated after a long hiatus. As an aid for people to understand which changes have been done see the following high level summary.

April 2022 Updates

This includes updates to Buildbarn since December 2020.

Authorizer Overhaul

Authorizers have been rehauled to be more flexible it is now part of each individual cache and execution configuration.

Using a JWT authorization bearer token has been added as an authorization method.

Hierarchical Blob Access

Using hierarchical blob access allows blobs in instance name foo/bar to be accessed from instance foo/bar/baz but not instance foo or foo/qux.

Action Result Expiration

An expiry can be added to action result which lets the action cache purge the result of an exection that was performed too far in the past. This can be used to ensure that all targets are rebuilt periodically even if they are accessed frequently enough to not normally be purged from the cache.

Read Only Cache Replicas

Cache read traffic can now be sent to a read-only replica which is periodically probed for availability.

Concurrency Limiting Blob Replication

Limit the number of concurrent replications to prevent network starvation

Run Commands as Another User

Allows the commands to be run as a different user, on most platforms this means the bb-runner instance must run as root.

Size Class Analysis

Allows executors of different size classes to be used, the scheduler will attempt to utilize executors efficiently but there is an inherent tradeof between throughput and latency. Once configured the scheduler will automatically attempt to keep track of which actions are best run on which executors.

Execution Routing Policy

The scheduler accepts an execution routing policy configuration that allows it to determine how to defer builds to different executors.

If you see any other changes you feel should get a mention feel free to submit a pull request at github using the link below.

Purpose of the Articles

· One min read
Benjamin Ingberg
Benjamin Ingberg

The purpose of these articles is to have a freeform area discussing ideas, technical issues, solutions and news in an indepth relaxed manner. It is not to serve as reference material, structured reference material should be available in the documentation section.

The article format allows more in depth on discussions for reacurring subjects. In contrast to the documentation published articles aren't changed, if the subject requires a revisit in the future then we publish a new post and add references to the old.

If you see any errors feel free to submit a pull request at github using the link below.