Kevin Webber

Programming and statistics are so important now that they’re essentially part of basic literacy. As the gap widens between superficial familiarity and deep understanding of complex fields like AI and quantum computing, so too does the opportunity for manipulation. Without a grounding in these disciplines, people become easy targets for misleading claims, whether it’s in AI ethics debates, government policy on emerging tech, or even the question of fair use for copyrighted materials in AI training.

Here’s the kicker. If the potential for manipulation is already growing, where’s the line between bias and outright propaganda? When do marketing claims and selective benchmarks cross the threshold from creative spin to outright deception?

This post dives into those questions. It unpacks the marketing phenomenon of “benchmarketing” and the tactics that twist data into narratives. But it doesn’t just point fingers. The goal is to give you the tools to recognize and challenge benchmarketing tactics, understand their impact on both the open source and commercial landscape, and sharpen your ability to spot misleading claims. More importantly, it aims to show practitioners how to design transparent, ethical benchmarks that inform and educate rather than mislead.

The Texas Sharpshooter Fallacy

The Texas Sharpshooter fallacy originates from a joke about a Texan who “indiscriminately shoots at the side of a barn and then paints a target around the tightest cluster of hits to proclaim his marksmanship”. This logical fallacy occurs when someone selectively focuses on a subset of data from a larger set, ignoring the rest, to support a specific conclusion.¹

Propaganda or not? It all depends on your methods, incentives, and agenda. If you’re attempting to sell the public on the existence of an ~800-meter tasty sphere to unlock an annual bonus accelerator, you’re possibly a propagandist. Original from XKCD.

Within the realm of product marketing, particularly in the startup scene, the Texas Sharpshooter fallacy isn’t always a simple mistake or statistical error. It’s sometimes an embraced strategy. Marketing narratives are typically designed with a specific aim from the start, like claiming “our product is faster and more reliable than yours.” In less-principled marketing teams, employees and consultants are then directed to find data to support the narrative, rather than the other way around. Any data uncovered that doesn’t fit the narrative is conveniently ignored.

When a selective attention to data is deliberate, with the aim of shaping public perception or outmaneuvering competitors, it crosses an ethical boundary. When the end goal is to have a marketed opinion accepted as an established fact, often to the detriment of competing products, it’s propaganda. Let’s start to call this out for what it is.²

When this selective focus on data becomes a deliberate marketing strategy, it gives rise to a practice that has quietly grown in the tech world: benchmarketing. Unlike simple analytical mistakes or honest misinterpretations, benchmarketing involves intentionally constructing a narrative using cherry-picked performance data to support a pre-baked conclusion. This is not careless statistics. It’s a calculated blend of technical spin and corporate storytelling designed to influence public perception and dominate competitor narratives.

Benchmarketing is especially insidious because it masquerades as objective, data-driven insight. But make no mistake. When the real goal is to craft a marketable story rather than provide a balanced view of product performance, the ethics are deeply questionable. As we’ll see, this practice has serious implications for both the integrity of technical marketing and the health of open source ecosystems.

What’s Benchmarketing?

So, what is benchmarketing, exactly? Let’s define a definition for the purpose of this essay, although you’re free to disagree with any of the nuances below.

Benchmarketing is the marketing practice of highlighting specific performance benchmarks to showcase a product favorably, often by focusing excessively on positive data points while ignoring or omitting less favorable ones.
Benchmarketing often compares one project against another project to help amplify the distorted presentation of data, for instance, by over-indexing on a single feature of one project to make the entire project look superior to another competing project.
Benchmarketing also tends to conflate specific measurements like latency and throughput with the broader concept of performance, by conveniently omitting important tradeoffs, or even worse, gaming the measurements.

A hypothetical example of benchmarketing might look like this. A database vendor optimizes a very specific query type, like descending-order SELECTs, achieving a 500 percent performance gain with extra tuning compared to a competitor’s default configuration. The marketing team then proclaims, “Our database is 500 percent faster than the competition!”, with fine print noting this applies only to descending queries under a narrowly defined workload. This tactic becomes even more dubious when the engineering team is explicitly instructed to implement the micro-optimization to support the marketing story. The end result is just enough technical truth to provide plausible deniability while promoting a misleading claim that sounds far more impressive than it is.

Let’s explore the origin of the term “benchmarketing,” trace its evolution, and discuss how to both prevent its production and build resilience against falling for it. Only then can we explore how to avoid these pitfalls and, even better, produce ethical benchmarks in our future projects.

The following example illustrates a case of benchmarketing in the wild, which also happens to be the origin of the term itself. It is important to note that a single instance of benchmarketing does not reflect the overall ethics of any individual or company involved, as long as it becomes a learning opportunity and not an ongoing pattern. Benchmarketing is a point-in-time decision, unless it turns into a recurring practice. Providing specific examples is essential for understanding the nature of technical propaganda in today’s context and for encouraging companies to shift away from benchmarketing toward more educational and transparent content.

One of the clearest examples of benchmarketing’s impact, and where the term first gained traction, was the high-profile benchmarking battle between Databricks and Ververica over Apache Spark and Apache Flink performance. Let’s take a closer look at what happened, not so we can point fingers, but so we can learn from this mistake.

The Curious Case of the Broken Benchmark: Revisiting Apache Flink® vs. Databricks Runtime

In 2017, Databricks promoted a series of benchmarks showing a substantial performance advantage between Apache Spark, which is at the heart of the Databricks Runtime project, over Apache Flink and Apache Kafka Streams (KStreams). They published a report called Benchmarking Structured Streaming on Databricks Runtime Against State-of-the-Art Streaming Systems, which summarized the performance of various tests run using the Yahoo Streaming Benchmark, along with Databricks Notebooks.³

Their initial analysis showed that Spark achieved higher throughput compared to the other systems compared, with the final summary that Spark reached over 4x the throughput of its competitors.

What’s the issue?

Upon scrutiny, Databricks’ claims fell apart after technologists from Ververica found non-trivial issues with the analysis. They were the first to call the report purposefully unethical, against the ethos of open source, and distracting to FOSS by diverting energy into refuting the report rather than enhancing the projects. Ververica’s rebuttal, The Curious Case of the Broken Benchmark: Revisiting Apache Flink® vs. Databricks Runtime, is the first time I came across the term benchmarketing after a colleague pointed me towards it.⁴

Ververica identified two main issues with the Databricks benchmark: a bug in the Flink data generator code that was actually written by Databricks themselves, and an incorrect Flink configuration setting regarding object reuse. After addressing these issues, Ververica found that Flink significantly outperformed Spark in throughput for these specific tests. This led to a much more transparent understanding of the benchmark’s results:

Spark achieved throughput of 2.5 million records per second (in line with what Databricks reported in their post)
Flink achieved throughput of 4 million records per second (significantly better performance than originally reported by Databricks)

The Ververica rebuttal underscores an important aspect of benchmarking in technology: the specificity and limitations of the benchmark itself is materially important and must be disclosed. A narrow benchmark, such as one focusing solely on word counting, with particular configurations and in a specific deployment environment, is unlikely to represent a range of real-world workloads.

This highlights the importance of context. All benchmarks tell a story, so it’s essential for those conducting benchmarks to be transparent about the story they aim to tell up-front. This should include the full scope and intention of their tests, such as specific workloads they plan to highlight. Without transparency, we quickly slide down the slippery slope of benchmarketing.

Benchmarketing is not Harmless

Why was this particular incidence of benchmarketing so harmful to the open source community?

Apache Spark, Apache Flink, and Apache Kafka are all free and open source projects (FOSS), governed by Apache, and collaboratively developed by a group of technologists who devote a great deal of time and effort to furthering these important projects. When well-funded vendors conduct biased marketing activities disguised as unbiased research, it harms open source by forcing the FOSS developer community to spend time refuting benchmarks instead of innovating.

There comes a time in the life of every stream processing project when its contributors must decide, “Are we here to solve previously unsolvable production problems for our users, or are we here to write blog posts about benchmarks?” – Stephan Ewen ⁵

There are many other examples of benchmarketing in the wild, and it wouldn’t take long for you to find some on your own. Rather than run through the entire catalog of benchmarketing examples, let’s instead explore some suggestions on how to produce a benchmark ethically. There’s nothing inherently negative about a benchmark, and they can be conducted without forcing developers to compromise their integrity, or distract the FOSS community into working on benchmarketing rebuttals instead of developing the wonderful software that we all benefit from.

How to Produce an Ethical Benchmark

Instead of creating new guidelines and recommendations from scratch, I’ll borrow liberally from Ververica’s rebuttal. I believe their insights are spot-on, so there’s no need to change what already works well. The following is a blend of Ververica’s original suggestions with my own thoughts sprinkled throughout.

1. Involve Neutral Third Parties and Open Reviews

Benchmarks should be carried out by neutral third parties, and all stakeholders should be offered a chance to review and respond.

Neutral third parties should craft and execute benchmarks on behalf of the community, and give all stakeholders, including open source communities, a voice in the review process. A neutral and inclusive approach to benchmarks helps mitigate bias, enhances the legitimacy of the results, and separates real benchmarking initiatives from propaganda.

For organizations with an Open Source Program Office (OSPO), tasking the OSPO with oversight of COSS-driven benchmarking initiatives, with minimal corporate influence along the way, can help further guarantee accuracy and ethical conduct. There’s no way to eliminate bias, but there are many ways to reduce the opportunity for bias.

2. Use Real Workloads and Acknowledge Gaps

Ensure benchmarks represent real workloads, and be honest about gaps.

To ensure benchmarks are meaningful and fair, they must accurately represent real-world workloads and acknowledge any limitations. Benchmarks that selectively focus on specific features of a project to support a premeditated narrative are likely leading toward benchmarketing propaganda. This is especially true for complex systems like database management systems and streaming platforms, which have diverse capabilities that benchmarks often fail to fully encompass. Claims like “project X is 150 percent faster than project Y” are unethical unless all features of the projects are thoroughly compared, which is rare due to potential compatibility differences between projects and the sheer complexity of such a comprehensive benchmark comparison.

Transparently reporting on what exactly has been tested and acknowledging gaps is essential for benchmarks to be trustworthy and valuable.

3. Keep Benchmarks Evolving and Iterative

Living benchmarks are better than static benchmarks.

Software changes frequently, and one-off benchmarks may only be accurate for a short period of time. In my opinion, ClickHouse has one of the best open source benchmark portals I’ve come across, ClickBench. It not only covers a wide array of configurations, hardware, and cloud variants, but also a ton of different query types. ClickHouse enables users to compare it against a wide array of alternatives without a pre-canned narrative, putting the focus on the numbers.

Treating benchmarks as evolving, iterative projects instead of one-off static reports further reduces bias and maintains relevance. Bonus points for enabling developers to draw their own conclusions through tools like ClickBench.

4. Open Methodology and Community Contributions

Publish a detailed methodology out in the open and accept contributions.

To further highlight ClickHouse, they open sourced the benchmarking framework I mentioned earlier, ClickBench. Beyond the tool itself, they’ve produced excellent learning materials that detail their testing methodology, covering key factors such as reproducibility, realism, and limitations. They also accept contributions to their test harnesses, which makes ClickBench a living, evolving benchmark rather than a static analysis.⁶

5. Let the Community Lead Rebuttals

Only refute benchmarks if requested by the open source community.

In the event of a benchmarking report being released by a vendor, there will be a knee-jerk reaction for other COSS vendors to respond forcefully. When this pressure seeps into FOSS development and maintenance efforts, it leads to a very distracting cycle of nonsense that adds no real value to any of the projects involved.

It’s critical to let the open source developers and maintainers of targeted projects decide whether or not to respond. Ververica’s approach was solid because, rather than responding right away, they only did so after being pressed by the Apache Flink and Apache Kafka communities. This showed a high level of respect for the people in the trenches.

To highlight the last point, below is a quote that offers more context around Ververica’s decision to respond. By holding off on a response until the community requested help, this helped preserve the focus of contributors to Apache Spark, Apache Flink, and Apache Kafka. The first way to ensure an ethical approach to benchmarks and rebuttals is for the open source community to drive such initiatives, not whatever COSS marketing team has the deepest pockets to counter benchmarketing with even more benchmarketing.

“We delayed investigating the benchmark results because we quite strongly believe that putting time and resources into an increasingly irrelevant benchmark is of no benefit to users. But we heard feedback from the community that input from a Flink perspective would be helpful, so we decided to take a closer look.” – Aljoscha Krettek, The Curious Case of the Broken Benchmark

By following general common sense guidelines around how to produce an ethical and valuable benchmark analysis, the community can focus on benchmark results that really matter: genuine explorations of new ideas and opportunities to improve the performance of free and open source software.

Summary

Increasingly, free and open source software (FOSS) projects are becoming targets of benchmarketing campaigns by commercial open source software (COSS) companies. In my opinion, this trend will only intensify as we enter a tougher macroeconomic climate. The industry should expect more benchmarketing as the battle for every dollar of revenue heats up. FOSS projects will remain easy targets for unethical COSS marketing teams focused on growing market share by any means necessary.

Benchmarketing arises when a report is designed to support a predetermined conclusion, leading to biased analysis. This type of “rational technical manipulation” is becoming more obvious and, over time, harms both those who produce it and those who rely on it. The key takeaway here is the importance of transparent, community-engaged benchmarking to foster trust and credibility.

It’s up to you to decide whether the Databricks analysis constituted benchmarketing or was simply an honest analytical mistake. However, such controversies are entirely avoidable by following the recommendations outlined above. Databricks has made significant contributions to the FOSS community through its technical innovations, making this an important case to learn from. Including the open source community in benchmarking analysis from the start, along with the other recommendations provided, can greatly reduce the risk of producing inaccurate or biased reports.