Google Data » Google Testing

Announcing OSS-Fuzz: Continuous Fuzzing for Open Source Software

Google Testing Bloggers — Thu, 01 Dec 2016 17:00:00 +0000

By Mike Aizatsky, Kostya Serebryany (Software Engineers, Dynamic Tools); Oliver Chang, Abhishek Arya (Security Engineers, Google Chrome); and Meredith Whittaker (Open Research Lead).

We are happy to announce OSS-Fuzz, a new Beta program developed over the past years with the Core Infrastructure Initiative community. This program will provide continuous fuzzing for select core open source software.

Open source software is the backbone of the many apps, sites, services, and networked things that make up "the internet." It is important that the open source foundation be stable, secure, and reliable, as cracks and weaknesses impact all who build on it.

Recent security storiesconfirm that errors likebuffer overflow anduse-after-free can have serious, widespread consequences when they occur in critical open source software. These errors are not only serious, but notoriously difficult to find via routine code audits, even for experienced developers. That's wherefuzz testing comes in. By generating random inputs to a given program, fuzzing triggers and helps uncover errors quickly and thoroughly.

In recent years, several efficient general purpose fuzzing engines have been implemented (e.g. AFL and libFuzzer), and we use them to fuzz various components of the Chrome browser. These fuzzers, when combined with Sanitizers, can help find security vulnerabilities (e.g. buffer overflows, use-after-free, bad casts, integer overflows, etc), stability bugs (e.g. null dereferences, memory leaks, out-of-memory, assertion failures, etc) and sometimeseven logical bugs.

OSS-Fuzz's goal is to make common software infrastructure more secure and stable by combining modern fuzzing techniques with scalable distributed execution. OSS-Fuzz combines various fuzzing engines (initially, libFuzzer) with Sanitizers (initially, AddressSanitizer) and provides a massive distributed execution environment powered by ClusterFuzz.

Early successes

Our initial trials with OSS-Fuzz have had good results. An example is the FreeType library, which is used on over a billion devices to display text (and which might even be rendering the characters you are reading now). It is important for FreeType to be stable and secure in an age when fonts are loaded over the Internet. Werner Lemberg, one of the FreeType developers, wasan early adopter of OSS-Fuzz. Recently the FreeType fuzzer found a new heap buffer overflow only a few hours after the source change:

ERROR: AddressSanitizer: heap-buffer-overflow on address 0x615000000ffa 
READ of size 2 at 0x615000000ffa thread T0
SCARINESS: 24 (2-byte-read-heap-buffer-overflow-far-from-bounds)
   #0 0x885e06 in tt_face_vary_cvtsrc/truetype/ttgxvar.c:1556:31

OSS-Fuzz automatically notifiedthe maintainer, whofixed the bug; then OSS-Fuzz automaticallyconfirmed the fix. All in one day! You can see the full list of fixed and disclosed bugs found by OSS-Fuzz so far.

Contributions and feedback are welcome

OSS-Fuzz has already found 150 bugs in several widely used open source projects (and churns ~4 trillion test cases a week). With your help, we can make fuzzing a standard part of open source development, and work with the broader community of developers and security testers to ensure that bugs in critical open source applications, libraries, and APIs are discovered and fixed. We believe that this approach to automated security testing will result in real improvements to the security and stability of open source software.

OSS-Fuzz is launching in Beta right now, and will be accepting suggestions for candidate open source projects. In order for a project to be accepted to OSS-Fuzz, it needs to have a large user base and/or be critical to Global IT infrastructure, a general heuristic that we are intentionally leaving open to interpretation at this early stage. See more details and instructions on how to apply here.

Once a project is signed up for OSS-Fuzz, it is automatically subject to the 90-day disclosure deadline for newly reported bugs in our tracker (see details here). This matches industry's best practices and improves end-user security and stability by getting patches to users faster.

Help us ensure this program is truly serving the open source community and the internet which relies on this critical software, contribute and leave your feedback on GitHub.

What Test Engineers do at Google: Building Test Infrastructure

Google Testing Bloggers — Fri, 18 Nov 2016 17:10:00 +0000

Author: Jochen Wuttke

In a recent post, we broadly talked about What Test Engineers do at Google. In this post, I talk about one aspect of the work TEs may do: building and improving test infrastructure to make engineers more productive.

Refurbishing legacy systems makes new tools necessary
A few years ago, I joined an engineering team that was working on replacing a legacy system with a new implementation. Because building the replacement would take several years, we had to keep the legacy system operational and even add features, while building the replacement so there would be no impact on our external users.

The legacy system was so complex and brittle that the engineers spent most of their time triaging and fixing bugs and flaky tests, but had little time to implement new features. The goal for the rewrite was to learn from the legacy system and to build something that was easier to maintain and extend. As the team's TE, my job was to understand what caused the high maintenance cost and how to improve on it. I found two main causes:

Tight coupling and insufficient abstraction made unit testing very hard, and as a consequence, a lot of end-to-end tests served as functional tests of that code.
The infrastructure used for the end-to-end tests had no good way to create and inject fakes or mocks for these services. As a result, the tests had to run the large number of servers for all these external dependencies. This led to very large and brittle tests that our existing test execution infrastructure was not able to handle reliably.

Exploring solutions
At first, I explored if I could split the large tests into smaller ones that would test specific functionality and depend on fewer external services. This proved impossible, because of the poorly structured legacy code. Making this approach work would have required refactoring the entire system and its dependencies, not just the parts my team owned.

In my second approach, I also focussed on large tests and tried to mock services that were not required for the functionality under test. This also proved very difficult, because dependencies changed often and individual dependencies were hard to trace in a graph of over 200 services. Ultimately, this approach just shifted the required effort from maintaining test code to maintaining test dependencies and mocks.

My third and final approach, illustrated in the figure below, made small tests more powerful. In the typical end-to-end test we faced, the client made RPCcalls to several services, which in turn made RPC calls to other services. Together the client and the transitive closure over all backend services formed a large graph (not tree!) of dependencies, which all had to be up and running for the end-to-end test. The new model changes how we test client and service integration. Instead of running the client on inputs that will somehow trigger RPC calls, we write unit tests for the code making method calls to the RPC stub. The stub itself is mocked with a common mocking framework like Mockito in Java. For each such test, a second test verifies that the data used to drive that mock "makes sense" to the actual service. This is also done with a unit test, where a replay client uses the same data the RPC mock uses to call the RPC handler method of the service.

This pattern of integration testing applies to any RPC call, so the RPC calls made by a backend server to another backend can be tested just as well as front-end client calls. When we apply this approach consistently, we benefit from smaller tests that still test correct integration behavior, and make sure that the behavior we are testing is "real".

To arrive at this solution, I had to build, evaluate, and discard several prototypes. While it took a day to build a proof-of-concept for this approach, it took me and another engineer a year to implement a finished tool developers could use.

Adoption
The engineers embraced the new solution very quickly when they saw that the new framework removes large amounts of boilerplate code from their tests. To further drive its adoption, I organized multi-day events with the engineering team where we focussed on migrating test cases. It took a few months to migrate all existing unit tests to the new framework, close gaps in coverage, and create the new tests that validate the mocks. Once we converted about 80% of the tests, we started comparing the efficacy of the new tests and the existing end-to-end tests.

The results are very good:

The new tests are as effective in finding bugs as the end-to-end tests are.
The new tests run in about 3 minutes instead of 30 minutes for the end-to-end tests.
The client side tests are 0% flaky. The verification tests are usually less flaky than the end-to-end tests, and never more.

Additionally, the new tests are unit tests, so you can run them in your IDE and step through them to debug. These results allowed us to run the end-to-end tests very rarely, only to detect misconfigurations of the interacting services, but not as functional tests.

Building and improving test infrastructure to help engineers be more productive is one of the many things test engineers do at Google. Running this project from requirements gathering all the way to a finished product gave me the opportunity to design and implement several prototypes, drive the full implementation of one solution, lead engineering teams to adoption of the new framework, and integrate feedback from engineers and actual measurements into the continuous refinement of the tool.

Hackable Projects – Pillar 3: Infrastructure

Google Testing Bloggers — Thu, 10 Nov 2016 19:07:00 +0000

By: Patrik Höglund

This is the third article in our series on Hackability; also see the first and second article.

We have seen in our previous articles how Code Health and Debuggability can make a project much easier to work on. The third pillar is a solid infrastructure that gets accurate feedback to your developers as fast as possible. Speed is going to be major theme in this article, and we’ll look at a number of things you can do to make your project easier to hack on.

Build Systems Speed

Question: What’s a change you’d really like to see in our development tools?

“I feel like this question gets asked almost every time, and I always give the same answer:
I would like them to be faster.”
-- Ian Lance Taylor

Replace make with ninja. Use the gold linker instead of ld. Detect and delete dead code in your project (perhaps using coverage tools). Reduce the number of dependencies, and enforce dependency rules so new ones are not added lightly. Give the developers faster machines. Use distributed build, which is available with many open-source continuous integration systems (or use Google’s system, Bazel!). You should do everything you can to make the build faster.

Figure 1: “Cheetah chasing its prey” by Marlene Thyssen.

Why is that? There’s a tremendous difference in hackability if it takes 5 seconds to build and test versus one minute, or even 20 minutes, to build and test. Slow feedback cycles kill hackability, for many reasons:

Build and test times longer than a handful of seconds cause many developers’ minds to wander, taking them out of the zone.
Excessive build or release times* makes tinkering and refactoring much harder. All developers have a threshold when they start hedging (e.g. “I’d like to remove this branch, but I don’t know if I’ll break the iOS build”) which causes refactoring to not happen.

* The worst I ever heard of was an OS that took 24 hours to build!

How do you actually make fast build systems? There are some suggestions in the first paragraph above, but the best general suggestion I can make is to have a few engineers on the project who deeply understand the build systems and have the time to continuously improve them. The main axes of improvement are:

Reduce the amount of code being compiled.
Replace tools with faster counterparts.
Increase processing power, maybe through parallelization or distributed systems.

Note that there is a big difference between full builds and incremental builds. Both should be as fast as possible, but incremental builds are by far the most important to optimize. The way you tackle the two is different. For instance, reducing the total number of source files will make a full build faster, but it may not make an incremental build faster.

To get faster incremental builds, in general, you need to make each source file as decoupled as possible from the rest of the code base. The less a change ripples through the codebase, the less work to do, right? See “Loose Coupling and Testability” in Pillar 1 for more on this subject. The exact mechanics of dependencies and interfaces depends on programming language - one of the hardest to get right is unsurprisingly C++, where you need to be disciplined with includes and forward declarations to get any kind of incremental build performance.

Build scripts and makefiles should be held to standards as high as the code itself. Technical debt and unnecessary dependencies have a tendency to accumulate in build scripts, because no one has the time to understand and fix them. Avoid this by addressing the technical debt as you go.

Continuous Integration and Presubmit Queues

You should build and run tests on all platforms you release on. For instance, if you release on all the major desktop platforms, but all your developers are on Linux, this becomes extra important. It’s bad for hackability to update the repo, build on Windows, and find that lots of stuff is broken. It’s even worse if broken changes start to stack on top of each other. I think we all know that terrible feeling: when you’re not sure your change is the one that broke things.

At a minimum, you should build and test on all platforms, but it’s even better if you do it in presubmit. The Chromium submit queue does this. It has developed over the years so that a normal patch builds and tests on about 30 different build configurations before commit. This is necessary for the 400-patches-per-day velocity of the Chrome project. Most projects obviously don’t have to go that far. Chromium’s infrastructure is based off BuildBot, but there are many other job scheduling systems depending on your needs.

Figure 2: How a Chromium patch is tested.

As we discussed in Build Systems, speed and correctness are critical here. It takes a lot of ongoing work to keep build, tests, and presubmits fast and free of flakes. You should never accept flakes, since developers very quickly lose trust in flaky tests and systems. Tooling can help a bit with this; for instance, see the Chromium flakiness dashboard.

Test Speed

Speed is a feature, and this is particularly true for developer infrastructure. In general, the longer a test takes to execute, the less valuable it is. My rule of thumb is: if it takes more than a minute to execute, its value is greatly diminished. There are of course some exceptions, such as soak tests or certain performance tests.

Figure 3. Test value.

If you have tests that are slower than 60 seconds, they better be incredibly reliable and easily debuggable. A flaky test that takes several minutes to execute often has negative value because it slows down all work in the code it covers. You probably want to build better integration tests on lower levels instead, so you can make them faster and more reliable.

If you have many engineers on a project, reducing the time to run the tests can have a big impact. This is one reason why it’s great to have SETIs or the equivalent. There are many things you can do to improve test speed:

Sharding and parallelization. Add more machines to your continuous build as your test set or number of developers grows.
Continuously measure how long it takes to run one build+test cycle in your continuous build, and have someone take action when it gets slower.
Remove tests that don’t pull their weight. If a test is really slow, it’s often because of poorly written wait conditions or because the test bites off more than it can chew (maybe that unit test doesn’t have to process 15000 audio frames, maybe 50 is enough).
If you have tests that bring up a local server stack, for instance inter-server integration tests, making your servers boot faster is going to make the tests faster as well. Faster production code is faster to test! See Running on Localhost, in Pillar 2 for more on local server stacks.

Workflow Speed

We’ve talked about fast builds and tests, but the core developer workflows are also important, of course. Chromium undertook multi-year project to switch from Subversion to Git, partly because Subversion was becoming too slow. You need to keep track of your core workflows as your project grows. Your version control system may work fine for years, but become too slow once the project becomes big enough. Bug search and management must also be robust and fast, since this is generally systems developers touch every day.

Release Often

It aids hackability to deploy to real users as fast as possible. No matter how good your product's tests are, there's always a risk that there's something you haven't thought of. If you’re building a service or web site, you should aim to deploy multiple times per week. For client projects, Chrome’s six-weeks cycle is a good goal to aim for.

You should invest in infrastructure and tests that give you the confidence to do this - you don’t want to push something that’s broken. Your developers will thank you for it, since it makes their jobs so much easier. By releasing often, you mitigate risk, and developers will rush less to hack late changes in the release (since they know the next release isn’t far off).

Easy Reverts

If you look in the commit log for the Chromium project, you will see that a significant percentage of the commits are reverts of a previous commits. In Chromium, bad commits quickly become costly because they impede other engineers, and the high velocity can cause good changes to stack onto bad changes.

Figure 4: Chromium’s revert button.

This is why the policy is “revert first, ask questions later”. I believe a revert-first policy is good for small projects as well, since it creates a clear expectation that the product/tools/dev environment should be working at all times (and if it doesn’t, a recent change should probably be reverted).

It has a wonderful effect when a revert is simple to make. You can suddenly make speculative reverts if a test went flaky or a performance test regressed. It follows that if a patch is easy to revert, so is the inverse (i.e. reverting the revert or relanding the patch). So if you were wrong and that patch wasn’t guilty after all, it’s simple to re-land it again and try reverting another patch. Developers might initially balk at this (because it can’t possibly be their patch, right?), but they usually come around when they realize the benefits.

For many projects, a revert can simply be

git revert 9fbadbeef

git push origin master

If your project (wisely) involves code review, it will behoove you to add something like Chromium’s revert button that I mentioned above. The revert button will create a special patch that bypasses review and tests (since we can assume a clean revert takes us back to a more stable state rather than the opposite). See Pillar 1 for more on code review and its benefits.

For some projects, reverts are going to be harder, especially if you have a slow or laborious release process. Even if you release often, you could still have problems if a revert involves state migrations in your live services (for instance when rolling back a database schema change). You need to have a strategy to deal with such state changes.

Reverts must always put you back to safer ground, and everyone must be confident they can safely revert. If not, you run the risk of massive fire drills and lost user trust if a bad patch makes it through the tests and you can’t revert it.

Performance Tests: Measure Everything

Is it critical that your app starts up within a second? Should your app always render at 60 fps when it’s scrolled up or down? Should your web server always serve a response within 100 ms? Should your mobile app be smaller than 8 MB? If so, you should make a performance test for that. Performance tests aid hackability since developers can quickly see how their change affects performance and thus prevent performance regressions from making it into the wild.

You should run your automated performance tests on the same device; all devices are different and this will reflect in the numbers. This is fairly straightforward if you have a decent continuous integration system that runs tests sequentially on a known set of worker machines. It’s harder if you need to run on physical phones or tablets, but it can be done.

A test can be as simple as invoking a particular algorithm and measuring the time it takes to execute it (median and 90th percentile, say, over N runs).

Figure 5: A VP8-in-WebRTC regression (bug) and its fix displayed in a Catapult dashboard.

Write your test so it outputs performance numbers you care about. But what to do with those numbers? Fortunately, Chrome’s performance test framework has been open-sourced, which means you can set up a dashboard, with automatic regression monitoring, with minimal effort. The test framework also includes the powerful Telemetry framework which can run actions on web pages and Android apps and report performance results. Telemetry and Catapult are battle-tested by use in the Chromium project and are capable of running on a wide set of devices, while getting the maximum of usable performance data out of the devices.

Sources

Figure 1: By Malene Thyssen (Own work) [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons

Hackable Projects – Pillar 2: Debuggability

Google Testing Bloggers — Tue, 11 Oct 2016 16:04:00 +0000

By: Patrik Höglund

This is the second article in our series on Hackability; also see the first article.

“Deep into that darkness peering, long I stood there, wondering, fearing, doubting, dreaming dreams no mortal ever dared to dream before.” -- Edgar Allan Poe

Debuggability can mean being able to use a debugger, but here we’re interested in a broader meaning. Debuggability means being able to easily find what’s wrong with a piece of software, whether it’s through logs, statistics or debugger tools. Debuggability doesn’t happen by accident: you need to design it into your product. The amount of work it takes will vary depending on your product, programming language(s) and development environment.

In this article, I am going to walk through a few examples of how we have aided debuggability for our developers. If you do the same analysis and implementation for your project, perhaps you can help your developers illuminate the dark corners of the codebase and learn what truly goes on there.

Figure 1: computer log entry from the Mark II, with a moth taped to the page.

Running on Localhost

Read more on the Testing Blog: Hermetic Servers by Chaitali Narla and Diego Salas

Suppose you’re developing a service with a mobile app that connects to that service. You’re working on a new feature in the app that requires changes in the backend. Do you develop in production? That’s a really bad idea, as you must push unfinished code to production to work on your change. Don’t do that: it could break your service for your existing users. Instead, you need some kind of script that brings up your server stack on localhost.

You can probably run your servers by hand, but that quickly gets tedious. In Google, we usually use fancy python scripts that invoke the server binaries with flags. Why do we need those flags? Suppose, for instance, that you have a server A that depends on a server B and C. The default behavior when the server boots should be to connect to B and C in production. When booting on localhost, we want to connect to our local B and C though. For instance:

b_serv --port=1234 --db=/tmp/fakedb
c_serv --port=1235
a_serv --b_spec=localhost:1234 --c_spec=localhost:1235

That makes it a whole lot easier to develop and debug your server. Make sure the logs and stdout/stderr end up in some well-defined directory on localhost so you don’t waste time looking for them. You may want to write a basic debug client that sends HTTP requests or RPCs or whatever your server handles. It’s painful to have to boot the real app on a mobile phone just to test something.

A localhost setup is also a prerequisite for making hermetic tests,where the test invokes the above script to bring up the server stack. The test can then run, say, integration tests among the servers or even client-server integration tests. Such integration tests can catch protocol drift bugs between client and server, while being super stable by not talking to external or shared services.

Debugging Mobile Apps

First, mobile is hard. The tooling is generally less mature than for desktop, although things are steadily improving. Again, unit tests are great for hackability here. It’s really painful to always load your app on a phone connected to your workstation to see if a change worked. Robolectric unit tests and Espresso functional tests, for instance, run on your workstation and do not require a real phone. xcTestsand Earl Grey give you the same on iOS.

Debuggers ship with Xcode and Android Studio. If your Android app ships JNI code, it’s a bit trickier, but you can attach GDB to running processes on your phone. It’s worth spending the time figuring this out early in the project, so you don’t have to guess what your code is doing. Debugging unit tests is even better and can be done straightforwardly on your workstation.

When Debugging gets Tricky

Some products are harder to debug than others. One example is hard real-time systems, since their behavior is so dependent on timing (and you better not be hooked up to a real industrial controller or rocket engine when you hit a breakpoint!). One possible solution is to run the software on a fake clock instead of a hardware clock, so the clock stops when the program stops.

Another example is multi-process sandboxed programs such as Chromium. Since the browser spawns one renderer process per tab, how do you even attach a debugger to it? The developers have made it quite a lot easier with debugging flags and instructions. For instance, this wraps gdb around each renderer process as it starts up:

chrome --renderer-cmd-prefix='xterm -title renderer -e gdb --args'

The point is, you need to build these kinds of things into your product; this greatly aids hackability.

Proper Logging

Read more on the Testing Blog: Optimal Logging by Anthony Vallone

It’s hackability to get the right logs when you need them. It’s easy to fix a crash if you get a stack trace from the error location. It’s far from guaranteed you’ll get such a stack trace, for instance in C++ programs, but this is something you should not stand for. For instance, Chromium had a problem where renderer process crashes didn’t print in test logs, because the test was running in a separate process. This was later fixed, and this kind of investment is worthwhile to make. A clean stack trace is worth a lot more than a “renderer crashed” message.

Logs are also useful for development. It’s an art to determine how much logging is appropriate for a given piece of code, but it is a good idea to keep the default level of logging conservative and give developers the option to turn on more logging for the parts they’re working on (example: Chromium). Too much logging isn’t hackability. This article elaborates further on this topic.

Logs should also be properly symbolized for C/C++ projects; a naked list of addresses in a stack trace isn’t very helpful. This is easy if you build for development (e.g. with -g), but if the crash happens in a release build it’s a bit trickier. You then need to build the same binary with the same flags and use addr2line / ndk-stack / etc to symbolize the stack trace. It’s a good idea to build tools and scripts for this so it’s as easy as possible.

Monitoring and Statistics

It aids hackability if developers can quickly understand what effect their changes have in the real world. For this, monitoring tools such as Stackdriver for Google Cloudare excellent. If you’re running a service, such tools can help you keep track of request volumes and error rates. This way you can quickly detect that 30% increase in request errors, and roll back that bad code change, before it does too much damage. It also makes it possible to debug your service in production without disrupting it.

System Under Test (SUT) Size

Tests and debugging go hand in hand: it’s a lot easier to target a piece of code in a test than in the whole application. Small and focused tests aid debuggability, because when a test breaks there isn’t an enormous SUT to look for errors in. These tests will also be less flaky. This article discusses this fact at length.

Figure 2. The smaller the SUT, the more valuable the test.

You should try to keep the above in mind, particularly when writing integration tests. If you’re testing a mobile app with a server, what bugs are you actually trying to catch? If you’re trying to ensure the app can still talk to the server (i.e. catching protocol drift bugs), you should not involve the UI of the app. That’s not what you’re testing here. Instead, break out the signaling part of the app into a library, test that directly against your local server stack, and write separate tests for the UI that only test the UI.

Smaller SUTs also greatly aids test speed, since there’s less to build, less to bring up and less to keep running. In general, strive to keep the SUT as small as possible through whatever means necessary. It will keep the tests smaller, faster and more focused.

Sources

Figure 1: By Courtesy of the Naval Surface Warfare Center, Dahlgren, VA., 1988. - U.S. Naval Historical Center Online Library Photograph NH 96566-KN, Public Domain, https://commons.wikimedia.org/w/index.php?curid=165211

(Continue to Pillar 3: Infrastructure)

Testing on the Toilet: What Makes a Good End-to-End Test?

Google Testing Bloggers — Wed, 21 Sep 2016 17:23:00 +0000

by Adam Bender

This article was adapted from a Google Testing on the Toilet (TotT) episode. You can download a printer-friendly version of this TotT episode and post it in your office.

An end-to-end test tests your entire system from one end to the other, treating everything in between as a black box. End-to-end tests can catch bugs that manifest across your entire system. In addition to unit and integration tests, they are a critical part of a balanced testing diet, providing confidence about the health of your system in a near production state. Unfortunately, end-to-end tests are slower, more flaky, and more expensive to maintain than unit or integration tests. Consider carefully whether an end-to-end test is warranted, and if so, how best to write one.

Let's consider how an end-to-end test might work for the following "login flow":

In order to be cost effective, an end-to-end test should focus on aspects of your system that cannot be reliably evaluated with smaller tests, such as resource allocation, concurrency issues and API compatibility. More specifically:

For each important use case, there should be one corresponding end-to-end test. This should include one test for each important class of error. The goal is the keep your total end-to-end count low.
Be prepared to allocate at least one week a quarter per test to keep your end-to-end tests stable in the face of issues like slow and flaky dependencies or minor UI changes.
Focus your efforts on verifying overall system behavior instead of specific implementation details; for example, when testing login behavior, verify that the process succeeds independent of the exact messages or visual layouts, which may change frequently.
Make your end-to-end test easy to debug by providing an overview-level log file, documenting common test failure modes, and preserving all relevant system state information (e.g.: screenshots, database snapshots, etc.).

End-to-end tests also come with some important caveats:

System components that are owned by other teams may change unexpectedly, and break your tests. This increases overall maintenance cost, but can highlight incompatible changes
It may be more difficult to make an end-to-end test fully hermetic; leftover test data may alter future tests and/or production systems. Where possible keep your test data ephemeral.
An end-to-end test often necessitates multiple test doubles (fakes or stubs) for underlying dependencies; they can, however, have a high maintenance burden as they drift from the real implementations over time.

What Test Engineers do at Google

Google Testing Bloggers — Mon, 12 Sep 2016 15:00:00 +0000

by Matt Lowrie, Manjusha Parvathaneni, Benjamin Pick, and Jochen Wuttke

Test engineers (TEs) at Google are a dedicated group of engineers who use proven testing practices to foster excellence in our products. We orchestrate the rapid testing and releasing of products and features our users rely on. Achieving this velocity requires creative and diverse engineering skills that allow us to advocate for our users. By building testable user journeys into the process, we ensure reliable products. TEs are also the glue that bring together feature stakeholders (product managers, development teams, UX designers, release engineers, beta testers, end users, etc.) to confirm successful product launches. Essentially, every day we ask ourselves, “How can we make our software development process more efficient to deliver products that make our users happy?”.

The TE role grew out of the desire to make Google’s early free products, like Search, Gmail and Docs, better than similar paid products on the market at the time. Early on in Google’s history, a small group of engineers believed that the company’s “launch and iterate” approach to software deployment could be improved with continuous automated testing. They took it upon themselves to promote good testing practices to every team throughout the company, via some programs you may have heard about: Testing on the Toilet, the Test Certified Program, and the Google Test Automation Conference (GTAC). These efforts resulted in every project taking ownership of all aspects of testing, such as code coverage and performance testing. Testing practices quickly became commonplace throughout the company and engineers writing tests for their own code became the standard. Today, TEs carry on this tradition of setting the standard of quality which all products should achieve.

Historically, Google has sustained two separate job titles related to product testing and test infrastructure, which has caused confusion. We often get asked what the difference is between the two. The rebranding of the Software engineer, tools and infrastructure (SETI) role, which now concentrates on engineering productivity, has been addressed in a previous blog post. What this means for test engineers at Google, is an enhanced responsibility of being the authority on product excellence. We are expected to uphold testing standards company-wide, both programmatically and persuasively.

Test engineer is a unique role at Google. As TEs, we define and organize our own engineering projects, bridging gaps between engineering output and end-user satisfaction. To give you an idea of what TEs do, here are some examples of challenges we need to solve on any particular day:

Automate a manual verification process for product release candidates so developers have more time to respond to potential release-blocking issues.
Design and implement an automated way to track and surface Android battery usage to developers, so that they know immediately when a new feature will cause users drained batteries.
Quantify if a regenerated data set used by a product, which contains a billion entities, is better quality than the data set currently live in production.
Write an automated test suite that validates if content presented to a user is of an acceptable quality level based on their interests.
Read an engineering design proposal for a new feature and provide suggestions about how and where to build in testability.
Investigate correlated stack traces submitted by users through our feedback tracking system, and search the code base to find the correct owner for escalation.
Collaborate on determining the root cause of a production outage, then pinpoint tests that need to be added to prevent similar outages in the future.
Organize a task force to advise teams across the company about best practices when testing for accessibility.

Over the next few weeks leading up to GTAC, we will also post vignettes of actual TEs working on different projects at Google, to showcase the diversity of the Google Test Engineer role. Stay tuned!

Hackable Projects – Pillar 1: Code Health

Google Testing Bloggers — Thu, 18 Aug 2016 18:14:00 +0000

By: Patrik Höglund

Introduction

Software development is difficult. Projects often evolve over several years, under changing requirements and shifting market conditions, impacting developer tools and infrastructure. Technical debt, slow build systems, poor debuggability, and increasing numbers of dependencies can weigh down a project The developers get weary, and cobwebs accumulate in dusty corners of the code base.

Fighting these issues can be taxing and feel like a quixotic undertaking, but don’t worry — the Google Testing Blog is riding to the rescue! This is the first article of a series on “hackability” that identifies some of the issues that hinder software projects and outlines what Google SETIs usually do about them.

According to Wiktionary, hackable is defined as:

Adjective
hackable ‎(comparative more hackable, superlative most hackable)

(computing) That can be hacked or broken into; insecure, vulnerable.
That lends itself to hacking (technical tinkering and modification); moddable.

Obviously, we’re not going to talk about making your product more vulnerable (by, say, rolling your own crypto or something equally unwise); instead, we will focus on the second definition, which essentially means “something that is easy to work on.” This has become the mainfocus for SETIs at Google as the role has evolved over the years.

In Practice

In a hackable project, it’s easy to try things and hard to break things. Hackability means fast feedback cycles that offer useful information to the developer.

This is hackability:

Developing is easy
Fast build
Good, fast tests
Clean code
Easy running + debugging
One-click rollbacks

In contrast, what is not hackability?

Broken HEAD (tip-of-tree)
Slow presubmit (i.e. checks running before submit)
Builds take hours
Incremental build/link > 30s
Flakytests
Can’t attach debugger
Logs full of uninteresting information

The Three Pillars of Hackability

There are a number of tools and practices that foster hackability. When everything is in place, it feels great to work on the product. Basically no time is spent on figuring out why things are broken, and all time is spent on what matters, which is understanding and working with the code. I believe there are three main pillars that support hackability. If one of them is absent, hackability will suffer. They are:

Pillar 1: Code Health

“I found Rome a city of bricks, and left it a city of marble.”
-- Augustus

Keeping the code in good shape is critical for hackability. It’s a lot harder to tinker and modify something if you don’t understand what it does (or if it’s full of hidden traps, for that matter).

Tests

Unit and small integration tests are probably the best things you can do for hackability. They’re a support you can lean on while making your changes, and they contain lots of good information on what the code does. It isn’t hackability to boot a slow UI and click buttons on every iteration to verify your change worked - it is hackability to run a sub-second set of unit tests! In contrast, end-to-end (E2E) tests generally help hackability much less (and can evenbe a hindrance if they, or the product, are in sufficiently bad shape).

Figure 1: the Testing Pyramid.

I’ve always been interested in how you actually make unit tests happen in a team. It’s about education. Writing a product such that it has good unit tests is actually a hard problem. It requires knowledge of dependency injection, testing/mocking frameworks, language idioms and refactoring. The difficulty varies by language as well. Writing unit tests in Go or Java is quite easy and natural, whereas in C++ it can be very difficult (and it isn’t exactly ingrained in C++ culture to write unit tests).

It’s important to educate your developers about unit tests. Sometimes, it is appropriate to lead by example and help review unit tests as well. You can have a large impact on a project by establishing a pattern of unit testing early. If tons of code gets written without unit tests, it will be much harder to add unit tests later.

What if you already have tons of poorly tested legacy code? The answer is refactoring and adding tests as you go. It’s hard work, but each line you add a test for is one more line that is easier to hack on.

Readable Code and Code Review

At Google, “readability” is a special committer status that is granted per language (C++, Go, Java and so on). It means that a person not only knows the language and its culture and idioms well, but also can write clean, well tested and well structured code. Readability literally means that you’re a guardian of Google’s code base and should push back on hacky and ugly code. The use of a style guide enforces consistency, and code review (where at least one person with readability must approve) ensures the code upholds high quality. Engineers must take care to not depend too much on “review buddies” here but really make sure to pull in the person that can give the best feedback.

Requiring code reviews naturally results in small changes, as reviewers often get grumpy if you dump huge changelists in their lap (at least if reviewers are somewhat fast to respond, which they should be). This is a good thing, since small changes are less risky and are easy to roll back. Furthermore, code review is good for knowledge sharing. You can also do pair programming if your team prefers that (a pair-programmed change is considered reviewed and can be submitted when both engineers are happy). There are multiple open-source review tools out there, such as Gerrit.

Nice, clean code is great for hackability, since you don’t need to spend time to unwind that nasty pointer hack in your head before making your changes. How do you make all this happen in practice? Put together workshops on, say, the SOLID principles, unit testing, or concurrency to encourage developers to learn. Spread knowledge through code review, pair programming and mentoring (such as with the Readability concept). You can’t just mandate higher code quality; it takes a lot of work, effort and consistency.

Presubmit Testing and Lint

Consistently formatted source code aids hackability. You can scan code faster if its formatting is consistent. Automated tooling also aids hackability. It really doesn’t make sense to waste any time on formatting source code by hand. You should be using tools like gofmt, clang-format, etc. If the patch isn’t formatted properly, you should see something like this (example from Chrome):

$ git cl upload
Error: the media/audio directory requires formatting. Please run 
    git cl format media/audio.

Source formatting isn’t the only thing to check. In fact, you should check pretty much anything you have as a rule in your project. Should other modules not depend on the internals of your modules? Enforce it with a check. Are there already inappropriate dependencies in your project? Whitelist the existing ones for now, but at least block new bad dependencies from forming. Should our app work on Android 16 phones and newer? Add linting, so we don’t use level 17+ APIs without gating at runtime. Should your project’s VHDL code always place-and-route cleanly on a particular brand of FPGA? Invoke the layout tool in your presubmit and and stop submit if the layout process fails.

Presubmit is the most valuable real estate for aiding hackability. You have limited space in your presubmit, but you can get tremendous value out of it if you put the right things there. You should stop all obvious errors here.

It aids hackability to have all this tooling so you don’t have to waste time going back and breaking things for other developers. Remember you need to maintain the presubmit well; it’s not hackability to have a slow, overbearing or buggy presubmit. Having a good presubmit can make it tremendously more pleasant to work on a project. We’re going to talk more in later articles on how to build infrastructure for submit queues and presubmit.

Single Branch And Reducing Risk

Having a single branch for everything, and putting risky new changes behind feature flags, aids hackability since branches and forks often amass tremendous risk when it’s time to merge them. Single branches smooth out the risk. Furthermore, running all your tests on many branches is expensive. However, a single branch can have negative effects on hackability if Team A depends on a library from Team B and gets broken by Team B a lot. Having some kind of stabilization on Team B’s software might be a good idea there. Thisarticle covers such situations, and how to integrate often with your dependencies to reduce the risk that one of them will break you.

Loose Coupling and Testability

Tightly coupled code is terrible for hackability. To take the most ridiculous example I know: I once heard of a computer game where a developer changed a ballistics algorithm and broke the game’s chat. That’s hilarious, but hardly intuitive for the poor developer that made the change. A hallmark of loosely coupled code is that it’s upfront about its dependencies and behavior and is easy to modify and move around.

Loose coupling, coherence and so on is really about design and architecture and is notoriously hard to measure. It really takes experience. One of the best ways to convey such experience is through code review, which we’ve already mentioned. Education on the SOLID principles, rules of thumb such as tell-don’t-ask, discussions about anti-patterns and code smells are all good here. Again, it’s hard to build tooling for this. You could write a presubmit check that forbids methods longer than 20 lines or cyclomatic complexity over 30, but that’s probably shooting yourself in the foot. Developers would consider that overbearing rather than a helpful assist.

SETIs at Google are expected to give input on a product’s testability. A few well-placed test hooks in your product can enable tremendously powerful testing, such as serving mock content for apps (this enables you to meaningfully test app UI without contacting your real servers, for instance). Testability can also have an influence on architecture. For instance, it’s a testability problem if your servers are built like a huge monolith that is slow to build and start, or if it can’t boot on localhost without calling external services. We’ll cover this in the next article.

Aggressively Reduce Technical Debt

It’s quite easy to add a lot of code and dependencies and call it a day when the software works. New projects can do this without many problems, but as the project becomes older it becomes a “legacy” project, weighed down by dependencies and excess code. Don’t end up there. It’s bad for hackability to have a slew of bug fixes stacked on top of unwise and obsolete decisions, and understanding and untangling the software becomes more difficult.

What constitutes technical debt varies by project and is something you need to learn from experience. It simply means the software isn’t in optimal form. Some types of technical debt are easy to classify, such as dead code and barely-used dependencies. Some types are harder to identify, such as when the architecture of the project has grown unfit to the task from changing requirements. We can’t use tooling to help with the latter, but we can with the former.

I already mentioned that dependency enforcement can go a long way toward keeping people honest. It helps make sure people are making the appropriate trade-offs instead of just slapping on a new dependency, and it requires them to explain to a fellow engineer when they want to override a dependency rule. This can prevent unhealthy dependencies like circular dependencies, abstract modules depending on concrete modules, or modules depending on the internals of other modules.

There are various tools available for visualizing dependency graphs as well. You can use these to get a grip on your current situation and start cleaning up dependencies. If you have a huge dependency you only use a small part of, maybe you can replace it with something simpler. If an old part of your app has inappropriate dependencies and other problems, maybe it’s time to rewrite that part.

(Continue to Pillar 2: Debuggability)

The Inquiry Method for Test Planning

Google Testing Bloggers — Mon, 06 Jun 2016 13:05:00 +0000

by Anthony Vallone

updated: July 2016

Creating a test plan is often a complex undertaking. An ideal test plan is accomplished by applying basic principles of cost-benefit analysis and risk analysis, optimally balancing these software development factors:

Implementation cost: The time and complexity of implementing testable features and automated tests for specific scenarios will vary, and this affects short-term development cost.
Maintenance cost: Some tests or test plans may vary from easy to difficult to maintain, and this affects long-term development cost. When manual testing is chosen, this also adds to long-term cost.
Monetary cost: Some test approaches may require billed resources.
Benefit: Tests are capable of preventing issues and aiding productivity by varying degrees. Also, the earlier they can catch problems in the development life-cycle, the greater the benefit.
Risk: The probability of failure scenarios may vary from rare to likely, and their consequences may vary from minor nuisance to catastrophic.

Effectively balancing these factors in a plan depends heavily on project criticality, implementation details, resources available, and team opinions. Many projects can achieve outstanding coverage with high-benefit, low-cost unit tests, but they may need to weigh options for larger tests and complex corner cases. Mission critical projects must minimize risk as much as possible, so they will accept higher costs and invest heavily in rigorous testing at all levels.

This guide puts the onus on the reader to find the right balance for their project. Also, it does not provide a test plan template, because templates are often too generic or too specific and quickly become outdated. Instead, it focuses on selecting the best content when writing a test plan.

Test plan vs. strategy

Before proceeding, two common methods for defining test plans need to be clarified:

Single test plan: Some projects have a single "test plan" that describes all implemented and planned testing for the project.
Single test strategy and many plans: Some projects have a "test strategy" document as well as many smaller "test plan" documents. Strategies typically cover the overall test approach and goals, while plans cover specific features or project updates.

Either of these may be embedded in and integrated with project design documents. Both of these methods work well, so choose whichever makes sense for your project. Generally speaking, stable projects benefit from a single plan, whereas rapidly changing projects are best served by infrequently changed strategies and frequently added plans.

For the purpose of this guide, I will refer to both test document types simply as "test plans”. If you have multiple documents, just apply the advice below to your document aggregation.

Content selection

A good approach to creating content for your test plan is to start by listing all questions that need answers. The lists below provide a comprehensive collection of important questions that may or may not apply to your project. Go through the lists and select all that apply. By answering these questions, you will form the contents for your test plan, and you should structure your plan around the chosen content in any format your team prefers. Be sure to balance the factors as mentioned above when making decisions.

Prerequisites

Do you need a test plan? If there is no project design document or a clear vision for the product, it may be too early to write a test plan.
Has testability been considered in the project design? Before a project gets too far into implementation, all scenarios must be designed as testable, preferably via automation. Both project design documents and test plans should comment on testability as needed.
Will you keep the plan up-to-date? If so, be careful about adding too much detail, otherwise it may be difficult to maintain the plan.
Does this quality effort overlap with other teams? If so, how have you deduplicated the work?

Risk

Are there any significant project risks, and how will you mitigate them? Consider:
- Injury to people or animals
- Security and integrity of user data
- User privacy
- Security of company systems
- Hardware or property damage
- Legal and compliance issues
- Exposure of confidential or sensitive data
- Data loss or corruption
- Revenue loss
- Unrecoverable scenarios
- SLAs
- Performance requirements
- Misinforming users
- Impact to other projects
- Impact from other projects
- Impact to company’s public image
- Loss of productivity
What are the project’s technical vulnerabilities? Consider:
- Features or components known to be hacky, fragile, or in great need of refactoring
- Dependencies or platforms that frequently cause issues
- Possibility for users to cause harm to the system
- Trends seen in past issues

Coverage

What does the test surface look like? Is it a simple library with one method, or a multi-platform client-server stateful system with a combinatorial explosion of use cases? Describe the design and architecture of the system in a way that highlights possible points of failure.
What platforms are supported? Consider listing supported operating systems, hardware, devices, etc. Also describe how testing will be performed and reported for each platform.
What are the features? Consider making a summary list of all features and describe how certain categories of features will be tested.
What will not be tested? No test suite covers every possibility. It’s best to be up-front about this and provide rationale for not testing certain cases. Examples: low risk areas that are a low priority, complex cases that are a low priority, areas covered by other teams, features not ready for testing, etc.
What is covered by unit (small), integration (medium), and system (large) tests? Always test as much as possible in smaller tests, leaving fewer cases for larger tests. Describe how certain categories of test cases are best tested by each test size and provide rationale.
What will be tested manually vs. automated? When feasible and cost-effective, automation is usually best. Many projects can automate all testing. However, there may be good reasons to choose manual testing. Describe the types of cases that will be tested manually and provide rationale.
How are you covering each test category? Consider:
- accessibility
- functional
- fuzz
- internationalization and localization
- performance, load, stress, and endurance (aka soak)
- privacy
- security
- smoke
- stability
- usability
Will you use static and/or dynamic analysis tools? Both static analysis tools and dynamic analysis tools can find problems that are hard to catch in reviews and testing, so consider using them.
How will system components and dependencies be stubbed, mocked, faked, staged, or used normally during testing? There are good reasons to do each of these, and they each have a unique impact on coverage.
What builds are your tests running against? Are tests running against a build from HEAD (aka tip), a staged build, and/or a release candidate? If only from HEAD, how will you test release build cherry picks (selection of individual changelists for a release) and system configuration changes not normally seen by builds from HEAD?
What kind of testing will be done outside of your team? Examples:
- Dogfooding
- External crowdsource testing
- Public alpha/beta versions (how will they be tested before releasing?)
- External trusted testers
How are data migrations tested? You may need special testing to compare before and after migration results.
Do you need to be concerned with backward compatibility? You may own previously distributed clients or there may be other systems that depend on your system’s protocol, configuration, features, and behavior.
Do you need to test upgrade scenarios for server/client/device software or dependencies/platforms/APIs that the software utilizes?
Do you have line coverage goals?

Tooling and Infrastructure

Do you need new test frameworks? If so, describe these or add design links in the plan.
Do you need a new test lab setup? If so, describe these or add design links in the plan.
If your project offers a service to other projects, are you providing test tools to those users? Consider providing mocks, fakes, and/or reliable staged servers for users trying to test their integration with your system.
For end-to-end testing, how will test infrastructure, systems under test, and other dependencies be managed? How will they be deployed? How will persistence be set-up/torn-down? How will you handle required migrations from one datacenter to another?
Do you need tools to help debug system or test failures? You may be able to use existing tools, or you may need to develop new ones.

Process

Are there test schedule requirements? What time commitments have been made, which tests will be in place (or test feedback provided) by what dates? Are some tests important to deliver before others?
How are builds and tests run continuously? Most small tests will be run by continuous integration tools, but large tests may need a different approach. Alternatively, you may opt for running large tests as-needed.
How will build and test results be reported and monitored?
- Do you have a team rotation to monitor continuous integration?
- Large tests might require monitoring by someone with expertise.
- Do you need a dashboard for test results and other project health indicators?
- Who will get email alerts and how?
- Will the person monitoring tests simply use verbal communication to the team?
How are tests used when releasing?
- Are they run explicitly against the release candidate, or does the release process depend only on continuous test results?
- If system components and dependencies are released independently, are tests run for each type of release?
- Will a "release blocker" bug stop the release manager(s) from actually releasing? Is there an agreement on what are the release blocking criteria?
- When performing canary releases (aka % rollouts), how will progress be monitored and tested?
How will external users report bugs? Consider feedback links or other similar tools to collect and cluster reports.
How does bug triage work? Consider labels or categories for bugs in order for them to land in a triage bucket. Also make sure the teams responsible for filing and or creating the bug report template are aware of this. Are you using one bug tracker or do you need to setup some automatic or manual import routine?
Do you have a policy for submitting new tests before closing bugs that could have been caught?
How are tests used for unsubmitted changes? If anyone can run all tests against any experimental build (a good thing), consider providing a howto.
How can team members create and/or debug tests? Consider providing a howto.

Utility

Who are the test plan readers? Some test plans are only read by a few people, while others are read by many. At a minimum, you should consider getting a review from all stakeholders (project managers, tech leads, feature owners). When writing the plan, be sure to understand the expected readers, provide them with enough background to understand the plan, and answer all questions you think they will have - even if your answer is that you don’t have an answer yet. Also consider adding contacts for the test plan, so any reader can get more information.
How can readers review the actual test cases? Manual cases might be in a test case management tool, in a separate document, or included in the test plan. Consider providing links to directories containing automated test cases.
Do you need traceability between requirements, features, and tests?
Do you have any general product health or quality goals and how will you measure success? Consider:
- Release cadence
- Number of bugs caught by users in production
- Number of bugs caught in release testing
- Number of open bugs over time
- Code coverage
- Cost of manual testing
- Difficulty of creating new tests

GTAC 2016 Registration Deadline Extended

Google Testing Bloggers — Fri, 03 Jun 2016 03:58:00 +0000

by Sonal Shah on behalf of the GTAC Committee

Our goal in organizing GTAC each year is to make it a first-class conference, dedicated to presenting leading edge industry practices. The quality of submissions we've received for GTAC 2016 so far has been overwhelming. In order to include the best talks possible, we are extending the deadline for speaker and attendee submissions by 15 days. The new timelines are as follows:

~~June 1, 2016~~ June 15, 2016 - Last day for speaker, attendee and diversity scholarship submissions.
~~June 15, 2016~~ July 15, 2016 - Attendees and scholarship awardees will be notified of selection/rejection/waitlist status. Those on the waitlist will be notified as space becomes available.
~~August 15, 2016~~ August 29, 2016 - Selected speakers will be notified.

To register, please fill out this form.
To apply for diversity scholarship, please fill out this form.

The GTAC website has a list of frequently asked questions. Please do not hesitate to contact gtac2016@google.com if you still have any questions.

Flaky Tests at Google and How We Mitigate Them

Google Testing Bloggers — Sat, 28 May 2016 00:31:00 +0000

by John Micco

At Google, we run a very large corpus of tests continuously to validate our code submissions. Everyone from developers to project managers rely on the results of these tests to make decisions about whether the system is ready for deployment or whether code changes are OK to submit. Productivity for developers at Google relies on the ability of the tests to find real problems with the code being changed or developed in a timely and reliable fashion.

Tests are run before submission (pre-submit testing) which gates submission and verifies that changes are acceptable, and again after submission (post-submit testing) to decide whether the project is ready to be released. In both cases, all of the tests for a particular project must report a passing result before submitting code or releasing a project.

Unfortunately, across our entire corpus of tests, we see a continual rate of about 1.5% of all test runs reporting a "flaky" result. We define a "flaky" test result as a test that exhibits both a passing and a failing result with the same code. There are many root causes why tests return flaky results, including concurrency, relying on non-deterministic or undefined behaviors, flaky third party code, infrastructure problems, etc. We have invested a lot of effort in removing flakiness from tests, but overall the insertion rate is about the same as the fix rate, meaning we are stuck with a certain rate of tests that provide value, but occasionally produce a flaky result. Almost 16% of our tests have some level of flakiness associated with them! This is a staggering number; it means that more than 1 in 7 of the tests written by our world-class engineers occasionally fail in a way not caused by changes to the code or tests.

When doing post-submit testing, our Continuous Integration (CI) system identifies when a passing test transitions to failing, so that we can investigate the code submission that caused the failure. What we find in practice is that about 84% of the transitions we observe from pass to fail involve a flaky test! This causes extra repetitive work to determine whether a new failure is a flaky result or a legitimate failure. It is quite common to ignore legitimate failures in flaky tests due to the high number of false-positives. At the very least, build monitors typically wait for additional CI cycles to run this test again to determine whether or not the test has been broken by a submission adding to the delay of identifying real problems and increasing the pool of changes that could contribute.

In addition to the cost of build monitoring, consider that the average project contains 1000 or so individual tests. To release a project, we require that all these tests pass with the latest code changes. If 1.5% of test results are flaky, 15 tests will likely fail, requiring expensive investigation by a build cop or developer. In some cases, developers dismiss a failing result as flaky only to later realize that it was a legitimate failure caused by the code. It is human nature to ignore alarms when there is a history of false signals coming from a system. For example, see this article about airline pilots ignoring an alarm on 737s. The same phenomenon occurs with pre-submit testing. The same 15 or so failing tests block submission and introduce costly delays into the core development process. Ignoring legitimate failures at this stage results in the submission of broken code.

We have several mitigation strategies for flaky tests during presubmit testing, including the ability to re-run only failing tests, and an option to re-run tests automatically when they fail. We even have a way to denote a test as flaky - causing it to report a failure only if it fails 3 times in a row. This reduces false positives, but encourages developers to ignore flakiness in their own tests unless their tests start failing 3 times in a row, which is hardly a perfect solution.
Imagine a 15 minute integration test marked as flaky that is broken by my code submission. The breakage will not be discovered until 3 executions of the test complete, or 45 minutes, after which it will need to be determined if the test is broken (and needs to be fixed) or if the test just flaked three times in a row.

Other mitigation strategies include:

A tool that monitors the flakiness of tests and if the flakiness is too high, it automatically quarantines the test. Quarantining removes the test from the critical path and files a bug for developers to reduce the flakiness. This prevents it from becoming a problem for developers, but could easily mask a real race condition or some other bug in the code being tested.
Another tool detects changes in the flakiness level of tests and works to identify the change that caused the test to change the level of flakiness.

In summary, test flakiness is an important problem, and Google is continuing to invest in detecting, mitigating, tracking, and fixing test flakiness throughout our code base. For example:

We have a new team dedicated to providing accurate and timely information about test flakiness to help developers and build monitors so that they know whether they are being harmed by test flakiness.
As we analyze the data from flaky test executions, we are seeing promising correlations with features that should enable us to identify a flaky result accurately without re-running the test.

By continually advancing the state of the art for teams at Google, we aim to remove the friction caused by test flakiness from the core developer workflows.

GTAC Diversity Scholarship

Google Testing Bloggers — Wed, 04 May 2016 14:32:00 +0000

by Lesley Katzen on behalf of the GTAC Diversity Committee

We are committed to increasing diversity at GTAC, and we believe the best way to do that is by making sure we have a diverse set of applicants to speak and attend. As part of that commitment, we are excited to announce that we will be offering travel scholarships this year.
Travel scholarships will be available for selected applicants from traditionally underrepresented groups in technology.

To be eligible for a grant to attend GTAC, applicants must:

Be 18 years of age or older.
Be from a traditionally underrepresented group in technology.
Work or study in Computer Science, Computer Engineering, Information Technology, or a technical field related to software testing.
Be able to attend core dates of GTAC, November 15th - 16th 2016 in Sunnyvale, CA.

To apply:
Please fill out the following form to be considered for a travel scholarship.
The deadline for submission is ~~June 1st~~ June 15th. Scholarship recipients will be announced on ~~June 30th~~ July 15th. If you are selected, we will contact you with information on how to proceed with booking travel.

What the scholarship covers:
Google will pay for standard coach class airfare for selected scholarship recipients to San Francisco or San Jose, and 3 nights of accommodations in a hotel near the Sunnyvale campus. Breakfast and lunch will be provided for GTAC attendees and speakers on both days of the conference. We will also provide a $50.00 gift card for other incidentals such as airport transportation or meals. You will need to provide your own credit card to cover any hotel incidentals.

Google is dedicated to providing a harassment-free and inclusive conference experience for everyone. Our anti-harassment policy can be found at:
https://www.google.com/events/policy/anti-harassmentpolicy.html

GTAC 2016 Registration is now open!

Google Testing Bloggers — Wed, 04 May 2016 14:27:00 +0000

by Sonal Shah on behalf of the GTAC Committee

The GTAC (Google Test Automation Conference) 2016 application process is now open for presentation proposals and attendance. GTAC will be held at the Google Sunnyvale office on November 15th - 16th, 2016.

GTAC will be streamed live on YouTube again this year, so even if you cannot attend in person, you will be able to watch the conference remotely. We will post the livestream information as we get closer to the event, and recordings will be posted afterwards.

Speakers
Presentations are targeted at students, academics, and experienced engineers working on test automation. Full presentations are 30 minutes and lightning talks are 10 minutes. Speakers should be prepared for a question and answer session following their presentation.

Application
For presentation proposals and/or attendance, complete this form. We will be selecting about 25 talks and 300 attendees for the event. The selection process is not first come first serve (no need to rush your application), and we select a diverse group of engineers from various locations, company sizes, and technical backgrounds.

Deadline
The due date for both presentation and attendance applications is ~~June 1st, 2016~~ June 15, 2016.

Cost
There are no registration fees, but speakers and attendees must arrange and pay for their own travel and accommodations.

More information
Please read our FAQ for most common questions
https://developers.google.com/google-test-automation-conference/2016/faq.

GTAC 2016 – Save the Date

Google Testing Bloggers — Fri, 08 Apr 2016 22:00:00 +0000

by Sonal Shah on behalf of the GTAC Committee

We are pleased to announce that the tenth GTAC (Google Test Automation Conference) will be held on Google’s campus in Sunnyvale (California, USA) on Tuesday and Wednesday, November 15th and 16th, 2016.

Based on feedback from the last GTAC (2015) and the increasing demand every year, we have decided to keep GTAC on a fall schedule. This schedule is a change from what we previously announced.

The schedule for the next few months is:

May 1, 2016 - Registration opens for speakers and attendees.

~~June 1, 2016~~ June 15, 2016 - Registration closes for speaker and attendee submissions.

~~June 30, 2016~~ July 15, 2016 - Selected attendees will be notified.

~~August 15, 2016~~ August 29, 2016 - Selected speakers will be notified.

November 14, 2016 - Rehearsal day for speakers.

November 15-16, 2016 - GTAC 2016!

As part of our efforts to increase diversity of speakers and attendees at GTAC, we will be offering travel scholarships for selected applicants from traditionally underrepresented groups in technology.

Stay tuned to this blog and the GTAC website for information about attending or presenting at GTAC. Please do not hesitate to contact gtac2016@google.com if you have any questions. We look forward to seeing you there!

From QA to Engineering Productivity

Google Testing Bloggers — Tue, 22 Mar 2016 20:24:00 +0000

By Ari Shamash

In Google’s early days, a small handful of software engineers built, tested, and released software. But as the user-base grew and products proliferated, engineers started specializing in roles, creating more scale in the development process:

Test Engineers (TEs) -- tested new products and systems integration
Release Engineers (REs) -- pushed bits into production
Site Reliability Engineers (SREs) -- managed systems and data centers 24x7.

This story focuses on the evolution of quality assurance and the roles of the engineers behind it at Google. The REs and SREs also evolved, but we’ll leave that for another day.

Initially, teams relied heavily on manual operations. When we attempted to automate testing, we largely focused on the frontends, which worked, because Google was small and our products had fewer integrations. However, as Google grew, longer and longer manual test cycles bogged down iterations and delayed feature launches. Also, since we identified bugs later in the development cycle, it took us longer and longer to fix them. We determined that pushing testing upstream via automation would help address these issues and accelerate velocity.

As manual testing transitioned to automated processes, two separate testing roles began to emerge at Google:

Test Engineers (TEs) -- With their deep product knowledge and test/quality domain expertise, TEs focused on what should be tested.
Software Engineers in Test (SETs) -- Originally software engineers with deep infrastructure and tooling expertise, SETs built the frameworks and packages required to implement automation.

The impact was significant:

Automated tests became more efficient and deterministic (e.g. by improving runtimes, eliminating sources of flakiness, etc.)
Metrics driven engineering proliferated (e.g. improving code and feature coverage led to higher quality products).

Manual operations were reduced to manual verification on new features, and typically only in end-to-end, cross product integration boundaries. TEs developed extreme depth of knowledge for the products they supported. They became go-to engineers for product teams that needed expertise in test automation and integration. Their role evolved into a broad spectrum of responsibilities: writing scripts to automate testing, creating tools so developers could test their own code, and constantly designing better and more creative ways to identify weak spots and break software.

SETs (in collaboration with TEs and other engineers) built a wide array of test automation tools and developed best practices that were applicable across many products. Release velocity accelerated for products. All was good, and there was much rejoicing!

SETs initially focused on building tools for reducing the testing cycle time, since that was the most manually intensive and time consuming phase of getting product code into production. We made some of these tools available to the software development community: webdriver improvements, protractor, espresso, EarlGrey, martian proxy, karma, and GoogleTest. SETs were interested in sharing and collaborating with others in the industry and established conferences. The industry has also embraced the Test Engineering discipline, as other companies hired software engineers into similar roles, published articles, and drove Test-Driven Development into mainstream practices.

Through these efforts, the testing cycle time decreased dramatically, but interestingly the overall velocity did not increase proportionately, since other phases in the development cycle became the bottleneck. SETs started building tools to accelerate all other aspects of product development, including:

Extending IDEs to make writing and reviewing code easier, shortening the “write code” cycle
Automating release verification, shortening the “release code” cycle.
Automating real time production system log verification and anomaly detection, helping automate production monitoring.
Automating measurement of developer productivity, helping understand what’s working and what isn’t.

In summary, the work done by the SETs naturally progressed from supporting only product testing efforts to include supporting product development efforts as well. Their role now encompassed a much broader Engineering Productivity agenda.

Given the expanded SET charter, we wanted the title of the role to reflect the work. But what should the new title be? We empowered the SETs to choose a new title, and they overwhelmingly (91%) selected Software Engineer, Tools & Infrastructure (abbreviated to SETI).

Today, SETIs and TEs still collaborate very closely on optimizing the entire development life cycle with a goal of eliminating all friction from getting features into production. Interested in building next generation tools and infrastructure? Join us (SETI, TE)!

EarlGrey – iOS Functional UI Testing Framework

Google Testing Bloggers — Tue, 16 Feb 2016 23:14:00 +0000

By Siddartha Janga on behalf of Google iOS Developers

Brewing for quite some time, we are excited to announce EarlGrey, a functional UI testing framework for iOS. Several Google apps like YouTube, Google Calendar, Google Photos, Google Translate, Google Play Music and many more have successfully adopted the framework for their functional testing needs.

The key features offered by EarlGrey include:

Powerful built-in synchronization : Tests will automatically wait for events such as animations, network requests, etc. before interacting with the UI. This will result in tests that are easier to write (no sleeps or waits) and simple to maintain (straight up procedural description of test steps).
Visibility checking : All interactions occur on elements that users can see. For example, attempting to tap a button that is behind an image will lead to test failure immediately.
Flexible design : The components that determine element selection, interaction, assertion and synchronization have been designed to be extensible.

Are you in need for a cup of refreshing EarlGrey? EarlGrey has been open sourced under the Apache license. Check out the getting started guide and add EarlGrey to your project using CocoaPods or manually add it to your Xcode project file.

GTAC 2015 Wrap Up

Google Testing Bloggers — Tue, 08 Dec 2015 20:40:00 +0000

by Michael Klepikov and Lesley Katzen on behalf of the GTAC Committee

The ninth GTAC (Google Test Automation Conference) was held on November 10-11 at the Google Cambridge office, the “Hub” of innovation. The conference was completely packed with presenters and attendees from all over the world, from industry and academia, discussing advances in test automation and the test engineering computer science field, bringing with them a huge diversity of experiences. Speakers from numerous companies and universities (Applitools, Automattic, Bitbar, Georgia Tech, Google, Indian Institute of Science, Intel, LinkedIn, Lockheed Martin, MIT, Nest, Netflix, OptoFidelity, Splunk, Supersonic, Twitter, Uber, University of Waterloo) spoke on a variety of interesting and cutting edge test automation topics.

All presentation videos and slides are posted on the Video Recordings and Presentations pages. All videos have professionally transcribed closed captions, and the YouTube descriptions have the slides links. Enjoy and share!

We had over 1,300 applicants and over 200 of those for speaking. Over 250 people filled our venue to capacity, and the live stream had a peak of about 400 concurrent viewers, with about 3,300 total viewing hours.

Our goal in hosting GTAC is to make the conference highly relevant and useful for both attendees and the larger test engineering community as a whole. Our post-conference survey shows that we are close to achieving that goal; thanks to everyone who completed the feedback survey!

Our 82 survey respondents were mostly (81%) test focused professionals with a wide range of 1 to 40 years of experience.
Another 76% of respondents rated the conference as a whole as above average, with marked satisfaction for the venue, the food (those Diwali treats!), and the breadth and coverage of the talks themselves.

The top five most popular talks were:

The Uber Challenge of Cross-Application/Cross-Device Testing (Apple Chow and Bian Jiang)
Your Tests Aren't Flaky (Alister Scott)
Statistical Data Sampling (Celal Ziftci and Ben Greenberg)
Coverage is Not Strongly Correlated with Test Suite Effectiveness (Laura Inozemtseva)
Chrome OS Test Automation Lab (Simran Basi and Chris Sosa).

Our social events also proved to be crowd pleasers. The social events were a direct response to feedback from GTAC 2014 for organized opportunities for socialization among the GTAC attendees.

This isn’t to say there isn’t room for improvement. We had 11% of respondents express frustration with event communications and provided some long, thoughtful suggestions for what we could do to improve next year. Also, many of the long form comments asked for a better mix of technologies, noting that mobile had a big presence in the talks this year.

If you have any suggestions on how we can improve, please comment on this post, or better yet – fill out the survey, which remains open. Based on feedback from last year urging more transparency in speaker selection, we included an individual outside of Google in the speaker evaluation. Feedback is precious, we take it very seriously, and we will use it to improve next time around.

Thank you to all the speakers, attendees, and online viewers who made this a special event once again. To receive announcements about the next GTAC, currently planned for early 2017, subscribe to the Google Testing Blog.

GTAC 2015 is Next Week!

Google Testing Bloggers — Sat, 07 Nov 2015 02:46:00 +0000

by Anthony Vallone on behalf of the GTAC Committee

The ninth GTAC (Google Test Automation Conference) commences on Tuesday, November 10th, at the Google Cambridge office. You can find the latest details on the conference site, including schedule, speaker profiles, and travel tips.

If you have not been invited to attend in person, you can watch the event live. And if you miss the livestream, we will post slides and videos later.

We have an outstanding speaker lineup this year, and we look forward to seeing you all there or online!

Announcing the GTAC 2015 Agenda

Google Testing Bloggers — Tue, 27 Oct 2015 23:52:00 +0000

by Anthony Vallone on behalf of the GTAC Committee

We have completed the selection and confirmation of all speakers and attendees for GTAC 2015. You can find the detailed agenda at: developers.google.com/gtac/2015/schedule.

Thank you to all who submitted proposals!

There is a lot of interest in GTAC once again this year with about 1400 applicants and about 200 of those for speaking. Unfortunately, our venue only seats 250. We will livestream the event as usual, so fret not if you were not selected to attend. Information about the livestream and other details will be posted on the GTAC site soon and announced here.

Audio Testing – Automatic Gain Control

Google Testing Bloggers — Thu, 08 Oct 2015 16:21:00 +0000

By: Patrik Höglund

What is Automatic Gain Control?

It’s time to talk about advanced media quality tests again! As experienced Google testing blog readers know, when I write an article it’s usually about WebRTC, and the unusual testing solutions we build to test it. This article is no exception. Today we’re going to talk about Automatic Gain Control, or AGC. This is a feature that’s on by default for WebRTC applications, such as http://apprtc.appspot.com. It uses various means to adjust the microphone signal so your voice makes it loud and clear to the other side of the peer connection. For instance, it can attempt to adjust your microphone gain or try to amplify the signal digitally.

Figure 1. How Auto Gain Control works [code here].

This is an example of automatic control engineering (another example would be the classic PID controller) and happens in real time. Therefore, if you move closer to the mic while speaking, the AGC will notice the output stream is too loud, and reduce mic volume and/or digital gain. When you move further away, it tries to adapt up again. The fancy voice activity detector is there so we only amplify speech, and not, say, the microwave oven your spouse just started in the other room.

Testing the AGC

Now, how do we make sure the AGC works? The first thing is obviously to write unit tests and integration tests. You didn’t think about building that end-to-end test first, did you? Once we have the lower-level tests in place, we can start looking at a bigger test. While developing the WebRTC implementation in Chrome, we had several bugs where the AGC code was working by itself, but was misconfigured in Chrome. In one case, it was simply turned off for all users. In another, it was only turned off in Hangouts.

Only an end-to-end test can catch these integration issues, and we already had stable, low-maintenance audio quality tests with the ability to record Chrome’s output sound for analysis. I encourage you to read that article, but the bottom line is that those tests can run a WebRTC call in two tabs and record the audio output to a file. Those tests run the PESQ algorithm on input and output to see how similar they are.

That’s a good framework to have, but I needed to make two changes:

Add file support to Chrome’s fake audio input device, so we can play a known file. The original audio test avoided this by using WebAudio, but AGC doesn’t run in the WebAudio path, just the microphone capture path, so that won’t work.
Instead of running PESQ, run an analysis that compares the gain between input and output.

Adding Fake File Support

This is always a big part of the work in media testing: controlling the input and output. It’s unworkable to tape microphones to loudspeakers or point cameras to screens to capture the media, so the easiest solution is usually to add a debug flag. It is exactly what I did here. It was a lot of work, but I won’t go into much detail since Chrome’s audio pipeline is complex. The core is this:

int FileSource::OnMoreData(AudioBus* audio_bus, uint32 total_bytes_delay) {
  // Load the file if we haven't already. This load needs to happen on the
  // audio thread, otherwise we'll run on the UI thread on Mac for instance.
  // This will massively delay the first OnMoreData, but we'll catch up.
  if (!wav_audio_handler_)
    LoadWavFile(path_to_wav_file_);
  if (load_failed_)
    return 0;

  DCHECK(wav_audio_handler_.get());

  // Stop playing if we've played out the whole file.
  if (wav_audio_handler_->AtEnd(wav_file_read_pos_))
    return 0;

  // This pulls data from ProvideInput.
  file_audio_converter_->Convert(audio_bus);
  return audio_bus->frames();
}

This code runs every 10 ms and reads a small chunk from the file, converts it to Chrome’s preferred audio format and sends it on through the audio pipeline. After implementing this, I could simply run:

chrome --use-fake-device-for-media-stream 
       --use-file-for-fake-audio-capture=/tmp/file.wav

and whenever I hit a webpage that used WebRTC, the above file would play instead of my microphone input. Sweet!

The Analysis Stage

Next I had to get the analysis stage figured out. It turned out there was something called an AudioPowerMonitor in the Chrome code, which you feed audio data into and get the average audio power for the data you fed in. This is a measure of how “loud” the audio is. Since the whole point of the AGC is getting to the right audio power level, we’re looking to compute

A_diff = A_out - A_in

Or, really, how much louder or weaker is the output compared to the input audio? Then we can construct different scenarios: A_diff should be 0 if the AGC is turned off and it should be > 0 dB if the AGC is on and we feed in a low power audio file. Computing the average energy of an audio file was straightforward to implement:

  // ...
  size_t bytes_written;
  wav_audio_handler->CopyTo(audio_bus.get(), 0, &bytes_written);
  CHECK_EQ(bytes_written, wav_audio_handler->data().size())
      << "Expected to write entire file into bus.";

  // Set the filter coefficient to the whole file's duration; this will make
  // the power monitor take the entire file into account.
  media::AudioPowerMonitor power_monitor(wav_audio_handler->sample_rate(),
                                         file_duration);
  power_monitor.Scan(*audio_bus, audio_bus->frames());
  // ...
  return power_monitor.ReadCurrentPowerAndClip().first;

I wrote a new test, and hooked up the above logic instead of PESQ. I could compute A_in by running the above algorithm on the reference file (which I fed in using the flag I implemented above) and A_out on the recording of the output audio. At this point I pretty much thought I was done. I ran a WebRTC call with the AGC turned off, expecting to get zero… and got a huge number. Turns out I wasn’t done.

What Went Wrong?

I needed more debugging information to figure out what went wrong. Since the AGC was off, I would expect the power curves for output and input to be identical. All I had was the average audio power over the entire file, so I started plotting the audio power for each 10 millisecond segment instead to understand where the curves diverged. I could then plot the detected audio power over the time of the test. I started by plotting A_diff:

Figure 2. Plot of A_diff.

The difference is quite small in the beginning, but grows in amplitude over time. Interesting. I then plotted A_out and A_in next to each other:

Figure 3. Plot of A_outand A_in.

A-ha! The curves drift apart over time; the above shows about 10 seconds of time, and the drift is maybe 80 ms at the end. The more they drift apart, the bigger the diff becomes. Exasperated, I asked our audio engineers about the above. Had my fancy test found its first bug? No, as it turns out - it was by design.

Clock Drift and Packet Loss

Let me explain. As a part of WebRTC audio processing, we run a complex module called NetEq on the received audio stream. When sending audio over the Internet, there will inevitably be packet loss and clock drift. Packet losses always happen on the Internet, depending on the network path between sender and receiver. Clock drift happens because the sample clocks on the sending and receiving sound cards are not perfectly synced.

In this particular case, the problem was not packet loss since we have ideal network conditions (one machine, packets go over the machine’s loopback interface = zero packet loss). But how can we have clock drift? Well, recall the fake device I wrote earlier that reads a file? It never touches the sound card like when the sound comes from the mic, so it runs on the system clock. That clock will drift against the machine’s sound card clock, even when we are on the same machine.

NetEq uses clever algorithms to conceal clock drift and packet loss. Most commonly it applies time compression or stretching on the audio it plays out, which means it makes the audio a little shorter or longer when needed to compensate for the drift. We humans mostly don’t even notice that, whereas a drift left uncompensated would result in a depleted or flooded receiver buffer – very noticeable. Anyway, I digress. This drift of the recording vs. the reference file was natural and I would just have to deal with it.

Silence Splitting to the Rescue!

I could probably have solved this with math and postprocessing of the results (least squares maybe?), but I had another idea. The reference file happened to be comprised of five segments with small pauses between them. What if I made these pauses longer, split the files on the pauses and trimmed away all the silence? This would effectively align the start of each segment with its corresponding segment in the reference file.

Figure 4. Before silence splitting.

Figure 5. After silence splitting.

We would still have NetEQ drift, but as you can see its effects will not stack up towards the end, so if the segments are short enough we should be able to mitigate this problem.

Result

Here is the final test implementation:

  base::FilePath reference_file = 
      test::GetReferenceFilesDir().Append(reference_filename);
  base::FilePath recording = CreateTemporaryWaveFile();

  ASSERT_NO_FATAL_FAILURE(SetupAndRecordAudioCall(
      reference_file, recording, constraints,
      base::TimeDelta::FromSeconds(30)));

  base::ScopedTempDir split_ref_files;
  ASSERT_TRUE(split_ref_files.CreateUniqueTempDir());
  ASSERT_NO_FATAL_FAILURE(
      SplitFileOnSilenceIntoDir(reference_file, split_ref_files.path()));
  std::vector ref_segments =
      ListWavFilesInDir(split_ref_files.path());

  base::ScopedTempDir split_actual_files;
  ASSERT_TRUE(split_actual_files.CreateUniqueTempDir());
  ASSERT_NO_FATAL_FAILURE(
      SplitFileOnSilenceIntoDir(recording, split_actual_files.path()));

  // Keep the recording and split files if the analysis fails.
  base::FilePath actual_files_dir = split_actual_files.Take();
  std::vector actual_segments =
      ListWavFilesInDir(actual_files_dir);

  AnalyzeSegmentsAndPrintResult(
      ref_segments, actual_segments, reference_file, perf_modifier);

  DeleteFileUnlessTestFailed(recording, false);
  DeleteFileUnlessTestFailed(actual_files_dir, true);

Where AnalyzeSegmentsAndPrintResult looks like this:

void AnalyzeSegmentsAndPrintResult(
    const std::vector& ref_segments,
    const std::vector& actual_segments,
    const base::FilePath& reference_file,
    const std::string& perf_modifier) {
  ASSERT_GT(ref_segments.size(), 0u)
      << "Failed to split reference file on silence; sox is likely broken.";
  ASSERT_EQ(ref_segments.size(), actual_segments.size())
      << "The recording did not result in the same number of audio segments "
      << "after on splitting on silence; WebRTC must have deformed the audio "
      << "too much.";

  for (size_t i = 0; i < ref_segments.size(); i++) {
    float difference_in_decibel = AnalyzeOneSegment(ref_segments[i],
                                                    actual_segments[i],
                                                    i);
    std::string trace_name = MakeTraceName(reference_file, i);
    perf_test::PrintResult("agc_energy_diff", perf_modifier, trace_name,
                           difference_in_decibel, "dB", false);
  }
}

The results look like this:

Figure 6. Average Adiff values for each segment on the y axis, Chromium revisions on the x axis.

We can clearly see the AGC applies about 6 dB of gain to the (relatively low-energy) audio file we feed in. The maximum amount of gain the digital AGC can apply is 12 dB, and 7 dB is the default, so in this case the AGC is pretty happy with the level of the input audio. If we run with the AGC turned off, we get the expected 0 dB of gain. The diff varies a bit per segment, since the segments are different in audio power.

Using this test, we can detect if the AGC accidentally gets turned off or malfunctions on windows, mac or linux. If that happens, the with_agc graph will drop from ~6 db to 0, and we’ll know something is up. Same thing if the amount of digital gain changes.

A more advanced version of this test would also look at the mic level the AGC sets. This mic level is currently ignored in the test, but it could take it into account by artificially amplifying the reference file when played through the fake device. We could also try throwing curveballs at the AGC, like abruptly raising the volume mid-test (as if the user leaned closer to the mic), and look at the gain for the segments to ensure it adapted correctly.

The Deadline to Apply for GTAC 2015 is Monday Aug 10

Google Testing Bloggers — Fri, 07 Aug 2015 17:33:00 +0000

Posted by Anthony Vallone on behalf of the GTAC Committee

The deadline to apply for GTAC 2015 is this Monday, August 10th, 2015. There is a great deal of interest to both attend and speak, and we’ve received many outstanding proposals. However, it’s not too late to submit your proposal for consideration. If you would like to speak or attend, be sure to complete the form by Monday.

We will be making regular updates to the GTAC site (developers.google.com/gtac/2015/) over the next several weeks, and you can find conference details there.

For those that have already signed up to attend or speak, we will contact you directly by mid-September.

GTAC 2015: Call for Proposals & Attendance

Google Testing Bloggers — Tue, 30 Jun 2015 21:11:00 +0000

Posted by Anthony Vallone on behalf of the GTAC Committee

The GTAC (Google Test Automation Conference) 2015 application process is now open for presentation proposals and attendance. GTAC will be held at the Google Cambridge office (near Boston, Massachusetts, USA) on November 10th - 11th, 2015.

GTAC will be streamed live on YouTube again this year, so even if you can’t attend in person, you’ll be able to watch the conference remotely. We will post the live stream information as we get closer to the event, and recordings will be posted afterward.

Speakers
Presentations are targeted at student, academic, and experienced engineers working on test automation. Full presentations are 30 minutes and lightning talks are 10 minutes. Speakers should be prepared for a question and answer session following their presentation.

Application
For presentation proposals and/or attendance, complete this form. We will be selecting about 25 talks and 200 attendees for the event. The selection process is not first come first serve (no need to rush your application), and we select a diverse group of engineers from various locations, company sizes, and technical backgrounds (academic, industry expert, junior engineer, etc).

Deadline
The due date for both presentation and attendance applications is August 10th, 2015.

Fees
There are no registration fees, but speakers and attendees must arrange and pay for their own travel and accommodations.

More information
You can find more details at developers.google.com/gtac.

GTAC 2015 Coming to Cambridge (Greater Boston) in November

Google Testing Bloggers — Thu, 28 May 2015 16:56:00 +0000

Posted by Anthony Vallone on behalf of the GTAC Committee

We are pleased to announce that the ninth GTAC (Google Test Automation Conference) will be held in Cambridge (Greatah Boston, USA) on November 10th and 11th (Toozdee and Wenzdee), 2015. So, tell everyone to save the date for this wicked good event.

GTAC is an annual conference hosted by Google, bringing together engineers from industry and academia to discuss advances in test automation and the test engineering computer science field. It’s a great opportunity to present, learn, and challenge modern testing technologies and strategies.

You can browse presentation abstracts, slides, and videos from previous years on the GTAC site.

Stay tuned to this blog and the GTAC website for application information and opportunities to present at GTAC. Subscribing to this blog is the best way to get notified. We're looking forward to seeing you there!

Multi-Repository Development

Google Testing Bloggers — Fri, 15 May 2015 20:53:00 +0000

Author: Patrik Höglund

As we all know, software development is a complicated activity where we develop features and applications to provide value to our users. Furthermore, any nontrivial modern software is composed out of other software. For instance, the Chrome web browser pulls roughly a hundred libraries into its third_party folder when you build the browser. The most significant of these libraries is Blink, the rendering engine, but there’s also ffmpeg for image processing, skia for low-level 2D graphics, and WebRTC for real-time communication (to name a few).

Figure 1. Holy dependencies, Batman!

There are many reasons to use software libraries. Why write your own phone number parser when you can use libphonenumber, which is battle-tested by real use in Android and Chrome and available under a permissive license? Using such software frees you up to focus on the core of your software so you can deliver a unique experience to your users. On the other hand, you need to keep your application up to date with changes in the library (you want that latest bug fix, right?), and you also run a risk of such a change breaking your application. This article will examine that integration problem and how you can reduce the risks associated with it.

Updating Dependencies is Hard

The simplest solution is to check in a copy of the library, build with it, and avoid touching it as much as possible. This solution, however, can be problematic because you miss out on bug fixes and new features in the library. What if you need a new feature or bug fix that just made it in? You have a few options:

Update the library to its latest release. If it’s been a long time since you did this, it can be quite risky and you may have to spend significant testing resources to ensure all the accumulated changes don’t break your application. You may have to catch up to interface changes in the library as well.
Cherry-pick the feature/bug fix you want into your copy of the library. This is even riskier because your cherry-picked patches may depend on other changes in the library in subtle ways. Also, you still are not up to date with the latest version.
Find some way to make do without the feature or bug fix.

None of the above options are very good. Using this ad-hoc updating model can work if there’s a low volume of changes in the library and our requirements on the library don’t change very often. Even if that is the case, what will you do if a critical zero-day exploit is discovered in your socket library?

One way to mitigate the update risk is to integrate more often with your dependencies. As an extreme example, let’s look at Chrome.

In Chrome development, there’s a massive amount of change going into its dependencies. The Blink rendering engine lives in a separate code repository from the browser. Blink sees hundreds of code changes per day, and Chrome must integrate with Blink often since it’s an important part of the browser. Another example is the WebRTC implementation, where a large part of Chrome’s implementation resides in the webrtc.org repository. This article will focus on the latter because it’s the team I happen to work on.

How “Rolling” Works

The open-sourced WebRTC codebase is used by Chrome but also by a number of other companies working on WebRTC. Chrome uses a toolchain called depot_tools to manage dependencies, and there’s a checked-in text file called DEPS where dependencies are managed. It looks roughly like this:

{
  # ... 
  'src/third_party/webrtc':
      'https://chromium.googlesource.com/' +
      'external/webrtc/trunk/webrtc.git' + 
      '@' + '5727038f572c517204e1642b8bc69b25381c4e9f',
}

The above means we should pull WebRTC from the specified git repository at the 572703... hash, similar to other dependency-provisioning frameworks. To build Chrome with a new version, we change the hash and check in a new version of the DEPS file. If the library’s API has changed, we must update Chrome to use the new API in the same patch. This process is known as rolling WebRTC to a new version.

Now the problem is that we have changed the code going into Chrome. Maybe getUserMedia has started crashing on Android, or maybe the browser no longer boots on Windows. We don’t know until we have built and run all the tests. Therefore a roll patch is subject to the same presubmit checks as any Chrome patch (i.e. many tests, on all platforms we ship on). However, roll patches can be considerably more painful and risky than other patches.

Figure 2. Life of a Roll Patch.

On the WebRTC team we found ourselves in an uncomfortable position a couple years back. Developers would make changes to the webrtc.org code and there was a fair amount of churn in the interface, which meant we would have to update Chrome to adapt to those changes. Also we frequently broke tests and WebRTC functionality in Chrome because semantic changes had unexpected consequences in Chrome. Since rolls were so risky and painful to make, they started to happen less often, which made things even worse. There could be two weeks between rolls, which meant Chrome was hit by a large number of changes in one patch.

Bots That Can See the Future: “FYI Bots”

We found a way to mitigate this which we called FYI (for your information) bots. A bot is Chrome lingo for a continuous build machine which builds Chrome and runs tests.

All the existing Chrome bots at that point would build Chrome as specified in the DEPS file, which meant they would build the WebRTC version we had rolled to up to that point. FYI bots replace that pinned version with WebRTC HEAD, but otherwise build and run Chrome-level tests as usual. Therefore:

If all the FYI bots were green, we knew a roll most likely would go smoothly.
If the bots didn’t compile, we knew we would have to adapt Chrome to an interface change in the next roll patch.
If the bots were red, we knew we either had a bug in WebRTC or that Chrome would have to be adapted to some semantic change in WebRTC.

The FYI “waterfall” (a set of bots that builds and runs tests) is a straight copy of the main waterfall, which is expensive in resources. We could have cheated and just set up FYI bots for one platform (say, Linux), but the most expensive regressions are platform-specific, so we reckoned the extra machines and maintenance were worth it.

Making Gradual Interface Changes

This solution helped but wasn’t quite satisfactory. We initially had the policy that it was fine to break the FYI bots since we could not update Chrome to use a new interface until the new interface had actually been rolled into Chrome. This, however, often caused the FYI bots to be compile-broken for days. We quickly started to suffer from red blindness [1] and had no idea if we would break tests on the roll, especially if an interface change was made early in the roll cycle.

The solution was to move to a more careful update policy for the WebRTC API. For the more technically inclined, “careful” here means “following the API prime directive” [2]. Consider this example:

class WebRtcAmplifier {
  ...
  int SetOutputVolume(float volume);
}

Normally we would just change the method’s signature when we needed to:

class WebRtcAmplifier {
  ...
  int SetOutputVolume(float volume, bool allow_eleven¹);
}

… but this would compile-break Chome until it could be updated. So we started doing it like this instead:

class WebRtcAmplifier {
  ...
  int SetOutputVolume(float volume);
  int SetOutputVolume2(float volume, bool allow_eleven);
}

Then we could:

Roll into Chrome
Make Chrome use SetOutputVolume2
Update SetOutputVolume’s signature
Roll again and make Chrome use SetOutputVolume
Delete SetOutputVolume2

This approach requires several steps but we end up with the right interface and at no point do we break Chrome.

Results

When we implemented the above, we could fix problems as they came up rather than in big batches on each roll. We could institute the policy that the FYI bots should always be green, and that changes breaking them should be immediately rolled back. This made a huge difference. The team could work smoother and roll more often. This reduced our risk quite a bit, particularly when Chrome was about to cut a new version branch. Instead of doing panicked and risky rolls around a release, we could work out issues in good time and stay in control.

Another benefit of FYI bots is more granular performance tests. Before the FYI bots, it would frequently happen that a bunch of metrics regressed. However, it’s not fun to find which of the 100 patches in the roll caused the regression! With the FYI bots, we can see precisely which WebRTC revision caused the problem.

Future Work: Optimistic Auto-rolling

The final step on this ladder (short of actually merging the repositories) is auto-rolling. The Blink team implemented this with their ARB (AutoRollBot). The bot wakes up periodically and tries to do a roll patch. If it fails on the trybots, it waits and tries again later (perhaps the trybots failed because of a flake or other temporary error, or perhaps the error was real but has been fixed).

To pull auto-rolling off, you are going to need very good tests. That goes for any roll patch (or any patch, really), but if you’re edging closer to a release and an unstoppable flood of code changes keep breaking you, you’re not in a good place.

References

[1] Martin Fowler (May 2006) “Continuous Integration”
[2] Dani Megert, Remy Chi Jian Suen, et. al. (Oct 2014) “Evolving Java-based APIs”

Footnotes

We actually did have a hilarious bug in WebRTC where it was possible to set the volume to 1.1, but only 0.0-1.0 was supposed to be allowed. No, really. Thus, our WebRTC implementation must be louder than the others since everybody knows 1.1 must be louder than 1.0.

Just Say No to More End-to-End Tests

Google Testing Bloggers — Wed, 22 Apr 2015 22:49:00 +0000

by Mike Wacker

At some point in your life, you can probably recall a movie that you and your friends all wanted to see, and that you and your friends all regretted watching afterwards. Or maybe you remember that time your team thought they’d found the next "killer feature" for their product, only to see that feature bomb after it was released.

Good ideas often fail in practice, and in the world of testing, one pervasive good idea that often fails in practice is a testing strategy built around end-to-end tests.

Testers can invest their time in writing many types of automated tests, including unit tests, integration tests, and end-to-end tests, but this strategy invests mostly in end-to-end tests that verify the product or service as a whole. Typically, these tests simulate real user scenarios.

End-to-End Tests in Theory

While relying primarily on end-to-end tests is a bad idea, one could certainly convince a reasonable person that the idea makes sense in theory.

To start, number one on Google's list of ten things we know to be true is: "Focus on the user and all else will follow." Thus, end-to-end tests that focus on real user scenarios sound like a great idea. Additionally, this strategy broadly appeals to many constituencies:

Developers like it because it offloads most, if not all, of the testing to others.
Managers and decision-makers like it because tests that simulate real user scenarios can help them easily determine how a failing test would impact the user.
Testers like it because they often worry about missing a bug or writing a test that does not verify real-world behavior; writing tests from the user's perspective often avoids both problems and gives the tester a greater sense of accomplishment.

End-to-End Tests in Practice

So if this testing strategy sounds so good in theory, then where does it go wrong in practice? To demonstrate, I present the following composite sketch based on a collection of real experiences familiar to both myself and other testers. In this sketch, a team is building a service for editing documents online (e.g., Google Docs).

Let's assume the team already has some fantastic test infrastructure in place. Every night:

The latest version of the service is built.
This version is then deployed to the team's testing environment.
All end-to-end tests then run against this testing environment.
An email report summarizing the test results is sent to the team.

The deadline is approaching fast as our team codes new features for their next release. To maintain a high bar for product quality, they also require that at least 90% of their end-to-end tests pass before features are considered complete. Currently, that deadline is one day away:

Days Left	Pass %	Notes
1	5%	Everything is broken! Signing in to the service is broken. Almost all tests sign in a user, so almost all tests failed.
0	4%	A partner team we rely on deployed a bad build to their testing environment yesterday.
-1	54%	A dev broke the save scenario yesterday (or the day before?). Half the tests save a document at some point in time. Devs spent most of the day determining if it's a frontend bug or a backend bug.
-2	54%	It's a frontend bug, devs spent half of today figuring out where.
-3	54%	A bad fix was checked in yesterday. The mistake was pretty easy to spot, though, and a correct fix was checked in today.
-4	1%	Hardware failures occurred in the lab for our testing environment.
-5	84%	Many small bugs hiding behind the big bugs (e.g., sign-in broken, save broken). Still working on the small bugs.
-6	87%	We should be above 90%, but are not for some reason.
-7	89.54%	(Rounds up to 90%, close enough.) No fixes were checked in yesterday, so the tests must have been flaky yesterday.

Analysis

Despite numerous problems, the tests ultimately did catch real bugs.

What Went Well

Customer-impacting bugs were identified and fixed before they reached the customer.

What Went Wrong

The team completed their coding milestone a week late (and worked a lot of overtime).
Finding the root cause for a failing end-to-end test is painful and can take a long time.
Partner failures and lab failures ruined the test results on multiple days.
Many smaller bugs were hidden behind bigger bugs.
End-to-end tests were flaky at times.
Developers had to wait until the following day to know if a fix worked or not.

So now that we know what went wrong with the end-to-end strategy, we need to change our approach to testing to avoid many of these problems. But what is the right approach?

The True Value of Tests

Typically, a tester's job ends once they have a failing test. A bug is filed, and then it's the developer's job to fix the bug. To identify where the end-to-end strategy breaks down, however, we need to think outside this box and approach the problem from first principles. If we "focus on the user (and all else will follow)," we have to ask ourselves how a failing test benefits the user. Here is the answer:

A failing test does not directly benefit the user.

While this statement seems shocking at first, it is true. If a product works, it works, whether a test says it works or not. If a product is broken, it is broken, whether a test says it is broken or not. So, if failing tests do not benefit the user, then what does benefit the user?

A bug fix directly benefits the user.

The user will only be happy when that unintended behavior - the bug - goes away. Obviously, to fix a bug, you must know the bug exists. To know the bug exists, ideally you have a test that catches the bug (because the user will find the bug if the test does not). But in that entire process, from failing test to bug fix, value is only added at the very last step.

Stage	Failing Test	Bug Opened	Bug Fixed
Value Added	No	No	Yes

Thus, to evaluate any testing strategy, you cannot just evaluate how it finds bugs. You also must evaluate how it enables developers to fix (and even prevent) bugs.

Building the Right Feedback Loop

Tests create a feedback loop that informs the developer whether the product is working or not. The ideal feedback loop has several properties:

It's fast. No developer wants to wait hours or days to find out if their change works. Sometimes the change does not work - nobody is perfect - and the feedback loop needs to run multiple times. A faster feedback loop leads to faster fixes. If the loop is fast enough, developers may even run tests before checking in a change.
It's reliable. No developer wants to spend hours debugging a test, only to find out it was a flaky test. Flaky tests reduce the developer's trust in the test, and as a result flaky tests are often ignored, even when they find real product issues.
It isolates failures. To fix a bug, developers need to find the specific lines of code causing the bug. When a product contains millions of lines of codes, and the bug could be anywhere, it's like trying to find a needle in a haystack.

Think Smaller, Not Larger

So how do we create that ideal feedback loop? By thinking smaller, not larger.

Unit Tests

Unit tests take a small piece of the product and test that piece in isolation. They tend to create that ideal feedback loop:

Unit tests are fast. We only need to build a small unit to test it, and the tests also tend to be rather small. In fact, one tenth of a second is considered slow for unit tests.
Unit tests are reliable. Simple systems and small units in general tend to suffer much less from flakiness. Furthermore, best practices for unit testing - in particular practices related to hermetic tests - will remove flakiness entirely.
Unit tests isolate failures. Even if a product contains millions of lines of code, if a unit test fails, you only need to search that small unit under test to find the bug.

Writing effective unit tests requires skills in areas such as dependency management, mocking, and hermetic testing. I won't cover these skills here, but as a start, the typical example offered to new Googlers (or Nooglers) is how Google builds and tests a stopwatch.

Unit Tests vs. End-to-End Tests

With end-to-end tests, you have to wait: first for the entire product to be built, then for it to be deployed, and finally for all end-to-end tests to run. When the tests do run, flaky tests tend to be a fact of life. And even if a test finds a bug, that bug could be anywhere in the product.

Although end-to-end tests do a better job of simulating real user scenarios, this advantage quickly becomes outweighed by all the disadvantages of the end-to-end feedback loop:

	Unit	End-toEnd
Fast
Reliable
Isolates Failures
Simulates a Real User

Integration Tests

Unit tests do have one major disadvantage: even if the units work well in isolation, you do not know if they work well together. But even then, you do not necessarily need end-to-end tests. For that, you can use an integration test. An integration test takes a small group of units, often two units, and tests their behavior as a whole, verifying that they coherently work together.

If two units do not integrate properly, why write an end-to-end test when you can write a much smaller, more focused integration test that will detect the same bug? While you do need to think larger, you only need to think a little larger to verify that units work together.

Testing Pyramid

Even with both unit tests and integration tests, you probably still will want a small number of end-to-end tests to verify the system as a whole. To find the right balance between all three test types, the best visual aid to use is the testing pyramid. Here is a simplified version of the testing pyramid from the opening keynote of the 2014 Google Test Automation Conference:

The bulk of your tests are unit tests at the bottom of the pyramid. As you move up the pyramid, your tests gets larger, but at the same time the number of tests (the width of your pyramid) gets smaller.

As a good first guess, Google often suggests a 70/20/10 split: 70% unit tests, 20% integration tests, and 10% end-to-end tests. The exact mix will be different for each team, but in general, it should retain that pyramid shape. Try to avoid these anti-patterns:

Inverted pyramid/ice cream cone. The team relies primarily on end-to-end tests, using few integration tests and even fewer unit tests.
Hourglass. The team starts with a lot of unit tests, then uses end-to-end tests where integration tests should be used. The hourglass has many unit tests at the bottom and many end-to-end tests at the top, but few integration tests in the middle.

Just like a regular pyramid tends to be the most stable structure in real life, the testing pyramid also tends to be the most stable testing strategy.

Quantum Quality

Google Testing Bloggers — Wed, 01 Apr 2015 09:30:00 +0000

UPDATE: Hey, this was an April fool's joke but in fact we wished we could

have realized this idea and we are looking forward to the day this has

been worked out and becomes a reality.

by Kevin Graney

Here at Google we have a long history of capitalizing on the latest research and technology to improve the quality of our software. Over our past 16+ years as a company, what started with some humble unit tests has grown into a massive operation. As our software complexity increased, ever larger and more complex tests were dreamed up by our Software Engineers in Test (SETs).

What we have come to realize is that our love of testing is a double-edged sword. On the one hand, large-scale testing keeps us honest and gives us confidence. It ensures our products remain reliable, our users' data is kept safe, and our engineers are able to work productively without fear of breaking things. On the other hand, it's expensive in both engineer and machine time. Our SETs have been working tirelessly to reduce the expense and latency of software tests at Google, while continuing to increase their quality.

Today, we're excited to reveal how Google is tackling this challenge. In collaboration with the Quantum AI Lab, SETs at Google have been busy revolutionizing how software is tested. The theory is relatively simple: bits in a traditional computer are either zero or one, but bits in a quantum computer can be both one and zero at the same time. This is known as superposition, and the classic example is Schrodinger's cat. Through some clever math and cutting edge electrical engineering, researchers at Google have figured out how to utilize superposition to vastly improve the quality of our software testing and the speed at which our tests run.

Figure 1 Some qubits inside a Google quantum device.

With superposition, tests at Google are now able to simultaneously model every possible state of the application under test. The state of the application can be thought of as an n bit sequential memory buffer, consistent with the traditional Von Neuman architecture of computing. Because each bit under superposition is simultaneously a 0 and a 1, these tests can simulate 2n different application states at any given instant in time in O(n) space. Each of these application states can be mutated by application logic to another state in constant time using quantum algorithms developed by Google researchers. These two properties together allow us to build a state transition graph of the application under test that shows every possible application state and all possible transitions to other application states. Using traditional computing methods this problem has intractable time complexity, but after leveraging superposition and our quantum algorithms it becomes relatively fast and cheap.

Figure 2 The application state graph for a demonstrative 3-bit application. If the start state is 001 then 000, 110, 111, and 011 are all unreachable states. States 010 and 100 both result in deadlock.

Once we have the state transition graph for the application under test, testing it becomes almost trivial. Given the initial startup state of the application, i.e. the executable bits of the application stored on disk, we can find from the application's state transition graph all reachable states. Assertions that ensure proper behavior are then written against the reachable subset of the transition graph. This paradigm of test writing allows both Google's security engineers and software engineers to work more productively. A security engineer can write a test, for example, that asserts "no executable memory regions become mutated in any reachable state". This one test effectively eliminates the potential for security flaws that result from memory safety violations. A test engineer can write higher level assertions using graph traversal methods that ensure data integrity is maintained across a subset of application state transitions. Tests of this nature can detect data corruption bugs.

We're excited about the work our team has done so far to push the envelope in the field of quantum software quality. We're just getting started, but based on early dogfood results among Googlers we believe the potential of this work is huge. Stay tuned!

Android UI Automated Testing

Google Testing Bloggers — Fri, 20 Mar 2015 20:49:00 +0000

by Mona El Mahdy

Overview

This post reviews four strategies for Android UI testing with the goal of creating UI tests that are fast, reliable, and easy to debug.

Before we begin, let’s not forget an important rule: whatever can be unit tested should be unit tested. Robolectric and gradle unit tests support are great examples of unit test frameworks for Android. UI tests, on the other hand, are used to verify that your application returns the correct UI output in response to a sequence of user actions on a device. Espresso is a great framework for running UI actions and verifications in the same process. For more details on the Espresso and UI Automator tools, please see: test support libraries.

The Google+ team has performed many iterations of UI testing. Below we discuss the lessons learned during each strategy of UI testing. Stay tuned for more posts with more details and code samples.

Strategy 1: Using an End-To-End Test as a UI Test

Let’s start with some definitions. A UI test ensures that your application returns the correct UI output in response to a sequence of user actions on a device. An end-to-end (E2E) test brings up the full system of your app including all backend servers and client app. E2E tests will guarantee that data is sent to the client app and that the entire system functions correctly.

Usually, in order to make the application UI functional, you need data from backend servers, so UI tests need to simulate the data but not necessarily the backend servers. In many cases UI tests are confused with E2E tests because E2E is very similar to manual test scenarios. However, debugging and stabilizing E2E tests is very difficult due to many variables like network flakiness, authentication against real servers, size of your system, etc.

When you use UI tests as E2E tests, you face the following problems:

Very large and slow tests.
High flakiness rate due to timeouts and memory issues.
Hard to debug/investigate failures.
Authentication issues (ex: authentication from automated tests is very tricky).

Let’s see how these problems can be fixed using the following strategies.

Strategy 2: Hermetic UI Testing using Fake Servers

In this strategy, you avoid network calls and external dependencies, but you need to provide your application with data that drives the UI. Update your application to communicate to a local server rather than external one, and create a fake local server that provides data to your application. You then need a mechanism to generate the data needed by your application. This can be done using various approaches depending on your system design. One approach is to record server responses and replay them in your fake server.

Once you have hermetic UI tests talking to a local fake server, you should also have server hermetic tests. This way you split your E2E test into a server side test, a client side test, and an integration test to verify that the server and client are in sync (for more details on integration tests, see the backend testing section of blog).

Now, the client test flow looks like:

While this approach drastically reduces the test size and flakiness rate, you still need to maintain a separate fake server as well as your test. Debugging is still not easy as you have two moving parts: the test and the local server. While test stability will be largely improved by this approach, the local server will cause some flakes.

Let’s see how this could this be improved...

Strategy 3: Dependency Injection Design for Apps.

To remove the additional dependency of a fake server running on Android, you should use dependency injection in your application for swapping real module implementations with fake ones. One example is Dagger, or you can create your own dependency injection mechanism if needed.

This will improve the testability of your app for both unit testing and UI testing, providing your tests with the ability to mock dependencies. In instrumentation testing, the test apk and the app under test are loaded in the same process, so the test code has runtime access to the app code. Not only that, but you can also use classpath override (the fact that test classpath takes priority over app under test) to override a certain class and inject test fakes there. For example, To make your test hermetic, your app should support injection of the networking implementation. During testing, the test injects a fake networking implementation to your app, and this fake implementation will provide seeded data instead of communicating with backend servers.

Strategy 4: Building Apps into Smaller Libraries

If you want to scale your app into many modules and views, and plan to add more features while maintaining stable and fast builds/tests, then you should build your app into small components/libraries. Each library should have its own UI resources and user dependency management. This strategy not only enables mocking dependencies of your libraries for hermetic testing, but also serves as an experimentation platform for various components of your application.

Once you have small components with dependency injection support, you can build a test app for each component.

The test apps bring up the actual UI of your libraries, fake data needed, and mock dependencies. Espresso tests will run against these test apps. This enables testing of smaller libraries in isolation.

For example, let’s consider building smaller libraries for login and settings of your app.

The settings component test now looks like:

Conclusion

UI testing can be very challenging for rich apps on Android. Here are some UI testing lessons learned on the Google+ team:

Don’t write E2E tests instead of UI tests. Instead write unit tests and integration tests beside the UI tests.
Hermetic tests are the way to go.
Use dependency injection while designing your app.
Build your application into small libraries/modules, and test each one in isolation. You can then have a few integration tests to verify integration between components is correct .
Componentized UI tests have proven to be much faster than E2E and 99%+ stable. Fast and stable tests have proven to drastically improve developer productivity.

The First Annual Testing on the Toilet Awards

Google Testing Bloggers — Tue, 03 Feb 2015 16:50:00 +0000

By Andrew Trenk

The Testing on the Toilet (TotT) series was created in 2006 as a way to spread unit-testing knowledge across Google by posting flyers in bathroom stalls. It quickly became a part of Google culture and is still going strong today, with new episodes published every week and read in hundreds of bathrooms by thousands of engineers in Google offices across the world. Initially focused on content related to testing, TotT now covers a variety of technical topics, such as tips on writing cleaner code and ways to prevent security bugs.

While TotT episodes often have a big impact on many engineers across Google, until now we never did anything to formally thank authors for their contributions. To fix that, we decided to honor the most popular TotT episodes of 2014 by establishing the Testing on the Toilet Awards. The winners were chosen through a vote that was open to all Google engineers. The Google Testing Blog is proud to present the winners that were posted on this blog (there were two additional winners that weren’t posted on this blog since we only post testing-related TotT episodes).

And the winners are ...

Erik Kuefler: Test Behaviors, Not Methods and Don't Put Logic in Tests
Alex Eagle: Change-Detector Tests Considered Harmful

The authors of these episodes received their very own Flushy trophy, which they can proudly display on their desks.

(The logo on the trophy is the same one we put on the printed version of each TotT episode, which you can see by looking for the “printer-friendly version” link in the TotT blog posts).

Congratulations to the winners!

Testing on the Toilet: Change-Detector Tests Considered Harmful

Google Testing Bloggers — Wed, 28 Jan 2015 00:42:00 +0000

by Alex Eagle

This article was adapted from a Google Testing on the Toilet (TotT) episode. You can download a printer-friendly version of this TotT episode and post it in your office.

You have just finished refactoring some code without modifying its behavior. Then you run the tests before committing and… a bunch of unit tests are failing. While fixing the tests, you get a sense that you are wasting time by mechanically applying the same transformation to many tests. Maybe you introduced a parameter in a method, and now must update 100 callers of that method in tests to pass an empty string.

What does it look like to write tests mechanically? Here is an absurd but obvious way:

// Production code:
def abs(i: Int)
  return (i < 0) ? i * -1 : i

// Test code:
for (line: String in File(prod_source).read_lines())
  switch (line.number)
    1: assert line.content equals def abs(i: Int)
    2: assert line.content equals   return (i < 0) ? i * -1 : i

That test is clearly not useful: it contains an exact copy of the code under test and acts like a checksum. A correct or incorrect program is equally likely to pass a test that is a derivative of the code under test. No one is really writing tests like that, but how different is it from this next example?

// Production code:
def process(w: Work)
  firstPart.process(w)
  secondPart.process(w)

// Test code:
part1 = mock(FirstPart)
part2 = mock(SecondPart)
w = Work()
Processor(part1, part2).process(w)
verify_in_order
  was_called part1.process(w)
  was_called part2.process(w)

It is tempting to write a test like this because it requires little thought and will run quickly. This is a change-detector test—it is a transformation of the same information in the code under test—and it breaks in response to any change to the production code, without verifying correct behavior of either the original or modified production code.

Change detectors provide negative value, since the tests do not catch any defects, and the added maintenance cost slows down development. These tests should be re-written or deleted.

Testing on the Toilet: Prefer Testing Public APIs Over Implementation-Detail Classes

Google Testing Bloggers — Wed, 14 Jan 2015 17:30:00 +0000

by Andrew Trenk

This article was adapted from a Google Testing on the Toilet (TotT) episode. You can download a printer-friendly version of this TotT episode and post it in your office.

Does this class need to have tests?

class UserInfoValidator {
  public void validate(UserInfo info) {
    if (info.getDateOfBirth().isInFuture()) { throw new ValidationException()); }
  }
}

Its method has some logic, so it may be good idea to test it. But what if its only user looks like this?

public class UserInfoService {
  private UserInfoValidator validator;
  public void save(UserInfo info) {
    validator.validate(info); // Throw an exception if the value is invalid.
    writeToDatabase(info);   
  }
}

The answer is: it probably doesn’t need tests, since all paths can be tested through UserInfoService. The key distinction is that the class is an implementation detail, not a public API.

A public API can be called by any number of users, who can pass in any possible combination of inputs to its methods. You want to make sure these are well-tested, which ensures users won’t see issues when they use the API. Examples of public APIs include classes that are used in a different part of a codebase (e.g., a server-side class that’s used by the client-side) and common utility classes that are used throughout a codebase.

An implementation-detail class exists only to support public APIs and is called by a very limited number of users (often only one). These classes can sometimes be tested indirectly by testing the public APIs that use them.

Testing implementation-detail classes is still useful in many cases, such as if the class is complex or if the tests would be difficult to write for the public API class. When you do test them, they often don’t need to be tested in as much depth as a public API, since some inputs may never be passed into their methods (in the above code sample, if UserInfoService ensured that UserInfo were never null, then it wouldn’t be useful to test what happens when null is passed as an argument to UserInfoValidator.validate, since it would never happen).

Implementation-detail classes can sometimes be thought of as private methods that happen to be in a separate class, since you typically don’t want to test private methods directly either. You should also try to restrict the visibility of implementation-detail classes, such as by making them package-private in Java.

Testing implementation-detail classes too often leads to a couple problems:

- Code is harder to maintain since you need to update tests more often, such as when changing a method signature of an implementation-detail class or even when doing a refactoring. If testing is done only through public APIs, these changes wouldn’t affect the tests at all.

- If you test a behavior only through an implementation-detail class, you may get false confidence in your code, since the same code path may not work properly when exercised through the public API. You also have to be more careful when refactoring, since it can be harder to ensure that all the behavior of the public API will be preserved if not all paths are tested through the public API.

Testing on the Toilet: Truth: a fluent assertion framework

Google Testing Bloggers — Fri, 19 Dec 2014 18:25:00 +0000

by Dori Reuveni and Kurt Alfred Kluever

This article was adapted from a Google Testing on the Toilet (TotT) episode. You can download a printer-friendly version of this TotT episode and post it in your office.

As engineers, we spend most of our time reading existing code, rather than writing new code. Therefore, we must make sure we always write clean, readable code. The same goes for our tests; we need a way to clearly express our test assertions.

Truth is an open source, fluent testing framework for Java designed to make your test assertions and failure messages more readable. The fluent API makes reading (and writing) test assertions much more natural, prose-like, and discoverable in your IDE via autocomplete. For example, compare how the following assertion reads with JUnit vs. Truth:

assertEquals("March", monthMap.get(3));          // JUnit
assertThat(monthMap).containsEntry(3, "March");  // Truth

Both statements are asserting the same thing, but the assertion written with Truth can be easily read from left to right, while the JUnit example requires "mental backtracking".

Another benefit of Truth over JUnit is the addition of useful default failure messages. For example:

ImmutableSet colors = ImmutableSet.of("red", "green", "blue", "yellow");
assertTrue(colors.contains("orange"));  // JUnit
assertThat(colors).contains("orange");  // Truth

In this example, both assertions will fail, but JUnit will not provide a useful failure message. However, Truth will provide a clear and concise failure message:

AssertionError: <[red, green, blue, yellow]> should have contained

Truth already supports specialized assertions for most of the common JDK types (Objects, primitives, arrays, Strings, Classes, Comparables, Iterables, Collections, Lists, Sets, Maps, etc.), as well as some Guava types (Optionals). Additional support for other popular types is planned as well (Throwables, Iterators, Multimaps, UnsignedIntegers, UnsignedLongs, etc.).

Truth is also user-extensible: you can easily write a Truth subject to make fluent assertions about your own custom types. By creating your own custom subject, both your assertion API and your failure messages can be domain-specific.

Truth's goal is not to replace JUnit assertions, but to improve the readability of complex assertions and their failure messages. JUnit assertions and Truth assertions can (and often do) live side by side in tests.

To get started with Truth, check out http://google.github.io/truth/

GTAC 2014 Wrap-up

Google Testing Bloggers — Fri, 05 Dec 2014 00:24:00 +0000

by Anthony Vallone on behalf of the GTAC Committee

On October 28th and 29th, GTAC 2014, the eighth GTAC (Google Test Automation Conference), was held at the beautiful Google Kirkland office. The conference was completely packed with presenters and attendees from all over the world (Argentina, Australia, Canada, China, many European countries, India, Israel, Korea, New Zealand, Puerto Rico, Russia, Taiwan, and many US states), bringing with them a huge diversity of experiences.

Speakers from numerous companies and universities (Adobe, American Express, Comcast, Dropbox, Facebook, FINRA, Google, HP, Medidata Solutions, Mozilla, Netflix, Orange, and University of Waterloo) spoke on a variety of interesting and cutting edge test automation topics.

All of the slides and video recordings are now available on the GTAC site. Photos will be available soon as well.

This was our most popular GTAC to date, with over 1,500 applicants and almost 200 of those for speaking. About 250 people filled our venue to capacity, and the live stream had a peak of about 400 concurrent viewers with 4,700 playbacks during the event. And, there was plenty of interesting Twitter and Google+ activity during the event.

Our goal in hosting GTAC is to make the conference highly relevant and useful for, not only attendees, but the larger test engineering community as a whole. Our post-conference survey shows that we are close to achieving that goal:

If you have any suggestions on how we can improve, please comment on this post.

Thank you to all the speakers, attendees, and online viewers who made this a special event once again. To receive announcements about the next GTAC, subscribe to the Google Testing Blog.

Protractor: Angular testing made easy

Google Testing Bloggers — Sun, 30 Nov 2014 17:46:00 +0000

By Hank Duan, Julie Ralph, and Arif Sukoco in Seattle

Have you worked with WebDriver but been frustrated with all the waits needed for WebDriver to sync with the website, causing flakes and prolonged test times? If you are working with AngularJS apps, then Protractor is the right tool for you.

Protractor (protractortest.org) is an end-to-end test framework specifically for AngularJS apps. It was built by a team in Google and released to open source. Protractor is built on top of WebDriverJS and includes important improvements tailored for AngularJS apps. Here are some of Protractor’s key benefits:

You don’t need to add waits or sleeps to your test. Protractor can communicate with your AngularJS app automatically and execute the next step in your test the moment the webpage finishes pending tasks, so you don’t have to worry about waiting for your test and webpage to sync.
It supports Angular-specific locator strategies (e.g., binding, model, repeater) as well as native WebDriver locator strategies (e.g., ID, CSS selector, XPath). This allows you to test Angular-specific elements without any setup effort on your part.
It is easy to set up page objects. Protractor does not execute WebDriver commands until an action is needed (e.g., get, sendKeys, click). This way you can set up page objects so tests can manipulate page elements without touching the HTML.
It uses Jasmine, the framework you use to write AngularJS unit tests, and Javascript, the same language you use to write AngularJS apps.

Follow these simple steps, and in minutes, you will have you first Protractor test running:

1) Set up environment

Install the command line tools ‘protractor’ and ‘webdriver-manager’ using npm:

npm install -g protractor

Start up an instance of a selenium server:

webdriver-manager update & webdriver-manager start

This downloads the necessary binary, and starts a new webdriver session listening on http://localhost:4444.

2) Write your test

// It is a good idea to use page objects to modularize your testing logic
var angularHomepage = {
  nameInput : element(by.model('yourName')),
  greeting : element(by.binding('yourName')),
  get : function() {
    browser.get('index.html');
  },
  setName : function(name) {
    this.nameInput.sendKeys(name);
  }
};

// Here we are using the Jasmine test framework 
// See http://jasmine.github.io/2.0/introduction.html for more details
describe('angularjs homepage', function() {
  it('should greet the named user', function(){
    angularHomepage.get();
    angularHomepage.setName('Julie');
    expect(angularHomepage.greeting.getText()).
        toEqual('Hello Julie!');
  });
});

3) Write a Protractor configuration file to specify the environment under which you want your test to run:

exports.config = {
  seleniumAddress: 'http://localhost:4444/wd/hub',
  
  specs: ['testFolder/*'],

  multiCapabilities: [{
    'browserName': 'chrome',
    // browser-specific tests
    specs: 'chromeTests/*' 
  }, {
    'browserName': 'firefox',
    // run tests in parallel
    shardTestFiles: true 
  }],

  baseUrl: 'http://www.angularjs.org',
};

4) Run the test:

Start the test with the command:

protractor conf.js

The test output should be:

1 test, 1 assertions, 0 failures

If you want to learn more, here’s a full tutorial that highlights all of Protractor’s features: http://angular.github.io/protractor/#/tutorial

GTAC 2014 is this Week!

Google Testing Bloggers — Mon, 27 Oct 2014 18:26:00 +0000

by Anthony Vallone on behalf of the GTAC Committee

The eighth GTAC commences on Tuesday at the Google Kirkland office. You can find the latest details on the conference at our site, including speaker profiles.

If you are watching remotely, we'll soon be updating the live stream page with the stream link and a Google Moderator link for remote Q&A.

If you have been selected to attend or speak, be sure to note the updated parking information. Google visitors will use off-site parking and shuttles.

We look forward to connecting with the greater testing community and sharing new advances and ideas.

Testing on the Toilet: Writing Descriptive Test Names

Google Testing Bloggers — Thu, 16 Oct 2014 20:39:00 +0000

by Andrew Trenk

This article was adapted from a Google Testing on the Toilet (TotT) episode. You can download a printer-friendly version of this TotT episode and post it in your office.

How long does it take you to figure out what behavior is being tested in the following code?

@Test public void isUserLockedOut_invalidLogin() {
  authenticator.authenticate(username, invalidPassword);
  assertFalse(authenticator.isUserLockedOut(username));

  authenticator.authenticate(username, invalidPassword);
  assertFalse(authenticator.isUserLockedOut(username));

  authenticator.authenticate(username, invalidPassword);
  assertTrue(authenticator.isUserLockedOut(username));
}

You probably had to read through every line of code (maybe more than once) and understand what each line is doing. But how long would it take you to figure out what behavior is being tested if the test had this name?

isUserLockedOut_lockOutUserAfterThreeInvalidLoginAttempts

You should now be able to understand what behavior is being tested by reading just the test name, and you don’t even need to read through the test body. The test name in the above code sample hints at the scenario being tested (“invalidLogin”), but it doesn’t actually say what the expected outcome is supposed to be, so you had to read through the code to figure it out.

Putting both the scenario and the expected outcome in the test name has several other benefits:

- If you want to know all the possible behaviors a class has, all you need to do is read through the test names in its test class, compared to spending minutes or hours digging through the test code or even the class itself trying to figure out its behavior. This can also be useful during code reviews since you can quickly tell if the tests cover all expected cases.

- By giving tests more explicit names, it forces you to split up testing different behaviors into separate tests. Otherwise you may be tempted to dump assertions for different behaviors into one test, which over time can lead to tests that keep growing and become difficult to understand and maintain.

- The exact behavior being tested might not always be clear from the test code. If the test name isn’t explicit about this, sometimes you might have to guess what the test is actually testing.

- You can easily tell if some functionality isn’t being tested. If you don’t see a test name that describes the behavior you’re looking for, then you know the test doesn’t exist.

- When a test fails, you can immediately see what functionality is broken without looking at the test’s source code.

There are several common patterns for structuring the name of a test (one example is to name tests like an English sentence with “should” in the name, e.g., shouldLockOutUserAfterThreeInvalidLoginAttempts). Whichever pattern you use, the same advice still applies: Make sure test names contain both the scenario being tested and the expected outcome.

Sometimes just specifying the name of the method under test may be enough, especially if the method is simple and has only a single behavior that is obvious from its name.

Announcing the GTAC 2014 Agenda

Google Testing Bloggers — Wed, 01 Oct 2014 00:36:00 +0000

by Anthony Vallone on behalf of the GTAC Committee

We have completed selection and confirmation of all speakers and attendees for GTAC 2014. You can find the detailed agenda at:
developers.google.com/gtac/2014/schedule

Thank you to all who submitted proposals! It was very hard to make selections from so many fantastic submissions.

There was a tremendous amount of interest in GTAC this year with over 1,500 applicants (up from 533 last year) and 194 of those for speaking (up from 88 last year). Unfortunately, our venue only seats 250. However, don’t despair if you did not receive an invitation. Just like last year, anyone can join us via YouTube live streaming. We’ll also be setting up Google Moderator, so remote attendees can get involved in Q&A after each talk. Information about live streaming, Moderator, and other details will be posted on the GTAC site soon and announced here.

Chrome – Firefox WebRTC Interop Test – Pt 2

Google Testing Bloggers — Tue, 09 Sep 2014 20:05:00 +0000

by Patrik Höglund

This is the second in a series of articles about Chrome’s WebRTC Interop Test. See the first.

In the previous blog post we managed to write an automated test which got a WebRTC call between Firefox and Chrome to run. But how do we verify that the call actually worked?

Verifying the Call

Now we can launch the two browsers, but how do we figure out the whether the call actually worked? If you try opening two apprtc.appspot.com tabs in the same room, you will notice the video feeds flip over using a CSS transform, your local video is relegated to a small frame and a new big video feed with the remote video shows up. For the first version of the test, I just looked at the page in the Chrome debugger and looked for some reliable signal. As it turns out, the remoteVideo.style.opacity property will go from 0 to 1 when the call goes up and from 1 to 0 when it goes down. Since we can execute arbitrary JavaScript in the Chrome tab from the test, we can simply implement the check like this:

bool WaitForCallToComeUp(content::WebContents* tab_contents) {
  // Apprtc will set remoteVideo.style.opacity to 1 when the call comes up.
  std::string javascript =
      "window.domAutomationController.send(remoteVideo.style.opacity)";
  return test::PollingWaitUntil(javascript, "1", tab_contents);
}

Verifying Video is Playing

So getting a call up is good, but what if there is a bug where Firefox and Chrome cannot send correct video streams to each other? To check that, we needed to step up our game a bit. We decided to use our existing video detector, which looks at a video element and determines if the pixels are changing. This is a very basic check, but it’s better than nothing. To do this, we simply evaluate the .js file’s JavaScript in the context of the Chrome tab, making the functions in the file available to us. The implementation then becomes

bool DetectRemoteVideoPlaying(content::WebContents* tab_contents) {
  if (!EvalInJavascriptFile(tab_contents, GetSourceDir().Append(
      FILE_PATH_LITERAL(
          "chrome/test/data/webrtc/test_functions.js"))))
    return false;
  if (!EvalInJavascriptFile(tab_contents, GetSourceDir().Append(
      FILE_PATH_LITERAL(
          "chrome/test/data/webrtc/video_detector.js"))))
    return false;

  // The remote video tag is called remoteVideo in the AppRTC code.
  StartDetectingVideo(tab_contents, "remoteVideo");
  WaitForVideoToPlay(tab_contents);
  return true;
}

where StartDetectingVideo and WaitForVideoToPlay call the corresponding JavaScript methods in video_detector.js. If the video feed is frozen and unchanging, the test will time out and fail.

What to Send in the Call

Now we can get a call up between the browsers and detect if video is playing. But what video should we send? For chrome, we have a convenient --use-fake-device-for-media-stream flag that will make Chrome pretend there’s a webcam and present a generated video feed (which is a spinning green ball with a timestamp). This turned out to be useful since Firefox and Chrome cannot acquire the same camera at the same time, so if we didn’t use the fake device we would have two webcams plugged into the bots executing the tests!

Bots running in Chrome’s regular test infrastructure do not have either software or hardware webcams plugged into them, so this test must run on bots with webcams for Firefox to be able to acquire a camera. Fortunately, we have that in the WebRTC waterfalls in order to test that we can actually acquire hardware webcams on all platforms. We also added a check to just succeed the test when there’s no real webcam on the system since we don’t want it to fail when a dev runs it on a machine without a webcam:

if (!HasWebcamOnSystem())
  return;

It would of course be better if Firefox had a similar fake device, but to my knowledge it doesn’t.

Downloading all Code and Components

Now we have all we need to run the test and have it verify something useful. We just have the hard part left: how do we actually download all the resources we need to run this test? Recall that this is actually a three-way integration test between Chrome, Firefox and AppRTC, which require the following:

The AppEngine SDK in order to bring up the local AppRTC instance,
The AppRTC code itself,
Chrome (already present in the checkout), and
Firefox nightly.

While developing the test, I initially just hand-downloaded these and installed and hard-coded the paths. This is a very bad idea in the long run. Recall that the Chromium infrastructure is comprised of thousands and thousands of machines, and while this test will only run on perhaps 5 at a time due to its webcam requirements, we don’t want manual maintenance work whenever we replace a machine. And for that matter, we definitely don’t want to download a new Firefox by hand every night and put it on the right location on the bots! So how do we automate this?

Downloading the AppEngine SDK
First, let’s start with the easy part. We don’t really care if the AppEngine SDK is up-to-date, so a relatively stale version is fine. We could have the test download it from the authoritative source, but that’s a bad idea for a couple reasons. First, it updates outside our control. Second, there could be anti-robot measures on the page. Third, the download will likely be unreliable and fail the test occasionally.

The way we solved this was to upload a copy of the SDK to a Google storage bucket under our control and download it using the depot_tools script download_from_google_storage.py. This is a lot more reliable than an external website and will not download the SDK if we already have the right version on the bot.

Downloading the AppRTC Code
This code is on GitHub. Experience has shown that git clone commands run against GitHub will fail every now and then, and fail the test. We could either write some retry mechanism, but we have found it’s better to simply mirror the git repository in Chromium’s internal mirrors, which are closer to our bots and thereby more reliable from our perspective. The pull is done by a Chromium DEPS file (which is Chromium’s dependency provisioning framework).

Downloading Firefox
It turns out that Firefox supplies handy libraries for this task. We’re using mozdownload in this script in order to download the Firefox nightly build. Unfortunately this fails every now and then so we would like to have some retry mechanism, or we could write some mechanism to “mirror” the Firefox nightly build in some location we control.

Putting it Together

With that, we have everything we need to deploy the test. You can see the final code here.

The provisioning code above was put into a separate “.gclient solution” so that regular Chrome devs and bots are not burdened with downloading hundreds of megs of SDKs and code that they will not use. When this test runs, you will first see a Chrome browser pop up, which will ensure the local apprtc instance is up. Then a Firefox browser will pop up. They will each acquire the fake device and real camera, respectively, and after a short delay the AppRTC call will come up, proving that video interop is working.

This is a complicated and expensive test, but we believe it is worth it to keep the main interop case under automation this way, especially as the spec evolves and the browsers are in varying states of implementation.

Future Work

Also run on Windows/Mac.
Also test Opera.
Interop between Chrome/Firefox mobile and desktop browsers.
Also ensure audio is playing.
Measure bandwidth stats, video quality, etc.

Chrome – Firefox WebRTC Interop Test – Pt 1

Google Testing Bloggers — Tue, 26 Aug 2014 21:05:00 +0000

by Patrik Höglund

WebRTC enables real time peer-to-peer video and voice transfer in the browser, making it possible to build, among other things, a working video chat with a small amount of Python and JavaScript. As a web standard, it has several unusual properties which makes it hard to test. A regular web standard generally accepts HTML text and yields a bitmap as output (what you see in the browser). For WebRTC, we have real-time RTP media streams on one side being sent to another WebRTC-enabled endpoint. These RTP packets have been jumping across NAT, through firewalls and perhaps through TURN servers to deliver hopefully stutter-free and low latency media.

WebRTC is probably the only web standard in which we need to test direct communication between Chrome and other browsers. Remember, WebRTC builds on peer-to-peer technology, which means we talk directly between browsers rather than through a server. Chrome, Firefox and Opera have announced support for WebRTC so far. To test interoperability, we set out to build an automated test to ensure that Chrome and Firefox can get a call up. This article describes how we implemented such a test and the tradeoffs we made along the way.

Calling in WebRTC

Setting up a WebRTC call requires passing SDP blobs over a signaling connection. These blobs contain information on the capabilities of the endpoint, such as what media formats it supports and what preferences it has (for instance, perhaps the endpoint has VP8 decoding hardware, which means the endpoint will handle VP8 more efficiently than, say, H.264). By sending these blobs the endpoints can agree on what media format they will be sending between themselves and how to traverse the network between them. Once that is done, the browsers will talk directly to each other, and nothing gets sent over the signaling connection.

Figure 1. Signaling and media connections.

How these blobs are sent is up to the application. Usually the browsers connect to some server which mediates the connection between the browsers, for instance by using a contact list or a room number. The AppRTC reference application uses room numbers to pair up browsers and sends the SDP blobs from the browsers through the AppRTC server.

Test Design

Instead of designing a new signaling solution from scratch, we chose to use the AppRTC application we already had. This has the additional benefit of testing the AppRTC code, which we are also maintaining. We could also have used the small peerconnection_server binary and some JavaScript, which would give us additional flexibility in what to test. We chose to go with AppRTC since it effectively implements the signaling for us, leading to much less test code.

We assumed we would be able to get hold of the latest nightly Firefox and be able to launch that with a given URL. For the Chrome side, we assumed we would be running in a browser test, i.e. on a complete Chrome with some test scaffolding around it. For the first sketch of the test, we imagined just connecting the browsers to the live apprtc.appspot.com with some random room number. If the call got established, we would be able to look at the remote video feed on the Chrome side and verify that video was playing (for instance using the video+canvas grab trick). Furthermore, we could verify that audio was playing, for instance by using WebRTC getStats to measure the audio track energy level.

Figure 2. Basic test design.

However, since we like tests to be hermetic, this isn’t a good design. I can see several problems. For example, if the network between us and AppRTC is unreliable. Also, what if someone has occupied myroomid? If that were the case, the test would fail and we would be none the wiser. So to make this thing work, we would have to find some way to bring up the AppRTC instance on localhost to make our test hermetic.

Bringing up AppRTC on localhost

AppRTC is a Google App Engine application. As this hello world example demonstrates, one can test applications locally with

google_appengine/dev_appserver.py apprtc_code/

So why not just call this from our test? It turns out we need to solve some complicated problems first, like how to ensure the AppEngine SDK and the AppRTC code is actually available on the executing machine, but we’ll get to that later. Let’s assume for now that stuff is just available. We can now write the browser test code to launch the local instance:

bool LaunchApprtcInstanceOnLocalhost() 
  // ... Figure out locations of SDK and apprtc code ...
  CommandLine command_line(CommandLine::NO_PROGRAM);
  EXPECT_TRUE(GetPythonCommand(&command_line));

  command_line.AppendArgPath(appengine_dev_appserver);
  command_line.AppendArgPath(apprtc_dir);
  command_line.AppendArg("--port=9999");
  command_line.AppendArg("--admin_port=9998");
  command_line.AppendArg("--skip_sdk_update_check");

  VLOG(1) << "Running " << command_line.GetCommandLineString();
  return base::LaunchProcess(command_line, base::LaunchOptions(),
                             &dev_appserver_);
}

That’s pretty straightforward ^[1].

Figuring out Whether the Local Server is Up

Then we ran into a very typical test problem. So we have the code to get the server up, and launching the two browsers to connect to http://localhost:9999?r=some_room is easy. But how do we know when to connect? When I first ran the test, it would work sometimes and sometimes not depending on if the server had time to get up.

It’s tempting in these situations to just add a sleep to give the server time to get up. Don’t do that. That will result in a test that is flaky and/or slow. In these situations we need to identify what we’re really waiting for. We could probably monitor the stdout of the dev_appserver.py and look for some message that says “Server is up!” or equivalent. However, we’re really waiting for the server to be able to serve web pages, and since we have two browsers that are really good at connecting to servers, why not use them? Consider this code.

bool LocalApprtcInstanceIsUp() {
  // Load the admin page and see if we manage to load it right.
  ui_test_utils::NavigateToURL(browser(), GURL("localhost:9998"));
  content::WebContents* tab_contents =
      browser()->tab_strip_model()->GetActiveWebContents();
  std::string javascript =
      "window.domAutomationController.send(document.title)";
  std::string result;
  if (!content::ExecuteScriptAndExtractString(tab_contents, 
                                              javascript,
                                              &result))
    return false;

  return result == kTitlePageOfAppEngineAdminPage;
}

Here we ask Chrome to load the AppEngine admin page for the local server (we set the admin port to 9998 earlier, remember?) and ask it what its title is. If that title is “Instances”, the admin page has been displayed, and the server must be up. If the server isn’t up, Chrome will fail to load the page and the title will be something like “localhost:9999 is not available”.

Then, we can just do this from the test:

while (!LocalApprtcInstanceIsUp())
  VLOG(1) << "Waiting for AppRTC to come up...";

If the server never comes up, for whatever reason, the test will just time out in that loop. If it comes up we can safely proceed with the rest of test.

Launching the Browsers

A browser window launches itself as a part of every Chromium browser test. It’s also easy for the test to control the command line switches the browser will run under.

We have less control over the Firefox browser since it is the “foreign” browser in this test, but we can still pass command-line options to it when we invoke the Firefox process. To make this easier, Mozilla provides a Python library called mozrunner. Using that we can set up a launcher python script we can invoke from the test:

from mozprofile import profile
from mozrunner import runner

WEBRTC_PREFERENCES = {
    'media.navigator.permission.disabled': True,
}

def main():
  # Set up flags, handle SIGTERM, etc
  # ...
  firefox_profile = 
      profile.FirefoxProfile(preferences=WEBRTC_PREFERENCES)
  firefox_runner = runner.FirefoxRunner(
      profile=firefox_profile, binary=options.binary, 
      cmdargs=[options.webpage])

  firefox_runner.start()

Notice that we need to pass special preferences to make Firefox accept the getUserMedia prompt. Otherwise, the test would get stuck on the prompt and we would be unable to set up a call. Alternatively, we could employ some kind of clickbot to click “Allow” on the prompt when it pops up, but that is way harder to set up.

Without going into too much detail, the code for launching the browsers becomes

GURL room_url = 
    GURL(base::StringPrintf("http://localhost:9999?r=room_%d",
                            base::RandInt(0, 65536)));
content::WebContents* chrome_tab = 
    OpenPageAndAcceptUserMedia(room_url);
ASSERT_TRUE(LaunchFirefoxWithUrl(room_url));

Where LaunchFirefoxWithUrl essentially runs this:

run_firefox_webrtc.py --binary /path/to/firefox --webpage http://localhost::9999?r=my_room

Now we can launch the two browsers. Next time we will look at how we actually verify that the call worked, and how we actually download all resources needed by the test in a maintainable and automated manner. Stay tuned!

¹The explicit ports are because the default ports collided on the bots we were running on, and the --skip_sdk_update_check was because the SDK stopped and asked us something if there was an update.

Testing on the Toilet: Web Testing Made Easier: Debug IDs

Google Testing Bloggers — Tue, 12 Aug 2014 18:53:00 +0000

by Ruslan Khamitov

This article was adapted from a Google Testing on the Toilet (TotT) episode. You can download a printer-friendly version of this TotT episode and post it in your office.

Adding ID attributes to elements can make it much easier to write tests that interact with the DOM (e.g., WebDriver tests). Consider the following DOM with two buttons that differ only by inner text:

Save button	Edit button
Save	Edit

How would you tell WebDriver to interact with the “Save” button in this case? You have several options. One option is to interact with the button using a CSS selector:

div.button

However, this approach is not sufficient to identify a particular button, and there is no mechanism to filter by text in CSS. Another option would be to write an XPath, which is generally fragile and discouraged:

//div[@class='button' and text()='Save']

Your best option is to add unique hierarchical IDs where each widget is passed a base ID that it prepends to the ID of each of its children. The IDs for each button will be:

contact-form.save-button
contact-form.edit-button

In GWT you can accomplish this by overriding onEnsureDebugId()on your widgets. Doing so allows you to create custom logic for applying debug IDs to the sub-elements that make up a custom widget:

@Override protected void onEnsureDebugId(String baseId) {
  super.onEnsureDebugId(baseId);
  saveButton.ensureDebugId(baseId + ".save-button");
  editButton.ensureDebugId(baseId + ".edit-button");
}

Consider another example. Let’s set IDs for repeated UI elements in Angular using ng-repeat. Setting an index can help differentiate between repeated instances of each element:

id="feedback-{{$index}}" class="feedback" ng-repeat="feedback in ctrl.feedbacks" >

In GWT you can do this with ensureDebugId(). Let’s set an ID for each of the table cells:

@UiField FlexTable table;
UIObject.ensureDebugId(table.getCellFormatter().getElement(rowIndex, columnIndex),
    baseID + colIndex + "-" + rowIndex);

Take-away: Debug IDs are easy to set and make a huge difference for testing. Please add them early.

Testing on the Toilet: Don’t Put Logic in Tests

Google Testing Bloggers — Thu, 31 Jul 2014 16:59:00 +0000

by Erik Kuefler

This article was adapted from a Google Testing on the Toilet (TotT) episode. You can download a printer-friendly version of this TotT episode and post it in your office.

Programming languages give us a lot of expressive power. Concepts like operators and conditionals are important tools that allow us to write programs that handle a wide range of inputs. But this flexibility comes at the cost of increased complexity, which makes our programs harder to understand.

Unlike production code, simplicity is more important than flexibility in tests. Most unit tests verify that a single, known input produces a single, known output. Tests can avoid complexity by stating their inputs and outputs directly rather than computing them. Otherwise it's easy for tests to develop their own bugs.

Let's take a look at a simple example. Does this test look correct to you?

@Test public void shouldNavigateToPhotosPage() {
  String baseUrl = "http://plus.google.com/";
  Navigator nav = new Navigator(baseUrl);
  nav.goToPhotosPage();
  assertEquals(baseUrl + "/u/0/photos", nav.getCurrentUrl());
}

The author is trying to avoid duplication by storing a shared prefix in a variable. Performing a single string concatenation doesn't seem too bad, but what happens if we simplify the test by inlining the variable?

@Test public void shouldNavigateToPhotosPage() {
  Navigator nav = new Navigator("http://plus.google.com/");
  nav.goToPhotosPage();
  assertEquals("http://plus.google.com//u/0/photos", nav.getCurrentUrl()); // Oops!
}

After eliminating the unnecessary computation from the test, the bug is obvious—we're expecting two slashes in the URL! This test will either fail or (even worse) incorrectly pass if the production code has the same bug. We never would have written this if we stated our inputs and outputs directly instead of trying to compute them. And this is a very simple example—when a test adds more operators or includes loops and conditionals, it becomes increasingly difficult to be confident that it is correct.

Another way of saying this is that, whereas production code describes a general strategy for computing outputs given inputs, tests are concrete examples of input/output pairs (where output might include side effects like verifying interactions with other classes). It's usually easy to tell whether an input/output pair is correct or not, even if the logic required to compute it is very complex. For instance, it's hard to picture the exact DOM that would be created by a Javascript function for a given server response. So the ideal test for such a function would just compare against a string containing the expected output HTML.

When tests do need their own logic, such logic should often be moved out of the test bodies and into utilities and helper functions. Since such helpers can get quite complex, it's usually a good idea for any nontrivial test utility to have its own tests.

The Deadline to Sign up for GTAC 2014 is Jul 28

Google Testing Bloggers — Wed, 23 Jul 2014 00:53:00 +0000

Posted by Anthony Vallone on behalf of the GTAC Committee

The deadline to sign up for GTAC 2014 is next Monday, July 28th, 2014. There is a great deal of interest to both attend and speak, and we’ve received many outstanding proposals. However, it’s not too late to add yours for consideration. If you would like to speak or attend, be sure to complete the form by Monday.

We will be making regular updates to our site over the next several weeks, and you can find conference details there:
developers.google.com/gtac

For those that have already signed up to attend or speak, we will contact you directly in mid August.

Measuring Coverage at Google

Google Testing Bloggers — Mon, 14 Jul 2014 19:42:00 +0000

By Marko Ivanković, Google Zürich

Introduction

Code coverage is a very interesting metric, covered by a large body of research that reaches somewhat contradictory results. Some people think it is an extremely useful metric and that a certain percentage of coverage should be enforced on all code. Some think it is a useful tool to identify areas that need more testing but don’t necessarily trust that covered code is truly well tested. Others yet think that measuring coverage is actively harmful because it provides a false sense of security.

Our team’s mission was to collect coverage related data then develop and champion code coverage practices across Google. We designed an opt-in system where engineers could enable two different types of coverage measurements for their projects: daily and per-commit. With daily coverage, we run all tests for their project, where as with per-commit coverage we run only the tests affected by the commit. The two measurements are independent and many projects opted into both.

While we did experiment with branch, function and statement coverage, we ended up focusing mostly on statement coverage because of its relative simplicity and ease of visualization.

How we measured

Our job was made significantly easier by the wonderful Google build system whose parallelism and flexibility allowed us to simply scale our measurements to Google scale. The build system had integrated various language-specific open source coverage measurement tools like Gcov (C++), Emma / JaCoCo (Java) and Coverage.py (Python), and we provided a central system where teams could sign up for coverage measurement.

For daily whole project coverage measurements, each team was provided with a simple cronjob that would run all tests across the project’s codebase. The results of these runs were available to the teams in a centralized dashboard that displays charts showing coverage over time and allows daily / weekly / quarterly / yearly aggregations and per-language slicing. On this dashboard teams can also compare their project (or projects) with any other project, or Google as a whole.

For per-commit measurement, we hook into the Google code review process (briefly explained in this article) and display the data visually to both the commit author and the reviewers. We display the data on two levels: color coded lines right next to the color coded diff and a total aggregate number for the entire commit.

Displayed above is a screenshot of the code review tool. The green line coloring is the standard diff coloring for added lines. The orange and lighter green coloring on the line numbers is the coverage information. We use light green for covered lines, orange for non-covered lines and white for non-instrumented lines.

It’s important to note that we surface the coverage information before the commit is submitted to the codebase, because this is the time when engineers are most likely to be interested in improving it.

Results

One of the main benefits of working at Google is the scale at which we operate. We have been running the coverage measurement system for some time now and we have collected data for more than 650 different projects, spanning 100,000+ commits. We have a significant amount of data for C++, Java, Python, Go and JavaScript code.

I am happy to say that we can share some preliminary results with you today:

The chart above is the histogram of average values of measured absolute coverage across Google. The median (50th percentile) code coverage is 78%, the 75th percentile 85% and 90th percentile 90%. We believe that these numbers represent a very healthy codebase.

We have also found it very interesting that there are significant differences between languages:

C++	Java	Go	JavaScript	Python
56.6%	61.2%	63.0%	76.9%	84.2%

The table above shows the total coverage of all analyzed code for each language, averaged over the past quarter. We believe that the large difference is due to structural, paradigm and best practice differences between languages and the more precise ability to measure coverage in certain languages.

Note that these numbers should not be interpreted as guidelines for a particular language, the aggregation method used is too simple for that. Instead this finding is simply a data point for any future research that analyzes samples from a single programming language.

The feedback from our fellow engineers was overwhelmingly positive. The most loved feature was surfacing the coverage information during code review time. This early surfacing of coverage had a statistically significant impact: our initial analysis suggests that it increased coverage by 10% (averaged across all commits).

Future work

We are aware that there are a few problems with the dataset we collected. In particular, the individual tools we use to measure coverage are not perfect. Large integration tests, end to end tests and UI tests are difficult to instrument, so large parts of code exercised by such tests can be misreported as non-covered.

We are working on improving the tools, but also analyzing the impact of unit tests, integration tests and other types of tests individually.

In addition to languages, we will also investigate other factors that might influence coverage, such as platforms and frameworks, to allow all future research to account for their effect.

We will be publishing more of our findings in the future, so stay tuned.

And if this sounds like something you would like to work on, why not apply on our job site?

ThreadSanitizer: Slaughtering Data Races

Google Testing Bloggers — Mon, 30 Jun 2014 21:30:00 +0000

by Dmitry Vyukov, Synchronization Lookout, Google, Moscow

Hello,

I work in the Dynamic Testing Tools team at Google. Our team develops tools like AddressSanitizer, MemorySanitizer and ThreadSanitizer which find various kinds of bugs. In this blog post I want to tell you about ThreadSanitizer, a fast data race detector for C++ and Go programs.

First of all, what is a data race? A data race occurs when two threads access the same variable concurrently, and at least one of the accesses attempts is a write. Most programming languages provide very weak guarantees, or no guarantees at all, for programs with data races. For example, in C++ absolutely any data race renders the behavior of the whole program as completely undefined (yes, it can suddenly format the hard drive). Data races are common in concurrent programs, and they are notoriously hard to debug and localize. A typical manifestation of a data race is when a program occasionally crashes with obscure symptoms, the symptoms are different each time and do not point to any particular place in the source code. Such bugs can take several months of debugging without particular success, since typical debugging techniques do not work. Fortunately, ThreadSanitizer can catch most data races in the blink of an eye. See Chromium issue 15577 for an example of such a data race and issue 18488 for the resolution.

Due to the complex nature of bugs caught by ThreadSanitizer, we don't suggest waiting until product release validation to use the tool. For example, in Google, we've made our tools easily accessible to programmers during development, so that anyone can use the tool for testing if they suspect that new code might introduce a race. For both Chromium and Google internal server codebase, we run unit tests that use the tool continuously. This catches many regressions instantly. The Chromium project has recently started using ThreadSanitizer on ClusterFuzz, a large scale fuzzing system. Finally, some teams also set up periodic end-to-end testing with ThreadSanitizer under a realistic workload, which proves to be extremely valuable. When races are found by the tool, our team has zero tolerance for races and does not consider any race to be benign, as even the most benign races can lead to memory corruption.

Our tools are dynamic (as opposed to static tools). This means that they do not merely "look" at the code and try to surmise where bugs can be; instead they they instrument the binary at build time and then analyze dynamic behavior of the program to catch it red-handed. This approach has its pros and cons. On one hand, the tool does not have any false positives, thus it does not bother a developer with something that is not a bug. On the other hand, in order to catch a bug, the test must expose a bug -- the racing data access attempts must be executed in different threads. This requires writing good multi-threaded tests and makes end-to-end testing especially effective.

As a bonus, ThreadSanitizer finds some other types of bugs: thread leaks, deadlocks, incorrect uses of mutexes, malloc calls in signal handlers, and more. It also natively understands atomic operations and thus can find bugs in lock-free algorithms (see e.g. this bug in the V8 concurrent garbage collector).

The tool is supported by both Clang and GCC compilers (only on Linux/Intel64). Using it is very simple: you just need to add a -fsanitize=thread flag during compilation and linking. For Go programs, you simply need to add a -race flag to the go tool (supported on Linux, Mac and Windows).

Interestingly, after integrating the tool into compilers, we've found some bugs in the compilers themselves. For example, LLVM was illegally widening stores, which can introduce very harmful data races into otherwise correct programs. And GCC was injecting unsafe code for initialization of function static variables. Among our other trophies are more than a thousand bugs in Chromium, Firefox, the Go standard library, WebRTC, OpenSSL, and of course in our internal projects.

So what are you waiting for? You know what to do!

GTAC 2014: Call for Proposals & Attendance

Google Testing Bloggers — Mon, 16 Jun 2014 21:57:00 +0000

Posted by Anthony Vallone on behalf of the GTAC Committee

The application process is now open for presentation proposals and attendance for GTAC (Google Test Automation Conference) (see initial announcement) to be held at the Google Kirkland office (near Seattle, WA) on October 28 - 29th, 2014.

GTAC will be streamed live on YouTube again this year, so even if you can’t attend, you’ll be able to watch the conference from your computer.

Speakers
Presentations are targeted at student, academic, and experienced engineers working on test automation. Full presentations and lightning talks are 45 minutes and 15 minutes respectively. Speakers should be prepared for a question and answer session following their presentation.

Application
For presentation proposals and/or attendance, complete this form. We will be selecting about 300 applicants for the event.

Deadline
The due date for both presentation and attendance applications is July 28, 2014.

Fees
There are no registration fees, and we will send out detailed registration instructions to each invited applicant. Meals will be provided, but speakers and attendees must arrange and pay for their own travel and accommodations.

Update : Our contact email was bouncing - this is now fixed.

GTAC 2014 Coming to Seattle/Kirkland in October

Google Testing Bloggers — Wed, 04 Jun 2014 19:21:00 +0000

Posted by Anthony Vallone on behalf of the GTAC Committee

If you're looking for a place to discuss the latest innovations in test automation, then charge your tablets and pack your gumboots - the eighth GTAC (Google Test Automation Conference) will be held on October 28-29, 2014 at Google Kirkland! The Kirkland office is part of the Seattle/Kirkland campus in beautiful Washington state. This campus forms our third largest engineering office in the USA.

GTAC is a periodic conference hosted by Google, bringing together engineers from industry and academia to discuss advances in test automation and the test engineering computer science field. It’s a great opportunity to present, learn, and challenge modern testing technologies and strategies.

You can browse the presentation abstracts, slides, and videos from last year on the GTAC 2013 page.

Stay tuned to this blog and the GTAC website for application information and opportunities to present at GTAC. Subscribing to this blog is the best way to get notified. We're looking forward to seeing you there!

Testing on the Toilet: Risk-Driven Testing

Google Testing Bloggers — Fri, 30 May 2014 23:10:00 +0000

by Peter Arrenbrecht

This article was adapted from a Google Testing on the Toilet (TotT) episode. You can download a printer-friendly version of this TotT episode and post it in your office.

We are all conditioned to write tests as we code: unit, functional, UI—the whole shebang. We are professionals, after all. Many of us like how small tests let us work quickly, and how larger tests inspire safety and closure. Or we may just anticipate flak during review. We are so used to these tests that often we no longer question why we write them. This can be wasteful and dangerous.

Tests are a means to an end: To reduce the key risks of a project, and to get the biggest bang for the buck. This bang may not always come from the tests that standard practice has you write, or not even from tests at all.

Two examples:

“We built a new debugging aid. We wrote unit, integration, and UI tests. We were ready to launch.”

Outstanding practice. Missing the mark.

Our key risks were that we'd corrupt our data or bring down our servers for the sake of a debugging aid. None of the tests addressed this, but they gave a false sense of safety and “being done”.
We stopped the launch.

“We wanted to turn down a feature, so we needed to alert affected users. Again we had unit and integration tests, and even one expensive end-to-end test.”

Standard practice. Wasted effort.

The alert was so critical it actually needed end-to-end coverage for all scenarios. But it would be live for only three releases. The cheapest effective test? Manual testing before each release.

A Better Approach: Risks First

For every project or feature, think about testing. Brainstorm your key risks and your best options to reduce them. Do this at the start so you don't waste effort and can adapt your design. Write them down as a QA design so you can point to it in reviews and discussions.

To be sure, standard practice remains a good idea in most cases (hence it’s standard). Small tests are cheap and speed up coding and maintenance, and larger tests safeguard core use-cases and integration.

Just remember: Your tests are a means. The bang is what counts. It’s your job to maximize it.

Testing on the Toilet: Effective Testing

Google Testing Bloggers — Wed, 07 May 2014 17:10:00 +0000

by Rich Martin, Zurich

This article was adapted from a Google Testing on the Toilet (TotT) episode. You can download a printer-friendly version of this TotT episode and post it in your office.

Whether we are writing an individual unit test or designing a product’s entire testing process, it is important to take a step back and think about how effective are our tests at detecting and reporting bugs in our code. To be effective, there are three important qualities that every test should try to maximize:

Fidelity

When the code under test is broken, the test fails. A high-fidelity test is one which is very sensitive to defects in the code under test, helping to prevent bugs from creeping into the code.

Maximize fidelity by ensuring that your tests cover all the paths through your code and include all relevant assertions on the expected state.

Resilience

A test shouldn’t fail if the code under test isn’t defective. A resilient test is one that only fails when a breaking change is made to the code under test. Refactorings and other non-breaking changes to the code under test can be made without needing to modify the test, reducing the cost of maintaining the tests.

Maximize resilience by only testing the exposed API of the code under test; avoid reaching into internals. Favor stubs and fakes over mocks; don't verify interactions with dependencies unless it is that interaction that you are explicitly validating. A flaky test obviously has very low resilience.

Precision

When a test fails, a high-precision test tells you exactly where the defect lies. A well-written unit test can tell you exactly which line of code is at fault. Poorly written tests (especially large end-to-end tests) often exhibit very low precision, telling you that something is broken but not where.

Maximize precision by keeping your tests small and tightly focused. Choose descriptive method names that convey exactly what the test is validating. For system integration tests, validate state at every boundary.

These three qualities are often in tension with each other. It's easy to write a highly resilient test (the empty test, for example), but writing a test that is both highly resilient and high-fidelity is hard. As you design and write tests, use these qualities as a framework to guide your implementation.

Testing on the Toilet: Test Behaviors, Not Methods

Google Testing Bloggers — Mon, 14 Apr 2014 21:25:00 +0000

by Erik Kuefler

This article was adapted from a Google Testing on the Toilet (TotT) episode. You can download a printer-friendly version of this TotT episode and post it in your office.

After writing a method, it's easy to write just one test that verifies everything the method does. But it can be harmful to think that tests and public methods should have a 1:1 relationship. What we really want to test are behaviors, where a single method can exhibit many behaviors, and a single behavior sometimes spans across multiple methods.

Let's take a look at a bad test that verifies an entire method:

@Test public void testProcessTransaction() {
  User user = newUserWithBalance(LOW_BALANCE_THRESHOLD.plus(dollars(2));
  transactionProcessor.processTransaction(
      user,
      new Transaction("Pile of Beanie Babies", dollars(3)));
  assertContains("You bought a Pile of Beanie Babies", ui.getText());
  assertEquals(1, user.getEmails().size());
  assertEquals("Your balance is low", user.getEmails().get(0).getSubject());
}

Displaying the name of the purchased item and sending an email about the balance being low are two separate behaviors, but this test looks at both of those behaviors together just because they happen to be triggered by the same method. Tests like this very often become massive and difficult to maintain over time as additional behaviors keep getting added in—eventually it will be very hard to tell which parts of the input are responsible for which assertions. The fact that the test's name is a direct mirror of the method's name is a bad sign.

It's a much better idea to use separate tests to verify separate behaviors:

@Test public void testProcessTransaction_displaysNotification() {
  transactionProcessor.processTransaction(
      new User(), new Transaction("Pile of Beanie Babies"));
  assertContains("You bought a Pile of Beanie Babies", ui.getText());
}
@Test public void testProcessTransaction_sendsEmailWhenBalanceIsLow() {
  User user = newUserWithBalance(LOW_BALANCE_THRESHOLD.plus(dollars(2));
  transactionProcessor.processTransaction(
      user,
      new Transaction(dollars(3)));
  assertEquals(1, user.getEmails().size());
  assertEquals("Your balance is low", user.getEmails().get(0).getSubject());
}

Now, when someone adds a new behavior, they will write a new test for that behavior. Each test will remain focused and easy to understand, no matter how many behaviors are added. This will make your tests more resilient since adding new behaviors is unlikely to break the existing tests, and clearer since each test contains code to exercise only one behavior.

The Real Test Driven Development

Google Testing Bloggers — Tue, 01 Apr 2014 09:00:00 +0000

Update: APRIL FOOLS!

by Kaue Silveira

Here at Google, we invest heavily in development productivity research. In fact, our TDD research group now occupies nearly an entire building of the Googleplex. The group has been working hard to minimize the development cycle time, and we’d like to share some of the amazing progress they’ve made.

The Concept

In the ways of old, it used to be that people wrote tests for their existing code. This was changed by TDD (Test-driven Development), where one would write the test first and then write the code to satisfy it. The TDD research group didn’t think this was enough and wanted to elevate the humble test to the next level. We are pleased to announce the Real TDD, our latest innovation in the Program Synthesis field, where you write only the tests and have the computer write the code for you!

The following graph shows how the number of tests created by a small feature team grew since they started using this tool towards the end of 2013. Over the last 2 quarters, more than 89% of this team’s production code was written by the tool!

See it in action:

Test written by a Software Engineer:

class LinkGeneratorTest(googletest.TestCase):

  def setUp(self):
    self.generator = link_generator.LinkGenerator()

  def testGetLinkFromIDs(self):
    expected = ('https://frontend.google.com/advancedSearchResults?'
                's.op=ALL&s.r0.field=ID&s.r0.val=1288585+1310696+1346270+')
    actual = self.generator.GetLinkFromIDs(set((1346270, 1310696, 1288585)))
    self.assertEqual(expected, actual)

Code created by our tool:

import urllib

class LinkGenerator(object):

  _URL = (
      'https://frontend.google.com/advancedSearchResults?'
      's.op=ALL&s.r0.field=ID&s.r0.val=')

  def GetLinkFromIDs(self, ids):
    result = []
    for id in sorted(ids):
      result.append('%s ' % id)
    return self._URL + urllib.quote_plus(''.join(result))

Note that the tool is smart enough to not generate the obvious implementation of returning a constant string, but instead it correctly abstracts and generalizes the relation between inputs and outputs. It becomes smarter at every use and it’s behaving more and more like a human programmer every day. We once saw a comment in the generated code that said "I need some coffee".

How does it work?

We’ve trained the Google Brain with billions of lines of open-source software to learn about coding patterns and how product code correlates with test code. Its accuracy is further improved by using Type Inference to infer types from code and the Girard-Reynolds Isomorphism to infer code from types.

The tool runs every time your unit test is saved, and it uses the learned model to guide a backtracking search for a code snippet that satisfies all assertions in the test. It provides sub-second responses for 99.5% of the cases (as shown in the following graph), thanks to millions of pre-computed assertion-snippet pairs stored in Spanner for global low-latency access.

How can I use it?

We will offer a free (rate-limited) service that everyone can use, once we have sorted out the legal issues regarding the possibility of mixing code snippets originating from open-source projects with different licenses (e.g., GPL-licensed tests will simply refuse to pass BSD-licensed code snippets). If you would like to try our alpha release before the public launch, leave us a comment!

Testing on the Toilet: What Makes a Good Test?

Google Testing Bloggers — Tue, 18 Mar 2014 22:10:00 +0000

by Erik Kuefler

This article was adapted from a Google Testing on the Toilet (TotT) episode. You can download a printer-friendly version of this TotT episode and post it in your office.

Unit tests are important tools for verifying that our code is correct. But writing good tests is about much more than just verifying correctness — a good unit test should exhibit several other properties in order to be readable and maintainable.

One property of a good test is clarity. Clarity means that a test should serve as readable documentation for humans, describing the code being tested in terms of its public APIs. Tests shouldn't refer directly to implementation details. The names of a class's tests should say everything the class does, and the tests themselves should serve as examples for how to use the class.

Two more important properties are completeness and conciseness. A test is complete when its body contains all of the information you need to understand it, and concise when it doesn't contain any other distracting information. This test fails on both counts:

@Test public void shouldPerformAddition() {
  Calculator calculator = new Calculator(new RoundingStrategy(), 
      "unused", ENABLE_COSIN_FEATURE, 0.01, calculusEngine, false);
  int result = calculator.doComputation(makeTestComputation());
  assertEquals(5, result); // Where did this number come from?
}

Lots of distracting information is being passed to the constructor, and the important parts are hidden off in a helper method. The test can be made more complete by clarifying the purpose of the helper method, and more concise by using another helper to hide the irrelevant details of constructing the calculator:

@Test public void shouldPerformAddition() {
  Calculator calculator = newCalculator();
  int result = calculator.doComputation(makeAdditionComputation(2, 3));
  assertEquals(5, result);
}

One final property of a good test is resilience. Once written, a resilient test doesn't have to change unless the purpose or behavior of the class being tested changes. Adding new behavior should only require adding new tests, not changing old ones. The original test above isn't resilient since you'll have to update it (and probably dozens of other tests!) whenever you add a new irrelevant constructor parameter. Moving these details into the helper method solved this problem.

When/how to use Mockito Answer

Google Testing Bloggers — Mon, 03 Mar 2014 21:19:00 +0000

by Hongfei Ding, Software Engineer, Shanghai

Mockito is a popular open source Java testing framework that allows the creation of mock objects. For example, we have the below interface used in our SUT (System Under Test):

interface Service {
  Data get();
}

In our test, normally we want to fake the Service’s behavior to return canned data, so that the unit test can focus on testing the code that interacts with the Service. We use when-return clause to stub a method.

when(service.get()).thenReturn(cannedData);

But sometimes you need mock object behavior that's too complex for when-return. An Answer object can be a clean way to do this once you get the syntax right.

A common usage of Answer is to stub asynchronous methods that have callbacks. For example, we have mocked the interface below:

interface Service {
  void get(Callback callback);
}

Here you’ll find that when-return is not that helpful anymore. Answer is the replacement. For example, we can emulate a success by calling the onSuccess function of the callback.

doAnswer(new Answer() {
    public Void answer(InvocationOnMock invocation) {
       Callback callback = (Callback) invocation.getArguments()[0];
       callback.onSuccess(cannedData);
       return null;
    }
}).when(service).get(any(Callback.class));

Answer can also be used to make smarter stubs for synchronous methods. Smarter here means the stub can return a value depending on the input, rather than canned data. It’s sometimes quite useful. For example, we have mocked the Translator interface below:

interface Translator {
  String translate(String msg);
}

We might choose to mock Translator to return a constant string and then assert the result. However, that test is not thorough, because the input to the translator function has been ignored. To improve this, we might capture the input and do extra verification, but then we start to fall into the “testing interaction rather than testing state” trap.

A good usage of Answer is to reverse the input message as a fake translation. So that both things are assured by checking the result string: 1) translate has been invoked, 2) the msg being translated is correct. Notice that this time we’ve used thenAnswer syntax, a twin of doAnswer, for stubbing a non-void method.

when(translator.translate(any(String.class))).thenAnswer(reverseMsg())
...
// extracted a method to put a descriptive name
private static Answer reverseMsg() { 
  return new Answer() {
    public String answer(InvocationOnMock invocation) {
       return reverseString((String) invocation.getArguments()[0]));
    }
  }
}

Last but not least, if you find yourself writing many nontrivial Answers, you should consider using a fake instead.

Minimizing Unreproducible Bugs

Google Testing Bloggers — Mon, 03 Feb 2014 19:23:00 +0000

by Anthony Vallone

Unreproducible bugs are the bane of my existence. Far too often, I find a bug, report it, and hear back that it’s not a bug because it can’t be reproduced. Of course, the bug is still there, waiting to prey on its next victim. These types of bugs can be very expensive due to increased investigation time and overall lifetime. They can also have a damaging effect on product perception when users reporting these bugs are effectively ignored. We should be doing more to prevent them. In this article, I’ll go over some obvious, and maybe not so obvious, development/testing guidelines that can reduce the likelihood of these bugs from occurring.

Avoid and test for race conditions, deadlocks, timing issues, memory corruption, uninitialized memory access, memory leaks, and resource issues

I am lumping together many bug types in this section, but they are all related somewhat by how we test for them and how disproportionately hard they are to reproduce and debug. The root cause and effect can be separated by milliseconds or hours, and stack traces might be nonexistent or misleading. A system may fail in strange ways when exposed to unusual traffic spikes or insufficient resources. Race conditions and deadlocks may only be discovered during unique traffic patterns or resource configurations. Timing issues may only be noticed when many components are integrated and their performance parameters and failure/retry/timeout delays create a chaotic system. Memory corruption or uninitialized memory access may go unnoticed for a large percentage of calls but become fatal for rare states. Memory leaks may be negligible unless the system is exposed to load for an extended period of time.

Guidelines for development:

Simplify your synchronization logic. If it’s too hard to understand, it will be difficult to reproduce and debug complex concurrency problems.
Always obtain locks in the same order. This is a tried-and-true guideline to avoid deadlocks, but I still see code that breaks it periodically. Define an order for obtaining multiple locks and never change that order.
Don’t optimize by creating many fine-grained locks, unless you have verified that they are needed. Extra locks increase concurrency complexity.
Avoid shared memory, unless you truly need it. Shared memory access is very easy to get wrong, and the bugs may be quite difficult to reproduce.

Guidelines for testing:

Stress test your system regularly. You don't want to be surprised by unexpected failures when your system is under heavy load.
Test timeouts. Create tests that mock/fake dependencies to test timeout code. If your timeout code does something bad, it may cause a bug that only occurs under certain system conditions.
Test with debug and optimized builds. You may find that a well behaved debug build works fine, but the system fails in strange ways once optimized.
Test under constrained resources. Try reducing the number of data centers, machines, processes, threads, available disk space, or available memory. Also try simulating reduced network bandwidth.
Test for longevity. Some bugs require a long period of time to reveal themselves. For example, persistent data may become corrupt over time.
Use dynamic analysis tools like memory debuggers, ASan, TSan, and MSan regularly. They can help identify many categories of unreproducible memory/threading issues.

Enforce preconditions

I’ve seen many well-meaning functions with a high tolerance for bad input. For example, consider this function:

void ScheduleEvent(int timeDurationMilliseconds) {
  if (timeDurationMilliseconds <= 0) {
    timeDurationMilliseconds = 1;
  }
  ...
}

This function is trying to help the calling code by adjusting the input to an acceptable value, but it may be doing damage by masking a bug. The calling code may be experiencing any number of problems described in this article, and passing garbage to this function will always work fine. The more functions that are written with this level of tolerance, the harder it is to trace back to the root cause, and the more likely it becomes that the end user will see garbage. Enforcing preconditions, for instance by using asserts, may actually cause a higher number of failures for new systems, but as systems mature, and many minor/major problems are identified early on, these checks can help improve long-term reliability.

Guidelines for development:

Enforce preconditions in your functions unless you have a good reason not to.

Use defensive programming

Defensive programming is another tried-and-true technique that is great at minimizing unreproducible bugs. If your code calls a dependency to do something, and that dependency quietly fails or returns garbage, how does your code handle it? You could test for situations like this via mocking or faking, but it’s even better to have your production code do sanity checking on its dependencies. For example:

double GetMonthlyLoanPayment() {
  double rate = GetTodaysInterestRateFromExternalSystem();
  if (rate < 0.001 || rate > 0.5) {
    throw BadInterestRate(rate);
  }
  ...
}

Guidelines for development:

When possible, use defensive programming to verify the work of your dependencies with known risks of failure like user-provided data, I/O operations, and RPC calls.

Guidelines for testing:

Use fuzz testing to test your systems hardiness when enduring bad data.

Don’t hide all errors from the user

There has been a trend in recent years toward hiding failures from users at all costs. In many cases, it makes perfect sense, but in some, we have gone overboard. Code that is very quiet and permissive during minor failures will allow an uninformed user to continue working in a failed state. The software may ultimately reach a fatal tipping point, and all the error conditions that led to failure have been ignored. If the user doesn’t know about the prior errors, they will not be able to report them, and you may not be able to reproduce them.

Guidelines for development:

Only hide errors from the user when you are certain that there is no impact to system state or the user.
Any error with impact to the user should be reported to the user with instructions for how to proceed. The information shown to the user, combined with data available to an engineer, should be enough to determine what went wrong.

Test error handling

The most common sections of code to remain untested is error handling code. Don’t skip test coverage here. Bad error handling code can cause unreproducible bugs and create great risk if it does not handle fatal errors well.

Guidelines for testing:

Always test your error handling code. This is usually best accomplished by mocking or faking the component triggering the error.
It’s also a good practice to examine your log quality for all types of error handling.

Check for duplicate keys

If unique identifiers or data access keys are generated using random data or are not guaranteed to be globally unique, duplicate keys may cause data corruption or concurrency issues. Key duplication bugs are very difficult to reproduce.

Guidelines for development:

Try to guarantee uniqueness of all keys.
When not possible to guarantee unique keys, check if the recently generated key is already in use before using it.
Watch out for potential race conditions here and avoid them with synchronization.

Test for concurrent data access

Some bugs only reveal themselves when multiple clients are reading/writing the same data. Your stress tests might be covering cases like these, but if they are not, you should have special tests for concurrent data access. Case like these are often unreproducible. For example, a user may have two instances of your app running against the same account, and they may not realize this when reporting a bug.

Guidelines for testing:

Always test for concurrent data access if it’s a feature of the system. Actually, even if it’s not a feature, verify that the system rejects it. Testing concurrency can be challenging. An approach that usually works for me is to create many worker threads that simultaneously attempt access and a master thread that monitors and verifies that some number of attempts were indeed concurrent, blocked or allowed as expected, and all were successful. Programmatic post-analysis of all attempts and changing system state may also be necessary to ensure that the system behaved well.

Steer clear of undefined behavior and non-deterministic access to data

Some APIs and basic operations have warnings about undefined behavior when in certain states or provided with certain input. Similarly, some data structures do not guarantee an iteration order (example: Java’s Set). Code that ignores these warnings may work fine most of the time but fail in unusual ways that are hard to reproduce.

Guidelines for development:

Understand when the APIs and operations you use might have undefined behavior and prevent those conditions.
Do not depend on data structure iteration order unless it is guaranteed. It is a common mistake to depend on the ordering of sets or associative arrays.

Log the details for errors or test failures

Issues described in this article can be easier to reproduce and debug when the logs contain enough detail to understand the conditions that led to an error.

Guidelines for development:

Follow good logging practices, especially in your error handling code.
If logs are stored on a user’s machine, create an easy way for them to provide you the logs.

Guidelines for testing:

Save your test logs for potential analysis later.

Anything to add?

Have I missed any important guidelines for minimizing these bugs? What is your favorite hard-to-reproduce bug that you discovered and resolved?

The Google Test and Development Environment – Pt. 3: Code, Build, and Test

Google Testing Bloggers — Tue, 21 Jan 2014 19:03:00 +0000

by Anthony Vallone

This is the third in a series of articles about our work environment. See the first and second.

I will never forget the awe I felt when running my first load test on my first project at Google. At previous companies I’ve worked, running a substantial load test took quite a bit of resource planning and preparation. At Google, I wrote less than 100 lines of code and was simulating tens of thousands of users after just minutes of prep work. The ease with which I was able to accomplish this is due to the impressive coding, building, and testing tools available at Google. In this article, I will discuss these tools and how they affect our test and development process.

Coding and building

The tools and process for coding and building make it very easy to change production and test code. Even though we are a large company, we have managed to remain nimble. In a matter of minutes or hours, you can edit, test, review, and submit code to head. We have achieved this without sacrificing code quality by heavily investing in tools, testing, and infrastructure, and by prioritizing code reviews.

Most production and test code is in a single, company-wide source control repository (open source projects like Chromium and Android have their own). There is a great deal of code sharing in the codebase, and this provides an incredible suite of code to build on. Most code is also in a single branch, so the majority of development is done at head. All code is also navigable, searchable, and editable from the browser. You’ll find code in numerous languages, but Java, C++, Python, Go, and JavaScript are the most common.

Have a strong preference for editor? Engineers are free to choose from many IDEs and editors. The most common are Eclipse, Emacs, Vim, and IntelliJ, but many others are used as well. Engineers that are passionate about their prefered editors have built up and shared some truly impressive editor plugins/tooling over the years.

Code reviews for all submissions are enforced via source control tooling. This also applies to test code, as our test code is held to the same standards as production code. The reviews are done via web-based code review tools that even include automatically generated test results. The process is very streamlined and efficient. Engineers can change and submit code in any part of the repository, but it must get reviewed by owners of the code being changed. This is great, because you can easily change code that your team depends on, rather than merely request a change to code you do not own.

The Google build system is used for building most code, and it is designed to work across many languages and platforms. It is remarkably simple to define and build targets. You won’t be needing that old Makefile book.

Running jobs and tests

We have some pretty amazing machine and job management tools at Google. There is a generally available pool of machines in many data centers around the globe. The job management service makes it very easy to start jobs on arbitrary machines in any of these data centers. Failing machines are automatically removed from the pool, so tests rarely fail due to machine issues. With a little effort, you can also set up monitoring and pager alerting for your important jobs.

From any machine you can spin up a massive number of tests and run them in parallel across many machines in the pool, via a single command. Each of these tests are run in a standard, isolated environment, so we rarely run into the “it works on my machine!” issue.

Before code is submitted, presubmit tests can be run that will find all tests that depend transitively on the change and run them. You can also define presubmit rules that run checks on a code change and verify that tests were run before allowing submission.

Once you’ve submitted test code, the build and test system automatically registers the test, and starts building/testing continuously. If the test starts failing, your team will get notification emails. You can also visit a test dashboard for your team and get details about test runs and test data. Monitoring the build/test status is made even easier with our build orbs designed and built by Googlers. These small devices will glow red if the build starts failing. Many teams have had fun customizing these orbs to various shapes, including a statue of liberty with a glowing torch.

Statue of LORBerty

Running larger integration and end-to-end tests takes a little more work, but we have some excellent tools to help with these tests as well: Integration test runners, hermetic environment creation, virtual machine service, web test frameworks, etc.

The impact

So how do these tools actually affect our productivity? For starters, the code is easy to find, edit, review, and submit. Engineers are free to choose tools that make them most productive. Before and after submission, running small tests is trivial, and running large tests is relatively easy. Since tests are easy to create and run, it’s fairly simple to maintain a green build, which most teams do most of the time. This allows us to spend more time on real problems and less on the things that shouldn’t even be problems. It allows us to focus on creating rigorous tests. It dramatically accelerates the development process that can prototype Gmail in a day and code/test/release service features on a daily schedule. And, of course, it lets us focus on the fun stuff.

Thoughts?

We are interested to hear your thoughts on this topic. Google has the resources to build tools like this, but would small or medium size companies benefit from a similar investment in its infrastructure? Did Google create the infrastructure or did the infrastructure create Google?

The Google Test and Development Environment – Pt. 2: Dogfooding and Office Software

Google Testing Bloggers — Fri, 03 Jan 2014 18:41:00 +0000

by Anthony Vallone

This is the second in a series of articles about our work environment. See the first.

There are few things as frustrating as getting hampered in your work by a bug in a product you depend on. What if it’s a product developed by your company? Do you report/fix the issue or just work around it and hope it’ll go away soon? In this article, I’ll cover how and why Google dogfoods its own products.

Dogfooding

Google makes heavy use of its own products. We have a large ecosystem of development/office tools and use them for nearly everything we do. Because we use them on a daily basis, we can dogfood releases company-wide before launching to the public. These dogfood versions often have features unavailable to the public but may be less stable. Instability is exactly what you want in your tools, right? Or, would you rather that frustration be passed on to your company’s customers? Of course not!

Dogfooding is an important part of our test process. Test teams do their best to find problems before dogfooding, but we all know that testing is never perfect. We often get dogfood bug reports for edge and corner cases not initially covered by testing. We also get many comments about overall product quality and usability. This internal feedback has, on many occasions, changed product design.

Not surprisingly, test-focused engineers often have a lot to say during the dogfood phase. I don’t think there is a single public-facing product that I have not reported bugs on. I really appreciate the fact that I can provide feedback on so many products before release.

Interested in helping to test Google products? Many of our products have feedback links built-in. Some also have Beta releases available. For example, you can start using Chrome Beta and help us file bugs.

Office software

From system design documents, to test plans, to discussions about beer brewing techniques, our products are used internally. A company’s choice of office tools can have a big impact on productivity, and it is fortunate for Google that we have such a comprehensive suite. The tools have a consistently simple UI (no manual required), perform very well, encourage collaboration, and auto-save in the cloud. Now that I am used to these tools, I would certainly have a hard time going back to the tools of previous companies I have worked. I’m sure I would forget to click the save buttons for years to come.

Examples of frequently used tools by engineers:

Google Drive Apps (Docs, Sheets, Slides, etc.) are used for design documents, test plans, project data, data analysis, presentations, and more.
Gmail and Hangouts are used for email and chat.
Google Calendar is used to schedule all meetings, reserve conference rooms, and setup video conferencing using Hangouts.
Google Maps is used to map office floors.
Google Groups are used for email lists.
Google Sites are used to host team pages, engineering docs, and more.
Google App Engine hosts many corporate, development, and test apps.
Chrome is our primary browser on all platforms.
Google+ is used for organizing internal communities on topics such as food or C++, and for socializing.

Thoughts?

We are interested to hear your thoughts on this topic. Do you dogfood your company’s products? Do your office tools help or hinder your productivity? What office software and tools do you find invaluable for your job? Could you use Google Docs/Sheets for large test plans?

(Continue to part 3)

The Google Test and Development Environment – Pt. 1: Office and Equipment

Google Testing Bloggers — Fri, 20 Dec 2013 19:58:00 +0000

by Anthony Vallone

When conducting interviews, I often get questions about our workspace and engineering environment. What IDEs do you use? What programming languages are most common? What kind of tools do you have for testing? What does the workspace look like?

Google is a company that is constantly pushing to improve itself. Just like software development itself, most environment improvements happen via a bottom-up approach. All engineers are responsible for fine-tuning, experimenting with, and improving our process, with a goal of eliminating barriers to creating products that amaze.

Office space and engineering equipment can have a considerable impact on productivity. I’ll focus on these areas of our work environment in this first article of a series on the topic.

Office layout

Google is a highly collaborative workplace, so the open floor plan suits our engineering process. Project teams composed of Software Engineers (SWEs), Software Engineers in Test (SETs), and Test Engineers (TEs) all sit near each other or in large rooms together. The test-focused engineers are involved in every step of the development process, so it’s critical for them to sit with the product developers. This keeps the lines of communication open.

Google Munich

The office space is far from rigid, and teams often rearrange desks to suit their preferences. The facilities team recently finished renovating a new floor in the New York City office, and after a day of engineering debates on optimal arrangements and white board diagrams, the floor was completely transformed.

Besides the main office areas, there are lounge areas to which Googlers go for a change of scenery or a little peace and quiet. If you are trying to avoid becoming a casualty of The Great Foam Dart War, lounges are a great place to hide.

Google Dublin

Working with remote teams

Google’s worldwide headquarters is in Mountain View, CA, but it’s a very global company, and our project teams are often distributed across multiple sites. To help keep teams well connected, most of our conference rooms have video conferencing equipment. We make frequent use of this equipment for team meetings, presentations, and quick chats.

Google Boston

What’s at your desk?

All engineers get high-end machines and have easy access to data center machines for running large tasks. A new member on my team recently mentioned that his Google machine has 16 times the memory of the machine at his previous company.

Most Google code runs on Linux, so the majority of development is done on Linux workstations. However, those that work on client code for Windows, OS X, or mobile, develop on relevant OSes. For displays, each engineer has a choice of either two 24 inch monitors or one 30 inch monitor. We also get our choice of laptop, picking from various models of Chromebook, MacBook, or Linux. These come in handy when going to meetings, lounges, or working remotely.

Google Zurich

Thoughts?

We are interested to hear your thoughts on this topic. Do you prefer an open-office layout, cubicles, or private offices? Should test teams be embedded with development teams, or should they operate separately? Do the benefits of offering engineers high-end equipment outweigh the costs?

(Continue to part 2)

WebRTC Audio Quality Testing

Google Testing Bloggers — Fri, 08 Nov 2013 18:47:00 +0000

by Patrik Höglund

The WebRTC project is all about enabling peer-to-peer video, voice and data transfer in the browser. To give our users the best possible experience we need to adapt the quality of the media to the bandwidth and processing power we have available. Our users encounter a wide variety of network conditions and run on a variety of devices, from powerful desktop machines with a wired broadband connection to laptops on WiFi to mobile phones on spotty 3G networks.

We want to ensure good quality for all these use cases in our implementation in Chrome. To some extent we can do this with manual testing, but the breakneck pace of Chrome development makes it very hard to keep up (several hundred patches land every day)! Therefore, we'd like to test the quality of our video and voice transfer with an automated test. Ideally, we’d like to test for the most common network scenarios our users encounter, but to start we chose to implement a test where we have plenty of CPU and bandwidth. This article covers how we built such a test.

Quality Metrics
First, we must define what we want to measure. For instance, the WebRTC video quality test uses peak signal-to-noise ratio and structural similarity to measure the quality of the video (or to be more precise, how much the output video differs from the input video; see this GTAC 13 talk for more details). The quality of the user experience is a subjective thing though. Arguably, one probably needs dozens of different metrics to really ensure a good user experience. For video, we would have to (at the very least) have some measure for frame rate and resolution besides correctness. To have the system send somewhat correct video frames seemed the most important though, which is why we chose the above metrics.

For this test we wanted to start with a similar correctness metric, but for audio. It turns out there's an algorithm called Perceptual Evaluation of Speech Quality (PESQ) which analyzes two audio files and tell you how similar they are, while taking into account how the human ear works (so it ignores differences a normal person would not hear anyway). That's great, since we want our metrics to measure the user experience as much as possible. There are many aspects of voice transfer you could measure, such as latency (which is really important for voice calls), but for now we'll focus on measuring how much a voice audio stream gets distorted by the transfer.

Feeding Audio Into WebRTC
In the WebRTC case we already had a test which would launch a Chrome browser, open two tabs, get the tabs talking to each other through a signaling server and set up a call on a single machine. Then we just needed to figure out how to feed a reference audio file into a WebRTC call and record what comes out on the other end. This part was actually harder than it sounds. The main WebRTC use case is that the web page acquires the user's mic through getUserMedia, sets up a PeerConnection with some remote peer and sends the audio from the mic through the connection to the peer where it is played in the peer's audio output device.

WebRTC calls transmit voice, video and data peer-to-peer, over the Internet.

But since this is an automated test, of course we could not have someone speak in a microphone every time the test runs; we had to feed in a known input file, so we had something to compare the recorded output audio against.

Could we duct-tape a small stereo to the mic and play our audio file on the stereo? That's not very maintainable or reliable, not to mention annoying for anyone in the vicinity. What about some kind of fake device driver which makes a microphone-like device appear on the device level? The problem with that is that it's hard to control a driver from the userspace test program. Also, the test will be more complex and flaky, and the driver interaction will not be portable.^[1]

Instead, we chose to sidestep this problem. We used a solution where we load an audio file with WebAudio and play that straight into the peer connection through the WebAudio-PeerConnection integration. That way we start the playing of the file from the same renderer process as the call itself, which made it a lot easier to time the start and end of the file. We still needed to be careful to avoid playing the file too early or too late, so we don't clip the audio at the start or end - that would destroy our PESQ scores! - but it turned out to be a workable approach.^[2]

Recording the Output
Alright, so now we could get a WebRTC call set up with a known audio file with decent control of when the file starts playing. Now we had to record the output. There are a number of possible solutions. The most end-to-end way is to straight up record what the system sends to default audio out (like speakers or headphones). Alternatively, we could write a hook in our application to dump our audio as late as possible, like when we're just about to send it to the sound card.

We went with the former. Our colleagues in the Chrome video stack team in Kirkland had already found that it's possible to configure a Windows or Linux machine to send the system's audio output (i.e. what plays on the speakers) to a virtual recording device. If we make that virtual recording device the default one, simply invoking SoundRecorder.exe and arecord respectively will record what the system is playing out.

They found this works well if one also uses the sox utility to eliminate silence around the actual audio content (recall we had some safety margins at both ends to ensure we record the whole input file as playing through the WebRTC call). We adopted the same approach, since it records what the user would hear, and yet uses only standard tools. This means we don't have to install additional software on the myriad machines that will run this test.^[3]

Analyzing Audio
The only remaining step was to compare the silence-eliminated recording with the input file. When we first did this, we got a really bad score (like 2.0 out of 5.0, which means PESQ thinks it’s barely legible). This didn't seem to make sense, since both the input and recording sounded very similar. Turns out we didn’t think about the following:

We were comparing a full-band (24 kHz) input file to a wide-band (8 kHz) result (although both files were sampled at 48 kHz). This essentially amounted to a low pass filtering of the result file.
Both files were in stereo, but PESQ is only mono-aware.
The files were 32-bit, but the PESQ implementation is designed for 16 bits.

As you can see, it’s important to pay attention to what format arecord and SoundRecorder.exe records in, and make sure the input file is recorded in the same way. After correcting the input file and “rebasing”, we got the score up to about 4.0.^[4]

Thus, we ended up with an automated test that runs continously on the torrent of Chrome change lists and protects WebRTC's ability to transmit sound. You can see the finished code here. With automated tests and cleverly chosen metrics you can protect against most regressions a user would notice. If your product includes video and audio handling, such a test is a great addition to your testing mix.

How the components of the test fit together.

Future work

It might be possible to write a Chrome extension which dumps the audio from Chrome to a file. That way we get a simpler-to-maintain and portable solution. It would be less end-to-end but more than worth it due to the simplified maintenance and setup. Also, the recording tools we use are not perfect and add some distortion, which makes the score less accurate.
There are other algorithms than PESQ to consider - for instance, POLQA is the successor to PESQ and is better at analyzing high-bandwidth audio signals.
We are working on a solution which will run this test under simulated network conditions. Simulated networks combined with this test is a really powerful way to test our behavior under various packet loss and delay scenarios and ensure we deliver a good experience to all our users, not just those with great broadband connections. Stay tuned for future articles on that topic!
Investigate feasibility of running this set-up on mobile devices.

¹It would be tolerable if the driver was just looping the input file, eliminating the need for the test to control the driver (i.e. the test doesn't have to tell the driver to start playing the file). This is actually what we do in the video quality test. It's a much better fit to take this approach on the video side since each recorded video frame is independent of the others. We can easily embed barcodes into each frame and evaluate them independently.

This seems much harder for audio. We could possibly do audio watermarking, or we could embed a kind of start marker (for instance, using DTMF tones) in the first two seconds of the input file and play the real content after that, and then do some fancy audio processing on the receiving end to figure out the start and end of the input audio. We chose not to pursue this approach due to its complexity.

²Unfortunately, this also means we will not test the capturer path (which handles microphones, etc in WebRTC). This is an example of the frequent tradeoffs one has to do when designing an end-to-end test. Often we have to trade end-to-endness (how close the test is to the user experience) with robustness and simplicity of a test. It's not worth it to cover 5% more of the code if the test become unreliable or radically more expensive to maintain. Another example: A WebRTC call will generally involve two peers on different devices separated by the real-world internet. Writing such a test and making it reliable would be extremely difficult, so we make the test single-machine and hope we catch most of the bugs anyway.

³It's important to keep the continuous build setup simple and the build machines easy to configure - otherwise you will inevitably pay a heavy price in maintenance when you try to scale your testing up.

⁴When sending audio over the internet, we have to compress it since lossless audio consumes way too much bandwidth. WebRTC audio generally sounds great, but there's still compression artifacts if you listen closely (and, in fact, the recording tools are not perfect and add some distorsion as well). Given that this test is more about detecting regressions than measuring some absolute notion of quality, we'd like to downplay those artifacts. As our Kirkland colleagues found, one of the ways to do that is to "rebase" the input file. That means we start with a pristine recording, feed that through the WebRTC call and record what comes out on the other end. After manually verifying the quality, we use that as our input file for the actual test. In our case, it pushed our PESQ score up from 3 to about 4 (out of 5), which gives us a bit more sensitivity to regressions.

Espresso for Android is here!

Google Testing Bloggers — Fri, 18 Oct 2013 21:26:00 +0000

Cross-posted from the Android Developers Google+ Page

Earlier this year, we presented Espresso at GTAC as a solution to the UI testing problem. Today we are announcing the launch of the developer preview for Espresso!

The compelling thing about developing Espresso was making it easy and fun for developers to write reliable UI tests. Espresso has a small, predictable, and easy to learn API, which is still open for customization. But most importantly - Espresso removes the need to think about the complexity of multi-threaded testing. With Espresso, you can think procedurally and write concise, beautiful, and reliable Android UI tests quickly.

Espresso is now being used by over 30 applications within Google (Drive, Maps and G+, just to name a few). Starting from today, Espresso will also be available to our great developer community. We hope you will also enjoy testing your applications with Espresso and looking forward to your feedback and contributions!

Android Test Kit: https://code.google.com/p/android-test-kit/

How the Google+ Team Tests Mobile Apps

Google Testing Bloggers — Fri, 30 Aug 2013 19:30:00 +0000

by Eduardo Bravo Ortiz

“Mobile first” is the motto these days for many companies. However, being able to test a mobile app in a meaningful way is very challenging. On the Google+ team we have had our share of trial and error that has led us to successful strategies for testing mobile applications on both iOS and Android.

General

Understand the platform. Testing on Android is not the same as testing on iOS. The testing tools and frameworks available for each platform are significantly different. (e.g., Android uses Java while iOS uses Objective-C, UI layouts are built differently on each platform, UI testing frameworks also work very differently in both platforms.)
Stabilize your test suite and test environments. Flaky tests are worse than having no tests, because a flaky test pollutes your build health and decreases the credibility of your suite.
Break down testing into manageable pieces. There are too many complex pieces when testing on mobile (e.g., emulator/device state, actions triggered by the OS).
Provide a hermetic test environment for your tests. Mobile UI tests are flaky by nature; don’t add more flakiness to them by having external dependencies.
Unit tests are the backbone of your mobile test strategy. Try to separate the app code logic from the UI as much as possible. This separation will make unit tests more granular and faster.

Android Testing

Unit Tests
Separating UI code from code logic is especially hard in Android. For example, an Activity is expected to act as a controller and view at the same time; make sure you keep this in mind when writing unit tests. Another useful recommendation is to decouple unit tests from the Android emulator, this will remove the need to build an APK and install it and your tests will run much faster. Robolectric is a perfect tool for this; it stubs the implementation of the Android platform while running tests.

Hermetic UI Tests
A hermetic UI test typically runs as a test without network calls or external dependencies. Once the tests can run in a hermetic environment, a white box testing framework like Espresso can simulate user actions on the UI and is tightly coupled to the app code. Espresso will also synchronize your tests actions with events on the UI thread, reducing flakiness. More information on Espresso is coming in a future Google Testing Blog article.

Diagram: Non-Hermetic Flow vs. Hermetic Flow

Monkey Tests
Monkey tests look for crashes and ANRs by stressing your Android application. They exercise pseudo-random events like clicks or gestures on the app under test. Monkey test results are reproducible to a certain extent; timing and latency are not completely under your control and can cause a test failure. Re-running the same monkey test against the same configuration will often reproduce these failures, though. If you run them daily against different SDKs, they are very effective at catching bugs earlier in the development cycle of a new release.

iOS Testing

Unit Tests
Unit test frameworks like OCUnit, which comes bundled with Xcode, or GTMSenTestcase are both good choices.

Hermetic UI Tests
KIF has proven to be a powerful solution for writing Objective-C UI tests. It runs in-process which allows tests to be more tightly coupled with the app under test, making the tests inherently more stable. KIF allows iOS developers to write tests using the same language as their application.

Following the same paradigm as Android UI tests, you want Objective-C tests to be hermetic. A good approach is to mock the server with pre-canned responses. Since KIF tests run in-process, responses can be built programmatically, making tests easier to maintain and more stable.

Monkey Tests
iOS has no equivalent native tool for writing monkey tests as Android does, however this type of test still adds value in iOS (e.g. we found 16 crashes in one of our recent Google+ releases). The Google+ team developed their own custom monkey testing framework, but there are also many third-party options available.

Backend Testing

A mobile testing strategy is not complete without testing the integration between server backends and mobile clients. This is especially true when the release cycles of the mobile clients and backends are very different. A replay test strategy can be very effective at preventing backends from breaking mobile clients. The theory behind this strategy is to simulate mobile clients by having a set of golden request and response files that are known to be correct. The replay test suite should then send golden requests to the backend server and assert that the response returned by the server matches the expected golden response. Since client/server responses are often not completely deterministic, you will need to utilize a diffing tool that can ignore expected differences.

To make this strategy successful you need a way to seed a repeatable data set on the backend and make all dependencies that are not relevant to your backend hermetic. Using in-memory servers with fake data or an RPC replay to external dependencies are good ways of achieving repeatable data sets and hermetic environments. Google+ mobile backend uses Guice for dependency injection, which allows us to easily swap out dependencies with fake implementations during testing and seed data fixtures.

Diagram: Normal flow vs Replay Tests flow

Conclusion

Mobile app testing can be very challenging, but building a comprehensive test strategy that understands the nature of different platforms and tools is the key to success. Providing a reliable and hermetic test environment is as important as the tests you write.

Finally, make sure you prioritize your automation efforts according to your team needs. This is how we prioritize on the Google+ team:

Unit tests: These should be your first priority in either Android or iOS. They run fast and are less flaky than any other type of tests.
Backend tests: Make sure your backend doesn’t break your mobile clients. Breakages are very likely to happen when the release cycle of mobile clients and backends are different.
UI tests: These are slower by nature and flaky. They also take more time to write and maintain. Make sure you provide coverage for at least the critical paths of your app.
Monkey tests: This is the final step to complete your mobile automation strategy.

Happy mobile testing from the Google+ team.

Testing on the Toilet: Know Your Test Doubles

Google Testing Bloggers — Thu, 18 Jul 2013 17:45:00 +0000

By Andrew Trenk

This article was adapted from a Google Testing on the Toilet (TotT) episode. You can download a printer-friendly version of this TotT episode and post it in your office.

A test double is an object that can stand in for a real object in a test, similar to how a stunt double stands in for an actor in a movie. These are sometimes all commonly referred to as “mocks”, but it's important to distinguish between the different types of test doubles since they all have different uses. The most common types of test doubles are stubs, mocks, and fakes.

A stub has no logic, and only returns what you tell it to return. Stubs can be used when you need an object to return specific values in order to get your code under test into a certain state. While it's usually easy to write stubs by hand, using a mocking framework is often a convenient way to reduce boilerplate.

// Pass in a stub that was created by a mocking framework.
AccessManager accessManager = new AccessManager(stubAuthenticationService);
// The user shouldn't have access when the authentication service returns false.
when(stubAuthenticationService.isAuthenticated(USER_ID)).thenReturn(false);
assertFalse(accessManager.userHasAccess(USER_ID));
// The user should have access when the authentication service returns true. 
when(stubAuthenticationService.isAuthenticated(USER_ID)).thenReturn(true);
assertTrue(accessManager.userHasAccess(USER_ID));

A mock has expectations about the way it should be called, and a test should fail if it’s not called that way. Mocks are used to test interactions between objects, and are useful in cases where there are no other visible state changes or return results that you can verify (e.g. if your code reads from disk and you want to ensure that it doesn't do more than one disk read, you can use a mock to verify that the method that does the read is only called once).

// Pass in a mock that was created by a mocking framework.
AccessManager accessManager = new AccessManager(mockAuthenticationService);
accessManager.userHasAccess(USER_ID);
// The test should fail if accessManager.userHasAccess(USER_ID) didn't call
// mockAuthenticationService.isAuthenticated(USER_ID) or if it called it more than once.
verify(mockAuthenticationService).isAuthenticated(USER_ID);

A fake doesn’t use a mocking framework: it’s a lightweight implementation of an API that behaves like the real implementation, but isn't suitable for production (e.g. an in-memory database). Fakes can be used when you can't use a real implementation in your test (e.g. if the real implementation is too slow or it talks over the network). You shouldn't need to write your own fakes often since fakes should usually be created and maintained by the person or team that owns the real implementation.

// Creating the fake is fast and easy.
AuthenticationService fakeAuthenticationService = new FakeAuthenticationService();
AccessManager accessManager = new AccessManager(fakeAuthenticationService);
// The user shouldn't have access since the authentication service doesn't
// know about the user.
assertFalse(accessManager.userHasAccess(USER_ID));
// The user should have access after it's added to the authentication service.
fakeAuthenticationService.addAuthenticatedUser(USER_ID);
assertTrue(accessManager.userHasAccess(USER_ID));

The term “test double” was coined by Gerard Meszaros in the book xUnit Test Patterns. You can find more information about test doubles in the book, or on the book’s website. You can also find a discussion about the different types of test doubles in this article by Martin Fowler.

Optimal Logging

Google Testing Bloggers — Fri, 14 Jun 2013 14:58:00 +0000

by Anthony Vallone

How long does it take to find the root cause of a failure in your system? Five minutes? Five days? If you answered close to five minutes, it’s very likely that your production system and tests have great logging. All too often, seemingly unessential features like logging, exception handling, and (dare I say it) testing are an implementation afterthought. Like exception handling and testing, you really need to have a strategy for logging in both your systems and your tests. Never underestimate the power of logging. With optimal logging, you can even eliminate the necessity for debuggers. Below are some guidelines that have been useful to me over the years.

Channeling Goldilocks

Never log too much. Massive, disk-quota burning logs are a clear indicator that little thought was put in to logging. If you log too much, you’ll need to devise complex approaches to minimize disk access, maintain log history, archive large quantities of data, and query these large sets of data. More importantly, you’ll make it very difficult to find valuable information in all the chatter.

The only thing worse than logging too much is logging too little. There are normally two main goals of logging: help with bug investigation and event confirmation. If your log can’t explain the cause of a bug or whether a certain transaction took place, you are logging too little.

Good things to log:

Important startup configuration
Errors
Warnings
Changes to persistent data
Requests and responses between major system components
Significant state changes
User interactions
Calls with a known risk of failure
Waits on conditions that could take measurable time to satisfy
Periodic progress during long-running tasks
Significant branch points of logic and conditions that led to the branch
Summaries of processing steps or events from high level functions - Avoid logging every step of a complex process in low-level functions.

Bad things to log:

Function entry - Don’t log a function entry unless it is significant or logged at the debug level.
Data within a loop - Avoid logging from many iterations of a loop. It is OK to log from iterations of small loops or to log periodically from large loops.
Content of large messages or files - Truncate or summarize the data in some way that will be useful to debugging.
Benign errors - Errors that are not really errors can confuse the log reader. This sometimes happens when exception handling is part of successful execution flow.
Repetitive errors - Do not repetitively log the same or similar error. This can quickly fill a log and hide the actual cause. Frequency of error types is best handled by monitoring. Logs only need to capture detail for some of those errors.

There is More Than One Level

Don't log everything at the same log level. Most logging libraries offer several log levels, and you can enable certain levels at system startup. This provides a convenient control for log verbosity.

The classic levels are:

Debug - verbose and only useful while developing and/or debugging.
Info - the most popular level.
Warning - strange or unexpected states that are acceptable.
Error - something went wrong, but the process can recover.
Critical - the process cannot recover, and it will shutdown or restart.

Practically speaking, only two log configurations are needed:

Production - Every level is enabled except debug. If something goes wrong in production, the logs should reveal the cause.
Development & Debug - While developing new code or trying to reproduce a production issue, enable all levels.

Test Logs Are Important Too

Log quality is equally important in test and production code. When a test fails, the log should clearly show whether the failure was a problem with the test or production system. If it doesn't, then test logging is broken.

Test logs should always contain:

Test execution environment
Initial state
Setup steps
Test case steps
Interactions with the system
Expected results
Actual results
Teardown steps

Conditional Verbosity With Temporary Log Queues

When errors occur, the log should contain a lot of detail. Unfortunately, detail that led to an error is often unavailable once the error is encountered. Also, if you’ve followed advice about not logging too much, your log records prior to the error record may not provide adequate detail. A good way to solve this problem is to create temporary, in-memory log queues. Throughout processing of a transaction, append verbose details about each step to the queue. If the transaction completes successfully, discard the queue and log a summary. If an error is encountered, log the content of the entire queue and the error. This technique is especially useful for test logging of system interactions.

Failures and Flakiness Are Opportunities

When production problems occur, you’ll obviously be focused on finding and correcting the problem, but you should also think about the logs. If you have a hard time determining the cause of an error, it's a great opportunity to improve your logging. Before fixing the problem, fix your logging so that the logs clearly show the cause. If this problem ever happens again, it’ll be much easier to identify.

If you cannot reproduce the problem, or you have a flaky test, enhance the logs so that the problem can be tracked down when it happens again.

Using failures to improve logging should be used throughout the development process. While writing new code, try to refrain from using debuggers and only use the logs. Do the logs describe what is going on? If not, the logging is insufficient.

Might As Well Log Performance Data

Logged timing data can help debug performance issues. For example, it can be very difficult to determine the cause of a timeout in a large system, unless you can trace the time spent on every significant processing step. This can be easily accomplished by logging the start and finish times of calls that can take measurable time:

Significant system calls
Network requests
CPU intensive operations
Connected device interactions
Transactions

Following the Trail Through Many Threads and Processes

You should create unique identifiers for transactions that involve processing across many threads and/or processes. The initiator of the transaction should create the ID, and it should be passed to every component that performs work for the transaction. This ID should be logged by each component when logging information about the transaction. This makes it much easier to trace a specific transaction when many transactions are being processed concurrently.

Monitoring and Logging Complement Each Other

A production service should have both logging and monitoring. Monitoring provides a real-time statistical summary of the system state. It can alert you if a percentage of certain request types are failing, it is experiencing unusual traffic patterns, performance is degrading, or other anomalies occur. In some cases, this information alone will clue you to the cause of a problem. However, in most cases, a monitoring alert is simply a trigger for you to start an investigation. Monitoring shows the symptoms of problems. Logs provide details and state on individual transactions, so you can fully understand the cause of problems.

Two New Videos About Testing at Google

Google Testing Bloggers — Fri, 12 Apr 2013 15:13:00 +0000

by Anthony Vallone

We have two excellent, new videos to share about testing at Google. If you are curious about the work that our Test Engineers (TEs) and Software Engineers in Test (SETs) do, you’ll find both of these videos very interesting.

The Life at Google team produced a video series called Do Cool Things That Matter. This series includes a video from an SET and TE on the Maps team (Sean Jordan and Yvette Nameth) discussing their work on the Google Maps team.

Meet Yvette and Sean from the Google Maps Test Team

The Google Students team hosted a Hangouts On Air event with several Google SETs (Diego Salas, Karin Lundberg, Jonathan Velasquez, Chaitali Narla, and Dave Chen) discussing the SET role.

Software Engineers in Test at Google - Covering your (Code)Bases

Interested in joining the ranks of TEs or SETs at Google? Search for Google test jobs.