- Part 2

About: Google Testing Bloggers

Profile:

Website

http://www.blogger.com/profile/03153388556673050910

Contact:

Posts by Google Testing Bloggers

ThreadSanitizer: Slaughtering Data Races

Jun 30, 2014

by Dmitry Vyukov, Synchronization Lookout, Google, MoscowHello, I work in the Dynamic Testing Tools team at Google. Our team develops tools like AddressSanitizer, MemorySanitizer and ThreadSanitizer which find various kinds of bugs. In this blog post I…

Read | No Comments | Tags: Google Testing

GTAC 2014: Call for Proposals & Attendance

Jun 16, 2014

Posted by Anthony Vallone on behalf of the GTAC Committee

The application process is now open for presentation proposals and attendance for GTAC (Google Test Automation Conference) (see initial announcement) to be held at the Google Kirkland office (near Seattle, WA) on October 28 – 29th, 2014.

GTAC will be streamed live on YouTube again this year, so even if you can’t attend, you’ll be able to watch the conference from your computer.

Speakers
Presentations are targeted at student, academic, and experienced engineers working on test automation. Full presentations and lightning talks are 45 minutes and 15 minutes respectively. Speakers should be prepared for a question and answer session following their presentation.

Application
For presentation proposals and/or attendance, complete this form. We will be selecting about 300 applicants for the event.

Deadline
The due date for both presentation and attendance applications is July 28, 2014.

Fees
There are no registration fees, and we will send out detailed registration instructions to each invited applicant. Meals will be provided, but speakers and attendees must arrange and pay for their own travel and accommodations.

Update : Our contact email was bouncing – this is now fixed.

Read | No Comments | Tags: Google Testing

GTAC 2014 Coming to Seattle/Kirkland in October

Jun 4, 2014

Posted by Anthony Vallone on behalf of the GTAC Committee

If you’re looking for a place to discuss the latest innovations in test automation, then charge your tablets and pack your gumboots – the eighth GTAC (Google Test Automation Conference) will be held on October 28-29, 2014 at Google Kirkland! The Kirkland office is part of the Seattle/Kirkland campus in beautiful Washington state. This campus forms our third largest engineering office in the USA.

GTAC is a periodic conference hosted by Google, bringing together engineers from industry and academia to discuss advances in test automation and the test engineering computer science field. It’s a great opportunity to present, learn, and challenge modern testing technologies and strategies.

You can browse the presentation abstracts, slides, and videos from last year on the GTAC 2013 page.

Stay tuned to this blog and the GTAC website for application information and opportunities to present at GTAC. Subscribing to this blog is the best way to get notified. We’re looking forward to seeing you there!

Read | No Comments | Tags: Google Testing

Testing on the Toilet: Risk-Driven Testing

May 30, 2014

by Peter Arrenbrecht

This article was adapted from a Google Testing on the Toilet (TotT) episode. You can download a printer-friendly version of this TotT episode and post it in your office.

We are all conditioned to write tests as we code: unit, functional, UI—the whole shebang. We are professionals, after all. Many of us like how small tests let us work quickly, and how larger tests inspire safety and closure. Or we may just anticipate flak during review. We are so used to these tests that often we no longer question why we write them. This can be wasteful and dangerous.

Tests are a means to an end: To reduce the key risks of a project, and to get the biggest bang for the buck. This bang may not always come from the tests that standard practice has you write, or not even from tests at all.

Two examples:

“We built a new debugging aid. We wrote unit, integration, and UI tests. We were ready to launch.”

Outstanding practice. Missing the mark.

Our key risks were that we’d corrupt our data or bring down our servers for the sake of a debugging aid. None of the tests addressed this, but they gave a false sense of safety and “being done”.
We stopped the launch.

“We wanted to turn down a feature, so we needed to alert affected users. Again we had unit and integration tests, and even one expensive end-to-end test.”

Standard practice. Wasted effort.

The alert was so critical it actually needed end-to-end coverage for all scenarios. But it would be live for only three releases. The cheapest effective test? Manual testing before each release.

A Better Approach: Risks First

For every project or feature, think about testing. Brainstorm your key risks and your best options to reduce them. Do this at the start so you don’t waste effort and can adapt your design. Write them down as a QA design so you can point to it in reviews and discussions.

To be sure, standard practice remains a good idea in most cases (hence it’s standard). Small tests are cheap and speed up coding and maintenance, and larger tests safeguard core use-cases and integration.

Just remember: Your tests are a means. The bang is what counts. It’s your job to maximize it.

Read | No Comments | Tags: Google Testing

Testing on the Toilet: Effective Testing

May 7, 2014

by Rich Martin, Zurich

This article was adapted from a Google Testing on the Toilet (TotT) episode. You can download a printer-friendly version of this TotT episode and post it in your office.

Whether we are writing an individual unit test or designing a product’s entire testing process, it is important to take a step back and think about how effective are our tests at detecting and reporting bugs in our code. To be effective, there are three important qualities that every test should try to maximize:

Fidelity

When the code under test is broken, the test fails. A high-fidelity test is one which is very sensitive to defects in the code under test, helping to prevent bugs from creeping into the code.

Maximize fidelity by ensuring that your tests cover all the paths through your code and include all relevant assertions on the expected state.

Resilience

A test shouldn’t fail if the code under test isn’t defective. A resilient test is one that only fails when a breaking change is made to the code under test. Refactorings and other non-breaking changes to the code under test can be made without needing to modify the test, reducing the cost of maintaining the tests.

Maximize resilience by only testing the exposed API of the code under test; avoid reaching into internals. Favor stubs and fakes over mocks; don’t verify interactions with dependencies unless it is that interaction that you are explicitly validating. A flaky test obviously has very low resilience.

Precision

When a test fails, a high-precision test tells you exactly where the defect lies. A well-written unit test can tell you exactly which line of code is at fault. Poorly written tests (especially large end-to-end tests) often exhibit very low precision, telling you that something is broken but not where.

Maximize precision by keeping your tests small and tightly focused. Choose descriptive method names that convey exactly what the test is validating. For system integration tests, validate state at every boundary.

These three qualities are often in tension with each other. It’s easy to write a highly resilient test (the empty test, for example), but writing a test that is both highly resilient and high-fidelity is hard. As you design and write tests, use these qualities as a framework to guide your implementation.

Read | No Comments | Tags: Google Testing

Testing on the Toilet: Test Behaviors, Not Methods

Apr 14, 2014

by Erik Kuefler

This article was adapted from a Google Testing on the Toilet (TotT) episode. You can download a printer-friendly version of this TotT episode and post it in your office.

After writing a method, it’s easy to write just one test that verifies everything the method does. But it can be harmful to think that tests and public methods should have a 1:1 relationship. What we really want to test are behaviors, where a single method can exhibit many behaviors, and a single behavior sometimes spans across multiple methods.

Let’s take a look at a bad test that verifies an entire method:

@Test public void testProcessTransaction() {
  User user = newUserWithBalance(LOW_BALANCE_THRESHOLD.plus(dollars(2));
  transactionProcessor.processTransaction(
      user,
new Transaction("Pile of Beanie Babies", dollars(3)));
  assertContains("You bought a Pile of Beanie Babies", ui.getText());
  assertEquals(1, user.getEmails().size());
  assertEquals("Your balance is low", user.getEmails().get(0).getSubject());
}

Displaying the name of the purchased item and sending an email about the balance being low are two separate behaviors, but this test looks at both of those behaviors together just because they happen to be triggered by the same method. Tests like this very often become massive and difficult to maintain over time as additional behaviors keep getting added in—eventually it will be very hard to tell which parts of the input are responsible for which assertions. The fact that the test’s name is a direct mirror of the method’s name is a bad sign.

It’s a much better idea to use separate tests to verify separate behaviors:

@Test public void testProcessTransaction_displaysNotification() {
  transactionProcessor.processTransaction(
new User(), new Transaction("Pile of Beanie Babies"));
  assertContains("You bought a Pile of Beanie Babies", ui.getText());
}
@Test public void testProcessTransaction_sendsEmailWhenBalanceIsLow() {
  User user = newUserWithBalance(LOW_BALANCE_THRESHOLD.plus(dollars(2));
  transactionProcessor.processTransaction(
      user,
new Transaction(dollars(3)));
  assertEquals(1, user.getEmails().size());
  assertEquals("Your balance is low", user.getEmails().get(0).getSubject());
}

Now, when someone adds a new behavior, they will write a new test for that behavior. Each test will remain focused and easy to understand, no matter how many behaviors are added. This will make your tests more resilient since adding new behaviors is unlikely to break the existing tests, and clearer since each test contains code to exercise only one behavior.

Read | No Comments | Tags: Google Testing

The Real Test Driven Development

Apr 1, 2014

Update: APRIL FOOLS!

by Kaue Silveira

Here at Google, we invest heavily in development productivity research. In fact, our TDD research group now occupies nearly an entire building of the Googleplex. The group has been working hard to minimize the development cycle time, and we’d like to share some of the amazing progress they’ve made.

The Concept

In the ways of old, it used to be that people wrote tests for their existing code. This was changed by TDD (Test-driven Development), where one would write the test first and then write the code to satisfy it. The TDD research group didn’t think this was enough and wanted to elevate the humble test to the next level. We are pleased to announce the Real TDD, our latest innovation in the Program Synthesis field, where you write only the tests and have the computer write the code for you!

The following graph shows how the number of tests created by a small feature team grew since they started using this tool towards the end of 2013. Over the last 2 quarters, more than 89% of this team’s production code was written by the tool!

See it in action:

Test written by a Software Engineer:

class LinkGeneratorTest(googletest.TestCase):

def setUp(self):
self.generator = link_generator.LinkGenerator()

def testGetLinkFromIDs(self):
    expected = ('https://frontend.google.com/advancedSearchResults?'
's.op=ALL&s.r0.field=ID&s.r0.val=1288585+1310696+1346270+')
    actual = self.generator.GetLinkFromIDs(set((1346270, 1310696, 1288585)))
self.assertEqual(expected, actual)

Code created by our tool:

import urllib

class LinkGenerator(object):

  _URL = (
'https://frontend.google.com/advancedSearchResults?'
's.op=ALL&s.r0.field=ID&s.r0.val=')

def GetLinkFromIDs(self, ids):
    result = []
for id in sorted(ids):
      result.append('%s ' % id)
return self._URL + urllib.quote_plus(''.join(result))

Note that the tool is smart enough to not generate the obvious implementation of returning a constant string, but instead it correctly abstracts and generalizes the relation between inputs and outputs. It becomes smarter at every use and it’s behaving more and more like a human programmer every day. We once saw a comment in the generated code that said “I need some coffee”.

How does it work?

We’ve trained the Google Brain with billions of lines of open-source software to learn about coding patterns and how product code correlates with test code. Its accuracy is further improved by using Type Inference to infer types from code and the Girard-Reynolds Isomorphism to infer code from types.

The tool runs every time your unit test is saved, and it uses the learned model to guide a backtracking search for a code snippet that satisfies all assertions in the test. It provides sub-second responses for 99.5% of the cases (as shown in the following graph), thanks to millions of pre-computed assertion-snippet pairs stored in Spanner for global low-latency access.

How can I use it?

We will offer a free (rate-limited) service that everyone can use, once we have sorted out the legal issues regarding the possibility of mixing code snippets originating from open-source projects with different licenses (e.g., GPL-licensed tests will simply refuse to pass BSD-licensed code snippets). If you would like to try our alpha release before the public launch, leave us a comment!

Read | No Comments | Tags: Google Testing

Testing on the Toilet: What Makes a Good Test?

Mar 18, 2014

by Erik Kuefler

This article was adapted from a Google Testing on the Toilet (TotT) episode. You can download a printer-friendly version of this TotT episode and post it in your office.

Unit tests are important tools for verifying that our code is correct. But writing good tests is about much more than just verifying correctness — a good unit test should exhibit several other properties in order to be readable and maintainable.

One property of a good test is clarity. Clarity means that a test should serve as readable documentation for humans, describing the code being tested in terms of its public APIs. Tests shouldn’t refer directly to implementation details. The names of a class’s tests should say everything the class does, and the tests themselves should serve as examples for how to use the class.

Two more important properties are completeness and conciseness. A test is complete when its body contains all of the information you need to understand it, and concise when it doesn’t contain any other distracting information. This test fails on both counts:

@Test public void shouldPerformAddition() {
  Calculator calculator = new Calculator(new RoundingStrategy(), 
"unused", ENABLE_COSIN_FEATURE, 0.01, calculusEngine, false);
int result = calculator.doComputation(makeTestComputation());
  assertEquals(5, result); // Where did this number come from?
}

Lots of distracting information is being passed to the constructor, and the important parts are hidden off in a helper method. The test can be made more complete by clarifying the purpose of the helper method, and more concise by using another helper to hide the irrelevant details of constructing the calculator:

@Test public void shouldPerformAddition() {
  Calculator calculator = newCalculator();
int result = calculator.doComputation(makeAdditionComputation(2, 3));
  assertEquals(5, result);
}

One final property of a good test is resilience. Once written, a resilient test doesn’t have to change unless the purpose or behavior of the class being tested changes. Adding new behavior should only require adding new tests, not changing old ones. The original test above isn’t resilient since you’ll have to update it (and probably dozens of other tests!) whenever you add a new irrelevant constructor parameter. Moving these details into the helper method solved this problem.

Read | No Comments | Tags: Google Testing

When/how to use Mockito Answer

Mar 3, 2014

by Hongfei Ding, Software Engineer, Shanghai

Mockito is a popular open source Java testing framework that allows the creation of mock objects. For example, we have the below interface used in our SUT (System Under Test):

interface Service {
  Data get();
}

In our test, normally we want to fake the Service’s behavior to return canned data, so that the unit test can focus on testing the code that interacts with the Service. We use when-return clause to stub a method.

when(service.get()).thenReturn(cannedData);

But sometimes you need mock object behavior that’s too complex for when-return. An Answer object can be a clean way to do this once you get the syntax right.

A common usage of Answer is to stub asynchronous methods that have callbacks. For example, we have mocked the interface below:

interface Service {
void get(Callback callback);
}

Here you’ll find that when-return is not that helpful anymore. Answer is the replacement. For example, we can emulate a success by calling the onSuccess function of the callback.

doAnswer(new Answer() {
public Void answer(InvocationOnMock invocation) {
       Callback callback = (Callback) invocation.getArguments()[0];
       callback.onSuccess(cannedData);
return null;
    }
}).when(service).get(any(Callback.class));

Answer can also be used to make smarter stubs for synchronous methods. Smarter here means the stub can return a value depending on the input, rather than canned data. It’s sometimes quite useful. For example, we have mocked the Translator interface below:

interface Translator {
  String translate(String msg);
}

We might choose to mock Translator to return a constant string and then assert the result. However, that test is not thorough, because the input to the translator function has been ignored. To improve this, we might capture the input and do extra verification, but then we start to fall into the “testing interaction rather than testing state” trap.

A good usage of Answer is to reverse the input message as a fake translation. So that both things are assured by checking the result string: 1) translate has been invoked, 2) the msg being translated is correct. Notice that this time we’ve used thenAnswer syntax, a twin of doAnswer, for stubbing a non-void method.

when(translator.translate(any(String.class))).thenAnswer(reverseMsg())
...
// extracted a method to put a descriptive name
private static Answer reverseMsg() { 
return new Answer() {
public String answer(InvocationOnMock invocation) {
return reverseString((String) invocation.getArguments()[0]));
    }
  }
}

Last but not least, if you find yourself writing many nontrivial Answers, you should consider using a fake instead.

Read | No Comments | Tags: Google Testing

Minimizing Unreproducible Bugs

Feb 3, 2014

by Anthony Vallone

Unreproducible bugs are the bane of my existence. Far too often, I find a bug, report it, and hear back that it’s not a bug because it can’t be reproduced. Of course, the bug is still there, waiting to prey on its next victim. These types of bugs can be very expensive due to increased investigation time and overall lifetime. They can also have a damaging effect on product perception when users reporting these bugs are effectively ignored. We should be doing more to prevent them. In this article, I’ll go over some obvious, and maybe not so obvious, development/testing guidelines that can reduce the likelihood of these bugs from occurring.

Avoid and test for race conditions, deadlocks, timing issues, memory corruption, uninitialized memory access, memory leaks, and resource issues

I am lumping together many bug types in this section, but they are all related somewhat by how we test for them and how disproportionately hard they are to reproduce and debug. The root cause and effect can be separated by milliseconds or hours, and stack traces might be nonexistent or misleading. A system may fail in strange ways when exposed to unusual traffic spikes or insufficient resources. Race conditions and deadlocks may only be discovered during unique traffic patterns or resource configurations. Timing issues may only be noticed when many components are integrated and their performance parameters and failure/retry/timeout delays create a chaotic system. Memory corruption or uninitialized memory access may go unnoticed for a large percentage of calls but become fatal for rare states. Memory leaks may be negligible unless the system is exposed to load for an extended period of time.

Guidelines for development:

Simplify your synchronization logic. If it’s too hard to understand, it will be difficult to reproduce and debug complex concurrency problems.
Always obtain locks in the same order. This is a tried-and-true guideline to avoid deadlocks, but I still see code that breaks it periodically. Define an order for obtaining multiple locks and never change that order.
Don’t optimize by creating many fine-grained locks, unless you have verified that they are needed. Extra locks increase concurrency complexity.
Avoid shared memory, unless you truly need it. Shared memory access is very easy to get wrong, and the bugs may be quite difficult to reproduce.

Guidelines for testing:

Stress test your system regularly. You don’t want to be surprised by unexpected failures when your system is under heavy load.
Test timeouts. Create tests that mock/fake dependencies to test timeout code. If your timeout code does something bad, it may cause a bug that only occurs under certain system conditions.
Test with debug and optimized builds. You may find that a well behaved debug build works fine, but the system fails in strange ways once optimized.
Test under constrained resources. Try reducing the number of data centers, machines, processes, threads, available disk space, or available memory. Also try simulating reduced network bandwidth.
Test for longevity. Some bugs require a long period of time to reveal themselves. For example, persistent data may become corrupt over time.
Use dynamic analysis tools like memory debuggers, ASan, TSan, and MSan regularly. They can help identify many categories of unreproducible memory/threading issues.

Enforce preconditions

I’ve seen many well-meaning functions with a high tolerance for bad input. For example, consider this function:

void ScheduleEvent(int timeDurationMilliseconds) {
if (timeDurationMilliseconds     timeDurationMilliseconds = 1;
  }
  ...
}

This function is trying to help the calling code by adjusting the input to an acceptable value, but it may be doing damage by masking a bug. The calling code may be experiencing any number of problems described in this article, and passing garbage to this function will always work fine. The more functions that are written with this level of tolerance, the harder it is to trace back to the root cause, and the more likely it becomes that the end user will see garbage. Enforcing preconditions, for instance by using asserts, may actually cause a higher number of failures for new systems, but as systems mature, and many minor/major problems are identified early on, these checks can help improve long-term reliability.

Guidelines for development:

Enforce preconditions in your functions unless you have a good reason not to.

Use defensive programming

Defensive programming is another tried-and-true technique that is great at minimizing unreproducible bugs. If your code calls a dependency to do something, and that dependency quietly fails or returns garbage, how does your code handle it? You could test for situations like this via mocking or faking, but it’s even better to have your production code do sanity checking on its dependencies. For example:

double GetMonthlyLoanPayment() {
double rate = GetTodaysInterestRateFromExternalSystem();
if (rate  0.5) {
throw BadInterestRate(rate);
  }
  ...
}

Guidelines for development:

When possible, use defensive programming to verify the work of your dependencies with known risks of failure like user-provided data, I/O operations, and RPC calls.

Guidelines for testing:

Use fuzz testing to test your systems hardiness when enduring bad data.

Don’t hide all errors from the user

There has been a trend in recent years toward hiding failures from users at all costs. In many cases, it makes perfect sense, but in some, we have gone overboard. Code that is very quiet and permissive during minor failures will allow an uninformed user to continue working in a failed state. The software may ultimately reach a fatal tipping point, and all the error conditions that led to failure have been ignored. If the user doesn’t know about the prior errors, they will not be able to report them, and you may not be able to reproduce them.

Guidelines for development:

Only hide errors from the user when you are certain that there is no impact to system state or the user.
Any error with impact to the user should be reported to the user with instructions for how to proceed. The information shown to the user, combined with data available to an engineer, should be enough to determine what went wrong.

Test error handling

The most common sections of code to remain untested is error handling code. Don’t skip test coverage here. Bad error handling code can cause unreproducible bugs and create great risk if it does not handle fatal errors well.

Guidelines for testing:

Always test your error handling code. This is usually best accomplished by mocking or faking the component triggering the error.
It’s also a good practice to examine your log quality for all types of error handling.

Check for duplicate keys

If unique identifiers or data access keys are generated using random data or are not guaranteed to be globally unique, duplicate keys may cause data corruption or concurrency issues. Key duplication bugs are very difficult to reproduce.

Guidelines for development:

Try to guarantee uniqueness of all keys.
When not possible to guarantee unique keys, check if the recently generated key is already in use before using it.
Watch out for potential race conditions here and avoid them with synchronization.

Test for concurrent data access

Some bugs only reveal themselves when multiple clients are reading/writing the same data. Your stress tests might be covering cases like these, but if they are not, you should have special tests for concurrent data access. Case like these are often unreproducible. For example, a user may have two instances of your app running against the same account, and they may not realize this when reporting a bug.

Guidelines for testing:

Always test for concurrent data access if it’s a feature of the system. Actually, even if it’s not a feature, verify that the system rejects it. Testing concurrency can be challenging. An approach that usually works for me is to create many worker threads that simultaneously attempt access and a master thread that monitors and verifies that some number of attempts were indeed concurrent, blocked or allowed as expected, and all were successful. Programmatic post-analysis of all attempts and changing system state may also be necessary to ensure that the system behaved well.

Steer clear of undefined behavior and non-deterministic access to data

Some APIs and basic operations have warnings about undefined behavior when in certain states or provided with certain input. Similarly, some data structures do not guarantee an iteration order (example: Java’s Set). Code that ignores these warnings may work fine most of the time but fail in unusual ways that are hard to reproduce.

Guidelines for development:

Understand when the APIs and operations you use might have undefined behavior and prevent those conditions.
Do not depend on data structure iteration order unless it is guaranteed. It is a common mistake to depend on the ordering of sets or associative arrays.

Log the details for errors or test failures

Issues described in this article can be easier to reproduce and debug when the logs contain enough detail to understand the conditions that led to an error.

Guidelines for development:

Follow good logging practices, especially in your error handling code.
If logs are stored on a user’s machine, create an easy way for them to provide you the logs.

Guidelines for testing:

Save your test logs for potential analysis later.

Anything to add?

Have I missed any important guidelines for minimizing these bugs? What is your favorite hard-to-reproduce bug that you discovered and resolved?

Read | No Comments | Tags: Google Testing

The Google Test and Development Environment – Pt. 3: Code, Build, and Test

Jan 21, 2014

by Anthony Vallone

This is the third in a series of articles about our work environment. See the first and second.

I will never forget the awe I felt when running my first load test on my first project at Google. At previous companies I’ve worked, running a substantial load test took quite a bit of resource planning and preparation. At Google, I wrote less than 100 lines of code and was simulating tens of thousands of users after just minutes of prep work. The ease with which I was able to accomplish this is due to the impressive coding, building, and testing tools available at Google. In this article, I will discuss these tools and how they affect our test and development process.

Coding and building

The tools and process for coding and building make it very easy to change production and test code. Even though we are a large company, we have managed to remain nimble. In a matter of minutes or hours, you can edit, test, review, and submit code to head. We have achieved this without sacrificing code quality by heavily investing in tools, testing, and infrastructure, and by prioritizing code reviews.

Most production and test code is in a single, company-wide source control repository (open source projects like Chromium and Android have their own). There is a great deal of code sharing in the codebase, and this provides an incredible suite of code to build on. Most code is also in a single branch, so the majority of development is done at head. All code is also navigable, searchable, and editable from the browser. You’ll find code in numerous languages, but Java, C++, Python, Go, and JavaScript are the most common.

Have a strong preference for editor? Engineers are free to choose from many IDEs and editors. The most common are Eclipse, Emacs, Vim, and IntelliJ, but many others are used as well. Engineers that are passionate about their prefered editors have built up and shared some truly impressive editor plugins/tooling over the years.

Code reviews for all submissions are enforced via source control tooling. This also applies to test code, as our test code is held to the same standards as production code. The reviews are done via web-based code review tools that even include automatically generated test results. The process is very streamlined and efficient. Engineers can change and submit code in any part of the repository, but it must get reviewed by owners of the code being changed. This is great, because you can easily change code that your team depends on, rather than merely request a change to code you do not own.

The Google build system is used for building most code, and it is designed to work across many languages and platforms. It is remarkably simple to define and build targets. You won’t be needing that old Makefile book.

Running jobs and tests

We have some pretty amazing machine and job management tools at Google. There is a generally available pool of machines in many data centers around the globe. The job management service makes it very easy to start jobs on arbitrary machines in any of these data centers. Failing machines are automatically removed from the pool, so tests rarely fail due to machine issues. With a little effort, you can also set up monitoring and pager alerting for your important jobs.

From any machine you can spin up a massive number of tests and run them in parallel across many machines in the pool, via a single command. Each of these tests are run in a standard, isolated environment, so we rarely run into the “it works on my machine!” issue.

Before code is submitted, presubmit tests can be run that will find all tests that depend transitively on the change and run them. You can also define presubmit rules that run checks on a code change and verify that tests were run before allowing submission.

Once you’ve submitted test code, the build and test system automatically registers the test, and starts building/testing continuously. If the test starts failing, your team will get notification emails. You can also visit a test dashboard for your team and get details about test runs and test data. Monitoring the build/test status is made even easier with our build orbs designed and built by Googlers. These small devices will glow red if the build starts failing. Many teams have had fun customizing these orbs to various shapes, including a statue of liberty with a glowing torch.

Statue of LORBerty

Running larger integration and end-to-end tests takes a little more work, but we have some excellent tools to help with these tests as well: Integration test runners, hermetic environment creation, virtual machine service, web test frameworks, etc.

The impact

So how do these tools actually affect our productivity? For starters, the code is easy to find, edit, review, and submit. Engineers are free to choose tools that make them most productive. Before and after submission, running small tests is trivial, and running large tests is relatively easy. Since tests are easy to create and run, it’s fairly simple to maintain a green build, which most teams do most of the time. This allows us to spend more time on real problems and less on the things that shouldn’t even be problems. It allows us to focus on creating rigorous tests. It dramatically accelerates the development process that can prototype Gmail in a day and code/test/release service features on a daily schedule. And, of course, it lets us focus on the fun stuff.

Thoughts?

We are interested to hear your thoughts on this topic. Google has the resources to build tools like this, but would small or medium size companies benefit from a similar investment in its infrastructure? Did Google create the infrastructure or did the infrastructure create Google?

Read | No Comments | Tags: Google Testing

The Google Test and Development Environment – Pt. 2: Dogfooding and Office Software

Jan 3, 2014

by Anthony Vallone

This is the second in a series of articles about our work environment. See the first.

There are few things as frustrating as getting hampered in your work by a bug in a product you depend on. What if it’s a product developed by your company? Do you report/fix the issue or just work around it and hope it’ll go away soon? In this article, I’ll cover how and why Google dogfoods its own products.

Dogfooding

Google makes heavy use of its own products. We have a large ecosystem of development/office tools and use them for nearly everything we do. Because we use them on a daily basis, we can dogfood releases company-wide before launching to the public. These dogfood versions often have features unavailable to the public but may be less stable. Instability is exactly what you want in your tools, right? Or, would you rather that frustration be passed on to your company’s customers? Of course not!

Dogfooding is an important part of our test process. Test teams do their best to find problems before dogfooding, but we all know that testing is never perfect. We often get dogfood bug reports for edge and corner cases not initially covered by testing. We also get many comments about overall product quality and usability. This internal feedback has, on many occasions, changed product design.

Not surprisingly, test-focused engineers often have a lot to say during the dogfood phase. I don’t think there is a single public-facing product that I have not reported bugs on. I really appreciate the fact that I can provide feedback on so many products before release.

Interested in helping to test Google products? Many of our products have feedback links built-in. Some also have Beta releases available. For example, you can start using Chrome Beta and help us file bugs.

Office software

From system design documents, to test plans, to discussions about beer brewing techniques, our products are used internally. A company’s choice of office tools can have a big impact on productivity, and it is fortunate for Google that we have such a comprehensive suite. The tools have a consistently simple UI (no manual required), perform very well, encourage collaboration, and auto-save in the cloud. Now that I am used to these tools, I would certainly have a hard time going back to the tools of previous companies I have worked. I’m sure I would forget to click the save buttons for years to come.

Examples of frequently used tools by engineers:

Google Drive Apps (Docs, Sheets, Slides, etc.) are used for design documents, test plans, project data, data analysis, presentations, and more.
Gmail and Hangouts are used for email and chat.
Google Calendar is used to schedule all meetings, reserve conference rooms, and setup video conferencing using Hangouts.
Google Maps is used to map office floors.
Google Groups are used for email lists.
Google Sites are used to host team pages, engineering docs, and more.
Google App Engine hosts many corporate, development, and test apps.
Chrome is our primary browser on all platforms.
Google+ is used for organizing internal communities on topics such as food or C++, and for socializing.

Thoughts?

We are interested to hear your thoughts on this topic. Do you dogfood your company’s products? Do your office tools help or hinder your productivity? What office software and tools do you find invaluable for your job? Could you use Google Docs/Sheets for large test plans?

(Continue to part 3)

Read | No Comments | Tags: Google Testing

The Google Test and Development Environment – Pt. 1: Office and Equipment

Dec 20, 2013

by Anthony Vallone

When conducting interviews, I often get questions about our workspace and engineering environment. What IDEs do you use? What programming languages are most common? What kind of tools do you have for testing? What does the workspace look like?

Google is a company that is constantly pushing to improve itself. Just like software development itself, most environment improvements happen via a bottom-up approach. All engineers are responsible for fine-tuning, experimenting with, and improving our process, with a goal of eliminating barriers to creating products that amaze.

Office space and engineering equipment can have a considerable impact on productivity. I’ll focus on these areas of our work environment in this first article of a series on the topic.

Office layout

Google is a highly collaborative workplace, so the open floor plan suits our engineering process. Project teams composed of Software Engineers (SWEs), Software Engineers in Test (SETs), and Test Engineers (TEs) all sit near each other or in large rooms together. The test-focused engineers are involved in every step of the development process, so it’s critical for them to sit with the product developers. This keeps the lines of communication open.

Google Munich

The office space is far from rigid, and teams often rearrange desks to suit their preferences. The facilities team recently finished renovating a new floor in the New York City office, and after a day of engineering debates on optimal arrangements and white board diagrams, the floor was completely transformed.

Besides the main office areas, there are lounge areas to which Googlers go for a change of scenery or a little peace and quiet. If you are trying to avoid becoming a casualty of The Great Foam Dart War, lounges are a great place to hide.

Google Dublin

Working with remote teams

Google’s worldwide headquarters is in Mountain View, CA, but it’s a very global company, and our project teams are often distributed across multiple sites. To help keep teams well connected, most of our conference rooms have video conferencing equipment. We make frequent use of this equipment for team meetings, presentations, and quick chats.

Google Boston

What’s at your desk?

All engineers get high-end machines and have easy access to data center machines for running large tasks. A new member on my team recently mentioned that his Google machine has 16 times the memory of the machine at his previous company.

Most Google code runs on Linux, so the majority of development is done on Linux workstations. However, those that work on client code for Windows, OS X, or mobile, develop on relevant OSes. For displays, each engineer has a choice of either two 24 inch monitors or one 30 inch monitor. We also get our choice of laptop, picking from various models of Chromebook, MacBook, or Linux. These come in handy when going to meetings, lounges, or working remotely.

Google Zurich

Thoughts?

We are interested to hear your thoughts on this topic. Do you prefer an open-office layout, cubicles, or private offices? Should test teams be embedded with development teams, or should they operate separately? Do the benefits of offering engineers high-end equipment outweigh the costs?

(Continue to part 2)

Read | No Comments | Tags: Google Testing

WebRTC Audio Quality Testing

Nov 8, 2013

by Patrik Höglund

The WebRTC project is all about enabling peer-to-peer video, voice and data transfer in the browser. To give our users the best possible experience we need to adapt the quality of the media to the bandwidth and processing power we have available. Our users encounter a wide variety of network conditions and run on a variety of devices, from powerful desktop machines with a wired broadband connection to laptops on WiFi to mobile phones on spotty 3G networks.

We want to ensure good quality for all these use cases in our implementation in Chrome. To some extent we can do this with manual testing, but the breakneck pace of Chrome development makes it very hard to keep up (several hundred patches land every day)! Therefore, we’d like to test the quality of our video and voice transfer with an automated test. Ideally, we’d like to test for the most common network scenarios our users encounter, but to start we chose to implement a test where we have plenty of CPU and bandwidth. This article covers how we built such a test.

Quality Metrics
First, we must define what we want to measure. For instance, the WebRTC video quality test uses peak signal-to-noise ratio and structural similarity to measure the quality of the video (or to be more precise, how much the output video differs from the input video; see this GTAC 13 talk for more details). The quality of the user experience is a subjective thing though. Arguably, one probably needs dozens of different metrics to really ensure a good user experience. For video, we would have to (at the very least) have some measure for frame rate and resolution besides correctness. To have the system send somewhat correct video frames seemed the most important though, which is why we chose the above metrics.

For this test we wanted to start with a similar correctness metric, but for audio. It turns out there’s an algorithm called Perceptual Evaluation of Speech Quality (PESQ) which analyzes two audio files and tell you how similar they are, while taking into account how the human ear works (so it ignores differences a normal person would not hear anyway). That’s great, since we want our metrics to measure the user experience as much as possible. There are many aspects of voice transfer you could measure, such as latency (which is really important for voice calls), but for now we’ll focus on measuring how much a voice audio stream gets distorted by the transfer.

Feeding Audio Into WebRTC
In the WebRTC case we already had a test which would launch a Chrome browser, open two tabs, get the tabs talking to each other through a signaling server and set up a call on a single machine. Then we just needed to figure out how to feed a reference audio file into a WebRTC call and record what comes out on the other end. This part was actually harder than it sounds. The main WebRTC use case is that the web page acquires the user’s mic through getUserMedia, sets up a PeerConnection with some remote peer and sends the audio from the mic through the connection to the peer where it is played in the peer’s audio output device.

WebRTC calls transmit voice, video and data peer-to-peer, over the Internet.

But since this is an automated test, of course we could not have someone speak in a microphone every time the test runs; we had to feed in a known input file, so we had something to compare the recorded output audio against.

Could we duct-tape a small stereo to the mic and play our audio file on the stereo? That’s not very maintainable or reliable, not to mention annoying for anyone in the vicinity. What about some kind of fake device driver which makes a microphone-like device appear on the device level? The problem with that is that it’s hard to control a driver from the userspace test program. Also, the test will be more complex and flaky, and the driver interaction will not be portable.^[1]

Instead, we chose to sidestep this problem. We used a solution where we load an audio file with WebAudio and play that straight into the peer connection through the WebAudio-PeerConnection integration. That way we start the playing of the file from the same renderer process as the call itself, which made it a lot easier to time the start and end of the file. We still needed to be careful to avoid playing the file too early or too late, so we don’t clip the audio at the start or end – that would destroy our PESQ scores! – but it turned out to be a workable approach.^[2]

Recording the Output
Alright, so now we could get a WebRTC call set up with a known audio file with decent control of when the file starts playing. Now we had to record the output. There are a number of possible solutions. The most end-to-end way is to straight up record what the system sends to default audio out (like speakers or headphones). Alternatively, we could write a hook in our application to dump our audio as late as possible, like when we’re just about to send it to the sound card.

We went with the former. Our colleagues in the Chrome video stack team in Kirkland had already found that it’s possible to configure a Windows or Linux machine to send the system’s audio output (i.e. what plays on the speakers) to a virtual recording device. If we make that virtual recording device the default one, simply invoking SoundRecorder.exe and arecord respectively will record what the system is playing out.

They found this works well if one also uses the sox utility to eliminate silence around the actual audio content (recall we had some safety margins at both ends to ensure we record the whole input file as playing through the WebRTC call). We adopted the same approach, since it records what the user would hear, and yet uses only standard tools. This means we don’t have to install additional software on the myriad machines that will run this test.^[3]

Analyzing Audio
The only remaining step was to compare the silence-eliminated recording with the input file. When we first did this, we got a really bad score (like 2.0 out of 5.0, which means PESQ thinks it’s barely legible). This didn’t seem to make sense, since both the input and recording sounded very similar. Turns out we didn’t think about the following:

We were comparing a full-band (24 kHz) input file to a wide-band (8 kHz) result (although both files were sampled at 48 kHz). This essentially amounted to a low pass filtering of the result file.
Both files were in stereo, but PESQ is only mono-aware.
The files were 32-bit, but the PESQ implementation is designed for 16 bits.

As you can see, it’s important to pay attention to what format arecord and SoundRecorder.exe records in, and make sure the input file is recorded in the same way. After correcting the input file and “rebasing”, we got the score up to about 4.0.^[4]

Thus, we ended up with an automated test that runs continously on the torrent of Chrome change lists and protects WebRTC’s ability to transmit sound. You can see the finished code here. With automated tests and cleverly chosen metrics you can protect against most regressions a user would notice. If your product includes video and audio handling, such a test is a great addition to your testing mix.

How the components of the test fit together.

Future work

It might be possible to write a Chrome extension which dumps the audio from Chrome to a file. That way we get a simpler-to-maintain and portable solution. It would be less end-to-end but more than worth it due to the simplified maintenance and setup. Also, the recording tools we use are not perfect and add some distortion, which makes the score less accurate.
There are other algorithms than PESQ to consider – for instance, POLQA is the successor to PESQ and is better at analyzing high-bandwidth audio signals.
We are working on a solution which will run this test under simulated network conditions. Simulated networks combined with this test is a really powerful way to test our behavior under various packet loss and delay scenarios and ensure we deliver a good experience to all our users, not just those with great broadband connections. Stay tuned for future articles on that topic!
Investigate feasibility of running this set-up on mobile devices.

¹It would be tolerable if the driver was just looping the input file, eliminating the need for the test to control the driver (i.e. the test doesn’t have to tell the driver to start playing the file). This is actually what we do in the video quality test. It’s a much better fit to take this approach on the video side since each recorded video frame is independent of the others. We can easily embed barcodes into each frame and evaluate them independently.

This seems much harder for audio. We could possibly do audio watermarking, or we could embed a kind of start marker (for instance, using DTMF tones) in the first two seconds of the input file and play the real content after that, and then do some fancy audio processing on the receiving end to figure out the start and end of the input audio. We chose not to pursue this approach due to its complexity.

²Unfortunately, this also means we will not test the capturer path (which handles microphones, etc in WebRTC). This is an example of the frequent tradeoffs one has to do when designing an end-to-end test. Often we have to trade end-to-endness (how close the test is to the user experience) with robustness and simplicity of a test. It’s not worth it to cover 5% more of the code if the test become unreliable or radically more expensive to maintain. Another example: A WebRTC call will generally involve two peers on different devices separated by the real-world internet. Writing such a test and making it reliable would be extremely difficult, so we make the test single-machine and hope we catch most of the bugs anyway.

³It’s important to keep the continuous build setup simple and the build machines easy to configure – otherwise you will inevitably pay a heavy price in maintenance when you try to scale your testing up.

⁴When sending audio over the internet, we have to compress it since lossless audio consumes way too much bandwidth. WebRTC audio generally sounds great, but there’s still compression artifacts if you listen closely (and, in fact, the recording tools are not perfect and add some distorsion as well). Given that this test is more about detecting regressions than measuring some absolute notion of quality, we’d like to downplay those artifacts. As our Kirkland colleagues found, one of the ways to do that is to “rebase” the input file. That means we start with a pristine recording, feed that through the WebRTC call and record what comes out on the other end. After manually verifying the quality, we use that as our input file for the actual test. In our case, it pushed our PESQ score up from 3 to about 4 (out of 5), which gives us a bit more sensitivity to regressions.

Read | No Comments | Tags: Google Testing

Espresso for Android is here!

Oct 18, 2013

Cross-posted from the Android Developers Google+ PageEarlier this year, we presented Espresso at GTAC as a solution to the UI testing problem. Today we are announcing the launch of the developer preview for Espresso!The compelling thing about developin…

Read | No Comments | Tags: Google Testing

previous page · next page

Google Data

About: Google Testing Bloggers

Profile:

Website

Contact:

Posts by Google Testing Bloggers

ThreadSanitizer: Slaughtering Data Races

GTAC 2014: Call for Proposals & Attendance

GTAC 2014 Coming to Seattle/Kirkland in October

Testing on the Toilet: Risk-Driven Testing

Testing on the Toilet: Effective Testing

Testing on the Toilet: Test Behaviors, Not Methods

The Real Test Driven Development

Testing on the Toilet: What Makes a Good Test?

When/how to use Mockito Answer

Minimizing Unreproducible Bugs

The Google Test and Development Environment – Pt. 3: Code, Build, and Test

The Google Test and Development Environment – Pt. 2: Dogfooding and Office Software

The Google Test and Development Environment – Pt. 1: Office and Equipment

WebRTC Audio Quality Testing

Espresso for Android is here!

Categories

Tags