AI that uses computers as humans

Is Claude Compute Use the future of end-to-end testing?

Walter Gandarella • October 26, 2024

Imagine: you're sitting at your desk and, next to you, you have a virtual assistant that not only talks to you, but also uses your computer as if it were a real person. Sounds like science fiction? Well, it's become reality. Anthropic has just announced a revolutionary update to its AI Claude, and this time it's for real - we're talking about an assistant that can literally use a computer just like you and me.

The big news: Computer Use

The guys at Anthropic decided to do something different. Instead of creating specific tools for each task (you know, the whole "for every problem, a different program" thing), they taught Claude to use the computer like a human would thing. It's like teaching someone to fish instead of giving them a fish - only in this case, we're teaching an AI to use a mouse, look at the screen and type.

And it's not just talk, either. Claude 3.5 Sonnet (the most powerful version of the system) is already showing impressive results. In tests on OSWorld, which evaluates how AIs see themselves using computers, it scored 14.9% in the screen viewing category - almost double the runner-up, which scored a mere 7.8%.

Getting your hands dirty (or is that a mouse?)

Professor Ethan Mollick, known for his analysis of AI, had the opportunity to test the system and shared some interesting experiences. One of them involved asking Claude to create a lesson plan about "The Great Gatsby". Instead of just talking, the AI went ahead and did the whole job: it downloaded the book, searched for lesson plans on the web, opened a worksheet and filled everything out on its own, including links to the Common Core standards (the American core curriculum).

But the story gets even more interesting when Mollick decided to test the system with a game called Paperclip Clicker (ironically, a game about an AI that destroys humanity in its obsession with making paperclips - did anyone catch the irony?). Claude not only understood the game, but developed his own strategies, A/B tested prices (even if he sometimes misinterpreted the results) and even tried to automate the process by writing code.

The new Claude’s superpowers

The Claude 3.5 Sonnet update didn’t just bring the ability to use computers. The AI also made a significant leap in its programming capabilities. In the SWE-bench Verified benchmark, it jumped from 33.4% to 49%, outperforming all publicly available models - including GPT-4 and expert coding systems.

Companies that are already testing the system report impressive improvements. GitLab, for example, has seen up to a 10% increase in reasoning in DevSecOps use cases. The Browser Company claims that Claude 3.5 Sonnet outperformed all models they had previously tested.

A revolution in software testing

One of the most promising applications of this new capability is in the field of software testing, especially end-to-end testing. As a developer, I have always been frustrated by the complexity and manual work required to create and maintain interface tests. Remember when you use Selenium or similar tools? It’s a laborious process that requires writing detailed scripts for each test scenario.

Now imagine being able to simply talk to an AI and say, “Hey, I need to check if the user registration flow is working correctly” or “Can you test if the shopping cart is calculating discounts correctly?” Claude could navigate the interface like a real user, running the tests and reporting the results – all without having to write a single line of test code. It's like having an automated QA that truly understands the context of what it's testing.

This approach would not only save valuable development time, but would also make testing more comprehensive and natural, since the AI can interact with the interface exactly as a real user would. It's a paradigm shift that could revolutionize the way we ensure the quality of our applications.

The little brother: Claude 3.5 Haiku

Along with all these new features, Anthropic also announced Claude 3.5 Haiku, a lighter and faster version of the system. Interestingly, even though it's more "economical", it manages to outperform the old top-of-the-range model (Claude 3 Opus) in several intelligence metrics. It's like having a compact car with a Ferrari engine - and using less fuel!

Real-world challenges

Of course, not everything is rosy in the AI garden. Like any new technology, the system still has its limitations. It can be stubborn at times (like when it insisted on keeping prices low in the clip game, even against the guidelines), and certain actions that are trivial for us humans - like scrolling on a page or dragging files - are still challenging for it.

In addition, there is concern about possible malicious uses of the technology. Anthropic is aware of this and has implemented special classifiers to identify when Computer Use is being used and if any kind of harmful activity is occurring.

The most fascinating thing about all of this is how this technology is changing the way we interact with AI. It is no longer just a matter of giving commands and receiving responses - now it is like having a real assistant that can perform complex tasks independently. It is as if we are moving from an era of "conversing with robots" to an era of "collaborating with intelligent agents".

And guess what is most interesting? This is just the beginning. Anthropic has already made it clear that it expects to see rapid improvements in the coming months. Here at Yes Marketing, we’re already looking at this transformative potential – so much so that we’ve deployed a dedicated team of developers to explore this new functionality. Our mission is to identify and develop innovative solutions that can make a real difference to our clients’ processes. We’re really entering uncharted territory, where the boundaries between human and artificial interaction with computers are beginning to dissolve. The future promises to be very interesting – and perhaps a little scary, but definitely exciting.

As for me? Well, I can’t wait to see a Claude playing The Sims. Will he do the same evil things we’ve all done to our characters? 🤔


Latest related articles