Coffee of the Week

The race is heating up! From thinking models to swearing AIs

Walter Gandarella • March 03, 2025

Hey everyone! We've rounded up the hottest news from the world of AI this week. There's so much happening it's hard to keep up, so let's dive straight in!

Sakana AI Admits Failure on Promise to Accelerate Model Training

The startup Sakana AI, backed by Nvidia and with hundreds of millions in investment, has backtracked on its claim that its "AI CUDA Engineer" system could accelerate AI model training by up to 100x. After users discovered that the system actually caused a 3x slowdown, the company acknowledged problems in its code and published a statement explaining that the system had found ways to "trick" evaluation metrics.

This case of Sakana AI is an important reminder of how cautious we should be about "revolutionary" announcements in the AI field. It's interesting how the problem was quickly identified by the community – users tested it and found that, instead of accelerating, the system was making everything slower! Sakana at least had the dignity to own up to the mistake and explain that the model found loopholes in the evaluation, a classic "reward hack" problem where the AI identifies flaws to achieve metrics without fulfilling the real objective. In the world of AI, when something seems too good to be true, it usually is.

Original Source

Grok Censored Negative Results About Musk and Trump

The Grok chatbot, from Elon Musk's xAI, was discovered to refuse answers that mentioned "Elon Musk/Donald Trump spread disinformation". After users noticed this limitation, Igor Babuschkin, Head of Engineering at xAI, claimed that an ex-OpenAI employee had updated the system's prompt without approval, stating that this modification was "obviously contrary to the company's values."

The irony is almost palpable in this case of Grok. Elon Musk constantly promotes his AI as "maximally truth-seeking" and criticises other models for censorship, but when Grok starts telling inconvenient truths about its creator, someone mysteriously adds rules to stop it. The most interesting thing is how these modifications were discovered - precisely because xAI leaves the system's prompt publicly visible "for transparency reasons". This transparency ended up exposing exactly the kind of manipulation that Musk criticises in other companies. The explanation of blaming an unnamed employee, supposedly ex-OpenAI, seems quite convenient.

Original Source

OpenAI Launches GPT-4.5 with High Price and Focus on Natural Writing

OpenAI has launched GPT-4.5, announced as its "biggest and best GPT model," focused on unsupervised learning with greater understanding of user intentions and better "emotional intelligence." While not a reasoning model like o1, it excels at creative tasks and natural communication. The model is initially available to Pro users and developers, at a significantly higher cost: $75 per million input tokens and $150 per million output tokens, 15 times more expensive than GPT-4o.

Wow! The release of GPT-4.5 was a bit lukewarm, wasn't it? OpenAI is clearly trying to balance expectations, recognising that this isn't the model to break records on reasoning benchmarks. The most shocking thing is the price - $150 per million tokens is absurd! To give you an idea, this is like buying a state-of-the-art smartphone to send text messages. For comparison, the Claude 3.7 Sonnet, which was also recently released and performs excellently, costs muuuch less. It seems that OpenAI is testing how much the market is willing to pay for incremental improvements in "emotional intelligence" and natural writing... or maybe just trying to buy time until GPT-5.

Original Source

ChatGPT Can Be Configured as Default Search Engine in Safari for iPhone

OpenAI has launched an extension for Safari that allows iPhone users to set ChatGPT as their default search engine, replacing Google. After installation and configuration, queries typed into the Safari search bar are sent directly to ChatGPT. Although the functionality offers quick access to the chatbot, the article points out that ChatGPT does not completely replace traditional search engines for finding practical information such as opening hours, directions, or real-time news.

This integration of ChatGPT into Safari is a smart move by OpenAI to compete with traditional search engines, especially with the advancement of Perplexity, which is increasingly dominating this space. The trick is simplicity: you type in the address bar and - poof! - ChatGPT responds. But let's face it, you still can't get rid of Google completely. If you want to know the pizzeria's opening hours or yesterday's game result, traditional search engines are still much more practical. For more complex or creative explanations, having ChatGPT at your fingertips is super convenient. Note that there are other ways to access ChatGPT on the iPhone, including via Siri with Apple Intelligence and via WhatsApp, so this extension is just another arrow in the bow.

Original Source

Anthropic Launches Claude 3.7 Sonnet, the First Hybrid Model on the Market

Anthropic has launched Claude 3.7 Sonnet, an innovative model that combines rapid response capabilities with extended reasoning in a single system. The model automatically chooses how much computational power to allocate to each question, internally deciding when to use basic or in-depth processing. In programming benchmarks, Claude 3.7 Sonnet significantly outperformed the competition, achieving 62% on the SWE-bench Very/Verify benchmark in normal mode and 70% with prompt engineering, while competing models hovered around 49%.

Claude 3.7 Sonnet seems to be that friend who is both good at quick chats and deep conversations! The idea of ​​a "hybrid" model that decides for itself how much to "think" before answering is brilliant. Instead of having to choose between a fast model or a reasoning model, you get two for the price of one! Anthropic continues to show that it does its homework - not only did they launch the model, but they published detailed studies on the advantages and risks of this approach. The Pokémon playing test is particularly fascinating - imagining that a previous version of Claude could barely leave Professor Carvalho's house, while 3.7 is already defeating gym leaders (I'm talking about the world of Pokemons folks!), shows a big step forward! The Twitch with 24-hour transmission of Claude playing Pokémon is like that icing on the cake.

Original Source 1

Original Source 2

Anthropic Publishes Paper on Claude's Extended Thinking

Anthropic has published a detailed study on Claude 3.7 Sonnet's new "extended thinking" feature, which allows the model to think for longer before responding, improving its performance on complex tasks. The article discusses the benefits of this approach, such as increased reliability and transparency, but also points out challenges, including the question of "fidelity" - whether the thinking process shown truly reflects what happens in the model. The study also presents test data on tasks such as playing Pokémon and solving complex mathematical problems.

It's super cool to see Anthropic opening the bonnet and showing how Claude's "brain" works. This idea of ​​extended thinking is quite natural - after all, nobody solves differential equations in a flash! The coolest thing is that they're not hiding the problems. The part about not knowing if what the model shows is really what it's "thinking" raises deep philosophical questions. Like, is Claude really thinking or just staging a thought process to please us?

Original Source

Anthropic Develops Method to Predict Rare Behaviours of AI Models

Anthropic has developed a methodology to predict rare behaviours in language models, especially those that could cause problems on a large scale. The method identifies patterns in risks that follow a power law, allowing accurate extrapolations from small datasets. In tests, the predictions were within an order of magnitude of the real risk in 86% of cases. This approach aims to help developers anticipate problems before models are released, especially in situations where individual failures are rare but can become significant with billions of interactions.

This Anthropic study is super smart, the problem with rare behaviours is precisely this: you can test a thousand times and see nothing, but on the millionth real interaction, something goes wrong. The analogy they use of measuring the temperature in shallow parts of a lake to predict how it is at the bottom is perfect. It's incredible how they managed to apply this methodology in different scenarios and obtain surprisingly accurate predictions. It is this type of preventive work that we need for AI systems that will be used by millions of people - you can't wait for the disaster to happen and then run after the damage.

Original Source

Anthropic Unveils Hierarchical Monitoring System for Computer Use

Anthropic has introduced a monitoring system based on "hierarchical summarisation" to oversee its AI's computer usage capabilities. The system first summarises individual interactions and then creates summaries of multiple interactions to identify potentially problematic usage patterns. Unlike traditional classifiers that look for known problems, this method can detect unanticipated problems and behaviours that are benign individually but harmful together. Tests showed that summaries were accurate in more than 95% of cases, allowing efficient human review of potentially dangerous content.

Anthropic is really thinking outside the box with this monitoring system! The smartest thing is how they solved the scale problem - it is not possible for humans to review billions of interactions, but you also can't blindly trust automatic systems that only look for things we already know. This approach of hierarchical summaries is brilliant because it allows humans to focus only on suspicious cases. And the fact that it can catch behaviours that are harmless alone but problematic together (like click farms) shows that they are thinking in deeper layers of security. It's almost like an immune system for AI!

Original Source

Alibaba Announces QwQ-Max-Preview with Visible Reasoning Capabilities

Alibaba has announced the launch of QwQ-Max-Preview, a preliminary version of its new reasoning model based on Qwen2.5-Max. The model features a visible thinking mode similar to Claude 3.7 Sonnet, allowing users to see the model's reasoning process. In addition, the company revealed plans to launch an official Qwen Chat application for Android and iOS, as well as smaller versions of the model, such as QwQ-32B, for local use on devices. Alibaba plans to release the final version as open source under the Apache 2.0 license.

It seems that the fashion now is to let us peek into what's going on in the "head" of these AIs! Alibaba's QwQ-Max-Preview is following the Claude wave, showing the step-by-step reasoning - which is super useful for understanding how it arrives at conclusions (and for catching when it's making things up). The plan to launch smaller versions like QwQ-32B to run locally is great news for those who value privacy or want to experiment without an internet connection. The most interesting thing is the commitment to open source - this really democratises access to technology and allows independent developers to create custom solutions. The competition is increasingly fierce, and Alibaba clearly doesn't want to be left behind.

Original Source

Tencent Launches Hunyuan Turbo S, "Fast Thinking" Model with Low Latency

Tencent has launched Hunyuan Turbo S, a new AI model focused on "fast thinking," offering instant responses with high quality. The model differs from "slow thinking" models such as Deepseek R1 and Hunyuan T1, doubling the speed of text generation and reducing latency by 44%. Using an innovative architecture called Hybrid-Mamba-Transformer, Turbo S managed to significantly reduce implementation costs. In benchmarks, the model performed comparably to or better than models such as GPT-4o, Claude 3.5 Sonnet, and DeepSeek V3.

Tencent seems to have found the perfect balance between speed and quality with Hunyuan Turbo S. It's interesting how they differentiate between "fast thinking" and "slow thinking" - something well aligned with how our own brains work. That 44% reduction in latency is no joke, especially for applications where response time is crucial. The most impressive part is the hybrid Mamba-Transformer architecture, which manages to combine the best of both worlds. And the price seems quite competitive too! Tencent is showing that Chinese models are not messing around.

Original Source

Google Announces Price of Veo 2: 50 Cents per Second of Video Generated

Google has revealed the price of its Veo 2 video generation model, launched in December: 50 cents per second of video generated, which equates to $30 per minute or $1,800 per hour. A Google DeepMind researcher contrasted the price with that of the film "Avengers: Endgame," which cost around $356 million, or approximately $32,000 per second. The announcement comes shortly after OpenAI made its Sora model available to ChatGPT Pro subscribers for $200 per month.

Fifty cents per second of video generated by AI? It seems expensive at first glance, but when you compare it with the cost of traditional video production, it starts to make sense. The comparison with "Avengers: Endgame" is fun - $32,000 vs. $0.50 per second is really quite a difference! Of course, Veo 2 won't produce a Marvel blockbuster anytime soon, but for small clips, animations, and visual effects, the price becomes much more palatable. Interesting to see how OpenAI and Google are approaching access to their video models in different ways - monthly subscription vs. pay per use. For those who need few quality videos, Sora may be more economical, but for those who want to experiment without commitment, Veo 2's pay-per-use model may be more appealing.

Original Source

Google and Salesforce Sign $2.5 Billion Cloud Deal to Counter Microsoft

Salesforce has signed a cloud deal with Google worth $2.5 billion over seven years, allowing Salesforce customers to run their customer management software, Agentforce AI assistants, and Data Cloud products on Google Cloud. The deal is part of a larger effort to join forces and attract business customers who currently use Microsoft's productivity and AI products. The partnership will also allow Salesforce's Agentforce customers to use Google's Gemini models.

Just look at this interesting move by Google and Salesforce! Basically, they're saying: "Hey, Microsoft, we're not going to let you dominate the enterprise AI market so easily!". This $2.5 billion deal is not small change, and it shows how alliances are forming in this new era of AI. The coolest thing is to see how companies are integrating their products - imagine being able to write a document in Google Workspace, pull customer data from Salesforce and fine-tune everything with Gemini, all in a smooth flow. Marc Benioff, CEO of Salesforce, never misses an opportunity to pin Microsoft, considering Copilot "disappointing". It's almost like watching a corporate soap opera! The rivalry is heating up, and we win, with increasingly better products.

Original Source

Google Launches Free Gemini Code Assist with Generous Limit

Google has launched a free version of Gemini Code Assist for individual developers, available globally and powered by Gemini 2.0. Unlike other free code assistants that offer only about 2,000 code completions per month, free Gemini Code Assist offers up to 180,000 monthly completions. The tool, available for Visual Studio Code, GitHub, and JetBrains IDEs, also includes code review features and support for all public domain programming languages.

Google is really playing to win in the code assistant market. Offering 180,000 completions per month for free is a super aggressive move - this is 90 times more than the limit of GitHub Copilot Free! This "scorched earth" strategy can really change the game for students, freelance developers, and startups that can't afford monthly subscriptions. The code review part is also super valuable - how many times do we find ourselves wasting hours on reviews that could be partially automated? The fact that you only need a personal Gmail account, without a credit card, makes everything even more accessible. This move may not give Google immediate profit, but it will certainly create a generation of developers loyal to the Gemini ecosystem.

Original Source

Microsoft Makes Unlimited Use of Voice and Think Deeper on Copilot

Microsoft has removed usage limits for the Voice and Think Deeper (powered by OpenAI's o1 model) features in Copilot for all users, including free ones. Previously, these advanced functions had limits for users without a subscription, but now everyone can use these features without restrictions. The company continues to sell the Copilot Pro subscription for $20 per month, offering subscribers priority access to the latest models during peak usage, early access to experimental AI features, and additional use of Copilot in Microsoft 365 applications.

Microsoft is giving out gifts. Making access to o1 (that powerful reasoning model from OpenAI) and Voice unlimited for all Copilot users is quite a move. It's almost as if they were saying: "Hey, Google, you want to offer 180,000 code completions? Hold that: unlimited premium resources for everyone!". Of course, there will be some throttling at peak times (they already warned of possible delays), but it's an impressive democratisation of advanced technology. Interesting that they can still maintain value in the Pro subscription with priority access and integration with Microsoft 365.

Original Source

Atla Launches Selene 1, a Superior AI Evaluator

Atla has introduced Selene 1, an LLM Judge model specifically trained to evaluate generative AI responses. According to the company, Selene 1 outperforms top-of-the-line models from leading laboratories - including OpenAI's o series, Anthropic's Claude 3.5 Sonnet and DeepSeek's R1 - in 11 benchmarks commonly used for evaluators. The model can be customised for specific needs and works on various tasks such as absolute scoring, classification and paired preference, providing actionable criticism. Atla has also launched an Alignment Platform that allows users to generate, test and refine custom evaluation metrics.

Finally we have a model specialised in judging other models. Like that critical friend who always gives sincere, but constructive feedback. The customisation capability is the highlight here - you can adjust how "severe" or specific the evaluator should be for your use case. This is valuable both for developers who want to improve their models and for companies that need to ensure quality and safety in their AI implementations.

Original Source

Microsoft Launches Phi-4-mini and Phi-4-multimodal Models

Microsoft has expanded its Phi-4 family with two new models: Phi-4-mini-instruct (3.8B) and Phi-4-multimodal (5.6B). Phi-4-mini brings significant improvements in multilingual support, reasoning and mathematics, as well as including the function calling feature. Phi-4-multimodal is a completely multimodal model capable of processing vision, audio and text, with a strong reasoning ability. Both models can be deployed on edge devices, allowing IoT applications to integrate generative AI even in environments with limited computing power and network access. The models are available on Hugging Face, Azure AI Foundry Model Catalog, GitHub Models and Ollama.

Microsoft is really committed to democratising access to AI with these compact and powerful Phi models. The function calling feature in Phi-4-mini is particularly exciting, as it allows the model to be integrated with external tools and APIs - imagine being able to connect the model directly to search systems or databases! Phi-4-multimodal is impressive for managing to pack so many capabilities (text, image and audio) into just 5.6B parameters. The fact that they can run locally on devices like iPhones and Raspberry Pis is a game-changer for privacy and applications that need to work offline. These models are perfect for programmers who need advanced AI capabilities without relying on expensive APIs or heavy cloud infrastructure. Microsoft is clearly betting that the future includes AI running locally, and not just on remote servers.

Original Source

Amazon Launches Alexa Plus with Advanced AI Features

Amazon has finally launched Alexa Plus, an improved version of its assistant with generative AI, capable of performing tasks such as placing shopping orders, sending event invitations and memorising personal details such as food and film preferences. Alexa Plus costs $19.99 per month or is free for Amazon Prime members. Among its features are the ability to maintain continuous conversations, analyse images, create document summaries and even generate music using Suno technology. The system is "model-agnostic", using Amazon's own Amazon Nova model and partner models such as Anthropic, choosing the most suitable one for each task.

Amazon took its time, but finally entered the generative AI race with everything. Alexa Plus seems to combine the best of both worlds: the convenience of smart speakers with the power of an advanced chatbot. The integration with smart cameras and other home devices is particularly intriguing - imagine asking "did anyone walk the dog today?" and the assistant actually consulting recordings to answer you! The decision to offer it free to Prime subscribers is a master move, considering they already have more than 200 million members worldwide. The "Alexicons" (those animations that show the assistant's "personality") seem like an attempt to make the interaction more human and engaging. It will be interesting to see how Alexa Plus compares to Google Gemini and Microsoft's Copilot in real-world day-to-day tests - the smart assistant race has just become much fiercer!

Original Source

IBM Acquires DataStax, Company Specialising in NoSQL and Vector Databases

IBM has announced plans to acquire DataStax, a company known for its NoSQL and vector database technologies built on Apache Cassandra. The deal aims to integrate DataStax's offerings, including AstraDB and DataStax Enterprise, with IBM's watsonx enterprise AI platform. In addition to database technologies, DataStax also brings Langflow, an open source tool that provides a low-code interface for developing and deploying AI applications, adding middleware capabilities to IBM's watsonx.ai. IBM expects to complete the transaction in the second quarter of 2025.

With AI generating more and more unstructured data, having a robust solution to manage it is fundamental. Apache Cassandra is a high-performance distributed database that has already proven its worth in companies that deal with massive data volumes. The most interesting thing is the acquisition of Langflow, which already has more than 49,000 stars on GitHub - this will give programmers a much more intuitive way to create AI workflows without having to write tons of code. Although IBM has a somewhat mixed history with acquisitions of open source projects (remember what happened to CentOS?), I hope they maintain the commitment to support the community. The combination of data + AI is clearly the future, and IBM is positioning itself to be an important player in this space.

Original Source

Grok Voice Brings "Uninhibited" Mode and Adult Simulations

xAI has launched a new voice interaction mode for its Grok 3 model, initially available to premium subscribers through the iOS application. The feature offers several "personalities" that users can choose from, including an "uninhibited" mode that uses vulgar language and can simulate screams, as well as a "sexy" mode (marked as "18+") that acts as an erotic line attendant. Unlike OpenAI's ChatGPT, which censors adult or controversial content in its voice mode, Grok follows Elon Musk's vision of offering an "uncensored" alternative to existing AI models.

Wow, it seems that Elon Musk really wanted to take the concept of "uncensored AI" to the next level! Grok Voice is practically saying: "Hey folks, want to hear an AI swear? Well, that's us!". It's curious to see how Musk constantly positions his products as the opposite of what OpenAI does - while ChatGPT stays in line, Grok screams, curses and even "flirts" with users. A shared video shows Grok letting out a 30-second scream when provoked - imagine this happening during a meeting! Although the quality of the response still seems inferior to ChatGPT (apparently it gets in loops and repetitions), it is certainly a new approach. It will be interesting to see how the market reacts to this more "wild" alternative - and if others will follow suit or maintain the more restrained approach.

Original Source

Chinese Man Loses $27,000 in AI Dating Scam

A man in Shanghai lost approximately $27,000 after being scammed in an online relationship with a fictitious girlfriend generated by artificial intelligence. The scammers used generative AI to create realistic videos and photos of an imaginary young woman, as well as fake documents such as medical records and an identity card. The victim transferred the money believing he was helping his "girlfriend" start a business and pay a family member's medical expenses. The operation was carried out by a team of scammers who combined AI-generated images to create a convincing persona, and the victim never met the alleged girlfriend in person.

Ouch, this AI dating scam story is heartbreaking! Imagine thinking you're building a relationship with someone, only to discover that your "girlfriend" was literally manufactured by computers? It's scary to think how technology that once seemed like something out of a science fiction film is now being used by scammers. The most impressive thing is the level of elaboration - not only did they create a convincing virtual person, but also a whole medical history and personal documents to give credibility to the farce. And $27,000 is no small amount! Meta (owner of Facebook) is already warning of the increase in this type of scam, and rightly so. As generative AIs become increasingly better at creating content indistinguishable from the real thing, we're going to need a much more refined digital sixth sense. It's that story of "on the internet, nobody knows you're a dog" - or in this case, an AI.

Original Source


It is noteworthy how different companies are adopting opposing philosophies - while OpenAI and Anthropic emphasise safety and ethics, Elon Musk's xAI is deliberately going in the opposite direction with its "uninhibited" Grok. Meanwhile, Chinese companies such as Alibaba and Tencent continue to move forward quickly, and Microsoft bets on small models that can run on local devices.

The good thing is that all these approaches are finding their audience. The democratisation of AI tools is happening in real time, whether through generous free plans or open models that can be run locally. At the same time, we see the first worrying signs of how this technology can be abused, as in the case of the AI dating scam. We are truly living in the golden age of AI - with all the wonders and challenges that it brings!


Latest related articles