GPT-4 wins chatbot lawyer contest – but is still not as good as humans

Several AI chatbots were tested to see how well they could perform legal reasoning and tasks used by human lawyers in everyday practice – GPT-4 performed the best, but still wasn’t great.

Compared to other AI chatbots, GPT-4 performs best on a test of legal reasoning – but it still falls short of the knowledge required for human lawyers. Early attempts to use AI chatbots in courtrooms have sometimes proven disastrous, and this finding adds to evidence that AI isn’t ready to handle the complexities of real-world legal arguments.

AI is increasingly being used by lawyers, but chatbots still don’t do that well at everyday legal tasks
WESTOCK PRODUCTIONS/Shutterstock


Artificial intelligence researchers and lawyers worked together to design LegalBench, which evaluates how well AI chatbots can do six different types of legal reasoning. LegalBench includes 162 practical tasks that human lawyers must handle in everyday practice, such as correctly analysing legal documents and detecting different types of legal language.


That makes LegalBench more relevant to how lawyers practice the law in real life than simply seeing how well an AI can perform at memorising the necessary information to pass the bar exam, says Neel Guha at Stanford University in California. OpenAI’s GPT-4 has already shown it can outperform the average bar exam test-taker with a score of 75 per cent.

He and his colleagues worked with legal professionals to design LegalBench, and then evaluated 20 commercial and open-source large language models, including OpenAI’s GPT-4 and GPT-3.5. GPT-4 generally performed the best with scores in the 70s and 80s out of 100. However, it performed poorly on recalling the specifics of legal rules, with a score of just 59. OpenAI’s GPT-3.5, Anthropic’s Claude and Meta’s Llama models generally trailed behind.


“I think the expectation is that a human legal professional trained for the task should get close to perfect on the majority of LegalBench tasks,” says Guha. “Unfortunately, we haven’t had the chance to evaluate a human legal professional on these tasks, so we don’t have a precise answer” about how they compare to AI’s performance.


Such academic research providing “an independent, comparative ranking of the various [Ais] is likely to be useful for law firms and in-house legal teams,” says Tom Roberts, a founding partner at Allen & Overy, a global law firm headquartered in London.


Since February 2023, his law firm has been testing how well OpenAI’s generative AI technology – customised by the company Harvey – can draft emails and research legal topics. The greatest challenges in adopting AI have to do with the costs, risk management, dealing with information security for proprietary data, but also, crucially, how such technologies handle more complex legal reasoning, says Roberts.


But given the “significant limitations” of AI, the law firm has been careful to “place clear limits on the technology’s use and always have a human in the loop,” says Roberts.

AI chatbots may still produce misleading or wrong answers when asked to provide information about specific cases, laws or regulations, says Guha. In an infamous real-life example, a US federal judge fined lawyers for submitting materials based on non-existent legal cases that were entirely made up by OpenAI’s ChatGPT.


The use of AI chatbots in the legal profession also raises ethical issues about unauthorised practice of law, copyright issues, legal malpractice and who can access such legal help, says Guha. For example, a company providing an AI chatbot for legal services without any human in the loop may violate US state laws that require licensed attorneys to perform such services.


Reference

arXivDOI: 10.48550/arXiv.2308.11462

Post a Comment

Last Article Next Article