ChannelLife UK - Industry insider news for technology resellers
Story image

DataStax partners with Wikimedia for AI knowledge project

Yesterday

DataStax and Wikimedia Deutschland have announced a collaborative AI Knowledge Project aimed at enhancing the accessibility of open editable and accessible data through advancements in vector embedding technology.

Wikidata, a crucial component of all Wikipedia language versions, is the largest collaborative knowledge graph supporting over 300 languages. This initiative, in coordination with NVIDIA AI, has accomplished the ingestion, processing, and vector embedding of over 10 million entries in a record time of under three days.

DataStax's AI Platform is being utilised by Wikimedia Deutschland, the organisation that backs German Wikipedia and develops Wikidata and Wikibase, with NVIDIA's NeMo Retriever and NIM microservices. This effort aims to make Wikidata available as an embedded vectorised database for developers.

Dr. Jonathan Fraine, Chief Technology Officer at Wikimedia Deutschland, stated, "WMDE plans to make Wikidata's data easily accessible for the Open Source AI/ML Community via an advanced vector search by expanding the functionality with fully multilingual models, such as Jina AI through DataStax's API portal, to semantically search up to 100 of the languages represented on Wikidata. To vector embed a large, massively multilingual, multicultural, and dynamic dataset is a hard challenge, especially for low-resource, low-capacity open source developers. With DataStax's collaboration, there is a chance that the world can soon access large subsets of Wikidata's data for their AI/ML applications through an easier-to-access method."

He further explained the significance of this development by noting the increase in processing speed: "Although only available in English for now, DataStax's solution provided a valuable initial experiment ~10x faster than our previous, on-premise GPU solution. This near-real-time speed will permit us to experiment at scale and speed by testing the integration of large subsets in a vector database aligned with the frequent updates of Wikidata."

The Wikimedia Deutschland team emphasised the importance of developer efficiency due to the large scale of Wikidata's knowledge graph. The DataStax AI Platform on AWS enabled rapid processing of over 10 million entries within three days, maintaining availability under a free CC0 licence.

Lydia Pintscher, Portfolio Lead for Wikidata at Wikimedia Deutschland, expressed optimism: "Our cooperation with DataStax and their approach has unlocked new capabilities and streamlined our processes, which will allow us to deliver faster and more accurate insights to our community. DataStax offers a combination of scalability, ease of use, and advanced embedding models that supports and encourages the development of AI applications for the public good with open and high-quality data."

Ed Anuff, Chief Product Officer at DataStax, highlighted the achievement: "We're thrilled to see Wikimedia Deutschland improving accessibility to the world's largest knowledge graph with our AI platform. The open source community is crucial as it can bring more common good and many new ideas and innovations to the digital world."

Future endeavours will look at enhancing search reliability and improving accessibility across hundreds of languages. The integration of Astra DB's serverless model, powered by AWS, is set to support Wikimedia Deutschland's expanding infrastructure needs as data demand grows.

DataStax introduces Astra Vectorize to simplify vector embeddings directly on Astra DB on AWS, alongside additional integrations such as Amazon Bedrock and Langflow. These enhancements aim to help AWS users manage their total cost of ownership while benefiting from AWS Graviton processors to reduce operational costs.

Follow us on:
Follow us on LinkedIn Follow us on X
Share on:
Share on LinkedIn Share on X