Poll: data quality remains top issue for data engineers
A recent poll has revealed that data quality issues remain one of the most significant frustrations for data engineers, with nearly half of the survey respondents ranking it above other common pipeline challenges. Conducted by the data transformation and infrastructure management tool, Pipeliner, the survey gathered responses from over 100 data engineers.
The results highlight that almost one in five engineers identify integration with other systems as their primary challenge. Additionally, just under 20 percent cited performance bottlenecks as a key concern. Other notable frustrations include GDPR compliance, poor team cooperation, and a lack of access and permissions.
Xavi Forde, a founding engineer of Pipeliner and a practising data engineer, commented on the findings: "It's no secret that data quality continues to be a root of major frustration for many data engineers. Couple this with an increasing number of organisations looking into adopting AI to support enterprise growth, and data engineers are under increasing pressure to ensure data is insight and AI ready."
Forde emphasised the importance of well-documented pipelines to mitigate data quality issues. "We know data is never perfect, but there are absolutely ways engineers can reduce the chances of data being compromised as it moves through the pipeline. It all starts from having a well-documented pipeline with a complete traceability between your intended data transformation rules and your data transformation code so that no engineer has to spend hours trying to untangle someone else's badly written SQL," he added.
To address these challenges, Pipeliner launched its metadata-driven data transformation and infrastructure management tool in July. This tool takes mapping specifications as input and delivers data pipeline and infrastructure code directly to a data engineer's GitHub repository, thereby accelerating the development of data lakes while enforcing data governance. According to Pipeliner, this process allows an engineer to go from a mapping specification to a live pipeline within minutes instead of hours or even days.
Commenting on the capabilities of Pipeliner, founder Svetlana Tarnagurskaja noted, "Pipeliner can help you build production-grade complex data transformation pipelines significantly faster. It's a tool built by engineers for engineers, with users retaining full control and ownership of their code, which was of paramount importance to us."
The mission of Pipeliner is to make the build and maintenance of high-quality bespoke data lakes more affordable and accessible for the industry. "Whether it's a small team in the charity sector or an established engineering team under pressure to unlock cost savings in a large enterprise, Pipeliner automates the most time-consuming part of infrastructure and data transformation code creation. This removes bottlenecks, increases productivity, and reduces cloud costs, potentially saving engineers days, even weeks of time," Tarnagurskaja explained.
Pipeliner's system operates through a three-stage process. The first stage, 'Define', involves analysts or engineers defining the source-to-target transformation logic and data structures in a mapping specification. The second stage, 'Generate', sees Pipeliner take the mapping specification as input and generate the ETL jobs and infrastructure code. Finally, the third stage, 'Deploy', delivers fully editable code straight to the Git repository of choice, ready for deployment, thus allowing the engineering team to retain full control of their code.
The tool is available through AWS Marketplace and complies fully with the AWS Well-Architected framework, aiming to empower businesses to extract actionable insights, make informed decisions, and foster growth through efficient data management.