MACSO Data Collection Tool Revolutionising Data Collection and Engineering- An Audio Case Study
Data collection and processing:
The mantra of machine learning is “garbage in, garbage out”. That is, the quality of the machine learning model in the end product is almost solely dependent on the quality of the data used to teach the model to perform the desired function. The highest quality data one can obtain is often provided by domain experts. With this in mind, data collection strategies that are error-prone, inefficient or unintuitive to the contributor will erode the quality of the data provided. At MACSO, we developed a browser-based data collection tool to address this need. This software allows for a sophisticated yet easy-to-use labelling system that can be operated by the provider without the requirement of any machine learning expertise. There are also customizable safeguards to decrease the likelihood of corrupt data. Finally, being browser-based, the contributor will not be required to follow any complicated installation processes; so long as they have a device with internet connectivity, they will have full access to the tool. With a robust collection practice enabled by this tool, the data cleaning and processing time is significantly reduced. In this case study we will outline a real-life use case of this tool for an audio project.
Case study background:
The case study centres around a semiconductor company that sought assistance in enabling their microphones to detect seven specific keywords on the edge. To accomplish this, the company initially employed conventional data collection methods, followed by a subsequent approach utilizing the MACSO data collection tool. By comparing these two methodologies, the study aims to shed light on the advantages offered by the MACSO tool in achieving the desired outcome.
The customer initially collected audio files, sourced from an estimated 334 contributors, employing traditional data collection techniques. In contrast, the MACSO data collection tool was employed as an alternative approach. This case study endeavours to quantify and illustrate the benefits derived from utilizing the MACSO tool in contrast to conventional data collection methods.
Approach 1, followed by the customer:
For audio classification for human-related projects, the most common approach of data collection involves expediting the process through the use of a third-party application. For example, when collecting audio data, the software Audacity is a popular application. Being a third-party application, it is often not tailored to the specific use case. To ensure data quality, one may need to provide additional instructions requiring user intervention of the contribution process which may not only obscure the user experience for the contributor or collector but expose additional vulnerabilities to the data quality. This is especially useful when training machine learning models for speech recognition as the “collector” is synonymous with the “contributor” due to the requirement for a large variety of voices. Simply put, we want as many different voices as possible and so we can’t simply teach a single individual to collect multiple data points as each individual themselves is a data point.
The grievances of data collection practices can be generalized as follows:
1. Inefficiency arising from requiring the manual installation of third-party software.
2. Unintuitive and error-prone implementation of required constraints
3. Inefficient, inconsistent and error-prone labelling processes
4. Inefficient and error-prone transfer of data from contributor to collector
With this in mind, the following issues may arise:
· Suppose that there is a requirement for an audio collection task that all audio files are of the same length. To satisfy this requirement, one may request that the contributor manually stop the recording at the desired completion time. Not only is this error prone in terms of accuracy but the action of manually stopping the recording may introduce noise that pollutes our dataset.
· Suppose that there are volume constraints to be considered. If the third-party software can be modified to accommodate these constraints, this may require each individual contributor to make such modifications during the installation process. To mitigate this, one may consider installing the software onto one machine that is used by all contributors. However, this introduces an administrative overhead through the organization of the data collection event required to allow all users to access this machine.
· For machine learning, the quality of data not only pertains to the audio recording itself but also the insight into the overall collected dataset. Often, this insight can be achieved through meta-data labelling of the dataset. However, it is rare that third-party audio software innately possesses this functionality. This, once again, introduces administrative overhead to either the contributor through additional requirements during the contribution process and/or the collector through the subsequent formatting and organization of the data.
· The transferring of data from the contributor to the collector is another process that is not innately offered by third-party audio recording software. Therefore, this process is often manual and thus inefficient and error prone.
· When using third-party software, the organization of customer-specific datasets is an entirely manual process as it is often not innately provided by third-party software.
The chosen data collection approach initially employed by the semiconductor company yielded significant inconsistencies due to human error and the absence of automated quality control measures for contributors' data. As a result, the collector, the semiconductor’s partner organization, had to invest two months of data engineering efforts to manually label the data, remove undesirable samples and ultimately faced the challenge of insufficient clean data for model training. To compensate for this shortage, volume augmentation techniques were implemented. However, the limited diversity within the collected dataset led to a biased and skewed model, predominantly influenced by the voices of a small group of contributors who had contributed the highest number of data points.
This reliance on conventional data collection methods highlighted the importance of implementing robust automated processes to ensure data quality and reduce human error. The shortcomings experienced during the data collection phase underscored the need for a more efficient and comprehensive approach, such as the adoption of the MACSO data collection tool.
Approach 2, adoption of MACSO’s data collection tool:
After engaging with MACSO, our approach involved, first and foremost, studying the customer use case and applicable environments, roadblocks to attract more contributors and collector’s roadblocks to effectively identify the root cause of bugs in the model and perform debugging. We identified the below challenges:
· Challenge 1: The adopted data collection method proved to be administrative heavy and required reading large material and following a lengthy process.
· Challenge 2: Contributors were not willing to install heavy software with complicated operational instructions.
· Challenge 3: Contributors had privacy concerns regarding the use of their data and their personal information.
· Challenge 4: Customer’s security concerns with transferring data to the collector.
· Challenge 5: Contributors did not have robust guidelines to follow to produce good-quality data.
· Challenge 6: Lack of insights into the type of microphone used by the contributor.
· Challenge 7: Lack of insights into the distance of the contributor to the microphone
· Challenge 8: Lack of insights into other characteristics of the contributor such as gender, age, and accent
The following are the detailed steps of how the MACSO data collection tool works and how it addresses each of the challenges:
Challenges 1 and 2: reducing the administrative process posed by the conventional method and eliminating the installation of complicated software.
To address ease of access and reduce administration time, we developed a browser-based tool accessible to any contributor who has access to a laptop, a phone, or any other mobile device equipped with a browser and microphone.
The contributor navigates to http://macso.ai and clicks the “Train Our AI” button on the top right:
Challenges 3 and 4, security, and privacy:
We utilize Azure B2C which acts as an authentication layer for the application. Contributors of data need to be provisioned with an account in order to begin recording. The login page is displayed below:
This protects the quality of the data by restricting access from persons that may deliberately attempt to sabotage our datasets by polluting them with incorrect contributions. Additionally, depending on the email domain used as the login name, we are able to use this information to distinguish between the companies the contributors are associated with. This is important as the contribution instructions may contain sensitive information that should not be exposed to all contributors. Our web browser application also restricts the information obtained from Azure B2C to ensure that no personally identifiable information is stored on our cloud storage system following contribution.
Only the domain portion of the login email is exposed to our web browser application and cloud storage systems; the contributor otherwise preserves complete anonymity.
Challenges 5 to 8, lack of guidance provided to the contributors:
Once logged in, the contributor will see a page with dropdown menus asking them for a variety of attributes such as age, gender, accent, type and version of device and browser used for recording. The page also includes a short guideline recommending the ideal environment for the contributor to record their voice.
The information provided here will be automatically appended to the contributor’s contributed data. Notice here that the form asks for the contributor’s name which may sabotage the anonymity preservation measures detailed in the previous point. However, all selections in this form are modifiable; this form represents a MACSO internal collection task where anonymity was not necessary and therefore the “Name” selection was included.
Additionally, if incorrect or incomplete information is provided, the contributor will be unable to select “Start Recording”. With these measures in place, we ensure that our dataset is correctly labelled which, in turn, ensures the quality of insight we have on our collected dataset:
After the successful completion of the metadata form, the contributor is able to click “Start Recording” to contribute data. The following pages include the keywords that need to be collected alongside the automated quality control features of our tool. These features include:
· Ensuring the background environment matches that of the guideline.
· The speaker is not speaking too loudly.
· The speaker is not speaking too quietly.
Furthermore, once the data reaches our cloud storage, automated algorithms are performed to ensure any audio file that does not match the intended keywords is discarded.
The ease of use of this tool enabled 713 contributors from the customer to collect data in less than one week compared to the previous 334 with much longer data collection duration.
The adoption of the MACSO data collection tool revolutionized the data labelling and quality control process by introducing automation at the collection and storage stages. As a result, the extensive 2-month data engineering effort was drastically reduced to a mere 40 minutes. This substantial reduction in time highlights the efficiency and effectiveness of leveraging automated processes, emphasizing the value of high-quality data over sheer volume. Notably, the use of the MACSO tool eliminated the need for volume augmentation, demonstrating that a focused and curated dataset can outperform a larger but less diverse dataset in training accurate models. The superiority of good quality data over big data is clearly evident in this scenario.
The table provided below offers a concise summary of the tasks performed under each approach (conventional and MACSO) and the corresponding outcomes. It serves as a comprehensive reference, highlighting the benefits and impact of adopting the MACSO data collection tool on the overall data collection process.
Furthermore, a comprehensive report on the data points is generated for the customers which shows the diversity of different aspects of the data from the diversity of users to the diversity of recording devices. This report is also crucial for MACSO as we perform various tests and debugging processes.
In this case study, our primary objective is to further expand the capabilities of our data collection tool, making it compatible with various sensors and applicable to diverse use cases. While we recognize the ease and quick tangible results audio-related case studies offer, accompanied by high return on investment (ROI), our focus is on the continued growth of our tool and empowering customers to utilize it across different domains.
As we advance our data collection tool, several key aspects remain our primary focus. First and foremost, we aim to ensure ease of access and usability for non-technical users, enabling a broader audience to leverage the tool effectively. Additionally, we prioritize the aspects of security and privacy, ensuring that customer data is protected throughout the data collection process.
Automated labelling and quality control at different stages of data collection also plays a crucial role in our growth strategy. By implementing automated processes, we strive to enhance efficiency, accuracy, and reliability, enabling our customers to obtain high-quality data for their specific use cases.
As we progress, our vision encompasses extending the tool's compatibility to a wide range of sensors and addressing various use cases. By maintaining our focus on accessibility, security, privacy, and automated data management, we aim to provide an all-encompassing solution that caters to the evolving needs of our expanding customer base.
Need to leverage this tool for your data collection needs? Get in touch with us by emailing us at firstname.lastname@example.org
 The screenshot of the following pages are not included in this case study as they are an IP of MACSO.  Given the data collection by the customer was done prior to MACSO’s engagement, we are unsure how long the process took. However, we were verbally informed that it took about 6-8 weeks.  While no augmentation for this project was not needed, the clean data allows us to reuse the same data for different environments using noise augmentation. Furthermore, to deploy projects at scale, our recommendation is a minimum of 1000 data points per audio event. In this project more data was collected after the study.