Found 3 result(s) for "cluster"! Click on the links for more details
Distributed_Analytics_of_US_Residential_Zoning
This is a project that aims to do distributed analytics using clusters using a spatial dataset. Our goal with this project was to analyze the impact of single family rresidential zoning in the US and correlate it to quality of life measures in an effort to dissuade a segreggation of zoning types and promote inclusivity. We hoped to be able to compare the results against data from other countries that have more includive zoning laws, but this was not possible due to constraints on data availability and language barriers. For the distributed component, we are using a cluster of 10 machines that are managed by Yarn. To do the processing of data and calculations, we applied Spark using Java and Gradle. The data itself was stored using HDFS and totaled to ~3.2 GB. For more detail on our motivation, procedures, project structure, and results, please reference the latex file or the presentation in the GitHub repo.
Analysis of the MovieLens Dataset using Apache Spark
This project was an introduction to using Apache Spark to analyze a large file (~800 MB), namely the Movie Lens dataset containing movies, genres, ratings, etc. The files were stored using HDFS and cluster size consisted of 10 machines. There is 1 Java file with 7 Spark jobs which are focused on answering the 7 questions that can be found on GitHub.
Analysis of Million Songs Dataset using Hadoop MapReduce
This project was an introduction to using Hadoop MapReduce to analyze a large file (~1.6 GB), namely the Million Song subset containing 10,000 songs. The files were stored using HDFS and cluster size consisted of 10 machines. There are 10 Java files with jobs of their own which are focused on answering the 10 questions below. Please visit the github for more details on the questions, answers, and more.