Lessons Learned - Research

Abhay Sri
Oct 23, 2021
3 min read

I know this may be my second post about the summer research I did - but please, bear with me. In case you didn't know, this summer, I reached out to professors regarding research on pH prediction. At the time, I thought I knew enough to conduct most of it myself, but I was definitely wrong. I had a background in R and Java, and when Dr. Cano from VCU Engineering accepted to become my mentor, I was delighted.

The first task he assigned for me to do was clean up the pH2O Analytics models that were already existing. I had put them in a very rudimentary form, where the graphs and data were hard to read. Eager to please him and have a good start, I spent the next couple of days looking for methods to display my graphs. What I found was that date time's in R were a lot harder to deal with than you think. Converting the data's starting and ending point to the date/time format was extremely tedious. This is because first of all, not all of the data models had parallel data sets. For example, one of the data sets could be in MM/DD/YY, and another one could be MM-DD-YY, so I had to find a way to standardize them. After that was done, I showed Dr. Cano my work, and he was impressed with my efficiency. From there, he decided to assist me in making a machine learning model to predict pH levels, with the ultimate goal being a research paper. Here are some of the lessons I have learned throughout this journey:

Thoroughly evaluate your constraints.
Don't get frustrated, it will all work out.
Learning vs Application is a huge jump.
Data science is an amazing and powerful tool that can be applied virtual anywhere. It's merits in environmental science are superb.

The first lesson is primarily due to what is probably the single biggest constraint of all projects - time. Although cost was a factor, as I used Google Colab Pro, time was much more limited. The machine learning models I built would sometimes take a whole night to output a nice and working predicted data set, and if something was wrong, I wouldn't find out until the next day. I should have previously evaluated time and considered how I could split up my schedule to account for this. This also brings me to my second point, don't get frustrated. Whenever I ran the model and something was wrong the next day, I became engrossed in the model to find what was wrong. If I didn't find why it was performing poorly, it made me extremely frustrated. Especially since the smaller data set I trained on worked completely fine. As my third pointer says, the jump from learning to application is tremendous. I learned how to code the models, and even tested it on a smaller data set. However, as the size of the data grew, so did the errors. Sometimes there were null values, or values that wouldn't work. So I had to filter these out, which took at least a good week. Lastly, from the whole experience itself, I can proudly say that I was successful. Data science is a great tool for environmental science, and I can't wait to see its applications in the future!

IMG Source: https://www.pmworld360.com/blog/2018/07/13/pmbok-finally-expands-on-lessons-learned-but-is-it-enough/