How-Web-Scraping-Is-Used-To-Extract-Mobile-App-Data-On-The-Scale

Introduction

Scraping data from various mobile apps is not something new, but it appears that various approaches have not scaled well. We are working hard at Web Screen Scraping on Mobile Apps Data Extraction on Scale, which is why we created this blog to provide relevant information on the issue.

Reverse Engineering

Reverse-Engineering

Now the question is, how are we going to do it? Assume you have to scrape data from a mobile app. Let's say you have an APK of an Android app and you want to scrape 500,000 points of data (UI screens) per day. How can you achieve this and what would it cost?

It's critical to figure out how a client communicates with servers, what protocol he uses, and how they send messages to one another.

Although this appears to be the most scalable and cost-effective option, it may only provide solutions for one application. What should we do if we desire to recurrence the process with more applications? What happens if an API is updated? As you can see, estimating the efforts it should make is difficult.

Following that, we used Android Emulator to install the APK, connect it to the proxy, and monitor the data.

After some hours, we were able to watch traffic from the clients to the server and even mimic calls to a server because everything was done over HTTPS.

Result

Reverse engineering is simple, to begin with, and appears to be the most cost-effective and scalable method of doing so. However, it may take a few lengthy days, as development charges are unpredictable, and don't always acquire the final product.

Appium or Selendroid

Appium-or-Selendroid

The situation is drastically different when utilizing tools like Selendroid or Appium. You may quickly write the situation you wish to test and have it run automatically over and over again. We choose to use Appium in conjunction with Android Emulator.

These have a reputation for being difficult to work with for mobile development, but with the release of x86 emulators, things have begun to operate more smoothly, and it now feels as if the applications running on laptops are faster than the physical devices themselves.

Later, we created a Docker container with Ubuntu 16.04, Appium, and an Android x86 emulator to begin the test of how many of them we could run simultaneously.

So, assuming that one CPU can run one emulator, we'll need 700 CPUs to run 700 emulators! It's a significant demand, and it's also quite costly!

Result

Physical hardware always delivers good performance, but it's difficult to manage on a wide scale.

So, how do you avoid having to deal with physical hardware management?

Well. We can use AWS, which is a public cloud. When we applied this strategy to the cloud, however, things went drastically differently. Linux, Docker, AWS, and Android have all worked successfully together in the past, but not with an emulator. AWS EC2 provides you with a Virtual Machine, and Android Emulator is another Virtual Machine. To take advantage of hardware acceleration while utilizing an x86 Android emulator, the host machine must reveal this capability; however, Amazon, like any other public cloud, does not do so; instead, they use it to serve us with virtual machines, therefore we were unable to even launch an Android x86 emulator!

So, how did we go about doing that? We've already used Ravello.

Cloud Ravello

Cloud-Ravello

When running on a public cloud, the Ravello solution supports nested virtualization or Kernel-based Virtual Machines on the host computer.

It has made it possible for us to run x86 Android emulators on the cloud. We tried it and it worked as well, however in terms of performance, the process took three times as long as on physical machines, and the situation worsened as more emulators were used.

Result

The Ravello Cloud solution is functional; however, its performance is lacking.

Cloud Genymotion

The Genymotion Cloud, which offers Android Machine Images (AMI) for Amazon EC2, is another option.

As a result, instead of obtaining a Windows or Ubuntu VM, you'll get an Android VM! It appeared to be the best public cloud-based option. We were able to run the scraping script on the physical hardware as well as the t2.small example (with 1 CPU + 2 GB RAM) using AMI.

The expense of this method is a problem because each instance besides the picture costs 0.149$ per hour, which adds up quickly when you have 700 Android simulators.

Result

Genymotion performs pretty well in the cloud and provides roughly the same work as running on a physical machine, but it's somewhat expensive when used on a large scale.

Bluestacks and Nox

Bluestacks-and-Nox

These items were designed specifically for game players, but that doesn't mean we can't use them. To test it, we create a t2.medium Windows VM on AWS EC2.

The installation of Nox failed because the graphic card's driver was out of date. Even after overcoming this, more challenges arose, so we decided to use Bluestacks.

The installation of Bluestacks proceeded smoothly, and it performed admirably.

However, the issue was that we didn't come up with a way to run several Bluestacks applications on the cloud within our Virtual Machine, and our APK test didn't perform well on it either, possibly because Bluestacks operates in tablet mode.

Result

Bluestacks performed incredibly well on the virtual machines, it's free, and it's even visible over ADB, which means we can run tests of Appium on it. However, it can only run on Mac or Windows, and you can only run one instance at a time, and it only works in Tablet mode.

Certain optimizations can assist in speeding the time it takes to scrape the data when using one of the emulator options. To name a few, Use landing URLs, deep links, and other techniques if possible. When used on powerful PCs, the app speed may be faster than on the actual device.

Conclusion

To summarize, if you need to Scrape Mobile Apps Data on a large scale, and reverse engineering performs well and meets your needs, then go for it because Web Screen Scraping claims it is the most cost-effective and scalable method.

Other options for using an Android emulator are limited, and the results are prohibitively expensive. If you have any other ideas for scraping mobile apps on a large scale, please share them with us.

Looking for Mobile app data scraping services? Contact Web Screen Scraping today or request a quote!


Post Comments

Get A Quote