There are currently 2 vendors from which we need to fetch millions of phone numbers.
First i will explain the process that i have implemented it for fetching intelligent data and then we can discuss the peerless part also in next mails when the intelligent logic is clear.
Please go through the Process flow carefully, you will be able to understand the current situation and the difficulties / queries that i face regarding it.
It will be a little lengthy but its just to make everything clear about the task. So please have patience.
1) Intelligent –
Current Process Flow Explanation –
Here we are able to fetch the data state wise and also can query data using page numbers / paging.
We have setup 2 tables ‘pool_numbers’ (here we store the actual phone numbers received) and ‘pool_status’ (here we store for which state and for which page we have received the data).
We can find for a state how many numbers are there like for example :- AL (Alabama) we have 307345 numbers and we try to fetch 5000 numbers in each request.
So total requests that we need to make through API for getting all phone numbers of AL (Alabama) is (307345 / 5000) = 62 requests to the intelligent API.
After each request, we loop over the 5000 numbers received and store it in the pool_numbers table.
We query the data using sorting on the phone numbers so we always receive new data each time with each request.
Difficulty 1 – Even if i try to fetch the data only for state AL (Alabama), i need to make 62 requests to the intelligent API to get all the numbers for this state but after 6 requests (means 30,000 numbers received), the API does not further provide any data for next page requests.
Question 1 –
Can you please let me know why does it not provide continuous data for all the page requests even if i try for a single state only. This happens for all the states which has a large number of data. I guess this is something to be checked from your side. ???
Question 2 –
Many times it does provide data for a state after certain amount of numbers are received, like for example for AL , if we have received 30,000 numbers, then no matter how many time the cron script executed and tries to fetch the next numbers it does not send any data back. Can you please have this checked on your side. ???
So when every time we make a request for a state and particular page we store this information in pool_status table.
So from the pool_status table, next time when the cron script will run it will try to fetch the data from the page number where the data was not received.
So from example above if we were able to fetch data for AL for 6 page request in first cron script execution, then when the next time cron script will execute it will start from page 7. So we do not query for repeated data.
The above approach is beneficial when we want to get all the numbers fetched and stored in the our local database.
Difficulty 2 – When we have all the data stored for AL in our local database and now we need to check if there are any new numbers available or any change in prices of a number.
Then in this case we need to always fetch all 307345 numbers again and again and then check with our local Database if a particular number is present then we update the price and if not present we store it in our database. But this is not a good approach as this will huge amount of time for a single state only as it has a lot of numbers. And when we want to do it for all states its not possible.
Question 3:- So after understanding ‘Difficulty 2’, if there is a flag or something which can provide only changed / new numbers from the API then it will save lot of time.
I was not able to find any such API method in the documentation with which i can achieve this. ???
Question 4:- Suppose we have all phone numbers for AL in our local database and now we only check for new numbers or changed numbers prices, then because of the issues mentioned in ‘Question 1’ and ‘Question 2’ , it is not possible to get all the data in single cron execution.
Difficulty 3 :- How do we manage the cron scripts. We had the idea of running scripts default every six hours or hours defined by user.
But as from above issues it is not possible to get all the numbers from all state in one cron execution of 6 hours. Can you suggest then how should we proceed with this . ???
I have the idea to group the states in small chunks. we currently have 58 states that we need to fetch data.
So if make 6 groups having about 10 states each. And then run the cron scripts like below then we can be able to fetch more data.
Group 1 – starts at 01:00 AM in night (runs every 6 hours)
Group 2 – starts at 02:00 AM in night (runs every 6 hours)
Group 3 – starts at 03:00 AM in night (runs every 6 hours)
Group 4 – starts at 04:00 AM in night (runs every 6 hours)
Group 5 – starts at 05:00 AM in night (runs every 6 hours)
Group 6 – starts at 06:00 AM in night (runs every 6 hours)
So this way there is a gap of 1 hour between each cron execution. Also each cron will execute in an interval of 6 hours. And because of grouping them it will be able fetch data for all the states simultaneously.