Lab 2 – More Java Fun
Lab 2 – More Java Fun
- Lab 2 Home
- Warmup
- Part 1
- Part 2
- Part 3
- Submission
Part 3 – Benford Analysis
Benford’s Law, also known as the “first-digit law,” is an observation about the distribution of digits in numbers in data sources. In brief, in many natural datasets the first digit will be a ‘1’ about 30% of the time with each higher digit occurring less frequently than the previous ending with a ‘9’ occurring less than 5% of the time.
There is a nice graph of this distribution on the Wikipedia page.
For this part of the lab, you will create a Java class named Benford
which
finds the distribution of digits of a set of files.
Processing files
This program will be very similar to some of the ones you have already worked on. You will give a list of files as arguments via Run Configurations… as on prior labs. Your program will work through all of them (if you need to remember how to modify Run Configuration arguments, consult last week’s warmup). If a file cannot be opened, just print a message to the user and continue on.
Counting first digits
For this program, you are only concerned about words that start with a
numerical digit, and then you need to get the value of that digit. You will
use an int array of size ten to maintain a count of the number of words you
encounter that start with the digits 0 through 9. You should use the digit as
the index into the array (i.e., count[0]
gives you the number of words that
start with 0).
There are a number of ways you could get at the first character, determine if it is a digit, and get the value. One way is to:
- Get a word from your input (probably using
Scanner
andhasNext()
/next()
. - Use the word’s
charAt()
method to get the first character and pass it into Character.isDigit() to determine if it is a digit. - If it is, then you can use the word’s substring method to extract that first
digit as a
String
. - And that new
String
can be passed intoInteger.parseInt()
to determine the integer value. - Note that we first use
charAt()
to get the character in order to useCharacter
’sisDigit()
method, and then use substring to get the same character becauseInteger.parseInt()
takes aString
and not aCharacter
. We could also convert theCharacter
to aString
.
Printing out the frequency chart
Once you have gone through and made a count of all your leading digits, you should print out a histogram of the various frequencies that you encountered. You should should print out the following information for each row,
- the leading digit 0–9;
- the total count of words beginning with that digit;
- the frequency of that digit being at the start of words beginning with digits (i.e., the count for that digit divided by the sum of counts for all digits); and
- A bar of
*
characters representing the frequency graphically, scaled by the maximum frequency.
When run with the input file numbers/census-pop-est.txt
, this will print:
Welcome to the Benford analysis program
0 1116 0.9% : **
1 35294 29.8% : **************************************************
2 20825 17.6% : ******************************
3 14070 11.9% : ********************
4 11405 9.6% : ****************
5 9210 7.8% : *************
6 8019 6.8% : ***********
7 6809 5.7% : **********
8 6472 5.5% : *********
9 5259 4.4% : *******
The most frequently occurring digit should have a bar of length MAX_WIDTH
where MAX_WIDTH
is defined by a constant in your class.
public static final int MAX_WIDTH = 50;
All of the other bars’ lengths should be scaled proportionally to this. Take
the digit’s count, divide by the max count, and multiply by MAX_WIDTH
. You
should round this number using Math.round()
to get it to the nearest integer.
If all the frequencies are about even, you’d expect all the bars to be about
50 units wide. If they are more varied, then the longest will be 50 and the
others will be of various shorter widths. Note that if the frequency is 0, the
bar should have 0 *
characters.
To print out the table in a formatted fashion, you might want to use
System.out.printf()
to do things. You can pass it a format string that uses %d
for decimal numbers and %f
for floating point numbers. You can also specify a
width for each field by including a number between the %
and the letter. Here
is what I used in the example shown above.
System.out.printf("%d %8d %4.1f%% : ", digit, count[digit], freq );
%d
- is just a decimal number%8d
- is a decimal number where 8 spaces will be used%4.1f
- is a float with a width of 4 and 1 space after the decimal%%
- because%
is used to signify a place to put data, you put 2 of them in a row to actually get a%
in your output
After the string, you list the values you want to print in the same order
Note that unlike in python, we don’t have a different operator for real vs
integer division: instead, dividing one integer by another will always return
an integer. In order to get a float
, you will need to cast one of them as a
float
, like: count/(float)totalcount
.
Sample data
I collected a number of pieces of sample data for you to try out your Benford
analysis tool on numbers.zip.
Because these are large files, and I don’t want you to turn them in, they are
all in a subfolder called numbers
. This data comes from the US Census
Bureau, CIA World Factbook, and a variety of other sources around the web. Try
running your tool on the various files. Note in your comments for your program
which ones seem to follow Benford’s law and which ones don’t. Feel free to
speculate why the ones that don’t are that way.
When run on the input numbers/census-pop-est.txt kittens
numbers/primes-10000.txt
, this will look like:
Welcome to the Benford analysis program
Could not open kittens (No such file or directory)
0 1116 0.9% : **
1 36898 28.7% : **************************************************
2 21954 17.1% : ******************************
3 15167 11.8% : *********************
4 12474 9.7% : *****************
5 10265 8.0% : **************
6 9032 7.0% : ************
7 7836 6.1% : ***********
8 7475 5.8% : **********
9 6265 4.9% : ********