I wrote a previous post about a password breach database that I was able to find online. Since then I’ve found many more, and I’ve been using them to compile word lists for password cracking. Good word lists when combined with a tool like Hashcat or John are vital for any penetration testser, and what better source for great word lists than real-world password breaches.
Typically, password breach databases are structured in many flat text files, and are usually all formatted the same, with a few exceptions. Unfortunately, there are a lot of strange characters that appear in some of these files, including characters that are from foreign language character sets (Asian, Arabic, etc.) and other garbage characters that are not printable in a terminal and can cause terminal corruption. In addition, some breaches contain passwords that are still in hashed format, since whoever released the breach data was unable to successfully crack all of the passwords from within a specific breach. For the purposes of creating password word lists, these are useless.
These breach text files typically contain usernames and passwords in the format [username]:[password], but I’ve also seen them formatted with [username];[password] and rarely [username],[password]. In the Antipublic dump that I have, text files are named MYR(1).txt, MYR(2).txt, etc. all the way through MYR(211).txt. To check the structure of the list of files I’m working with, I’ll use the following command.
This command will allow you to preview the file, page by page, without loading the entire file in to memory, so it’s very fast.
Combining Breach Files
For the purposes of this example I’m working with the Antipublic breach data, which consists of 211 text files totaling 91.5GB, and contents are formatted [username]:[password]. The first thing I’m going to do is to combine all of these separate text files in to a single text file that I can work with, the reason for which will be apparent later.
cat MYR*.txt > antipublic.txt
This command takes all files beginning with MYR and ending with the .txt extension and prints them to standard out (the terminal). I use > to direct the output from standard out to a text file called antipublic.txt. This in effect combines all of the MYR*.txt files in to a single file. It should be noted that unlike Windows, UNIX-style operating systems including Linux and macOS don’t need file extensions, as the file type is determined from the OS itself reading the header of the file to determine what type of file it is, but having extensions won’t hurt anything and sometimes make it easier for humans to distinguish different file types.
A good check to see if the operation was successful is to check the file size of the output file, and compare it to the size of all combined files. The sizes should be the same.
Removing Usernames/Email Addresses
For the purpose of creating a word list, the usernames contained in the breach don’t matter to me, so I’m going to use the cut command to remove them from this massive 91.5GB text file. This should reduce the size of the file considerably.
cut -d : -f 2 antipublic.txt > antipublic_pw.txt
This command is fairly simple. The -d flag specifies the delimiter, which is the character within the file antipublic.txt that separates the username field from the password field. In this case, it’s a colon. The -f flag specifies which fields I want to “keep” – ie which ones will be printed to standard output (the terminal). Since I want to keep the second field, the passwords, 2 is specified. Finally, the entire output is directed from standard out to the file antipublic_pw.txt.
You may find that you’ve got some entries in the dumps that are formatted [username];[password]. This can easily be fixed by sending the output of the first cut command through the cut command again to pull out all of the email addresses/usernames before the semicolon. Since the command line treats the semicolon as a command operator, it must be put in single quotes. For example:
cut -d ';' -f 2 antipublic_pw.txt > antipublic_pw2.txt
Keep in mind that doing this will mess up any passwords in your file that have a semicolon in them as well, since the command just says “strip off anything before and including the semicolon.” If there are only a few lines in the file like this, it might be safer to edit them manually at the end of the process.
Removing all of the usernames (email addresses in this case) cut the 91.5GB file down to roughly 30GB.
Sorting Passwords and Removing Duplicates
Since the usernames have been removed, and since multiple users typically use the same common passwords, the next thing I’ll do is to sort the contents of the text file alphabetically and remove all duplicate lines from the file. This operation can take a very long time and is very CPU and disk intensive, so if you have an NVME drive, you’ll want to move the large text file there in order to speed this operation up.
sort antipublic_pw.txt | uniq > antipublic_pwu.txt
I purposefully do not specify any flags on the uniq command here. I have seen some tutorials that show examples with the -u flag, which is not correct. If the -u flag is specified, uniq will only output lines that were unique and that did not contain a duplicate line anywhere in the file. Lines that are duplicates are simply stripped out and a deduplicated version of the line is not output. For example, given a text file called passwords.txt that contains the following data:
Password00 password00 Password01 password01 pass00 pass00
Using the command sort passwords.txt | uniq will produce the following output:
pass00 password0 Password0 pasword1 Password1
The command sort passwords.txt | uniq -u will produce the following output.
password0 Password0 password1 Password1
Notice that by adding the -u flag, pass00 is not in the output. This is because -u tells uniq to only output lines that are unique, and to simply remove duplicate lines from the output. Not specifying -u will deduplicate all duplicate lines, but will still show the deduplicated password in the output.
Again, this sort | uniq command can take a very long time, and produces no progress bar or indication of output. While running, it looks like the command is frozen. You can use a program like top in a separate terminal window to monitor the status of the command. You should see high CPU usage for both the sort and uniq commands. Be patient and it will finish eventually. After removing duplicate passwords from the Antipublic breach, my file size dropped from 30.7GB down to 7.3GB. This is still a very large flat file to work with, but is much smaller.
After the command finishes, the size of the file should be considerably smaller.
Optional: Removing Passwords Over or Under a Specific Length
One easy way to see if there are any garbage lines in the file is to output lines that are longer than a specific length. Users typically don’t have 20 character passwords, but some garbage lines can be that long. To determine whether or not I need to manually clean up the password list.
awk 'length>20' antipublic_pwu.txt
This command should be very telling quite quickly in regard to if or not garbage lines exist in the deduplicated file. An optional step would be to use the same command to cut out any lines that are over a certain number of characters. A lot of times these dumps contain long lines of garbage, lines of hex, random characters combined with email addresses all on one line after the delimiter, etc. The following example outputs only lines with less than 24 characters to standard out, and redirects only those lines to antipublic_pwul24.txt.
awk 'length<24' antipublic_pwu.txt > antipublic_pwul24.txt
This further reduces the size of my file, taking it from 7.3GB down to 5.0GB.
Splitting Large Word Lists for Easier Editing
If manually cleaning up the remaining file is necessary (and it usually is), this can most easily be accomplished by splitting the file into smaller chunks. The following command will split the file up in to 1GB chunks:
split -b 1000M antipublic_pwul24.txt
The result of this command is several 1GB files called “xaa,” “xab,” “xac,” etc. Each of these files can then be edited using a text editor like Sublime, which I’ve found is one of the best GUI text editors for editing large files. I prefer GUI editors when cleaning up these large databases, since scrolling through huge files is much easier with a mouse than with arrow keys, especially if your mouse has a freewheeling wheel (one that just spins forever like a fidget spinner if you give it a good push, versus clicking). Sublime is very quick to open 1GB files, and does so on my laptop in less than 20 seconds.
If your password files contain null bytes, and you’re using Sublime to edit them, you may see nothing but hexadecimal format when you open the files. This is because if a file contains a null byte, Sublime will by default assume the file is to be opened in hex format. To keep Sublime from doing this, go to Preferences > Settings and in the right pane between the brackets, add the the following line, press CTRL+S to save the file, and restart Sublime.
Keep in mind that when you open a text file, the entire contents of the file is loaded in to RAM. If you are going to work with 1GB files, make sure you have 1GB of free RAM to work with. If you don’t, you may have to split the file in to smaller chunks.
Bonus: Removing HEX and “Junk” Characters
After combining several of these lists, I found that several editors and even command line commands like less and tail would display word lists as “garbage.” This is likely because of unprintable characters contained within certain lists that I was combining together that would trick certain applications in to thinking that the file was a binary file.
Since we know our word lists aren’t binary files, the following command will strip all of the junk out of a file called “foo” and output all of the printable ascii characters in to a file called “bar.”
tr -cd '\11\12\15\40-\176' < foo > bar
This command executes very quickly and will save you a lot of time if you’re trying to manually identify and remove these characters from large text files. Thanks so much to Alvin Alexander for this information.