If you are not familiar with the term Section 16, Section 16 is a section of the Securities Exchange Act of 1934. In fact, it is section 16 of the act. Section 16 is a way to attempt to mitigate insider trading at companies. It states that any person who is a director of the company, an officer of the company, or a 10% or more beneficial owner, directly or indirectly, they must file Forms 3, 4, and 5. These forms report the level of equity interests whenever a significant change occurs. The terms that you will see referenced here are reporting owner, who is the person of interest here, and issuer, who is the company that in which the reporting owner has equity.
Friday, January 04. 2019
LDC #117: Building a Section 16 Data Scraper, Part 1: Collection
Today we are going to start building a new feature that will help in the creation of Section 16 forms in GoFiler by taking a CIK of a reporting owner and building a spreadsheet of past filing data, as well as then allowing a user to create a new Form 4 or 5 that automatically can pull in information from previous filings as needed. This is going to be broken down into multiple weeks of blog posts as we will focus on a different area each week. Each week is going to build on the previous post and script, and there may be significant changes to previous scripts based upon changes we make as time goes on.
The first topic we are going to talk about is gathering old filings from the SEC. Similarly to how companies are given CIK and CCC credentials, reporting owners are also given CIK and CCC credentials for use on the EDGAR system. This allows each reporting owner to be given a unique identifier on the system as they may report ownership in multiple issuers. It also allows us to pull all of the Section 16 filings done by a particular reporting owner, even if the filing itself is done under the issuers credentials.
Today’s script is just the stepping stone for the next few weeks. We are going to take a reporting owners CIK, get all of the Section 16 filings done by them, and then cache the filings locally. Finally we put the list of all of the filings into a table. By the end of the script we have a list of Section 16 filings. Next blog post we can then iterate through the list to get all of the information out of the filings.
Here is the full script:
/***************************************************************************************************************** Section 16 Data Collector ------------------------- Revision: 01-04-19 JCK Initial creation - get filings from SEC Notes: -Stage 1 (c) 2019 Novaworks, LLC. All Rights Reserved. *****************************************************************************************************************/ #define TEST_CIK "0001214156" boolean IsSection16 (string type); void main() { //creation variables string files[]; //Files from SEC string s16filings[]; //Table for us string cache; //Location of cache string name; //Name of file boolean s16Check; //Placeholder boolean int rc; //Error Checker string c_cik; //CIK of Reporting Owner handle file; //Handle to the EDGAR Archive string base; c_cik = TEST_CIK; files = EDGARFetchArchiveList(c_cik); rc = GetLastError(); if (rc == (ERROR_REMOTE | ERROR_FILE_NOT_FOUND)) { MessageBox("CIK Does Not Exist"); return; } if (rc == ERROR_OVERFLOW) { files = EDGARFetchArchiveList(c_cik, FALSE, 0, 0, TRUE); } int count; int numfiles; int max; max = ArrayGetAxisDepth(files); count = 0; numfiles = 0; cache = GetScriptFolder() + "\\Cache\\" + c_cik; CreateFolders(cache); if (IsFolder(cache) == FALSE) { MessageBox('x', "Unable to create document cache."); return; } ProgressOpen("Getting Files"); ProgressSetPhaseCount(max-1); while (count < max) { ProgressSetPhase(count); ProgressSetStatus(FormatString("Getting file %d of %d", count, max)); ProgressUpdate(count, max); base = ReplaceInString(GetFilename(files[count]), ".txt", ".xml"); if (DoesFileExist(AddPaths(cache, base)) == FALSE) { file = EDGARArchiveOpen(files[count]); s16Check = IsSection16(EDGARArchiveGetDocType(file, 0)); if (s16Check) { name = AddPaths(cache, EDGARArchiveGetProperty(file, "accession_number") + ".xml"); EDGARArchiveGetDocFile(file, 0, name); } s16filings[numfiles++] = name; CloseHandle(file); } else { s16filings[numfiles++] = AddPaths(cache, base); } count++; } ProgressClose(); count = ArrayGetAxisDepth(s16filings) - 1; while (count >= 0) { AddMessage("%s", GetFilename(s16filings[count])); count--; } } //Returns true if S16 file type, false if anything else boolean IsSection16(string type) { switch (type) { case "3": return true; case "4": return true; case "5": return true; case "3/A": return true; case "4/A": return true; case "5/A": return true; } return false; }
Now let’s break down how the script works and take a look at what we are doing to get this list of filings.
#define TEST_CIK "0001214156" boolean IsSection16 (string type); void main() { //creation variables string files[]; //Files from SEC string s16filings[]; //Table for us string cache; //Location of cache string name; //Name of file boolean s16Check; //Placeholder boolean int rc; //Error Checker string c_cik; //CIK of Reporting Owner handle file; //Handle to the EDGAR Archive string base; c_cik = TEST_CIK; files = EDGARFetchArchiveList(c_cik); rc = GetLastError(); if (rc == (ERROR_REMOTE | ERROR_FILE_NOT_FOUND)) { MessageBox("CIK Does Not Exist"); return; } if (rc == ERROR_OVERFLOW) { files = EDGARFetchArchiveList(c_cik, FALSE, 0, 0, TRUE); }
In a future post we are going to allow the user to enter a CIK, but since I did not want to delve into creating a dialog first thing I’m currently hardcoding a CIK that we will use as our tests. You can update the define to be any valid CIK and this script will download all of the Section 16 filings done by that CIK. In practice we want to limit that to just reporting owners, but there is no effective way of doing that based off of the information provided by EDGAR.
Other than defining a CIK we also declare our supporting function and then enter main. We then declare a bunch of variables that we will use throughout the script. Finally we set our CIK to be our defined TEST_CIK and then we fetch the EDGAR Archive List for the CIK using the EDGARFetchArchiveList function. This gets all of the EDGAR filings ever done under this CIK. We then check a couple things from the archive list function. First we check to see if the CIK is a valid CIK. If the CIK is not found it was not valid. At this point there is no reason to continue running the script. Finally we check to see if there’s an overflow error, which can happen if there are a large number of filings done under this CIK. While this should never be hit because the limit is so large (approximately 2000 filings), it is a good habit to get into because it will allow the script to always work. If an overflow is hit, we rerun the fetch of the archive with the piq_flag (perform incremental query) set to TRUE, which will cause GoFiler to run the query month-by-month. This gets around the 2000 filing limit, but will significantly increase the execution time. This is why we only go to this method if the normal list overflows.
int count; int numfiles; int max; max = ArrayGetAxisDepth(files); count = 0; numfiles = 0; cache = GetScriptFolder() + "\\Cache\\" + c_cik; CreateFolders(cache); if (IsFolder(cache) == FALSE) { MessageBox('x', "Unable to create document cache."); return; } ProgressOpen("Getting Files"); ProgressSetPhaseCount(max-1);
We then define some integers that we will use to control our loops. We get the depth of the files array, which gives us the number of used elements in the array. Next we create our cache folder by getting the current script folder, adding a “Cache” directory, and then putting in the CIK of the reporting owner. This will allow us to put the files of our reporting owners into their own personal caches. This means that each time we run the script we only have to download files if it is the first time we are seeing that file. We use CreateFolders to make the cache folder, and report an error if we can’t. This likely means that the script is not saved anywhere, or it is saved in a location where the user cannot modify. More than likely when we put finishing touches on the script we will make it so that the location is a value that can be modified by the user, but again for now we are creating this script to make it work before making it look good.
The last thing we do in this section is we create a progress dialog and set the number of phases to our file array depth. After the first time that we run a script for a CIK the progress box will not show up for very long as it will be processed quickly. On the initial run when it has to download all of the filings it takes a few seconds per filing. We use a progress bar to give the user a small heads up as to how long the script is going to take.
After this point we get to the main loop of this portion of the script:
while (count < max) { ProgressSetPhase(count); ProgressSetStatus(FormatString("Getting file %d of %d", count, max)); ProgressUpdate(count, max); base = ReplaceInString(GetFilename(files[count]), ".txt", ".xml"); if (DoesFileExist(AddPaths(cache, base)) == FALSE) { file = EDGARArchiveOpen(files[count]); s16Check = IsSection16(EDGARArchiveGetDocType(file, 0));
We iterate through the list of files from the EDGAR Archive. The array is a series of URLs that point to the EDGAR website, so we get the filename from the URL and change “.txt” to “.xml” because we will end up only saving the XML portion in our cache. We set the progress as we get to each file so that the user knows how far along we are. We then check to see if the file has already been cached. If it has not been cached we open the EDGAR Archive and we check to see if the file type matches a Section 16 file type. Let’s take a look at that function and then I’ll go in more depth about what the EDGAR Archive is.
//Returns true if S16 file type, false if anything else boolean IsSection16(string type) { switch (type) { case "3": return true; case "4": return true; case "5": return true; case "3/A": return true; case "4/A": return true; case "5/A": return true; } return false; }
This is a very quick function that takes the type and turns it into a boolean. We use it to make sure that the file type of each filing is a Section 16 file type before we download it and add it to the cache. Section 16 file types are 3, 3/A, 4, 4/A, 5, and 5/A. I threw them all into a switch statement with return true if it is caught anywhere in the statement. We don’t have to worry about breaks for each case as returning breaks out of the whole function at that point. If we make it through the switch statement the function returns false.
So now let’s quickly talk about the EDGAR Archive object. This is the object that we are getting when we use the EDGARArchiveOpen function. This function loads an archive file into memory and gives the handle to us to use. We can then get properties from the file. This script only scratches the properties that can be retrieved, as we will retrieve the accession number and the date filed. We could also ask for: Period of Report, Item Information, Company Name, Company CIK, Company SIC, IRS Number, Fiscal Year End, as well as all of the different parts of the company’s address. A full list of properties can be found in the Legato Documentation for the function EDGARArchiveGetProperty.
Let’s finish up the while loop that we started in our main function:
if (s16Check) { name = AddPaths(cache, EDGARArchiveGetProperty(file, "accession_number") + ".xml"); EDGARArchiveGetDocFile(file, 0, name); } s16filings[numfiles++] = name; CloseHandle(file); } else { s16filings[numfiles++] = AddPaths(cache, base); } count++; } ProgressClose();
If the file in the archive is a Section 16 file type, we put together the name of the file by getting the accession number and putting a “.xml” at the end. We attach that to the folder that we created for the cache and then use the EDGARArchiveGetDocFile function to download the actual XML file from the SEC’s website and stick it into our cache folder. We then add the file to the list of filings and close the handle to our EDGAR Archive.
If the cache file already exists, we add the location of the cached filing to our list of filings. The very last thing to do is to increment the count and continue the loop. Once the loop has finished we close the progress window.
At this point we are essentially done with this weeks script. We take a CIK and cache any new Section 16 filings this CIK has done since the last time the script was run, and then we have the list of filings in an object in memory. The last section of the script shows us what it has done in a way that we will use next post in order to iterate through the files:
count = ArrayGetAxisDepth(s16filings) - 1; while (count >= 0) { AddMessage("%s", GetFilename(s16filings[count])); count--; } }
We get the number of elements in the order table and iterate through them, printing out the name of the file to the log. This is a proof of concept that our collector is working, in addition to the files that are created and saved on the computer when it is run.
So here we are at a good stopping point. Our base structure is complete, but there is still a lot of work to be done before this is a fully functioning feature. We still need to parse through the files that we have collected and aggregate the data contained within, we still need to take the aggregated data and allow a user to see it in a human friendly form as well as allow the user to somehow use this information in filings, and we still need to make the script have a nice user interface so that users do not have to modify hard-coded CIK values.
Still a long road ahead, but we will keep building on our base framework until the structure is complete.
Joshua Kwiatkowski is a developer at Novaworks, primarily working on Novaworks’ cloud-based solution, GoFiler Online. He is a graduate of the Rochester Institute of Technology with a Bachelor of Science degree in Game Design and Development. He has been with the company since 2013. |
Additional Resources