Posts on Russell ‛Russ’ Frith

Scoop Over Chocolatey: A Windows Package Manager Transition Guide

Mon, 04 Mar 2024 15:08:01 -0400

Introduction

For the last decade, I have been using Chocolatey as my package manager on Windows. I recently needed to install software on a new machine, and decided that now was an ideal time to revisit some popular package managers and perhaps make a switch.

One of my favorite benefits of package managers is the ability to install software in a consistent way across a variety of different applications. Having a command line tool with remote repositories of current versions of software makes it easy to keep applications up to date. Finally, it makes them easy to remove in a unified way should they no longer be needed. When setting up a new machine, the ability to have a script that installs most if not all the applications that I will need on that machine using such a tool is also highly beneficial.

Package Managers

A package manager is a tool or system that automates the processes of installing, upgrading, configuring, and using software, primarily aimed at developer tools. Developers specify their required tools, and the package manager installs and configures them based on these specifications. This not only saves time in setting up a development environment, but also ensures consistency in the versions of packages installed. Let’s explore some of the most popular package managers and examine their respective strengths and weaknesses.

winget

Windows Package Manager winget command-line tool is available on current versions of Windows as a part of the App Installer, launched by Microsoft to simplify application management on Windows systems. Announced in May 2020 at Microsoft Build, this command-line tool allows users to discover, install, upgrade, remove, and configure applications, filling a gap in Windows’ developer tools by providing functionality similar to package managers in Linux or macOS. Its introduction marks a significant step by Microsoft to enhance the Windows developer experience.

winget distinguishes itself from other popular Windows package managers in several ways:

Integration with Windows: As a Microsoft product, winget is closely integrated with the Windows ecosystem. This integration offers potential for more seamless updates and compatibility with Windows-specific features and updates.
Official Microsoft Support: winget is developed and supported by Microsoft, which may provide a sense of security and stability for users and organizations, particularly in terms of long-term support and integration with other Microsoft products.
Open Source and Community Contributions: Unlike Ninite or the discontinued AppGet, winget is open source, allowing the community to contribute to its development, suggest features, and improve the tool over time.
Manifest-Based System: winget uses a manifest system for package definitions, which can be more standardized and consistent compared to some methods used in other package managers like Chocolatey or Scoop.
Native Command-Line Experience: winget offers a native command-line experience that is consistent with other Windows command-line tools, which might be more familiar to Windows users than the interfaces provided by Chocolatey or Scoop.
No Need for Third-Party Repository: While Chocolatey and Scoop rely heavily on community-maintained repositories, winget can pull software directly from the Microsoft Store in addition to its community repository, potentially offering more official and trusted sources for software.
Scripted Installations and Batch Operations: winget allows for more complex scripted installations and batch operations, similar to Chocolatey and Scoop, but with the potential for deeper integration with Windows-specific features.
Community-Driven Repositories: Like Scoop and Chocolatey, winget allows for community-driven repositories, but being a Microsoft product, it potentially has a broader reach and might garner a more extensive repository of packages.

In summary, winget combines the familiarity and integration of a Microsoft product with the flexibility and community-driven approach of open-source package managers like Chocolatey and Scoop, while also providing a native and streamlined command-line interface for Windows users.

Chocolatey

Chocolatey is distinguished by its extensive repository encompassing a wide array of both open-source and proprietary software. It excels in offering advanced features like custom package creation, deep integration with Windows and PowerShell for enhanced scripting, and robust version control for precise software management. Particularly appealing to system administrators and power users, Chocolatey supports integration with configuration management tools like Puppet, Chef, and Ansible, facilitating automated, consistent configurations across multiple machines. Its command-line interface is comprehensive and user-friendly, catering to both seasoned scripters and those new to command-line operations. The community-driven approach in maintaining and updating its packages further adds to its appeal, ensuring a broad software selection and timely updates.

Chocolatey distinguishes itself from other Windows package managers through several unique features and capabilities:

Extensive Software Repository: Chocolatey offers a vast and diverse repository of both open-source and proprietary software, surpassing the variety found in other managers like winget, Scoop, or Ninite.
Package Creation and Customization: Unlike Ninite or AppGet, Chocolatey allows users and organizations to create and host their own packages. This feature is particularly beneficial for businesses that need to manage internal software distributions or specific software configurations.
Integration with Windows Features and PowerShell: Chocolatey’s deep integration with Windows features, especially PowerShell, stands out compared to alternatives. This integration allows for more advanced scripting and automation, which is particularly appealing for system administrators and power users.
Version Control and Software Management: Chocolatey excels in its ability to install specific versions of software and manage upgrades and downgrades, offering more control compared to winget and Ninite. This is crucial in environments where specific software versions are required for compatibility or testing purposes.
Community-Driven Package Maintenance: The Chocolatey community plays a significant role in maintaining and updating packages. This community-driven approach often leads to quicker updates and a broader range of software than what might be found in more centrally managed repositories like winget or Ninite.
Compatibility with Configuration Management Tools: Chocolatey uniquely integrates with configuration management tools like Puppet, Chef, and Ansible. This makes it a more versatile choice for enterprise environments where consistent and automated configuration across multiple Windows machines is essential.
Comprehensive CLI Support: While Scoop and winget also offer command line interfaces, Chocolatey’s CLI is more robust and feature-rich, providing a level of control and flexibility that is particularly beneficial in automated scripts and batch operations.
Security and Compliance Features: For enterprise users, Chocolatey offers additional security and compliance features, like package internalization and private repositories, which are not as prominent in winget, Scoop, or Ninite.

In summary, Chocolatey’s unique combination of an extensive software repository, package creation and customization, deep Windows integration, advanced version control, community support, enterprise features, and powerful CLI distinguishes it from other Windows package managers like winget, Scoop, and Ninite.

Scoop

Scoop is a command-line package manager for Windows, known for its simplicity and focus on user convenience. It stands out for its straightforward setup and zero-configuration philosophy, appealing especially to developers and those who prefer a minimalistic approach. Scoop installs programs to the user’s home directory, avoiding system-wide changes and making it non-intrusive compared to traditional Windows installers. This approach also facilitates easy backup and sync of installed applications. It primarily focuses on open-source and developer tools, offering a curated selection that simplifies the installation and update process for commonly used utilities and applications. Scoop’s repository is community-maintained, ensuring a relevant and up-to-date software selection. Its ease of use, coupled with the ability to handle dependencies elegantly, makes it a favored choice for users seeking a lightweight and efficient package management solution on Windows.

Scoop has several unique features:

Focus on Developer Tools: Scoop is primarily focused on making it easy for developers to install tools they need. It’s optimized for command-line users and simplifies the process of managing development tools.
User Space Installation: Unlike some other package managers, Scoop installs programs in the user’s space, not system-wide. This means no administrator privileges are needed for installation, and it doesn’t add to the system PATH by default, reducing potential system conflicts.
Portable Applications: Scoop is designed with a preference for portable applications. This means that applications are less likely to interfere with the rest of the system, as they are self-contained.
Simplified App Management: Scoop provides a simple command line interface for installing applications, without the need for a GUI. It also allows for easy version management and updating of installed programs.
Git-Powered: Scoop uses Git to manage its buckets (collections of applications), making it easy for users to add or create their own buckets. This contrasts with the centralized repositories of some other package managers.
Manifest-Driven: In Scoop, each application has a manifest that describes how it should be installed. This makes it easier to add new applications to Scoop compared to some other package managers that may require more complex packaging processes.
Focus on Avoiding Bloatware: Scoop tends to avoid applications that come bundled with unwanted extras, a practice sometimes seen in other free software distributions.
Customizability and Scripting: Because of its command-line nature and use of PowerShell, Scoop is highly customizable and can be integrated into scripts for automation purposes.

In contrast:

winget is Microsoft’s own package manager, integrated with Windows and focusing more broadly on all types of software.
Chocolatey is more feature-rich and caters to both end-users and enterprise needs with a focus on comprehensive package management solutions, including GUI support.
Ninite is known for its simplicity and is often used for quick, bulk installations of popular applications, usually for setting up new systems.

Each of these package managers serves different user needs and preferences, with Scoop being particularly favorable for developers and command-line enthusiasts who prefer a minimalistic and scriptable approach.

Ninite

Ninite is a streamlined package management solution for Windows, recognized for its user-friendly approach, particularly suited for non-technical users. It simplifies the process of installing and updating multiple applications with just a few clicks. Users select their desired software from the Ninite website, which then creates a customized installer to install all chosen programs in one go, without any further interaction. Ninite stands out for its automated, batch installation process that smartly avoids toolbars or extra junk, ensuring a clean and hassle-free setup. It primarily caters to popular, commonly-used software, making it a convenient choice for quick setup of new computers or routine maintenance. The service is especially popular for its reliability and security, as it automatically updates all installed programs to their latest versions, reducing the risk of security vulnerabilities. Ninite’s appeal lies in its simplicity and effectiveness, making it an excellent choice for those who prefer a straightforward, no-fuss software management tool.

Ninite is unique in several ways:

User-Friendliness: Ninite is known for its exceptionally user-friendly interface. It allows users to select multiple applications from its website and creates a single installer to install all chosen apps at once. This simplicity contrasts with other package managers, which often require command-line interaction.
Silent Installation: Ninite automates the installation process without requiring user interaction. It automatically declines toolbar installations and other add-ons, installs the latest version of the software, and sets up programs in their default locations. This automatic, ‘silent’ approach is different from others that might require user input during installations.
Batch Updates: Ninite can also be used for updating installed programs. It checks for and installs updates for all supported apps, again doing so silently and automatically. This is a distinctive feature that may not be as straightforward in other package managers.
No Bundleware: Ninite is known for not including any bundleware or extra software in its installations, which is a significant advantage over some other package managers that might include additional offers or software in their installations.
Focus on Free and Popular Software: Ninite primarily focuses on popular free software, making it a go-to for quickly setting up a new PC with essential tools and utilities. Other package managers may offer a wider range of software, including more niche or specialized tools.
No Command-Line Interface (CLI): Unlike winget, Chocolatey, and Scoop, which offer command-line interfaces for more control and scripting capabilities, Ninite operates almost entirely through a graphical interface. This makes it more accessible to casual users but less flexible for power users.
Automatic Background Updates (Pro Version): Ninite Pro offers additional features like automatic background updates and network deployment, which can be particularly useful in a corporate environment for maintaining multiple machines.

These features make Ninite especially appealing for less technical users or those looking for a hassle-free way to install and update a standard set of applications on their PCs. However, it may be less suitable for users who need a broader range of software options or more control over the installation process.

Why Scoop Stands Out

Scoop stands out for a developer like me due to its primary focus on developer tools, making it an ideal choice for those needing a straightforward, command-line-centric tool for managing development applications. The user space installations are a key feature for individuals without administrator rights or those who prefer not to make system-wide changes, enhancing both ease of use and system stability. Additionally, Scoop’s preference for portable applications and its Git-powered, manifest-driven approach offer a level of flexibility and transparency that is highly valued in a developer’s workflow. Moreover, its design philosophy to avoid bloatware, coupled with its capabilities for customization and automation through scripting, make it an attractive option for users seeking minimalistic and efficient package management. While winget, Chocolatey, and Ninite each have their own strengths, Scoop’s specific focus on minimal system impact, developer-friendly features, and ease of use in a command-line environment sets it apart as the go-to choice for many developers and command-line users.

Scoop Deep Dive

Installing Scoop

Installing Scoop is a straightforward process. Here’s a step-by-step guide:

Open PowerShell: First, you need to open PowerShell. You can do this by searching for PowerShell in the Start menu and running it as an administrator.
Configure Execution Policy (if needed): Scoop requires the ability to execute scripts. In PowerShell, run the following command to allow script execution:
```
Set-ExecutionPolicy RemoteSigned -scope CurrentUser
```
This command changes the policy to allow the execution of scripts, while still protecting against the execution of scripts downloaded from the Internet, unless they are signed by a trusted publisher.

Note on Security: Setting the PowerShell execution policy to RemoteSigned for the CurrentUser scope, as done with Set-ExecutionPolicy RemoteSigned -Scope CurrentUser, poses a moderate security risk. This policy allows locally created scripts to run without a digital signature, but requires scripts downloaded from the internet to be signed by a trusted publisher. While this setting prevents the execution of unsigned scripts from external sources, it doesn’t safeguard against all risks. Locally created or modified scripts, which might include malicious code, can still run without any signature verification. Additionally, attackers could potentially exploit this policy by tricking users into downloading and locally modifying scripts, thereby bypassing the need for a digital signature. This setting strikes a balance between usability and security, but users should remain cautious of the scripts they run, especially those obtained from untrusted sources.
Install Scoop: To install Scoop, run the following command in PowerShell:
```
Invoke-RestMethod -Uri https://get.scoop.sh | Invoke-Expression
```
This command downloads the Scoop installation script and executes it. Invoke-RestMethod fetches the script directly from https://get.scoop.sh, and Invoke-Expression runs the script.
Install Git (Recommended): After installing Scoop, it is recommended to install Git. Many packages, including those in the ’extras’ bucket, require Git. To install Git, use the following Scoop command:
```
scoop install git
```
Verification: After the installation script completes, you can verify the installation by typing scoop help in PowerShell. This should display a list of Scoop commands, indicating that Scoop is installed correctly.
Optional - Configure Scoop to use a Proxy (if needed): If you are behind a proxy, you may need to configure Scoop to use it. You can do this by setting the http_proxy and https_proxy environment variables in PowerShell:
```
$env:http_proxy="http://yourproxy:port"
$env:https_proxy="http://yourproxy:port"
```
Replace yourproxy and port with your proxy details.
Optional - Associate .git* and .sh Files: After installing Git, you can optionally create file associations for .git* and .sh files. This can be done by running the install-file-associations.reg file, which is included in the Git installation via Scoop.
```
cd path\to\git\installation
& .\install-file-associations.reg
```
Replace path\to\git\installation with the actual installation path of Git. This step is especially useful for developers who frequently work with Git and shell scripts.

Scoop is designed to make the installation process easy and straightforward, particularly for command-line tools and developer utilities. If you encounter any issues during the installation, make sure you’re running PowerShell as an administrator and that your internet connection is stable. Scoop’s GitHub page and community forums can also be useful resources for troubleshooting.

Basic Commands

Scoop is a command-line package manager for Windows that offers a variety of commands for managing software installations. Here’s a guide to some of the basic commands and their usage:

Install a Package: To install a package with Scoop, use the install command followed by the package name. For example, to install Visual Studio Code:
```
scoop install vscode
```
Note on Additional Customization: Some packages, like Visual Studio Code, offer additional setup options such as creating context menu entries (right-click menu options) or file associations. After installing such packages, you might find additional scripts or instructions in the package’s installation folder to enable these features. This can enhance your workflow by integrating the software more deeply into your Windows environment. However, it’s important to note that using these .reg files or scripts might deviate from the standard Scoop practice of keeping programs self-contained within the user folder. For detailed steps and to understand the implications of these modifications, refer to the documentation of the specific package or check the installation folder.
List Installed Packages: To see a list of all packages that you’ve installed with Scoop, use the list command:
```
scoop list
```
Search for a Package: If you’re not sure about the exact name of a package, you can search for it using the search command:
```
scoop search [package-name]
```
Replace [package-name] with the name or part of the name of the package you’re looking for.
Check for Updates: To check if any of your installed packages have updates available, use the status command:
```
scoop status
```
Update a Package: If there’s an update available for a package, you can update it using the update command. To update Git, for example:
```
scoop update git
```
To update all your installed packages at once, use:
```
scoop update *
```
Uninstall a Package: To remove a package that you no longer need, use the uninstall command:
```
scoop uninstall [package-name]
```
View Package Information: To view information about a specific package (like version, dependencies, etc.), use the info command:
```
scoop info [package-name]
```
Add a Bucket: Buckets in Scoop are like repositories that contain manifests for installing software. To add a new bucket, use the bucket add command. For example, to add the ’extras’ bucket:
```
scoop bucket add extras
```
List Buckets: To see a list of all the buckets you’ve added, use the bucket list command:
```
scoop bucket list
```
Cleanup Old Versions: Over time, you may accumulate older versions of packages. To remove these old versions and free up disk space, use the cleanup command:
```
scoop cleanup [package-name]
```
To cleanup old versions of all packages, simply run:
```
scoop cleanup *
```

These basic commands cover most of the everyday usage scenarios for Scoop. It’s worth noting that Scoop is particularly popular among developers and power users who prefer a command-line interface for managing software installations. For more advanced commands and options, you can always refer to the Scoop documentation or use the scoop help command.

Managing Multiple Versions

Scoop provides additional commands and features that are particularly useful for managing multiple versions of the same software. This capability is especially valuable for developers who may need to switch between different versions of tools or programming languages for different projects. Here are some key commands and concepts related to managing multiple versions in Scoop:

Hold and Unhold a Version:
- Hold: If you want to prevent a specific version of a package from being updated, you can ‘hold’ it. This ensures that the package stays at the current version even when you update other packages.
```
scoop hold [package-name]
```
- Unhold: To remove the hold and allow updates again, use ‘unhold’:
```
scoop unhold [package-name]
```
Install Specific Versions: Scoop allows you to install specific versions of a package (if available in the bucket). Use the install command with the @[version] syntax:
```
scoop install [package-name]@[version]
```
Replace [version] with the desired version number.
Switch Between Versions: If you have multiple versions of a package installed, you can switch between them using the reset command. This command sets the specified version as the current active version.
```
scoop reset [package-name]@[version]
```
List All Versions of a Package: To see all the installed versions of a particular package, use the list command with the package name:
```
scoop list [package-name]
```
Check Available Versions: To see what versions of a package are available for installation, you can use the checkver command with the -a (all) flag:
```
scoop checkver [package-name] -a
```
Remove Specific Versions: If you want to remove a specific version of a package, use the uninstall command with the specific version:
```
scoop uninstall [package-name]@[version]
```

These commands enhance Scoop’s versatility, making it a powerful tool for scenarios where version management is crucial. It’s particularly useful in development environments where testing across multiple versions of languages, libraries, or tools is necessary. Remember that the availability of different versions depends on the package and the bucket it belongs to. Not all packages may have multiple versions available in Scoop’s default buckets.

Buckets and Repositories

Customizing Scoop with buckets and repositories is a key aspect of its flexibility and power. Buckets in Scoop are analogous to repositories in systems like Git. They are collections of “manifests” (JSON files) that describe how to install each application. By adding and managing buckets, you can extend Scoop’s range of available software and tailor it to your specific needs. Here’s how to customize Scoop using buckets and repositories:

Understanding Buckets:
- A Scoop bucket is a Git repository containing application manifests. Scoop uses these manifests to install applications.
- By default, Scoop comes with its main bucket, which contains a wide range of commonly used applications.
Adding Custom Buckets:
- To add a new bucket, use the bucket add command followed by the bucket’s name and, optionally, the Git URL if it’s a third-party bucket:
```
scoop bucket add [bucket-name] [bucket-url]
```
- For well-known buckets, like ’extras’, you don’t need to specify the URL:
```
scoop bucket add extras
```
- These commands add the specified bucket to your Scoop installation, making all the applications in that bucket available for installation.
Listing Available Buckets:
- You can list all the buckets currently added to your Scoop installation with the bucket list command:
```
scoop bucket list
```
Searching Across Buckets:
- Once you’ve added a bucket, you can search for applications across all your added buckets using the search command:
```
scoop search [application-name]
```
Creating Your Own Bucket:
- If you have a set of applications or specific versions that aren’t available in existing buckets, you can create your own. To do this, you need to:
  - Create a new Git repository.
  - Add application manifests in the form of JSON files to this repository.
  - Use the bucket add command to add your custom bucket to Scoop.
Contributing to Existing Buckets:
- If you want to add a new application or update an existing one in a public bucket, you can fork the bucket’s repository, make your changes, and then submit a pull request. This is common in open-source projects and allows for community contributions.
Removing Buckets:
- If you no longer need a specific bucket, you can remove it from your Scoop installation using the bucket rm command:
```
scoop bucket rm [bucket-name]
```

Customizing Scoop with buckets allows you to significantly expand the range of software that you can manage with it. Whether you’re adding existing third-party buckets, creating your own for unique needs, or contributing to the community, buckets provide a powerful way to tailor Scoop to your workflow.

Conclusion

For my new machine, I’ve chosen Scoop as the primary installation tool, primarily because it neatly isolates application installations within my User folder. While I’m not strictly committed to using only one installer, I’ve found Scoop’s approach efficient for most applications.

However, for certain software not available in Scoop’s buckets, I’m open to using alternatives like winget or specific application installers. A case in point is Visual Studio, a crucial tool for some of my development work. For this, I utilized the Visual Studio Installer, which not only facilitates updating and managing different versions of Visual Studio but also simplifies the modification of workloads.

So far, my experience with Scoop has been positive. I anticipate that as I continue using it, I’ll gain a deeper understanding of how separate installation paths can streamline software management on my machines. I encourage you to explore different package managers, especially if you’ve been relying on the same one or haven’t used one at all. Diversifying your toolkit can be both enlightening and practical.

Additional Resources

Excel to CSV Using Python Pandas

Tue, 26 Jul 2022 15:08:01 -0400

Introduction

Python Pandas is a powerful package used by data scientists to analyze data. My recent use case was far more pedestrian. I had one vendor that was providing a data file in Excel format, and another vendor that needed to consume that data in a completely different schema as a CSV file.

As is often the case, the file I was receiving was formed oddly. I’m sure you can relate. The primary problem was that the columns were not all defined. They were unnamed and formatted for some previously defined form, such that the first column might be a person’s name, the second column might be the first line of the address, or a second person’s name.

Column 0	Column 1	Column 2	Column 3
Mary Smith	1 Main St	Newark, NJ
John Doe	Jane Doe	2 Main St	Oakland, CA

In the table above, column 0 is always a name. Column 1 may be a name, or an address, etc. The table above is a very simplified example of the problem, but we can use it to learn how Pandas can help us.

Our export requirement is to combine the names, charges, credits, descriptions, and balances into comma separated values, with pipe separated values in columns where multiple values exist as follows:

Full Name	Address
Mary Smith	1 Main St, Newark, NJ
John Doe\|Jane Doe	2 Main St, Oakland, CA

Pandas Series

A pandas.Series is a one-dimensional array with axis labels. A Series is equatable to an Excel column. The object supports both integer and label-based indexing and provides methods for performing operations involving the index.

Pandas DataFrame

A pandas.DataFrame is a two-dimensional data structure, like a two-dimensional array, or a table with rows and columns. You can think of it as a dictionary for Series objects. A dataframe is one of the primary datas structures of a Pandas project.

Read Excel

Pandas can open files and load them into a dataframe. In addition to reading and writing to Excel and CSV files, Pandas supports many other file formats, including JSON, XML, SQL, among others. We will open an XLSX files as shown below:

import pandas as pd

df_input = pd.read_excel(open('input.xlsx', 'rb'), sheet_name='Sheet2')

DataFrame iloc

Since the file I was given did not include headings, and because of the problem described above, such headings would be meaningless, we must access columns by the integer index. Pandas provides the DataFrame.loc method to access labeled axes, and the DataFrame.iloc method to access columns based on the integer index.

mydict = [{'a': 1, 'b': 2, 'c': 3, 'd': 4},
          {'a': 100, 'b': 200, 'c': 300, 'd': 400},
          {'a': 1000, 'b': 2000, 'c': 3000, 'd': 4000 }]

df = pd.DataFrame(mydict)

print(df)
      a     b     c     d
0     1     2     3     4
1   100   200   300   400
2  1000  2000  3000  4000

print(df.iloc[0])
a    1
b    2
c    3
d    4

print(df.iloc[0, 1])
2

print(df.iloc[[0, 2], [1, 3]])
      b     d
0     2     4
2  2000  4000

NumPy Select

One option is to select what are names and what are addresses based on a list of conditions. Numpy provides the numpy.select() statement to achieve this. The condlist is a list of conditions that determine from which array in the choice list the output elements are taken. If multiple conditions are satisfied the first one in condlist is used. The choicelist is the list of arrays from which the output elements are taken. If all conditions evaluate to false, default value is returned.

numpy.select(condlist, choicelist, default=0)

Functions

Unlike many popular programming languages that use braces (or “curly brackets”). Python uses indentation to indicate a block of code. Therefore, a Python function looks like this:

def my_function():
  print("Hello from a function")

my_function()

Names & Addresses

My first thought was to find a library that parsed addresses. I found a promising option called usaddress. It provided all of the features I needed, but it did not appear to be in active development, and was only compatible with Python 2.7. Since I had a limited data set and could assume that addresses would start with a house number, and that names would not, I was able to use NumPy to identify names vs addresses.

Using what we have seen above, we can combine this to solve the name and address problem. I find this solution to be a bit brute force and ugly, but my goal was to learn enough to solve the problem and move on. A deeper understanding of Python and these libraries may come later.

import numpy as np

def is_name(column):
    return ~df_input.iloc[:, column].str[:1].str.isnumeric()

name_conditions = [
    is_name(0) & ~is_name(1),
    is_name(0) & is_name(1) & ~is_name(2)
]

names = [
    df_input.iloc[:, 0],
    df_input.iloc[:, 0] + '|' + df_input.iloc[:, 1]
]

addresses = [
    df_input.iloc[:, 1] + ', ' + df_input.iloc[:, 2],
    df_input.iloc[:, 2] + ', ' + df_input.iloc[:, 3]
]

df_input['Full Name'] = np.select(name_conditions, names)
df_input['Address'] = np.select(name_conditions, addresses)

Charges and Credits

In addition to the name and address challenge outlined above, I also needed to parse the transactions. In the file supplied by the source company, they had formatting built into the export file. Date, Description, Charge, Credit, etc. are all combined into one column, and we don’t know how many columns there will be. This was done so that the previous consumer could simply print the entire column, and it would be appropriately formatted. They were unwilling or unable to provide a cleaner export. Therefore, I needed to parse the column to separate the charges and the credits.

One small thing that made this easier was the fact that the unknown number of charges and credits were the final columns. Therefore, we could assume that if the first possible column with a charge or credit was at index 10, the remaining columns would all be credits, debits, or empty.

Imagine the column is 80 characters wide. The first 10 characters include the Date. Characters 15 through 25 include a Description. Characters 35 through 45 include the charge if applicable. Characters 50 through 60 include the credit if applicable. Finally, characters 70 through 80 include the subtotal. The columns could be parsed as follows:

def transaction_date(col_value):
    return col_value[:10]

def transaction_description(col_value):
    return col_value[15:25].strip()

def charge_amount(col_value):
    return col_value[35:45].strip()

def is_charge(col_value):
    return charge_amount(col_value) != ''

def credit_amount(col_value):
    return col_value[50:60].strip()

def is_credit(col_value):
    return credit_amount(col_value) != ''

def subtotal(col_value):
    return col_value[70:80].strip()

def charge_list():
    transaction_dates = []
    transaction_amounts = []
    transaction_descriptions = []
    transaction_subtotals = []

    for row in range(len(df_input)):
        transaction_amounts_list = []
        transaction_dates_list = []
        transaction_descriptions_list = []
        transaction_subtotal_list = []
        
        for column in range(10, len(df_input.columns)):
            col_value = df_input.iloc[row, column]
            if isinstance(col_value, str):
                if is_charge(col_value):
                    transaction_amounts_list.append(charge_amount(col_value))
                    transaction_dates_list.append(transaction_date(col_value))
                    transaction_descriptions_list.append(transaction_description(col_value))
                    transaction_subtotal_list.append(subtotal(col_value))
                elif is_credit(col_value):
                    transaction_amounts_list.append(credit_amount(col_value))
                    transaction_dates_list.append(transaction_date(col_value))
                    transaction_descriptions_list.append(transaction_description(col_value))
                    transaction_subtotal_list.append(subtotal(col_value))
        separator = '|'
        transaction_amounts.append(separator.join(transaction_amounts_list))
        transaction_dates.append(separator.join(transaction_dates_list))
        transaction_descriptions.append(separator.join(transaction_descriptions_list))
        transaction_subtotals.append(separator.join(transaction_subtotal_list))

    df_input['Charges'] = transaction_amounts
    df_input['Dates'] = transaction_dates
    df_input['Descriptions'] = transaction_descriptions
    df_input['Balances'] = transaction_subtotals
    return

charge_list()

Write CSV

Once we make the necessary changes to our dataframe, we can export it to a new file:

df_input.to_csv('output.csv')

Wrapping Up

Python Pandas is great at manipulating data, but it can also be used to import., transform, and export data when the situation arises. I found it a relatively simple way to take a poorly planned dataset and manipulate it for use in another application. My use case combined with a smallish dataset made my brute-force approach usable. However, if you have a large dataset, or if you are using Pandas as intended, you should operate on a Series.

PyInstaller

My script will need to be run regularly by someone else. Therefore, I need to create an executable version of my Python script that requires no dependencies. Two options are auto-py-to-exe and PyInstaller. I chose PyInstaller for the command line interface.

pip install pyinstaller

pyinstaller --onefile name_of_script.py

Now anyone can run this executable on their Windows machine without the need for Python to be installed.

I generally avoid hacks, but as much as it hurt me to share this, it did solve my problem. This was not my favorite project, and it pains me to share it, but if this helps one person understand how Python Pandas can manipulate data when they have a simiar situation, then the pain was worth suffering.

Is Identity GUID or Is It Int?

Tue, 19 Apr 2022 13:31:33 -0400

Introduction

I have long waffled on whether a GUID or integer make for the best primary key. I wrote this with the goal of refining or challenging my current opinion.

The identity, or primary key, uniquely identifies an entity or record. To be clear, ‘identity’ is a Domain Driven Design concept that describes an immutable identifying attribute of an entity (not to be confused with a SQL Server identity column, which is an automatically incrementing integer). A ‘primary key’ is a database concept that serves the same purpose for a record and enforces referential integrity. In this post, I will use the term ‘primary key’, since we are focusing on the database. The concept of an identity attribute is more than just a database concern, it is a domain concern, but it is important to consider where these concerns intersect.

While my bias as a software developer guides me to focus on the domain model, the realities of database persistence cannot be overlooked. For this post, we are going to assume that we are using SQL Server for persistence. Therefore, its performance must be factored into the identifier decision. Consider that other databases are going to have similar issues.

GUID

From a domain perspective, I have an affinity for GUID (or UUID) as my identifier. It is almost assuredly unique in the world and therefore will not collide with other GUIDs should my system be distributed, replicated, merged, or otherwise interact with other systems or contexts. It also has the benefit that it can be created in the application, rather than being a side-effect of adding the entity to the database.

Column Size

The most basic and easy to understand issue with GUID is the size of the column, which is 16 bytes. Consider how keys will proliferate in your database. Not only will it consume space in the table where it is a primary key, but in every table where it is a foreign key. If every key in your database is a GUID, you can imagine the memory implications versus a 4-byte int or a 8-byte bigint. Consider further that memory usage will not only occur on disk, but in the RAM of the server.

Clustered Index Fragmentation

A common argument against the use of GUID for the primary key is fragmentation.[2] Clustered indexes sort and store the data rows based on their key values. There can be only one clustered index per table, because the data rows themselves can be stored in only one order. When setting a primary key for a table, SQL Server creates a clustered index on that column by default. That may be what you want initially. When you have an auto-incrementing integer as a clustering key, that key is in the same order as the clustered index. When you add a key that is not ordered, such as a GUID, fragmentation occurs. There are some options to mitigate fragmentation when using a GUID. COMB (for COMBined, abbreviated)[3] or sequential GUIDs allow for ordered identifiers as do integers, thus mitigating clustered index fragmentation. A COMB GUID has the benefit of being generated in code like a regular GUID using libraries such as RT.Comb. NEWSEQUENTIALID() is a Transact-SQL function built into SQL Server that creates a GUID that is greater than any GUID previously generated by this function on a specified computer since it was started. Both of these options allow for the creation of ordered GUIDs that mitigate fragmentation. There is also the option of declaring a separate primary key and clustering key. This way you could have a GUID as your primary key, and an integer as your clustering key. Other causes of fragmentation exist that have nothing to do with the key, but that is a topic for another time.

Index Structure Size

Another strong argument against the use of GUID as the clustering key is the table size resulting from the aforementioned clustered index. The pointer from an index row in a nonclustered index to a data row is called a row locator. For a clustered table, the row locator is the clustered index key. Since a GUID is 16 bytes versus 4 bytes for an int, the index structure is going to be larger than using an integer. Since every nonclustered index will contain the clustering key, a larger clustering key will widen the nonclustered index. The result is that a clustering key of a GUID will make the index structure multiple times larger than will an integer.

Hard to Remember

Finally, GUIDs are just plain ugly. Integers are much easier to remember and type. If your table has 10,000 rows, it is far easier to remember a key of ‘8,711’ than it is to remember ‘55e7ad83-c81b-4148-a658-07766c221558’. You may be able to hold the former in your head, but the latter will require the use of the clipboard.

API

Occasionally, you will see an API that is hackable, either by design or accident. This probably is not what you want. If someone has order ‘100’, you don’t want them to have the ability to enter ‘99’ or ‘101’ to see the orders before and after them. GUIDs make this less likely. Simply encoding the GUID to a BASE64 string gives us something that is highly unlikely to be hacked in this manner.

public string GuidToBase64(Guid guid)
{
  string base64 = Convert.ToBase64String(guid.ToByteArray());
  base64 = base64.Replace("/", "_").Replace("+", "-");
  return base64.Substring(0, 22);
}

public Guid Base64ToGuid(string base64)
{
  Guid guid = default(Guid);
  base64 = base64.Replace("-", "/").Replace("_", "+") + "==";

  try {
    guid = new Guid(Convert.FromBase64String(base64));
  }
  catch (Exception ex) {
    throw new Exception("Failed to covert BASE64 string to GUID", ex);
  }

  return guid;
}

Using the code above, a GUID of ‘55e7ad83-c81b-4148-a658-07766c221558’ would return a BASE64 string of ‘g63nVRvISEGmWAd2bCIVWA’. This string could then be used on your API. (Keep in mind that this ID is case sensitive.)

If one uses integers as their primary key, there are options to obfuscate that key on their API. One such library is Hashids, a small open-source library that generates short, unique, non-sequential IDs from numbers, which is available for a variety of programming languages.

While you may see tutorials or production websites that use hackable IDs, you will want to use a mechanism like those suggested above to decouple the keys of your database from your public API.

Conclusion

I began writing this with the premise of showing how a COMB or similar sequential GUID is a good default choice for an identity attribute or primary key. There remain many strong arguments in favor of the GUID. However, the table and index structure size of GUIDs make integers a wise choice. My recommendations are:

If you choose GUID as a primary and clustering key, use some form of sequential GUID,
Choose GUID if you need to generate the key in code,
Choose GUID if you require its distributed benefits,
If using GUID, consider an int or bigint clustering key,
If performance is your top concern, use int or bigint as your primary and clustering key,
Only expose hashed keys to the outside world.

This is an interesting topic to dive into deeper. I recommend reading the posts below, and if you use SQL Server, consume just about everything that Kimberly Tripp writes or says.

References

Robert C. Martin. No DB, The Clean Code Blog, May 15, 2012.
Kimberly Tripp. GUIDs as PRIMARY KEYs and/or the clustering key. SQLskills, March 5, 2009.
Jimmy Nilsson. The Cost of GUIDs as Primary Keys, InformIT, March 8, 2002.
Jeff Atwood. Primary Keys: IDs versus GUIDs, Coding Horror, March 19, 2007.
Tom Harrison. UUID or GUID as Primary Keys? Be Careful!, Tom Harrison’s Blog, February 12, 2017.

The Disposable Toothbrush

Wed, 30 Mar 2022 19:07:03 -0400

I know the inventor of the disposable toothbrush. They would stay at hotels that provided soap, shampoo, conditioner, etc., but a recurring issue was that they would forget their toothbrush, and hotels didn’t furnish a solution. It seemed obvious that hotels should have individually wrapped toothbrushes, just as they had other toiletries. Why didn’t anyone make an inexpensive short-term use toothbrush with toothpaste already applied? So, this person invented it – only they didn’t. They did in fact come up with the idea, as I’m sure did many others, but they never acted on it. Someone else eventually did. To this day, every time this person sees one, they declare how it was their invention.

This is my long overdue blog. It will not be perfect. It probably will not be widely read. Upon reviewing old items to be donated or thrown away, I came across some books and articles about blogging – from twenty years ago! It was important to me then, but not important enough for me to get started. My thoughts have remained stuck in my head or written in personal notes. I will now endeavor to put some of them out into the world. They may be ugly, unpopular, or just plain wrong, but they will be authentically mine.

Inaction is easy. We can easily fall victim to the default effect. There is a reason that GDPR requires that websites affirmatively receive users’ consent before using unnecessary cookies, rather than just making the settings available. There is a reason that creating a new online account or installing new software requires you to opt-out of marketing and data collection options, rather than opt-in. There is a reason that we defend our viewpoints without thinking, naturally tend to resist change, and support the status quo. The reason in all these cases is that we tend to accept the default. An object at rest remains at rest, and an object in motion remains in motion in the same potentially unproductive direction that it has been moving in all along. We can let someone else start a company. We can let others write their blogs and novels. We can let yet another invent the disposable toothbrush.