/pokutuna/int64 や datetime64 を linear で interpolate したい

generated at 2/22/2025, 10:58:37 AM
int64 や datetime64 を linear で interpolate したい
#Numpy_&_Pandas #Python

np.NaN は float64 の値、結合したらその Series は float64 になる
max.pyi64info = np.iinfo(np.int64)
s1 = pd.Series([i64info.max, i64info.min])
s2 = pd.Series([np.nan])

display(s1)
# 0    9223372036854775807
# 1   -9223372036854775808
# dtype: int64

display(s2)
# 0   NaN
# dtype: float64

s = pd.concat([s1, s2])
display(s)
# 0    9.223372e+18
# 1   -9.223372e+18
# 0             NaN
# dtype: float64

s.interpolate(method='linear').astype(np.int64)
# 0   -9223372036854775808
# 1   -9223372036854775808
# 0   -9223372036854775808
# dtype: int64

 np.NaN  は浮動小数点数なので、 df.interpolate での補間を考える時点で  np.NaN  込みデータは float64 になっているはず

欠損値を持つ int64  dtype=Int64 
Int64 や Float64 にするには  convert_dtypes 

BigQuery で Nullable な INT64 のカラムとってきたときもそうなる
一方  interpolate(method='linear')  はうまく動かない、 ffill  やらは動く in pandas 1.4.4
BUG: Interpolate over time does not work with Int64 or Float64 · Issue 40252 · pandas-dev/pandas
NaT は対応されてそのうちリリースされそうだけど Int64 とかがまだなのか
Interpolate NaT · Issue 11701 · pandas-dev/pandas
BUG: Series.interpolate with dt64/td64 raises by jbrockmendel · Pull Request 51005 · pandas-dev/pandas PR 

Nullable integer data type — pandas 1.5.3 documentation
欠損値
 pd.NA 
 np.NaN  (float64)
 None 

 pd.NA  を含む series を  astype('float64')  しても  np.NaN  になるわけではなく例外になる
 pd.Series([1,2,3,pd.NA,5]).replace({pd.NA: np.nan})  すると  Int64  から  float64  にわりとすんなりいける


Working with missing data — pandas 1.5.3 documentation
 inf  や  -inf  を NA 扱いにするには
 pandas.options.mode.use_inf_as_na = True 
穴埋めで平均を入れる
 dff.fillna(dff.mean()) 

int64 の補間
interpolate_int64.pydf = pd.DataFrame(data={'num': [np.iinfo(np.int64).max, 320, 92, 3, 103, 91]})
display(df)
# 0	9223372036854775807
# 1	320
# 2	92
# 3	3
# 4	103
# 5	91

# 下位 1% 未満を線形補間する
# 3 3 が対象になる
cond = df['num'] < df['num'].quantile(0.01)
df['num'][cond] = df['num'].mask(cond).interpolate(method="linear").astype('int64')
df
# 0 9223372036854775807 ← 変換による精度落ちてない
# 1	320
# 2	92
# 3	97
# 4	103
# 5	91

型変換避けないなら1行で書いてもよい(精度は落ちる)
 df['num'] = df['num'].mask(df['num'] < df['num'].quantile(0.01)).interpolate(method="linear").astype('int64') 

下位1% & 上位1% をまとめてやるなら  mask  の条件で
 df.mask((df['num'] < df['num'].quantile(0.01)) | (df['num'].quantile(0.99) < df['num'])) 

Int64 のときは失敗する?
float64 -> int64 のときは小数部が切落されるが float64 -> Int64 は例外になる
明に np.floor や np.round などしてから Int64 にするとよい
Essential basic functionality — pandas 1.5.3 documentation

interpolate.py# 極端に低いところ NA で埋めて
df.at['2022-10-19', 'pv'] = pd.NA

# 列ごとに interpolate
# 型変換なるべく避けるので mask して NA のところだけ補間した値にする、floor
for s in ['pv', 'users', 'hoge']:
  df[s] = df[s].mask(df[s].isna(), np.floor(df[s].astype('f8').interpolate(method='linear')))


datetime の補間
ちなみに BigQuery で NULL 込みの TIMESTAMP を SELECT してきた場合は
 datetime64[ns, UTC]  型になり、NULL 部分は  pd.NaT  (Not a Time) が入る  isnull()  も真を返す値

BUG: Series.interpolate with dt64/td64 raises by jbrockmendel · Pull Request 51005 · pandas-dev/pandas
最近はできそう(2023/3/29 時点ではまだ RC)
interpolate_datetime.py>>> pd.to_datetime(pd.Series(['2023-01-01', '2023-01-02', None, '2023-01-04'])).interpolate(method='linear')
0   2023-01-01
1   2023-01-02
2   2023-01-03
3   2023-01-04
dtype: datetime64[ns]
>>> pd.__version__
'2.0.0rc0'

型変換なるべくしたくない場合は別の df 作ってやるべきかな
Interpolate NaT · Issue 11701 · pandas-dev/pandas

interpolate_datetime.py# 準備
df = pd.DataFrame(data={'date': ['2023-01-01', '2023-01-02', None, '2023-01-04']})
df['date'] = pd.to_datetime(df['date'])
display(df)
# 0	2023-01-01 00:00:00
# 1	2023-01-02 00:00:00
# 2	NaT
# 3	2023-01-04 00:00:00

# 別の Series (tmp_date) を用意し で f8 に変換し、 NaT のとこだけ np.nan を入れる
tmp_date = df['date'].astype('i8').astype('f8')
tmp_date[df['date'].isnull()] = np.nan

# tmp_date で interpolate したものを、元の df の null の箇所だけに入れる
# mask
df['date'].mask(df['date'].isnull(), pd.to_datetime(tmp_date.interpolate(method='linear')))

pandas.DataFrame.mask — pandas 1.5.3 documentation
 mask  は真値の部分を置き換える、 inplace=True  してもよい


BQ の to_dataframe みる
googleapis/python-bigquery@be0255e - google/cloud/bigquery/job/query.py#L1620